# Pandas

## What is Pandas ?

##### **Pandas is a Python library used for data manipulation, analysis, and cleaning.**
##### *The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.*
It provides two main data structures:**
1. Series (1D data)
2. DataFrame (2D tabular data)

In [None]:
#Checking pandas version...
import pandas as pd
print(pd.__version__)

In [None]:
import pandas as pd
s = pd.Series([10, 20, 30, 40, 50], index = ["a", "b", "c", "d", "e"])   #index helps us to create our own labels.
s

#These labels acts like index through which we can access the data.
print(s["d"])    
print(s["b" : "e"])   #slicing can be done using the modified indexes.

In [None]:
import pandas as pd
data = {
    "star_names" : ["Betelguese", "Bellatrix", "Rigel", "Saiph"],
    "distance_in_light_years" : [643, 860, 250, 650]
}
df = pd.DataFrame(data, index = [1, 2, 3, 4])  #index helps us to create our own labels.
df

#### Creating a .csv file for stars and the constellation they belong to with alphabet 'A'.

In [None]:
import pandas as pd

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

star_data_A = pd.read_csv("sample.csv")
print("0-20 pc → Solar neighborhood\n20-50 pc → Local stellar region")
print("50-100 pc → Nearby galactic disk\n100+ pc → Distant bright stars\n")
star_data_A

In [None]:
star_data_A.head()   #It returns headers and first 5 rows by default if argument not mentioned.

In [None]:
star_data_A.tail()   #It returns headers and last 5 rows by default if argument not mentioned.

In [None]:
star_data_A.info()   #It states structures, data-types and missing values.

In [None]:
star_data_A.describe()   #It desribes statistical summary.

#### Column Selection

In [None]:
import pandas as pd

frame_data = {
    "Planets" : ["Mercury", "Venus", "Earth", "Mars", "Jupyter", "Saturn", "Uranus", "Neptune"],
    "Radius_in_km" : [2439.7, 6051.8, 6371.0, 3389.5, 69911, 58232, 25362, 24622],
    "Distance_from_the_sun_in_AU" : [0.39, 0.72, 1.00, 1.52, 5.20, 9.54, 19.22, 30.06],
    "Gravitational_acceleration_in_m/s^2" : [3.7, 8.87, 9.8, 3.7, 24.79, 10.44, 8.69, 11.15]
}
df = pd.DataFrame(frame_data)
print("DataFrame Set: ")
df

Single Column Selection : It is like indexing where you mention the header and that data inside that header will be printed.

In [None]:
#Single Column Seleciton...
df["Planets"]

Multiple Column Selction : It helps us mention multiple headers and print the data under that header sections.

In [None]:
#Multiple Column Selection...
df[["Planets", "Radius_in_km", "Distance_from_the_sun_in_AU"]]

Outputs :
1. Output of a single column data is like a type of Series that prints only a single header with no. of data's inside it.
2. Output of a multiple data is a DataFrame that prints mentioned headers and no. of data's inside it.
Both are different from each other and shows the clear importance between accessibility of a single or a multiple data sets.

#### Row Filtering

Row filtering allows us to access data from specific rows by using conditions. 

In [None]:
import pandas as pd
student_data = {
    "name " : ["Pia", "Hikaru", "Judit", "Magnus"],
    "age" : [29, 23, 24, 19],
    "birth_year" : [2000, 2002, 2001, 2005],
}
data_f = pd.DataFrame(student_data)
data_f

##### Using single conditional statements.

In [None]:
print(data_f["age"] > 20)
data_f[ data_f[ "age"] > 20]    #true terms will be printed out 

In [None]:
print(data_f[ "birth_year"] >= 2002)
data_f[ data_f[ "birth_year"] >= 2002]    #true terms will be printed out

##### Using multiple conditional statements.

Here we relate with more than pne conditions to access the specific data. Here the parenthesis are required as unwanted data might get involved resulting error in ouptut.

As you can see here '&' is used instead of 'and' operator as '&' is a bitwise-AND operator and...
1. '&' operates on bits(integers) while 'and' is used on truth tables(bool values).
2. '&' returns an integer value while 'and' returns a last evaluated operand. 

In [None]:
#Check for shape of the dataframe...
print(f"Original Shape = {data_f.shape}\n\n")

In [None]:
print(data_f[ "age"] > 20 & (data_f[ "birth_year"] >= 2002))
data_1 = data_f[(data_f[ "age"] > 20) & (data_f[ "birth_year"] >= 2002)]
data_1

In [None]:
#Check for shape after filtering...
print(f"Filtered Shape = {data_1.shape}")

### Sorting Data

#### sort_values() : It sorts the values/data/elements in an ascending order by defaault.

In [None]:
import pandas as pd

earth_composition = {
    "Gases" : ["Nitrogen", "Oxygen", "Carbon Dioxide", "Argon", "Other Gases"],
    "percent" : [78, 20.9, 0.03, ">" + 0.9, ">" + 0.17]
}
comp_data = pd.DataFrame(earth_composition)
earth_composition