# Pandas

## What is Pandas ?

##### **Pandas is a Python library used for data manipulation, analysis, and cleaning.**
##### *The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.*
It provides two main data structures:**
1. Series (1D data)
2. DataFrame (2D tabular data)

In [1]:
#Checking pandas version...
import pandas as pd
print(pd.__version__)

2.3.3


In [2]:
import pandas as pd
s = pd.Series([10, 20, 30, 40, 50], index = ["a", "b", "c", "d", "e"])   #index helps us to create our own labels.
s

#These labels acts like index through which we can access the data.
print(s["d"])    
print(s["b" : "e"])   #slicing can be done using the modified indexes.

40
b    20
c    30
d    40
e    50
dtype: int64


In [3]:
import pandas as pd
data = {
    "star_names" : ["Betelguese", "Bellatrix", "Rigel", "Saiph"],
    "distance_in_light_years" : [643, 860, 250, 650]
}
df = pd.DataFrame(data, index = [1, 2, 3, 4])  #index helps us to create our own labels.
df

Unnamed: 0,star_names,distance_in_light_years
1,Betelguese,643
2,Bellatrix,860
3,Rigel,250
4,Saiph,650


#### Creating a .csv file for stars and the constellation they belong to with alphabet 'A'.

In [4]:
import pandas as pd

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

star_data = pd.read_csv("sample.csv")
print("0-20 pc → Solar neighborhood\n20-50 pc → Local stellar region")
print("50-100 pc → Nearby galactic disk\n100+ pc → Distant bright stars\n")
star_data

0-20 pc → Solar neighborhood
20-50 pc → Local stellar region
50-100 pc → Nearby galactic disk
100+ pc → Distant bright stars



Unnamed: 0,Proper_name,Scientific_name,Constellation,Distance_ly,Distance_pc
0,Aljanah,ε Cygni,Cygnus,73.0,22.4
1,Alkaid,η Ursae Majoris,Ursa Major,104.0,31.9
2,Alkalurops,μ Boötis,Boötes,121.0,37.1
3,Alkaphrah,κ Ursae Majoris,Ursa Major,358.0,109.8
4,Alkarab,ε Pegasi,Pegasus,690.0,211.7
5,Alkes,α Crateris,Crater,174.0,53.4
6,Almaaz,ε Aurigae,Auriga,2030.0,622.7
7,Almach,γ Andromedae,Andromeda,355.0,108.9
8,Al Minliar al Asad,ε Leonis,Leo,247.0,75.8
9,Alnair,α Gruis,Grus,101.0,31.0


In [5]:
star_data.head()   #It returns headers and first 5 rows by default if argument not mentioned.

Unnamed: 0,Proper_name,Scientific_name,Constellation,Distance_ly,Distance_pc
0,Aljanah,ε Cygni,Cygnus,73.0,22.4
1,Alkaid,η Ursae Majoris,Ursa Major,104.0,31.9
2,Alkalurops,μ Boötis,Boötes,121.0,37.1
3,Alkaphrah,κ Ursae Majoris,Ursa Major,358.0,109.8
4,Alkarab,ε Pegasi,Pegasus,690.0,211.7


In [6]:
star_data.tail()   #It returns headers and last 5 rows by default if argument not mentioned.

Unnamed: 0,Proper_name,Scientific_name,Constellation,Distance_ly,Distance_pc
64,Axólotl,HD 224693,Cetus,308.0,94.5
65,Ayeyarwady,HD 18742,Eridanus,530.0,162.6
66,Azelfafage,π Cygni,Cygnus,172.0,52.8
67,Azha,η Eridani,Eridanus,139.0,42.6
68,Azmidi,ξ Puppis,Puppis,1200.0,368.1


In [7]:
star_data.info()   #It states structures, data-types and missing values.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69 entries, 0 to 68
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Proper_name      69 non-null     object 
 1   Scientific_name  69 non-null     object 
 2   Constellation    69 non-null     object 
 3   Distance_ly      69 non-null     float64
 4   Distance_pc      69 non-null     float64
dtypes: float64(2), object(3)
memory usage: 2.8+ KB


In [8]:
star_data.describe()   #It desribes statistical summary.

Unnamed: 0,Distance_ly,Distance_pc
count,69.0,69.0
mean,361.698551,110.955072
std,439.366709,134.771342
min,16.7,5.1
25%,97.0,29.8
50%,177.0,54.3
75%,432.0,132.5
max,2200.0,674.8


#### Column Selection

In [9]:
import pandas as pd

frame_data = {
    "Planets" : ["Mercury", "Venus", "Earth", "Mars", "Jupyter", "Saturn", "Uranus", "Neptune"],
    "Radius_in_km" : [2439.7, 6051.8, 6371.0, 3389.5, 69911, 58232, 25362, 24622],
    "Distance_from_the_sun_in_AU" : [0.39, 0.72, 1.00, 1.52, 5.20, 9.54, 19.22, 30.06],
    "Gravitational_acceleration_in_m/s^2" : [3.7, 8.87, 9.8, 3.7, 24.79, 10.44, 8.69, 11.15]
}
df = pd.DataFrame(frame_data)
print("DataFrame Set: ")
df

DataFrame Set: 


Unnamed: 0,Planets,Radius_in_km,Distance_from_the_sun_in_AU,Gravitational_acceleration_in_m/s^2
0,Mercury,2439.7,0.39,3.7
1,Venus,6051.8,0.72,8.87
2,Earth,6371.0,1.0,9.8
3,Mars,3389.5,1.52,3.7
4,Jupyter,69911.0,5.2,24.79
5,Saturn,58232.0,9.54,10.44
6,Uranus,25362.0,19.22,8.69
7,Neptune,24622.0,30.06,11.15


Single Column Selection : It is like indexing where you mention the header and that data inside that header will be printed.

In [10]:
#Single Column Seleciton...
df["Planets"]

0    Mercury
1      Venus
2      Earth
3       Mars
4    Jupyter
5     Saturn
6     Uranus
7    Neptune
Name: Planets, dtype: object

Multiple Column Selction : It helps us mention multiple headers and print the data under that header sections.

In [11]:
#Multiple Column Selection...
df[["Planets", "Radius_in_km", "Distance_from_the_sun_in_AU"]]

Unnamed: 0,Planets,Radius_in_km,Distance_from_the_sun_in_AU
0,Mercury,2439.7,0.39
1,Venus,6051.8,0.72
2,Earth,6371.0,1.0
3,Mars,3389.5,1.52
4,Jupyter,69911.0,5.2
5,Saturn,58232.0,9.54
6,Uranus,25362.0,19.22
7,Neptune,24622.0,30.06


Outputs :
1. Output of a single column data is like a type of Series that prints only a single header with no. of data's inside it.
2. Output of a multiple data is a DataFrame that prints mentioned headers and no. of data's inside it.
Both are different from each other and shows the clear importance between accessibility of a single or a multiple data sets.

#### Row Filtering

Row filtering allows us to access data from specific rows by using conditions. 

In [12]:
import pandas as pd
student_data = {
    "name " : ["Pia", "Hikaru", "Judit", "Magnus"],
    "age" : [29, 23, 24, 19],
    "birth_year" : [2000, 2002, 2001, 2005],
}
data_f = pd.DataFrame(student_data)
data_f

Unnamed: 0,name,age,birth_year
0,Pia,29,2000
1,Hikaru,23,2002
2,Judit,24,2001
3,Magnus,19,2005


##### Using single conditional statements.

In [13]:
print(data_f["age"] > 20)
data_f[ data_f[ "age"] > 20]    #true terms will be printed out 

0     True
1     True
2     True
3    False
Name: age, dtype: bool


Unnamed: 0,name,age,birth_year
0,Pia,29,2000
1,Hikaru,23,2002
2,Judit,24,2001


In [14]:
print(data_f[ "birth_year"] >= 2002)
data_f[ data_f[ "birth_year"] >= 2002]    #true terms will be printed out

0    False
1     True
2    False
3     True
Name: birth_year, dtype: bool


Unnamed: 0,name,age,birth_year
1,Hikaru,23,2002
3,Magnus,19,2005


##### Using multiple conditional statements.

Here we relate with more than pne conditions to access the specific data. Here the parenthesis are required as unwanted data might get involved resulting error in ouptut.

As you can see here '&' is used instead of 'and' operator as '&' is a bitwise-AND operator and...
1. '&' operates on bits(integers) while 'and' is used on truth tables(bool values).
2. '&' returns an integer value while 'and' returns a last evaluated operand. 

In [15]:
#Check for shape of the dataframe...
print(f"Original Shape = {data_f.shape}\n\n")

Original Shape = (4, 3)




In [16]:
print(data_f[ "age"] > 20 & (data_f[ "birth_year"] >= 2002))
data_1 = data_f[(data_f[ "age"] > 20) & (data_f[ "birth_year"] >= 2002)]
data_1

0    True
1    True
2    True
3    True
dtype: bool


Unnamed: 0,name,age,birth_year
1,Hikaru,23,2002


In [17]:
#Check for shape after filtering...
print(f"Filtered Shape = {data_1.shape}")

Filtered Shape = (1, 3)


### Sorting Data

#### sort_values() : It sorts the values/data/elements in an ascending order by defaault.

In [18]:
import pandas as pd

earth_composition = {
    "Gases" : ["Nitrogen", "Oxygen", "Carbon Dioxide", "Argon", "Other Gases"],
    "percent" : [78.084, 20.946, 0.042, 0.934, 0.002]
}
comp_data = pd.DataFrame(earth_composition)
comp_data

Unnamed: 0,Gases,percent
0,Nitrogen,78.084
1,Oxygen,20.946
2,Carbon Dioxide,0.042
3,Argon,0.934
4,Other Gases,0.002


In [19]:
#sorting the above compostion value
comp_data.sort_values("percent")    #By Default sorts in the ascending order

Unnamed: 0,Gases,percent
4,Other Gases,0.002
2,Carbon Dioxide,0.042
3,Argon,0.934
1,Oxygen,20.946
0,Nitrogen,78.084


In [20]:
#sorting in descending order
comp_data.sort_values("percent", ascending = False)

#when ascending = False, it returns sorting values with descending order

Unnamed: 0,Gases,percent
0,Nitrogen,78.084
1,Oxygen,20.946
3,Argon,0.934
2,Carbon Dioxide,0.042
4,Other Gases,0.002


In [21]:
comp_data.isna()

Unnamed: 0,Gases,percent
0,False,False
1,False,False
2,False,False
3,False,False
4,False,False


#### Checking the missing values (NaN) inside the DataFrame

In [22]:
import pandas as pd
presenty_data = {
    "student_name" : ["Rahul", "Priya", "Daksh", "Sanya", "Durvesh"], 
    "presenty" : [1, None, 0, 1, None]
}
p_data = pd.DataFrame(presenty_data)
p_data

Unnamed: 0,student_name,presenty
0,Rahul,1.0
1,Priya,
2,Daksh,0.0
3,Sanya,1.0
4,Durvesh,


In [23]:
#Checking how many missing values are there
print(p_data.isna())

#Checking for the total missing values(NaN) values..
print(p_data.isna().sum())

   student_name  presenty
0         False     False
1         False      True
2         False     False
3         False     False
4         False      True
student_name    0
presenty        2
dtype: int64


**isna.() returns boolean values. For the missing values such as NaN, None, etc., it returns 'True' otherwise 'False'.**

**.sum() here counts the number of the missing values such as NaN, None, etc..., per column.**

### How to handle the missing values ?

In [24]:
import pandas as pd
import numpy as np

df2 = pd.read_csv("hokage.csv", na_values = [" None", "None "])
df2

Unnamed: 0,Hokage,Clan,Signature_Jutsu,Moniker,Databook_Total
0,Hashirama,Senju,WoodRelease,GodofShinobi,
1,Tobirama,Senju,FlyingRaijin,,38.0
2,Hiruzen,,ReaperDeathSeal,TheProfessor,
3,Minato,Namikaze,Rasengan,YellowFlash,32.5
4,Tsunade,Senju,100Healings,LegendarySannin,35.0
5,Kakashi,,PurpleLightning,CopyNinja,33.0
6,Naruto,Uzumaki,Rasenshuriken,ChildofProphecy,
7,Shikamaru,Nara,,,37.5


In [25]:
#checking for null or missing values
df2.isna().sum()

Hokage             0
Clan               2
Signature_Jutsu    1
Moniker            2
Databook_Total     3
dtype: int64

**Options to handle missing values :**

In [26]:
#1. Drop missing rows >>>
df2.dropna()

Unnamed: 0,Hokage,Clan,Signature_Jutsu,Moniker,Databook_Total
3,Minato,Namikaze,Rasengan,YellowFlash,32.5
4,Tsunade,Senju,100Healings,LegendarySannin,35.0


*As you can see the above output... It usually doesn't change anything in strings or characters and only reacts to the numeric data by which the missing values are filled by using mean of all other values.*

In [27]:
#2. Fill missing rowa >>>
df2_filled = df2.fillna(df2.mean(numeric_only = True))
df2_filled

Unnamed: 0,Hokage,Clan,Signature_Jutsu,Moniker,Databook_Total
0,Hashirama,Senju,WoodRelease,GodofShinobi,35.2
1,Tobirama,Senju,FlyingRaijin,,38.0
2,Hiruzen,,ReaperDeathSeal,TheProfessor,35.2
3,Minato,Namikaze,Rasengan,YellowFlash,32.5
4,Tsunade,Senju,100Healings,LegendarySannin,35.0
5,Kakashi,,PurpleLightning,CopyNinja,33.0
6,Naruto,Uzumaki,Rasenshuriken,ChildofProphecy,35.2
7,Shikamaru,Nara,,,37.5


*As you can see the above output... It usually omits the rows with NaN/ None anywhere in the column. Thus making it less efficient as one requires whole data to handle, store or organize it in most of the cases.*

#### When to drop and when to fill ?

Use of .dropna() ->
1. Use it only when the data is massive and doesn't affect your result.
2. Use it only when there are too many NaN values in a single rows.

Use of .fillna() ->
1. Use it only if the data is much smaller and easily affects the result.
2. Use it only when there are fewer NaN values and can be replaced by filling data.

***Real Data-sets almost always contains the missing values.***

### Learning Groupby and Aggregation functions

#### Using groupb() to an actual data
*groupby() helps us group the same strings and return the data without repeating the same row formats.*

In [28]:
import pandas as pd

df3 = pd.read_csv("weather.csv")
df3.index = ["A", "B", "C", "D", "E"]
df3

Unnamed: 0,City_name,Temperature,Humidity
A,Delhi,30,60
B,Delhi,32,55
C,Mumbai,35,70
D,Mumbai,36,65
E,Chennai,33,75


In [29]:
#group by column
df3.groupby("City_name")
df3

Unnamed: 0,City_name,Temperature,Humidity
A,Delhi,30,60
B,Delhi,32,55
C,Mumbai,35,70
D,Mumbai,36,65
E,Chennai,33,75


In [30]:
#apply aggregation functions 
df3.groupby("City_name").mean(numeric_only = True)

Unnamed: 0_level_0,Temperature,Humidity
City_name,Unnamed: 1_level_1,Unnamed: 2_level_1
Chennai,33.0,75.0
Delhi,31.0,57.5
Mumbai,35.5,67.5


In [31]:
#calculation over mean values
#Printing max values over their mean values
df3.groupby("City_name").max(numeric_only=True)

Unnamed: 0_level_0,Temperature,Humidity
City_name,Unnamed: 1_level_1,Unnamed: 2_level_1
Chennai,33,75
Delhi,32,60
Mumbai,36,70


In [32]:
#calculation over mean values
#Printing min values over their mean values
df3.groupby("City_name").min(numeric_only = True)

Unnamed: 0_level_0,Temperature,Humidity
City_name,Unnamed: 1_level_1,Unnamed: 2_level_1
Chennai,33,75
Delhi,30,55
Mumbai,35,65


#### mean()/max()/min() and numeric_types
*mean() calculates the mean values per city, max() calculates maximum values per city and min() calculates minimum values per city.*
*'numeric_only = True' allows to work on only numeric types leaving non-numeric data as it is.*

#### Groupby with multiple aggregation functions
*groupby with multiple aggregation helps to group and add functions in a single table.*

GroupBy splits the data into groups based on planet type.
Then an aggregation function such as mean or sum is applied to each group.
This helps compare characteristics of terrestrial, gas, and ice giants.

In [33]:
#groupby with multiple aggregations
df3_mix = df3.groupby("City_name").agg({
    "Temperature": ["max", "min", "mean"],
    "Humidity": ["max", "min", "mean"]
})
df3_mix

Unnamed: 0_level_0,Temperature,Temperature,Temperature,Humidity,Humidity,Humidity
Unnamed: 0_level_1,max,min,mean,max,min,mean
City_name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Chennai,33,33,33.0,75,75,75.0
Delhi,32,30,31.0,60,55,57.5
Mumbai,36,35,35.5,70,65,67.5


*In above cell, the output prints table as groupby and all the aggregation functions in a single table.*

In [34]:
#Common methods to access various forms of data from DataFrame
print(df3.index)
print("\n")
print(df3.columns)
print("\n")
print(df3.City_name)
print("\n")
print(df3.Temperature)
print("\n")
print(df3.Humidity)

Index(['A', 'B', 'C', 'D', 'E'], dtype='object')


Index(['City_name', 'Temperature', 'Humidity'], dtype='object')


A      Delhi
B      Delhi
C     Mumbai
D     Mumbai
E    Chennai
Name: City_name, dtype: object


A    30
B    32
C    35
D    36
E    33
Name: Temperature, dtype: int64


A    60
B    55
C    70
D    65
E    75
Name: Humidity, dtype: int64


#### Uses of .loc() and .iloc() in the program

In [35]:
#printing the first few rows for index purpose
df3.head(3)

Unnamed: 0,City_name,Temperature,Humidity
A,Delhi,30,60
B,Delhi,32,55
C,Mumbai,35,70


In [36]:
#To access the memory address of the DataFrame
id(df3)

1779455752272

In [37]:
#using .loc() method to print the row content of the label specified.
df3.loc["A"]    #Specified label here is 'A' i.e. first row.

City_name      Delhi
Temperature       30
Humidity          60
Name: A, dtype: object

In [38]:
#using .iloc() method prints the row content of the specified index.
df3.iloc[0]    #Specified index here is '0' i.e. first row.

City_name      Delhi
Temperature       30
Humidity          60
Name: A, dtype: object

Here one can also try slicing methods using loc and iloc methods.

In [39]:
#using .loc[] for slicing 
df3.loc['B' : 'D']    #prints only the row contents from 'B' to 'D'.

Unnamed: 0,City_name,Temperature,Humidity
B,Delhi,32,55
C,Mumbai,35,70
D,Mumbai,36,65


In [40]:
#One can also select the specific columns from the DataFrame to print row content from that columns.
df3.loc['B' : 'D', ["City_name", "Temperature"]]

#here it requires the names of the labels of the columns itself to print those specific columns.

Unnamed: 0,City_name,Temperature
B,Delhi,32
C,Mumbai,35
D,Mumbai,36


In [41]:
#using .iloc[] for slicing 
df3.iloc[1 : 4]    #prints only the row contents from indexes 1 to 3 with the 4 shown to be end row and is omitted.

Unnamed: 0,City_name,Temperature,Humidity
B,Delhi,32,55
C,Mumbai,35,70
D,Mumbai,36,65


In [42]:
#One can also select the specific columns from the DataFrame to print row content from that columns.
df3.iloc[1 : 4, [0, 1]]

#here it requires the indexes of the columns itself to print those specific columns.

Unnamed: 0,City_name,Temperature
B,Delhi,32
C,Mumbai,35
D,Mumbai,36


Remember few things :
1) .loc uses labels (names of rows and columns)
2) .iloc uses integer positions (like list indexing)

1) .loc includes the end index in slices
2) .iloc excludes the end index in slices

*“I will use loc when I know column names, and iloc when I work by positions.”*

## Mini EDA (Exploratory Data Analysis)

In [43]:
import pandas as pd

planets_data_book = pd.read_csv("planets.csv")
planets_data_book

Unnamed: 0,Planet,Type,Distance_AU,Mass_Earths,Radius_km,Gravity_mps2,Orbital_Period_days,Rotational_Period_hours,Num_Moons,Has_Rings,Surface_Temp_C,Atmospheric_Composition,Density_gcm3,Escape_Velocity_kms,Albedo_reflectivity
0,Mercury,Terrestrial,0.39,0.055,,3.7,87.97,1407.6,0,No,-173 to 427,"O, Na, He, H, K, Ca, Mg",5.43,4.25,0.088
1,Venus,Terrestrial,0.72,0.815,6051.8,8.87,224.7,-5832.0,0,No,462,"CO2 96.5%, N2 3.5%",,10.36,
2,Earth,Terrestrial,1.0,1.0,6371.0,9.8,365.25,23.93,1,No,-89 to 58,"N2 78%, O2 21%, Ar, CO2",5.51,11.19,0.306
3,Mars,Terrestrial,1.52,0.107,3389.5,3.71,686.98,24.62,2,,-125 to 20,"CO2 ~95%, N2, Ar",3.93,5.03,0.25
4,Jupiter,Gas Giant,5.2,317.8,69911.0,24.79,,9.93,97,Yes,-145,"H2 ~90%, He ~10%",1.33,59.5,
5,Saturn,Gas Giant,9.54,95.2,58232.0,10.44,10759.22,10.66,274,,-178,"H2 ~96%, He ~3%",,35.5,0.47
6,Uranus,Ice Giant,19.18,14.5,25362.0,8.69,30685.4,-17.24,29,Yes,-197,"H2, He, CH4",1.27,21.3,
7,Neptune,Ice Giant,30.06,17.1,24622.0,,60189.0,16.11,16,Yes,-201,"H2, He, CH4",1.64,23.5,


### Inspection

In [44]:
planets_data_book.head(4)   # Returns first four rows.

Unnamed: 0,Planet,Type,Distance_AU,Mass_Earths,Radius_km,Gravity_mps2,Orbital_Period_days,Rotational_Period_hours,Num_Moons,Has_Rings,Surface_Temp_C,Atmospheric_Composition,Density_gcm3,Escape_Velocity_kms,Albedo_reflectivity
0,Mercury,Terrestrial,0.39,0.055,,3.7,87.97,1407.6,0,No,-173 to 427,"O, Na, He, H, K, Ca, Mg",5.43,4.25,0.088
1,Venus,Terrestrial,0.72,0.815,6051.8,8.87,224.7,-5832.0,0,No,462,"CO2 96.5%, N2 3.5%",,10.36,
2,Earth,Terrestrial,1.0,1.0,6371.0,9.8,365.25,23.93,1,No,-89 to 58,"N2 78%, O2 21%, Ar, CO2",5.51,11.19,0.306
3,Mars,Terrestrial,1.52,0.107,3389.5,3.71,686.98,24.62,2,,-125 to 20,"CO2 ~95%, N2, Ar",3.93,5.03,0.25


In [45]:
planets_data_book.tail(3)    # Returns last three rows.

Unnamed: 0,Planet,Type,Distance_AU,Mass_Earths,Radius_km,Gravity_mps2,Orbital_Period_days,Rotational_Period_hours,Num_Moons,Has_Rings,Surface_Temp_C,Atmospheric_Composition,Density_gcm3,Escape_Velocity_kms,Albedo_reflectivity
5,Saturn,Gas Giant,9.54,95.2,58232.0,10.44,10759.22,10.66,274,,-178,"H2 ~96%, He ~3%",,35.5,0.47
6,Uranus,Ice Giant,19.18,14.5,25362.0,8.69,30685.4,-17.24,29,Yes,-197,"H2, He, CH4",1.27,21.3,
7,Neptune,Ice Giant,30.06,17.1,24622.0,,60189.0,16.11,16,Yes,-201,"H2, He, CH4",1.64,23.5,


In [46]:
planets_data_book.info()    
# Returns the class, total entries with index, no. of non-null objects, the data-types and the memory 
# usage.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Planet                   8 non-null      object 
 1   Type                     8 non-null      object 
 2   Distance_AU              8 non-null      float64
 3   Mass_Earths              8 non-null      float64
 4   Radius_km                7 non-null      float64
 5   Gravity_mps2             7 non-null      float64
 6   Orbital_Period_days      7 non-null      float64
 7   Rotational_Period_hours  8 non-null      float64
 8   Num_Moons                8 non-null      int64  
 9   Has_Rings                6 non-null      object 
 10  Surface_Temp_C           8 non-null      object 
 11  Atmospheric_Composition  8 non-null      object 
 12  Density_gcm3             6 non-null      float64
 13  Escape_Velocity_kms      8 non-null      float64
 14  Albedo_reflectivity      4 non

In [47]:
planets_data_book.describe()    # Returns the description of the statistical information as a summary.

Unnamed: 0,Distance_AU,Mass_Earths,Radius_km,Gravity_mps2,Orbital_Period_days,Rotational_Period_hours,Num_Moons,Density_gcm3,Escape_Velocity_kms,Albedo_reflectivity
count,8.0,8.0,7.0,7.0,7.0,8.0,8.0,6.0,8.0,4.0
mean,8.45125,55.822125,27705.614286,10.0,14714.074286,-544.54875,52.375,3.185,21.32875,0.2785
std,10.837234,110.605697,26594.118607,7.085488,22968.115626,2191.647874,95.350388,2.024646,18.681635,0.157619
min,0.39,0.055,3389.5,3.7,87.97,-5832.0,0.0,1.27,4.25,0.088
25%,0.93,0.638,6211.4,6.2,294.975,3.1375,0.75,1.4075,9.0275,0.2095
50%,3.36,7.75,24622.0,8.87,686.98,13.385,9.0,2.785,16.245,0.278
75%,11.95,36.625,41797.0,10.12,20722.31,24.1025,46.0,5.055,26.5,0.347
max,30.06,317.8,69911.0,24.79,60189.0,1407.6,274.0,5.51,59.5,0.47


In [48]:
print(planets_data_book.isna())    
# Returns the boolean values where 'True' means null data and 'False' means non-null data.

print("\n")

print("Total Null Data: ")
print(planets_data_book.isna().sum())
# Returns the sum of all the non-data throughout the dataframe.

   Planet   Type  Distance_AU  Mass_Earths  Radius_km  Gravity_mps2  \
0   False  False        False        False       True         False   
1   False  False        False        False      False         False   
2   False  False        False        False      False         False   
3   False  False        False        False      False         False   
4   False  False        False        False      False         False   
5   False  False        False        False      False         False   
6   False  False        False        False      False         False   
7   False  False        False        False      False          True   

   Orbital_Period_days  Rotational_Period_hours  Num_Moons  Has_Rings  \
0                False                    False      False      False   
1                False                    False      False      False   
2                False                    False      False      False   
3                False                    False      False       Tru

In [49]:
planets_data_book.groupby("Type").agg(    
    Mean_mass = ("Mass_Earths", "mean"),
    Mean_radius = ("Radius_km", "mean"),
    Total_moons = ("Num_Moons", "sum")
)
# Returns the mean_mass, mean_radius and total_moons for all three types of planets.

Unnamed: 0_level_0,Mean_mass,Mean_radius,Total_moons
Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Gas Giant,206.5,64071.5,371
Ice Giant,15.8,24992.0,45
Terrestrial,0.49425,5270.766667,3


### Data cleaning

#### The DataFrame feels a little empty and needs to be cleaned. So we will try cleaning the data by filling data in place of dropping data for numeric values.
**REMEMBER : Actual data cannot be filled altogether inside the object-type columns and need to be filled manually or use pandas operations for separate filling.**

In [50]:
filled_planets_data_book = planets_data_book.fillna(planets_data_book.mean(numeric_only = True))
print(filled_planets_data_book)
# This allows us to fill the empty numeric data with the mean value which is sum of all values divided
# by the number of rows the data is present.

print("\n")

# And now the cleaned numeric data looks like ...
print("Filled planetary numeric data (not for actual measurement) : ")
print(filled_planets_data_book.iloc[0 : 7, [0, 2, 3, 4, 5, 6, 7, 8]])

    Planet         Type  Distance_AU  Mass_Earths     Radius_km  Gravity_mps2  \
0  Mercury  Terrestrial         0.39        0.055  27705.614286          3.70   
1    Venus  Terrestrial         0.72        0.815   6051.800000          8.87   
2    Earth  Terrestrial         1.00        1.000   6371.000000          9.80   
3     Mars  Terrestrial         1.52        0.107   3389.500000          3.71   
4  Jupiter    Gas Giant         5.20      317.800  69911.000000         24.79   
5   Saturn    Gas Giant         9.54       95.200  58232.000000         10.44   
6   Uranus    Ice Giant        19.18       14.500  25362.000000          8.69   
7  Neptune    Ice Giant        30.06       17.100  24622.000000         10.00   

   Orbital_Period_days  Rotational_Period_hours  Num_Moons Has_Rings  \
0            87.970000                  1407.60          0        No   
1           224.700000                 -5832.00          0        No   
2           365.250000                    23.93       

#### Basic questions to counter : 

In [51]:
# How many rows and columns ?

print(f"Total rows: {len(planets_data_book.index)}")   # Returns the total rows in the DataFrame.

print(f"Total columns: {len(planets_data_book.columns)}")   # Returns the total columns in the DataFrame.

Total rows: 8
Total columns: 15


In [52]:
# Which columns have missing values?

print(f"Columns with missing values: \n{planets_data_book.columns[planets_data_book.isnull().any()]}\n")
# Returns the names of the columns that has missing values.

print(f"Number of columns with missing values: {planets_data_book.isnull().any().sum()}")
# Returns the numeber of the columns with missing values.

Columns with missing values: 
Index(['Radius_km', 'Gravity_mps2', 'Orbital_Period_days', 'Has_Rings',
       'Density_gcm3', 'Albedo_reflectivity'],
      dtype='object')

Number of columns with missing values: 6


In [53]:
# Which column is numeric?

print(f"Columns with numeric values: {planets_data_book.select_dtypes('number').columns}")
# Returns the name of all the columns with numeric data-types.

Columns with numeric values: Index(['Distance_AU', 'Mass_Earths', 'Radius_km', 'Gravity_mps2',
       'Orbital_Period_days', 'Rotational_Period_hours', 'Num_Moons',
       'Density_gcm3', 'Escape_Velocity_kms', 'Albedo_reflectivity'],
      dtype='object')


Note : For questions that compare different planet properties, prints the right answer only due to a value assigned to it and if it's value was NaN then it would return different answer only if it had a valid numeric value.
For example: 

In [54]:
# Which planet is largest by radius?   

max_radius = planets_data_book['Radius_km'].max()
max_radius_index = planets_data_book['Radius_km'].idxmax()

print(f"""Planet that is largest by radius is {planets_data_book.loc[max_radius_index, 'Planet']}.
It has the radius of {max_radius} km.""")

Planet that is largest by radius is Jupiter.
It has the radius of 69911.0 km.


In [55]:
# Which planet has a strongest gravity?

max_gravity = planets_data_book['Gravity_mps2'].max()
max_gravity_index = planets_data_book['Gravity_mps2'].idxmax()

print(f"""The planet with strongest gravity is {planets_data_book.loc[max_gravity_index, 'Planet']}.
It has the gravitational acceleration of {max_gravity} meters/seconds.""")

The planet with strongest gravity is Jupiter.
It has the gravitational acceleration of 24.79 meters/seconds.


In [56]:
# Which planet is farthest from the sun?

max_distance = planets_data_book['Distance_AU'].max()
max_distance_index = planets_data_book['Distance_AU'].idxmax()

print(f"The planet farthest from the sun is {planets_data_book.loc[max_distance_index, 'Planet']}.")
print(F"""The distance between the sun and the neptune is {max_distance} Astronomical unit(AU).""")

# 1 AU = 150 million km = 1.49 x 10^11 km to be exact. It is the average distance between the Sun and the Earth.

The planet farthest from the sun is Neptune.
The distance between the sun and the neptune is 30.06 Astronomical unit(AU).


In [57]:
# How many terrestrial vs gas giants?

counts = planets_data_book['Type'].value_counts()
terrestrial = counts.get("Terrestrial", 0)
gas_giants = counts.get("Gas Giant", 0)
ice_giants = counts.get("Ice Giant", 0)

total_planets = planets_data_book.shape[0]

print(f"There are a total of {total_planets} planets.")

print(f"""Out of these planets, {terrestrial} are terrestrial and {gas_giants} are gas giants
and the remaining {ice_giants} are ice giants.""")

There are a total of 8 planets.
Out of these planets, 4 are terrestrial and 2 are gas giants
and the remaining 2 are ice giants.


In [58]:
# Does distance roughly relates to temperature? 

# Here we added an extra column to represent the temperatures as a single value.
planets_data_book["High_Surface_Temp"] = [167, 464, 15, -65, -110, -140, -195, -200] 

relation = planets_data_book["Distance_AU"].corr(planets_data_book["High_Surface_Temp"])

rel_table = planets_data_book.iloc[0 : 8, [0, 2, 15]]

print(rel_table)

print(f"""\nRelation between distance and maximum surface temperature is given as {relation}.)
The relation is negative meaning distance is inversely proportional to temperature.
As the distance between the planet and the sun increases, the temperature decreases.
From the above table, mercury and venus have the highest temperature while th neptune has the lowest
temperature.""")

# Venus has the high temperature difference with the mercury even though it is second closest because of the 
# presence of atmosphere and expecially due to  carbon dioxide. As per it's properties, it has the ability to 
# absorb solar radiations trapping the heat inside the planet.

    Planet  Distance_AU  High_Surface_Temp
0  Mercury         0.39                167
1    Venus         0.72                464
2    Earth         1.00                 15
3     Mars         1.52                -65
4  Jupiter         5.20               -110
5   Saturn         9.54               -140
6   Uranus        19.18               -195
7  Neptune        30.06               -200

Relation between distance and maximum surface temperature is given as -0.6307420036402225.)
The relation is negative meaning distance is inversely proportional to temperature.
As the distance between the planet and the sun increases, the temperature decreases.
From the above table, mercury and venus have the highest temperature while th neptune has the lowest
temperature.


In [59]:
# Which planet has the most moons?

max_moons = planets_data_book['Num_Moons'].max()
max_moon_index = planets_data_book['Num_Moons'].idxmax()

print(f"The planet with largest number of moons is {planets_data_book.loc[max_moon_index, 'Planet']}.")
print(f"It has approximately {max_moons} number of moons so far.")

#Studies has detected that the number might be even larger.

The planet with largest number of moons is Saturn.
It has approximately 274 number of moons so far.


In [60]:
# How many planets have rings around them?

has_rings = planets_data_book['Has_Rings'].value_counts()['Yes']
planets_with_rings = planets_data_book.iloc[0 : 8, [0, 9]]

print(f"According to the dataframe, there are {has_rings} planets with rings considering only the available data.\n")
print(planets_with_rings)

#In real sense, there are a total of 4 planets that has rings around them namely., jupiter, saturn, uranus and 
#neptune.

According to the dataframe, there are 3 planets with rings considering only the available data.

    Planet Has_Rings
0  Mercury        No
1    Venus        No
2    Earth        No
3     Mars       NaN
4  Jupiter       Yes
5   Saturn       NaN
6   Uranus       Yes
7  Neptune       Yes


In [61]:
# Are larger planets more likely to have rings?

has_rings = planets_data_book['Has_Rings'].value_counts()['Yes']

planets_with_rings = planets_data_book.iloc[0 : 8, [0, 9]]

print(planets_with_rings)

print("""\nFrom the above dataframe with the available information, we found that there are three planets with rings 
and if compared with the planets with no rings we can conclude that larger planets likely have rings.""")

    Planet Has_Rings
0  Mercury        No
1    Venus        No
2    Earth        No
3     Mars       NaN
4  Jupiter       Yes
5   Saturn       NaN
6   Uranus       Yes
7  Neptune       Yes

From the above dataframe with the available information, we found that there are three planets with rings 
and if compared with the planets with no rings we can conclude that larger planets likely have rings.
