<a href="https://colab.research.google.com/github/Shrutiba/iisc_cds/blob/main/M0_NB_Practice_02_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Certification Program in Computational Data Science
## A program by IISc and TalentSprint
### Practice Notebook 2: Introduction to Pandas
#### (Ungraded)

## Learning Objectives


At the end of the experiment, you will be able to :

* understand the various applications of Pandas and why it is a building block in the field of Data Science
* define a Pandas DataFrame and describe how data can be stored and accessed in these DataStructures
* describe the key characteristics of pandas dataframes
* perform data cleaning, manipulation using Pandas
* run calculations and summarize data in pandas dataframes

## Information

#### Pandas

* Pandas is an important Python library for data manipulation, wrangling, and analysis.
* It functions as an intuitive and easy-to-use set of tools for performing operations on any kind of data.
* Initial work for pandas was done by Wes McKinney in 2008 while he was a developer at AQR Capital Management. Since then, the scope of the pandas project has increased a lot and it has become a popular library of choice for data scientists all over the world.
* Pandas allows you to work with both cross-sectional data and time series based data.
* The data representation in pandas is done using two primary data structures:
  - Series
  - Dataframes

##### Series

* Series in pandas is a one-dimensional ndarray with an axis label.
* It means that in functionality, it is almost similar to a simple array.
* The values in a series will have an index that needs to be hashable. This requirement is needed when we perform manipulation and summarization on data contained in a series data structure.

##### DataFrame

* Dataframe is the most important and useful data structure, which is used for almost all kind of data representation and manipulation in pandas. Unlike numpy arrays (in general) a dataframe can contain
heterogeneous data.
* Pandas dataframes are composed of rows and columns that can have header names, and the columns in pandas dataframes can be different types (e.g. the first column containing integers and the second column containing text strings). Each value in pandas dataframe is referred to as a cell that has a specific row index and column index within the tabular structure.

##### Features

* Fast and efficient DataFrame object with default and customized indexing.
* Tools for loading data into in-memory data objects from different file formats.
* Data alignment and integrated handling of missing data.
* Reshaping and pivoting of date sets.
* Label-based slicing, indexing and subsetting of large data sets.
* Columns from a data structure can be deleted or inserted.
* Group by data for aggregation and transformations.
* High performance merging and joining of data.
* Time Series functionality.

**To know more about Pandas click [here](https://pandas.pydata.org/docs/getting_started/overview.html)**

##### Excercise 1: How to import pandas and check the version?

In [17]:
# Your code here
import pandas as pd
import numpy as np
print(pd.__version__)

2.2.2


##### Excercise 2: Create a Pandas DataFrame from a dictionary

In [4]:
# Your code here
mydic = {'Name': ['Shruti','Saanvi','Divik'],
         'Relation': ['Mother','Daughter','Father'],
         'Age': ['45','17','46']}

df1 = pd.DataFrame(mydic)
df1

Unnamed: 0,Name,Relation,Age
0,Shruti,Mother,45
1,Saanvi,Daughter,17
2,Divik,Father,46


##### Excercise 3: Get the number of rows, columns, datatype and summary statistics of each column of the Cars93 dataset. Also get the numpy array and list equivalent of the dataframe.

**Access the dataset using this [link](https://cdn.iisc.talentsprint.com/CDS/Datasets/Cars93_miss.csv)**

**Hint:** use `wget` to download the dataset

In [None]:
# Your code here
path = 'https://cdn.iisc.talentsprint.com/CDS/Datasets/Cars93_miss.csv'
df2 = pd.read_csv(path)
df2.count()

In [20]:
df2.dtypes
df2.head()

Unnamed: 0,Manufacturer,Model,Type,Min.Price,Price,Max.Price,MPG.city,MPG.highway,AirBags,DriveTrain,...,Passengers,Length,Wheelbase,Width,Turn.circle,Rear.seat.room,Luggage.room,Weight,Origin,Make
0,Acura,Integra,Small,12.9,15.9,18.8,25.0,31.0,,Front,...,5.0,177.0,102.0,68.0,37.0,26.5,,2705.0,non-USA,Acura Integra
1,,Legend,Midsize,29.2,33.9,38.7,18.0,25.0,Driver & Passenger,Front,...,5.0,195.0,115.0,71.0,38.0,30.0,15.0,3560.0,non-USA,Acura Legend
2,Audi,90,Compact,25.9,29.1,32.3,20.0,26.0,Driver only,Front,...,5.0,180.0,102.0,67.0,37.0,28.0,14.0,3375.0,non-USA,Audi 90
3,Audi,100,Midsize,,37.7,44.6,19.0,26.0,Driver & Passenger,,...,6.0,193.0,106.0,,37.0,31.0,17.0,3405.0,non-USA,Audi 100
4,BMW,535i,Midsize,,30.0,,22.0,30.0,,Rear,...,4.0,186.0,109.0,69.0,39.0,27.0,13.0,3640.0,non-USA,BMW 535i


In [21]:
np1 = df2['Manufacturer']
np1

Unnamed: 0,Manufacturer
0,Acura
1,
2,Audi
3,Audi
4,BMW
...,...
88,Volkswagen
89,Volkswagen
90,Volkswagen
91,Volvo


In [22]:
np2 = df2['Manufacturer'].tolist()
np2

['Acura',
 nan,
 'Audi',
 'Audi',
 'BMW',
 'Buick',
 'Buick',
 'Buick',
 'Buick',
 'Cadillac',
 'Cadillac',
 'Chevrolet',
 'Chevrolet',
 'Chevrolet',
 'Chevrolet',
 'Chevrolet',
 'Chevrolet',
 'Chevrolet',
 'Chevrolet',
 nan,
 'Chrysler',
 'Chrysler',
 'Dodge',
 'Dodge',
 'Dodge',
 'Dodge',
 'Dodge',
 'Dodge',
 'Eagle',
 'Eagle',
 'Ford',
 'Ford',
 'Ford',
 'Ford',
 'Ford',
 'Ford',
 'Ford',
 'Ford',
 'Geo',
 'Geo',
 'Honda',
 'Honda',
 'Honda',
 'Hyundai',
 'Hyundai',
 'Hyundai',
 'Hyundai',
 'Infiniti',
 'Lexus',
 nan,
 'Lincoln',
 'Lincoln',
 'Mazda',
 'Mazda',
 'Mazda',
 'Mazda',
 'Mazda',
 'Mercedes-Benz',
 'Mercedes-Benz',
 'Mercury',
 'Mercury',
 'Mitsubishi',
 'Mitsubishi',
 'Nissan',
 'Nissan',
 'Nissan',
 'Nissan',
 'Oldsmobile',
 'Oldsmobile',
 'Oldsmobile',
 'Oldsmobile',
 'Plymouth',
 'Pontiac',
 'Pontiac',
 'Pontiac',
 'Pontiac',
 'Pontiac',
 'Saab',
 'Saturn',
 'Subaru',
 'Subaru',
 'Subaru',
 'Suzuki',
 'Toyota',
 'Toyota',
 'Toyota',
 'Toyota',
 'Volkswagen',
 'Volkswa

##### Exercise 4: Given the DataFrame below

| Name        | Age         |
| ----------- | ----------- |
| Jeff        | 30          |
| Esha        | 56          |
| Jia         | 8           |


Display the details of the Name column



In [23]:
# Your code here
data2 = {'Name':['Jeff','Esha','Jia'],
         'Age': [30,56,8]}

df3 = pd.DataFrame(data2)
df3['Name']

Unnamed: 0,Name
0,Jeff
1,Esha
2,Jia


##### Exercise 5: Given the DataFrame below

| Name        | Age         |
| ----------- | ----------- |
| Jeff        | 30          |
| Esha        | 56          |
| Jia         | 8           |


Display the details of the row with Jia in it


In [24]:
# Your code here
df3[df3['Name'] == 'Jia']

Unnamed: 0,Name,Age
2,Jia,8


##### Exercise 6: Given the DataFrame below

| Name        | Age         |
| ----------- | ----------- |
| Jeff        | 30          |
| Esha        | 56          |
| Jia         | 8           |


Display the Name of the person with minimum age


In [25]:
# Your code here
df3[df3.Age.min() == df3['Age']]

Unnamed: 0,Name,Age
2,Jia,8


##### Exercise 7: Given the DataFrame below

| Name        | Age         |
| ----------- | ----------- |
| Jeff        | 30          |
| Esha        | 56          |
| Jia         | 8           |


Display the Name of the person with maximum age


In [26]:
# Your code here
df3[df3.Age.max() == df3['Age']]

Unnamed: 0,Name,Age
1,Esha,56


##### Exercise 8: Given the DataFrame below

| Name        | Age         |
| ----------- | ----------- |
| Jeff        | 30          |
| Esha        | 56          |
| Jia         | 8           |
| Eiran       | 46          |


Display the details of the person whose name starts with 'E'

In [36]:
# Your code here
new_row = pd.DataFrame({'Name':['Eiran'],'Age':[46]})
df3 = pd.concat([df3,new_row], ignore_index=True)
df3 = df3.drop_duplicates()

In [37]:
df3[df3['Name'].str[0] == 'E']

Unnamed: 0,Name,Age
1,Esha,56
3,Eiran,46


##### Exercise 9: Form a new dataset by removing or replacing the missing values from Cars93 dataset.



**To download the dataset click [here](https://cdn.iisc.talentsprint.com/CDS/Datasets/Cars93_miss.csv)**

In [2]:
import pandas as pd
# Your code here
path = 'https://cdn.iisc.talentsprint.com/CDS/Datasets/Cars93_miss.csv'
df4 = pd.read_csv(path)
df4.count()

Unnamed: 0,0
Manufacturer,89
Model,92
Type,90
Min.Price,86
Price,91
Max.Price,88
MPG.city,84
MPG.highway,91
AirBags,55
DriveTrain,86


##### Exercise 10:  Use the dataset obtained after performing the above operations on Cars93 dataset. Perform the following operations:

- Find the count of the models which provides AirBags for the driver only
- Find the count of the models which provides AirBags for the driver only and whose engine size is greater than 3
- Find the count of the models which provides AirBags for the driver only and whose engine size is greater than 3 and type is Large


In [6]:
df4['AirBags'].value_counts()

Unnamed: 0_level_0,count
AirBags,Unnamed: 1_level_1
Driver only,39
Driver & Passenger,16


In [13]:
# Your code here
len(df4[df4['AirBags'] == 'Driver only'])

39

In [14]:
len(df4[(df4['AirBags'] == 'Driver only') & (df4['EngineSize'] > 3)])

9

In [15]:
len(df4[(df4['AirBags'] == 'Driver only') & (df4['EngineSize'] > 3) & (df4['Type'] == 'Large')])

7