# 0. Jupyter-Notebooks 
- Each cell is independent and can be run separately.
- Even after deleting a cell if it was run, the variables in that cell are still a part of the memory
- Every notebook works on kernel sessions
- '!' command to run linux commands
- %time -> gives you the runtime of each cell 
- Markdown and code cells
- Some common keybindings: j, k, a, b, shift+enter, ctrl+enter, ctrl+z for undo, ctrl+a, double click, triple click
- Drive Integration available for colab
- GPU acceleration 

# 1. Pandas

In [48]:
import pandas as pd

- Why pandas?
- why csvs?

## 1.1 Reading csvs
This command parses a csv file and converts it into a pandas dataframe

In [49]:
df = pd.read_csv("../datasets/housing.csv")
df

Unnamed: 0,city,area,floor,price
0,Delhi,1200,3,5000000
1,Mumbai,1500,2,7000000
2,Bangalore,1000,1,4000000
3,Chennai,1300,4,6000000
4,Hyderabad,1100,2,4500000
5,Kolkata,1400,3,5500000
6,Pune,900,1,3500000
7,Ahmedabad,1600,4,6500000
8,Surat,1050,2,4200000
9,Jaipur,1250,3,5200000


## 1.2 Shape, rows and columns
Commands to find the shape of the dataset, get the number of columns and rows

In [50]:
# Getting information about the shape of the dataset
print(f"The shape of the dataset is as follows: {df.shape}")
print(f"No. of rows: {df.shape[0]}")
print(f"No. of columns: {df.shape[1]}")

The shape of the dataset is as follows: (15, 4)
No. of rows: 15
No. of columns: 4


## 1.3 Head and tail
The head commands lets you see the head of the dataset (the top part)
The tail commands lets you see the tail of the dataset (the bottom part)

In [51]:
# displays the first 5 rows of the dataset
df.head()

Unnamed: 0,city,area,floor,price
0,Delhi,1200,3,5000000
1,Mumbai,1500,2,7000000
2,Bangalore,1000,1,4000000
3,Chennai,1300,4,6000000
4,Hyderabad,1100,2,4500000


In [52]:
# you can pass a parameter that displays the exact amount of records that you want
df.head(10)

Unnamed: 0,city,area,floor,price
0,Delhi,1200,3,5000000
1,Mumbai,1500,2,7000000
2,Bangalore,1000,1,4000000
3,Chennai,1300,4,6000000
4,Hyderabad,1100,2,4500000
5,Kolkata,1400,3,5500000
6,Pune,900,1,3500000
7,Ahmedabad,1600,4,6500000
8,Surat,1050,2,4200000
9,Jaipur,1250,3,5200000


In [53]:
# displaying the last 5 records from the dataset
df.tail()

Unnamed: 0,city,area,floor,price
10,Kanpur,800,1,3000000
11,Lucknow,1150,2,4300000
12,Nagpur,950,1,3800000
13,Indore,1350,3,5300000
14,Vadodara,1000,2,4000000


In [54]:
# adding a parameter specifying the number of lines to display from the end
df.tail(10)

Unnamed: 0,city,area,floor,price
5,Kolkata,1400,3,5500000
6,Pune,900,1,3500000
7,Ahmedabad,1600,4,6500000
8,Surat,1050,2,4200000
9,Jaipur,1250,3,5200000
10,Kanpur,800,1,3000000
11,Lucknow,1150,2,4300000
12,Nagpur,950,1,3800000
13,Indore,1350,3,5300000
14,Vadodara,1000,2,4000000


## 1.3 Info, describe commands
- Info lets you see the null values and datatypes for each column
- Describe gives you a statistical description of the numerical column

In [55]:
# Run the "info" method to see datatypes, counts and null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   city    15 non-null     object
 1   area    15 non-null     int64 
 2   floor   15 non-null     int64 
 3   price   15 non-null     int64 
dtypes: int64(3), object(1)
memory usage: 612.0+ bytes


In [56]:
df.describe()

Unnamed: 0,area,floor,price
count,15.0,15.0,15.0
mean,1170.0,2.266667,4786667.0
std,229.751418,1.032796,1134439.0
min,800.0,1.0,3000000.0
25%,1000.0,1.5,4000000.0
50%,1150.0,2.0,4500000.0
75%,1325.0,3.0,5400000.0
max,1600.0,4.0,7000000.0


In [57]:
# describe method for see the statistical measures
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
area,15.0,1170.0,229.7514,800.0,1000.0,1150.0,1325.0,1600.0
floor,15.0,2.266667,1.032796,1.0,1.5,2.0,3.0,4.0
price,15.0,4786667.0,1134439.0,3000000.0,4000000.0,4500000.0,5400000.0,7000000.0


## 1.4 Iloc and loc
- Used for selecting data
- Basically the 'SELECT' query from SQL for dataframes

In [58]:
# entries from 3rd row till last
df.iloc[3:]

Unnamed: 0,city,area,floor,price
3,Chennai,1300,4,6000000
4,Hyderabad,1100,2,4500000
5,Kolkata,1400,3,5500000
6,Pune,900,1,3500000
7,Ahmedabad,1600,4,6500000
8,Surat,1050,2,4200000
9,Jaipur,1250,3,5200000
10,Kanpur,800,1,3000000
11,Lucknow,1150,2,4300000
12,Nagpur,950,1,3800000


In [59]:
# entries from first 3 rows
df.iloc[:3]

Unnamed: 0,city,area,floor,price
0,Delhi,1200,3,5000000
1,Mumbai,1500,2,7000000
2,Bangalore,1000,1,4000000


In [60]:
# entries for the first 2 columns
df.iloc[:, :2]

Unnamed: 0,city,area
0,Delhi,1200
1,Mumbai,1500
2,Bangalore,1000
3,Chennai,1300
4,Hyderabad,1100
5,Kolkata,1400
6,Pune,900
7,Ahmedabad,1600
8,Surat,1050
9,Jaipur,1250


In [61]:
# first 3 entries of the first 2 columns
df.iloc[:3, :2]

Unnamed: 0,city,area
0,Delhi,1200
1,Mumbai,1500
2,Bangalore,1000


In [64]:
# last 3 entries of the first 2 columns
df.iloc[-3:, :2]

Unnamed: 0,city,area
12,Nagpur,950
13,Indore,1350
14,Vadodara,1000


### 1.4.1 Loc & Iloc: Major difference

In [65]:
df.loc[1:3]
# notice how 3 is inclusive

Unnamed: 0,city,area,floor,price
1,Mumbai,1500,2,7000000
2,Bangalore,1000,1,4000000
3,Chennai,1300,4,6000000


In [66]:
df.iloc[1:3]
# notice how 3 is exclusive

Unnamed: 0,city,area,floor,price
1,Mumbai,1500,2,7000000
2,Bangalore,1000,1,4000000


### 1.4.2 Querying using loc

In [67]:
# Find rowws where the city is Mumbai
df.loc[(df.city == "Mumbai")]

Unnamed: 0,city,area,floor,price
1,Mumbai,1500,2,7000000


In [68]:
# Find entries where city is one of ["Mumbai", "Hyderabad", "Chennai", "Bangalore"]
df.loc[(df.city.isin(["Mumbai", "Hyderabad", "Chennai", "Bangalore"]))]

Unnamed: 0,city,area,floor,price
1,Mumbai,1500,2,7000000
2,Bangalore,1000,1,4000000
3,Chennai,1300,4,6000000
4,Hyderabad,1100,2,4500000


In [69]:
# Fin entries where area is > 1200 and floor is 2
df.loc[(df.area > 1000) & (df.floor == 2)]

Unnamed: 0,city,area,floor,price
1,Mumbai,1500,2,7000000
4,Hyderabad,1100,2,4500000
8,Surat,1050,2,4200000
11,Lucknow,1150,2,4300000


In [70]:
# Find entries where area is less than 800 or floor is > 2
df.loc[(df.area < 1000) | (df.floor > 2)]

Unnamed: 0,city,area,floor,price
0,Delhi,1200,3,5000000
3,Chennai,1300,4,6000000
5,Kolkata,1400,3,5500000
6,Pune,900,1,3500000
7,Ahmedabad,1600,4,6500000
9,Jaipur,1250,3,5200000
10,Kanpur,800,1,3000000
12,Nagpur,950,1,3800000
13,Indore,1350,3,5300000


## 1.5 Drop
- Dropping rows and columns

In [71]:
# Create a copy with the first 10 records
df_copy = df[:10]
df_copy

Unnamed: 0,city,area,floor,price
0,Delhi,1200,3,5000000
1,Mumbai,1500,2,7000000
2,Bangalore,1000,1,4000000
3,Chennai,1300,4,6000000
4,Hyderabad,1100,2,4500000
5,Kolkata,1400,3,5500000
6,Pune,900,1,3500000
7,Ahmedabad,1600,4,6500000
8,Surat,1050,2,4200000
9,Jaipur,1250,3,5200000


In [72]:
# drop the city column
df_copy.drop(columns = "city")

Unnamed: 0,area,floor,price
0,1200,3,5000000
1,1500,2,7000000
2,1000,1,4000000
3,1300,4,6000000
4,1100,2,4500000
5,1400,3,5500000
6,900,1,3500000
7,1600,4,6500000
8,1050,2,4200000
9,1250,3,5200000


In [73]:
# drop the second record specifically
df_copy.drop(index = 2)

Unnamed: 0,city,area,floor,price
0,Delhi,1200,3,5000000
1,Mumbai,1500,2,7000000
3,Chennai,1300,4,6000000
4,Hyderabad,1100,2,4500000
5,Kolkata,1400,3,5500000
6,Pune,900,1,3500000
7,Ahmedabad,1600,4,6500000
8,Surat,1050,2,4200000
9,Jaipur,1250,3,5200000


In [74]:
# drop 3rd and 4rd records with horizontal axis
df_copy.drop([3, 4], axis = 0)

Unnamed: 0,city,area,floor,price
0,Delhi,1200,3,5000000
1,Mumbai,1500,2,7000000
2,Bangalore,1000,1,4000000
5,Kolkata,1400,3,5500000
6,Pune,900,1,3500000
7,Ahmedabad,1600,4,6500000
8,Surat,1050,2,4200000
9,Jaipur,1250,3,5200000


In [75]:
# drop the "floor" column with axis
df_copy.drop("floor", axis = 1)

Unnamed: 0,city,area,price
0,Delhi,1200,5000000
1,Mumbai,1500,7000000
2,Bangalore,1000,4000000
3,Chennai,1300,6000000
4,Hyderabad,1100,4500000
5,Kolkata,1400,5500000
6,Pune,900,3500000
7,Ahmedabad,1600,6500000
8,Surat,1050,4200000
9,Jaipur,1250,5200000


## 1.6 Miscellaneous
- Dropping duplicates
- Looking at unique values and nuniques
- finiding nulls in the dataset
- finding counts of all values in a column

In [76]:
# drop duplicates when necessary
df = df.drop_duplicates()
df

Unnamed: 0,city,area,floor,price
0,Delhi,1200,3,5000000
1,Mumbai,1500,2,7000000
2,Bangalore,1000,1,4000000
3,Chennai,1300,4,6000000
4,Hyderabad,1100,2,4500000
5,Kolkata,1400,3,5500000
6,Pune,900,1,3500000
7,Ahmedabad,1600,4,6500000
8,Surat,1050,2,4200000
9,Jaipur,1250,3,5200000


In [77]:
# finding the number of nulls in each column
df.isna().sum()
# or df.isnull().sum()

city     0
area     0
floor    0
price    0
dtype: int64

In [78]:
# print the number of unique values in the "city" column and the unique values themselves
print("The no. of unique values in city column: ", df["city"].nunique())
print("The unique values in city column: ", df["city"].unique())

The no. of unique values in city column:  15
The unique values in city column:  ['Delhi' 'Mumbai' 'Bangalore' 'Chennai' 'Hyderabad' 'Kolkata' 'Pune'
 'Ahmedabad' 'Surat' 'Jaipur' 'Kanpur' 'Lucknow' 'Nagpur' 'Indore'
 'Vadodara']


In [79]:
# finding counts of all values in a column
df.floor.value_counts()

floor
2    5
3    4
1    4
4    2
Name: count, dtype: int64