<h1> What are pandas()? </h1><br>
<font size = "+1">pandas arises from the need to have a specific library to analyze data that provides, in the simplest possible way, all the instruments for data processing, data extraction, and data manipulation.<br>
This Python package is designed on the basis of the NumPy library.

<font size = "+1">The heart of pandas is the two primary data structures on which all transactions, which
are generally made during the analysis of data, are centralized:
<font size = "+1"><ul>• Series</ul>
<font size = "+1"><ul>• Dataframes</ul>
The series, as you will see, constitutes the data structure designed to accommodate a
sequence of one-dimensional data, while the dataframe, a more complex data structure,
is designed to contain cases with several dimensions.

In [19]:
import pandas as pd
import numpy as np

<h3> Series

![image.png](attachment:image.png)

<font size = "+0.5"> There are basically seven methods to create a pandas Series..<br>
1. Creating an empty series <br>
2. Creating a series from List <br>
3. Creating a series from a numpy array <br>
4. Creating a series from other numpy functions  e.g., arange, linespace<br>
5. Creating a series from a dictionary <br>
6. Creating a series from a constant value. <br>
7. Creating a series using range function 

In [20]:
# 1. defining a empty series
s = pd.Series(dtype ='float64')

In [21]:
s

Series([], dtype: float64)

In [22]:
# 2. Creating from a list
s2 = pd.Series([12,-4,7,9])
print(s2)

0    12
1    -4
2     7
3     9
dtype: int64


In [23]:
# 3. creating from numpy array
s3 = pd.Series(np.array([2,3,4,5,6]))
s3

0    2
1    3
2    4
3    5
4    6
dtype: int32

In [24]:
# 4. creating using numpy functions
s4_1 = pd.Series(np.arange(6))
s4_1

0    0
1    1
2    2
3    3
4    4
5    5
dtype: int32

In [25]:
# another numpy function
s4_2 = pd.Series(np.linspace(1,4,7))
s4_2

0    1.0
1    1.5
2    2.0
3    2.5
4    3.0
5    3.5
6    4.0
dtype: float64

In [26]:
# 5. Creating Series from a dictionary
mydict = {'red': 2000, 'blue': 1000, 'yellow': 500, 'orange': 1000}

In [27]:
mydict

{'red': 2000, 'blue': 1000, 'yellow': 500, 'orange': 1000}

In [28]:
s5 = pd.Series(mydict)
s5

red       2000
blue      1000
yellow     500
orange    1000
dtype: int64

In [29]:
# 6. creating using a constant value
s6 = pd.Series(12, index = [0,1,2,3,4])
s6

0    12
1    12
2    12
3    12
4    12
dtype: int64

In [30]:
# 7. creating using range function
s7 = pd.Series(range(4,12,2), index=['a','b','c','d'])
s7

a     4
b     6
c     8
d    10
dtype: int64

<h3> Method and Attributes in pandas Series  </h3>
<font size = "+1">  There are multiple methods & attributes associated with pandas series, here are some discussed :<br>
    1. values<br>
    2. index and index operations<br>
    3. slicing using [ ]<br>
    4. value_counts()<br>
    5. unique() or not<br>
    6. isna() and notna()<br>
    7. resetting the index using reset_index()<br>
    8. append()<br>
    9. maximum and minimum index as per the value using idxmax() and idxmin()<br>
    10. operations between two and more series

In [31]:
# to get the values of a series
s7.values

array([ 4,  6,  8, 10], dtype=int64)

In [32]:
# to get the indexes of series
s7.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [33]:
# fetching values from index number
s7[2]

8

In [34]:
# fetching values from index name
s7['b']

6

In [35]:
# fetching multiple values from index
s7[0:3]

a    4
b    6
c    8
dtype: int64

In [36]:
# multiple values from position
s7[[0,2]]

a    4
c    8
dtype: int64

In [37]:
# fetching multiple values from index names
s7[['b','c']]

b    6
c    8
dtype: int64

In [38]:
# changing the value using index 
s7[1] = 1

In [39]:
s7

a     4
b     1
c     8
d    10
dtype: int64

In [40]:
# filtering the values
s7[s7>4]

c     8
d    10
dtype: int64

In [41]:
# mathematical operations
np.log(s7)

a    1.386294
b    0.000000
c    2.079442
d    2.302585
dtype: float64

In [42]:
# we can also define a series from a list with different index
s8 = pd.Series([1,0,4,1,4,3], index=['white','white','blue','green','green','yellow'])

In [43]:
# To know all the values contained in the series, excluding duplicates
s8.unique()

array([1, 0, 4, 3], dtype=int64)

In [44]:
# to check if unique values or not
s8.is_unique

False

In [45]:
# value_counts(), which not only returns unique values but also calculates the occurrences within a series.
s8.value_counts()

1    2
4    2
0    1
3    1
Name: count, dtype: int64

In [46]:
s8

white     1
white     0
blue      4
green     1
green     4
yellow    3
dtype: int64

In [47]:
# isin() function tells you if the values are contained in the data structure. 
## Boolean values that are returned.
s8.isin([1,3])

white      True
white     False
blue      False
green      True
green     False
yellow     True
dtype: bool

In [48]:
s8[s8.isin([1,3])]

white     1
green     1
yellow    3
dtype: int64

In [49]:
s9 = pd.Series([5,-3,np.NaN,14])

In [50]:
s9

0     5.0
1    -3.0
2     NaN
3    14.0
dtype: float64

In [51]:
s9.isna() 

0    False
1    False
2     True
3    False
dtype: bool

In [52]:
s9.notna()

0     True
1     True
2    False
3     True
dtype: bool

In [53]:
s9[s9.notna()]

0     5.0
1    -3.0
3    14.0
dtype: float64

In [54]:
s9

0     5.0
1    -3.0
2     NaN
3    14.0
dtype: float64

In [55]:
# Return the row label of the minimum value.
s9.idxmin()

1

In [56]:
s9.idxmax()

3

In [57]:
# operation between series
s10 = pd.Series(range(5))
s11 = pd.Series(range(6,11))

In [58]:
s10

0    0
1    1
2    2
3    3
4    4
dtype: int64

In [59]:
s11

0     6
1     7
2     8
3     9
4    10
dtype: int64

In [60]:
# element-wise operation
s10 + s11

0     6
1     8
2    10
3    12
4    14
dtype: int64

In [61]:
# adding new elements to series
s12 = pd.concat([s10,s11])
s12

0     0
1     1
2     2
3     3
4     4
0     6
1     7
2     8
3     9
4    10
dtype: int64

In [62]:
# reset index
s12.reset_index(drop = True)

0     0
1     1
2     2
3     3
4     4
5     6
6     7
7     8
8     9
9    10
dtype: int64

<h3> DataFrame </h3><br>
<font size = "+0.5"> 1. Pandas DataFrame is a 2-dimensional labeled data structure like any table with rows and columns. <br>2. The size and values of the dataframe are mutable,i.e., can be modified.<br>3.  It is the most commonly used pandas object. Pandas DataFrame can be created in multiple ways. <br><br>
    <em> The syntax of creating dataframe is: </em><br>
    <b> pandas.DataFrame(data, index, columns) </b>

![image-2.png](attachment:image-2.png)

<font size = "+1"> There are various methods to create a DataFrame <br>
    1. Creating df from dictionary containing ndarray or lists.<br>
    2. Creating df from Lists, multiple lists and nested Lists.<br>
    3. Creating df from list of dicts.<br>
    4. Creating df from one or more series.<br>
    5. Creating df using numpy array

In [63]:
# 1. df from dictionary of list
data1 = {
    "color":["blue","green","yellow","red","white"],
    "object":["ball", "pen","pencil","paper","mug"],
    "price":[1.2, 1, 0.6, 0.9, 1.7]
}

In [64]:
df1 = pd.DataFrame(data1)

In [65]:
df1

Unnamed: 0,color,object,price
0,blue,ball,1.2
1,green,pen,1.0
2,yellow,pencil,0.6
3,red,paper,0.9
4,white,mug,1.7


In [66]:
# we can also choose which data from dictionary we want to use 
df2 = pd.DataFrame(data1, columns=["object","price"], index = [1,2,3,4,5])
df2

Unnamed: 0,object,price
1,ball,1.2
2,pen,1.0
3,pencil,0.6
4,paper,0.9
5,mug,1.7


In [67]:
# 2. df From list
Age = [25, 30, 26, 22]

In [68]:
df3 = pd.DataFrame(data = Age, columns=["Age"])
df3

Unnamed: 0,Age
0,25
1,30
2,26
3,22


In [69]:
Name = ['Monica', 'krish', 'nick', 'juli']
Gender = ["F","M","M","F"]

In [70]:
# for multiple lists, we can make a dictionary of them to pass as a data in df
data2 = {
    "Name":Name,
    "Age":Age,
    "Gender":Gender
}

In [71]:
# for making dataframe using multiple lists with newly defined index
df4 = pd.DataFrame(data = data2)
df4

Unnamed: 0,Name,Age,Gender
0,Monica,25,F
1,krish,30,M
2,nick,26,M
3,juli,22,F


In [72]:
# for making df using nested lists
data3 = [['tom', 10], ['nick', 15], ['juli', 14]] # this is a nested list

In [73]:
data3

[['tom', 10], ['nick', 15], ['juli', 14]]

In [74]:
# this time, giving manual index
df5 = pd.DataFrame(data = data3, columns=["Name","Age"], index = ['a','b','c'])
df5

Unnamed: 0,Name,Age
a,tom,10
b,nick,15
c,juli,14


In [75]:
# 3. a list of dictionary
data4 = [{'a': 1, 'b': 2, 'c': 3},
        { 'b': 20, 'c': 30}]

In [76]:
# creating df using data4
df6 = pd.DataFrame(data = data4)
df6

Unnamed: 0,a,b,c
0,1.0,2,3
1,,20,30


In [77]:
dod = {'Nevada':{2001: 2.4, 2002:5.6},
'Ohio':{2000:4.5, 2001:3.4, 2002:6.7}}
frame3 = pd.DataFrame(dod)
frame3

Unnamed: 0,Nevada,Ohio
2001,2.4,3.4
2002,5.6,6.7
2000,,4.5


In [79]:
ser2 = np.arange(16).reshape(4,4)
ser2

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

In [80]:
df8 = pd.DataFrame(ser2)
df8

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15


In [81]:
# we can also create a dataframe using numpy array
df8 = pd.DataFrame(np.arange(16).reshape(4,4), columns=["a",'b','c','d'])
df8

Unnamed: 0,a,b,c,d
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15


In [82]:
df8["a"]

0     0
1     4
2     8
3    12
Name: a, dtype: int32

<h3> loc and iloc </h3>
<font size = "+0.5"> These are used in slicing data from the Pandas DataFrame. They help in the convenient selection of data from the DataFrame in Python. They are used in filtering the data according to some conditions.
<h3>loc</h3><font size = "+0.5"> The loc() function is label based data selecting method which means that we have to pass the name of the row or column which we want to select.

In [83]:
data5 = {
    "color":["blue","green","yellow","red","white"],
    "object":["ball", "pen","pencil","paper","mug"],
    "price":[1.2, 1, 0.6, 0.9, 1.7]
}

In [84]:
df9 = pd.DataFrame(data5, index = range(2,7))

In [85]:
df9

Unnamed: 0,color,object,price
2,blue,ball,1.2
3,green,pen,1.0
4,yellow,pencil,0.6
5,red,paper,0.9
6,white,mug,1.7


In [86]:
df9.loc[4,'color']

'yellow'

In [87]:
df9.loc[2:4, ['color','price']] # remember, unlike other slicing methods, in this, last element also included

Unnamed: 0,color,price
2,blue,1.2
3,green,1.0
4,yellow,0.6


In [88]:
df9.loc[[5,2,3]]

Unnamed: 0,color,object,price
5,red,paper,0.9
2,blue,ball,1.2
3,green,pen,1.0


In [89]:
df9.loc[:,['color','price']]

Unnamed: 0,color,price
2,blue,1.2
3,green,1.0
4,yellow,0.6
5,red,0.9
6,white,1.7


In [90]:
# using boolean indexing
df9.loc[[True, False, True,True, False],[True,False,False]]

Unnamed: 0,color
2,blue
4,yellow
5,red


<h3>iloc</h3><font size = "+0.5"> The iloc() function is indexed-based data selecting method which means that we have to pass an integer index in the method to select a specific row/column.

In [91]:
df9

Unnamed: 0,color,object,price
2,blue,ball,1.2
3,green,pen,1.0
4,yellow,pencil,0.6
5,red,paper,0.9
6,white,mug,1.7


In [92]:
df9.iloc[0:2]

Unnamed: 0,color,object,price
2,blue,ball,1.2
3,green,pen,1.0


In [93]:
df9.iloc[0:2,1:3]

Unnamed: 0,object,price
2,ball,1.2
3,pen,1.0


In [94]:
df9.iloc[[1,2,4],[0,2]]

Unnamed: 0,color,price
3,green,1.0
4,yellow,0.6
6,white,1.7


<font size = "+0.5"> Once you understand how to access the various elements that make up a dataframe, you
follow the same logic to add or change the values in it.

In [95]:
df9

Unnamed: 0,color,object,price
2,blue,ball,1.2
3,green,pen,1.0
4,yellow,pencil,0.6
5,red,paper,0.9
6,white,mug,1.7


In [96]:
# changing the index name
df9.index.name = "id"

In [97]:
df9

Unnamed: 0_level_0,color,object,price
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2,blue,ball,1.2
3,green,pen,1.0
4,yellow,pencil,0.6
5,red,paper,0.9
6,white,mug,1.7


In [98]:
# adding columns name
df9.columns.name = 'item'

In [99]:
df9

item,color,object,price
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2,blue,ball,1.2
3,green,pen,1.0
4,yellow,pencil,0.6
5,red,paper,0.9
6,white,mug,1.7


In [100]:
# adding a new column 
df9["new"] = 12

In [101]:
df9

item,color,object,price,new
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2,blue,ball,1.2,12
3,green,pen,1.0,12
4,yellow,pencil,0.6,12
5,red,paper,0.9,12
6,white,mug,1.7,12


In [102]:
# If, however, you want to update the contents of a column, you have to use an array.
df9["new"] = [12,13,14,15,16]

In [103]:
df9

item,color,object,price,new
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2,blue,ball,1.2,12
3,green,pen,1.0,13
4,yellow,pencil,0.6,14
5,red,paper,0.9,15
6,white,mug,1.7,16


In [104]:
# we can add column through series too
s5 = pd.Series([200,150,233,111,400],index = range(2,7))
s5

2    200
3    150
4    233
5    111
6    400
dtype: int64

In [105]:
df9["Quantity"] = s5

In [106]:
df9

item,color,object,price,new,Quantity
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2,blue,ball,1.2,12,200
3,green,pen,1.0,13,150
4,yellow,pencil,0.6,14,233
5,red,paper,0.9,15,111
6,white,mug,1.7,16,400


In [107]:
# checking if exists
df9.isin([1.0,'pen'])

item,color,object,price,new,Quantity
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2,False,False,False,False,False
3,False,True,True,False,False
4,False,False,False,False,False
5,False,False,False,False,False
6,False,False,False,False,False


In [108]:
df9[df9.isin([1.0,'pen'])]

item,color,object,price,new,Quantity
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2,,,,,
3,,pen,1.0,,
4,,,,,
5,,,,,
6,,,,,


In [109]:
# deleting a column
del df9['new']

In [110]:
df9

item,color,object,price,Quantity
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2,blue,ball,1.2,200
3,green,pen,1.0,150
4,yellow,pencil,0.6,233
5,red,paper,0.9,111
6,white,mug,1.7,400


In [111]:
# filtering out
df9[df9['Quantity']>160]

item,color,object,price,Quantity
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2,blue,ball,1.2,200
4,yellow,pencil,0.6,233
6,white,mug,1.7,400


In [112]:
# for large datasets, to check if index is duplicate or not
df9.index.is_unique

True

<h3> Drop

In [113]:
# to drop a column or multiple columns
df10 = df9.drop(columns = ["color",'price'])
df10

item,object,Quantity
id,Unnamed: 1_level_1,Unnamed: 2_level_1
2,ball,200
3,pen,150
4,pencil,233
5,paper,111
6,mug,400


In [114]:
# to drop rows
df10 = df10.drop(index = [2,3])
df10

item,object,Quantity
id,Unnamed: 1_level_1,Unnamed: 2_level_1
4,pencil,233
5,paper,111
6,mug,400


<h3> importing and writing data to csv </h3>

In [115]:
import seaborn as sns

In [116]:
dataset = sns.load_dataset('iris')

In [117]:
dataset.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [118]:
dataset.shape

(150, 5)

In [119]:
dataset.to_csv('iris.csv')

In [120]:
df = pd.read_csv('iris.csv')
df.head(10)

Unnamed: 0.1,Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,0,5.1,3.5,1.4,0.2,setosa
1,1,4.9,3.0,1.4,0.2,setosa
2,2,4.7,3.2,1.3,0.2,setosa
3,3,4.6,3.1,1.5,0.2,setosa
4,4,5.0,3.6,1.4,0.2,setosa
5,5,5.4,3.9,1.7,0.4,setosa
6,6,4.6,3.4,1.4,0.3,setosa
7,7,5.0,3.4,1.5,0.2,setosa
8,8,4.4,2.9,1.4,0.2,setosa
9,9,4.9,3.1,1.5,0.1,setosa


<h3> Data Alignment in Pandas </h3>

<font size = "+0.5">pandas can align indexes coming from two different data structures. 

In [121]:
frame1 = pd.DataFrame(np.arange(20).reshape((5,4)),
                      columns=['ball','pen','pencil','paper'])

In [122]:
frame2 =  pd.DataFrame(np.arange(12).reshape((4,3)),
                       columns=['mug','pen','ball'])

In [123]:
frame1

Unnamed: 0,ball,pen,pencil,paper
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


In [124]:
frame2

Unnamed: 0,mug,pen,ball
0,0,1,2
1,3,4,5
2,6,7,8
3,9,10,11


In [125]:
# values with same column gets added and with become NaN for different column
frame1+frame2

Unnamed: 0,ball,mug,paper,pen,pencil
0,2.0,,,2.0,
1,9.0,,,9.0,
2,16.0,,,16.0,
3,23.0,,,23.0,
4,,,,,


In [126]:
frame1 - frame2

Unnamed: 0,ball,mug,paper,pen,pencil
0,-2.0,,,0.0,
1,-1.0,,,1.0,
2,0.0,,,2.0,
3,1.0,,,3.0,
4,,,,,


Similar for divide and multiply

<h3> *Element wise functions

In [127]:
frame1

Unnamed: 0,ball,pen,pencil,paper
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


In [128]:
# you could calculate the square root of each value in the dataframe using the NumPy np.sqrt().
np.sqrt(frame1)

Unnamed: 0,ball,pen,pencil,paper
0,0.0,1.0,1.414214,1.732051
1,2.0,2.236068,2.44949,2.645751
2,2.828427,3.0,3.162278,3.316625
3,3.464102,3.605551,3.741657,3.872983
4,4.0,4.123106,4.242641,4.358899


<h3> *Statistical functions

In [129]:
frame1.max()

ball      16
pen       17
pencil    18
paper     19
dtype: int32

In [130]:
frame1.sum()

ball      40
pen       45
pencil    50
paper     55
dtype: int64

In [131]:
frame1.mean()

ball       8.0
pen        9.0
pencil    10.0
paper     11.0
dtype: float64

In [132]:
frame1.describe()

Unnamed: 0,ball,pen,pencil,paper
count,5.0,5.0,5.0,5.0
mean,8.0,9.0,10.0,11.0
std,6.324555,6.324555,6.324555,6.324555
min,0.0,1.0,2.0,3.0
25%,4.0,5.0,6.0,7.0
50%,8.0,9.0,10.0,11.0
75%,12.0,13.0,14.0,15.0
max,16.0,17.0,18.0,19.0


<h3> Sorting and Ranking

In [133]:
# again using frame1 for understanding sorting
frame1 = pd.DataFrame(np.arange(16).reshape((4,4)), 
                      index=['red','blue','yellow','white'], 
                      columns=['ball','pen','pencil','paper'])

In [134]:
frame1

Unnamed: 0,ball,pen,pencil,paper
red,0,1,2,3
blue,4,5,6,7
yellow,8,9,10,11
white,12,13,14,15


In [135]:
# sorting the index of a dataframe
frame1.sort_index(ascending=False, axis = 0)

Unnamed: 0,ball,pen,pencil,paper
yellow,8,9,10,11
white,12,13,14,15
red,0,1,2,3
blue,4,5,6,7


In [136]:
# sorting the columns
frame1.sort_index(ascending=True, axis = 1)

Unnamed: 0,ball,paper,pen,pencil
red,0,3,1,2
blue,4,7,5,6
yellow,8,11,9,10
white,12,15,13,14


In [137]:
# to sort the data on the basis of the values of particular column
frame1.sort_values(by = "pen", ascending=False)

Unnamed: 0,ball,pen,pencil,paper
white,12,13,14,15
yellow,8,9,10,11
blue,4,5,6,7
red,0,1,2,3


<h2> Why Is Data Missing From the Dataset??? </h2><br>
<font size = "+1"> There can be multiple reasons why certain values are missing from the data. Some of the reasons are listed below:<br>
<font size = "+1"><ul>Past data might get corrupted due to improper maintenance.</ul>
<font size = "+1"><ul>Observations are not recorded for certain fields due to some reasons. There might be a failure in recording the values due to human error.</ul>
<font size = "+1"><ul>The user has not provided the values intentionally.</ul>
<font size = "+1"><ul>Item nonresponse: This means the participant refused to respond.</ul></font>

<h2>Why Do We Need to Care About Handling Missing Data?</h2>
<font size = "+1"><ul>Many machine learning algorithms fail if the dataset contains missing values.</ul>
<font size = "+1"><ul>You may end up building a machine learning model that will lead to incorrect results if the missing values are not handled properly.</ul>
<font size = "+1"><ul>Missing data can lead to a lack of precision in the statistical analysis.</ul>

<h2> Treating with null numerical values </h2>
<font size = "+1"> We have two methods to treat with null <br>
<br>1. dropna() method - to remove the null value rows. <br>
        <ul> &#x2022  remove row with any null value </ul>
        <ul> &#x2022  remove only those rows which have all null values </ul>
        <ul> &#x2022  remove rows based on the threshold</ul>
<br>2. fillna() method - to fill the null values 
    <ul> &#x2022 Using any constant value, e.g., 0, 1 </ul>
    <ul> &#x2022 Using statistical measures like mean and median. </ul>
    <ul> &#x2022 Using forward fill or backward fill </ul>
    <ul> &#x2022 Using interpolation 
    <ul> &#x2022 linear interpolation</ul>
        <ul> &#x2022 polynomial interpolation</ul>
        <ul> &#x2022 padding interpolation</ul></ul>

<h4> 1. dropna

In [138]:
data = {
    "initial":['a','b',np.nan,'d','e',np.nan],
    "age":[24,22,np.nan,23,25,np.nan],
    "student":[True,False,True,True,np.nan,np.nan]
}

In [139]:
frame3 = pd.DataFrame(data)

In [140]:
frame3

Unnamed: 0,initial,age,student
0,a,24.0,True
1,b,22.0,False
2,,,True
3,d,23.0,True
4,e,25.0,
5,,,


In [141]:
frame3.isna().sum()

initial    2
age        2
student    2
dtype: int64

<h4> 1.1 delete row with any null value

In [142]:
frame4 = frame3.dropna()

In [143]:
frame4

Unnamed: 0,initial,age,student
0,a,24.0,True
1,b,22.0,False
3,d,23.0,True


In [144]:
frame4.isna().sum()

initial    0
age        0
student    0
dtype: int64

<font size = "+0.5"> If you use the dropna() function on this type of object, and there is only one NaN value on a column or row, it will eliminate it.
<br> Therefore, to avoid having entire rows and columns disappear completely, you should specify the how option, assigning a value of all to it. <br>
    This tells the dropna() function to delete only the rows or columns in which all elements are NaN.

<h4> 1.2 Delete only those rows which have all values null

In [145]:
frame4 = frame3.dropna(how="all")

In [146]:
frame4

Unnamed: 0,initial,age,student
0,a,24.0,True
1,b,22.0,False
2,,,True
3,d,23.0,True
4,e,25.0,


<h4> 1.3 Delete the rows based on threshold,

<font size = "+0.5">  For instance, “thresh = 2” means that the rows that have at least 2 non-missing values will be kept. The other ones will be dropped.

In [147]:
frame4 = frame3.dropna(thresh=2)
frame4

Unnamed: 0,initial,age,student
0,a,24.0,True
1,b,22.0,False
3,d,23.0,True
4,e,25.0,


<h4>2.1 fillna() - using a constant value

In [148]:
# filling with a constant value in all columns
frame4 = frame3.fillna(1)
frame4

Unnamed: 0,initial,age,student
0,a,24.0,True
1,b,22.0,False
2,1,1.0,True
3,d,23.0,True
4,e,25.0,1
5,1,1.0,1


In [149]:
# or we can fill specific value for each column like this
frame4 = frame3.fillna({"initial":"f","age":22, "student":True})
frame4

Unnamed: 0,initial,age,student
0,a,24.0,True
1,b,22.0,False
2,f,22.0,True
3,d,23.0,True
4,e,25.0,True
5,f,22.0,True


<h4>2.2 fill using statistical measures like mean, median and mode

In [150]:
frame3

Unnamed: 0,initial,age,student
0,a,24.0,True
1,b,22.0,False
2,,,True
3,d,23.0,True
4,e,25.0,
5,,,


In [151]:
age_mean = frame3["age"].mean()
age_mean

23.5

In [152]:
frame3["age"].fillna(age_mean)

0    24.0
1    22.0
2    23.5
3    23.0
4    25.0
5    23.5
Name: age, dtype: float64

##### similarly median and mode

<h3>2.3 forward fill and backward fill</h3>
<font size = "+0.5"> It is possible to replace the missing values in a column with the previous or next value in that column.<br>
    This method might come in handy when working with time-series data.<br>
    e.g., Consider you have a data frame that contains the daily temperature measurement and the temperate in one day is missing. The optimal solution would be to use the temperature in the next or previous day

In [153]:
frame3.fillna(method="ffill") # for forward filling

Unnamed: 0,initial,age,student
0,a,24.0,True
1,b,22.0,False
2,b,22.0,True
3,d,23.0,True
4,e,25.0,True
5,e,25.0,True


In [154]:
frame3.fillna(method="bfill") # for backward filling

Unnamed: 0,initial,age,student
0,a,24.0,True
1,b,22.0,False
2,d,23.0,True
3,d,23.0,True
4,e,25.0,
5,,,


<h3>2.4 interpolation</h3>
<font size = "+0.5"> interpolation is a method of estimating values between known data points. When graphical data contains a gap, but data is available on either side of the gap or at a few specific points within the gap, an estimate of values within the gap can be made by interpolation.<br>
    There are basically three types of intepolation techniques:<br>
    1. linear<br>
    2. polynomial<br>
    3. padding <br>
    

<h4> 2.4.1 linear interpolation </h4><font size = "+0.5"> The simplest method of interpolation is to draw straight lines between the known data points and consider the function as the combination of those straight lines. This method, called linear interpolation, usually introduces considerable error

In [155]:
pd.Series([1,np.nan,2,3,np.nan,5,6,7,np.nan]).interpolate(method="linear")

0    1.0
1    1.5
2    2.0
3    3.0
4    4.0
5    5.0
6    6.0
7    7.0
8    7.0
dtype: float64

In [156]:
frame3.interpolate(method = "linear")

Unnamed: 0,initial,age,student
0,a,24.0,True
1,b,22.0,False
2,,22.5,True
3,d,23.0,True
4,e,25.0,
5,,25.0,


<h4> 2.4.2 polynomial interpolation </h4><font size = "+0.5">  A polynomial is a mathematical expression comprising a sum of terms, each term including a variable or variables raised to a power and multiplied by a coefficient.<br>e.g., x<sup>2</sup> + 2x + 1

In [157]:
pd.Series([1,np.nan,2,3,np.nan,5,6,7,np.nan]).interpolate(method="polynomial", order = 2)

0    1.000000
1    1.316227
2    2.000000
3    3.000000
4    4.008322
5    5.000000
6    6.000000
7    7.000000
8         NaN
dtype: float64

In [158]:
frame3.interpolate(method = "polynomial", order = 2)

Unnamed: 0,initial,age,student
0,a,24.0,True
1,b,22.0,False
2,,21.833333,True
3,d,23.0,True
4,e,25.0,
5,,,


<h4> 2.4.3 interpolation through padding </h4><font size = "+0.5"> Interpolation with the help of padding simply means filling missing values with the same value present above them in the dataset, While using this technique, you also need to specify the limit, which means how many NaN values to fill.<br>
    In the below example, we can see that NaN value of index 4 didn't filled because our limit is 2.

In [159]:
pd.Series([1,2,np.nan,np.nan,np.nan,6,7,np.nan,np.nan,np.nan]).interpolate(method='pad',limit = 2)

0    1.0
1    2.0
2    2.0
3    2.0
4    NaN
5    6.0
6    7.0
7    7.0
8    7.0
9    NaN
dtype: float64

In [160]:
frame3.interpolate(method='pad',limit = 2)

Unnamed: 0,initial,age,student
0,a,24.0,True
1,b,22.0,False
2,b,22.0,True
3,d,23.0,True
4,e,25.0,True
5,e,25.0,True
