# **Pandas Tutorial**

<font color = 'blue'> *It is a python library that is used to manipulate data and clean the data. It helps us to make the data relevant for data analysis. Also it offers many different features and functions that will be discussed further*</font>

## 1. Importing the library
<font color = 'blue'>*We use* <font color = 'green'> **import**</font> *keyword to import the library*></font>

In [1]:
# This imports the library
import pandas 

### 1.1 Aliasing

<font color = 'blue'>*We use* <font color = 'green'> **as**</font> *keyword to give an alias to the library*></font>

In [2]:
# we give pd as an alias name
import pandas as pd 

## 2. Data types in pandas

<font color = 'blue'>*There are two data types in pandas.* <font color = 'green'> **Series and DataFrame**</font> *keyword to import the library*></font>

### 2.1 Creating a Series
<font color = 'blue'>*A series is a one dimentional data structure that has only one column apart from an automatically generated column working as index*</font> 

In [3]:
# We use Series function

s = pd.Series([0,1,2,3,4])
print(s)

0    0
1    1
2    2
3    3
4    4
dtype: int64


### 2.1.1 Creating through numpy array

In [4]:
# importing numpy 
import numpy as np

# creating a numpy array using arange function 
arr = np.arange(1,6)

# creating pandas series using numpy array
s = pd.Series(arr)
print(s)

0    1
1    2
2    3
3    4
4    5
dtype: int32


### 2.1.2 Giving index inplace of automatically generated columns

In [5]:
# Hardcoding the index

s = pd.Series(arr, index=[1,2,3,4,5])
print(s)

1    1
2    2
3    3
4    4
5    5
dtype: int32


#### Giving index through numpy array

In [6]:
#Creating a numpy array of alphabets to use as index

idx = np.array(['A','B','C','D','E'])

s = pd.Series(arr, index = idx)
print(s)

A    1
B    2
C    3
D    4
E    5
dtype: int32


### 2.1.3 Creating Series through dictionary

In [7]:
# Here the keys will be used as the indices for the series

dict1 = {
    'Name' : 'Devansh',
    'Age' : 22,
    'Blood' : 'O+',
    'Board' : 'CBSE'
}

s = pd.Series(dict1)
print(s)

Name     Devansh
Age           22
Blood         O+
Board       CBSE
dtype: object


### 2.2 Creating a DataFrame using Dictionary
<font color = 'blue'>*A Dataframe is a multi-dimentional data structure that has several rows and columns. It can be referred to as a table*</font> 

In [8]:
# Create a dictionary

data = {
    "Name" : ["Devansh","Suryansh","Yash","Vidit"],
    "Weight"  : [70,56,69,80]
}
s = pd.DataFrame(data, index = np.arange(1,5))
print(s)

       Name  Weight
1   Devansh      70
2  Suryansh      56
3      Yash      69
4     Vidit      80


## 3 Locating the rows 

### 3.1 loc()
<font color = 'blue'> *We can acheive this using*</font> <font color = 'green'>**loc** </font> <font color = 'blue'> *function. This function takes label as arguement* </font>

In [9]:
# We can locate rows using loc()
print(s.loc[2])

Name      Suryansh
Weight          56
Name: 2, dtype: object


In [10]:
# We can also locate multiple rows by passing a list of labels

print(s.loc[[1,2]])

       Name  Weight
1   Devansh      70
2  Suryansh      56


### 3.2 iloc()
<font color = 'blue'> *Their is another method also i.e* </font> <font color = 'green'>**iloc** </font> <font color = 'blue'> *This function takes indexes of the rows and columns just as in python slicing* </font>

In [11]:
# accessing rows through iloc(). Indexing starts from 0

print(s.iloc[1,1])

56


## 4 Loading the dataset


<font color = 'blue'> *To load the dataset though csv format file we use* </font> <font color = 'green'>**read_csv()** </font> <font color = 'blue'> *This function takes path of the csv file* </font>

In [12]:
# Loading the dataset in the name df
df = pd.read_csv('IndiaCovidSample.csv')

## 5 Functions with data

### 5.1 head()
<font color = 'green'>**head** </font> <font color = 'blue'> *Shows us the specified number of rows from the top. the default value is 5* </font>

In [13]:
#Setting the max display of columns to see columns in one horizontal structure

pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', None)

In [14]:
# Using head to get the first 5 rows
print(df.head())

   Patient_ID  Age Temperature Oxygen_Saturation Heart_Rate COVID_Test_Result       City
0        1385   52        98.3              91.8         72          Negative  Bangalore
1        1105   81       100.6              92.3         80          Positive      Delhi
2        1700   85        97.8              87.2         69           Pending      Delhi
3        1683   80       100.2              97.0         63           Pending        NaN
4        1001   48        97.4              97.4        116           Pending      Delhi


In [15]:
# Using head for specific number of rows
print(df.head(7))

   Patient_ID  Age Temperature Oxygen_Saturation Heart_Rate COVID_Test_Result       City
0        1385   52        98.3              91.8         72          Negative  Bangalore
1        1105   81       100.6              92.3         80          Positive      Delhi
2        1700   85        97.8              87.2         69           Pending      Delhi
3        1683   80       100.2              97.0         63           Pending        NaN
4        1001   48        97.4              97.4        116           Pending      Delhi
5        1205   41        97.9              93.2         80          Positive    Chennai
6        1143   74        98.1              98.7         68           Pending     Mumbai


### 5.2 tail()
<font color = 'green'>**tail** </font> <font color = 'blue'> *It is just like head but the only difference is that it shows the bottom rows* </font>

In [16]:
# using tail to get the bottom 5 rows
print(df.tail())

     Patient_ID  Age Temperature Oxygen_Saturation Heart_Rate COVID_Test_Result     City
515        1345   28        96.3              96.5         86          Positive  Kolkata
516        1862   77        98.7              96.4         71          Positive   Mumbai
517        1441   60       101.7              95.9        104          Positive  Chennai
518        1374   61        96.9              97.8         79          Positive  Chennai
519        1201   53         NaN               NaN         78          Negative   Mumbai


In [17]:
# using tail to get specific number of rows
print(df.tail(7))

     Patient_ID  Age Temperature Oxygen_Saturation Heart_Rate COVID_Test_Result     City
513        1791   70        97.7              95.3         93          Positive  Chennai
514        1661   32        99.8              95.1         65          Positive   Mumbai
515        1345   28        96.3              96.5         86          Positive  Kolkata
516        1862   77        98.7              96.4         71          Positive   Mumbai
517        1441   60       101.7              95.9        104          Positive  Chennai
518        1374   61        96.9              97.8         79          Positive  Chennai
519        1201   53         NaN               NaN         78          Negative   Mumbai


### 5.3 info()
<font color = 'green'>**info** </font> <font color = 'blue'> *tells us the description of the dataset like the coloum name, the datatype etc.* </font>

In [30]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 520 entries, 0 to 519
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Patient_ID         520 non-null    int64 
 1   Age                520 non-null    int64 
 2   Temperature        465 non-null    object
 3   Oxygen_Saturation  468 non-null    object
 4   Heart_Rate         520 non-null    object
 5   COVID_Test_Result  520 non-null    object
 6   City               470 non-null    object
dtypes: int64(2), object(5)
memory usage: 28.6+ KB
None


### 5.3 describe()
<font color = 'green'>**info** </font> <font color = 'blue'> *tells us the description of all the columns with numeric values* </font>

In [18]:
print(df.describe())

        Patient_ID         Age
count   520.000000  520.000000
mean   1496.469231   53.871154
std     290.349067   20.754477
min    1001.000000   20.000000
25%    1232.000000   36.000000
50%    1496.000000   53.000000
75%    1747.250000   72.250000
max    1996.000000   89.000000


### 5.4 Shape
<font color = 'green'>**shape** </font> <font color = 'blue'> *tells us the number of rows and columns in the dataset*> </font>


In [19]:
df.shape

(520, 7)

### 5.5 to_string
<font color = 'green'>**to_string** </font> <font color = 'blue'> *This function helps us to print the whole dataset* </font>

In [20]:
# to_string 
x = print(df.to_string())
print(x)

     Patient_ID  Age Temperature Oxygen_Saturation Heart_Rate COVID_Test_Result       City
0          1385   52        98.3              91.8         72          Negative  Bangalore
1          1105   81       100.6              92.3         80          Positive      Delhi
2          1700   85        97.8              87.2         69           Pending      Delhi
3          1683   80       100.2              97.0         63           Pending        NaN
4          1001   48        97.4              97.4        116           Pending      Delhi
5          1205   41        97.9              93.2         80          Positive    Chennai
6          1143   74        98.1              98.7         68           Pending     Mumbai
7          1797   22        97.2               NaN         77           Pending    Chennai
8          1460   36        98.4              92.8        103           Pending  Bangalore
9          1098   66        99.2              92.3         75           Pending      Delhi

In [21]:
# to check for the maximum capacity of rows to print by the editor
print(pd.options.display.max_rows)

60


## 6 Data Cleaning
<font color = 'blue'>*Data cleaning is the most imortant feature of pandas that helps to prepare the data for analysis*</font>
<font color = 'blue'>*It can be done through various functions*</font>

### 6.1 Cleaning Empty cells 
<font color = 'blue'>*Their are multiple ways to do it*</font>

#### 6.1.1 Remove the empty rows
<font color = 'green'>**dropna**</font><font color = 'blue'>*This function drops the empty rows from the dataset*</font>

In [22]:
x = df.dropna()
print(x.to_string())

     Patient_ID  Age Temperature Oxygen_Saturation Heart_Rate COVID_Test_Result       City
0          1385   52        98.3              91.8         72          Negative  Bangalore
1          1105   81       100.6              92.3         80          Positive      Delhi
2          1700   85        97.8              87.2         69           Pending      Delhi
4          1001   48        97.4              97.4        116           Pending      Delhi
5          1205   41        97.9              93.2         80          Positive    Chennai
6          1143   74        98.1              98.7         68           Pending     Mumbai
8          1460   36        98.4              92.8        103           Pending  Bangalore
9          1098   66        99.2              92.3         75           Pending      Delhi
10         1681   38        96.8              94.0         96           Pending     Mumbai
11         1217   76        99.9              97.2         86          Positive    Kolkata

In [23]:
# using the parameters how
import pandas as pd
df = pd.read_csv('IndiaCovidSample.csv')

# how = 'any'. This drops any rows with at least one null value
print(df.dropna(how = 'any').to_string())

     Patient_ID  Age Temperature Oxygen_Saturation Heart_Rate COVID_Test_Result       City
0          1385   52        98.3              91.8         72          Negative  Bangalore
1          1105   81       100.6              92.3         80          Positive      Delhi
2          1700   85        97.8              87.2         69           Pending      Delhi
4          1001   48        97.4              97.4        116           Pending      Delhi
5          1205   41        97.9              93.2         80          Positive    Chennai
6          1143   74        98.1              98.7         68           Pending     Mumbai
8          1460   36        98.4              92.8        103           Pending  Bangalore
9          1098   66        99.2              92.3         75           Pending      Delhi
10         1681   38        96.8              94.0         96           Pending     Mumbai
11         1217   76        99.9              97.2         86          Positive    Kolkata

In [24]:
# using the perimeter how = 'all'
print(df.dropna(how = 'all').to_string())

     Patient_ID  Age Temperature Oxygen_Saturation Heart_Rate COVID_Test_Result       City
0          1385   52        98.3              91.8         72          Negative  Bangalore
1          1105   81       100.6              92.3         80          Positive      Delhi
2          1700   85        97.8              87.2         69           Pending      Delhi
3          1683   80       100.2              97.0         63           Pending        NaN
4          1001   48        97.4              97.4        116           Pending      Delhi
5          1205   41        97.9              93.2         80          Positive    Chennai
6          1143   74        98.1              98.7         68           Pending     Mumbai
7          1797   22        97.2               NaN         77           Pending    Chennai
8          1460   36        98.4              92.8        103           Pending  Bangalore
9          1098   66        99.2              92.3         75           Pending      Delhi

In [25]:
import pandas as pd
df = pd.read_csv('IndiaCovidSample.csv')
print(df.to_string())

     Patient_ID  Age Temperature Oxygen_Saturation Heart_Rate COVID_Test_Result       City
0          1385   52        98.3              91.8         72          Negative  Bangalore
1          1105   81       100.6              92.3         80          Positive      Delhi
2          1700   85        97.8              87.2         69           Pending      Delhi
3          1683   80       100.2              97.0         63           Pending        NaN
4          1001   48        97.4              97.4        116           Pending      Delhi
5          1205   41        97.9              93.2         80          Positive    Chennai
6          1143   74        98.1              98.7         68           Pending     Mumbai
7          1797   22        97.2               NaN         77           Pending    Chennai
8          1460   36        98.4              92.8        103           Pending  Bangalore
9          1098   66        99.2              92.3         75           Pending      Delhi

In [26]:
# if you want to drop columns instead of rows then use subset parameter
x = df.dropna(subset = ['Temperature'])
print(x.to_string())

     Patient_ID  Age Temperature Oxygen_Saturation Heart_Rate COVID_Test_Result       City
0          1385   52        98.3              91.8         72          Negative  Bangalore
1          1105   81       100.6              92.3         80          Positive      Delhi
2          1700   85        97.8              87.2         69           Pending      Delhi
3          1683   80       100.2              97.0         63           Pending        NaN
4          1001   48        97.4              97.4        116           Pending      Delhi
5          1205   41        97.9              93.2         80          Positive    Chennai
6          1143   74        98.1              98.7         68           Pending     Mumbai
7          1797   22        97.2               NaN         77           Pending    Chennai
8          1460   36        98.4              92.8        103           Pending  Bangalore
9          1098   66        99.2              92.3         75           Pending      Delhi

**You can use same parameters 'how' in columns also.**
**dropna does not change the original dataset**

### 6.1.2 Replace the empty rows
<font color = 'green'>**fillna**</font><font color = 'blue'>*This function fills the empty cells with the provided value*</font>

In [27]:
# replace the empty cells with a value
x = df.fillna(100)
print(x)

     Patient_ID  Age Temperature Oxygen_Saturation Heart_Rate COVID_Test_Result       City
0          1385   52        98.3              91.8         72          Negative  Bangalore
1          1105   81       100.6              92.3         80          Positive      Delhi
2          1700   85        97.8              87.2         69           Pending      Delhi
3          1683   80       100.2              97.0         63           Pending        100
4          1001   48        97.4              97.4        116           Pending      Delhi
..          ...  ...         ...               ...        ...               ...        ...
515        1345   28        96.3              96.5         86          Positive    Kolkata
516        1862   77        98.7              96.4         71          Positive     Mumbai
517        1441   60       101.7              95.9        104          Positive    Chennai
518        1374   61        96.9              97.8         79          Positive    Chennai

In [28]:
# We can also fill empty cells in particular columns also
df['Temperature'].fillna(100)

0       98.3
1      100.6
2       97.8
3      100.2
4       97.4
       ...  
515     96.3
516     98.7
517    101.7
518     96.9
519      100
Name: Temperature, Length: 520, dtype: object

**We can use inplace parameter to change in the dataset permanently**

In [33]:
# Replace the values with mean, median, mode values
x = df['Patient_ID'].mean()
df['Patient_ID'].fillna(x)

0      1385
1      1105
2      1700
3      1683
4      1001
       ... 
515    1345
516    1862
517    1441
518    1374
519    1201
Name: Patient_ID, Length: 520, dtype: int64

**Same goes for median as well** 

In [None]:
# for mode we use
df = pd.read_csv('IndiaCovidSample.csv')
x = df['Temperature'].mode()[0]
print(df['Temperature'].fillna(x).to_string())

### 6.1.3 Cleaning wrong format
<font color = 'blue'>*Either we can remove the rows or we can convert to the right format*</font>

In [37]:
# convert to the right format
df['Temperature'] = pd.to_numeric(df['Temperature'], errors='coerce')

# coerce is used to convert non numeric values to NaN values (empty cells)

### 6.1.4 Cleaning duplicates
<font color = 'green'>**duplicated** </font> <font color = 'blue'> *This function helps us to detect if their are any duplicate values. This returns a boolean series where true means a duplicate value* </font>

In [40]:
# check for any duplicate values
df.duplicated()

0      False
1      False
2      False
3      False
4      False
       ...  
515    False
516    False
517    False
518    False
519    False
Length: 520, dtype: bool

<font color = 'green'>**drop_duplicates** </font> <font color = 'blue'> *This function helps us to drop the duplicate values present in the dataset* </font>

In [44]:
# we can also use inplace parameter to drop the duplicates in original dataset
df.drop_duplicates(inplace = True)
print(df.duplicated().to_string())

0      False
1      False
2      False
3      False
4      False
5      False
6      False
7      False
8      False
9      False
10     False
11     False
12     False
13     False
14     False
15     False
16     False
17     False
18     False
19     False
20     False
22     False
23     False
24     False
25     False
26     False
27     False
28     False
29     False
30     False
31     False
32     False
33     False
34     False
36     False
37     False
38     False
39     False
40     False
41     False
42     False
43     False
44     False
45     False
46     False
47     False
48     False
49     False
50     False
51     False
52     False
53     False
54     False
55     False
56     False
57     False
58     False
59     False
60     False
61     False
62     False
63     False
64     False
65     False
66     False
67     False
68     False
69     False
70     False
71     False
72     False
73     False
74     False
75     False
76     False
77     False
78     False