# Pandas for Data Science
The pandas module contains data structures and data manipulation tools designed to make data cleaning and analysis fast and easy in Python.

<br>
<br>
<br>

## Basics:

### Install Pandas:
To install pandas in your system, run the below command.

### Import pandas
Import pandas with an alias 'pd'

In [1]:
import pandas as pd

<br>
<br>
<br>

## Pandas Data Structures
pandas provides 2 important data structures:
1. **Series**: a one-dimensional array-like object containing a sequence of values and an associated array of data labels, called its index.
2. **DataFrame**: a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type

### Creating pandas series:

In [2]:
# using the pd.Series() function:
data= pd.Series([1,2,3,4,5,6])
data

0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64

The pd.Series() function accepts any type of i dimentional sequence object and creates a pandas Series with its elements.

In [3]:
# specifying the index
data2= pd.Series(['Alice','Bob','Claire','Dean','Elaina'],index=['a','b','c','d','e'])
data2

a     Alice
b       Bob
c    Claire
d      Dean
e    Elaina
dtype: object

The pd.Series() function also provides a way to specify the indices of the series object. If not specified, the elements will be indexed starting from 0 by default.

### Creating pandas DataFrame:

#### Using Lists

In [4]:
df=pd.DataFrame([['Name','Age','Grade'],['Alice',20,'A'],['Bob',19,'B'],['Claire',19,'A'],['Dean',20,'B']])
print(df)

        0    1      2
0    Name  Age  Grade
1   Alice   20      A
2     Bob   19      B
3  Claire   19      A
4    Dean   20      B


A DataFrame can be created using Lists of lists, where, each list each considered as an individual row. 

In [7]:
# Specifying column names
df=pd.DataFrame([['Alice',20,'A'],['Bob',19,'B'],['Claire',19,'A'],['Dean',20,'B']],columns=['Name','Age','Grade'])
print(df)

     Name  Age Grade
0   Alice   20     A
1     Bob   19     B
2  Claire   19     A
3    Dean   20     B


The pd.DataFrame() function allows specification of column and row names of the object

#### Using Dictionaries

In [8]:
df=pd.DataFrame({'Name':['Alice','Bob','Claire','Dean'],'Age':[19,20,19,19],'Grade':['A','B','A','B']})
print(df)

     Name  Age Grade
0   Alice   19     A
1     Bob   20     B
2  Claire   19     A
3    Dean   19     B


While creating a DataFrame using Dictionaries, each item becomes an individual column.

#### Using CSV files

In [57]:
data=pd.read_csv("Python Programming Lab.csv")
df=pd.DataFrame(data)
df

Unnamed: 0,S No,Lab 1,Lab 2,Lab 3,Lab 4,Lab 5,Lab 6,Lab 7,Lab 8\n,Lab 9,Lab 10,Lab 11,Lab Total,Lab Marks,Record,Lab Internal,Attendance,Total
0,1,,80.0,100.0,,100.0,100.0,100.0,,,,,480,7,8.0,13,4,32
1,2,100.0,100.0,100.0,100.0,,100.0,,50.0,100.0,90.0,90.0,830,12,8.0,18,4,42
2,3,100.0,95.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,,995,14,8.0,19,5,46
3,4,80.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,50.0,100.0,,930,13,7.0,16,5,41
4,5,80.0,100.0,100.0,100.0,,100.0,100.0,100.0,50.0,,,730,10,8.0,14,4,36
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65,66,,,100.0,100.0,90.0,100.0,100.0,,,80.0,85.0,655,9,7.0,13,2,31
66,67,,100.0,,100.0,100.0,100.0,100.0,,,,,500,7,7.0,16,4,34
67,68,,100.0,100.0,100.0,100.0,100.0,100.0,80.0,50.0,80.0,,810,12,7.0,12,4,35
68,69,,,,100.0,100.0,100.0,100.0,,80.0,10.0,,490,7,7.0,12,2,28


<br>
<br>
<br>

## Data Handling using pandas

Let us use the above csv file to understand how pandas is helpful in Data Science. The data contains marks of 70 students in various areas.
The main context in the data:
- Each Student has performed 11 lab sessions. The total marks scored in all of these sessions is reduced to 15.
- Each student also has marks alloted for Records, Internals and attendance.
- The total marks include the sum of all these data.


### Acessing Data in a DataFrame 

The data can be acessed using basic indexing or by using the column names. To acess data of individual rows, pandas provied loc() and iloc() functions.

In [58]:
# Total marks of all students
df.Total

0     32
1     42
2     46
3     41
4     36
      ..
65    31
66    34
67    35
68    28
69    26
Name: Total, Length: 70, dtype: int64

In [59]:
df['Total']

0     32
1     42
2     46
3     41
4     36
      ..
65    31
66    34
67    35
68    28
69    26
Name: Total, Length: 70, dtype: int64

In [60]:
# lab marks and total marks
df[['Lab Marks','Total']]

Unnamed: 0,Lab Marks,Total
0,7,32
1,12,42
2,14,46
3,13,41
4,10,36
...,...,...
65,9,31
66,7,34
67,12,35
68,7,28


#### Using the loc and iloc functions

The loc function is used to acess individual rows based on the row names. The iloc function is used for acessing with row indices.

In [61]:
df.loc[34]

S No              35.0
Lab 1            100.0
Lab 2            100.0
Lab 3            100.0
Lab 4            100.0
Lab 5            100.0
Lab 6            100.0
Lab 7            100.0
Lab 8\n           50.0
Lab 9             80.0
Lab 10            95.0
Lab 11            90.0
Lab Total       1015.0
Lab Marks         14.0
Record             8.0
Lab Internal      18.0
Attendance         5.0
Total             45.0
Name: 34, dtype: float64

In [62]:
df.loc[[34,35,36]]

Unnamed: 0,S No,Lab 1,Lab 2,Lab 3,Lab 4,Lab 5,Lab 6,Lab 7,Lab 8\n,Lab 9,Lab 10,Lab 11,Lab Total,Lab Marks,Record,Lab Internal,Attendance,Total
34,35,100.0,100.0,100.0,100.0,100.0,100.0,100.0,50.0,80.0,95.0,90.0,1015,14,8.0,18,5,45
35,36,100.0,100.0,100.0,100.0,100.0,100.0,100.0,80.0,90.0,80.0,90.0,1040,15,9.0,18,5,47
36,37,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,,80.0,90.0,970,14,7.0,14,5,40


#### Conditional Accessing:

We can get only the specific rows that satisfy a codition using the loc function. For example, let us try to filter the data to display only those records whose total is greater than 45.

In [63]:
df.loc[df['Total']>45]

Unnamed: 0,S No,Lab 1,Lab 2,Lab 3,Lab 4,Lab 5,Lab 6,Lab 7,Lab 8\n,Lab 9,Lab 10,Lab 11,Lab Total,Lab Marks,Record,Lab Internal,Attendance,Total
2,3,100.0,95.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,,995,14,8.0,19,5,46
9,10,90.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,95.0,80.0,1065,15,9.0,19,5,48
28,29,100.0,100.0,100.0,100.0,100.0,100.0,100.0,80.0,50.0,70.0,80.0,980,14,8.0,19,5,46
35,36,100.0,100.0,100.0,100.0,100.0,100.0,100.0,80.0,90.0,80.0,90.0,1040,15,9.0,18,5,47
58,59,90.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,90.0,90.0,85.0,1055,15,9.0,19,5,48


### Handling Missing Data:
pandas provides the isnull(), dropna() and fillna() functions to handle missing data. 

The isnull() function provies a similar shaped matrix that contains true if the cell value is null and false if not. The notnull() function is the opposite of isnull().

In [64]:
df.isnull()

Unnamed: 0,S No,Lab 1,Lab 2,Lab 3,Lab 4,Lab 5,Lab 6,Lab 7,Lab 8\n,Lab 9,Lab 10,Lab 11,Lab Total,Lab Marks,Record,Lab Internal,Attendance,Total
0,False,True,False,False,True,False,False,False,True,True,True,True,False,False,False,False,False,False
1,False,False,False,False,False,True,False,True,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False
4,False,False,False,False,False,True,False,False,False,False,True,True,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65,False,True,True,False,False,False,False,False,True,True,False,False,False,False,False,False,False,False
66,False,True,False,True,False,False,False,False,True,True,True,True,False,False,False,False,False,False
67,False,True,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False
68,False,True,True,True,False,False,False,False,True,False,False,True,False,False,False,False,False,False


To get the number of null values in each column:

In [65]:
df.isnull().sum()

S No             0
Lab 1           17
Lab 2            3
Lab 3            7
Lab 4            3
Lab 5            8
Lab 6            4
Lab 7            8
Lab 8\n         15
Lab 9           21
Lab 10          15
Lab 11          36
Lab Total        0
Lab Marks        0
Record           3
Lab Internal     0
Attendance       0
Total            0
dtype: int64

To get total null values in the data frame:

In [66]:
df.isnull().sum().sum()

140

For efficient analysis, handling missing data is crucial. Let try to eliminate missing values by filling them with the means of the respective columns.

In [67]:
df['Lab 1']=df['Lab 1'].fillna(100)

In [68]:
df=df.fillna(df.mean())
df

Unnamed: 0,S No,Lab 1,Lab 2,Lab 3,Lab 4,Lab 5,Lab 6,Lab 7,Lab 8\n,Lab 9,Lab 10,Lab 11,Lab Total,Lab Marks,Record,Lab Internal,Attendance,Total
0,1,100.0,80.000000,100.00000,99.552239,100.000000,100.0,100.000000,72.0,74.693878,78.636364,81.617647,480,7,8.000000,13,4,32
1,2,100.0,100.000000,100.00000,100.000000,99.193548,100.0,99.677419,50.0,100.000000,90.000000,90.000000,830,12,8.000000,18,4,42
2,3,100.0,95.000000,100.00000,100.000000,100.000000,100.0,100.000000,100.0,100.000000,100.000000,81.617647,995,14,8.000000,19,5,46
3,4,80.0,100.000000,100.00000,100.000000,100.000000,100.0,100.000000,100.0,50.000000,100.000000,81.617647,930,13,7.000000,16,5,41
4,5,80.0,100.000000,100.00000,100.000000,99.193548,100.0,100.000000,100.0,50.000000,78.636364,81.617647,730,10,8.000000,14,4,36
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65,66,100.0,99.328358,100.00000,100.000000,90.000000,100.0,100.000000,72.0,74.693878,80.000000,85.000000,655,9,7.000000,13,2,31
66,67,100.0,100.000000,99.84127,100.000000,100.000000,100.0,100.000000,72.0,74.693878,78.636364,81.617647,500,7,7.000000,16,4,34
67,68,100.0,100.000000,100.00000,100.000000,100.000000,100.0,100.000000,80.0,50.000000,80.000000,81.617647,810,12,7.000000,12,4,35
68,69,100.0,99.328358,99.84127,100.000000,100.000000,100.0,100.000000,72.0,80.000000,10.000000,81.617647,490,7,7.000000,12,2,28


Now change the data type such that it contains only integer values

In [69]:
df=df.astype('int64')
df

Unnamed: 0,S No,Lab 1,Lab 2,Lab 3,Lab 4,Lab 5,Lab 6,Lab 7,Lab 8\n,Lab 9,Lab 10,Lab 11,Lab Total,Lab Marks,Record,Lab Internal,Attendance,Total
0,1,100,80,100,99,100,100,100,72,74,78,81,480,7,8,13,4,32
1,2,100,100,100,100,99,100,99,50,100,90,90,830,12,8,18,4,42
2,3,100,95,100,100,100,100,100,100,100,100,81,995,14,8,19,5,46
3,4,80,100,100,100,100,100,100,100,50,100,81,930,13,7,16,5,41
4,5,80,100,100,100,99,100,100,100,50,78,81,730,10,8,14,4,36
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65,66,100,99,100,100,90,100,100,72,74,80,85,655,9,7,13,2,31
66,67,100,100,99,100,100,100,100,72,74,78,81,500,7,7,16,4,34
67,68,100,100,100,100,100,100,100,80,50,80,81,810,12,7,12,4,35
68,69,100,99,99,100,100,100,100,72,80,10,81,490,7,7,12,2,28


<br>
<br>
<br>

## Data Analysis using Pandas

Pandas provides very efficient ways to perform data analysis. Infact, pandas is a tool that is most often used for this puspose.

#### Summary Statistics:
The df.describe method displays multiple summary statistics of the data.

In [70]:
df.describe()

Unnamed: 0,S No,Lab 1,Lab 2,Lab 3,Lab 4,Lab 5,Lab 6,Lab 7,Lab 8\n,Lab 9,Lab 10,Lab 11,Lab Total,Lab Marks,Record,Lab Internal,Attendance,Total
count,70.0,70.0,70.0,70.0,70.0,70.0,70.0,70.0,70.0,70.0,70.0,70.0,70.0,70.0,70.0,70.0,70.0,70.0
mean,35.5,103.714286,99.314286,99.757143,99.528571,99.171429,98.314286,99.6,72.0,74.485714,78.5,81.3,838.785714,11.871429,7.8,15.571429,4.214286,39.157143
std,20.351085,85.975585,2.936738,1.22102,2.657892,3.530289,11.988573,2.398067,20.387549,16.342124,22.945272,9.473609,201.863485,2.812608,0.714244,3.132677,1.317583,6.706336
min,1.0,0.0,80.0,90.0,80.0,80.0,0.0,80.0,0.0,50.0,0.0,50.0,200.0,3.0,6.0,0.0,0.0,3.0
25%,18.25,100.0,100.0,100.0,100.0,100.0,100.0,100.0,50.0,56.0,78.0,81.0,720.0,10.0,7.0,13.0,4.0,36.0
50%,35.5,100.0,100.0,100.0,100.0,100.0,100.0,100.0,72.0,74.0,80.0,81.0,902.5,13.0,8.0,16.0,5.0,40.5
75%,52.75,100.0,100.0,100.0,100.0,100.0,100.0,100.0,80.0,90.0,90.0,85.0,983.75,14.0,8.0,18.0,5.0,44.0
max,70.0,800.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,90.0,1480.0,21.0,10.0,19.0,5.0,48.0


From the above output, we can conclude that:
- Minimum total marks is: 3
- Maximum total marks is: 48
- on an average the class scored 39.157143

#### Grouping:
The df.groupby() function allows to group data based on a specific column. For example, let us try to find out the average marks students scored based on their attendance.

In [73]:
df.groupby(df['Attendance']).agg({'Total':['mean']})

Unnamed: 0_level_0,Total
Unnamed: 0_level_1,mean
Attendance,Unnamed: 1_level_2
0,21.0
2,34.125
4,38.125
5,41.744186


We see that students with an attendence of 5 scored on an average of 41.744186, those with4 attendence scored 38.125000 and so on. We can see that attendence may be a crucial factor influencing the total marks of the student.

<br>
<br>
<br>

## Applications of Pandas in Data Science:

### Advantages of Numpy:

1. **Easy Data Representation**
- **Pandas**: Uses DataFrames, which are like tables in Excel. They can hold different types of data in each column, making them easy to work with.
- **Traditional Python**: Would require complex combinations of lists or dictionaries, which can be hard to manage.

2. **Simple Data Manipulation**
- **Pandas**: Has built-in functions to easily filter, sort, and group data. You can do complex operations with just a few lines of code.
- **Traditional Python**: need to write more code, often involving loops, which is slower and harder to maintain.

3. **Better Performance**
- **Pandas**: Is optimized to handle large datasets quickly, using the speed of C and Numpy behind the scenes.
- **Traditional Python**: Slower for large datasets, especially when using loops.


5. **Efficient Data Cleaning**
- **Pandas**: Offers tools to handle missing data, remove duplicates, and convert data types easily.
- **Traditional Python**: Requires more manual work, making the process longer and more error-prone.


### Applications in Real World:

- **Finance**: Used for analyzing stock data, managing portfolios, and assessing risks through large datasets of transactions and market prices.

- **Healthcare**: Helps in managing patient records, analyzing clinical trial data, and improving diagnosis and treatment through pattern identification.

- **Social Media**: Performs sentiment and trend analysis by processing large volumes of social media data, helping businesses engage effectively with their audience.

- **Education**: Analyzes student performance data to tailor teaching methods and improve learning outcomes.