# **Pandas (Lib)** 
- Useful for **Data Processing** and **Analysis**
- There are two types of object in **Pandas**
    1. Series
    2. Data Frame (Useful in ML)

## **Pandas Series**
- Pandas Series is 1-D column data structure with labeled axes. (columns) 

## **Pandas Data Frame**
- Pandas Data Frame is 2-D tabular data structure with labeled axes (rows and columns)   

## **Import Pandas**

In [1]:
# importing pandas lib
import pandas as pd
pd.__version__

'1.3.4'

---
## **Initializing Pandas Objects**


### **Series**
- A **Pandas Series** is like a **column in a table.**
- It is a **one-dimensional array** holding **data of any type**.
- Index can be **labelled** as **number** or **user defined index**.
    - By **default indexes** are 0,1,2,.. like **arrays**
    - Like **dictionary** we can use it as **key-value** pair

<br>

**Syntax:** 
- <code>pd.Series([])</code>
- <code>pd.Series([],index=[])</code>

***
#### **Loading Data Into Series Using <code>pd.Series([])</code>**

In [2]:
col = [1,2,3,4]
series = pd.Series(col)
series

0    1
1    2
2    3
3    4
dtype: int64

In [3]:
series[0] #Normal Indexing

1

In [4]:
series[1:] #Support Slicing

1    2
2    3
3    4
dtype: int64

***
#### **Loading Data Into Series Using <code>pd.Series([],index=[])</code>**

In [5]:
# Using Two lists
col = ["hari","CS",8.5,None]
ind = ["Name","Course","CGPA",3]
series1 = pd.Series(col,index=ind)
series1

Name      hari
Course      CS
CGPA       8.5
3         None
dtype: object

In [6]:
series1["Name"]

'hari'

In [7]:
print(series1[3])

None


In [8]:
#Using Dictionary
dict1 = {
    "Name":"Sri",
    "Course":"EEE",
    "CGPA":8.0,
    3:None
}
series2 = pd.Series(dict1)
series2

Name       Sri
Course     EEE
CGPA       8.0
3         None
dtype: object

In [9]:
# Taking only the needed key-value pair
series3 = pd.Series(dict1,index=["Name","CGPA"])
series3

Name    Sri
CGPA    8.0
dtype: object

In [10]:
# What if no_of_index is not equal to no_of_values
col = ["hari","CS",8.5,None]
ind = ["Name","Course","CGPA"]
series4 = pd.Series(col,index=ind)
series4
# Error

ValueError: Length of values (4) does not match length of index (3)

***
### **DataFrames**
- A Panda DataFrames is like a table.
- It is a **Two-Dimensional** array holding **data of any type**.

**Syntax:**
- <code>pd.DataFrame(dict)</code>
- <code>pd.DataFrame(dict,index=[])</code>
- <code>pd.DataFrame([series1,series2])</code>
- <code>pd.read_csv("file_path")</code>
- <code>pd.read_excel("file_path")</code>

#### **Loading Data Into DataFrames Using Dictionary**
- Each **key** is recommended to have **list of values**
    - Beacuse DataFrame is 2-D Data Structure 

In [None]:
dict2 = {
    "Name":["Hari","Dheepan","Hithesh","Shree"],
    "CGPA":[8.0,6.5,7.8,9.0],
    "City":["Pdk","Salem","Chennai","Coimbatore"]
}
df = pd.DataFrame(dict2)
df

In [None]:
df1 = pd.DataFrame(dict2,index=["A","B","C","D"])
df1

***
#### **Loading Data Into DataFrames Using Series**

In [None]:
df2 = pd.DataFrame([series1,series2])
df2

In [None]:
df2 = pd.DataFrame([series1,series2],index=["A","B"])
df2

***
#### **Loading Data Into DataFrames Using sklearn.datasets**

**Sklearn.datasets** contains some practice datasets which are used to practice **ML Concepts**.

##### **Importing the boston house price data**
- We get the data in the **format**: **Bunch**(Dictionary)

In **Bunch** we have **six keys** ***(data,feature_names,target,target_names,DESCR,filename):***

- **data:** all the feature data (the attributes of the scan that help us identify if the tumor is malignant or benign, such as radius, area, etc.) in a NumPy array
- **feature_names:** are the names of the feature variables, in other words names of the columns in data
- **target:** the target data (the variable you want to predict, in this case whether the tumor is malignant or benign) in a NumPy array,
- **target_names:** the name(s) of the target variable(s), in other words name(s) of the target column(s)
- **DESCR:**  short for DESCRIPTION, is a description of the dataset
- **filename:** is the path to the actual file of the data in CSV format.

In [None]:
from sklearn.datasets import load_boston

In [None]:
boston_dataset = load_boston()

In [None]:
print(type(boston_dataset)) # DataType is Bunch

In [None]:
print(boston_dataset)
# we get the data in "dictionary" format


##### **Coverting Bunch to DataFrame**

In [None]:
boston_df = pd.DataFrame(boston_dataset.data, columns = boston_dataset.feature_names)

In [None]:
boston_df

##### **Some Basic In-built methods**

In [None]:
boston_df.head()
#Prints first 5 rows

In [None]:
boston_df.tail()
#Prints last 5 rows

In [None]:
boston_df.head(10)
#print first 10 rows

In [None]:
boston_df.tail(10)
#print last 10 rows

In [None]:
boston_df.shape
# print (no_of_rows, no_of_columns)

***
#### **Loading DataFrame using ".csv" file**

In [None]:
diabetes_df = pd.read_csv("../data/diabetes.csv")

In [None]:
type(diabetes_df)

In [None]:
diabetes_df.head()

In [None]:
diabetes_df.tail()

In [None]:
diabetes_df.shape

***
#### **Exporting DataFrame to csv file/excel file**

In [None]:
boston_df.to_csv('../output/boston.csv')

In [None]:
boston_df.to_excel('../output/boston.xlsx')

***
#### **Loading DataFrame with a Random Values**

In [None]:
import numpy as np

##### **Own Logic to create random DataFrame**

In [None]:
rand_dict = dict()

In [None]:
rows = 10
cols =10
rand_dict["data"] = np.random.randint(10,20,(rows,cols))

In [None]:
rand_dict['data']

In [None]:
col = ["column-"+str(i+1) for i in range(cols)]
col = np.array(col,dtype='U')
rand_dict['feature_data'] = col
rand_dict

In [None]:
rand_df = pd.DataFrame(rand_dict["data"],columns=rand_dict["feature_data"])
rand_df

##### **Another Way**

In [None]:
rand_df = pd.DataFrame(np.random.rand(10,20))
rand_df

***
## **Analysing The DataFrame**


### **Finding the number of rows and columns**

In [None]:
boston_df.shape

***
### **DataFrame.head(n) (First n rows)** 

In [None]:
boston_df.head() # by default 5 rows

In [None]:
boston_df.head(7)

***
### **DataFrame.tail(n) (Last n rows)** 

In [None]:
boston_df.tail() # by default 5 rows

In [None]:
boston_df.tail(7)

***
### **DataFrame.info()**
- **No of columns** with name and **no of rows(entries)**.
- **Datatype** of the values for each column
- **Number of values** in the **column(non-null + null)** 
- Gives the **number of non-null values** and **null values** seperately
    - **Null values** means missing value

In [None]:
boston_df.info()

***
### **Finding the Number of Missing Values**
<code>Dataframe.isnull().sum()</code>

In [None]:
boston_df.isnull() 

In [None]:
boston_df.isnull().sum()

***
### **Counting The Values Based On Labels**
<code>DataFrame.value_counts("column_name")</code>

In [None]:
diabetes_df.value_counts("Outcome")

In [None]:
diabetes_df.value_counts("Age")

***
### **Applying function with Grouping and without Grouping**
**With Grouping** : <code>DataFrame.groupby('column_name').mean()</code>
**Without Grouping** : <code>DataFrame.mean()</code>

In [None]:
# with grouping
diabetes_df.groupby('Outcome').mean()

In [None]:
# without grouping
diabetes_df.mean()

***
## **Statistical Analysis**

### **Count Number Of Values**

In [None]:
boston_df.count()

***
### **Mean Value - Column Wise**

In [None]:
boston_df.mean()

***
### **Standard Deviation Column Wise**

In [None]:
boston_df.std()

***
### **Minimum & Maximum Value In Each Column**

In [None]:
boston_df.min()

In [None]:
boston_df.max()

***
### **Statistical Analysis using one method**

In [None]:
boston_df.describe()
# below output
# 25% - 25% of the value are less than 0.0820 
# 50% - 50% of value are less than 0.256510
# 75% - 75% of value are less than 3.677083
#(Percentile)

***
## **Manipulating The DataFrame**


### **Adding The Column/Row to DataFrame**

#### **Adding Column**

In [None]:
boston_df1 = boston_df
boston_df1['Price'] = boston_dataset.target
boston_df1

#### **Adding a row**

In [None]:
df

In [None]:
df1 = df.append({"Name":"Abbi","CGPA":7.6,"City":"Vilu"},ignore_index=True)
df

In [None]:
df1
# Above code didn't add row in original DataFrame

In [None]:
df1.loc[len(df1)] = ['Nithu',9.0,"Dindgul"]

In [None]:
df1

In [None]:
# Inserting At Specific Index
df1.loc[1.5] = ['Sud',9.0,"Dindgul"]
df1.sort_index().reset_index(drop=True)

***
### **Removing The Row/Column From DataFrame**
<code>DataFrame.drop(index=,axis=)</code> <br>
<code>axis=0</code> for row <br>
<code>axis=1</code> for column <br>

In [None]:
boston_df1.drop(index=0,axis=0)
#Above code just return the removed dataframe, It won't remove the row from the datframe permently

In [None]:
boston_df1.drop(columns="Price",axis=1)
#Above code just return the removed dataframe, It won't remove the column from the dataframe permently

In [None]:
boston_df1.drop(columns="Price",axis=1,inplace=True)
# Removes the Column permently

In [None]:
boston_df1

***
## **Locating a Row/Column**

### **Locating a Row**

In [None]:
boston_df.iloc[2]
# iloc is used for integer indexing.

In [None]:
boston_df.loc[boston_df.B==396.90]
# loc is typically used for label indexing and can access multiple columns.

***
### **Locating a Column**

In [None]:
boston_df.iloc[:,0]

In [None]:
boston_df.iloc[:,-1]

***
## **Correlation**
- Relation between Columns

### **Types of Correlation**
1. **Positive Correlation**
    - When one column's value changes, the other column's value change in the same direction. **(Either both increases or decreases)**
2. **Negative Correlation**
    - When one column's value changes, the other column's value change in the opposite direction. **(One increases or Other decreases)**

In [None]:
boston_df.corr()

## **Cleaning Data**

### **Removing Missing values**
<code>DataFrame.dropna()</code>

In [35]:
dict2 = {
    "Name":["Hari","Dheepan","Hithesh","Shree"],
    "CGPA":[8.0,6.5,7.8,9.0],
    "City":["Pdk","Salem","Chennai","Coimbatore"]
}
df = pd.DataFrame(dict2)
df

Unnamed: 0,Name,CGPA,City
0,Hari,8.0,Pdk
1,Dheepan,6.5,Salem
2,Hithesh,7.8,Chennai
3,Shree,9.0,Coimbatore


In [36]:
for i in range(10):
    df.loc[len(df)] = ["",None,""]

In [37]:
df

Unnamed: 0,Name,CGPA,City
0,Hari,8.0,Pdk
1,Dheepan,6.5,Salem
2,Hithesh,7.8,Chennai
3,Shree,9.0,Coimbatore
4,,,
5,,,
6,,,
7,,,
8,,,
9,,,


In [38]:
df.dropna() # it won't change the dataframe it just returns the Dataframe without Missing value

Unnamed: 0,Name,CGPA,City
0,Hari,8.0,Pdk
1,Dheepan,6.5,Salem
2,Hithesh,7.8,Chennai
3,Shree,9.0,Coimbatore


In [39]:
df.dropna(inplace=True)
df

Unnamed: 0,Name,CGPA,City
0,Hari,8.0,Pdk
1,Dheepan,6.5,Salem
2,Hithesh,7.8,Chennai
3,Shree,9.0,Coimbatore


### **Filling the Missing Value**
<code>DataFrame.fillna()</code>

In [40]:
for i in range(10):
    df.loc[len(df)] = [None,None,None]

In [41]:
df.fillna(130) # We are not using inplace attribute

Unnamed: 0,Name,CGPA,City
0,Hari,8.0,Pdk
1,Dheepan,6.5,Salem
2,Hithesh,7.8,Chennai
3,Shree,9.0,Coimbatore
4,130,130.0,130
5,130,130.0,130
6,130,130.0,130
7,130,130.0,130
8,130,130.0,130
9,130,130.0,130


In [42]:
df["CGPA"].fillna(5.0)

0     8.0
1     6.5
2     7.8
3     9.0
4     5.0
5     5.0
6     5.0
7     5.0
8     5.0
9     5.0
10    5.0
11    5.0
12    5.0
13    5.0
Name: CGPA, dtype: float64

In [43]:
df.fillna(value={"Name":"Loki","CGPA":8.0,"City":"Karur"},inplace = True)
df

Unnamed: 0,Name,CGPA,City
0,Hari,8.0,Pdk
1,Dheepan,6.5,Salem
2,Hithesh,7.8,Chennai
3,Shree,9.0,Coimbatore
4,Loki,8.0,Karur
5,Loki,8.0,Karur
6,Loki,8.0,Karur
7,Loki,8.0,Karur
8,Loki,8.0,Karur
9,Loki,8.0,Karur


### **Removing Duplicates**
<code>DataFrame.drop_duplicates()</code>

In [44]:
df.drop_duplicates()

Unnamed: 0,Name,CGPA,City
0,Hari,8.0,Pdk
1,Dheepan,6.5,Salem
2,Hithesh,7.8,Chennai
3,Shree,9.0,Coimbatore
4,Loki,8.0,Karur


In [46]:
df.drop_duplicates(inplace =True)
df

Unnamed: 0,Name,CGPA,City
0,Hari,8.0,Pdk
1,Dheepan,6.5,Salem
2,Hithesh,7.8,Chennai
3,Shree,9.0,Coimbatore
4,Loki,8.0,Karur


___
## **To Print DataFrame in Console**

In [None]:
print(boston_df.to_string())