## Contents

- Handling Missing values
- Data Formatting
- Data Normalization (scaling and centralizing)
- Data binning (groups of data)
- Making Dummies of catagorical data

 PANDAS   |EXCELL|
 ---------|------------|
 Data frames|Worksheets|
 Series  |Columns|
 Index   |Row Headings|
 Row     |Row|
 Nan     |Empty|

In [259]:
# Import Libreries
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [260]:
# Load dataset(students_data)
st=pd.read_csv("familydataset.csv")
st1=st
st2=st
st

Unnamed: 0,Names,Age,Marks,Cnic,Domicile,Fee
0,Asim,30.0,80,2220124449717,lahore,50.0
1,komal,22.0,90,4420157576574,islamabad,
2,zeshan,,100,3320123356978,peshawar,70.0
3,wasim,,70,1220233669811,karachi,
4,wasif,8.0,60,3120123346833,sargodha,40.0


In [261]:
# rename the column age
st.rename(columns={"Age":"Age in years"})

Unnamed: 0,Names,Age in years,Marks,Cnic,Domicile,Fee
0,Asim,30.0,80,2220124449717,lahore,50.0
1,komal,22.0,90,4420157576574,islamabad,
2,zeshan,,100,3320123356978,peshawar,70.0
3,wasim,,70,1220233669811,karachi,
4,wasif,8.0,60,3120123346833,sargodha,40.0


### 1. Handling missing values
- in a data set missing values are either NAN or N/A, or 0 or blank cell
  #### steps
  - search for mistakes and collect the data again
  - remove the column or row if it does not effect the data set
  - replace the missing values
    - take the average value of variable
    - Mode replacment
    - ML algorithm can be used

In [262]:
# finding missing values in dataset
st.isnull().sum()

Names       0
Age         2
Marks       0
Cnic        0
Domicile    0
Fee         2
dtype: int64

In [263]:
# removing missing values
st.dropna() # it removes all rows which contains missing values

Unnamed: 0,Names,Age,Marks,Cnic,Domicile,Fee
0,Asim,30.0,80,2220124449717,lahore,50.0
4,wasif,8.0,60,3120123346833,sargodha,40.0


In [264]:
 # it removes all the rows which contain Nan in "Fee" column
modst=st.dropna(subset=["Fee"],axis=0)
modst

Unnamed: 0,Names,Age,Marks,Cnic,Domicile,Fee
0,Asim,30.0,80,2220124449717,lahore,50.0
2,zeshan,,100,3320123356978,peshawar,70.0
4,wasif,8.0,60,3120123346833,sargodha,40.0


In [265]:
st1.isnull().sum()

Names       0
Age         2
Marks       0
Cnic        0
Domicile    0
Fee         2
dtype: int64

### Replacing missing values with average of that column

In [266]:
# first find the mean or average of that column
mean=st1["Fee"].mean()
mean

53.333333333333336

In [267]:
# replace NAN with mean value
st1["Fee"]=st1["Fee"].replace(np.nan,mean)
st1

Unnamed: 0,Names,Age,Marks,Cnic,Domicile,Fee
0,Asim,30.0,80,2220124449717,lahore,50.0
1,komal,22.0,90,4420157576574,islamabad,53.333333
2,zeshan,,100,3320123356978,peshawar,70.0
3,wasim,,70,1220233669811,karachi,53.333333
4,wasif,8.0,60,3120123346833,sargodha,40.0


In [268]:
# converting one data type to another
st1["Marks"]=st1["Marks"].astype("float64")
st1.dtypes


Names        object
Age         float64
Marks       float64
Cnic          int64
Domicile     object
Fee         float64
dtype: object

In [269]:
# converting age in years to age in days
st1['Age']=st1['Age']*365
st1=st1.rename(columns={'Age':'Age in days'})
st1

Unnamed: 0,Names,Age in days,Marks,Cnic,Domicile,Fee
0,Asim,10950.0,80.0,2220124449717,lahore,50.0
1,komal,8030.0,90.0,4420157576574,islamabad,53.333333
2,zeshan,,100.0,3320123356978,peshawar,70.0
3,wasim,,70.0,1220233669811,karachi,53.333333
4,wasif,2920.0,60.0,3120123346833,sargodha,40.0


In [270]:
# to replace NAN value with desire value
st1=st1.replace(to_replace=np.nan,value=5000)
st1['Fee']=st1['Fee']*200
st1


Unnamed: 0,Names,Age in days,Marks,Cnic,Domicile,Fee
0,Asim,10950.0,80.0,2220124449717,lahore,10000.0
1,komal,8030.0,90.0,4420157576574,islamabad,10666.666667
2,zeshan,5000.0,100.0,3320123356978,peshawar,14000.0
3,wasim,5000.0,70.0,1220233669811,karachi,10666.666667
4,wasif,2920.0,60.0,3120123346833,sargodha,8000.0


### Data normalization
- paramatric tests for normal data
- non paramatric tests for non normal data
- uniform the data
- having same impact

### Methods of Normalization
- Simple feature scaling
- Min-Max method
- Z-score (standard score)
- Log transformation

### Simple featuire scaling

    X(new value column) = X(old value in column)/X(max value in column)

In [271]:
# lets normalize the "Age in days" column
st1["Age in days"]=st1["Age in days"]/st1["Age in days"].max()
st1

Unnamed: 0,Names,Age in days,Marks,Cnic,Domicile,Fee
0,Asim,1.0,80.0,2220124449717,lahore,10000.0
1,komal,0.733333,90.0,4420157576574,islamabad,10666.666667
2,zeshan,0.456621,100.0,3320123356978,peshawar,14000.0
3,wasim,0.456621,70.0,1220233669811,karachi,10666.666667
4,wasif,0.266667,60.0,3120123346833,sargodha,8000.0


### Min-Max method
    X(new value in column)=[X(old value)-X(minimum value in that column)]/[X(max value)-X(min value)]

In [272]:
st1["Fee"]=(st1["Fee"]-st1["Fee"].min())/(st1["Fee"].max()-st1["Fee"].min())
st1

Unnamed: 0,Names,Age in days,Marks,Cnic,Domicile,Fee
0,Asim,1.0,80.0,2220124449717,lahore,0.333333
1,komal,0.733333,90.0,4420157576574,islamabad,0.444444
2,zeshan,0.456621,100.0,3320123356978,peshawar,1.0
3,wasim,0.456621,70.0,1220233669811,karachi,0.444444
4,wasif,0.266667,60.0,3120123346833,sargodha,0.0


### Z-score method
    X(new value in column)=[X(old value)-X(mean value of that column)]/[X(old std)]

In [273]:
st1["Marks"]=(st1["Marks"]-st1["Fee"].mean())/st1["Marks"].std()
st1

Unnamed: 0,Names,Age in days,Marks,Cnic,Domicile,Fee
0,Asim,1.0,5.031535,2220124449717,lahore,0.333333
1,komal,0.733333,5.663991,4420157576574,islamabad,0.444444
2,zeshan,0.456621,6.296446,3320123356978,peshawar,1.0
3,wasim,0.456621,4.39908,1220233669811,karachi,0.444444
4,wasif,0.266667,3.766624,3120123346833,sargodha,0.0


### Log transformation

In [274]:
st1["Cnic"]=np.log(st1["Cnic"])
st1

Unnamed: 0,Names,Age in days,Marks,Cnic,Domicile,Fee
0,Asim,1.0,5.031535,28.428584,lahore,0.333333
1,komal,0.733333,5.663991,29.117196,islamabad,0.444444
2,zeshan,0.456621,6.296446,28.831023,peshawar,1.0
3,wasim,0.456621,4.39908,27.830063,karachi,0.444444
4,wasif,0.266667,3.766624,28.768894,sargodha,0.0
