# Data Preprocessing with scikit-learn

## Preprocessing techniques
- Data Preprocessing is a technique that is used to convert the raw data into a clean data set

### Data Preprocessing steps
- loading data
- exploring data
- cleaning data
- Transforming data
    - we will learn data preprocessing techniques with scikit-learn, one of the most popular frameworks used for industry data science
    - the scikit-learn library includes tools for data preprocessing and data mining.

### Data Imputation
- If the dataset is missing too many values, we just don't use the datset
- If only a few of the values are missing, we can perform data imputation to substitute the missing data with some other value(s).
- There are many different methods for data imputation 
     - using the mean value
     - using the median value
     - using the most frequent value
     - filling in missing values with a constant

Feature Scaling:
-------------------------
>1. Standardizing Data
    - example : distance (cm,m,km,miles)
    - Data Scientists will convert the data into a standard format to make it easier to understand
    - The standard format refers to data that has mean as 0 and variance as 1, and the process of converting data into this format is called data standardization
    - Improves the performance of models
    - The formula for this standardization is (x-mean)/variance
    - image --> std.png, stddata.png
    
>2. Data Range
    - Scales data by compressing it into a fixed range[0,1]
    - MinMaxScaler  -->minmax.png
>3. Normalizing data
    - Want to scale the individual data observations (rows)
    - Mostly used in classification problems and data mining
    - When clustering data we need to apply L2 Normalization to each row
    - L2 Normalization applied to a particular row of a data array
    - L2 Normalization of a row is the square root of sum of the squared values for each row
    - normal.png
    
>4. Robust Scaling
    - Deals with outliers
    - Robustly scales the data i.e., to avoid being affected by outliers
    - Scaling by using data's median and Interquartile Range(IQR)
    - Here mean affected but median remains same
    - Subtract the median from each data value then scale to the IQR    

In [3]:
import pandas as pd
from sklearn.impute import SimpleImputer
di = { "a": pd.Series([12,34,56],index = [1,2,3]),
      "b": pd.Series([90,87,78,56],index= [1,2,3,4])
    
}
df = pd.DataFrame(di)
df

Unnamed: 0,a,b
1,12.0,90
2,34.0,87
3,56.0,78
4,,56


In [5]:
si  = SimpleImputer(strategy = "mean") # median, most_frequent
si.fit_transform(df)

array([[12., 90.],
       [34., 87.],
       [56., 78.],
       [34., 56.]])

In [6]:
df.mean()

a    34.00
b    77.75
dtype: float64

In [7]:
si = SimpleImputer(strategy = "constant",fill_value = -1)
si.fit_transform(df)

array([[12., 90.],
       [34., 87.],
       [56., 78.],
       [-1., 56.]])

In [8]:
adv = pd.read_csv("https://raw.githubusercontent.com/APSSDC-Data-Analysis/DataAnalysisBatch-6/main/08-10-2020(Day-4)/Datasets/Advertising.csv")
adv.head()

Unnamed: 0,TV,radio,newspaper,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


In [9]:
from sklearn.preprocessing import scale
scl = scale(adv)

In [10]:
scl

array([[ 9.69852266e-01,  9.81522472e-01,  1.77894547e+00,
         1.55205313e+00],
       [-1.19737623e+00,  1.08280781e+00,  6.69578760e-01,
        -6.96046111e-01],
       [-1.51615499e+00,  1.52846331e+00,  1.78354865e+00,
        -9.07405869e-01],
       [ 5.20496822e-02,  1.21785493e+00,  1.28640506e+00,
         8.60330287e-01],
       [ 3.94182198e-01, -8.41613655e-01,  1.28180188e+00,
        -2.15683025e-01],
       [-1.61540845e+00,  1.73103399e+00,  2.04592999e+00,
        -1.31091086e+00],
       [-1.04557682e+00,  6.43904671e-01, -3.24708413e-01,
        -4.27042783e-01],
       [-3.13436589e-01, -2.47406325e-01, -8.72486994e-01,
        -1.58039455e-01],
       [-1.61657614e+00, -1.42906863e+00, -1.36042422e+00,
        -1.77205942e+00],
       [ 6.16042873e-01, -1.39530685e+00, -4.30581584e-01,
        -6.57617064e-01],
       [-9.45155670e-01, -1.17923146e+00, -2.92486143e-01,
        -1.04190753e+00],
       [ 7.90028350e-01,  4.96973404e-02, -1.22232878e+00,
      

In [11]:
scl_data = pd.DataFrame(scl,columns = adv.columns)
scl_data

Unnamed: 0,TV,radio,newspaper,sales
0,0.969852,0.981522,1.778945,1.552053
1,-1.197376,1.082808,0.669579,-0.696046
2,-1.516155,1.528463,1.783549,-0.907406
3,0.052050,1.217855,1.286405,0.860330
4,0.394182,-0.841614,1.281802,-0.215683
5,-1.615408,1.731034,2.045930,-1.310911
6,-1.045577,0.643905,-0.324708,-0.427043
7,-0.313437,-0.247406,-0.872487,-0.158039
8,-1.616576,-1.429069,-1.360424,-1.772059
9,0.616043,-1.395307,-0.430582,-0.657617


In [15]:
scl_data.mean().round(3),scl_data.std()

(TV           0.0
 radio       -0.0
 newspaper    0.0
 sales       -0.0
 dtype: float64, TV           1.002509
 radio        1.002509
 newspaper    1.002509
 sales        1.002509
 dtype: float64)

In [14]:
adv.mean(),adv.std()

(TV           147.0425
 radio         23.2640
 newspaper     30.5540
 sales         14.0225
 dtype: float64, TV           85.854236
 radio        14.846809
 newspaper    21.778621
 sales         5.217457
 dtype: float64)

In [17]:
from sklearn.preprocessing import MinMaxScaler
mnscl = MinMaxScaler()
mnscl = mnscl.fit_transform(adv)

In [18]:
mnscl

array([[0.77578627, 0.76209677, 0.60598065, 0.80708661],
       [0.1481231 , 0.79233871, 0.39401935, 0.34645669],
       [0.0557998 , 0.92540323, 0.60686016, 0.30314961],
       [0.50997633, 0.83266129, 0.51187335, 0.66535433],
       [0.60906324, 0.21774194, 0.51099384, 0.44488189],
       [0.02705445, 0.9858871 , 0.65699208, 0.22047244],
       [0.19208657, 0.66129032, 0.20404573, 0.4015748 ],
       [0.4041258 , 0.39516129, 0.09938434, 0.45669291],
       [0.02671627, 0.04233871, 0.00615655, 0.12598425],
       [0.67331755, 0.05241935, 0.18381706, 0.35433071],
       [0.2211701 , 0.11693548, 0.21020229, 0.27559055],
       [0.72370646, 0.48387097, 0.03254178, 0.62204724],
       [0.07811972, 0.70766129, 0.5769569 , 0.2992126 ],
       [0.32735881, 0.15322581, 0.06068602, 0.31889764],
       [0.68785932, 0.66330645, 0.40193492, 0.68503937],
       [0.65843761, 0.96169355, 0.46262093, 0.81889764],
       [0.22691917, 0.73790323, 1.        , 0.42913386],
       [0.94927291, 0.7983871 ,

In [19]:
mnscl.min()

0.0

In [20]:
mnscl.max()

1.0

In [21]:
from sklearn.preprocessing import Normalizer
norm = Normalizer()

In [22]:
home = pd.read_csv("https://raw.githubusercontent.com/APSSDC-Data-Analysis/DataAnalysisBatch-6/main/08-10-2020(Day-4)/Datasets/HomeBuyer.csv")
home.head()

Unnamed: 0,Age,EstimatedSalary,Purchased
0,19,19000,0
1,35,20000,0
2,26,43000,0
3,27,57000,0
4,19,76000,0


In [24]:
norm_data = norm.fit_transform(home)

In [25]:
norm_data.min(),norm_data.max()

(0.0, 0.9999999807797668)