## Context

To know about Encoding , Scaling data (Normalizing).

## Contents:

 
1. Import necessary libraries and Load Data
2. Encoding Categorical Variables - One-hot encoding and label encoding 
3. Normalization and Scaling 

## Objective

To perform Encoding on data (One hot Encoding, Creating dummies , Label Encoding) and Scaling data.

## Attributes:

1. Suburb	
2. Address	
3. Rooms	
4. Type	
5. Method	
6. SellerG	
7. Date	
8. Distance	
9. Postcode	
10. Bedroom	
11. Bathroom	
12. Car	
13. Landsize	
14. BuildingArea	
15. YearBuilt
16. CouncilArea	
17. Latitude	
18. Longtitude	
19. Regionname	
20. Propertycount	
21. ParkingArea	
22. Price


## Loading Libraries

In [41]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings

## Setting Options

In [42]:
# to suppress warnings
warnings.filterwarnings('ignore')

## Loading Dataset

In [43]:
df = pd.read_csv('Melbourne_housing_FULL.csv')

In [44]:
df.shape

(34857, 22)

This dataset contains 22 columns. For the sake of univariate analysis let us use just one column of data - 'Distance'.

**First step is to check if we have any missing values or not.**

In [45]:
df.isnull().sum()

Suburb               0
Address              0
Rooms                0
Type                 0
Method               0
SellerG              0
Date                 0
Distance             1
Postcode             1
Bedroom           8217
Bathroom          8226
Car               8728
Landsize         11810
BuildingArea     21097
YearBuilt        19306
CouncilArea          3
Latitude          7976
Longtitude        7976
Regionname           0
Propertycount        3
ParkingArea          0
Price             7610
dtype: int64

In [46]:
#Drop all missing value columns. You can also impute it with mean , median etc

#Creating copy first of original dataset
df1= df.copy(deep =True
            )
df1.dropna(inplace = True)

In [47]:
df1.isnull().sum()

Suburb           0
Address          0
Rooms            0
Type             0
Method           0
SellerG          0
Date             0
Distance         0
Postcode         0
Bedroom          0
Bathroom         0
Car              0
Landsize         0
BuildingArea     0
YearBuilt        0
CouncilArea      0
Latitude         0
Longtitude       0
Regionname       0
Propertycount    0
ParkingArea      0
Price            0
dtype: int64

In [48]:
df1.dtypes

Suburb            object
Address           object
Rooms              int64
Type              object
Method            object
SellerG           object
Date              object
Distance         float64
Postcode         float64
Bedroom          float64
Bathroom         float64
Car              float64
Landsize         float64
BuildingArea      object
YearBuilt        float64
CouncilArea       object
Latitude         float64
Longtitude       float64
Regionname        object
Propertycount    float64
ParkingArea       object
Price            float64
dtype: object

# Handling non-numeric(Categorical) data

Most of the machine learning models are designed to work on numeric data. Hence, we need to convert categorical text data into numerical data for model building

### One-Hot Encoding

One-Hot-Encoding is used to create dummry variables to replace the categories in a categorical variable into features of each category and represent it using 1 or 0 based on the presence or absence of the categorical value in the record 


In [49]:
df_dummies= pd.get_dummies(df1, prefix='Park', columns=['ParkingArea']) #This function does One-Hot-Encoding on categorical text

In [50]:
df_dummies.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Method,SellerG,Date,Distance,Postcode,Bedroom,...,Propertycount,Price,Park_Attached Garage,Park_Carport,Park_Detached Garage,Park_Indoor,Park_Outdoor Stall,Park_Parkade,Park_Parking Pad,Park_Underground
1,Airport West,154 Halsey Rd,3,t,PI,Nelson,3/9/2016,13.5,3042.0,3.0,...,3464.0,840000.0,0,0,1,0,0,0,0,0
2,Albert Park,105 Kerferd Rd,2,h,S,hockingstuart,3/9/2016,3.3,3206.0,2.0,...,3280.0,1275000.0,1,0,0,0,0,0,0,0
5,Alphington,6 Smith St,4,h,S,Brace,3/9/2016,6.4,3078.0,3.0,...,2211.0,2000000.0,0,0,0,0,0,0,0,1
6,Alphington,5/6 Yarralea St,3,h,S,Jellis,3/9/2016,6.4,3078.0,3.0,...,2211.0,1110000.0,0,0,0,0,1,0,0,0
7,Altona,158 Queen St,3,h,VB,Greg,3/9/2016,13.8,3018.0,3.0,...,5301.0,520000.0,0,0,0,0,0,0,1,0


In [51]:
df_dummies.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Method', 'SellerG', 'Date',
       'Distance', 'Postcode', 'Bedroom', 'Bathroom', 'Car', 'Landsize',
       'BuildingArea', 'YearBuilt', 'CouncilArea', 'Latitude', 'Longtitude',
       'Regionname', 'Propertycount', 'Price', 'Park_Attached Garage',
       'Park_Carport', 'Park_Detached Garage', 'Park_Indoor',
       'Park_Outdoor Stall', 'Park_Parkade', 'Park_Parking Pad',
       'Park_Underground'],
      dtype='object')

In [52]:
df_dummies.shape

(8890, 29)

In [53]:
df.shape

(34857, 22)

### Sklearn Label Encoding

In [54]:
df_dummies['Regionname']

1              Western Metropolitan
2             Southern Metropolitan
5             Northern Metropolitan
6             Northern Metropolitan
7              Western Metropolitan
                    ...            
34838          Western Metropolitan
34846    South-Eastern Metropolitan
34848          Western Metropolitan
34851    South-Eastern Metropolitan
34856         Northern Metropolitan
Name: Regionname, Length: 8890, dtype: object

In [55]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df_dummies['RegionId'] = le.fit_transform(df_dummies['Regionname'])
df_dummies['RegionId'].head()

1    6
2    5
5    2
6    2
7    6
Name: RegionId, dtype: int32

In [56]:
df_dummies['RegionId'].value_counts()

5    2709
2    2613
6    2059
0     982
4     371
3      62
1      51
7      43
Name: RegionId, dtype: int64

In [57]:
df_dummies['RegionId'].unique()

array([6, 5, 2, 0, 4, 3, 1, 7])

In [58]:
df_dummies['RegionId'].nunique()

8

### Sklearn - OneHotEncoder

This function does One-Hot-Encoding on categorical numbers

In [59]:
from sklearn.preprocessing import OneHotEncoder

he = OneHotEncoder()
encoded = he.fit_transform(df_dummies['RegionId'].values.reshape(-1,1)).toarray()
encoded

array([[0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.]])

In [60]:
encoded.shape

(8890, 8)

Let us now add this back to df_dummies dataframe.

In [61]:
# convert the array into a dataframe. Specifically, one hot encoded dataframe

df_encoded = pd.DataFrame(encoded, columns = ["RegionId_"+str(int(i)) for i in range(encoded.shape[1])])
df_encoded.head()

Unnamed: 0,RegionId_0,RegionId_1,RegionId_2,RegionId_3,RegionId_4,RegionId_5,RegionId_6,RegionId_7
0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [62]:
df_encoded.shape

(8890, 8)

In [63]:
df_dummies = pd.concat([df_dummies, df_encoded], axis=1)  # concats two dataframes
df_dummies.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Method,SellerG,Date,Distance,Postcode,Bedroom,...,Park_Underground,RegionId,RegionId_0,RegionId_1,RegionId_2,RegionId_3,RegionId_4,RegionId_5,RegionId_6,RegionId_7
0,,,,,,,,,,,...,,,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,Airport West,154 Halsey Rd,3.0,t,PI,Nelson,3/9/2016,13.5,3042.0,3.0,...,0.0,6.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,Albert Park,105 Kerferd Rd,2.0,h,S,hockingstuart,3/9/2016,3.3,3206.0,2.0,...,0.0,5.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,,,,,,,,,,,...,,,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,,,,,,,,,,,...,,,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


Above steps are ways to show what all things we can do using onehotencoder and labelencoder. It is not necessary to follow all the above steps.

# Normalizing and Scaling

In this method, we convert variables with different scales of measurements into a single scale. There are various ways to normalize the data:

1. Z-Score
2. StandardScaler (uses mean/std)
3. MinMaxScaler
4. Log Transformation
5. Exponential Transformation

###  1. Z-Score

In [85]:
from scipy.stats import zscore

df['Rooms_std_zscore'] = df[['Rooms']].apply(zscore)
df['Rooms_std_zscore']

0       -1.062988
1       -0.031974
2       -1.062988
3       -1.062988
4       -0.031974
           ...   
34852   -0.031974
34853    0.999040
34854    0.999040
34855   -0.031974
34856    0.999040
Name: Rooms_std_zscore, Length: 34857, dtype: float64

### 2. StandardScaler

In [82]:
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

df['Rooms_std']= ss.fit_transform(df[['Rooms']])
df['Rooms_std']

0       -1.062988
1       -0.031974
2       -1.062988
3       -1.062988
4       -0.031974
           ...   
34852   -0.031974
34853    0.999040
34854    0.999040
34855   -0.031974
34856    0.999040
Name: Rooms_std, Length: 34857, dtype: float64

### 3. MinMaxScaler

In [88]:
from sklearn.preprocessing import MinMaxScaler

mm = MinMaxScaler()

df['Rooms_mm'] = mm.fit_transform(df[['Rooms']])
df['Rooms_mm']

0        0.066667
1        0.133333
2        0.066667
3        0.066667
4        0.133333
           ...   
34852    0.133333
34853    0.200000
34854    0.200000
34855    0.133333
34856    0.200000
Name: Rooms_mm, Length: 34857, dtype: float64

### 4. Log Transformation

In [91]:
from sklearn.preprocessing import FunctionTransformer

log = FunctionTransformer(np.log)

df['Rooms_log'] = log.fit_transform(df[['Rooms']])
df['Rooms_log']

0        0.693147
1        1.098612
2        0.693147
3        0.693147
4        1.098612
           ...   
34852    1.098612
34853    1.386294
34854    1.386294
34855    1.098612
34856    1.386294
Name: Rooms_log, Length: 34857, dtype: float64

### 5. Exponential Transformation

In [95]:
# from sklearn.preprocessing import FunctionTransformer

exp = FunctionTransformer(np.exp)

df['Rooms_exp'] = exp.fit_transform(df[['Rooms']])
df['Rooms_exp']

0         7.389056
1        20.085537
2         7.389056
3         7.389056
4        20.085537
           ...    
34852    20.085537
34853    54.598150
34854    54.598150
34855    20.085537
34856    54.598150
Name: Rooms_exp, Length: 34857, dtype: float64