# Practice for Lesson 1 - Intro To Machine Learning
### Topics covered:
- MinMax Scaling
- Standard Scaling
- Label Encoding
- One Hot Encoding

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('airbnb_ny.csv')

In [4]:
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


In [6]:
g_df = df[['host_id', 'neighbourhood_group', 'room_type', 'price', 'minimum_nights']]
g_df.head()

Unnamed: 0,host_id,neighbourhood_group,room_type,price,minimum_nights
0,2787,Brooklyn,Private room,149,1
1,2845,Manhattan,Entire home/apt,225,1
2,4632,Manhattan,Private room,150,3
3,4869,Brooklyn,Entire home/apt,89,1
4,7192,Manhattan,Entire home/apt,80,10


# Data Scaling

## Let's say we want to look at price and minimum nights to build a regression model so we need to scale those parameters so that we can get good weights! 

There are 2 ways to do this:
- MinMax Scaler
- Standard Scaler

When building regression model, its good to try both of these to see which one works best


### MinMax Scaler:

#### Manual Way

In [19]:
def min_max_scaler(input_X):
    X = np.copy(input_X)
    X = (X - X.min(axis=0)) / (X.max(axis=0)-X.min(axis=0))
    return X

In [22]:
a = g_df[['price', 'minimum_nights']].to_numpy()
print(f'Values BEFORE min max scaler: \n{a}')
print(f'Values AFTER min max scaler: \n{min_max_scaler(a)}')

Values BEFORE min max scaler: 
[[149   1]
 [225   1]
 [150   3]
 ...
 [115  10]
 [ 55   1]
 [ 90   7]]
Values AFTER min max scaler: 
[[0.0149     0.        ]
 [0.0225     0.        ]
 [0.015      0.00160128]
 ...
 [0.0115     0.00720576]
 [0.0055     0.        ]
 [0.009      0.00480384]]


#### Using Sklearn

In [24]:
from sklearn import preprocessing

min_max_scaler = preprocessing.MinMaxScaler()

a = g_df[['price', 'minimum_nights']].to_numpy()
print(f'Values BEFORE min max scaler: \n{a}')
print(f'Values AFTER min max scaler: \n{min_max_scaler.fit_transform(a)}')

Values BEFORE min max scaler: 
[[149   1]
 [225   1]
 [150   3]
 ...
 [115  10]
 [ 55   1]
 [ 90   7]]
Values AFTER min max scaler: 
[[0.0149     0.        ]
 [0.0225     0.        ]
 [0.015      0.00160128]
 ...
 [0.0115     0.00720576]
 [0.0055     0.        ]
 [0.009      0.00480384]]


### Standard Scaler

#### Manual Way

In [25]:
def standard_scaler(input_X):
    X = np.copy(input_X)
    X = (X - X.mean(axis=0)) / X.std(axis=0)
    return X

In [26]:
a = g_df[['price', 'minimum_nights']].to_numpy()
print(f'Values BEFORE standard scaler: \n{a}')
print(f'Values AFTER standard scaler: \n{standard_scaler(a)}')

Values BEFORE standard scaler: 
[[149   1]
 [225   1]
 [150   3]
 ...
 [115  10]
 [ 55   1]
 [ 90   7]]
Values AFTER standard scaler: 
[[-0.01549307 -0.29399621]
 [ 0.30097355 -0.29399621]
 [-0.01132904 -0.19648442]
 ...
 [-0.15707024  0.14480686]
 [-0.4069123  -0.29399621]
 [-0.2611711  -0.00146083]]


#### Using Sklearn

In [28]:
from sklearn import preprocessing

standard_scaler = preprocessing.StandardScaler()

a = g_df[['price', 'minimum_nights']].to_numpy()
print(f'Values BEFORE min max scaler: \n{a}')
print(f'Values AFTER min max scaler: \n{standard_scaler.fit_transform(a)}')

Values BEFORE min max scaler: 
[[149   1]
 [225   1]
 [150   3]
 ...
 [115  10]
 [ 55   1]
 [ 90   7]]
Values AFTER min max scaler: 
[[-0.01549307 -0.29399621]
 [ 0.30097355 -0.29399621]
 [-0.01132904 -0.19648442]
 ...
 [-0.15707024  0.14480686]
 [-0.4069123  -0.29399621]
 [-0.2611711  -0.00146083]]


## As you can see, both manual way and sklearn produced the same results! 
- MinMax scaler transforms values to 0 to 1
- Standard scaler transforms values to -1 to 1

# Converting non-numerical values to numerical

## Let's say that we have values that are non-numeric but we want to use them when we build our models. In order to use them, we have to convert them to numerical values!

There are also two ways to do this:

- Label Encoding
- One Hot Encoding

These two do slightly different things as you will see

### Label Encoding

In [29]:
g_df.head()

Unnamed: 0,host_id,neighbourhood_group,room_type,price,minimum_nights
0,2787,Brooklyn,Private room,149,1
1,2845,Manhattan,Entire home/apt,225,1
2,4632,Manhattan,Private room,150,3
3,4869,Brooklyn,Entire home/apt,89,1
4,7192,Manhattan,Entire home/apt,80,10


In [33]:
g_df['neighbourhood_group'].value_counts()

Manhattan        21661
Brooklyn         20104
Queens            5666
Bronx             1091
Staten Island      373
Name: neighbourhood_group, dtype: int64

In [34]:
g_df['room_type'].value_counts()

Entire home/apt    25409
Private room       22326
Shared room         1160
Name: room_type, dtype: int64

#### So we have 5 different types of neighborhood groups, and 3 different room types. Let's convert them into numbers!

In [43]:
from sklearn.preprocessing import LabelEncoder

le_1 = LabelEncoder()
le_2 = LabelEncoder()

enc_df = g_df.copy()
enc_df['room_type'] = le_1.fit_transform(enc_df['room_type'])
enc_df['neighbourhood_group'] = le_2.fit_transform(enc_df['neighbourhood_group'])

room_type_map = dict(zip(le_1.classes_, le_1.transform(le_1.classes_)))
neighbourhood_group_map = dict(zip(le_2.classes_, le_2.transform(le_2.classes_)))

print(f'Map for room type: {room_type_map}')
print(f'Map for neighbourhood group: {neighbourhood_group_map}')

Map for room type: {'Entire home/apt': 0, 'Private room': 1, 'Shared room': 2}
Map for neighbourhood group: {'Bronx': 0, 'Brooklyn': 1, 'Manhattan': 2, 'Queens': 3, 'Staten Island': 4}


In [44]:
enc_df.head()

Unnamed: 0,host_id,neighbourhood_group,room_type,price,minimum_nights
0,2787,1,1,149,1
1,2845,2,0,225,1
2,4632,2,1,150,3
3,4869,1,0,89,1
4,7192,2,0,80,10


As you can see, everything is converted to numbers and we even have mappings from the previous non-numerical value to the transformed numerical value!

#### One Hot Encoding

In [45]:
g_df.head()

Unnamed: 0,host_id,neighbourhood_group,room_type,price,minimum_nights
0,2787,Brooklyn,Private room,149,1
1,2845,Manhattan,Entire home/apt,225,1
2,4632,Manhattan,Private room,150,3
3,4869,Brooklyn,Entire home/apt,89,1
4,7192,Manhattan,Entire home/apt,80,10


In [46]:
enc_df = pd.get_dummies(g_df)
enc_df.head()

Unnamed: 0,host_id,price,minimum_nights,neighbourhood_group_Bronx,neighbourhood_group_Brooklyn,neighbourhood_group_Manhattan,neighbourhood_group_Queens,neighbourhood_group_Staten Island,room_type_Entire home/apt,room_type_Private room,room_type_Shared room
0,2787,149,1,0,1,0,0,0,0,1,0
1,2845,225,1,0,0,1,0,0,1,0,0
2,4632,150,3,0,0,1,0,0,0,1,0
3,4869,89,1,0,1,0,0,0,1,0,0
4,7192,80,10,0,0,1,0,0,1,0,0


### What happened here??

Well, pandas automatically found non-numerical values and created separate columns based on the unique types of data types we had in those columns, and put 1's for rows that were of that type, and 0's if they were not!  

## That's it for now, might come back later to explore this dataset more! 