# Classification
## Use K-Nearest Neighbour on AirBnb [data](https://www.kaggle.com/datasets/dgomonov/new-york-city-airbnb-open-data?resource=download)
- The data file is already downloaded to: data/AB_NYC_2019.csv. Load it into pandas dataframe
- Purpose of this exercise is to use K-Neares-Neighbor algorithm to make a binary classification in order to estimate if the price of a specific Airbnb accommodation will be above or below the median, 
- First we will try to do it based on only 2 features: longitude and latitude. 
- Next we will see if we can improve accuracy with using more features
- As independent variables, we have location, neighborhood and the number of reviews the acommodation has on Airbnb.
1. Use the following imports:
```python
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import auc, roc_curve, confusion_matrix
```
2. Get the data into a pandas dataframe
3. Add a column to the dataframe: "is_cheap", that contains boolean values for the price being below median. Hint: DataFrame has a median() method. This column contains our target data: y
4. Create a Classifier model with `KNeighborsClassifier()` and give it an arbitrary number for the n_neighbors argument
5. Create input data: X as a DataFrame containing only longitude and latitude.
5. Based on X and y above, split data into training and test data using train_test_split() method with 33% test data.
6. Fit the model with the training data. Hint: `knn_class.fit(X_train, y_train)`
7. And make predictions with the test data. Hint: `knn_class.predict(X_test)`
8. Now we have our target and our predictions and we need to compare them to see how well our model have done. For this we ca use the roc_curve method like this: `fpr, tpr, _ = roc_curve(y_test, y_pred, pos_label=True)` where pos_label lets the algorithm know that our data uses boolean in the target column. This gives us the True Positive Rate (TPR) and the False Positive Rate (FPR). ROC Curve works by plotting the fraction of true positives out of the positives (TPR = true positive rate) vs. the fraction of false positives out of the negatives (FPR = false positive rate), at various threshold settings. Finally we use the `auc(fpr,tpr)` function to get an AUC_Score (This score is 1 when the model had 100% correct predictions and less than 1 for less perfect accuracy score. The result should be around `.7` which is not a great prediction accuracy rate, but its a start and we can try to improve it by adding more data features to the model.
Study: [ROC curves typically feature true positive rate on the Y axis, and false positive rate on the X axis. This means that the top left corner of the plot is the “ideal” point - a false positive rate of zero, and a true positive rate of one. This is not very realistic, but it does mean that a larger area under the curve (AUC) is usually better.](https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#sphx-glr-auto-examples-model-selection-plot-roc-py).
9. Now lets add some more columns from the dataframe: 
    1. First we need to One-Hot encode the data of 3 columns:['neighbourhood','neighbourhood_group','room_type']. Hint: Use pandas get_dummies method (see example in the clustering with titanic notebook.
    2. With these new columns in the dataframe do the train_test_split operation again to get 33% test data and 67% training data for both input data X and target/labels y.
    3. Normalize both training and test data with [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html). Hint: `StandardScaler().fit(X_train[independent_variables])` where `independent_variables` is a list of all the columns we want to use in the model (There are many so a quick way to get the names of those columns that we One-Hot encoded is by using a list comprehension like this: `[col for col in df if col.startswith('neighbourhood') or col.startswith('room_type')]`. Then just add the 'latitude', 'longitude','number_of_reviews' and 'reviews_per_month' columns.
    4. Now get the normalized training data with something like: `X_train_norm = np.nan_to_num(scaler.transform(X_train[independent_variables]))` where np.nan_to_num() is used to swap NAN for zeros.
    5. Do the same with the test data
    6. Now create a `KNeighborsClassifier` model like last time and fit it with the training data and the training targets
    7. Get predictions on the test data and produce the AUC score like last time. Is it improved?
    8. When we create our KneighborsClassifier model we can try it out with different number of neighbors and with different ways to measure the distance between the neighbors like this `KNeighborsClassifier(n_neighbors=k, metric=dist)`. [These are the different available methods for measuring distance.](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.distance_metrics.html#sklearn.metrics.pairwise.distance_metrics). Now create a function that can take k and dist (as shown above) and can print an AUC score based on the data we used above and on the 2 arguments.
    9. Run the function with all combinations of n_neighbor values of 2, 4, 8, 32, 64 and with metric values of 'manhattan', 'euclidean', 'haversine','cosine'.
    10. Are there any noticable differences?
    
## Part 2 Neural Network



In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import auc, roc_curve, confusion_matrix

In [2]:
!ls ../data

1.jpg
AB_NYC_2019.csv
aclImdb
aclImdb.tar.gz
adjectives
API_EN.ATM.CO2E.KT_DS2_en_csv_v2_1345584.csv
API_EN.ATM.CO2E.KT_DS2_en_csv_v2_1345584.zip
API_EN.ATM.CO2E.KT_DS2_en_csv_v2_2056082.csv
API_EN.ATM.CO2E.KT_DS2_en_csv_v2_2056082.zip
API_EN.ATM.CO2E.KT_DS2_en_csv_v2_887574.csv
API_MS.MIL.XPND.CN_DS2_en_csv_v2_898165.csv
baboon.jpg
befkbhalderstatkode.csv
befkbh_stat_code.json
bones_in_london.txt
cities.json
commander_keen_sprite_run.gif
country_codes.csv
DKstat_bykoder.csv
donut.json
dronning_tale2021.txt
ds_salaries.csv
eb4182cd41cd737fe19f28afbc5cf286
employees.csv
enrollment_forecast.csv
example.html
example.xlsx
flamenco_contours.png
folkekirkemedlemskab.csv
github_repoes
haarcascade_eye.xml
haarcascade_frontalface_default.xml
housing.csv
installed_packages.txt
iris.csv
iris_data.csv
iris_data.xlsx
jurassic-park-tour-jeep.jpg
keen.gif
loeberuter.json
mare-08.jpg
Metadata_Country_API_EN.ATM.CO2E.KT_DS2_en_csv_v2_1345584.csv
Metadata_Countr

In [3]:
# Reading data
df = pd.read_csv("../data/AB_NYC_2019.csv")
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


In [4]:
# Transforming categorical variables into binary
# dummies_columns = pd.get_dummies(df[['neighbourhood','neighbourhood_group','room_type']])
# df = pd.concat([df,dummies_columns], axis=1)
# df = df.drop(['neighbourhood','neighbourhood_group','room_type'], axis=1)

In [5]:
# Transforming categorical variables into binary
df = pd.get_dummies(df,columns=['neighbourhood','neighbourhood_group','room_type'])
df.head()

Unnamed: 0,id,name,host_id,host_name,latitude,longitude,price,minimum_nights,number_of_reviews,last_review,...,neighbourhood_Woodrow,neighbourhood_Woodside,neighbourhood_group_Bronx,neighbourhood_group_Brooklyn,neighbourhood_group_Manhattan,neighbourhood_group_Queens,neighbourhood_group_Staten Island,room_type_Entire home/apt,room_type_Private room,room_type_Shared room
0,2539,Clean & quiet apt home by the park,2787,John,40.64749,-73.97237,149,1,9,2018-10-19,...,0,0,0,1,0,0,0,0,1,0
1,2595,Skylit Midtown Castle,2845,Jennifer,40.75362,-73.98377,225,1,45,2019-05-21,...,0,0,0,0,1,0,0,1,0,0
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,40.80902,-73.9419,150,3,0,,...,0,0,0,0,1,0,0,0,1,0
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,40.68514,-73.95976,89,1,270,2019-07-05,...,0,0,0,1,0,0,0,1,0,0
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,40.79851,-73.94399,80,10,9,2018-11-19,...,0,0,0,0,1,0,0,1,0,0


In [6]:
# Making target (y)
y = df['price'] < df['price'].median()
# Train test split
X = df
print('X:\n',X.shape)
print('y:\n',y.head())

X:
 (48895, 242)
y:
 0    False
1    False
2    False
3     True
4     True
Name: price, dtype: bool


In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [8]:
# Creating a model with 50 nearest neighbours
knn_class = KNeighborsClassifier(n_neighbors=60)

# Fiting the model on only the the 2 features of latitude and longitude
knn_class.fit(X_train[['latitude','longitude']], y_train)

# Making predictions
y_pred = knn_class.predict(X_test[['latitude','longitude']]) 

# Evaluation
fpr, tpr, thresholds = roc_curve(y_test, y_pred, pos_label=True)
# ROC curve, is a graphical plot which illustrates the performance of a binary classifier system as its discrimination threshold is varied. 
# It is created by plotting the fraction of true positives out of the positives (TPR = true positive rate) vs. the fraction of false positives out of the negatives (FPR = false positive rate), at various threshold settings
auc_score = round(auc(fpr, tpr), 4)
# threshold = 0.5
print("AUC: {}".format(auc_score))
# AUC is a quality metric for classification tasks, and the closer it is to 1, the better. An AUC of 0.72 is alright, but not good


AUC: 0.7189


In [11]:
# Get all the independent variables for the model
categorical_variables = [col for col in df if col.startswith('neighbourhood') or col.startswith('room_type')]
numeric_variables = ['latitude', 'longitude','number_of_reviews','reviews_per_month']
independent_variables = categorical_variables+numeric_variables # All 233 columns of numeric data
# print(len(independent_variables))

# Standardize the training and test data
scaler = StandardScaler().fit(X_train[independent_variables])
X_train_norm = np.nan_to_num(scaler.transform(X_train[independent_variables])) # Standardize and change NAN to 0
X_test_norm = np.nan_to_num(scaler.transform(X_test[independent_variables]))

# Fit the model
neigh = KNeighborsClassifier(n_neighbors=10)
neigh.fit(X_train_norm, y_train)
y_pred = neigh.predict(X_test_norm)
# Evaluation
fpr, tpr, thresholds = roc_curve(y_test, y_pred, pos_label=True)
auc_score = round(auc(fpr, tpr), 2)
print("AUC: {}".format(auc_score))

AUC: 0.83


In [10]:
for dist in ['manhattan', 'euclidean','cosine']: # 'haversine' distance can only be used with 2 dimensions.
    print("Distance metric: {}".format(dist))
    for k in [2, 4, 10, 50, 100, 500]:
        # Fit & predict
        neigh = KNeighborsClassifier(n_neighbors=k, metric=dist)
        neigh.fit(X_train_norm, y_train)
        y_pred = neigh.predict(X_test_norm)        # Evaluation
        fpr, tpr, thresholds = roc_curve(y_test, y_pred,        pos_label=True)
        auc_score = round(auc(fpr, tpr), 2)
        print("For K = {}, AUC: {}".format(k, auc_score))

Distance metric: manhattan
For K = 2, AUC: 0.77
For K = 4, AUC: 0.8
For K = 10, AUC: 0.83
For K = 50, AUC: 0.82
For K = 100, AUC: 0.82
For K = 500, AUC: 0.81
Distance metric: euclidean
For K = 2, AUC: 0.77
For K = 4, AUC: 0.8
For K = 10, AUC: 0.83
For K = 50, AUC: 0.82
For K = 100, AUC: 0.82
For K = 500, AUC: 0.81
Distance metric: cosine
For K = 2, AUC: 0.77
For K = 4, AUC: 0.8
For K = 10, AUC: 0.83
For K = 50, AUC: 0.82
For K = 100, AUC: 0.82
For K = 500, AUC: 0.81
