# Classification
## Use K-Nearest Neighbour on AirBnb [data](https://www.kaggle.com/datasets/dgomonov/new-york-city-airbnb-open-data?resource=download)
- The data file is already downloaded to: data/AB_NYC_2019.csv. Load it into pandas dataframe
- Purpose of this exercise is to use K-Neares-Neighbor algorithm to make a binary classification in order to estimate if the price of a specific Airbnb accommodation will be above or below the median, 
- First we will try to do it based on only 2 features: longitude and latitude. 
- Next we will see if we can improve accuracy with using more features
- As independent variables, we have location, neighborhood and the number of reviews the acommodation has on Airbnb.
1. Use the following imports:
```python
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import auc, roc_curve, confusion_matrix
```
2. Get the data into a pandas dataframe
3. Add a column to the dataframe: "is_cheap", that contains boolean values for the price being below median. Hint: DataFrame has a median() method. This column contains our target data: y
4. Create a Classifier model with `KNeighborsClassifier()` and give it an arbitrary number for the n_neighbors argument
5. Create input data: X as a DataFrame containing only longitude and latitude.
5. Based on X and y above, split data into training and test data using train_test_split() method with 33% test data.
6. Fit the model with the training data. Hint: `knn_class.fit(X_train, y_train)`
7. And make predictions with the test data. Hint: `knn_class.predict(X_test)`
8. Now we have our target and our predictions and we need to compare them to see how well our model have done. For this we ca use the roc_curve method like this: `fpr, tpr, _ = roc_curve(y_test, y_pred, pos_label=True)` where pos_label lets the algorithm know that our data uses boolean in the target column. This gives us the True Positive Rate (TPR) and the False Positive Rate (FPR). ROC Curve works by plotting the fraction of true positives out of the positives (TPR = true positive rate) vs. the fraction of false positives out of the negatives (FPR = false positive rate), at various threshold settings. Finally we use the `auc(fpr,tpr)` function to get an AUC_Score (This score is 1 when the model had 100% correct predictions and less than 1 for less perfect accuracy score. The result should be around `.7` which is not a great prediction accuracy rate, but its a start and we can try to improve it by adding more data features to the model.
Study: [ROC curves typically feature true positive rate on the Y axis, and false positive rate on the X axis. This means that the top left corner of the plot is the “ideal” point - a false positive rate of zero, and a true positive rate of one. This is not very realistic, but it does mean that a larger area under the curve (AUC) is usually better.](https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#sphx-glr-auto-examples-model-selection-plot-roc-py).
9. Now lets add some more columns from the dataframe: 
    1. First we need to One-Hot encode the data of 3 columns:['neighbourhood','neighbourhood_group','room_type']. Hint: Use pandas get_dummies method (see example in the clustering with titanic notebook.
    2. With these new columns in the dataframe do the train_test_split operation again to get 33% test data and 67% training data for both input data X and target/labels y.
    3. Normalize both training and test data with [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html). Hint: `StandardScaler().fit(X_train[independent_variables])` where `independent_variables` is a list of all the columns we want to use in the model (There are many so a quick way to get the names of those columns that we One-Hot encoded is by using a list comprehension like this: `[col for col in df if col.startswith('neighbourhood') or col.startswith('room_type')]`. Then just add the 'latitude', 'longitude','number_of_reviews' and 'reviews_per_month' columns.
    4. Now get the normalized training data with something like: `X_train_norm = np.nan_to_num(scaler.transform(X_train[independent_variables]))` where np.nan_to_num() is used to swap NAN for zeros.
    5. Do the same with the test data
    6. Now create a `KNeighborsClassifier` model like last time and fit it with the training data and the training targets
    7. Get predictions on the test data and produce the AUC score like last time. Is it improved?
    8. When we create our KneighborsClassifier model we can try it out with different number of neighbors and with different ways to measure the distance between the neighbors like this `KNeighborsClassifier(n_neighbors=k, metric=dist)`. [These are the different available methods for measuring distance.](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.distance_metrics.html#sklearn.metrics.pairwise.distance_metrics). Now create a function that can take k and dist (as shown above) and can print an AUC score based on the data we used above and on the 2 arguments.
    9. Run the function with all combinations of n_neighbor values of 2, 4, 8, 32, 64 and with metric values of 'manhattan', 'euclidean', 'haversine','cosine'.
    10. Are there any noticable differences?
    
## Part 2 Neural Network



## 1. Use the following imports:

In [12]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import auc, roc_curve, confusion_matrix

## 2. Get the data into a pandas dataframe

In [25]:
data = pd.read_csv('./my_data/AB_NYC_2019.csv')
df = pd.DataFrame(data)

df[:]

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.94190,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.10,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48890,36484665,Charming one bedroom - newly renovated rowhouse,8232441,Sabrina,Brooklyn,Bedford-Stuyvesant,40.67853,-73.94995,Private room,70,2,0,,,2,9
48891,36485057,Affordable room in Bushwick/East Williamsburg,6570630,Marisol,Brooklyn,Bushwick,40.70184,-73.93317,Private room,40,4,0,,,2,36
48892,36485431,Sunny Studio at Historical Neighborhood,23492952,Ilgar & Aysel,Manhattan,Harlem,40.81475,-73.94867,Entire home/apt,115,10,0,,,1,27
48893,36485609,43rd St. Time Square-cozy single bed,30985759,Taz,Manhattan,Hell's Kitchen,40.75751,-73.99112,Shared room,55,1,0,,,6,2


## 3. Add a column to the dataframe: "is_cheap", that contains boolean values for the price being below median. Hint: DataFrame has a median() method. This column contains our target data: y

In [24]:
is_cheap = df['price'].median()
df['is_cheap'] = is_cheap

df[:]

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,is_cheap
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365,106.0
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355,106.0
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.94190,Private room,150,3,0,,,1,365,106.0
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194,106.0
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.10,1,0,106.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48890,36484665,Charming one bedroom - newly renovated rowhouse,8232441,Sabrina,Brooklyn,Bedford-Stuyvesant,40.67853,-73.94995,Private room,70,2,0,,,2,9,106.0
48891,36485057,Affordable room in Bushwick/East Williamsburg,6570630,Marisol,Brooklyn,Bushwick,40.70184,-73.93317,Private room,40,4,0,,,2,36,106.0
48892,36485431,Sunny Studio at Historical Neighborhood,23492952,Ilgar & Aysel,Manhattan,Harlem,40.81475,-73.94867,Entire home/apt,115,10,0,,,1,27,106.0
48893,36485609,43rd St. Time Square-cozy single bed,30985759,Taz,Manhattan,Hell's Kitchen,40.75751,-73.99112,Shared room,55,1,0,,,6,2,106.0


## 4. Create a Classifier model with KNeighborsClassifier() and give it an arbitrary number for the n_neighbors argument

In [47]:
neigh = KNeighborsClassifier(n_neighbors=0.3)
neigh

## 5. Create input data: X as a DataFrame containing only longitude and latitude.

In [45]:
x = df['longitude'],['latitude']
x = input()
x

235


'235'

 ## 6. Fit the model with the training data. Hint: `knn_class.fit(X_train, y_train)`