# Machine Learning


## Student
* Student Name:
* Student ID:

### Rules
1. This is a take-home exam.
2. You are expected to complete and submit the exam by 27 - May - 2020, 15:00 GMT
   + No late submissions will be allowed.
   + Submit your exam through canvas.
3. The exam is in English and you are expected to complete it in English.
4. The exam requires coding and you are expected to code in Python 3.0.
5. The exam is worth 100 points, and it will constitute 50% of your grade.
6. To pass the course you need to obtain at least 50 points in this exam (and a 5.5 overall course grade)
7. The exam describes a problem you will need to solve.
8. This is an open-ended problem, which means that
   + there is no single perfect solution, however 
   + there are still correct and incorrect things one may do, and
   + there are certain best practices to follow.
9. You are expected to complete this exam on your own. 
   + You are not allowed to discuss or share the exam, or share your solutions with anyone else.
   + We run a plagiarism check both against the web and against all submitted exams. 
   + There are different ways to work on this problem. It would really be a coincidence if 2 students choose very similar way of solving it; if that is the case, we will take a closer look and we will invite the students to a short oral examination/discussion over their solutions.
10. We will spend approximately 15 minutes to assess your work. So make sure that everything looks clear, is neatly documented, and that everything runs.
11. You are allowed to import from the following APIs: sklearn, pandas, numpy, matplotlib, scipy, nltk. You are not allowed to use any other API.

### Instructions

1. You are provided with a dataset of Airbnb listings for apartments in Amsterdam. Download the data to the same folder as this ipynb file.
2. Each row in the listings corresponds to an appartment, and each column to an attribute/feature for that appartment.
3. One of the attributes is the *price* of the apartment. Based on the actual price listings are split into two categories, *cheap* and *expensive*, which is the value of the attribute *price*.
4. Your goal is to 
  1. classify listings into cheap and expensive by the means of machine learning, 
  2. discuss the quality of your classifier in terms of appropriate evaluation measures, using tables or plots for comparisons, and 
  3. compare it to a naive classifier.
5. You can use numerical, categorical, and textual features you think can help, or derive features of your own from existing ones,  apply the right treatment to them, and build some simple classifiers over them. You should not use the features *neighborhood*, *longitude* and *latitude* (see point 6.).
6. Two of the features are the *longitude* and the *latitude* of the apartment. Together these two numbers tell you the precise location of a listing in Amsterdam. You should use  these two numbers, but you are not allowed to use them directly. Instead, you are expected to cluster listings, based on these two numbers, into a number of clusters/neightborhoods, and use the neighborhood cluster as a feature.
7. You must demonstrate that you are constructing the best possible classifier, that is you should run all those experiments needed to persuade us that you have achieved optimal performance, using everything that you were taught by this class.
8. You are expected to explain what are you doing at every step and argue about your choices on the basis of the course.

## Exam

In [2]:
import sklearn, pandas, numpy, matplotlib, scipy, nltk

### 1. Load the dataset and make choices on how to treat it (e.g. split, balance, impute, etc.) (5 points)

*Explain what you do and why*

In [73]:
# Your code goes here

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import numpy.random as nr


data = pd.read_csv("exam_listings.csv")  
print(len(data))
data.head()


9408


Unnamed: 0,id,listing_url,name,summary,description,experiences_offered,neighbourhood,latitude,longitude,property_type,...,review_scores_location,review_scores_value,requires_license,jurisdiction_names,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,reviews_per_month
0,2818,https://www.airbnb.com/rooms/2818,Quiet Garden View Room & Super Fast WiFi,Quiet Garden View Room & Super Fast WiFi,Quiet Garden View Room & Super Fast WiFi I'm r...,none,Indische Buurt,52.36575,4.94142,Apartment,...,9.0,10.0,f,"{Amsterdam,"" NL Zip Codes 2"","" Amsterdam"","" NL""}",t,f,strict_14_with_grace_period,f,f,2.06
1,27886,https://www.airbnb.com/rooms/27886,"Romantic, stylish B&B houseboat in canal district",Stylish and romantic houseboat on fantastic hi...,Stylish and romantic houseboat on fantastic hi...,none,Westelijke Eilanden,52.38761,4.89188,Houseboat,...,10.0,10.0,f,"{Amsterdam,"" NL Zip Codes 2"","" Amsterdam"","" NL""}",t,f,strict_14_with_grace_period,f,f,2.15
2,29051,https://www.airbnb.com/rooms/29051,Comfortable single room,because of the city imposing a 4 paying guest ...,because of the city imposing a 4 paying guest ...,none,Amsterdam Centrum,52.36773,4.89151,Apartment,...,10.0,9.0,f,"{Amsterdam,"" NL Zip Codes 2"","" Amsterdam"","" NL""}",f,f,moderate,f,f,4.33
3,42970,https://www.airbnb.com/rooms/42970,Comfortable room@PERFECT location + 2 bikes,A home away from home Great location Including...,A home away from home Great location Including...,none,Amsterdam Centrum,52.36781,4.89001,Bed and breakfast,...,10.0,9.0,f,"{Amsterdam,"" NL Zip Codes 2"","" Amsterdam"","" NL""}",t,f,strict_14_with_grace_period,f,t,4.02
4,48076,https://www.airbnb.com/rooms/48076,Amsterdam Central and lot of space,"third floor apartment two bedrooms,bathroom,te...","third floor apartment two bedrooms,bathroom,te...",none,Grachtengordel,52.38042,4.89453,Bed and breakfast,...,10.0,9.0,f,"{Amsterdam,"" NL Zip Codes 2"","" Amsterdam"","" NL""}",t,f,strict_14_with_grace_period,f,t,2.0


In [105]:
# split data
x = data.loc[:, data.columns != "price"]
y = data.loc[:,data.columns == "price"]

# clean data


In [None]:
from sklearn.preprocessing import  LabelEncoder
x = x.select_dtypes(include=['int', 'float'])
le = LabelEncoder()
x = x.apply(le.fit_transform)
x.head()
y = y.apply(le.fit_transform)

### 2. Numerical Features (15 points)
* Choose the numerical features you wish to use and pre-process them
* Train a logistic regression classifier and demonstrate its performance with the right measures

*Explain what you do and why*

In [99]:
# Your code goes here
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error
cols = x.columns
num_cols = x._get_numeric_data().columns
num_cols = x[num_cols]

le = LabelEncoder()
y = y.apply(le.fit_transform)

X_train, X_test, y_train, y_test = train_test_split( num_cols, y, test_size=0.33, random_state=42)

model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print((np.sqrt(mean_squared_error(y_test, predictions))))

0.6370220572706061


  y = column_or_1d(y, warn=True)


*Explain what you observe and what is your conclusion regarding both classes*

### 3. Categorical Features (15 points)
* Choose the categorical features you wish to use and pre-process them
* Train a logistic regression classifier and demonstrate its performance with the right measures

*Explain what you do and why*

In [112]:
cols = x.columns
num_cols = x._get_numeric_data().columns
num_cols = list(set(cols) - set(num_cols))
num_cols = x[num_cols]
le = LabelEncoder()
num_cols = num_cols.apply(le.fit_transform)
y = y.apply(le.fit_transform)
X_train, X_test, y_train, y_test = train_test_split( num_cols, y, test_size=0.33, random_state=42)

model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print((np.sqrt(mean_squared_error(y_test, predictions))))

0.5098072004616178


  y = column_or_1d(y, warn=True)


*Explain what you observe and what is your conclusion regarding both classes*

### 4. Textual Features (15 points)
* Choose the textual features you wish to use and pre-process them 
* Train a logistic regression classifier and demonstrate its performance with the right measures

*Explain what you do and why*

In [147]:

textualFeatures = num_cols[["description","amenities"]]
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import re
nltk.download('stopwords')
stemmer = PorterStemmer()
words = stopwords.words("english")
myDes = textualFeatures["amenities"].apply(lambda x: " ".join([stemmer.stem(i) for i in re.sub("[^a-zA-Z]", " ", x).split() if i not in words]).lower())

[nltk_data] Downloading package stopwords to /home/shahir/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df= 3, stop_words="english", sublinear_tf=True, norm='l2', ngram_range=(1, 2))
final_features = vectorizer.fit_transform(dataset['cleaned']).toarray()
final_features.shape

*Explain what you observe and what is your conclusion regarding both classes*

### 5. Clustering (20 points)
* Use *latitude* and *longitude* to cluster listings into neighborhoods
* Use the feature 'neighborhood' included in the listings to evaluate your clustering
* Visualize the listings using a scatter plot over *latitude* and *longitude* and a different color for each cluster
* Use the cluster label as a listing feature for classification
* Train a logistic regression classifier and demonstrate its performance with the right measures

*Explain what you do and why*

In [6]:
# your code goes here

*Explain what you observe and what is your conclusion*

### 6. Naive classifier (5 points)
* Construct a naive classifier
* Compare it to all the classifiers you implemented above

*Explain what you do and why*

In [47]:
# your code goes here

*Explain what you observe and what is your conclusion*

### 7. Best possible classifier (25 points)
* Develop the best possible classifier using whatever you have learnt in this class.
* Demonstrate performance in the best possible way to show that it is indeed the best possible classifer. 

If you wish to merge all features into a single feature vector Xn you can follow the code below, assuming that Xnum corresponds to the vector of numerical features, Xcat categorical features, Xtext textual features, and Xpos the neighborhood feature. Change the code below to whatever is appropriate for you.

In [None]:
from scipy import sparse
Xn = sparse.hstack((Xnum, Xcat, Xtext, Xpos)).tocsr()

*Explain what you do and why*

In [48]:
# your code goes here

*Explain what you observe and what is your conclusion*