In this assignment, you are going to apply what you learned about machine learning to a dataset of your choice on KaggleLinks to an external site.. Kaggle is an online platform for data science. Predict the outcomes in a data set using either Random Forest or k-NN.

The documentation is even more important than the code. Explain what you are doing and why. Only comments on the code should be in coding formatting.

# Introduction

What influences love at first sight? (Or, at least, love in the first four minutes?) This dataset was compiled by Columbia Business School professors Ray Fisman and Sheena Iyengar for their paper Gender Differences in Mate Selection: Evidence From a Speed Dating Experiment.

Data was gathered from participants in experimental speed dating events from 2002-2004. During the events, the attendees would have a four minute "first date" with every other participant of the opposite sex. At the end of their four minutes, participants were asked if they would like to see their date again. They were also asked to rate their date on six attributes: Attractiveness, Sincerity, Intelligence, Fun, Ambition, and Shared Interests.

The dataset also includes questionnaire data gathered from participants at different points in the process. These fields include: demographics, dating habits, self-perception across key attributes, beliefs on what others find valuable in a mate, and lifestyle information. See the Speed Dating Data Key document below for details.

We are going to predict the variable dec_o (decision by partner) using this dataset, to see what influences a decisionmaking process of the dates. I choose this dataset, because i was interested in what could be influential for a succesful date. 

In [24]:
import seaborn as sns 
import pandas as pd 
import matplotlib.pyplot as plt 
import math
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

Let's first look at the dataset and see which variables we can use.

In [25]:
SD_data = pd.read_csv('Speed Dating Data.csv', encoding ='unicode_escape')
SD_data.head(20)

Unnamed: 0,iid,id,gender,idg,condtn,wave,round,position,positin1,order,...,attr3_3,sinc3_3,intel3_3,fun3_3,amb3_3,attr5_3,sinc5_3,intel5_3,fun5_3,amb5_3
0,1,1.0,0,1,1,1,10,7,,4,...,5.0,7.0,7.0,7.0,7.0,,,,,
1,1,1.0,0,1,1,1,10,7,,3,...,5.0,7.0,7.0,7.0,7.0,,,,,
2,1,1.0,0,1,1,1,10,7,,10,...,5.0,7.0,7.0,7.0,7.0,,,,,
3,1,1.0,0,1,1,1,10,7,,5,...,5.0,7.0,7.0,7.0,7.0,,,,,
4,1,1.0,0,1,1,1,10,7,,7,...,5.0,7.0,7.0,7.0,7.0,,,,,
5,1,1.0,0,1,1,1,10,7,,6,...,5.0,7.0,7.0,7.0,7.0,,,,,
6,1,1.0,0,1,1,1,10,7,,1,...,5.0,7.0,7.0,7.0,7.0,,,,,
7,1,1.0,0,1,1,1,10,7,,2,...,5.0,7.0,7.0,7.0,7.0,,,,,
8,1,1.0,0,1,1,1,10,7,,8,...,5.0,7.0,7.0,7.0,7.0,,,,,
9,1,1.0,0,1,1,1,10,7,,9,...,5.0,7.0,7.0,7.0,7.0,,,,,


- We don't need to use any of the ID value's
- dec_o is our dependent variable
- There are 6 other valuables i want to use for this model: field_cd, mn_satm, zipcode, income, goal and date. This is because I felt this would give me the most interesting results in prediction the dec_o variable. 

So let's select those variables and drop the rows with NaN's so we can use the information later on.

# Data cleaning

In [26]:
df = SD_data[['dec_o','field_cd', 'mn_sat', 'zipcode', 'income', 'goal', 'date']]
df = df.dropna() #get rid of rows with empty cells
df.head(20)

Unnamed: 0,dec_o,field_cd,mn_sat,zipcode,income,goal,date
3408,1,3.0,1070.0,23060,44346.0,3.0,7.0
3409,0,3.0,1070.0,23060,44346.0,3.0,7.0
3410,1,3.0,1070.0,23060,44346.0,3.0,7.0
3411,0,3.0,1070.0,23060,44346.0,3.0,7.0
3412,1,3.0,1070.0,23060,44346.0,3.0,7.0
3413,0,3.0,1070.0,23060,44346.0,3.0,7.0
3414,1,3.0,1070.0,23060,44346.0,3.0,7.0
3415,0,3.0,1070.0,23060,44346.0,3.0,7.0
3416,0,3.0,1070.0,23060,44346.0,3.0,7.0
3426,0,10.0,1400.0,46815,42225.0,2.0,7.0


I don't think that we need to look for impossible values, since all of the data is from a survey and has been categorised. 

Let's see how many times the decision of partner the night of event was positive. We can use this for the evaluation.

In [27]:
df['dec_o'].value_counts()

0    1081
1     838
Name: dec_o, dtype: int64

# Exploratory data analysis

Let's have a look at the correlation between the variables

In [28]:
SD_data.corr().loc['dec_o'].sort_values(ascending=False).head()

dec_o     1.000000
match     0.522326
like_o    0.513399
attr_o    0.486885
fun_o     0.414276
Name: dec_o, dtype: float64

# Predictive model 

Lets select the variables and create a test and training set

In [29]:
from sklearn.preprocessing import normalize #get the function needed to normalize our data.

X = SD_data[['field_cd', 'mn_sat', 'zipcode', 'income', 'goal', 'date']] #create the X matrix
X = normalize(X) #normalize the matrix to put everything on the same scale
y = SD_data['dec_o'] #create the y-variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) #split the data, store it 

ValueError: could not convert string to float: '1,070.00'

Lets use knn

In [30]:
from sklearn.neighbors import KNeighborsClassifier #the object class we need

knn = KNeighborsClassifier(n_neighbors=3) #create a KNN-classifier with 5 neighbors (default)
knn = knn.fit(X_train, y_train) #this fits the k-nearest neigbor model with the train data
knn.score(X_test, y_test) #calculate the fit on the test data

NameError: name 'X_train' is not defined

This would give a score and we could give comments on how well our model works

# Evaluation 

Lets create a confusion matrix from the test data 

In [31]:
from sklearn.metrics import confusion_matrix
y_test_pred = knn.predict(X_test) #the predicted values
cm = confusion_matrix(y_test, y_test_pred) #creates a "confusion matrix" of the test set
cm

NameError: name 'X_test' is not defined

Make it easier to understand

In [32]:
#In order to read it easily , let's make a dataframe out of it, and add labels to it.
conf_matrix = pd.DataFrame(cm, index=['Match', 'No_match'], columns = ['Match_p', 'No_match_p']) 
conf_matrix

NameError: name 'cm' is not defined

Lets make a classification report to make a conclusion about the model

In [34]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_test_pred))

NameError: name 'y_test' is not defined

# Conclusion 

I could make a review on the Precision, recall etc to conclude the functionality of the model