TODO: this are the four labels we want to predict

EDUCATION
income
job

BODY TYPE
diet
sex
drinks
height

LOCATION
income
job
status
offspring
pets

INCOME
diet
smokes
drinks
drugs

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import OrdinalEncoder

# Preprocessing the data and training the model

In [None]:
raw_dataset = pd.read_csv('../data/okcupid.csv') 
okcupid_profiles = raw_dataset.drop(columns="Unnamed: 0") 

As we can see from the output below, almost every column contains object types, which we can not use to fit the Random Forest. 
We need to convert the objects into numbers, and we can do that using OrdinalEncoder from sklearn.
We need to manage the missing data first tho

In [None]:
okcupid_profiles.dtypes

In [None]:
okcupid_profiles['education'].value_counts()

## Filling the missing data

The columns containing missing data are the following:

In [None]:
print(okcupid_profiles.isna().sum())

Comparing the missing data output with the dtype output, we can easily see how, except for height, all the missing data are categorical strings.

Since there are only three rows with missing values for height, instead of replacing the NaN with something like 0 or -1, or the average height, we think it's better to just drop them, since it is such a small number

In [None]:
okcupid_profiles = okcupid_profiles.dropna(how = 'any', subset = 'height') 

For all the others attributes, we will just replace the missing values with the 'missing' string.

In [None]:
okcupid_profiles = okcupid_profiles.fillna(value = 'missing')

And now all the columns contain something

In [None]:
print(okcupid_profiles.isna().sum())

## Encoding the data

In [None]:
enc = OrdinalEncoder()
enc.fit(okcupid_profiles)

In [None]:
encoded_data = enc.transform(okcupid_profiles)

Now we have a Numpy array with the encoded data, so no more objects, but only numbers.

In [None]:
encoded_data.dtype

# Random Forest

## Education prediction model

In [None]:
# remember that now we have a Numpy array

y = encoded_data[:,6] # This should pick the education column

X = encoded_data[:, [1,2,3,4,5,7,8,15,16]] # Pick income and job
# X = np.delete(encoded_data, 11, axis = 1) # This should remove the income colum

# test_size = 0.3   means 70% training set | 30% test set
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size = 0.3, 
                                                    random_state = 42)

In [None]:
# n_estimators is the number of trees in the forest
rfc = RandomForestClassifier(n_estimators = 50)
rfc.fit(X_train, y_train)
rfc_prediction = rfc.predict(X_test)

In [None]:
print("Random Forest Classification report")
print(classification_report(y_test, rfc_prediction))
print("Random Forest Confusion Matrix")
print(confusion_matrix(y_test, rfc_prediction))

Let's now plot a proper confusion matrix

In [None]:
# Get and reshape confusion matrix data
matrix = confusion_matrix(y_test, rfc_prediction)

matrix = matrix.astype('float') / matrix.sum(axis=1)[:, np.newaxis] # makes number into %
#matrix = matrix.astype('float')

# Build the plot
plt.figure(figsize = (15, 10))
sns.set(font_scale = 1.4)
sns.heatmap(matrix, 
            annot = True, 
            annot_kws = {'size':5},
            cmap = plt.cm.Blues, 
            linewidths = 0.2)

# Add labels to the plot
# class_names = np.array(okcupid_profiles["income"].unique())
# class_names = [-1, 80000, 20000, 40000, 30000, 50000, 60000, 1000000, 150000, 100000, 500000, 70000, 250000]
class_names = []

tick_marks = np.arange(len(class_names))
tick_marks2 = tick_marks + 0.5

plt.xticks(tick_marks, class_names, rotation=90)
plt.yticks(tick_marks2, class_names, rotation=0)

plt.xlabel('Predicted label')
plt.ylabel('True label')

plt.title('Confusion Matrix for Education')
plt.show()