# Movie Wars
## ~ BONUS – The dark side of The Data ~

First of all, we should set the notebook so that it outputs all results of each cell and not only the last one.

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

And import all the python libraries needed for this step.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn import neighbors
from sklearn.decomposition import PCA
from sklearn.metrics import confusion_matrix
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error 
from sklearn.tree import DecisionTreeClassifier, plot_tree, export_graphviz
from imblearn.over_sampling import RandomOverSampler

pd.options.mode.chained_assignment = None

Next, let's define how to plot a **confusion matrix**.

In [None]:
def plot_confusion_matrix(y_true, y_pred,classes, normalize = False, title = None, cmap = plt.cm.Blues, size = 4):
    if not title:
        if normalize:
            title = 'Normalized confusion matrix'
        else:
            title = 'Confusion matrix, without normalization'

    # Compute confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    
    # Only use the labels that appear in the data
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    fig, ax = plt.subplots()
    im = ax.imshow(cm, interpolation = 'nearest', cmap = cmap)
    ax.figure.colorbar(im, ax = ax)
    # We want to show all ticks...
    ax.set(xticks = np.arange(cm.shape[1]),
           yticks = np.arange(cm.shape[0]),
           # ... and label them with the respective list entries
           xticklabels = classes, yticklabels = classes,
           title = title,
           ylabel = 'True label',
           xlabel = 'Predicted label')

    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation = 45, ha = "right",
             rotation_mode = "anchor")
    
    fig.set_size_inches(size, size)
    
    # Loop over data dimensions and create text annotations.
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, format(cm[i, j], fmt),
                    ha = "center", va = "center",
                    color = "white" if cm[i, j] > thresh else "black")
    fig.tight_layout()
    return ax

Next, we state where our data sources are.

In [None]:
data_folder_path = 'data\\'

user_profiles_file_path = data_folder_path + 'user_profiles.csv'
movie_profiles_file_path = data_folder_path + 'movie_profiles.csv'

Load the data.

In [None]:
user_profiles = pd.read_csv(user_profiles_file_path, sep = ';')
movie_profiles = pd.read_csv(movie_profiles_file_path, sep = ';')

movie_genres = ['Action','Adventure','Animation',"Children's",'Comedy','Crime','Documentary','Drama',
                'Fantasy','Film-Noir','Horror','Musical','Mystery','Romance','Sci-Fi','Thriller','War','Western']

user_occupations = { 0:  "other", 1:  "academic/educator", 2:  "artist",  3:  "clerical/admin", 4:  "college/grad student",
                    5:  "customer service", 6:  "doctor/health care", 7:  "executive/managerial", 8:  "farmer", 9:  "homemaker",
                     10:  "K-12 student", 11:  "lawyer", 12:  "programmer", 13:  "retired", 14:  "sales/marketing", 15:  "scientist",
                     16:  "self-employed", 17:  "technician/engineer", 18:  "tradesman/craftsman", 19:  "unemployed", 20:  "writer"}

Do some normalizations.

In [None]:
user_profiles['Favorite_epoch_norm'] = (user_profiles['Favorite epoch']-user_profiles['Favorite epoch'].min())/(user_profiles['Favorite epoch'].max() - user_profiles['Favorite epoch'].min())

And split our data into train and test.

In [None]:
users_train, users_test = train_test_split(user_profiles, test_size = 0.2, random_state = 0)

Now, we are ready to start diggin on the dark side of The Data.

# User's gender prediction

Firstly we list the features for training the model. In this case we use the favorite movie genres associated with each user to predict their gender.

In [None]:
features_for_training = [x + '_affinity' for x in movie_genres]
target_feature = 'Gender'

As preprocessing step, we perform random oversampling over the user gender to balance the target feature.

In [None]:
x_train = users_train[features_for_training]
x_test = users_test[features_for_training]
y_train = users_train[target_feature]
y_test = users_test[target_feature]

ros = RandomOverSampler(random_state = 0, sampling_strategy = 0.75)
x_train_resampled, y_train_resampled = ros.fit_resample(x_train, y_train)

Finally we use simple LogisticRegression for modelling

In [None]:
LR = LogisticRegression(random_state = 0, max_iter = 100)
LR.fit(x_train_resampled, y_train_resampled);

print('Test_score', round(LR.score( x_test, y_test), 2), 'Train_score', round(LR.score( x_train, y_train), 2))

We obtain a good initial predictor for user gender, we can highlight the umbalance towards male gender as the principal problem.

In [None]:
y_pred = LR.predict(x_test)
plot_confusion_matrix(y_test, y_pred, classes = ['Male', 'Female'],  normalize = True);

# User's age prediction


Now we try to predict user's age. We use features related to the favorite movies genre again. In this case we will train a regression model.

In [None]:
features_for_training = [x + '_affinity' for x in movie_genres]
target_feature = 'Age'

In [None]:
x_train = users_train[features_for_training]
x_test = users_test[features_for_training]
y_train = users_train[target_feature] 
y_test = users_test[target_feature]

We perform a decomposition in principal components over the training data to reduce the noise and the dimensionality of training data.

In [None]:
pca = PCA(n_components = 16)

pca.fit(x_train)

x_train_pca = pd.DataFrame(pca.transform(x_train))
x_test_pca = pd.DataFrame(pca.transform(x_test))

After it we use a gradient boost based regressor for modelling.

In [None]:
GB = GradientBoostingRegressor(random_state = 1, n_estimators = 1500)
GB.fit(x_train_pca, y_train);

After training the model we measure how good is it and we obtain an acceptable error, but in contrast there is a remarkable difference between test and training error which is indicative of overfitting.

In [None]:
pred_train = GB.predict(x_train_pca) 
pred_test = GB.predict(x_test_pca)

Ages_list = [1,18,25,35,45,50,56]
pred_test = [min(Ages_list, key = lambda x: abs(x - y)) for y in pred_test]

print('MAE(TRAINING): ', round(mean_absolute_error(pred_train ,y_train),2))
print('MAE(TEST): ', round(mean_absolute_error(pred_test , y_test),2))

# User's occupation prediction


We extend our later training data including personal information of the users to predict the occupation.

In [None]:
features_for_training = [x + '_affinity' for x in movie_genres] + ['Age', 'Gender', 'Favorite epoch']
target_feature = 'Occupation'

In [None]:
x_train = users_train[features_for_training] 
x_test = users_test[features_for_training]
y_train = users_train[target_feature] 
y_test = users_test[target_feature]

We choose a simple decision tree model as first candidate

In [None]:
DT = DecisionTreeClassifier(random_state = 0, min_samples_leaf = 30)
DT.fit(x_train, y_train);

Jugding by the following confusion matrix the approach is highly improvable, the model is biased and more accuracy is desirable.

In [None]:
pred_test = DT.predict(x_test)
plot_confusion_matrix(y_test, pred_test, classes = user_occupations.values(),  normalize=True, size = 12);