# Movie Wars
## ~ Episode III – Revenge of the Outliers  ~

First of all, we should set the notebook so that it outputs all results of each cell and not only the last one (and disable the warnings).

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import warnings
warnings.filterwarnings('ignore')

And import all the python libraries needed for this step.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle

Next, we state where our data sources are.

In [None]:
data_folder_path = 'data\\'

movies_file_path = data_folder_path + 'movies_with_genre_and_year.csv'
users_file_path = data_folder_path + 'users_with_age_interval_and_occupation.csv'
ratings_file_path = data_folder_path + 'ratings.csv'
ratings_by_user_file_path  = data_folder_path + 'ratings_by_user_data.csv'

And load the data.

In [None]:
all_movies = pd.read_csv(movies_file_path, sep = ';', index_col = 'ID')
all_users = pd.read_csv(users_file_path, sep = ';', index_col = 'Id')
all_ratings = pd.read_csv(ratings_file_path, sep = ',')
ratings_by_user = pd.read_csv(ratings_by_user_file_path, sep = ';')

movie_genres = ['Action', 'Adventure', 'Animation', "Children's", 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']

Now, we are ready to start with the feature engineering process.

## Transforming features

### Normalizing Movie Year feature

We split the 20th century into 5 epochs:

 - **(1900 - 1939)**: black and white movies and silent movies.
 - **(1939 - 1970)**: western movies and classic movies.
 - **(1970, 1985)**: first world famous actors, classic action movies, first blockbusters and color movies.
 - **(1985, 1995)**: future based topics, introduction of special effects
 - **(1995, 2000)**: sci-fi, new age movies and computer effects.

In [None]:
def movie_year_normalization(year):
    if year < 1939:
        return 0
    if year < 1970:
        return 0.25
    if year < 1985:
        return 0.5
    if year < 1995:
        return 0.75
    return 1

all_movies['Year_normalized'] = all_movies['Year'].apply(movie_year_normalization)

### Changing the class mark for the users' age

To improve the age feature we can set the **age** class mark as de midpoint of the interval rather than a extreme of the same.

In [None]:
centered_user_ages = {1: (0 + 17)/2, 18: (18 + 24)/2, 25: (25 + 34)/2, 35: (35 + 44)/2, 45: (45 + 49)/2, 50: (50 + 55)/2, 56: (56 + 70)/2}

all_users['Age'] = all_users['Age'].apply(lambda x: centered_user_ages[x])

### Normalizing the users' age values

We use the **min-max** normalization with users age.

In [None]:
minimun_age = min(all_users['Age'])
maximun_age = max(all_users['Age'])

all_users['Age_normalized'] = all_users['Age'].apply(lambda age: ((age - minimun_age)/(maximun_age - minimun_age)))

### Grouping the users' occupation

To improve the occupation feature we can regroup its classes in **fewer categories** and make them **more balanced**.

In [None]:
user_occupations_categories = {1: 'artist', 2: 'craftsmen', 3: 'engineer', 4: 'academic', 5: 'student', 6: 'customer-fancing',
                       7: 'other', 8:'unemployed', 9: 'high-wage'}

occupations_map = {20: 1, 2: 1, 18: 2, 8: 2, 9: 2, 12: 3, 17: 3, 15:4 ,1: 4, 10: 5, 4: 5, 16: 6, 14: 6, 5: 6, 0: 7, 3: 7, 
                  19: 8, 13: 8, 6: 9, 7: 9, 11: 9}

all_users['Occupation_category'] = all_users['Occupation'].apply(lambda occupation: occupations_map[occupation])

Let's see how it looks now.

In [None]:
all_users['Occupation_category_decoded'] = all_users['Occupation_category'].apply(lambda category: user_occupations_categories[category])

occupation_distribution = pd.DataFrame({
    'Occupation': list(Counter(all_users.Occupation_category_decoded).keys()),
    'Count': list(Counter(all_users.Occupation_category_decoded).values()),
})

occupations_order = occupation_distribution.sort_values(by = 'Count', ascending = False)['Occupation']

occupation_barplot = sns.barplot(x = 'Occupation', y = 'Count', data = occupation_distribution, order = occupations_order)
occupation_barplot.set_xticklabels(occupation_barplot.get_xticklabels(), rotation = 45, horizontalalignment = 'right')
occupation_barplot.set_title("Users' occupation distribution");

## Transforming users gender feature into binary

For model learning purposes, we transform the feature of users Gender from string type to binary

In [None]:
all_users['Gender'] = all_users['Gender'].apply(lambda value: 1 if value == 'F' else 0)

## Improving the profiles

### Users profiles

We can use information about the movies rated by each user for improving our models performance, for example we can: 
- Find the **user's average ratings** 
- Find the **user's affinity to each genre** using the average the ratings of movies corresponding to that genres.
- Find the user's **favorite movie epoch**.

In [None]:
users_ids = list(all_ratings['user'].unique())
movies_ids = list(all_ratings['movie'].unique())

# Get the movies rated by each user
movies_by_user = dict()
for user in users_ids:
    movies_by_user[user] = list(all_ratings[all_ratings['user'] == user]['movie'])

# Get the movies by genre
movies_by_genre = dict()
for genre in movie_genres:
    movies_by_genre[genre] = list(all_movies[all_movies[genre] > 0].index) 

# Compute the average rating of each movie
mean_ratings = dict()
for movie in movies_ids:
    mean_ratings[movie] = round(all_ratings[all_ratings['movie'] == movie]['rating'].mean(), 3)

# Compute the genre mean ratings and ratings counts for each user
genre_means_per_user = dict()
genre_count_per_user = dict()
for genre in movie_genres:
    selected_movies = list(map(lambda user_movies: set(user_movies) & set(movies_by_genre[genre]), movies_by_user.values()))
    selected_ratings = list(map(lambda movies: [mean_ratings[movie] for movie in movies], selected_movies))
    
    genre_means_per_user[genre] = list(map(lambda ratings: round(np.array(ratings).mean(), 3), selected_ratings))
    genre_count_per_user[genre] = list(map(lambda ratings: len(ratings), selected_ratings))

# Create user profiles
user_profiles = pd.DataFrame({
    'User': users_ids,
    'Age': [all_users.at[user_id, 'Age'] for user_id in users_ids],
    'Gender': [all_users.at[user_id, 'Gender'] for user_id in users_ids],
    'Occupation': [all_users.at[user_id, 'Occupation_category'] for user_id in users_ids],
    'Favorite epoch': list(map(lambda movies: round(all_movies[all_movies.index.isin(list(movies))]['Year'].mean(), 3), movies_by_user.values())),
    'Mean rating': list(map(lambda movies: round(np.array([mean_ratings[movie] for movie in movies]).mean(), 3), movies_by_user.values())),
    'Ratings count': list(map(lambda movies: len(movies), movies_by_user.values()))
}).set_index('User')

for genre in movie_genres:
    user_profiles[genre + '_affinity'] = [(mean - 1)/4 if (not np.isnan(mean)) else 0 for mean in genre_means_per_user[genre]]
    user_profiles[genre + '_ratings_count'] = genre_count_per_user[genre]

user_profiles.head(5)

### Movies profiles

We can use information about the users that rated each movie for improving our models performance, for example we can: 
- Find the **movie's average rating** 

In [None]:
# Create movie profiles
movie_profiles = pd.DataFrame({
    'Movie': movies_ids,
    'Year': [all_movies.at[movie_id, 'Year'] for movie_id in movies_ids],
    'Mean rating': list(map(lambda movie: round(all_ratings[all_ratings['movie'] == movie]['rating'].mean(), 3), movies_ids)),
    'Ratings count': list(map(lambda movie: len(all_ratings[all_ratings['movie'] == movie]['rating']), movies_ids)),
    'Genre count': list(map(lambda movie: sum(all_movies.loc[movie][movie_genres].values), movies_ids))
}).set_index('Movie')

for genre in movie_genres:
    movie_profiles[genre] = [all_movies.at[movie_id, genre] for movie_id in movies_ids]

movie_profiles.head(5)

### Rating estimate

Knowing the user's affinity to each genre and the genres of a movie, we can easily give an estimated approximation for the rating, as follows:

In [None]:
def estimator(movie, user):
    user_genre_affinity = user_profiles.loc[user][[genre + "_affinity" for genre in movie_genres]].values
    genres = movie_profiles.loc[movie][movie_genres].values
    
    affinities = np.array([x * y for x, y in zip(user_genre_affinity, genres)])
    affinities = affinities[affinities != 0.]
    
    return round(affinities.mean() * 4 + 1, 3)

This is what most classic content-based recommender systems did. Now a days, it is used to further improve the models. 

***Note:*** *This can take a while, since it needs to compute it for half a million data points.*

In [None]:
all_ratings['estimate'] = all_ratings.apply(lambda rating: estimator(rating['movie'], rating['user']), axis=1)

We should not forget to add the other features that we want to use later on to our dataset, in our case we will use: **users (age, gender, occupation, favorite epoch)** and **movies (year, genres)**. 

In [None]:
all_ratings['user_age'] = all_ratings['user'].apply(lambda user: user_profiles.at[user, 'Age'])
all_ratings['user_gender'] = all_ratings['user'].apply(lambda user: user_profiles.at[user, 'Gender'])
all_ratings['user_occupation_category'] = all_ratings['user'].apply(lambda user: user_profiles.at[user, 'Occupation'])
all_ratings['user_movies_epoch'] = all_ratings['user'].apply(lambda user: user_profiles.at[user, 'Favorite epoch'])

all_ratings['movie_year'] = all_ratings['movie'].apply(lambda movie: movie_profiles.at[movie, 'Year'])
for genre in movie_genres:
    all_ratings[genre] = all_ratings['movie'].apply(lambda movie: movie_profiles.at[movie, genre])

## Removing noise

Thankfully in a recommendation system based on ratings, neither missing values nor measurement noise exists. 

But there are still some things that we can do to improve the quality of the dataset.

### Outliers

Outlier are a data points that differ significantly from other observations. In the case of our users, we can find two groups of them:
- The *"haters"* that rate everything negatively
- The *"lovers"* that rate everything as perfect

In [None]:
haters = list(ratings_by_user[(ratings_by_user['Mean rating'] < 2.5) & (ratings_by_user['Rating deviation'] < 2)].index)
lovers = list(ratings_by_user[(ratings_by_user['Mean rating'] > 4) & (ratings_by_user['Rating deviation'] < 0.6)].index)

f"Number of haters: {len(haters)} ({round(100*len(haters)/len(all_users), 2)}%)"
f"Number of lovers: {len(lovers)} ({round(100*len(lovers)/len(all_users), 2)}%)"

For models based on **user profiling**, we need to **remove** this anomalous users to improve the gereralization of the profiles. But for models based on **closest neighbors**, those anomalous users can be actually very helpful so is better to **keep** them.

###  Lack of ratings

For models based on user profiling, it is also important to **remove** users with very few ratings.

In [None]:
users_with_few_ratings = list(ratings_by_user[ratings_by_user['Rating count'] < 30].index)

f"People with few ratings: {len(users_with_few_ratings)} ({round(100*len(users_with_few_ratings)/len(all_users), 2)}%)"

## Spliting the data

We can consider two ways of spliting our data set into 80% training and 20% testing sets, the first option is just to do a **random sampling split**. 

In [None]:
classic_train_data, classic_test_data = train_test_split(all_ratings, test_size = 0.2)

The second one is to first add **at least one rating** for each film and one rating for each user to the training set and **fill the remaining capacity up to 80% randomly**, this warranties no movie nor user in our test set will not have a rating.

In [None]:
def minimum_random_sampling(dataset):
    selected_users = set()
    selected_movies = set()
    selected_ratings = []
    
    shuffled_dataset = shuffle(dataset)
    
    for index, row in shuffled_dataset.iterrows():
        if row.user not in selected_users or row.movie not in selected_movies :
            selected_users |= {row.user}
            selected_movies |= {row.movie}
            selected_ratings.append(row.id)
    
    return (dataset[dataset.index.isin(selected_ratings)], dataset[~dataset.index.isin(selected_ratings)])

def special_random_sampling(dataset, test_size = 0.2):   
    (minimum, remaining) = minimum_random_sampling(dataset)
    (train, test) = train_test_split(remaining, test_size = test_size * (len(dataset)/(len(dataset) - len(minimum))))
    
    return (train.append(minimum), test)

This second approach must be taken when the model selected cannot handle unseen movies/users during training.

In [None]:
(special_train_data, special_test_data) = special_random_sampling(all_ratings)

### Filtering data

We want to remove the outliers and users with few ratings **only from the training set**, and only for ceirtaing models. If we were to remove them from the test set, our metrics wouldn't be valid.

In [None]:
classic_train_data_reduced = classic_train_data[(~classic_train_data['user'].isin(haters)) & (~classic_train_data['user'].isin(lovers)) & (~classic_train_data['user'].isin(users_with_few_ratings))]
special_train_data_reduced = special_train_data[(~special_train_data['user'].isin(haters)) & (~special_train_data['user'].isin(lovers)) & (~special_train_data['user'].isin(users_with_few_ratings))]

## Results

Finally, we save the resulting training and testing sets.

In [None]:
features = ['rating', 'user_age', 'user_gender', 'user_occupation_category', 'user_movies_epoch', 'movie_year', 'estimate'] + movie_genres

classic_train_data[features].to_csv(data_folder_path + 'ratings_training_data_basic_split.csv', sep = ';')
classic_train_data_reduced[features].to_csv(data_folder_path + 'ratings_training_data_basic_split_reduced.csv', sep = ';')
classic_test_data[features].to_csv(data_folder_path + 'ratings_test_data_basic_split.csv', sep = ';')

special_train_data[features].to_csv(data_folder_path +'ratings_training_data_special_split.csv', sep = ';')
special_train_data_reduced[features].to_csv(data_folder_path +'ratings_training_data_special_split_reduced.csv', sep = ';')
special_test_data[features].to_csv(data_folder_path +'ratings_test_data_special_split.csv', sep = ';')