# Unsupervised Learning Predict
© Explore Data Science Academy

---
*   Thomas Kenyon
*   Name
*   Name
*   Name
*   Name
---

### Honour Code

We, Team 8, confirm - by submitting this document - that the solutions in this notebook are a result of our own work and that we abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.

---

### Predict Overview: Movie Recommender

--Insert description and context here

<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Exploratory Data Analysis (EDA)</a>

<a href=#four>4. Data Engineering</a>

<a href=#five>5. Modeling</a>

<a href=#six>6. Model Performance</a>

<a href=#seven>7. Model Explanations</a>

<a href=#eight>8. Appendix</a>

 <a id="one"></a>
## 1. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---

In [3]:
# Import our regular old heroes 
import numpy as np
import pandas as pd
import scipy as sp # <-- The sister of Numpy, used in our code for numerical efficientcy. 
import matplotlib.pyplot as plt
import seaborn as sns

# Entity featurization and similarity computation
from sklearn.metrics.pairwise import cosine_similarity 
from sklearn.feature_extraction.text import TfidfVectorizer

# Libraries used during sorting procedures.
import operator # <-- Convienient item retrieval during iteration 
import heapq # <-- Efficient sorting of large lists

# Imported for our sanity
import warnings
warnings.filterwarnings('ignore')

In [4]:
import cufflinks as cf
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

<a id="two"></a>
## 2. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---

In [5]:
genome_scores = pd.read_csv('genome_scores.csv')
genome_tags = pd.read_csv('genome_tags.csv')
imdb_data = pd.read_csv('imdb_data.csv')
links = pd.read_csv('links.csv')
movies = pd.read_csv('movies.csv')
tags = pd.read_csv('tags.csv')
test = pd.read_csv('test.csv')
train = pd.read_csv('train.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'genome_scores.csv'

<a id="three"></a>
## 3. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---

In [2]:
genome_scores.isnull().sum()
genome_tags.isnull().sum()

NameError: name 'genome_scores' is not defined

<a id="four"></a>
## 4. Data Engineering
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---

Prior to training a model with any sort of data, it is essential to re-engineer it. This is to ensure that the data is in a consistent form with no missing values, incorrect data types or just plain incorrect data. For structured,  numeric data, this entails scaling values, filling in missing values and typecasting any non-numeric data to numeric form. Non-numeric data, such as this database is often unstructured and consists of text data. This type of data is not easily interpretable by computers. Computers work with 1s and 0s, not letters and words. Therefore it must be converted into a form that is interpretable, and therefore, numeric data. 

### Content-based Recommender

---
First thing, we need to merge the movies dataframe with the imdb_data dataframe, these two dataframes contain most of the information we need to construct our recommender.

As you can see, for each movie we now have multiple columns that contain useful information about each movie:
* genres
* title_cast
* director
* plot_keywords

There are more columns that could be useful, such as budget, but lets stick to these for now.

In [None]:
joined2 = movies.merge(imdb_data,  how='left', on = 'movieId')
joined2.head()

There is a lot of actors names for each movie in the title_cast column! Since we're going to ultimately be vectorizing all of this data, we should ask ourselves if we really need all the actors for each movie listed in this column. For the sake of minimizing the number of features in our vectorized dataset, lets reduce this to the top 3 actors. The first 3 actors in this column seem to be the top-billed/most prominent cast for each film, so it's likely that they're the most important anyway.

In [None]:
def first_3(actors):
  s = []
  s = [actor for actor in actors if len(s) < 3]
  s2 = s[:3]
  return '|'.join([actor for actor in s2])

# x = ['Rhys Ifans', 'Tim Allen', 'Don Rickles', 'Crispin Glover', 'Christian McKay']
# print(first_3(x))


joined2['title_cast'] = joined2['title_cast'].apply(lambda x: str(x).split('|'))
joined2['title_cast'] = joined2['title_cast'].apply(first_3)
joined2.head(2)

One problem with this dataset are the number of nan values. Most of these are actual np.nan null values, however some of them are actually 'nan' strings. Sneaky! We'll need to convert all these cryptic nan values to real nan values and then imput something whenever they occur, since null values cannot be passed to a vectorizer.

In [None]:
def check_nan(field):
  '''Some NaN values are being stored as strings, so fillna wont work'''
  if not isinstance(field, str):
    return np.nan
  f2 = field.strip().lower()
  if 'nan' in f2:
    return np.nan
  else:
    return field

joined3 = joined2.copy()
joined3['title_cast'] = joined3['title_cast'].apply(check_nan)
joined3['plot_keywords'] = joined3['plot_keywords'].apply(check_nan)

Lets replace all nan values with empty strings. This means that there will be a lot of movies with a lot of empty metadata columns, but hopefully when we include the genome tag dataset this will become less of an issue.

In [None]:
joined3['director'].fillna('', inplace = True)
joined3['title_cast'].fillna('', inplace = True)
joined3['runtime'].fillna(0, inplace = True)
joined3['budget'].fillna(0, inplace = True)
joined3['plot_keywords'].fillna('', inplace = True)

Creating a word soup column. This column has all the text data in it that we will vectorize. Think of all the words/names in these columns becoming tags. I've added director 3 times so that it has more weighting. IE, if someone likes a movie like Inception, then they are likely to enjoy other movies directer by Christopher Nolan. This is quite a janky way of altering the weightings of specific tags, but creating a word soup like this does simplify the vectorizing process.

In [None]:
joined3['soup'] = joined3['genres'] + '|' + joined3['director'] + '|' + joined3['director'] + '|' + joined3['director'] + '|' + joined3['plot_keywords'] + '|'+ joined3['title_cast']
joined3.iloc[1]

The movie titles contain their release year. We'll leave those in to keep things compatible with the streamlit app. But we'll also add the years to their own column. Just in case we decide to use this information

In [None]:
def get_year(title):
  s = title.split()
  year = s[-1]
  if '(' in year and ')' in year:
    year2 = year[1:-1]
    return year2
  else:
    return np.nan
  
joined3['year'] = joined3['title'].apply(get_year)

Filling in nan values in the year column with 0 

In [None]:
joined3['year'].fillna('0', inplace = True)

#### What to do about the genome tag data?

---
The genome tag data consists of 1128 tags assigned to the movies in the movie database. What is handy about this data is that there are no null values. 1128 standardized tags used across all movies is also useful for the purposes of vectorization, at most, we'll only be adding 1128 new features. This is very efficient considering how much data there is! One other thing, each tag for each movie is assigned a relevance value. Values closer to zero mean that particular tag is not related to that movie, a relevance closer to 1 means that a tag is related to a movie.

---
How would we add these tags to our word soup data while also accounting for the extra data contained in the relevance column? The option we've gone for here is again quite janky, tags with a relevance >= 0.5 will be added to a particular movie's word soup once, while tags with a relevance >= 0.8 will be added twice, thus increasing their weighting in the same way adding the director of a movie multiple times to its word soup increases weighting.

---
Word of warning, we could not find a straightforward way to write this code more efficiently, on a relatively new laptop with an 11th gen i5 processer the cell below takes 2h30m to run. Fortunately, it only needs to be run once.

In [None]:
%%time
def get_genome_tags(movieid):
    scores = genome_scores.loc[genome_scores['movieId'] == movieid]
    output = []
    output2 = ''
    for index, row in scores.iterrows(): # itering through a dataframe is heresy but we're taking the janky route here
        tag = genome_tags.loc[genome_tags['tagId'] == row['tagId']].values[0][1]
        relevance = row['relevance']
        if relevance >= 0.5:
            output += [tag]
        if relevance >= 0.80:
            output += [tag]
        output2 = '|'.join([x for x in output])
        #print(output2)
    return output2


# joined3['genome'] = joined3['movieId'].apply(get_genome_tags)
joined4 = joined3.copy()
# joined4
joined4['genome'] = joined4['movieId'].apply(get_genome_tags)
joined4.head()

In [None]:
joined4['soup'] = joined4['soup'] + '|' + joined4['genome']

In [None]:
joined4.drop('genome', inplace=True, axis=1)

Uncomment and run this cell below to load the output dataframe from code 3 cells above:

In [None]:
# joined4 = pd.read_csv('joined_df_soup_genome3.csv')

In [None]:
joined4['soup'][3]

Great! That looks like a pretty decent collection of tags

#### Time to vectorize
We've got a soup column that now contains all the movie information we want. It's time to vectorize! One thing to note, we'll need to be careful to ensure that the vectorizer splits our word soup into the correct 'tokens'. We don't want tokens to be generate at each whitespace, Morgan Freeman is one person (or God?) so we don't want our vectorizer to split his name into two tokens. We'll avoid this by using a special regex pattern that'll split tokens when it encounters the '|' symbol, since that is what we've been using to seperate our terms in our word soup.

In [None]:
tf = TfidfVectorizer(analyzer='word',
                     min_df=4, max_df=0.5, max_features=100000, token_pattern='[A-Za-z .]{3,50}')

In [None]:
tf_soup = tf.fit_transform(joined4['soup'])

In [None]:
tf_soup

This vectorized object is a sparse matrix, if we wanted to use as little memory as possible we'd keep it sparse, but for simplicity's sake we're going to convert it to a dense array and then to a dataframe so we can use the movie titles as the dataframe's index. This will definitely use far more memory, but it'll make our code easier to read.

In [None]:
tfidf_df = pd.DataFrame(tf_soup.toarray(),  columns=tf.get_feature_names())

In [None]:
tfidf_df.index = joined4['title']

In [None]:
tfidf_df.shape

A few movies in this dataset are actually duplicated, lets remove those duplicates now before attempting PCA or any cosine similarity calculations.

In [None]:
tfidf_df = tfidf_df[~tfidf_df.index.duplicated(keep='first')]

In [None]:
tfidf_df.shape

In [None]:
tfidf_df.head()

### PCA time

---
Okay, so we've got an enormous dataframe with 62325 rows and 6526 columns (features). That is a huge dataframe! Not just to store and manipulate but also to perform calculations on. It's likely that many of these features are not that important, ie, they don't contain much information about each observation and don't explain much of the variation in the dataset. We'll use Principal Component analysis here to reduce the number of features in our dataset while having a minimal effect on the the quality of the dataset.

---
Another reason we converted our vectorized soup data to a dataframe is that Sklearn's PCA implementation doesn't support sparse matrices as input!

DO NOT run the cell below. IT takes ~12 minutes to run and uses about 13GB of memory. The image below is it's output and has been saved from a previous run and added here for convenience purposes.

In [None]:
# define PCA object
pca = PCA()

# fit the PCA model to our data and apply the dimensionality reduction 
prin_comp = pca.fit_transform(tfidf_df)

# create a dataframe containing the principal components
pca_df = pd.DataFrame(data = prin_comp)

# plot line graph of cumulative variance explained 
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of components')
plt.ylabel('Cumulative explained variance')

In [2]:
from IPython.display import Image
from IPython.core.display import HTML
Image(url= "https://github.com/ethanmacrae/unsupervised-predict-streamlit-template/blob/master/resources/imgs/Cumulative_explained_variance.png?raw=true")

Amazing! From the graph above, we can see that the majority of variation between observations can be explained by a fraction of the features. It appears that if we were to keep the 1000 most significant principal components we'd still be able to explain ~90% of the variation. Not only will that significantly reduce the memory footprint of our dataset, it'll also drastically speed up movie prediction times.

---
Lets now confirm that the first 1000 principal components really do explain most of the variance and then store them in a numpy array

In [None]:
# create PCA object with n_components set to 1000
pca_reg = PCA(n_components=1000)

# fit the PCA model to our data and apply the dimensionality reduction 
PCA_array = pca_reg.fit_transform(tfidf_df)

# confirm the number of components
pca_reg.n_components_



pca_reg.explained_variance_ratio_.sum()

Storing the PCA data as a numpy array proved to be the most efficient (better than a dataframe). This .npy file is what is loaded into streamlit and used to make content-based recommendations

---
Since this data doesn't contain movie titles, we're also going to save a dataframe containing just the movie titles as a separate csv file.

In [None]:
from numpy import asarray
from numpy import save
save('PCA1000features.npy', PCA_array)

In [None]:
titles_df = tfidf_df.index
titles_df = pd.DataFrame(titles_df.values, columns=['title'])
titles_df.to_csv('titles_df.csv')

Assuming you haven't run the the PCA code above for obvious reasons, you can run the cell below to load the two files that were saved in the two cells above. Then we'll assign the movie titles as indexes in the PCA dataframe.

In [None]:
from numpy import load
# load array
PCA_array = load('PCA1000features.npy')
titles = pd.read_csv('titles_df.csv')

In [None]:
PCA_df = pd.DataFrame(PCA_array)

In [None]:
PCA_df.index = titles['title']

In [None]:
PCA_df.head()

Below is our function to generate contant-based recommendations

In [None]:
%%time
def content_model(movie_list,top_n=10):
    recommended_movies = []
    sims1 = cosine_similarity(PCA_df.loc[movie_list[0]].values.reshape(1, -1), PCA_df)
    sims2 = cosine_similarity(PCA_df.loc[movie_list[1]].values.reshape(1, -1), PCA_df)
    sims3 = cosine_similarity(PCA_df.loc[movie_list[2]].values.reshape(1, -1), PCA_df)
    sims1_df = pd.DataFrame(sims1.T, index=PCA_df.index,columns=['similarity_score'])
    sims2_df = pd.DataFrame(sims2.T, index=PCA_df.index,columns=['similarity_score'])
    sims3_df = pd.DataFrame(sims3.T, index=PCA_df.index,columns=['similarity_score'])
    del sims1,sims2,sims3
    sims1_df.drop(movie_list[0], inplace=True)
    sims2_df.drop(movie_list[1], inplace=True)
    sims3_df.drop(movie_list[2], inplace=True)
    sims1_df_sorted = sims1_df.sort_values(by='similarity_score', ascending=False)
    sims2_df_sorted = sims2_df.sort_values(by='similarity_score', ascending=False)
    sims3_df_sorted = sims3_df.sort_values(by='similarity_score', ascending=False)
    del sims1_df,sims2_df,sims3_df
    sims1_df_sorted = sims1_df_sorted.head(100)
    sims2_df_sorted = sims2_df_sorted.head(100)
    sims3_df_sorted = sims3_df_sorted.head(100)
    final = sims1_df_sorted.append([sims2_df_sorted, sims3_df_sorted])
    final = final[~final.index.duplicated(keep='first')]
    final.sort_values(by='similarity_score', ascending=False, inplace=True)
    recommended_movies = final.head(top_n).index
    return recommended_movies
    
print(content_model(['Moana (2016)', 'Toy Story (1995)', 'Ice Age (2002)']))