<a href="https://colab.research.google.com/github/Ms-Noxolo/Team_EN3_Jozi/blob/master/Team_EN3_JHB_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Team EN3 Unsupervised Learning predict

### Kaggle Submission: Team_EN3_

---


**Team Members:** Refiloe Phipa, Selebogo Mosoeu, Itumeleng Ngoetjana, Noxolo Kheswa, Jamie Japhta, Nkopane

**Supervisor :** Ebrahim Noormahomed

### Table of content
---
1.   [Introduction](#intro)
  *   Background
  *   Problem statement
---
2.   [Load Dependencies](#imports)
---
3.   [Data cleaning](#cleaning)
---
4.   [Data preprocessing](#preprocessing)
---
5.   [Exploratory Data Analysis](#EDA)
---
6.   [Modelling](#modelling)
---
7.   [Performance Evaluation](#evaluation)
---
8.   [Conclusion](#ending)
---
9.  [References](#ending)
















# 1. Introduction

### Background

In today’s technology driven world, recommender systems are socially and economically critical for ensuring that individuals can make appropriate choices surrounding the content they engage with on a daily basis. One application where this is especially true surrounds movie content recommendations; where intelligent algorithms can help viewers find great titles from tens of thousands of options.

Recommender systems require a broad base access of the user's historical preferences as a result increasing the insights and the accuracy of its future predictions. We can implement an unsupervised machine learning algorithm to solve this problem.

Machine learning is the study of computer algorithms that improve automatically through experience. It is a powerful branch of Artificial intelligence, dating as far back as 1952. Machine learning algorithms build a mathematical model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so.

Unsupervised learning is a type of machine learning that looks for previously undetected patterns in a data set with no pre-existing labels and with a minimum of human supervision.


### Problem Statement

Build an unsupervised machine learning model that is capable of accurately predicting how a user will rate a movie they have not yet viewed based on their historical preferences based on content or collaborative filtering.


# 2. Load Dependencies

In [2]:
# importing the libraries
import numpy as np
import pandas as pd
import seaborn as sns
import math
import random
from nltk.corpus import stopwords
from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse.linalg import svds
from sklearn.preprocessing import MinMaxScaler

import matplotlib.pyplot as plt 
import json
%matplotlib inline
import re
from wordcloud import WordCloud

!pip install scikit-surprise
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from surprise import NormalPredictor
from surprise import SVDpp
from surprise import NMF

  import pandas.util.testing as tm




In [1]:
#from google.colab import files
#uploaded = files.upload()

In [None]:
# loading in the datasets
movies = pd.read_csv("movies.csv")
links = pd.read_csv('links.csv')
imdb = pd.read_csv('imdb_data.csv')
tags = pd.read_csv('tags.csv')
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
scores = pd.read_csv('genome_scores.csv')
#Sample_Submission = pd.read_csv('sample_submission.csv')

# 3. Data Overview

Below we take a general review and summary of the datasets taking note of the shapes, info and features i.e. columns all of which will help us establish a good approach into performing the exploratory data analyses of the datasets.

In [None]:
# The movies
print(movies.shape)
movies.head()

In [None]:
# The links
print(links.shape)
links.head()

Given the similar shape of the movie and links i.e. movie homepage dataframes, we can already note that the links dataframe contains information relating to the movies file.

In [None]:
# The imdb
print(imdb.shape)
imdb.head()

This dataframe consists of additional movie metadata scraped from IMDB using the links.csv file. These include cast/crew, budgets, plots as well as the runtime. The IMDB platform has its own movie-listing requirements therefore this could be the reason why the dataframe doesn't capture all the movies in the links.csv

In [None]:
# The tags 
print(tags.shape)
tags.head(2)

In [None]:
print(scores.shape)
scores.head(2)

We see that the scores and tags dataframes contain information about the movies which can be used ior included in creating a meatadata dataframe of the movies which can be very useful in building a suitable recommender systems.

In [None]:
# The training 
print(train.shape)
train.head()

It is not surpring that this dataframe has alomst 2x more entries than the movie dataset because an individual user can watch more than one movies and also provide ratings for a selection of various movies. This dataframe can also be taken as a ratings table.

In [None]:
# Creating a new dataframe from a subset of the train data 
ratings = train.copy()
ratings.head()

In [None]:
# Since we're considering the train as ratings, it is useful to check for missing values,
ratings.isna().sum()

In [None]:
# as well as the datatype:
train_df.info()

There are no missing values in the dataframe and it consists of only numeric values. That's Good! However, we may need to use only a subset of the dataframe to build and train our models as it becomes impossible work with a huge dataset depending on one's computational power.


In [None]:
# Overview of the testing dataframe
print(test.shape
test.head()

This dataset that will be used in testing the algorithms build for constructing recommender systems.

# 4. Data preprocessing

Data preprocessing is the process of detecting and correcting corrupt or inaccurate records from the dataset and identifying incomplete, incorrect, inaccurate or irrelevant parts of the data.

In [None]:
users = len(ratings.userId.unique())
items = len(ratings.movieId.unique())
print('There are {} unique users and {} unique movies in this data set'.format(num_users, num_items))

In [None]:
max_userId = ratings.userId.max()
max_itemId = ratings.movieId.max()
print('There are {} distinct users and the max of user ID is also {}'.format(num_users, user_maxId))
print('There are {} distinct movies, however, the max of movie ID is {}'.format(num_items, item_maxId))

For matrix factorization, a item vector that is in unnecessarily high dimensional space requires data cleaning to reduce the dimension of item vector back to the number of items i.e.Movies

In [None]:
def reduce_item_dim(df_ratings):
    """
    Reduce item vector dimension to the number of distinct items in our data sets
    
    input: pd.DataFrame, df_ratings should have columns ['userId', 'movieId', 'rating']
    output: pd.DataFrame, df_ratings with new 'MovieID' that is compressed
    """
    # pivot
    df_user_item = df_ratings.pivot(index='userId', columns='movieId', values='rating')
    # reset movieId
    df_user_item = df_user_item.T.reset_index(drop=True).T
    # undo pivot/melt - compress data frame
    df_ratings_new = df_user_item \
        .reset_index('userId') \
        .melt(
            id_vars='userId', 
            value_vars=df_user_item.columns,
            var_name='movieId',
            value_name='rating')
    # drop nan and final clean up
    return df_ratings_new.dropna().sort_values(['userId', 'movieId']).reset_index(drop=True)

In [None]:
print('reduce item dimension before:')
ratings.head()

In [None]:
new_ratings = reduce_item_dim(ratings.copy())
print('reduce item dimension after:')
new_ratings.head()

An alternative Filtering and cleaning method:

In [None]:
# Limiting the ratings to user ratings that have rated more that 25 movies:
ratings_f = ratings.groupby('userId').filter(lambda x: len(x) >= 25)

# Creating a list with movie titles that survived the filtering:
movie_list_rating = ratings_f.movieId.unique().tolist()

In [None]:
# The prop of the original movie titles in ratings data frame that we have retained:
len(ratings_f.movieId.unique())/len(movies.movieId.unique()) * 100

In [None]:
# The prop of the users in ratings data frame that we have retained:
len(ratings_f.userId.unique())/len(ratings.userId.unique()) * 100

Now that we have somemwhat finalized the size and type of dataset we will be building and training models with, we can continue with cleaning the data some more.

In [None]:
# Using the results to filter the movies data frame:
movies = movies[movies.movieId.isin(movie_list_rating)]

In [None]:
# Storing the years from the titles separately:

# We specify the parantheses so we don’t conflict with movies that have years in their titles
movies["year"] = movies.title.str.extract("(\(\d\d\d\d\))",expand=False)
# Removing the parentheses
movies["year"] = movies.year.str.extract("(\d\d\d\d)",expand=False)
# Removing the years from the ‘title’ column
movies["title"] = movies.title.str.replace("(\(\d\d\d\d\))", "")
# Applying the strip function to get rid of any ending whitespace characters that may have appeared
movies["title"] = movies["title"].apply(lambda x: x.strip())

In [None]:
# Removing the character separating the genres for each movie
movies['genres'] = movies['genres'].str.replace('|',' ')

# Removing the same character for the meatadata:
imdb['title_cast'] = imdb['title_cast'].str.replace('|',' ')
imdb['plot_keywords'] = imdb['plot_keywords'].str.replace('|',' ')

In [None]:
# map movie to id:
Mapping_file = dict(zip(movies.title.tolist(), movies.movieId.tolist()))

In [None]:
# Dropping the timestamps as they are considered not useful since we have runtime data
tags.drop(['timestamp'],1, inplace=True)
ratings_f.drop(['timestamp'],1, inplace=True)

Merging the movies and the tags dataframes and creating a metadata tag for each movie

In [None]:
# creating the mixed dataframe of movies title, genres and all user tags given to each movie
mixed = pd.merge(movies, tags, on='movieId', how='left')
mixed.head(3)

In [None]:
print(metadata.shape)

Cleaning this newly created metadata:

In [None]:
# create metadata from tags and genres
mixed.fillna("", inplace=True)
mixed = pd.DataFrame(mixed.groupby('movieId')['tag'].apply(
                                          lambda x: "%s" % ' '.join(x)))
Final = pd.merge(movies, mixed, on='movieId', how='left')
Final ['metadata'] = Final[['tag', 'genres']].apply(
                                          lambda x: ' '.join(x), axis = 1)
Final[['movieId','title','metadata']].head(3)

# 5. Exploratory Data Analysis

We need to perform investigative and detective analysis on our data to see if we can unearth any useful insights. We have data being generated from websites so it’s important to utilize Exploratory Data Analysis to analyze all this text data, with the aid of Visuals to help organizations make data-driven decisions.




### The number of movies that are being released
The general distribution of the ratings

In [None]:
# Plotting the distribution
ratings.plot(kind='bar')
plt.ylabel('count')
plt.xlabel('movie rating')
plt.title('Count of movies by ratings');

There are more movies with ratings of 4.0, followed by 3.0, then 5.0. The issue here is that a movie may have been watched by one user and they might have given it a rating of 5.0. To curb this issue, there might be a need to consider only a movie whereby there we only 100 or more users who have watched the movie.

In [None]:
# Evaluating the number of ratings a movie has received:
no_of_ratings = ratings.groupby('movieId').count()['rating']
no_of_ratings = no_of_ratings[no_of_ratings >= 10]
no_of_ratings

In [None]:
# How many of these ratings cn be considered to be new?:
new_ratings = ratings[ratings['movieId'].isin(no_of_ratings.index)]
len(new_ratings)

In [2]:
# How do these numbers look visually?:
plt.figure(figsize=(8,5))
ratings.plot(kind='bar')
plt.ylabel('count')
plt.xlabel('movie rating')
plt.title('Count of movies with 10 or more viewers by ratings');

There are still more movies with ratings of 4.0, followed by 3.0, then 5.0, with 0.5 and 1.5 ratings being the lowest as well.

In [None]:
# Average rating of movies in the database
avg_rating = new_ratings.groupby('movieId')['rating'].mean()

### The number of movies that are being released
Knowing the numbers around movies can help paint a picture of the relationship that exists between movies and 'users' as availability of movies can be play a huge role in the general growth of the movie-audience industry.

In [None]:
# The number of Movies released per year
num = metadata.groupby('year').count()
plt.figure(figsize=(35,25))
plt.plot(num.index, num['budget'])
plt.xlabel("years", size=25)
plt.xticks(rotation='vertical')
plt.ylabel('No. of Movies', size=25)
plt.title('Number of Movies Released By year', size=25)
plt.show()

Although there has been some drops in the number of movies released throughout the years, it is clear to see that there generally has been a significant growth in movies being released with the growth being exponential around 1990s.

### The years with majority of movies being released
Now we want to find out which years are dominating the movie industry.

In [None]:
# Looking at the all the years of movies releases
year_corpus = metadata['year'].value_counts()
# Generating the wordcolud
year_wordcloud = WordCloud(background_color='white', height=2000, width=4000).generate_from_frequencies(year_corpus)
plt.figure(figsize=(16,8))
plt.imshow(year_wordcloud)
plt.axis('off')
plt.show()

We can see that the wordcloud is correspodning with the plot above that the movie industry grew exponentially in the 2000s, with 2015 and 2016 being the most frequent yeas of movie releases.

### The kind of movies that are being released
Now we want to find out which movies, in terms of genre are dominating the movie industry.

In [None]:
# Looking at the titles and checking for any similarity
metadata['genres'] = metadata['genres'].astype('str')
genre_corpus = ' '.join(metadata['genres'])
#Generating the stopwords
stopword = ['no genres', 'no', 'genres', 'genre', 'listed']
# Generating the wordcolud
genre_wordcloud = WordCloud(stopwords=stopword, background_color='white', height=2000, width=4000).generate(genre_corpus)
plt.figure(figsize=(16,8))
plt.imshow(genre_wordcloud)
plt.axis('off')
plt.show()

We can see that majority of the movies in the dataset are Comedy, Drama and Romance.

In [None]:
# Looking at the titles and checking for any similarity
metadata['title'] = metadata['title'].astype('str')
title_corpus = ' '.join(metadata['title'])
# Generating the wordcolud
title_wordcloud = WordCloud(stopwords=STOPWORDS, background_color='white', height=2000, width=4000).generate(title_corpus)
plt.figure(figsize=(16,8))
plt.imshow(title_wordcloud)
plt.axis('off')
plt.show()

As the worldcloud suggests, there are a lot of movies that pertain tell the stories of a boy and/or girl, movies about wars, crime, America and sequels as indicated by "II". These correspond to the genres unpacked above.

The dataset consists of 27248 movies for which we have data on overview, cast/crew and budget. This is close to only 44% of the entire dataset. Although this is less than 505 of the entire dataset, it is more than enough to perform very useful analysis and discover interesting insights about the world of movies.

In [None]:
# Looking at the plots and checking for any similarity
metadata['plot_keywords'] = metadata['plot_keywords'].astype('str')
overview_corpus = ' '.join(metadata['plot_keywords'])
# Generating the wordcolud
plot_wordcloud = WordCloud(stopwords=STOPWORDS, background_color='white', height=2000, width=4000).generate(overview_corpus)
plt.figure(figsize=(16,8))
plt.imshow(plot_wordcloud)
plt.axis('off')
plt.show()

### The runtime of movies that are being released
Movies have progressed in terms of runtime, From the 1 minute slient, black & white clips to epic 3 hour gci. So, in this section, let us try and gain some additional insights about the nature of movie lengths and their evolution over time.

Now we want to find out the duration of these movies being released are.

In [None]:
# converting the column to numeric
metadata['runtime'] = pd.to_numeric(metadata['runtime'])

# Viewing relative durations of the movies
metadata['runtime'].describe()


This is only a subset of the dataset i.e. 22%, and from this wee can see that the average length of a movie is about 1 hour and 40 minutes. The longest movie recorded in this dataset is 877 minutes (or 14 hours) long.

In [None]:
# The distribution of these mainstream movies .i.e movies less than 3 hours (or 200 minutes) long. 
plt.figure(figsize=(12,6))
sns.distplot(metadata[(metadata['runtime'] < 200) & (metadata['runtime'] > 0)]['runtime'])

Possible trends in what may be considered as the appropriate length of a movie across the years.

In [None]:
# Looking at the shortest Movies
metadata[metadata['runtime'] > 0][['runtime', 'title', 'year']].sort_values('runtime').head(10)

Majority of the short movies were filmed in the late 1890s and the beginning of the 20th century and they're absurdly only a minute long. The exceptn in the Top 10 are Fresh Guacamole released in 2012 and Curb Dance released in 2010 both being two minutes long.

In [None]:
# Looking at the longest Movies
metadata[metadata['runtime'] > 0][['runtime', 'title', 'year']].sort_values('runtime', ascending=False).head(10)

Notably, almost all the entries in the above list were released in the 2000s and are actually miniseries and sequels and as such, can't count as feature length films. There isn't much insight we can gther from this as there is no way of distinguishing feature length films from TV Mini Series from our dataset unless done manually, and this could take days.

# 6. Modelling

##Collaborative Filtering

A technique that can filter out items that a user might like on the basis of reactions by similar users.

It works by searching for a group of people with similar taste to this specific users. It is a method used to predict a rating for a user item pair based on the history of ratings given by the user and given to the item.

In [None]:
reader = Reader()
train = Dataset.load_from_df(train[['userId', 'movieId', 'rating']], reader)

###SVD

The Singular Value Decomposition (SVD), a method from linear algebra that has been generally used as a dimensionality reduction technique in machine learning. SVD is a matrix factorisation technique, which reduces the number of features of a dataset by reducing the space dimension from N-dimension to K-dimension (where K < N)

In [None]:
svd = SVD()
cross_validate(svd, train, measures=['RMSE', 'MAE'])

Now we train our dataset and arrive at a prediction.

In [None]:
data_train = train.build_full_trainset()
svd.fit(data_train)

Breakdown of what is put into the model:


* a = userId
* b = movieId
* c = rating
* d = expected rating

In [None]:
a = 
b =
c =

In [None]:
d = svd.predict(a, b, c)
print (d)

For our prediction, we give the model a different users info (userId, movieId and their rating). It will then predict an estimate of what our user may rate this movie.

Comparing other recommendation algorithms

In [None]:
norm = NormalPredictor()
cross_validate(norm, train, measures=['RMSE', 'MAE'])
data_train = train.build_full_trainset()
norm.fit(data_train)
d = norm.predict(a, b, c)
print ('Normal predictor', d)

svd2 = SVDpp()
cross_validate(svd2, train, measures=['RMSE', 'MAE'])
data_train = train.build_full_trainset()
svd2.fit(data_train)
d = svd2.predict(a, b, c)
print ('SVD++', d)

nmf = NMF()
cross_validate(nmf, train, measures=['RMSE', 'MAE'])
data_train = train.build_full_trainset()
nmf.fit(data_train)
d = nmf.predict(a, b, c)
print ('Non-negative Matrix Factorization', d)

##Content Based Filtering

A Content based filter uses attributes to recommend similar content.

In a content-based recommender system, keywords are used to describe the items and a user profile is built to indicate the type of item this user likes. In other words, these algorithms try to recommend items that are similar to those that a user liked in the past.

In [3]:
from google.colab import files
uploaded = files.upload()

Saving imdb_data.csv to imdb_data (3).csv
Saving movies.csv to movies (2).csv


In [6]:
#In order to best recommend a movie, we need to look at the list of movies we have. 
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
movies = pd.read_csv('movies.csv')
imdb = pd.read_csv('imdb_data.csv')

Other features that could help or influence what a user may like include the cast, director or keywords.

This data can all be found in the imdb dataset.

In [7]:
imdb.head()

Unnamed: 0,movieId,title_cast,director,runtime,budget,plot_keywords
0,1,Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wal...,John Lasseter,81.0,"$30,000,000",toy|rivalry|cowboy|cgi animation
1,2,Robin Williams|Jonathan Hyde|Kirsten Dunst|Bra...,Jonathan Hensleigh,104.0,"$65,000,000",board game|adventurer|fight|game
2,3,Walter Matthau|Jack Lemmon|Sophia Loren|Ann-Ma...,Mark Steven Johnson,101.0,"$25,000,000",boat|lake|neighbor|rivalry
3,4,Whitney Houston|Angela Bassett|Loretta Devine|...,Terry McMillan,124.0,"$16,000,000",black american|husband wife relationship|betra...
4,5,Steve Martin|Diane Keaton|Martin Short|Kimberl...,Albert Hackett,106.0,"$30,000,000",fatherhood|doberman|dog|mansion


###Combine relevant datasets

The movie database has movie titles and genres and the imdb database has the cast, directors and keywords.
We make it easier to view all of this information by combining them into one dataset.  

In [8]:
alls = pd.merge(movies, imdb)
alls.head()

Unnamed: 0,movieId,title,genres,title_cast,director,runtime,budget,plot_keywords
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wal...,John Lasseter,81.0,"$30,000,000",toy|rivalry|cowboy|cgi animation
1,2,Jumanji (1995),Adventure|Children|Fantasy,Robin Williams|Jonathan Hyde|Kirsten Dunst|Bra...,Jonathan Hensleigh,104.0,"$65,000,000",board game|adventurer|fight|game
2,3,Grumpier Old Men (1995),Comedy|Romance,Walter Matthau|Jack Lemmon|Sophia Loren|Ann-Ma...,Mark Steven Johnson,101.0,"$25,000,000",boat|lake|neighbor|rivalry
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,Whitney Houston|Angela Bassett|Loretta Devine|...,Terry McMillan,124.0,"$16,000,000",black american|husband wife relationship|betra...
4,5,Father of the Bride Part II (1995),Comedy,Steve Martin|Diane Keaton|Martin Short|Kimberl...,Albert Hackett,106.0,"$30,000,000",fatherhood|doberman|dog|mansion


Check for null values and replace them.

In [9]:
alls.isna().sum()

movieId              0
title                0
genres               0
title_cast        9665
director          9519
runtime          11345
budget           17583
plot_keywords    10482
dtype: int64

In [10]:
alls['title_cast'] = alls['title_cast'].fillna('')
alls['director'] = alls['director'].fillna('')
alls['plot_keywords'] = alls['plot_keywords'].fillna('')
alls.isna().sum()

movieId              0
title                0
genres               0
title_cast           0
director             0
runtime          11345
budget           17583
plot_keywords        0
dtype: int64

In [11]:
alls.head()

Unnamed: 0,movieId,title,genres,title_cast,director,runtime,budget,plot_keywords
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wal...,John Lasseter,81.0,"$30,000,000",toy|rivalry|cowboy|cgi animation
1,2,Jumanji (1995),Adventure|Children|Fantasy,Robin Williams|Jonathan Hyde|Kirsten Dunst|Bra...,Jonathan Hensleigh,104.0,"$65,000,000",board game|adventurer|fight|game
2,3,Grumpier Old Men (1995),Comedy|Romance,Walter Matthau|Jack Lemmon|Sophia Loren|Ann-Ma...,Mark Steven Johnson,101.0,"$25,000,000",boat|lake|neighbor|rivalry
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,Whitney Houston|Angela Bassett|Loretta Devine|...,Terry McMillan,124.0,"$16,000,000",black american|husband wife relationship|betra...
4,5,Father of the Bride Part II (1995),Comedy,Steve Martin|Diane Keaton|Martin Short|Kimberl...,Albert Hackett,106.0,"$30,000,000",fatherhood|doberman|dog|mansion


###Drop columns

Runtime and budget do not really influence a user's choice in movie preferences so we drop it.

In [12]:
alls = alls.drop(['runtime', 'budget'], axis = 1)
alls = alls.drop('movieId', axis = 1)

In [13]:
alls.isna().sum()

title            0
genres           0
title_cast       0
director         0
plot_keywords    0
dtype: int64

###Removing unneseccary characters

We remove any extra symbols that could mess with our recommendation system as well as the years in our title column.

In [14]:
alls['title_cast'] = alls.title_cast.str.split('|')
alls['plot_keywords'] = alls.plot_keywords.str.split('|')
alls['title'] = alls['title'].str.extract('(.*)\((\d{4})\)', expand=False)
alls['genres'] = alls.genres.str.split('|')

In [15]:
alls.head()

Unnamed: 0,title,genres,title_cast,director,plot_keywords
0,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]","[Tom Hanks, Tim Allen, Don Rickles, Jim Varney...",John Lasseter,"[toy, rivalry, cowboy, cgi animation]"
1,Jumanji,"[Adventure, Children, Fantasy]","[Robin Williams, Jonathan Hyde, Kirsten Dunst,...",Jonathan Hensleigh,"[board game, adventurer, fight, game]"
2,Grumpier Old Men,"[Comedy, Romance]","[Walter Matthau, Jack Lemmon, Sophia Loren, An...",Mark Steven Johnson,"[boat, lake, neighbor, rivalry]"
3,Waiting to Exhale,"[Comedy, Drama, Romance]","[Whitney Houston, Angela Bassett, Loretta Devi...",Terry McMillan,"[black american, husband wife relationship, be..."
4,Father of the Bride Part II,[Comedy],"[Steve Martin, Diane Keaton, Martin Short, Kim...",Albert Hackett,"[fatherhood, doberman, dog, mansion]"


####We then convert the data to lowercase and remove any spaces to ensure our system does not confuse names or movies that may start with the same first words.
eg 'James Bond' and 'James Carter' may come across as the same character.

In [22]:
def clean(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [23]:
for items in alls:
    alls[items] = alls[items].apply(clean)

In [24]:
alls.head()

Unnamed: 0,index,title,genres,title_cast,director,plot_keywords,soup
0,,toystory,"[adventure, animation, children, comedy, fantasy]","[tomhanks, timallen, donrickles, jimvarney, wa...",johnlasseter,"[toy, rivalry, cowboy, cgianimation]",adventureanimationchildrencomedyfantasytomhank...
1,,jumanji,"[adventure, children, fantasy]","[robinwilliams, jonathanhyde, kirstendunst, br...",jonathanhensleigh,"[boardgame, adventurer, fight, game]",adventurechildrenfantasyrobinwilliamsjonathanh...
2,,grumpieroldmen,"[comedy, romance]","[waltermatthau, jacklemmon, sophialoren, ann-m...",markstevenjohnson,"[boat, lake, neighbor, rivalry]",comedyromancewaltermatthaujacklemmonsophialore...
3,,waitingtoexhale,"[comedy, drama, romance]","[whitneyhouston, angelabassett, lorettadevine,...",terrymcmillan,"[blackamerican, husbandwiferelationship, betra...",comedydramaromancewhitneyhoustonangelabassettl...
4,,fatherofthebridepartii,[comedy],"[stevemartin, dianekeaton, martinshort, kimber...",alberthackett,"[fatherhood, doberman, dog, mansion]",comedystevemartindianekeatonmartinshortkimberl...


Now that we have clean the relevant data, we create our metadata soup that will be used in vectorizing. 

In [25]:
def soup(x):
    return ' '.join(x['genres']) + ' ' + ' '.join(x['title_cast']) + ' ' + x['director'] + ' ' + ' '.join(x['plot_keywords'])
alls['soup'] = alls.apply(soup, axis=1)

###Reset the index
This function generates a new dataframe or series setting the indices in order, starting from 0, making it easier to work with the dataframe/series.

In [26]:
alls = alls.reset_index()
indices = pd.Series(alls.index, index=alls['title'])

### CountVectorizer

It provides a simple way to both tokenize a collection od text documents and build a vocabulary of known words.
Also enabling the preprocessing of text data prior to generating the vector representation making it a highly flexible feature representation module for text. 

In [27]:
count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(alls['soup'])

### Cosine Similarity

A metric used to determine how similar the data is. It measures similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them.  

In [None]:
cosine_sim = cosine_similarity(count_matrix, count_matrix)

The final step before asking a recommendation will be to write a function that considers all our relevant features along with the cosine similarity and return a list of recommended movies. 

In [None]:
def reco(title, cosine_sim = cosine_sim):
    
    index = indices[title]

    sim_scores = list(enumerate(cosine_sim[index]))
    sims = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sims = sims[1:11]

    movie_indices = [i[0] for i in sims]

    # Return the top 10 most similar movies
    return alls['title'].iloc[movie_indices]

Finally we are now able to suggest a movie and find other similar movies for our user to watch.

In [21]:
reco('jumanji', cosine_sim)

KeyError: ignored

## 7. Performance Evaluation

## 8. Conclusion

## 9. References

1. Beginner Tutorial: Recommender Systems in Python
https://www.datacamp.com/community/tutorials/recommender-systems-python

2. Build a Recommendation Engine With Collaborative Filtering
https://realpython.com/build-recommendation-engine-collaborative-filtering/

3. How to Build Simple Recommender Systems in Python
https://medium.com/swlh/how-to-build-simple-recommender-systems-in-python-647e5bcd78bd

4. Introduction to Recommendation System. Part 1
https://hackernoon.com/introduction-to-recommender-system-part-1-collaborative-filtering-singular-value-decomposition-44c9659c5e75

5. Building a Recommender System With Pandas
https://medium.com/towards-artificial-intelligence/building-a-recommender-system-with-pandas-1ca0bb03fdce