<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">


# DSI-13 Capstone Project: 
# Recommender Systems

_Authors: Davis Hong_

---

### Problem Statement:
A Recommender System is a system capable of predicting future preferences of a set of items for users, and recommend something that s/he may like (content based) or something preferred by others with similar tastes (collaborative). In the past where brick and mortar business model is the norm, recommendations are usually made through word of mouth. However, with the internet, availability of large amount of data coupled with some good prediction models, recommendations can be fairly accurate. Businesses relying heavily on the internet for revenue, recommender systems can help boost growth and increase revenue.


### Objective:
There are many type of recommender systems that can be used for movie recommender but for this project, we will focus on the 2 models below. 

The objective is to use a simple non-machine learning model and use the results as the baseline. A machine learning based model will be used to compare with the simple model. The machine learning model will be expected to produce better results.

<a id="two-classical-recommendation-methods"></a>
### Two classical recommendation methods

- **Collaborative Filtering**: _(similar people)_
    - If you like the same 5 movies as someone else, you'll likely enjoy other movies they like.
    - There are two main types: (a) Find users who are similar and recommend what they like (**user-based**), or (b) recommend items that are similar to already-liked items (**item-based**).
   

- **Content-Based Filtering** _(similar items)_
    - If you enjoy certain characteristics of movies (e.g. certain actors, genre, etc.), you'll enjoy other movies with those characteristics.
    - Note this can easily be done using machine learning methods! Each movie can be decomposed into features. Then, for each user we compute a model -- the target can be a binary classifier (e.g. "LIKE"/"DISLIKE") or regression (e.g. star rating).
<br>


### Data Source:
Data source: The ?? Movie Lens datasets will be used for the project.

Datasets can be found at https://grouplens.org/datasets/movielens/1m/

Data dictionary: http://files.grouplens.org/datasets/movielens/ml-1m-README.txt


### Potential challenges and obstacles:
As there are no metrics to measure whether the recommender engine is good or not thus some form of human interpretation of the results will be required to gauge the performance of models.

### Measuring of results.
Whenever possible, baseline score will be set and metrics such as MSE, RSME, etc will be used to measure the effectiveness of the model.

Bearing any unforeseen circumstances for eg, interruptions cause by COVID19, this project should be achievable as data is available. Data cleaning is expected but minimal.

### Let's get started...

### Load required libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from scipy import sparse
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.model_selection import train_test_split, cross_val_score, KFold, GridSearchCV
from sklearn import preprocessing
sns.set()
sns.set_style('darkgrid')
plt.style.use('fivethirtyeight')
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

## Load datasets
---

We'll be using the [MovieLens](https://grouplens.org/datasets/movielens/) dataset for building our recommendation engine. There are 3 datasets namely movies, ratings and users which will be used to build the recommender system. 

In [2]:
pwd

'P:\\GitHub\\General-Assembly-Projects\\Capstone'

In [3]:
movies = pd.read_csv('.\datasets\\movies.csv')
ratings = pd.read_csv('.\datasets\\ratings.csv')
users = pd.read_csv('.\datasets\\users.csv')

In [4]:
print(f'Shape of Movies dataframe: {movies.shape}\n')
print(f'Any null values in the dataframe:\n{movies.isnull().sum()}\n')
print(f'Data types:\n{movies.dtypes}')
movies.head()

Shape of Movies dataframe: (9742, 3)

Any null values in the dataframe:
movieId    0
title      0
genres     0
dtype: int64

Data types:
movieId     int64
title      object
genres     object
dtype: object


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
print(f'Shape of Ratings dataframe: {ratings.shape}\n')
print(f'Any null values in the dataframe:\n{ratings.isnull().sum()}\n')
print(f'Data types:\n{ratings.dtypes}')
ratings.head()

Shape of Ratings dataframe: (100836, 4)

Any null values in the dataframe:
userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

Data types:
userId         int64
movieId        int64
rating       float64
timestamp      int64
dtype: object


Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [6]:
# Convert column header of movies and ratings dataframe to lower case
movies.rename(columns = {'movieId': 'movieid'}, inplace = True)
movies.head(2)

Unnamed: 0,movieid,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy


In [7]:
print(f'Shape of Users dataframe: {users.shape}\n')
print(f'Any null values in the dataframe:\n{users.isnull().sum()}\n')
print(f'Data types:\n{users.dtypes}')
users.head()

Shape of Users dataframe: (6040, 5)

Any null values in the dataframe:
userid        0
gender        0
age           0
occupation    0
zipcode       0
dtype: int64

Data types:
userid         int64
gender        object
age            int64
occupation     int64
zipcode       object
dtype: object


Unnamed: 0,userid,gender,age,occupation,zipcode
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


In [8]:
ratings.columns = map(str.lower, ratings.columns)
ratings.head(2)

Unnamed: 0,userid,movieid,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247


In [9]:
# Per data dictionary, numeric codes used in the users dataset
# contains categorical information.

#Create dictionary for age group.
data = {'age': [1, 18, 25, 35, 45, 50, 56], 
        'agegrp': ['Under 18', '18-24', '25-34', '35-44', '45-49', '50-55', '56+']} 

# Create DataFrame for age group
age_grp = pd.DataFrame(data)

# Create dictionary for occupation
data = {'code': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], 
        'occupation': ['other', 'academic/educator', 'artist', 'cleric/admin',
                       'college/grad student', 'customer service',
                       'doctor/health care', 'executive/managerial', 'farmer',
                       'homemaker', 'K-12 student', 'lawyer', 'programmer',
                      'retired', 'sales/marketing', 'scientist',
                      'self-employed', 'technician/engineer',
                      'tradesman/craftsman', 'unemployed', 'writer']} 

# Create DataFrame for age group
occupation = pd.DataFrame(data)

In [10]:
age_grp

Unnamed: 0,age,agegrp
0,1,Under 18
1,18,18-24
2,25,25-34
3,35,35-44
4,45,45-49
5,50,50-55
6,56,56+


In [11]:
occupation

Unnamed: 0,code,occupation
0,0,other
1,1,academic/educator
2,2,artist
3,3,cleric/admin
4,4,college/grad student
5,5,customer service
6,6,doctor/health care
7,7,executive/managerial
8,8,farmer
9,9,homemaker


## Drop unnecessary columns
---

We won't need the `timestamp` column from `ratings`, nor will we need the `genres` column from `movies`. Drop both columns in the cells below.

In [12]:
movies.drop('genres', axis = 1, inplace = True)
ratings.drop('timestamp', inplace = True, axis = 1)

## Merge `movies` and `ratings`
---

Use `pd.merge` to **inner join** `movies` with `ratings` on the `movieId` column.

In [13]:
movie_ratings = pd.merge(ratings, movies, on = 'movieid')

In [14]:
# Print first and last 5 rows of the merged dataframe
print(movie_ratings.head())
print(movie_ratings.tail())

   userid  movieid  rating             title
0       1        1     4.0  Toy Story (1995)
1       5        1     4.0  Toy Story (1995)
2       7        1     4.5  Toy Story (1995)
3      15        1     2.5  Toy Story (1995)
4      17        1     4.5  Toy Story (1995)
        userid  movieid  rating                             title
100831     610   160341     2.5                  Bloodmoon (1997)
100832     610   160527     4.5  Sympathy for the Underdog (1971)
100833     610   160836     3.0                     Hazard (2005)
100834     610   163937     3.5                Blair Witch (2016)
100835     610   163981     3.5                         31 (2016)


In [15]:
movie_ratings.to_csv('.\output\\movielens.csv', index = False, header = True)

## Create pivot table
---

Because we're creating an movie-based collaborative recommender, we'll set up our pivot table as follows:
1. The `title` will be the index
2. The `userId` will be the column
3. The `rating` will be the value

**If we were building a user-based collaborative recommender, what would change about this pivot table?**

In [16]:
movie_ratings.head(2)

Unnamed: 0,userid,movieid,rating,title
0,1,1,4.0,Toy Story (1995)
1,5,1,4.0,Toy Story (1995)


In [17]:
pivot = pd.pivot_table(movie_ratings, index = 'title', 
                       columns='userid', values = 'rating')
pivot.head(2)

userid,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),,,,,,,,,,,...,,,,,,,,,,4.0
'Hellboy': The Seeds of Creation (2004),,,,,,,,,,,...,,,,,,,,,,


## Create sparse matrix
Calculate the cosine similarity for each movie using the `pairwise_distances` function. Before that, we need to create a sparse matrix (datatype) using `scipy`'s `sparse` module.

In [18]:
sparse_pivot = sparse.csr_matrix(pivot.fillna(0))

In [19]:
sparse_pivot

<9719x610 sparse matrix of type '<class 'numpy.float64'>'
	with 100832 stored elements in Compressed Sparse Row format>

## Calculate cosine similarity
Use `sklearn's` built-in `pairwise_distances` function for the recommender. It will return a square matrix, comparing every movie with every other movie in the dataset.

In [20]:
# Create a distance metrics similar to    
recommender = pairwise_distances(sparse_pivot, metric = 'cosine')

## Create distances DataFrame
---

At this point, we essentially have a recommender. We'll load it into a `pandas` DataFrame for readability. 

You'll notice that each movie has a "distance" of 0 with itself (along the diagonal).

In [21]:
recommender_df = pd.DataFrame(recommender, 
                              columns = pivot.index, 
                              index = pivot.index)
recommender_df.head()

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.858347,1.0,...,1.0,0.657945,0.456695,0.292893,1.0,1.0,0.860569,0.672673,1.0,1.0
'Hellboy': The Seeds of Creation (2004),1.0,0.0,0.292893,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
'Round Midnight (1986),1.0,0.292893,0.0,1.0,1.0,1.0,0.823223,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
'Salem's Lot (2004),1.0,1.0,1.0,0.0,0.142507,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
'Til There Was You (1997),1.0,1.0,1.0,0.142507,0.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## Evaluate recommender performance
---

Now comes the fun part! Let's check out a few movies to see if the recommender aligns with our intuition. In the cell below we'll do the following:
1. Create a search term
2. Use that to find all titles matching the search query
3. For each title, we'll list off the following:
  1. The average rating
  2. The number of ratings
  3. The ten most similar movies

In [22]:
search = input("What is your favorite movie: ").title()

What is your favorite movie: Lord


In [23]:
for title in movies.loc[movies['title'].str.contains(search), 'title']:
    print(title)
    print(f'Average rating: {pivot.loc[title, :].mean()}') # Average ratings from users
    print(f'Number of rating: {pivot.loc[title, :].count()}')
    print('')
    print('Here are some movies with similar movie title and genre as per your input.')
    print('')
    print(recommender_df[title].sort_values()[1:11])
    print('')
    print('*' * 70)
    

Lord of Illusions (1995)
Average rating: 2.75
Number of rating: 8

Here are some movies with similar movie title and genre as per your input.

title
Vampire in Brooklyn (1995)               0.481266
Suspiria (1977)                          0.502466
Hellraiser: Bloodline (1996)             0.511376
Pokémon Heroes (2003)                    0.521909
Relative Fear (1994)                     0.521909
To Live and Die in L.A. (1985)           0.521909
Frisk (1995)                             0.521909
Young Poisoner's Handbook, The (1995)    0.521909
It Came from Outer Space (1953)          0.521909
Castle Freak (1995)                      0.521909
Name: Lord of Illusions (1995), dtype: float64

**********************************************************************
Little Lord Fauntleroy (1936)
Average rating: 4.0
Number of rating: 1

Here are some movies with similar movie title and genre as per your input.

title
Land and Freedom (Tierra y libertad) (1995)             0.0
Pinocchio (2002)   