# Recommendation Engines

## Introduction

Recommendations are being used to recommend everything from movies to music to friends to new destinations. There are three main methods for implementing recommendations that you will become familiar with throughout this lesson:
* Knowledge Based Recommendations
* Collaborative Filtering Based Recommendations
* Content Based Recommendations

After completing this lesson, you will be ready for the upcoming lessons where you will:
* Learn about more advanced techniques.
* Deploy your recommendations in a web application.

These three lessons will aim to be extremely practical. The lessons will require that you write code to implement a number of different recommendation techniques.

**Example Recommendations:**

* LinkedIn and Facebook
> Both LinkedIn and Facebook have recommendations for connections (business of friends) similar to what is shown below.

* AirBnB Experiences and Destinations
> AirBnB uses recommendations to determine experiences and destinations for their users.

* Walmart, Amazon, and Other Retailers
> As humans on the Internet, we all get pinged with constant recommendations from retailers.

## What's Ahead

### Types of Recommendations

In this lesson, you will be working with the MovieTweetings data to apply each of the three methods of recommendations:
1. Knowledge Based Recommendations
2. Collaborative Filtering Based Recommendations
3. Content Based Recommendations

Within Collaborative Filtering, there are two main branches:
1. Model Based Collaborative Filtering
2. Neighborhood Based Collaborative Filtering

In this lesson, you will implement Neighborhood Based Collaborative Filtering. In the next lesson, you will implement Model Based Collaborative Filtering.

### Similarity Metrics

In order to implement Neighborhood Based Collaborative Filtering, you will learn about some common ways to measure the similarity between two users (or two items) including:
1. Pearson's correlation coefficient
2. Spearman's correlation coefficient
3. Kendall's Tau
4. Euclidean Distance
5. Manhattan Distance

You will learn why sometimes one metric works better than another by looking at a specific situation where one metric provides more information than another.

### Business Cases For Recommendations

Finally, you will look at the four ideas needed for businesses to implement successful recommendations to drive revenue, which include:
1. Relevance
2. Novelty
3. Serendipity
4. Increased Diversity

At the end of this lesson, you will have gained a ton of skills to build upon or to start creating your own recommendations in practice.

## Base Data - MovieTweetings

If you would like additional information about the MovieTweetings data, you can find more information at the links provided here:
* [The MovieTweetings white paper(DEADLINK)](http://crowdrec2013.noahlab.com.hk/papers/crowdrec2013_Dooms.pdf)
* [A Github account set up for MovieTweetings](https://github.com/sidooms/MovieTweetings)
* [A slide deck by Simon Doom about MovieTweetings.](https://www.slideshare.net/simondooms/movie-tweetings-a-movie-rating-dataset-collected-from-twitter)
> Attached in repo as well

### Recommendations with MovieTweetings: Getting to Know The Data

Throughout this lesson, you will be working with the [MovieTweetings Data](https://github.com/sidooms/MovieTweetings/tree/master/recsyschallenge2014).

**Note:** There are solutions to each of the notebooks available by hitting the orange jupyter logo in the top left of this notebook.  Additionally, you can watch me work through the solutions on the screencasts that follow each workbook. 

To get started, read in the libraries and the two datasets you will be using throughout the lesson using the code below.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# import tests as t

%matplotlib inline

In [2]:
# Read in the MovieTweetings dataset originally taken from https://github.com/sidooms/MovieTweetings/tree/master/latest
movies = pd.read_csv(
    '06_recommendation_engines/movies.dat',
    delimiter='::',
    header=None,
    names=['movie_id', 'movie', 'genre'],
    dtype={'movie_id': object},
    engine='python')
reviews = pd.read_csv(
    '06_recommendation_engines/ratings.dat',
    delimiter='::',
    header=None,
    names=['user_id', 'movie_id', 'rating', 'timestamp'],
    dtype={'movie_id': object, 'user_id': object, 'timestamp': object},
    engine='python')

#### 1. Take a Look At The Data 

Take a look at the data and use your findings to fill in the dictionary below with the correct responses to show your understanding of the data.

In [3]:
print(movies.shape)
display(movies.head())

(35479, 3)


Unnamed: 0,movie_id,movie,genre
0,8,Edison Kinetoscopic Record of a Sneeze (1894),Documentary|Short
1,10,La sortie des usines Lumière (1895),Documentary|Short
2,12,The Arrival of a Train (1896),Documentary|Short
3,25,The Oxford and Cambridge University Boat Race ...,
4,91,Le manoir du diable (1896),Short|Horror


In [4]:
print(reviews.shape)
display(reviews.head())

(863866, 4)


Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,114508,8,1381006850
1,2,208092,5,1586466072
2,2,358273,9,1579057827
3,2,10039344,5,1578603053
4,2,6751668,9,1578955697


In [5]:
dict_sol1 = {
'The number of movies in the dataset': movies['movie'].nunique()
,'The number of ratings in the dataset': reviews['rating'].notnull().sum()
,'The number of different genres': movies['genre'].nunique()
,'The number of unique users in the dataset': reviews['user_id'].nunique()
,'The number missing ratings in the reviews dataset': reviews['rating'].isna().sum()
,'The average rating given across all ratings': reviews['rating'].mean()
,'The minimum rating given across all ratings': reviews['rating'].min()
,'The maximum rating given across all ratings': reviews['rating'].max()
}

In [6]:
dict_sol1

{'The number of movies in the dataset': 35416,
 'The number of ratings in the dataset': 863866,
 'The number of different genres': 2736,
 'The number of unique users in the dataset': 67353,
 'The number missing ratings in the reviews dataset': 0,
 'The average rating given across all ratings': 7.315877693994207,
 'The minimum rating given across all ratings': 0,
 'The maximum rating given across all ratings': 10}

#### 2. Data Cleaning

Next, we need to pull some additional relevant information out of the existing columns. 

For each of the datasets, there are a couple of cleaning steps we need to take care of:

#### Movies
* Pull the date from the title and create new column
* Dummy the date column with 1's and 0's for each century of a movie (1800's, 1900's, and 2000's)
* Dummy column the genre with 1's and 0's

#### Reviews
* Create a date out of time stamp

You can check your results against the header of my solution by running the cell below with the **show_clean_dataframes** function.

In [7]:
def remove_year_in_paren(s):
    close_left = s.rfind('(')
    close_right = s.rfind(')')
    s_paren = s[close_left+1:close_right]

    return s_paren

In [8]:
movies['date'] = movies['movie'].apply(lambda x: remove_year_in_paren(x))
movies.head()

Unnamed: 0,movie_id,movie,genre,date
0,8,Edison Kinetoscopic Record of a Sneeze (1894),Documentary|Short,1894
1,10,La sortie des usines Lumière (1895),Documentary|Short,1895
2,12,The Arrival of a Train (1896),Documentary|Short,1896
3,25,The Oxford and Cambridge University Boat Race ...,,1895
4,91,Le manoir du diable (1896),Short|Horror,1896


In [9]:
date_ind = {'18':"1800's",'19':"1900's",'20':"2000's"}
for date in date_ind:
    movies.loc[:,date_ind[date]] = 0
    movies.loc[movies['date'].str[:2] == date, date_ind[date]] = 1
movies.head()

Unnamed: 0,movie_id,movie,genre,date,1800's,1900's,2000's
0,8,Edison Kinetoscopic Record of a Sneeze (1894),Documentary|Short,1894,1,0,0
1,10,La sortie des usines Lumière (1895),Documentary|Short,1895,1,0,0
2,12,The Arrival of a Train (1896),Documentary|Short,1896,1,0,0
3,25,The Oxford and Cambridge University Boat Race ...,,1895,1,0,0
4,91,Le manoir du diable (1896),Short|Horror,1896,1,0,0


In [10]:
# number of different genres
genres = []
for val in movies.genre:
    try:
        genres.extend(val.split('|'))
    except AttributeError:
        pass
genres = set(genres)

def split_genres(val):
    try:
        if val.find(gene) >-1:
            return 1
        else:
            return 0
    except AttributeError:
        return 0

# Apply function for each genre
for gene in genres:        
    movies[gene] = movies['genre'].apply(split_genres)
# print("The number of genres is {}.".format(len(genres)))

# movies = pd.concat([movies,pd.get_dummies(movies['genre'])],axis=1)
movies.head()

Unnamed: 0,movie_id,movie,genre,date,1800's,1900's,2000's,Animation,Game-Show,Documentary,...,Film-Noir,Talk-Show,Short,Thriller,Action,Crime,Sci-Fi,Comedy,Adult,War
0,8,Edison Kinetoscopic Record of a Sneeze (1894),Documentary|Short,1894,1,0,0,0,0,1,...,0,0,1,0,0,0,0,0,0,0
1,10,La sortie des usines Lumière (1895),Documentary|Short,1895,1,0,0,0,0,1,...,0,0,1,0,0,0,0,0,0,0
2,12,The Arrival of a Train (1896),Documentary|Short,1896,1,0,0,0,0,1,...,0,0,1,0,0,0,0,0,0,0
3,25,The Oxford and Cambridge University Boat Race ...,,1895,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,91,Le manoir du diable (1896),Short|Horror,1896,1,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0


In [11]:
reviews['date'] = pd.to_datetime(reviews['timestamp'],unit='s')
reviews.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,date
0,1,114508,8,1381006850,2013-10-05 21:00:50
1,2,208092,5,1586466072,2020-04-09 21:01:12
2,2,358273,9,1579057827,2020-01-15 03:10:27
3,2,10039344,5,1578603053,2020-01-09 20:50:53
4,2,6751668,9,1578955697,2020-01-13 22:48:17


## Solution

The solution to the previous notebook is available in two videos below. Remember you can access the solution notebooks from within the classroom workspaces by clicking on the orange, Jupyter Notebook icon in the upper left hand corner.