# Movie Recommendation System with Python
In this project, we'll develop a basic recommender system with Python and pandas.

Movies will be suggested by similarity to other movies; this is not a robust recommendation system, but something to start out on.

In [1]:
import numpy as np
import pandas as pd

## Data
We have two datasets:

+ A dataset of movie ratings.
+ A dataset of all movies titles and their ids.

In [3]:
#Reading the ratings dataset.
column_names = ['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv('../data/data.data', sep='\t', names=column_names)

In [4]:
df.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,9438161,,,
1,332,222.0,4.0,887916529.0
2,551,735.0,5.0,892783110.0
3,13,160.0,4.0,882140070.0
4,532,946.0,5.0,888635366.0


Reading the movie titles

In [5]:
movie_titles = pd.read_csv("../data/Movie_Id_Titles.csv")
movie_titles.head()

Unnamed: 0,item_id,title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)


We can merge them together:

In [None]:
df = pd.merge(df,movie_titles,on='item_id')
df.head()

## Exploratory Analysis
Let's explore the data a bit and get a look at some of the best rated movies.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
%matplotlib inline

Let's create a ratings dataframe with average rating and number of ratings:

In [None]:
df.groupby('title')['rating'].mean().sort_values(ascending=False).head()

In [None]:
df.groupby('title')['rating'].count().sort_values(ascending=False).head()

In [None]:
ratings = pd.DataFrame(df.groupby('title')['rating'].mean())
ratings.head()

Setting the number of ratings column:

In [2]:
ratings['num of ratings'] = pd.DataFrame(df.groupby('title')['rating'].count())
ratings.head()

<class 'NameError'>: name 'df' is not defined

Visualizing the number of ratings:

In [None]:
plt.figure(figsize=(10,4))
ratings['num of ratings'].hist(bins=40)

In [None]:
plt.figure(figsize=(10,4))
ratings['rating'].hist(bins=40)

It makes intuitive sense for most ratings to be around the 3.0 mark.

## Recommending Similar Movies
The next step is to create a martix that has the user ids on one axis and the movie titles on another. Each cell will then consist of the rating of a movie by a particular user

In [None]:
moviemat = df.pivot_table(index='user_id',columns='title',values='rating')
moviemat.head()

It's normal for there to be lots of NaN values, as not everyone would have seen most movies.

Checking out the Most rated movie:

In [None]:
ratings.sort_values('num of ratings',ascending=False).head(10)

Let's choose two movies to focus on: starwars, a sci-fi movie. And Dumb and Dumber, a comedy.

In [None]:
ratings.head()

Now let's grab the user ratings for those two movies:

In [None]:
starwars_user_ratings = moviemat['Star Wars (1977)']
dumb_user_ratings = moviemat['Liar Liar (1997)']
starwars_user_ratings.head()

We can then use corrwith() method to get correlations between two pandas series:

In [None]:
similar_to_starwars = moviemat.corrwith(starwars_user_ratings)
similar_to_liarliar = moviemat.corrwith(dumb_user_ratings)

Let's clean the data by removing NaN values and using a DataFrame instead of a series:

In [None]:
corr_starwars = pd.DataFrame(similar_to_starwars,columns=['Correlation'])
corr_starwars.dropna(inplace=True)
corr_starwars.head()

If we sort the dataframe by correlation, we should get the most similar movies, however there will be some results that don't really make sense.

This is because there are a lot of movies only watched once by users who also watched star wars (it was the most popular movie).

In [None]:
corr_starwars.sort_values('Correlation',ascending=False).head(10)

Let's fix this by filtering out movies that have less than 100 reviews (this value was chosen based off the histogram from earlier).

In [None]:
corr_starwars = corr_starwars.join(ratings['num of ratings'])
corr_starwars.head()

Now sort the values and notice how the titles make a lot more sense:

In [None]:
corr_starwars[corr_starwars['num of ratings']>100].sort_values('Correlation',ascending=False).head()

Now the same for Liar Liar:

In [None]:
corr_liarliar = pd.DataFrame(similar_to_liarliar,columns=['Correlation'])
corr_liarliar.dropna(inplace=True)
corr_liarliar = corr_liarliar.join(ratings['num of ratings'])
corr_liarliar[corr_liarliar['num of ratings']>100].sort_values('Correlation',ascending=False).head()

And we're done!

It looks like our results make sense. Even though the system is far from state of the art, it still recommended the other Star Wars movies (and another George Lucas film) for Star Wars; while it looks like there's some scope for more improvement in Liar Liar (although we do get another Jim Carrey movie as a recommendation too).