# Movie Recommender System
#### Recommending movies based on 'movie rating correlation' and 'number of ratings' based on the IMDB Movies data.

By: Anamika Singh

In [None]:
# IMPORTING REQUIRED LIBRARIES

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

The data I'm working on is a medium dataset having 1,00,003 rows and 4 columns. A much larger version of the dataset is freely available on the internet.

In [None]:
# LOADING THE DATASET

column_names = ['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv('u.data', sep = '\t', names = column_names)

Missing values are usually represented in the form of Nan or null or None in the dataset.

df.info() function can be used to give information about the dataset. This will provide you with the column names along with the number of non – null values in each column.

The second way of finding whether we have null values in the data is by using the isnull() function.

In [None]:
# CHECKING FOR MISSING VALUES

df.info()

Getting the movie titles:

In [None]:
movie_titles = pd.read_csv("Movie_Id_Titles")
movie_titles.head()

Merging both the datasets together:

In [None]:
df = pd.merge(df, movie_titles, on = 'item_id')
df.head()

## Exploratory Data Analysis

In [None]:
sns.set_style('white')
%matplotlib inline

Creating a ratings dataframe with average rating and number of ratings:

In [None]:
df.groupby('title')['rating'].mean().sort_values(ascending = False).head()

In [None]:
df.groupby('title')['rating'].count().sort_values(ascending = False).head()

In [None]:
ratings = pd.DataFrame(df.groupby('title')['rating'].mean())
ratings.head()

Setting the number of ratings column:

In [None]:
ratings['num of ratings'] = pd.DataFrame(df.groupby('title')['rating'].count())
ratings.head()

## Visualizing Data

In [None]:
plt.figure(figsize = (10, 4))
ratings['num of ratings'].hist(bins = 70)

In [None]:
plt.figure(figsize = (10, 4))
ratings['rating'].hist(bins = 70)

In [None]:
sns.jointplot(x = 'rating', y = 'num of ratings', data = ratings, alpha=0.5)

## Recommending Similar Movies

Creating a matrix that has the user ids on one axis and the movie title on another axis. Each cell consists of the rating the user gave to that movie. There are a of NaN values, because most people have not seen most of the movies.

In [None]:
moviemat = df.pivot_table(index = 'user_id', columns = 'title', values = 'rating')
moviemat.head()

Most rated movie:

In [None]:
ratings.sort_values('num of ratings', ascending = False).head(10)

Choosing two movies: starwars, a sci-fi movie. And Liar Liar, a comedy.

In [None]:
ratings.head()

User ratings for those two movies:

In [None]:
starwars_user_ratings = moviemat['Star Wars (1977)']
liarliar_user_ratings = moviemat['Liar Liar (1997)']
starwars_user_ratings.head()

Using corrwith() method to get correlations between two pandas series:

In [None]:
similar_to_starwars = moviemat.corrwith(starwars_user_ratings)
similar_to_liarliar = moviemat.corrwith(liarliar_user_ratings)

Removing NaN values and using a DataFrame instead of a series:

In [None]:
corr_starwars = pd.DataFrame(similar_to_starwars, columns = ['Correlation'])
corr_starwars.dropna(inplace = True)
corr_starwars.head()

If we sort the dataframe by correlation, we get the most similar movies, however some results that don't really make sense. This is because there are a lot of movies only watched once by users who also watched star wars (it was the most popular movie). 

In [None]:
corr_starwars.sort_values('Correlation', ascending = False).head(10)

Filtering out movies that have less than 100 reviews (this value is chosen based off the histogram from earlier).

In [None]:
corr_starwars = corr_starwars.join(ratings['num of ratings'])
corr_starwars.head()

Sorting the values so that the titles make a lot more sense:

In [None]:
corr_starwars[corr_starwars['num of ratings'] > 100].sort_values('Correlation', ascending = False).head()

Same for the comedy Liar Liar:

In [None]:
corr_liarliar = pd.DataFrame(similar_to_liarliar,columns = ['Correlation'])
corr_liarliar.dropna(inplace = True)
corr_liarliar = corr_liarliar.join(ratings['num of ratings'])
corr_liarliar[corr_liarliar['num of ratings'] > 100].sort_values('Correlation', ascending = False).head()