## Movie Reccomendations ##

This reccomndation system uses the description of the movie, as well as some other text features. It uses tokenization and follows the NLP route.

Columns used:
- 'original_title'
- 'original_language'
- 'overview'
- 'cast'
- 'director'

Package Dependencies:
- pandas
- pymongo
- sklearn

### Step 1: Connecting to DataBase ###

For this usecase I used Mongo DB. It is easy to set up and for basic tables that are unrelated it makes it easy to use.

There is a Python wrapper called PyMongo.

In [1]:
from pymongo import MongoClient

client = MongoClient("mongodb+srv://andre:annette@recommendation-sys.hzgmt.mongodb.net/?retryWrites=true&w=majority")

db = client['reccomender-sys']
db.list_collection_names()

['credits', 'movies']

In [2]:
import pandas as pd
import numpy as np

movie = db.movies
movies = pd.DataFrame(list(movie.find()))

credit = db.credits
credits = pd.DataFrame(list(credit.find()))

### Step 2: Handling Incorrect Data ###

In the ID column there are some rows with errors. There is a date instead of an id that can help to tie movie title to a numeric id. To clean data I dropped these rows. To do this I checked for the length of the date in id and removed based on length.

In [3]:
movies['idlen'] = movies['id'].apply(lambda x: len(x))

In [4]:
movies.sort_values("idlen", ascending=False)[:4]

Unnamed: 0,_id,adult,budget,genres,id,imdb_id,original_language,original_title,overview,popularity,...,release_date,revenue,runtime,spoken_languages,status,title,vote_average,vote_count,tagline,idlen
29503,62a792b30200fd59f88551e0,Rune Balot goes to a casino connected to the ...,/zV8bHuSL6WXoD6FWogP9j4x80bL.jpg,"[{'name': 'Aniplex', 'id': 2883}, {'name': 'Go...",2012-09-29,0,68.0,"[{'iso_639_1': 'ja', 'name': '日本語'}]",Released,,...,12,,,,,,,,,10
19730,62a792ad0200fd59f8852bb3,- Written by Ørnås,/ff9qCepilowshEtG2GYWwzt2bs4.jpg,"[{'name': 'Carousel Productions', 'id': 11176}...",1997-08-20,0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,...,1,,,,,,,,,10
35587,62a792b60200fd59f88569a4,Avalanche Sharks tells the story of a bikini ...,/zaSf5OG7V8X8gqFvly88zDdRm46.jpg,"[{'name': 'Odyssey Media', 'id': 17161}, {'nam...",2014-01-01,0,82.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Beware Of Frost Bites,...,22,,,,,,,,,10
22733,62a792af0200fd59f885376e,False,0,[],133018,tt0088193,en,A Streetcar Named Desire,Blanche Dubois goes to visit her pregnant sist...,1.1622,...,1984-03-04,0.0,119.0,[],Released,A Streetcar Named Desire,5.0,1.0,,6


In [5]:
# removing rows with incorrect id (total 3 rows)
movies = movies[movies['idlen'] != 10]

### Step 3: Merging Data ###

The data for this model is stored into two distinct collections, or tables, so it is neccessary to merge these two collections into one DataFrame.

In [6]:
# making sure id columns are of same datatypes
movies['id'] = movies['id'].astype('int64')
credits['id'] = credits['id'].astype('int64')

# merging the two dataframes
df = movies.merge(credits, on='id')

In [7]:
# columns we will be using at first
columns = ['id', 'cast', 'crew', 'genres', 'title', 'overview']

# displaying new dataframe
df[columns].head(3)

Unnamed: 0,id,cast,crew,genres,title,overview
0,862,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",Toy Story,"Led by Woody, Andy's toys live happily in his ..."
1,8844,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",Jumanji,When siblings Judy and Peter discover an encha...
2,15602,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",Grumpier Old Men,A family wedding reignites the ancient feud be...


### Step 4: Data Manipulation ###

Some basic data extraction from the json located within some of thew columns is needed to get data that is neeeded in this basic model.

Extracting this text is neccessary as this model uses Tokenization, which is a NLP technique that looks at counts of specific root words to get the computer to try to understand a language like english for the movie descriptions.

In [8]:
# manipulating json objects
# importing the needed function
from ast import literal_eval

# looping through columns to create iterable dtypes
for columns in ['cast', 'crew', 'genres']:
    df[columns] = df[columns].apply(lambda x: literal_eval(x))

In [9]:
def new_columns(df, col, extract, join=True):   
    # creating list to turn into new column
    cast_list = []
    # starting with cast column
    for row in df[col]:
        char_list = []
        for item in row:
            char_list.append(item[extract])

        if join:
            cast_list.append(' '.join(char_list))

        else:
            cast_list.append(char_list)

    return cast_list

df['cast_char_str'] = new_columns(df, 'cast', 'character')
df['cast_name_str'] = new_columns(df, 'cast', 'name')
df['genres'] = new_columns(df, 'genres', 'name')

In [10]:
# function that gets the index of the director name
def get_director(x):
    if 'Director' in x:
        return x.index('Director')

df['director_id'] = pd.Series(new_columns(df, 'crew', 'job', False)).apply(get_director).replace(np.NaN, 1000000)

In [11]:
# creating the list of names of the crew
df['director_name'] = new_columns(df, 'crew', 'name', False)

In [12]:
# creating a new column holding the names of the directors
dir_list = []

# looping through list of all names with the column that holds the index of the director
for index, name in zip(df['director_id'], df['director_name']):
    try:
        dir_list.append(name[int(index)])
    except:
        dir_list.append(np.NaN)

df['director_name'] = dir_list
df['director_name'].head(3)

0    John Lasseter
1     Joe Johnston
2    Howard Deutch
Name: director_name, dtype: object

In [13]:
# dropping un-needed columns
df.drop(
    ['cast', 'crew', '_id_x', '_id_y', 'poster_path', 'genres', 'production_companies', 'production_countries']
    , axis='columns', inplace=True)

In [14]:
# displaying the data
df.head(1)

Unnamed: 0,adult,budget,id,imdb_id,original_language,original_title,overview,popularity,release_date,revenue,...,status,title,vote_average,vote_count,tagline,idlen,cast_char_str,cast_name_str,director_id,director_name
0,False,30000000,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,1995-10-30,373554033,...,Released,Toy Story,7.7,5415,,3,Woody (voice) Buzz Lightyear (voice) Mr. Potat...,Tom Hanks Tim Allen Don Rickles Jim Varney Wal...,0.0,John Lasseter


In [15]:
# printing out the columns
df.columns

Index(['adult', 'budget', 'id', 'imdb_id', 'original_language',
       'original_title', 'overview', 'popularity', 'release_date', 'revenue',
       'runtime', 'spoken_languages', 'status', 'title', 'vote_average',
       'vote_count', 'tagline', 'idlen', 'cast_char_str', 'cast_name_str',
       'director_id', 'director_name'],
      dtype='object')

In [16]:
# copying the data before it is changed
model_data = df.copy()

# selecting columns to use to build a model
data = model_data[['id', 'original_language', 'original_title', 'overview', 'director_name', 'cast_name_str']]

# looking at the shape afterf null rows are dropped
print(data.dropna(axis=0).isnull().sum(), data.shape)

# dropping null values
modelDF = data.dropna(axis=0)

id                   0
original_language    0
original_title       0
overview             0
director_name        0
cast_name_str        0
dtype: int64 (45538, 6)


In [17]:
# printing first row of DF that will be used to train model
modelDF.head(1)

Unnamed: 0,id,original_language,original_title,overview,director_name,cast_name_str
0,862,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",John Lasseter,Tom Hanks Tim Allen Don Rickles Jim Varney Wal...


### Step 5: Training the Model ###

Now it is time to use sklearn Objects and Functions to train model

In [18]:
# creating Series object to use tokenization on to create reccomendation 
# based on tokens
def create_soup(df):
    return df['original_language'] + df['original_title'] \
        + df['overview'] + df['director_name'] + df['cast_name_str']

soup = create_soup(modelDF)

In [19]:
# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(soup)

count_matrix.shape

(43771, 250534)

In [20]:
# Compute the Cosine Similarity matrix based on the count_matrix
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

In [23]:
# Reset index of your main DataFrame and construct reverse mapping as before
metadata = modelDF.reset_index()
indices = pd.Series(metadata.index, index=metadata['original_title'])

# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim=cosine_sim2):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return metadata['original_title'].iloc[movie_indices]

### Step 6: Using Model

In [32]:
get_recommendations('Star Wars')

1144                               The Empire Strikes Back
1157                                    Return of the Jedi
915                        Around the World in Eighty Days
29816                        The Star Wars Holiday Special
2588                                             Spartacus
1152                                    Lawrence of Arabia
2498                                              Superman
15357    Empire of Dreams: The Story of the Star Wars T...
10578                            Knute Rockne All American
11219                                        Union Pacific
Name: original_title, dtype: object