# Explanation of topic

## Movie Recomendation

For this midterm, I took the basis of the movie recomendation system given as an example, but decided to add a simple chatbot pipeline so that you can request information about movies that are similar to the one initially passed in.

In [1]:
#Imports
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.datasets import make_classification

In [2]:
#Reading in data
df = pd.read_csv('metadata_clean.csv')

orig_df = pd.read_csv('movies_metadata.csv', low_memory=False)

In [3]:
#Cleaning data
df['overview'], df['id'] = orig_df['overview'], orig_df['id']

In [4]:
#Creating Vectorizer instance with English stopwords
tfidf = TfidfVectorizer(stop_words='english')

In [5]:
#Filling NA
df['overview'] = df['overview'].fillna('')

In [6]:
#Preview of data structure
df.iloc[34682]

title                 How the Lion Cub and the Turtle Sang a Song
genres                                              ['Animation']
runtime                                                       9.0
vote_average                                                  6.5
vote_count                                                    4.0
year                                                         1974
overview        The Tortoise composed a song and the Lion cub ...
id                                                         273126
Name: 34682, dtype: object

In [7]:
#Setting up the matrix so we can compare cosine_similarities
tfidf_matrix = tfidf.fit_transform(df['overview'])
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [8]:
#Creating our table of indices for faster lookups
indices = pd.Series(df.index, index=df['title']).drop_duplicates()
indices.head

<bound method NDFrame.head of title
Toy Story                          0
Jumanji                            1
Grumpier Old Men                   2
Waiting to Exhale                  3
Father of the Bride Part II        4
                               ...  
Subdue                         45461
Century of Birthing            45462
Betrayal                       45463
Satan Triumphant               45464
Queerama                       45465
Length: 45466, dtype: int64>

In [46]:
def content_recommender(title):
    idx = indices[title]
    #Listing cosine similarities
    sim_scores = list(enumerate(cosine_sim[idx]))
    #Reordering so highest score is first
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse =True)
    #creating a list of top 20
    sim_scores = sim_scores[1:21]
    #Fetching the data for found indices
    movie_indices = [i[0] for i in sim_scores]
    #Returning data
    return df.iloc[movie_indices]

In [47]:
content_recommender('How the Lion Cub and the Turtle Sang a Song')

Unnamed: 0,title,genres,runtime,vote_average,vote_count,year,overview,id
359,The Lion King,"['Family', 'Animation', 'Drama']",89.0,8.0,5520.0,1994,A young lion cub named Simba can't wait to be ...,8587
36290,Bad,['Music'],17.0,7.6,20.0,1987,"""Bad"" is a song by American song writer and re...",342381
42829,Prey,"['Horror', 'Thriller']",0.0,5.9,8.0,2016,A Lion Terrorizes the City of Amsterdam.,400610
19252,The Tortoise and the Hare,['Animation'],9.0,6.8,21.0,1935,The Tortoise and the Hare is an animated short...,58159
17041,African Cats,"['Documentary', 'Family', 'Adventure']",89.0,7.1,74.0,2011,"African Cats captures the real-life love, humo...",57586
33842,Gangster High,"['Action', 'Crime']",100.0,6.2,13.0,2006,Sang-ho is an ordinary high school student. Hi...,40122
33778,Broken,"['Thriller', 'Drama', 'Mystery']",122.0,6.6,15.0,2014,"Sang-Hyun, who lost his wife, lives with his d...",264273
33798,Howling,"['Thriller', 'Mystery', 'Foreign']",114.0,5.9,33.0,2012,A man is found burned to death inside of a car...,116227
14177,Thirst,"['Drama', 'Horror', 'Thriller']",133.0,7.0,198.0,2009,"Sang-hyun (Song Kang-ho), a respected priest, ...",22536
27933,"Massaï, les guerriers de la pluie","['Adventure', 'Drama']",94.0,8.5,2.0,2004,"After a lion kills the village leader, the exp...",74684


In [11]:
#Creation of chatbot pipeline to be used in recommendation bot.
text_clf = Pipeline([
    ('BOW', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB())
])

In [50]:
def recommendation_bot(title):
    i = 0
    ids = []
    #Calls content recommender created previously
    movies = content_recommender(title)
    print(movies['title'])
    #Fits previously created pipeline and fits it to the indices found from recommendations
    text_clf.fit(movies['title'], movies.index)
    #Request input from user
    movie = input('If you would like to see some overviews, please enter the names of any of the following movies. No need to make sure it is exact. \nType "done" if not.\n')
    while (movie != 'done'):
        movie = [movie]
        #Predicts movie title based off of input
        id = int(text_clf.predict(movie))
        ids.append(id)
        #Request input from user
        movie = input('Any more movies to add to the overview? \nType "done" if not.\n')
    #Prints information for movies selected
    return df.iloc[ids]

In [51]:
recommendation_bot('The Lion King')

34682    How the Lion Cub and the Turtle Sang a Song
9353                                The Lion King 1½
9115                  The Lion King 2: Simba's Pride
42829                                           Prey
25654                                 Fearless Fagan
17041                                   African Cats
27933              Massaï, les guerriers de la pluie
6094                                       Born Free
37409                                     Sour Grape
3203                                The Waiting Game
14402           Michael Jackson: Life of a Superstar
3293                                        The Bear
31208                                     White Lion
34754                             Helen the Baby Fox
37278                          The Little Bear Movie
35242                              His Name was King
31439                   Michael Jackson: Number Ones
34095                          Knives of the Avenger
6574                   Once Upon a Time in Chi

Unnamed: 0,title,genres,runtime,vote_average,vote_count,year,overview,id
42829,Prey,"['Horror', 'Thriller']",0.0,5.9,8.0,2016,A Lion Terrorizes the City of Amsterdam.,400610
37278,The Little Bear Movie,"['Animation', 'Family']",75.0,6.0,2.0,2001,Little Bear and Father Bear go camping in the ...,33371
34095,Knives of the Avenger,"['Action', 'Adventure']",85.0,5.5,3.0,1966,A mysterious knife-throwing viking warrior pro...,66087


# Conclusion

I tried to use the pipline to be used to look up the movie before the recommendations so that perfect spelling and capitalization is not required and allows you to find movies that you are unsure of the exact title for. That did function, but frankly it was extremely slow. I think I need to learn more about how to make efficient search engines before attempting to parse through over 45000 records to compare titles. I ended up settling for a small pipeline that allows the user to request data on one of the movie recomendations without having to have proper capitalization or perfect setting. 20 records is small enough that the pipeline can be fit to a new dataset every time, which allows it to be used as part of a runtime-function. For future additions, I would love to add the ability to search for movies in the entire dataset through pipeline, but a more efficient search is needed. Expanding on the chatbot would also be ideal, that way the user can ask for specific information instead of the entire overview.