<a href="https://colab.research.google.com/github/JamesKha/MovieRecommendation/blob/main/TfIdfVectorizer_Checker.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

In [3]:
tagsDF = pd.read_csv('/content/drive/MyDrive/Miniproject Files/tags.csv')
moviesDF = pd.read_csv('/content/drive/MyDrive/Miniproject Files/movies.csv')

#Recommending with tags

In [4]:
tagsAndGenresDataFrame = pd.merge(tagsDF, moviesDF, on='movieId')[['movieId','title', 'tag','genres']]
titlesAndTags = pd.DataFrame(tagsAndGenresDataFrame.groupby(['title']).tag.unique())
titlesAndTags = titlesAndTags.reset_index()
titlesAndTags['tag'] = titlesAndTags['tag'].apply(lambda x: ' '.join([str(i) for i in x]))
titlesAndTags['tag'] = titlesAndTags['tag'].str.lower()

In [5]:
titlesAndTags.head()

Unnamed: 0,title,tag
0,(500) Days of Summer (2009),artistic funny humorous inspiring intelligent ...
1,...And Justice for All (1979),lawyers
2,10 Cloverfield Lane (2016),creepy suspense
3,10 Things I Hate About You (1999),shakespeare sort of
4,101 Dalmatians (1996),dogs remake


In [6]:
tfidf = TfidfVectorizer(stop_words='english')
titlesAndTags['tag'] = titlesAndTags['tag'].fillna('')
tfidf_matrix = tfidf.fit_transform(titlesAndTags['tag'])

In [7]:
from sklearn.metrics.pairwise import linear_kernel
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [8]:
indices = pd.Series(titlesAndTags.index, index=titlesAndTags['title']).drop_duplicates()

In [9]:
indices

title
(500) Days of Summer (2009)             0
...And Justice for All (1979)           1
10 Cloverfield Lane (2016)              2
10 Things I Hate About You (1999)       3
101 Dalmatians (1996)                   4
                                     ... 
Zero Dark Thirty (2012)              1567
Zombieland (2009)                    1568
Zoolander (2001)                     1569
Zulu (1964)                          1570
eXistenZ (1999)                      1571
Length: 1572, dtype: int64

In [10]:
cosine_sim

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

In [11]:
def get_recommendations(title, cosine_sim=cosine_sim):

    idx = indices[title]

    sim_scores = list(enumerate(cosine_sim[idx]))


    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)


    sim_scores = sim_scores[1:11]


    movie_indices = [i[0] for i in sim_scores]


    return titlesAndTags[['title','tag']].iloc[movie_indices]

In [12]:
titlesAndTags

Unnamed: 0,title,tag
0,(500) Days of Summer (2009),artistic funny humorous inspiring intelligent ...
1,...And Justice for All (1979),lawyers
2,10 Cloverfield Lane (2016),creepy suspense
3,10 Things I Hate About You (1999),shakespeare sort of
4,101 Dalmatians (1996),dogs remake
...,...,...
1567,Zero Dark Thirty (2012),afghanistan american propaganda assassination ...
1568,Zombieland (2009),bill murray dark comedy emma stone funny jesse...
1569,Zoolander (2001),ben stiller comedy david bowie goofy mindless ...
1570,Zulu (1964),africa


In [13]:
get_recommendations('Step Brothers (2008)')

Unnamed: 0,title,tag
986,Old School (2003),comedy will ferrell
66,Anchorman 2: The Legend Continues (2013),comedy steve carell stupid but funny will ferrell
36,Airheads (1994),comedy
242,Chalet Girl (2011),comedy
146,"Big Lebowski, The (1998)",coen brothers black comedy bowling classic coe...
491,Game Night (2018),comedy funny rachel mcadams
67,Anchorman: The Legend of Ron Burgundy (2004),hilarious steve carell will ferrell stupid awe...
274,Clueless (1995),chick flick funny paul rudd quotable seen more...
713,Jumanji: Welcome to the Jungle (2017),action comedy dwayne johnson funny
1385,The Lego Batman Movie (2017),funny heartwarming


#Implementation with genres

In [14]:
movieAndGenres = tagsAndGenresDataFrame

In [15]:
movieAndGenres['tag'] =  movieAndGenres.groupby(['movieId','title'])['tag'].transform(lambda x: ' '.join(x))

In [16]:
movieAndGenres = movieAndGenres.drop_duplicates()

In [17]:
movieAndGenres

Unnamed: 0,movieId,title,tag,genres
0,60756,Step Brothers (2008),funny Highly quotable will ferrell comedy funn...,Comedy
8,89774,Warrior (2011),Boxing story MMA Tom Hardy,Drama
11,106782,"Wolf of Wall Street, The (2013)",drugs Leonardo DiCaprio Martin Scorsese Stock ...,Comedy|Crime|Drama
16,48516,"Departed, The (2006)",way too long Leonardo DiCaprio suspense twist ...,Crime|Drama|Thriller
26,431,Carlito's Way (1993),Al Pacino gangster mafia,Crime|Drama
...,...,...,...,...
3677,1948,Tom Jones (1963),British,Adventure|Comedy|Romance
3678,5694,Staying Alive (1983),70mm,Comedy|Drama|Musical
3679,6107,Night of the Shooting Stars (Notte di San Lore...,World War II,Drama|War
3680,7936,Shame (Skammen) (1968),austere,Drama|War


In [18]:
movieAndGenres['genresList'] = movieAndGenres['genres'].str.split("|")
movieAndGenres.drop('genres',axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [19]:
movieAndGenres['genresList'] = movieAndGenres['genresList'].apply(lambda x: ' '.join([str(i) for i in x]))
movieAndGenres['genresList'] = movieAndGenres['genresList'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [20]:
movieAndGenres['year'] = movieAndGenres['title'].str.extract('.*\((.*)\).*')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [21]:
def create_soup(x):
    return ''.join(x['tag'].lower()) + ' ' + ''.join(x['genresList'])

In [22]:
movieAndGenres['soup'] = movieAndGenres.apply(create_soup, axis=1)
movieAndGenres.reset_index(drop=True,inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


#Word Soup


In [23]:
tfidf_matrix = tfidf.fit_transform(movieAndGenres['soup'])

In [24]:
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [25]:
movieAndGenresindices = pd.Series(movieAndGenres.index, index=movieAndGenres['title']).drop_duplicates()

In [26]:
movieAndGenresindices

title
Step Brothers (2008)                                                0
Warrior (2011)                                                      1
Wolf of Wall Street, The (2013)                                     2
Departed, The (2006)                                                3
Carlito's Way (1993)                                                4
                                                                 ... 
Tom Jones (1963)                                                 1567
Staying Alive (1983)                                             1568
Night of the Shooting Stars (Notte di San Lorenzo, La) (1982)    1569
Shame (Skammen) (1968)                                           1570
Hard-Boiled (Lat sau san taam) (1992)                            1571
Length: 1572, dtype: int64

In [27]:
def get_recommendations(title, cosine_sim=cosine_sim):

    idx = movieAndGenresindices[title]

    sim_scores = list(enumerate(cosine_sim[idx]))


    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)


    sim_scores = sim_scores[1:11]


    movie_indices = [i[0] for i in sim_scores]


    return movieAndGenres[['title','soup']].iloc[movie_indices]

In [28]:
testDF = pd.DataFrame(get_recommendations('Step Brothers (2008)'))

In [29]:
movieAndGenresindices

title
Step Brothers (2008)                                                0
Warrior (2011)                                                      1
Wolf of Wall Street, The (2013)                                     2
Departed, The (2006)                                                3
Carlito's Way (1993)                                                4
                                                                 ... 
Tom Jones (1963)                                                 1567
Staying Alive (1983)                                             1568
Night of the Shooting Stars (Notte di San Lorenzo, La) (1982)    1569
Shame (Skammen) (1968)                                           1570
Hard-Boiled (Lat sau san taam) (1992)                            1571
Length: 1572, dtype: int64

In [31]:
testDF.to_csv('/content/drive/MyDrive/output1.csv')

In [32]:
import streamlit as st
import pandas as pd
import numpy as np

In [39]:
%%writefile app.py
import streamlit as st
import pandas as pd
PAGE_CONFIG = {"page_title":"Movie Recommender","page_icon":":movie_camera:","layout":"centered"}
st.set_page_config(**PAGE_CONFIG)
def main():
  st.title("Movie Recommender")
  st.dataframe(pd.read_csv('/content/drive/MyDrive/output1.csv'))
  
if __name__ == '__main__':
	main()

Overwriting app.py


In [41]:
!streamlit run app.py &>/dev/null&

In [42]:
!pgrep streamlit

428
508
542
677
719
750
882
929
1006
1138
1179


In [43]:
from pyngrok import ngrok# Setup a tunnel to the streamlit port 8501
public_url = ngrok.connect(port='8501')
public_url

2021-08-31 01:13:08.667 INFO    pyngrok.process: t=2021-08-31T01:13:08+0000 lvl=info msg=start pg=/api/tunnels id=e6b81f2338cf5ebe

2021-08-31 01:13:08.753 INFO    pyngrok.process: t=2021-08-31T01:13:08+0000 lvl=info msg="started tunnel" obj=tunnels name="http-8501-d1c6aef3-3763-4244-b3e8-2960b664375f (http)" addr=http://localhost:8501 url=http://fb92-34-73-71-140.ngrok.io

2021-08-31 01:13:08.757 INFO    pyngrok.process: t=2021-08-31T01:13:08+0000 lvl=info msg="started tunnel" obj=tunnels name=http-8501-d1c6aef3-3763-4244-b3e8-2960b664375f addr=http://localhost:8501 url=https://fb92-34-73-71-140.ngrok.io



'http://fb92-34-73-71-140.ngrok.io'

2021-08-31 01:13:08.763 INFO    pyngrok.process: t=2021-08-31T01:13:08+0000 lvl=info msg=end pg=/api/tunnels id=e6b81f2338cf5ebe status=201 dur=86.247969ms

