## Movie recommendation system
Would we be able to predict which movies might or might not be a commercial success? This dataset collects part of the knowledge from the API TMDB, which contains only 5000 movies out of the total number. The following resources are available:

### Step 1: Loading the dataset

In [20]:
import pandas as pd

movies = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/k-nearest-neighbors-project-tutorial/main/tmdb_5000_movies.csv")
movies.head()
movies.shape

(4803, 20)

In [21]:
credits = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/k-nearest-neighbors-project-tutorial/main/tmdb_5000_credits.csv")
credits.head()
credits.shape

(4803, 4)

### Step 2: Creation of a database

In [22]:
import sqlite3

#Creación de la db y tablas con datos

con = sqlite3.connect("../data/test.db")
movies.to_sql('movies', con, if_exists='replace', index=False)
credits.to_sql('credits', con, if_exists='replace', index=False)


4803

In [23]:
#Creación de la tercera tabla con join
query = """SELECT *
FROM movies
INNER JOIN credits ON movies.title=credits.title;
"""

In [24]:
#Obtención del tercer dataframe y limpieza de columnas
data = pd.read_sql_query(query,con)
print(data.head())
print(data.shape)

      budget                                             genres  \
0  237000000  [{"id": 28, "name": "Action"}, {"id": 12, "nam...   
1  300000000  [{"id": 12, "name": "Adventure"}, {"id": 14, "...   
2  245000000  [{"id": 28, "name": "Action"}, {"id": 12, "nam...   
3  250000000  [{"id": 28, "name": "Action"}, {"id": 80, "nam...   
4  260000000  [{"id": 28, "name": "Action"}, {"id": 12, "nam...   

                                       homepage      id  \
0                   http://www.avatarmovie.com/   19995   
1  http://disney.go.com/disneypictures/pirates/     285   
2   http://www.sonypictures.com/movies/spectre/  206647   
3            http://www.thedarkknightrises.com/   49026   
4          http://movies.disney.com/john-carter   49529   

                                            keywords original_language  \
0  [{"id": 1463, "name": "culture clash"}, {"id":...                en   
1  [{"id": 270, "name": "ocean"}, {"id": 726, "na...                en   
2  [{"id": 470, "nam

In [25]:
#Eliminación de columnas no necesarias
data = data[['movie_id','title','overview','genres','keywords','cast','crew']]
data = data.T.drop_duplicates().T
print(data.shape)
print(data.info())

(4809, 7)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4809 entries, 0 to 4808
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   movie_id  4809 non-null   object
 1   title     4809 non-null   object
 2   overview  4806 non-null   object
 3   genres    4809 non-null   object
 4   keywords  4809 non-null   object
 5   cast      4809 non-null   object
 6   crew      4809 non-null   object
dtypes: object(7)
memory usage: 263.1+ KB
None


In [26]:
data.head(10)

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...","[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...","[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...","[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 818, ""name"": ""based on novel""}, {""id"":...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."
5,559,Spider-Man 3,The seemingly invincible Spider-Man goes up ag...,"[{""id"": 14, ""name"": ""Fantasy""}, {""id"": 28, ""na...","[{""id"": 851, ""name"": ""dual identity""}, {""id"": ...","[{""cast_id"": 30, ""character"": ""Peter Parker / ...","[{""credit_id"": ""52fe4252c3a36847f80151a5"", ""de..."
6,38757,Tangled,When the kingdom's most wanted-and most charmi...,"[{""id"": 16, ""name"": ""Animation""}, {""id"": 10751...","[{""id"": 1562, ""name"": ""hostage""}, {""id"": 2343,...","[{""cast_id"": 34, ""character"": ""Flynn Rider (vo...","[{""credit_id"": ""52fe46db9251416c91062101"", ""de..."
7,99861,Avengers: Age of Ultron,When Tony Stark tries to jumpstart a dormant p...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 8828, ""name"": ""marvel comic""}, {""id"": ...","[{""cast_id"": 76, ""character"": ""Tony Stark / Ir...","[{""credit_id"": ""55d5f7d4c3a3683e7e0016eb"", ""de..."
8,767,Harry Potter and the Half-Blood Prince,"As Harry begins his sixth year at Hogwarts, he...","[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 616, ""name"": ""witch""}, {""id"": 2343, ""n...","[{""cast_id"": 3, ""character"": ""Harry Potter"", ""...","[{""credit_id"": ""52fe4273c3a36847f801fab1"", ""de..."
9,209112,Batman v Superman: Dawn of Justice,Fearing the actions of a god-like Super Hero l...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 849, ""name"": ""dc comics""}, {""id"": 7002...","[{""cast_id"": 18, ""character"": ""Bruce Wayne / B...","[{""credit_id"": ""553bf23692514135c8002886"", ""de..."


### Step 3: Transform the data

In [27]:
# Data transform as expected
import json

try:
    data['genres'] = data['genres'].apply(lambda x : [var['name'] for var in json.loads(x)] if pd.notna(x) else None)
except: None
try:
    data["keywords"] = data["keywords"].apply(lambda x: [item["name"] for item in json.loads(x)] if pd.notna(x) else None)
except: None
try: 
    data["cast"] = data["cast"].apply(lambda x: [item["name"] for item in json.loads(x)][:3] if pd.notna(x) else None)
except: None
try:
    data["crew"] = data["crew"].apply(lambda x: " ".join([crew_member['name'] for crew_member in json.loads(x) if crew_member['job'] == 'Director']))
except: None
try: 
    data["overview"] = data["overview"].apply(lambda x: [x])
except: None

print(data.head())


  movie_id                                     title  \
0    19995                                    Avatar   
1      285  Pirates of the Caribbean: At World's End   
2   206647                                   Spectre   
3    49026                     The Dark Knight Rises   
4    49529                               John Carter   

                                            overview  \
0  [In the 22nd century, a paraplegic Marine is d...   
1  [Captain Barbossa, long believed to be dead, h...   
2  [A cryptic message from Bond’s past sends him ...   
3  [Following the death of District Attorney Harv...   
4  [John Carter is a war-weary, former military c...   

                                          genres  \
0  [Action, Adventure, Fantasy, Science Fiction]   
1                   [Adventure, Fantasy, Action]   
2                     [Action, Adventure, Crime]   
3               [Action, Crime, Drama, Thriller]   
4           [Action, Adventure, Science Fiction]   

             

In [28]:
data.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"[In the 22nd century, a paraplegic Marine is d...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",James Cameron
1,285,Pirates of the Caribbean: At World's End,"[Captain Barbossa, long believed to be dead, h...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley]",Gore Verbinski
2,206647,Spectre,[A cryptic message from Bond’s past sends him ...,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[Daniel Craig, Christoph Waltz, Léa Seydoux]",Sam Mendes
3,49026,The Dark Knight Rises,[Following the death of District Attorney Harv...,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...","[Christian Bale, Michael Caine, Gary Oldman]",Christopher Nolan
4,49529,John Carter,"[John Carter is a war-weary, former military c...","[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","[Taylor Kitsch, Lynn Collins, Samantha Morton]",Andrew Stanton


In [29]:
data["overview"] = data["overview"].apply(lambda x: [str(x)])
data["genres"] = data["genres"].apply(lambda x: [str(genre).replace(" ","") for genre in x])
data["keywords"] = data["keywords"].apply(lambda x: [str(keyword).replace(" ","") for keyword in x])
data["cast"] = data["cast"].apply(lambda x: [str(actor).replace(" ","") for actor in x])
data["crew"] = data["crew"].apply(lambda x: [x.replace(" ","")])

data["tags"] = data["overview"] + data["genres"] + data["keywords"] + data["cast"] + data["crew"]
data["tags"] = data["tags"].apply(lambda x: ",".join(x).replace(",", " "))
print(data.loc[0,'tags'])

['In the 22nd century  a paraplegic Marine is dispatched to the moon Pandora on a unique mission  but becomes torn between following orders and protecting an alien civilization.'] Action Adventure Fantasy ScienceFiction cultureclash future spacewar spacecolony society spacetravel futuristic romance space alien tribe alienplanet cgi marine soldier battle loveaffair antiwar powerrelations mindandsoul 3d SamWorthington ZoeSaldana SigourneyWeaver JamesCameron


In [30]:
data.drop(columns = ['overview','genres','keywords','cast','crew'], inplace=True)
data.head()

Unnamed: 0,movie_id,title,tags
0,19995,Avatar,['In the 22nd century a paraplegic Marine is ...
1,285,Pirates of the Caribbean: At World's End,['Captain Barbossa long believed to be dead ...
2,206647,Spectre,['A cryptic message from Bond’s past sends him...
3,49026,The Dark Knight Rises,"[""Following the death of District Attorney Har..."
4,49529,John Carter,"[""John Carter is a war-weary former military ..."


In [31]:
data.to_csv("../data/processed/clean_data.csv", index = False)

conn = sqlite3.connect("../data/movies_database.db")

movies.to_sql("clean_movies_data", conn, if_exists = "replace", index = False)

4803

### Step 4: Build a KNN

In [32]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(data["tags"])

print(tfidf_matrix)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 261477 stored elements and shape (4809, 35722)>
  Coords	Values
  (0, 15381)	0.03430379984009723
  (0, 32188)	0.05197251138234446
  (0, 233)	0.1782999934178856
  (0, 5383)	0.11605098002492341
  (0, 23843)	0.1741688620570135
  (0, 20151)	0.2894581404073792
  (0, 16010)	0.0407567052400133
  (0, 9090)	0.16259432423830888
  (0, 32558)	0.028837181438295535
  (0, 21742)	0.14372186033980094
  (0, 23783)	0.17067604025945823
  (0, 23283)	0.048730978267768996
  (0, 33661)	0.14094387672328448
  (0, 21518)	0.10108343771123139
  (0, 4666)	0.05731590249071853
  (0, 3059)	0.08929978906455724
  (0, 32771)	0.13926470190030477
  (0, 3351)	0.09361003986445465
  (0, 12051)	0.1256781920680997
  (0, 23409)	0.14805088370998445
  (0, 1470)	0.029464533002284618
  (0, 25589)	0.15497037107988149
  (0, 1422)	0.04909112080642707
  (0, 1125)	0.21777635532299378
  (0, 6094)	0.1584631928774368
  :	:
  (4808, 11150)	0.13664140942609315
  (4808, 25447)	0.112

In [33]:
from sklearn.metrics.pairwise import cosine_similarity

similarity = cosine_similarity(tfidf_matrix)

In [34]:
def recommend(movie):
    movie_index = data[data["title"] == movie].index[0]
    distances = similarity[movie_index]
    movie_list = sorted(list(enumerate(distances)), reverse = True , key = lambda x: x[1])[1:6]

    return movie_list

In [35]:
input_movie = "How to Train Your Dragon"
recommendations = recommend(input_movie)
print("Film recommendations '{}'".format(input_movie))
for movie, distance in recommendations:
    print("- Film: {}".format(data.loc[movie,'title']))

Film recommendations 'How to Train Your Dragon'
- Film: How to Train Your Dragon 2
- Film: Dragon Nest: Warriors' Dawn
- Film: Pete's Dragon
- Film: George and the Dragon
- Film: Eragon


In [36]:
input_movie = "Avatar"
recommendations = recommend(input_movie)
print("Film recommendations '{}'".format(input_movie))
for movie, distance in recommendations:
    print("- Film: {}".format(data.loc[movie,'title']))

Film recommendations 'Avatar'
- Film: Aliens
- Film: Battle: Los Angeles
- Film: Falcon Rising
- Film: Apollo 18
- Film: Titan A.E.


In [37]:
input_movie = "Spectre"
recommendations = recommend(input_movie)
print("Film recommendations '{}'".format(input_movie))
for movie, distance in recommendations:
    print("- Film: {}".format(data.loc[movie,'title']))

Film recommendations 'Spectre'
- Film: Skyfall
- Film: Never Say Never Again
- Film: Quantum of Solace
- Film: From Russia with Love
- Film: Thunderball


Después de tres pruebas, se ha comprobado que el recomendador de peliculas funciona como se espera.