# Movie Recommendation System using K-nearest neighbors Model 
## 1. Data Download and Inspection 

### 1.1 Import modules 

In [22]:
import json 
import pandas as pd 

### 1.2 Data download 

In [23]:
movies = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/k-nearest-neighbors-project-tutorial/main/tmdb_5000_movies.csv")
credits = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/k-nearest-neighbors-project-tutorial/main/tmdb_5000_credits.csv")

### 1.3 Data inspection 

In [24]:
movies.head().T

Unnamed: 0,0,1,2,3,4
budget,237000000,300000000,245000000,250000000,260000000
genres,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam..."
homepage,http://www.avatarmovie.com/,http://disney.go.com/disneypictures/pirates/,http://www.sonypictures.com/movies/spectre/,http://www.thedarkknightrises.com/,http://movies.disney.com/john-carter
id,19995,285,206647,49026,49529
keywords,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...","[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...","[{""id"": 818, ""name"": ""based on novel""}, {""id"":..."
original_language,en,en,en,en,en
original_title,Avatar,Pirates of the Caribbean: At World's End,Spectre,The Dark Knight Rises,John Carter
overview,"In the 22nd century, a paraplegic Marine is di...","Captain Barbossa, long believed to be dead, ha...",A cryptic message from Bond’s past sends him o...,Following the death of District Attorney Harve...,"John Carter is a war-weary, former military ca..."
popularity,150.437577,139.082615,107.376788,112.31295,43.926995
production_companies,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""name"": ""Walt Disney Pictures"", ""id"": 2}]"


In [25]:
credits.head().T

Unnamed: 0,0,1,2,3,4
movie_id,19995,285,206647,49026,49529
title,Avatar,Pirates of the Caribbean: At World's End,Spectre,The Dark Knight Rises,John Carter
cast,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c..."
crew,"[{""credit_id"": ""52fe48009251416c750aca23"", ""de...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


- 'Id' from movies df and 'movie_id' from credits df are same, confirming matching data movies in each row! Both dataframes can be joined together for further processing! 

In [26]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

- Some missing data in columns: homepage, overview, release_date, runtime, tagline
- Mix of both numerical and categorical features in the dataframe! 

In [27]:
credits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   movie_id  4803 non-null   int64 
 1   title     4803 non-null   object
 2   cast      4803 non-null   object
 3   crew      4803 non-null   object
dtypes: int64(1), object(3)
memory usage: 150.2+ KB


- No missing data! 

### 1.4 Join two dataframes 

In [28]:
credits.rename({"movie_id": "id"}, axis=1, inplace=True) #renaming the column so we can merge the two datasets 
credits.head().T

Unnamed: 0,0,1,2,3,4
id,19995,285,206647,49026,49529
title,Avatar,Pirates of the Caribbean: At World's End,Spectre,The Dark Knight Rises,John Carter
cast,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c..."
crew,"[{""credit_id"": ""52fe48009251416c750aca23"", ""de...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [29]:
data_df = pd.merge(movies, credits, on='id', how='outer')
data_df.head().T

Unnamed: 0,0,1,2,3,4
budget,4000000,11000000,94000000,55000000,15000000
genres,"[{""id"": 80, ""name"": ""Crime""}, {""id"": 35, ""name...","[{""id"": 12, ""name"": ""Adventure""}, {""id"": 28, ""...","[{""id"": 16, ""name"": ""Animation""}, {""id"": 10751...","[{""id"": 35, ""name"": ""Comedy""}, {""id"": 18, ""nam...","[{""id"": 18, ""name"": ""Drama""}]"
homepage,,http://www.starwars.com/films/star-wars-episod...,http://movies.disney.com/finding-nemo,,http://www.dreamworks.com/ab/
id,5,11,12,13,14
keywords,"[{""id"": 612, ""name"": ""hotel""}, {""id"": 613, ""na...","[{""id"": 803, ""name"": ""android""}, {""id"": 4270, ...","[{""id"": 494, ""name"": ""father son relationship""...","[{""id"": 422, ""name"": ""vietnam veteran""}, {""id""...","[{""id"": 255, ""name"": ""male nudity""}, {""id"": 29..."
original_language,en,en,en,en,en
original_title,Four Rooms,Star Wars,Finding Nemo,Forrest Gump,American Beauty
overview,It's Ted the Bellhop's first night on the job....,Princess Leia is captured and held hostage by ...,"Nemo, an adventurous young clownfish, is unexp...",A man with a low IQ has accomplished great thi...,"Lester Burnham, a depressed suburban father in..."
popularity,22.87623,126.393695,85.688789,138.133331,80.878605
production_companies,"[{""name"": ""Miramax Films"", ""id"": 14}, {""name"":...","[{""name"": ""Lucasfilm"", ""id"": 1}, {""name"": ""Twe...","[{""name"": ""Pixar Animation Studios"", ""id"": 3}]","[{""name"": ""Paramount Pictures"", ""id"": 4}]","[{""name"": ""DreamWorks SKG"", ""id"": 27}, {""name""..."


In [30]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 23 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

- Merge worked but with 3 features referring to 'movie title' 
- cleanup the movie title issue from the data frame!

In [31]:
data_df.drop(['title_x', 'title_y'], axis=1, inplace=True)

In [32]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

In [33]:
data_df.rename({"original_title": "title"}, axis=1, inplace=True)

In [34]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   title                 4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

In [42]:
data_df['cast']

0       [{"cast_id": 42, "character": "Ted the Bellhop...
1       [{"cast_id": 3, "character": "Luke Skywalker",...
2       [{"cast_id": 8, "character": "Marlin (voice)",...
3       [{"cast_id": 7, "character": "Forrest Gump", "...
4       [{"cast_id": 6, "character": "Lester Burnham",...
                              ...                        
4798    [{"cast_id": 1, "character": "Dawn", "credit_i...
4799    [{"cast_id": 4, "character": "Smith Bhatnagar"...
4800    [{"cast_id": 3, "character": "Amber", "credit_...
4801                                                   []
4802    [{"cast_id": 0, "character": "Narrator", "cred...
Name: cast, Length: 4803, dtype: object

In [43]:
data_df['keywords']

0       [{"id": 612, "name": "hotel"}, {"id": 613, "na...
1       [{"id": 803, "name": "android"}, {"id": 4270, ...
2       [{"id": 494, "name": "father son relationship"...
3       [{"id": 422, "name": "vietnam veteran"}, {"id"...
4       [{"id": 255, "name": "male nudity"}, {"id": 29...
                              ...                        
4798                                                   []
4799                                                   []
4800    [{"id": 10060, "name": "christian film"}, {"id...
4801                                                   []
4802    [{"id": 6027, "name": "music"}, {"id": 225822,...
Name: keywords, Length: 4803, dtype: object

In [44]:
data_df['genres']

0       [{"id": 80, "name": "Crime"}, {"id": 35, "name...
1       [{"id": 12, "name": "Adventure"}, {"id": 28, "...
2       [{"id": 16, "name": "Animation"}, {"id": 10751...
3       [{"id": 35, "name": "Comedy"}, {"id": 18, "nam...
4                           [{"id": 18, "name": "Drama"}]
                              ...                        
4798                       [{"id": 27, "name": "Horror"}]
4799    [{"id": 35, "name": "Comedy"}, {"id": 10751, "...
4800    [{"id": 53, "name": "Thriller"}, {"id": 18, "n...
4801                    [{"id": 10751, "name": "Family"}]
4802                  [{"id": 99, "name": "Documentary"}]
Name: genres, Length: 4803, dtype: object

## 2. EDA

In [35]:
encoded_data_df= data_df.copy()
encoded_data_df.head().T

Unnamed: 0,0,1,2,3,4
budget,4000000,11000000,94000000,55000000,15000000
genres,"[{""id"": 80, ""name"": ""Crime""}, {""id"": 35, ""name...","[{""id"": 12, ""name"": ""Adventure""}, {""id"": 28, ""...","[{""id"": 16, ""name"": ""Animation""}, {""id"": 10751...","[{""id"": 35, ""name"": ""Comedy""}, {""id"": 18, ""nam...","[{""id"": 18, ""name"": ""Drama""}]"
homepage,,http://www.starwars.com/films/star-wars-episod...,http://movies.disney.com/finding-nemo,,http://www.dreamworks.com/ab/
id,5,11,12,13,14
keywords,"[{""id"": 612, ""name"": ""hotel""}, {""id"": 613, ""na...","[{""id"": 803, ""name"": ""android""}, {""id"": 4270, ...","[{""id"": 494, ""name"": ""father son relationship""...","[{""id"": 422, ""name"": ""vietnam veteran""}, {""id""...","[{""id"": 255, ""name"": ""male nudity""}, {""id"": 29..."
original_language,en,en,en,en,en
title,Four Rooms,Star Wars,Finding Nemo,Forrest Gump,American Beauty
overview,It's Ted the Bellhop's first night on the job....,Princess Leia is captured and held hostage by ...,"Nemo, an adventurous young clownfish, is unexp...",A man with a low IQ has accomplished great thi...,"Lester Burnham, a depressed suburban father in..."
popularity,22.87623,126.393695,85.688789,138.133331,80.878605
production_companies,"[{""name"": ""Miramax Films"", ""id"": 14}, {""name"":...","[{""name"": ""Lucasfilm"", ""id"": 1}, {""name"": ""Twe...","[{""name"": ""Pixar Animation Studios"", ""id"": 3}]","[{""name"": ""Paramount Pictures"", ""id"": 4}]","[{""name"": ""DreamWorks SKG"", ""id"": 27}, {""name""..."


In [47]:
import pandas as pd
import json

# Sample DataFrame
data = {'cast': [
    '[{"name": "Robert Downey Jr."}, {"name": "Chris Evans"}, {"name": "Scarlett Johansson"}, {"name": "Mark Ruffalo"}]',
    '[{"name": "Tom Holland"}, {"name": "Zendaya"}, {"name": "Jacob Batalon"}]',
    '[{"name": "Ryan Reynolds"}, {"name": "Hugh Jackman"}]'  # Only two names
]}

data_df = pd.DataFrame(data)
data_df.T

Unnamed: 0,0,1,2
cast,"[{""name"": ""Robert Downey Jr.""}, {""name"": ""Chri...","[{""name"": ""Tom Holland""}, {""name"": ""Zendaya""},...","[{""name"": ""Ryan Reynolds""}, {""name"": ""Hugh Jac..."


In [None]:
extracted_values=[]

for json_string in data_df['cast']:
    array = json.loads(json_string)
    names = [elem['name'] for elem in array]
    extracted_values.append(names)

[]
