# K-nearest neighbors: Movie recommendation system

## 1. Data loading
### 1.1. Load

In [1]:
# Handle imports up-front
import json
import pandas as pd

movies=pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/k-nearest-neighbors-project-tutorial/main/tmdb_5000_movies.csv")
credits=pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/k-nearest-neighbors-project-tutorial/main/tmdb_5000_credits.csv")

### 1.2. Inspect

In [2]:
# Your code here...
print(movies.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

In [3]:
print(credits.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   movie_id  4803 non-null   int64 
 1   title     4803 non-null   object
 2   cast      4803 non-null   object
 3   crew      4803 non-null   object
dtypes: int64(1), object(3)
memory usage: 150.2+ KB
None


In [4]:
print(movies.head())

      budget                                             genres  \
0  237000000  [{"id": 28, "name": "Action"}, {"id": 12, "nam...   
1  300000000  [{"id": 12, "name": "Adventure"}, {"id": 14, "...   
2  245000000  [{"id": 28, "name": "Action"}, {"id": 12, "nam...   
3  250000000  [{"id": 28, "name": "Action"}, {"id": 80, "nam...   
4  260000000  [{"id": 28, "name": "Action"}, {"id": 12, "nam...   

                                       homepage      id  \
0                   http://www.avatarmovie.com/   19995   
1  http://disney.go.com/disneypictures/pirates/     285   
2   http://www.sonypictures.com/movies/spectre/  206647   
3            http://www.thedarkknightrises.com/   49026   
4          http://movies.disney.com/john-carter   49529   

                                            keywords original_language  \
0  [{"id": 1463, "name": "culture clash"}, {"id":...                en   
1  [{"id": 270, "name": "ocean"}, {"id": 726, "na...                en   
2  [{"id": 470, "nam

In [5]:
print(movies['keywords'].head(10))  # Print the first 10 rows

0    [{"id": 1463, "name": "culture clash"}, {"id":...
1    [{"id": 270, "name": "ocean"}, {"id": 726, "na...
2    [{"id": 470, "name": "spy"}, {"id": 818, "name...
3    [{"id": 849, "name": "dc comics"}, {"id": 853,...
4    [{"id": 818, "name": "based on novel"}, {"id":...
5    [{"id": 851, "name": "dual identity"}, {"id": ...
6    [{"id": 1562, "name": "hostage"}, {"id": 2343,...
7    [{"id": 8828, "name": "marvel comic"}, {"id": ...
8    [{"id": 616, "name": "witch"}, {"id": 2343, "n...
9    [{"id": 849, "name": "dc comics"}, {"id": 7002...
Name: keywords, dtype: object


### 1.3. Join

In [6]:
# Combine the datasets (hint: you don't need SQL here - Pandas can do SQL-like joins directly).
data = pd.merge(movies, credits, left_on='id', right_on='movie_id')
print(data.shape)
print(data.head())

(4803, 24)
      budget                                             genres  \
0  237000000  [{"id": 28, "name": "Action"}, {"id": 12, "nam...   
1  300000000  [{"id": 12, "name": "Adventure"}, {"id": 14, "...   
2  245000000  [{"id": 28, "name": "Action"}, {"id": 12, "nam...   
3  250000000  [{"id": 28, "name": "Action"}, {"id": 80, "nam...   
4  260000000  [{"id": 28, "name": "Action"}, {"id": 12, "nam...   

                                       homepage      id  \
0                   http://www.avatarmovie.com/   19995   
1  http://disney.go.com/disneypictures/pirates/     285   
2   http://www.sonypictures.com/movies/spectre/  206647   
3            http://www.thedarkknightrises.com/   49026   
4          http://movies.disney.com/john-carter   49529   

                                            keywords original_language  \
0  [{"id": 1463, "name": "culture clash"}, {"id":...                en   
1  [{"id": 270, "name": "ocean"}, {"id": 726, "na...                en   
2  [{"id"

## 2. EDA

### 2.1. Feature encoding

In [7]:
# Make a copy to work with while encoding so that we have the original to go back to if needed
encoded_data_df=data.copy()

Some of the features contain per-cell JSON formatted data. This is a terrible practice, no competent data scientist would ever produce a dataset this way. Its like building a perfectly tuned, carbon fibre racing bicycle and then using a delicious, bacon, egg and cheese sandwich for the seat. Either one is awesome, but the awkward combination is nonsensical.

This kind of thing happens. We can't control the format(s) we find interesting data in. But, as bad*as data scientists, we can use our Python/Pandas chops to extract and parse any data we want into a useful format. This requires some item-by-item processing and is necessarily messy.

In the cell below, I re-wrote the apply() lambda function provided in the 4Geeks solution in a more verbose - but possibly more familiar - style using loops. The lambda apply method is better. Not only is it more succinct, but there is a performance benefit to using apply() vs looping on a Pandas dataframe. I added the loop version for comparison to help you understand what the lambda function is doing:

It loads the 'cast' JSON from each row of the dataframe as a dictionary and extracts the value of 'name'.

In [8]:
import ast
data['cast_names'] = data['cast'].apply(lambda x: [i['name'] for i in ast.literal_eval(x)])

data['cast_names'] = data['cast_names'].apply(lambda x: x[:5] if len(x) > 5 else x)

print(data['cast_names'].head())

0    [Sam Worthington, Zoe Saldana, Sigourney Weave...
1    [Johnny Depp, Orlando Bloom, Keira Knightley, ...
2    [Daniel Craig, Christoph Waltz, Léa Seydoux, R...
3    [Christian Bale, Michael Caine, Gary Oldman, A...
4    [Taylor Kitsch, Lynn Collins, Samantha Morton,...
Name: cast_names, dtype: object


#### 2.1.1. Extract cast names: loop

In [9]:
 # Empty list to hold extracted values
# extracted_values=[]

 # Loop on the elements of the cast column
# for json_string in data['cast']:

     # Load the json string into a python dictionary
#     json_list=json.loads(json_string)

     # Empty list to hold values from this element
#     values=[]

     # Loop on the first three elements of the json list
#     for item in json_list[:3]:

         # Extract the value for the name key
#         value=item['name']

         # Add it to the list
#         values.append(value)

#     extracted_values.append(values)

cast_frequency = data['cast_names'].explode().value_counts()

top_cast = cast_frequency.head(500).index

data['cast_names'] = data['cast_names'].apply(lambda x: [name for name in x if name in top_cast])

print(data['cast_names'])


0       [Zoe Saldana, Sigourney Weaver, Michelle Rodri...
1       [Johnny Depp, Orlando Bloom, Keira Knightley, ...
2                           [Daniel Craig, Ralph Fiennes]
3       [Christian Bale, Michael Caine, Gary Oldman, A...
4       [Samantha Morton, Willem Dafoe, Thomas Haden C...
                              ...                        
4798                                                   []
4799                                       [Edward Burns]
4800                                                   []
4801                                        [Bill Paxton]
4802                                     [Drew Barrymore]
Name: cast_names, Length: 4803, dtype: object


#### 2.1.2. Extract cast names: lambda apply()

In [10]:
data['genres'] = data['genres'].apply(lambda x: [i['name'] for i in ast.literal_eval(x)])

# Parse and extract keywords from JSON-like strings
data['keywords'] = data['keywords'].apply(lambda x: [i['name'] for i in ast.literal_eval(x)] if isinstance(x, str) else [])

data['production_companies'] = data['production_companies'].apply(lambda x: [i['name'] for i in ast.literal_eval(x)])

print(data['keywords'].head())

0    [culture clash, future, space war, space colon...
1    [ocean, drug abuse, exotic island, east india ...
2    [spy, based on novel, secret agent, sequel, mi...
3    [dc comics, crime fighter, terrorist, secret i...
4    [based on novel, mars, medallion, space travel...
Name: keywords, dtype: object


#### 2.1.3. Extract other features

In [11]:
# Same for the 'keywords' column
# encoded_data_df['keywords']=data['keywords'].apply(lambda x: [item['name'] for item in json.loads(x)][:3] if pd.notna(x) else None)

# # And the 'genres' column
# encoded_data_df['genres']=data['genres'].apply(lambda x: [item['name'] for item in json.loads(x)][:3] if pd.notna(x) else None)

# encoded_data_df.head(3)

data['overview'] = data['overview'].apply(lambda x: x if isinstance(x, str) else "")


### 2.2. Missing and/or extreme values

In [12]:
# Look for and clean up any junk data, if it exists
print(data.isnull().sum())

budget                     0
genres                     0
homepage                3091
id                         0
keywords                   0
original_language          0
original_title             0
overview                   0
popularity                 0
production_companies       0
production_countries       0
release_date               1
revenue                    0
runtime                    2
spoken_languages           0
status                     0
tagline                  844
title_x                    0
vote_average               0
vote_count                 0
movie_id                   0
title_y                    0
cast                       0
crew                       0
cast_names                 0
dtype: int64


In [13]:
data.drop(columns=['homepage', 'tagline'], inplace=True)

In [14]:
data['release_date'].fillna('Unknown', inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['release_date'].fillna('Unknown', inplace=True)


In [15]:
data['runtime'].fillna(data['runtime'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['runtime'].fillna(data['runtime'].median(), inplace=True)


### 2.3. Feature selection

In [16]:
# Do we need all of the features?
from sklearn.preprocessing import MultiLabelBinarizer

mlb_genres = MultiLabelBinarizer()
genres_encoded = pd.DataFrame(mlb_genres.fit_transform(data['genres']), columns=mlb_genres.classes_, index=data.index)
data = pd.concat([data, genres_encoded], axis=1)

In [17]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
data['popularity'] = scaler.fit_transform(data[['popularity']])

In [18]:
# data['keywords'] = data['keywords'].apply(lambda x: [i.strip() for i in x.split()] if isinstance(x, str) else [])

mlb_keywords = MultiLabelBinarizer()
keywords_encoded = pd.DataFrame(mlb_keywords.fit_transform(data['keywords']), columns=mlb_keywords.classes_, index=data.index)

keywords_top = keywords_encoded.sum().sort_values(ascending=False).head(500).index
keywords_encoded = keywords_encoded[keywords_top]

data = pd.concat([data, keywords_encoded], axis=1)

print(keywords_encoded.shape)

(4803, 500)


In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_features=500, stop_words='english')

overview_tfidf = pd.DataFrame(tfidf.fit_transform(data['overview']).toarray(), columns=tfidf.get_feature_names_out(), index=data.index)
data = pd.concat([data, overview_tfidf], axis=1)

print(overview_tfidf.shape)

(4803, 500)


In [20]:
mlb_cast = MultiLabelBinarizer()
cast_encoded = pd.DataFrame(mlb_cast.fit_transform(data['cast_names']), columns=mlb_cast.classes_, index=data.index)
data = pd.concat([data, cast_encoded], axis=1)

print(cast_encoded.shape)
print(cast_encoded.head())

(4803, 500)
   Aaron Eckhart  Abigail Breslin  Adam Sandler  Adam Scott  Adrien Brody  \
0              0                0             0           0             0   
1              0                0             0           0             0   
2              0                0             0           0             0   
3              0                0             0           0             0   
4              0                0             0           0             0   

   Al Pacino  Alan Arkin  Alec Baldwin  Alfred Molina  Amanda Peet  ...  \
0          0           0             0              0            0  ...   
1          0           0             0              0            0  ...   
2          0           0             0              0            0  ...   
3          0           0             0              0            0  ...   
4          0           0             0              0            0  ...   

   William H. Macy  William Hurt  William Shatner  Winona Ryder  Woody All

In [21]:
final_features = list(genres_encoded.columns) + ['popularity'] + list(keywords_encoded.columns) + list(overview_tfidf.columns) + list(cast_encoded.columns)
movieFeatures = data[final_features]

print(movieFeatures.shape)

print(movieFeatures.dtypes)

print(movieFeatures.head())

(4803, 1692)
Action               int64
Adventure            int64
Animation            int64
Comedy               int64
Crime                int64
                     ...  
Woody Harrelson      int64
Zac Efron            int64
Zach Galifianakis    int64
Zoe Saldana          int64
Zooey Deschanel      int64
Length: 1692, dtype: object
   Action  Adventure  Animation  Comedy  Crime  Documentary  Drama  Family  \
0       1          1          0       0      0            0      0       0   
1       1          1          0       0      0            0      0       0   
2       1          1          0       0      1            0      0       0   
3       1          0          0       0      1            0      1       0   
4       1          1          0       0      0            0      0       0   

   Fantasy  Foreign  ...  William H. Macy  William Hurt  William Shatner  \
0        1        0  ...                0             0                0   
1        1        0  ...                0

In [22]:
movieFeatures = movieFeatures.loc[:, ~movieFeatures.columns.duplicated()]
print(movieFeatures.shape)

(4803, 1436)


In [23]:
print(movieFeatures.isnull().sum().sum())  # Should return 0

0


In [24]:
for col in movieFeatures.columns:
    if movieFeatures[col].dtype == 'object':
        print(f"Column: {col}")
        print(movieFeatures[col].head())

Column: crew
0    [{"credit_id": "52fe48009251416c750aca23", "de...
1    [{"credit_id": "52fe4232c3a36847f800b579", "de...
2    [{"credit_id": "54805967c3a36829b5002c41", "de...
3    [{"credit_id": "52fe4781c3a36847f81398c3", "de...
4    [{"credit_id": "52fe479ac3a36847f813eaa3", "de...
Name: crew, dtype: object


In [25]:
# Parse JSON-like strings to extract relevant information (e.g., names)
movieFeatures['crew'] = movieFeatures['crew'].apply(
    lambda x: [i['name'] for i in ast.literal_eval(x)] if isinstance(x, str) else []
)

In [26]:
movieFeatures = movieFeatures.drop(columns=['crew', 'credits'], errors='ignore')

In [27]:
print(movieFeatures.dtypes)
print(movieFeatures.isnull().sum().sum())  # Ensure no NaNs remain

Action               int64
Adventure            int64
Animation            int64
Comedy               int64
Crime                int64
                     ...  
Woody Harrelson      int64
Zac Efron            int64
Zach Galifianakis    int64
Zoe Saldana          int64
Zooey Deschanel      int64
Length: 1435, dtype: object
0


In [28]:
scaler = MinMaxScaler()
movieFeatures = scaler.fit_transform(movieFeatures)

print(movieFeatures.shape)


(4803, 1435)


## 3. Model training

In [29]:
# Your code here...
from sklearn.neighbors import NearestNeighbors

knn_model = NearestNeighbors(n_neighbors=6, metric='cosine')
knn_model.fit(movieFeatures)


In [30]:
# Get the index of the target movie
movie_title = "Avatar"
movie_index = data[data['original_title'] == movie_title].index[0]

# Find the nearest neighbors
distances, indices = knn_model.kneighbors([movieFeatures[movie_index]])

# Print the indices of similar movies
print("Similar Movies:")
for idx in indices.flatten():
    print(data.iloc[idx]['original_title'])

Similar Movies:
Avatar
Jupiter Ascending
Star Trek Into Darkness
Predator
The Time Machine
Megaforce


## 4. Recommender

In [42]:
# Recommender function
def get_movie_recommendations(movie_title, knn, X, data, top_n=5):
    '''Takes a movie title string, looks up TFIDF feature vector for that movie
    and returns title of top 5 most similar movies.'''

    movie_index = data[data['original_title'] == movie_title].index[0]

    distances, indices = knn_model.kneighbors([movieFeatures[movie_index]])

    recommended_movies = [data.iloc[idx]['original_title'] for idx in indices.flatten() if idx != movie_index]

    return recommended_movies[:top_n]
    # Your code here...

recommendations = get_movie_recommendations("Ronin", knn_model, movieFeatures, data, top_n=5)
print("Recommended Movies:", recommendations)

Recommended Movies: ['Midnight Run', 'El Mariachi', 'Dick Tracy', 'Serbuan maut', 'Code of Honor']


In [39]:
# 'Target' movie
input_movie = "Predator"

# Call the recommendation function
recommendations = get_movie_recommendations(input_movie, knn_model, movieFeatures, data, top_n=5)

# Print the results
print("Film recommendations for'{}'".format(input_movie))
for movie, distance, *extra in recommendations:
    print("- Film: {}".format(movie, distance))


print(recommendations)

Film recommendations for'Predator'
- Film: S
- Film: I
- Film: A
- Film: T
- Film: G
['Star Trek III: The Search for Spock', 'I Am Number Four', 'Aliens vs Predator: Requiem', 'Terminator Genisys', 'Green Lantern']
