# Explore here - Problem Statement | Background

## **Movie recommendation system**

This dataset collects part of the knowledge from the API TMDB, which contains only 5000 movies out of the total number.

The following resources are available:

tmdb_5000_movies: https://raw.githubusercontent.com/4GeeksAcademy/k-nearest-neighbors-project-tutorial/main/tmdb_5000_movies.csv

tmdb_5000_credits: https://raw.githubusercontent.com/4GeeksAcademy/k-nearest-neighbors-project-tutorial/main/tmdb_5000_credits.csv

### Step 1  - Load the files
load the two files and store them in two separate data structures (Pandas DataFrames). On one side we will have stored the information of the movies and their credits.


### Step 2: Creation of a database
Create a database to store the two DataFrames in separate tables. Then join the two tables with SQL (and integrate it with Python) to generate a third table containing information from both tables unified. The key through which the join can be done is the title of the movie (title).

Now, clean the generated table and leave only the following columns:

- movie_id
- title
- overview
- genres
- keywords
- cast
- crew

### Import Libraries


In [1]:
import pandas as pd
from pickle import dump

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


### Read the CSV files for both Movies and Credits

In [2]:
#import csv movie file
movies_data = pd.read_csv('https://raw.githubusercontent.com/4GeeksAcademy/k-nearest-neighbors-project-tutorial/main/tmdb_5000_movies.csv')

# Set display options to show all columns (None means unlimited)
pd.set_option('display.max_columns', None)

#Read csv file and display intial rows
movies_data.head(3)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466


In [3]:
#import csv credits file
credits_data = pd.read_csv('https://raw.githubusercontent.com/4GeeksAcademy/k-nearest-neighbors-project-tutorial/main/tmdb_5000_credits.csv')

# Set display options to show all columns (None means unlimited)
pd.set_option('display.max_columns', None)

#Read csv file and display intial rows
credits_data.head(3)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."


In [4]:
# display shape
print(movies_data.shape)
print(credits_data.shape)

(4803, 20)
(4803, 4)


### Creation of a database with SQLite 3

In [5]:
import sqlite3
import pandas as pd
import pandasql as psql

# Create SQLite Database
conn = sqlite3.connect('recommend_db.sqlite')

# Store DataFrames in the Database
movies_data.to_sql('movies_data', conn, if_exists='replace', index=False)
credits_data.to_sql('credits_data', conn, if_exists='replace', index=False)

# Close the connection
#conn.close()

4803

In [6]:
# Merge tables for creating a new DataFrame
query = """
    SELECT *
    FROM movies_data
    INNER JOIN credits_data
    ON movies_data.title = credits_data.title;
"""
total_data = pd.read_sql_query(query, conn)
conn.close()

total_data = total_data.loc[:, ~total_data.columns.duplicated()]

# Specify the columns you want to keep
columns_to_keep = ['movie_id', 'title', 'overview', 'genres', 'keywords', 'cast', 'crew']

# Create a new DataFrame with only these columns
clean_total_data = total_data[columns_to_keep]

clean_total_data.head()


Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...","[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...","[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...","[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,A cryptic message from Bond’s past sends him o...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...","[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,Following the death of District Attorney Harve...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...","[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...","[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"John Carter is a war-weary, former military ca...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","[{""id"": 818, ""name"": ""based on novel""}, {""id"":...","[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [7]:
import json
import pandas as pd

def load_json_safe(json_str, default_value=None):
    """Safely load JSON string into Python object."""
    try:
        return json.loads(json_str)
    except (TypeError, json.JSONDecodeError):
        return default_value

# Assuming clean_total_data is your DataFrame
# Check if columns exist in clean_total_data
required_columns = ['genres', 'keywords', 'cast', 'crew', 'overview']
for col in required_columns:
    if col not in clean_total_data.columns:
        print(f"Column {col} not found in DataFrame")
        # Handle missing column (e.g., add it, skip transformation, or raise an error)

# Apply transformations using .loc for assignments
if 'genres' in clean_total_data:
    clean_total_data.loc[:, 'genres'] = clean_total_data['genres'].apply(
        lambda x: [item["name"] for item in load_json_safe(x)] if pd.notna(x) else None
    )

if 'keywords' in clean_total_data:
    clean_total_data.loc[:, 'keywords'] = clean_total_data['keywords'].apply(
        lambda x: [item["name"] for item in load_json_safe(x)] if pd.notna(x) else None
    )

if 'cast' in clean_total_data:
    clean_total_data.loc[:, 'cast'] = clean_total_data['cast'].apply(
        lambda x: [item["name"] for item in load_json_safe(x)][:3] if pd.notna(x) else None
    )

if 'crew' in clean_total_data:
    clean_total_data.loc[:, 'crew'] = clean_total_data['crew'].apply(
        lambda x: " ".join([crew_member['name'] for crew_member in load_json_safe(x) if crew_member['job'] == 'Director'])
    )

if 'overview' in clean_total_data:
    clean_total_data.loc[:, 'overview'] = clean_total_data['overview'].apply(lambda x: [x] if pd.notna(x) else None)

# Renaming clean_total_data to tot_data2
tot_data2 = clean_total_data
tot_data2.head()

Unnamed: 0,movie_id,title,overview,genres,keywords,cast,crew
0,19995,Avatar,"[In the 22nd century, a paraplegic Marine is d...","[Action, Adventure, Fantasy, Science Fiction]","[culture clash, future, space war, space colon...","[Sam Worthington, Zoe Saldana, Sigourney Weaver]",James Cameron
1,285,Pirates of the Caribbean: At World's End,"[Captain Barbossa, long believed to be dead, h...","[Adventure, Fantasy, Action]","[ocean, drug abuse, exotic island, east india ...","[Johnny Depp, Orlando Bloom, Keira Knightley]",Gore Verbinski
2,206647,Spectre,[A cryptic message from Bond’s past sends him ...,"[Action, Adventure, Crime]","[spy, based on novel, secret agent, sequel, mi...","[Daniel Craig, Christoph Waltz, Léa Seydoux]",Sam Mendes
3,49026,The Dark Knight Rises,[Following the death of District Attorney Harv...,"[Action, Crime, Drama, Thriller]","[dc comics, crime fighter, terrorist, secret i...","[Christian Bale, Michael Caine, Gary Oldman]",Christopher Nolan
4,49529,John Carter,"[John Carter is a war-weary, former military c...","[Action, Adventure, Science Fiction]","[based on novel, mars, medallion, space travel...","[Taylor Kitsch, Lynn Collins, Samantha Morton]",Andrew Stanton


In [8]:
tot_data2['genres'][1]

['Adventure', 'Fantasy', 'Action']

In [9]:
# Ensure each column is a string and remove commas if the column is a list
for column in ["overview", "genres", "keywords", "cast", "crew"]:
    tot_data2[column] = tot_data2[column].apply(lambda x: ' '.join(x).replace(',', '') if isinstance(x, list) else str(x))

# Combine columns into a new 'tags' column as a list of strings
tot_data2["tags"] = tot_data2.apply(lambda row: [row["overview"], row["genres"], row["keywords"], row["cast"], row["crew"]], axis=1)

# Drop the original columns
tot_data2.drop(columns=["overview", "genres", "keywords", "cast", "crew"], inplace=True)

# Display the 'tags' of the first row
print(tot_data2.iloc[0].tags)

['In the 22nd century a paraplegic Marine is dispatched to the moon Pandora on a unique mission but becomes torn between following orders and protecting an alien civilization.', 'Action Adventure Fantasy Science Fiction', 'culture clash future space war space colony society space travel futuristic romance space alien tribe alien planet cgi marine soldier battle love affair anti war power relations mind and soul 3d', 'Sam Worthington Zoe Saldana Sigourney Weaver', 'James Cameron']


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tot_data2[column] = tot_data2[column].apply(lambda x: ' '.join(x).replace(',', '') if isinstance(x, list) else str(x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tot_data2["tags"] = tot_data2.apply(lambda row: [row["overview"], row["genres"], row["keywords"], row["cast"], row["crew"]], axis=1)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tot_data2.

In [16]:
tot_data2.head(2)

Unnamed: 0,movie_id,title,tags,tags_vec
0,19995,Avatar,[In the 22nd century a paraplegic Marine is di...,
1,285,Pirates of the Caribbean: At World's End,[Captain Barbossa long believed to be dead has...,


### Initialization and training of the model

In [11]:
#Import Libraries
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors

# Assuming tot_data2 is your dataset and is already loaded

# Preprocess tags: lowercasing and ensuring string type
tot_data2 = tot_data2.copy()
tot_data2.loc[:, 'tags_vec'] = tot_data2['tags'].str.lower().astype(str)

# Feature Extraction
tfidf = TfidfVectorizer() # No limit on vectorizer
features = tfidf.fit_transform(tot_data2['tags_vec'])

# Model Training
model = NearestNeighbors(n_neighbors=6, algorithm="brute", metric="cosine")
model.fit(features)

def get_movie_recommendations(movie_title: str):
    if movie_title not in tot_data2['title'].values:
        return "Movie not found in the dataset."

    movie_index = tot_data2[tot_data2["title"] == movie_title].index[0]
    distances, indices = model.kneighbors(features[movie_index], return_distance=True)
    return [(tot_data2["title"].iloc[i], distances[0][j]) 
            for j, i in enumerate(indices[0]) if j != 0]

# User inputs the movie title
input_movie = "Superman"

recommendations = get_movie_recommendations(input_movie)

if isinstance(recommendations, str):
    print(recommendations)
elif recommendations:
    print(f"Film recommendations for '{input_movie}':")
    for movie, _ in recommendations:
        print(movie)
else:
    print("No recommendations available.")

Film recommendations for 'Superman':
Beneath Hill 60
Crocodile Dundee
The Girl on the Train
The I Inside
Polisse


#### Save Model

In [None]:
dump(model, open("KNN_default_42.sav", "wb"))