# Explore here - Problem Statement | Background

** Movie recommendation system**

This dataset collects part of the knowledge from the API TMDB, which contains only 5000 movies out of the total number.

The following resources are available:

tmdb_5000_movies: https://raw.githubusercontent.com/4GeeksAcademy/k-nearest-neighbors-project-tutorial/main/tmdb_5000_movies.csv

tmdb_5000_credits: https://raw.githubusercontent.com/4GeeksAcademy/k-nearest-neighbors-project-tutorial/main/tmdb_5000_credits.csv

### Step 1  - Load the files
load the two files and store them in two separate data structures (Pandas DataFrames). On one side we will have stored the information of the movies and their credits.


### Step 2: Creation of a database
Create a database to store the two DataFrames in separate tables. Then join the two tables with SQL (and integrate it with Python) to generate a third table containing information from both tables unified. The key through which the join can be done is the title of the movie (title).

Now, clean the generated table and leave only the following columns:

- movie_id
- title
- overview
- genres
- keywords
- cast
- crew

### Import Libraries


In [None]:
import pandas as pd




from pickle import dump

### Read the CSV files for both Movies and Credits

In [None]:
#import csv movie file
movies_data = pd.read_csv('https://raw.githubusercontent.com/4GeeksAcademy/k-nearest-neighbors-project-tutorial/main/tmdb_5000_movies.csv')

# Set display options to show all columns (None means unlimited)
pd.set_option('display.max_columns', None)

#Read csv file and display intial rows
movies_data.head(3)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466


In [None]:
#import csv credits file
credits_data = pd.read_csv('https://raw.githubusercontent.com/4GeeksAcademy/k-nearest-neighbors-project-tutorial/main/tmdb_5000_credits.csv')

# Set display options to show all columns (None means unlimited)
pd.set_option('display.max_columns', None)

#Read csv file and display intial rows
credits_data.head(3)

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."


In [None]:
# display shape
print(movies_data.shape)
print(credits_data.shape)

(4803, 20)
(4803, 4)


Use FLASK to connect to SQLite

# Remove 'package_name' and split string

In [None]:
# List of keywords to filter out
keywords = ['package_name']

# Finding columns that contain any of the keywords
columns_to_drop = [col for col in tot_data.columns if any(keyword in col for keyword in keywords)]

tot_data["review"] = tot_data["review"].str.strip().str.lower()

# Dropping these columns from the DataFrame
tot_data = tot_data.drop(columns=columns_to_drop)

In [None]:
# Step 1: Text Preprocessing
# Basic preprocessing can include lowercasing, removing punctuation, etc.
tot_data['review_cleaned'] = tot_data['review'].str.lower()

# Step 2: Feature Extraction
tfidf = TfidfVectorizer(max_features=1000)  # Limit number of features to 5000
features = tfidf.fit_transform(tot_data['review_cleaned'])

# Step 3: Label Encoding
label_encoder = LabelEncoder()
labels = label_encoder.fit_transform(tot_data['polarity'])

# Splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)

In [None]:
print(X_train.toarray()[:5])  # to print the first 5 rows


[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [None]:
# Training the Naive Bayes Classifier
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)

#### Model prediction and Evaluation


In [None]:
# Predicting and Evaluating
y_pred = nb_classifier.predict(X_test)
print("Accuracy:", round(accuracy_score(y_test, y_pred),5))
#print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.83799


#### Save Model

In [None]:
dump(nb_classifier, open("nb_classifier_default_42.sav", "wb"))