<p style="font-family: Arial; font-size:3.75em;color:purple; font-style:bold"><br>
Jupyter Notebook for the Final Project</p><br>

<h1 style="font-size:2em;color:#2467C0">Case Study: Netflix Movies and TV Shows Data Analysis</h1>

### Proposed Question1: What are the most frequent genres for long-running TV shows (at least five seasons long)?
### Proposed Question2: Can we build a machine-learning model that estimates the content rating of a Netflix movie or TV show based on all other attributes?
### Proposed Question3: What are the most frequent themes or topics based on Netflix TV shows and movie descriptions?
##### Disclaimer: Throughout the project, I will leverage part of the codes Dr. Porter and Dr. Altintas illustrated in the lecture.

<h1 style="font-size:2em;color:#2467C0">Data Engineering: Step 1, Acquire Data</h1>

This Notebook uses a dataset from the Kaggle website. Here is the link to the data source: https://www.kaggle.com/datasets/shivamb/netflix-shows<br>
<br>Once the download completes, please ensure the data file **netflix_titles** in the same directory this **Notebook** lives.

# Proposed Question1: What are the most frequent genres for long-running TV shows (at least five seasons long)?

<h1 style="font-size:2em;color:#2467C0">Data Engineering: Step 2A, Exploring Data</h1>

In [None]:
# Filter for long-running TV shows (more than 8 seasons)
long_running_shows = data[(data['type'] == 'TV Show') & (data['show_seasons'] > 8)]

long_running_shows

In [None]:
df = pd.read_csv('./netflix_titles.csv')
df.isnull().sum()

In [None]:
# Filter for TV Shows
df_tv_shows = df[df['type'] == 'TV Show']

# Parse the number of seasons from the "duration" field
df_tv_shows['num_seasons'] = df_tv_shows['duration'].str.extract('(\d+)').astype(int)

# Filter for shows with at least 5 seasons
df_long_running = df_tv_shows[df_tv_shows['num_seasons'] >= 5]

# Split the 'listed_in' column into separate rows
df_long_running = df_long_running.assign(listed_in=df_long_running['listed_in'].str.split(',')).explode('listed_in')

# Trim whitespace
df_long_running['listed_in'] = df_long_running['listed_in'].str.strip()

df_long_running.head()
import matplotlib.pyplot as plt

# Count the occurrences of each genre
genre_counts = df_long_running['listed_in'].value_counts()

In [None]:
# Create the plot
plt.figure(figsize=(10, 6))
genre_counts.plot(kind='bar', color='skyblue')
plt.title('Genres of Long-Running TV Shows (5 Seasons or More)')
plt.xlabel('Genres')
plt.ylabel('Count')
plt.xticks(rotation=90)
plt.show()

# Proposed Question2: Can we build a machine-learning model that estimates the content rating of a Netflix movie or TV show based on all other attributes?

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

In [None]:
# Load the Netflix titles data from the CSV file
data = pd.read_csv('./netflix_titles.csv')

data.isnull().sum()

<h1 style="font-size:2em;color:#2467C0">Data Engineering: Step 2B, Pre-Processing Data</h1>

In [None]:
# Keep a copy of the original data for later use, specifically for restoring the 'cast' column
data_copy = data.copy()

In [None]:
# Create mappings for 'cast' to 'director' and 'director' to 'country'
# These mappings are created from existing non-null data
cast_director_mapping = data[data.director.notna()].set_index('cast')['director'].to_dict()
director_country_mapping = data[data.country.notna()].set_index('director')['country'].to_dict()

cast_director_mapping, director_country_mapping

In [None]:
# Fill missing 'director' data based on the 'cast' to 'director' mapping created above
# If the mapping does not exist for a particular cast, fill the value with 'Unknown'
data.director = data.director.fillna(data.cast.map(cast_director_mapping))
data.director.fillna('Unknown', inplace=True)

data.director.isnull().sum()

In [None]:
# Fill missing 'country' data based on the 'director' to 'country' mapping created above
# If the mapping does not exist for a particular director, fill the value with 'Unknown'
data.country = data.country.fillna(data.director.map(director_country_mapping))
data.country.fillna('Unknown', inplace=True)

data.country.isnull().sum()

In [None]:
# Fill missing 'date_added' data using the backfill method
# The backfill method replaces missing values with the next valid value in the column
data.date_added.bfill(inplace=True)

data.date_added.isnull().sum()

In [None]:
# Remove rows with missing 'duration' data
# These rows are not useful for the analysis and predictive model, so they are removed
data.dropna(subset=['duration'], inplace=True)

data.duration.isnull().sum()

In [None]:
# Store the position of 'date_added' for later use
date_added_position = data.columns.get_loc('date_added')

date_added_position

In [None]:
# Convert 'date_added' from string to datetime format and extract the year, storing it in a new column 'year_added'
# The year when the title was added to Netflix could be a relevant feature for the predictive model
data['date_added'] = pd.to_datetime(data['date_added'])
data['year_added'] = data['date_added'].dt.year

data.date_added, data.year_added

In [None]:
# Drop the original 'date_added' column
# We no longer need this column as we have extracted the year information
data.drop('date_added', axis=1, inplace=True)

data.info()

In [None]:
# Insert 'year_added' at the original position of 'date_added'
# This is done to maintain the original structure of the dataset
data.insert(date_added_position, 'year_added', data.pop('year_added'))

data.info()

In [None]:
# Store the position of 'duration' for later use
duration_position = data.columns.get_loc('duration')

duration_position

In [None]:
# Convert 'duration' to separate features 'movie_duration' and 'show_seasons'
# This is done because the 'duration' column contains different types of information for movies and TV shows
# For movies, it contains the duration in minutes, while for TV shows, it contains the number of seasons
is_movie = data['duration'].str.contains('min')
is_show = data['duration'].str.contains('Season')
data.loc[is_movie, 'movie_duration'] = data.loc[is_movie, 'duration'].str.replace(' min', '').astype(int)
data.loc[is_show, 'show_seasons'] = data.loc[is_show, 'duration'].str.replace(' Season(s)?', '', regex=True).astype(int)

data.info()

In [None]:
# Drop the original 'duration' column
# We no longer need this column as we have separated the information into 'movie_duration' and 'show_seasons'
data.drop('duration', axis=1, inplace=True)

data.info()

In [None]:
# Insert 'movie_duration' and 'show_seasons' at the original position of 'duration'
# This is done to maintain the original structure of the dataset
data.insert(duration_position, 'movie_duration', data.pop('movie_duration'))
data.insert(duration_position + 1, 'show_seasons', data.pop('show_seasons'))

data.info()

In [None]:
# Separate the data into two subsets:
known_rating_data = data[data.rating.notna()].copy()
unknown_rating_data = data[data.rating.isna()].copy()

known_rating_data.isnull().sum(), unknown_rating_data.isnull().sum()

In [None]:
# Save the indices of the rows with unknown 'rating' for later use
missing_rating_index = unknown_rating_data.index

missing_rating_index

In [None]:
# Display the rows with missing 'rating' before filling
before_filling = data_copy.loc[missing_rating_index]
print("Rows with missing 'rating' before filling:\n", before_filling)

<h1 style="font-size:2em;color:#2467C0">Data Analysis: Step 3, Analyze Data</h1>

In [None]:
# Remove 'cast' column as it is not used in the predictive model
# The 'cast' column contains too many unique values, which could make the predictive model overly complex
data.drop('cast', axis=1, inplace=True)

data.info()

In [None]:
# Split the data with known 'rating' into a training set and a test set
# The model will be trained on the training set and tested on the test set
train_data, test_data = train_test_split(known_rating_data, test_size=0.2, random_state=42)

train_data.head(), test_data.head()

In [None]:
# Define the features to be used in the predictive model
# These features are selected based on their relevance to the 'rating'
features = ['type', 'director', 'country', 'year_added', 'release_year', 'movie_duration', 'show_seasons', 'listed_in']

# Append a row with 'Unknown' for each feature to the training data
# This is done to ensure that the 'Unknown' category is included in the LabelEncoder classes for each feature
unknown_row = pd.DataFrame({feature: ['Unknown'] for feature in features}, index=[-1])
train_data = pd.concat([train_data, unknown_row], axis=0)

train_data.tail()

In [None]:
# Convert all features to string type, as they will be encoded as categories
# This is necessary for the LabelEncoder to work correctly
for feature in features:
    train_data[feature] = train_data[feature].astype(str)
    test_data[feature] = test_data[feature].astype(str)
    unknown_rating_data[feature] = unknown_rating_data[feature].astype(str)
    
features

In [None]:
# Encode the categorical variables using LabelEncoder
# This is done to convert categorical features to numerical values, as required by the RandomForestClassifier
le = LabelEncoder()
for feature in features:
    train_data[feature] = le.fit_transform(train_data[feature])
    test_data.loc[~test_data[feature].isin(le.classes_), feature] = 'Unknown'
    test_data[feature] = test_data[feature].map(lambda s: 'Unknown' if s not in le.classes_ else s)
    test_data[feature] = le.transform(test_data[feature])
    
train_data.head(), test_data.head()

In [None]:
# Remove the appended row from the training data
# The 'Unknown' row was only needed for the LabelEncoder and is not used in the actual training of the model
train_data = train_data[train_data.index != -1]

train_data.head()

In [None]:
# Train a RandomForestClassifier on the training set
# The RandomForestClassifier is a powerful machine learning model that can handle complex tasks
X_train = train_data[features]
y_train = train_data['rating']
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

model

In [None]:
# Evaluate the model on the test set
# This provides an estimate of how well the model will perform on new, unseen data
X_test = test_data[features]
y_test = test_data['rating']
y_pred_test = model.predict(X_test)
print("Classification Report:\n", classification_report(y_test, y_pred_test))
print("Model Accuracy: ", accuracy_score(y_test, y_pred_test))

In [None]:
# Predict the missing ratings in the unknown_rating_data
# This is done by applying the trained model to the data with unknown 'rating'
for feature in features:
    unknown_rating_data.loc[~unknown_rating_data[feature].isin(le.classes_), feature] = 'Unknown'
    unknown_rating_data[feature] = unknown_rating_data[feature].map(lambda s: 'Unknown' if s not in le.classes_ else s)
    unknown_rating_data[feature] = le.transform(unknown_rating_data[feature])

In [None]:
X_unknown = unknown_rating_data[features]
y_pred_unknown = model.predict(X_unknown)

In [None]:
# Fill the missing ratings in the original data
# The predicted ratings are used to fill the missing values in the 'rating' column
data.loc[data.rating.isna(), 'rating'] = y_pred_unknown

In [None]:
# Add the original 'cast' column back to the dataset
# The 'cast' column is restored to the dataset for completeness
data.insert(4, 'cast', data_copy['cast'])

data.info()

<h1 style="font-size:2em;color:#2467C0">Data Analysis: Step 4, Reporting Insights</h1>

In [None]:
# Display the rows with missing 'rating' before filling
# This is done to provide a comparison before and after filling the missing values
print("Rows with missing 'rating' before filling:\n", before_filling)

In [None]:
# Display the rows with missing 'rating' after filling
# This is done to provide a comparison before and after filling the missing values
print("Rows with missing 'rating' after filling:\n", data.loc[missing_rating_index])

In [None]:
# Display the final state of the data
# This provides a complete view of the dataset after all preprocessing and filling of missing values
data.info()

# Proposed Question3: What are the most frequent themes or topics based on Netflix TV shows and movie descriptions?

<h1 style="font-size:2em;color:#2467C0">Data Engineering: Step 2B, Pre-Processing Data</h1>

In [None]:
import nltk
import string
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

netflix_data = pd.read_csv('./netflix_titles.csv')

In [None]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [None]:
# Initialize a WordNetLemmatizer for lemmatization
lemmatizer = WordNetLemmatizer()

# Define a list of English stop words
stop_words = set(stopwords.words('english'))

# Define a function to preprocess text
def preprocess_text(text):
    # Tokenize the text
    tokens = word_tokenize(text)
    
    # Convert the tokens to lower case
    tokens = [token.lower() for token in tokens]
    
    # Remove punctuation from the tokens
    tokens = [token for token in tokens if token not in string.punctuation]
    
    # Remove stop words from the tokens
    tokens = [token for token in tokens if token not in stop_words]
    
    # Lemmatize the tokens
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    # Return the processed tokens
    return tokens

# Apply the text preprocessing function to the 'description' column
netflix_data['tokens'] = netflix_data['description'].apply(preprocess_text)

netflix_data['tokens'].head(10)

<h1 style="font-size:2em;color:#2467C0">Data Analysis: Step 3, Analyze Data</h1>

In [None]:
from gensim import corpora

# Create a dictionary from the processed tokens
dictionary = corpora.Dictionary(netflix_data['tokens'])

# Filter out tokens that occur in less than 20 documents or more than 50% of the documents
dictionary.filter_extremes(no_below=20, no_above=0.5)

# Create a bag-of-words representation for each document
corpus = [dictionary.doc2bow(doc) for doc in netflix_data['tokens']]

In [None]:
from gensim.models import LdaModel

# Set the parameters for the LDA model
num_topics = 10
chunksize = 2000
passes = 20
iterations = 400
eval_every = 1  

# Create an id-to-word dictionary
temp = dictionary[0]  # This is only to "load" the dictionary
id2word = dictionary.id2token

# Initialize the LDA model
model = LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=num_topics,
    passes=passes,
    eval_every=eval_every
)

<h1 style="font-size:2em;color:#2467C0">Data Analysis: Step 4, Reporting Insights</h1>

In [None]:
# Get the top topics from the LDA model
top_topics = model.top_topics(corpus) 

# Print the top words for each topic
for i, topic in enumerate(top_topics):
    print(f'Top {i} words for topic #{i}:')
    print([id[1] for id in topic[0][:10]])
    print('\n')