# Final Project Section 2 Team 3

The primary goal of this project is to evaluate and compare the effectiveness of traditional machine learning models with that of large language models (LLMs) for providing movie recommendations. Traditional ML models have been successfully used in various recommendation systems due to their ability to handle structured data efficiently. However, with advancements in LLMs that can understand and process natural language intricately, there is an opportunity to leverage these models for recommendations. 

We are performing this experiment using the Netflix Movie Dataset. 

Link to Github:
> https://github.com/PhilipFelizarta/LLM-ML-Recommender-Study

Link to Dataset:
> https://www.kaggle.com/datasets/netflix-inc/netflix-prize-data

## Data Understanding (EDA)


## Data Preparation
Loading the data and performing necessary data cleaning and preprocessing steps.

In this case, I am loading the data from two files, `combined_data_1.txt` and `movie_titles.csv`. The `combined_data_1.txt` file contains the movie ratings data, while the `movie_titles.csv` file contains the movie titles and release years.

After cleaning and joining the data, I'm also enriching the data by adding the following features:
- average rating per movie
- number of ratings per movie

In [None]:
import pandas as pd
import csv
import matplotlib.pyplot as plt

In [None]:
ratings_file_1 = "combined_data_1.txt"

# Not loading all the data at once to reduce iteration time
ratings = [ratings_file_1]
movie_titles = "movie_titles.csv"
dataset_path = "./data"

In [None]:
def load_ratings_data(filename):
    data = []
    with open(filename, 'r') as file:
        current_movie_id = None
        for line in file:
            line = line.strip()
            if line.endswith(':'):
                current_movie_id = int(line.replace(':', ''))
            else:
                customer_id, rating, date = line.split(',')
                data.append([current_movie_id, int(customer_id), float(rating), date])
    return pd.DataFrame(data, columns=['Movie_Id', 'Cust_Id', 'Rating', 'Date'])


In [None]:
df_ratings = pd.concat([load_ratings_data(f"{dataset_path}/{rating}") for rating in ratings])

In [None]:
df_ratings.head(3)

In [None]:
print(df_ratings.shape)

In [None]:
df_ratings.describe()

In [None]:
cust_count = df_ratings['Cust_Id'].nunique()
movie_count = df_ratings['Movie_Id'].nunique()
rating_count = df_ratings['Cust_Id'].count()


print(f"Total number of unique customers: {cust_count}")

p = df_ratings.groupby('Rating')['Rating'].agg(['count'])

ax = p.plot(kind = 'barh', legend = False, figsize = (15,10))
plt.title('Total pool: {:,} Movies, {:,} customers, {:,} ratings given'.format(movie_count, cust_count, rating_count), fontsize=20)
plt.axis('off')

for i in range(1,6):
    ax.text(p.iloc[i-1][0]/4, i-1, 'Rating {}: {:.0f}%'.format(i, p.iloc[i-1][0]*100 / p.sum()[0]), color = 'white', weight = 'bold')

In [None]:
titles = []

with open(f"{dataset_path}/{movie_titles}", encoding="ISO-8859-1") as file:
    reader = csv.reader(file)
    for row in reader:
        movie_id = int(row[0])
        # Need to handle a few cases where the year is missing. 
        if row[1] == 'NULL':
            year = -1
        else:
            year = int(row[1])
        # Need to handle the case where a movie title has a comman in the name
        name = ','.join(row[2:]) 
        titles.append([movie_id, year, name])

df_titles = pd.DataFrame(titles, columns=['Movie_Id', 'Movie_Year', 'Name'])
df_titles.set_index('Movie_Id', inplace=True)


In [None]:
df_titles.head(3)

In [None]:
df = df_ratings.join(df_titles, on='Movie_Id', how='inner')
df.head(3)

In [None]:
# No missing values
na_check = df.isna().sum()
print(na_check)

##### Enrich data with average rating and review count for each movie

In [None]:
average_ratings = df.groupby('Movie_Id')['Rating'].mean().reset_index()
average_ratings.columns = ['Movie_Id', 'Average_Rating']

review_counts = df.groupby('Movie_Id')['Rating'].count().reset_index()
review_counts.columns = ['Movie_Id', 'Review_Count']

In [None]:
df = df.merge(average_ratings, on='Movie_Id', how='inner')
df = df.merge(review_counts, on='Movie_Id', how='inner')

##### Find set of N movies where the customers have rate all N movies 

In [None]:
KEEP_TOP_N = 20
top_movies = review_counts.sort_values(by='Review_Count', ascending=False)['Movie_Id']
top_n_movies = top_movies.head(KEEP_TOP_N)

In [None]:
df_top_n = df[df['Movie_Id'].isin(top_n_movies)]
print(df_top_n.shape)

In [None]:
pivot_table = df_top_n.pivot_table(index='Cust_Id', columns='Movie_Id', values='Rating', aggfunc='count', fill_value=0)

customers_all_n_movies = pivot_table[pivot_table.sum(axis=1) == KEEP_TOP_N].index

df_final = df_top_n[df_top_n['Cust_Id'].isin(customers_all_n_movies)]


In [None]:
df_final.head(3)

In [None]:
print(f"Number of customers who reviewed all {KEEP_TOP_N} movies: {len(customers_all_n_movies)}")
print(df_titles.loc[top_n_movies])


In [None]:
# Save the final dataset for experiment
df_final.to_csv(f"{dataset_path}/df_top_20_movies_customers_reviewed_all.csv", index=False)

## Feature Engineering


## Feature Selection


## Modeling


## Evaluation

## Discussion and Conclusions

