# Movie Recommendation System

## Library Imports and Constants Initialization

In [1]:
import os
import pandas as pd
import numpy as np

## Data Loading

In [2]:
movies_df = pd.read_csv(os.path.join(os.getcwd(), 'data', "movies.csv"))
ratings_df = pd.read_csv(os.path.join(os.getcwd(), 'data', "ratings.csv")).iloc[:500000, :] # Using 500,000 for now due to sheer size of original
tags_df = pd.read_csv(os.path.join(os.getcwd(), 'data', "tags.csv"))

## Dataset Exploration

### Movies Info

Nothing much to the movies dataset, everything is relevant, and there's no missing data.

In [3]:
print("\nGeneral information of the movies' dataset:")
print(movies_df.info())

print("------------------------------------------------------------------")

print("\nHead of the movies' dataset:")
print(movies_df.head())

print("------------------------------------------------------------------")

print("\nShape of the movies' dataset:", movies_df.shape)

print("------------------------------------------------------------------")

print("\nMissing values of the movies' dataset:")
print(movies_df.isnull().sum())


General information of the movies' dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62423 entries, 0 to 62422
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  62423 non-null  int64 
 1   title    62423 non-null  object
 2   genres   62423 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.4+ MB
None
------------------------------------------------------------------

Head of the movies' dataset:
   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3            

### Ratings Info

For the ratings dataset, the timestamp column is something we don't plan on using.

In [4]:
print("\nGeneral information of the ratings' dataset:")
print(ratings_df.info())
# Timestamp unnecessary

print("------------------------------------------------------------------")

print("\nHead of the ratings' dataset:")
print(ratings_df.head())

print("------------------------------------------------------------------")

print("\nShape of the ratings' dataset:", ratings_df.shape)

print("------------------------------------------------------------------")

print("\nMissing values of the ratings' dataset:")
print(ratings_df.isnull().sum())


General information of the ratings' dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500000 entries, 0 to 499999
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     500000 non-null  int64  
 1   movieId    500000 non-null  int64  
 2   rating     500000 non-null  float64
 3   timestamp  500000 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 15.3 MB
None
------------------------------------------------------------------

Head of the ratings' dataset:
   userId  movieId  rating   timestamp
0       1      296     5.0  1147880044
1       1      306     3.5  1147868817
2       1      307     5.0  1147868828
3       1      665     5.0  1147878820
4       1      899     3.5  1147868510
------------------------------------------------------------------

Shape of the ratings' dataset: (500000, 4)
------------------------------------------------------------------

Missing values of the ratings' 

### Tags Info

Once again, we'll likely drop the timestamp column. Also, suprisingly there are 16 missing values in the dataset specifically in the tag column. Considering how small that is compared to the size of the whole thing, we'll just drop them.

One of the bigger things we have to do for this dataset is alter the values in the tag column. The thing is, each tag is a user input. It's not like the generes column in the movies dataset that . This means 2 things. First, its that a movie can be given the same tag by different users but they're not technically the "same" due to how a user decided to spell the tag. So like sci-fi vs scifi. Second, some tags may be heavily opinion based, i.e. "So bad its good", this may effect te final outcome of our model since it will rely on the "genreness" of tags rather than opinionated descriptions. After all, something like, "So bad its good", can be used for any type of movie.

In [5]:
print("\nGeneral information of the tags' dataset:")
print(tags_df.info())
# Timestamp unnecessary
# Tags are user defined, so a lot of them (although the same) will be spelled differently. 
    # i.e. sci-fi vs scifi, 90's vs 90s, and Horror vs horror

print("------------------------------------------------------------------")

print("\nHead of the tags' dataset:")
print(tags_df.head())

print("------------------------------------------------------------------")

print("\nShape of the tags' dataset:", tags_df.shape)
print("------------------------------------------------------------------")

print("\nMissing values of the tags' dataset:")
print(tags_df.isnull().sum())
# 16 missing values

print("------------------------------------------------------------------")

print("\nUnique values tags' in dataset:")
print(tags_df.tag.nunique())


General information of the tags' dataset:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1093360 entries, 0 to 1093359
Data columns (total 4 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   userId     1093360 non-null  int64 
 1   movieId    1093360 non-null  int64 
 2   tag        1093344 non-null  object
 3   timestamp  1093360 non-null  int64 
dtypes: int64(3), object(1)
memory usage: 33.4+ MB
None
------------------------------------------------------------------

Head of the tags' dataset:
   userId  movieId               tag   timestamp
0       3      260           classic  1439472355
1       3      260            sci-fi  1439472256
2       4     1732       dark comedy  1573943598
3       4     1732    great dialogue  1573943604
4       4     7569  so bad it's good  1573943455
------------------------------------------------------------------

Shape of the tags' dataset: (1093360, 4)
---------------------------------------------

## Initial Data Preprocessing for Ratings, and Tags

### Ratings Preprocessing

In [None]:
ratings_cleaned = ratings_df.drop(columns=['timestamp'])

### Tags Preprocessing

In [None]:
tags_cleaned = tags_df.drop(columns=['timestamp'])
tags_cleaned = tags_cleaned.dropna()