![title_pic](./img/title_page.png)
#### TV & Movie recommendation system using a collaborative and content based filtering approach

# Data Understanding
The data used in the Film Finder project comes from the Amazon Review Data (2018) dataset found [here](https://nijianmo.github.io/amazon/index.html) which was originally sourced by Jianmo Ni, UCSD for their paper [“Justifying recommendations using distantly-labeled reviews and fined-grained aspects”](https://cseweb.ucsd.edu//~jmcauley/pdfs/emnlp19a.pdf). 

The raw files are stored as compressed JSON files, with each JSON entry on its own line. To access the larger files, such as the 'Movies And TV' reviews and metadata files used in this project, you need to fill out a google form linked from the source site.

This dataset spans from May 1996 to October 2018, with our analysis focusing on the Movies and TV category, comprising 8,765,568 reviews and metadata for 203,970 products.

The dataset offers user reviews and metadata for various films and TV shows, allowing us to better understand user preferences and industry trends.

### Key dataset features:
**overall**: User-assigned ratings from 1 to 5.

**verified**: Indicates if the user actually purchased or streamed the product.

**reviewerID**: Unique Amazon user ID.

**asin**: Unique Amazon product identification number.

**style**: Format of the movie or TV show (e.g., streaming, DVD, VHS, etc.).

**description**: Synopsis/summary of the movie or TV show.

**reviewText**: User's review text.

**brand**: Starring or leading role of the film.

### Dataset limitations:
- The category feature, with genre and sub-genre details, has ambiguous labels, complicating recommendation accuracy and trend analysis.
- The description feature contains missing or unusable information in at least a third of entries, which restricts accurate suggestions based on plot summaries.
- The brand feature, despite listing the leading role, lacks comprehensive cast information, potentially limiting actor/actress-based recommendations.
- The dataset doesn't extend beyond 2018, limiting insights into recent viewership trends.

# Data Preparation

### Initial Dataset Access
Access review and metadata files, which are stored as compressed JSON files with separate entries on each line. For larger files like 'Movies and TV' reviews and metadata, fill out a Google Form provided by their respective links.

### Data Conversion
Uncompress JSON files and convert them using Pandas. Perform data cleaning using Pandas, Numpy, and Python's ast module[(Abstract Syntax Trees)](https://docs.python.org/3/library/ast.html)

### EDA - Part 1: Metadata
**Remove unused features**: Discard unnecessary features for collaborative and content-based filtering, focusing on null or redundant information.

**Remove unused main_cat**: Discard all main categories not labeled as 'Movies & TV'. and then discard the feature.

**Extract genres**: Parse category to acquire specific genres/subgenres for movies/TV shows.

**Extract leading roles**: Analyze brand to obtain leading roles in films.

**Preprocess descriptions**: Clean description, isolating valuable information, and remove duplicates for text preprocessing and vectorization efficiency in the content-based filtering system.

### EDA - Part 2: User Review Data
**Remove unused features**: Discard unnecessary features for collaborative and content-based filtering, focusing on null or redundant information.

**Extract video content formats**: Parse format to obtain the top 4 formats, excluding VHS for relevancy.

**Keep verified reviews only**: Eliminate unverified reviews to maintain data validity.

**Remove duplicate reviews**: Discard duplicates based on asin and reviewerID to avoid bias.

**Match review with metadata**: Using cleaned metadata dataframe, remove review entries with movie/TV show IDs not found in the cleaned metadata dataframe.

**Filter by user review count**: Remove user IDs with less than 4 reviews to reduce dataset size and enhance model effectiveness. Chose the count of 5 after comparing against counts of 3, 4, and 6 based on the BaselineOnly() model's prediction score from Python's scikit Surprise module for recommender systems.[(Surprise Documentation found here)](https://surprise.readthedocs.io/en/stable/)

In [265]:
import pandas as pd
import numpy as np
import ast

import pandas as pd

# increasing display to view large descriptions and reviewText
pd.set_option('display.max_colwidth', None)


import warnings
warnings.filterwarnings('ignore')

### Dataset conversion json to csv
[data source](https://nijianmo.github.io/amazon/index.html)

In [3]:
# reviews = pd.read_json('./data/Movies_and_TV.json.gz', compression='gzip', lines=True)
# reviews.to_csv('./data/reviews.csv', encoding='utf-8', index=False)

# meta = pd.read_json('./data/meta_Movies_and_TV.json.gz', compression='gzip', lines=True)
# meta.to_csv('./data/reviews_meta.csv', encoding='utf-8', index=False)

# Part 1: Content Based Dataframes

In [197]:
meta_df = pd.read_csv('./data/reviews_meta.csv')
meta_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 203766 entries, 0 to 203765
Data columns (total 19 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   category         203766 non-null  object 
 1   tech1            6 non-null       object 
 2   description      203766 non-null  object 
 3   fit              0 non-null       float64
 4   title            203707 non-null  object 
 5   also_buy         203766 non-null  object 
 6   tech2            0 non-null       float64
 7   brand            137335 non-null  object 
 8   feature          203766 non-null  object 
 9   rank             203766 non-null  object 
 10  also_view        203766 non-null  object 
 11  main_cat         203756 non-null  object 
 12  similar_item     0 non-null       float64
 13  date             38 non-null      object 
 14  price            110745 non-null  object 
 15  asin             203766 non-null  object 
 16  imageURL         203766 non-null  obje

#### Dropping mostly null columns, and unecessary columns for modeling.

In [198]:
meta_df.rename(columns={'category': 'genre', 'brand': 'starring', 'asin': 'movie_id'}, inplace=True)
meta_df.drop_duplicates(subset='movie_id', inplace=True)
meta_df = meta_df[meta_df['main_cat'] == 'Movies & TV']
meta_df.drop(['tech1', 'fit', 'tech2', 'similar_item',
         'date', 'price', 'imageURL', 'imageURLHighRes',
        'also_buy', 'also_view', 'feature', 'rank',
          'main_cat', 'details'], axis=1, inplace=True)

meta_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 181552 entries, 0 to 203765
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   genre        181552 non-null  object
 1   description  181552 non-null  object
 2   title        181494 non-null  object
 3   starring     120607 non-null  object
 4   movie_id     181552 non-null  object
dtypes: object(5)
memory usage: 8.3+ MB


#### Reformating column types

In [199]:

convert = ['.', '\n', '-', '--', 'Na',
           'BRIDGESTONE MULTIMEDIA', '*', 'none',
           'na', 'N/a', 'VARIOUS', 'Artist Not Provided',
           'Sinister Cinema', 'Learn more', 'Various', 'various',
           'The Ambient Collection', 'Animation', 'Standard Deviants',
          'Animated']

meta_df['starring'] = meta_df['starring'].apply(lambda x: 'Various Artists' if isinstance(x, str) and (x in convert or '\n' in x) else x)

meta_df['starring'].fillna('Various Artists', inplace=True)

In [200]:
meta_df['genre'] = meta_df['genre'].apply(lambda x: ast.literal_eval(x))
meta_df['genre'] = [x[1:] if len(x) > 1 and x[0] == 'Movies & TV' else x for x in meta_df['genre']]
meta_df = meta_df[~meta_df['genre'].apply(lambda x: 'Exercise & Fitness' in x)]
meta_df.loc[meta_df['genre'].apply(lambda x: isinstance(x, list) and len(x) > 2 and x[0] == 'Art House & International'), 'genre'] = meta_df['genre'].apply(lambda x: [x[0] + ' ' + x[2]] if len(x) > 2 else x)
meta_df['genre'] = meta_df['genre'].apply(lambda x: x[:1] + x[2:] if isinstance(x, list) and len(x) > 2 and x[0] == 'Art House & International' and len(x) > 2 else x)
meta_df['genre'] = meta_df['genre'].apply(lambda x: ' '.join(x) if isinstance(x, list) else x)
meta_df['genre'].replace({'': 'Unknown'}, inplace=True)

In [296]:
remove_genre_str = ['Genre for Featured Categories', 'Blu-ray Movies',
                    'Studio Specials Universal Studios Home Entertainment All Universal Studios Titles',
                    'Studio Specials Sony Pictures Home Entertainment All Sony Pictures Titles',
                    'Studio Specials Lionsgate Home Entertainment All Lionsgate Titles',
                    'Studio Specials Warner Home Video All Titles', 'Independently Distributed',
                    'Studio Specials 20th Century Fox Home Entertainment All Fox Titles',
                    'A&E Home Video All A&E Titles', 'Studio Specials MGM Home Entertainment All MGM Titles',
                    'MOD CreateSpace Video', 'BBC All BBC Titles', 'Independently Distributed',
                    'Boxed Sets', 'Blu-ray TV', 'Fully Loaded DVDs DTS', 'HBO All HBO Titles',
                    'Studio Specials 20th Century Fox Home Entertainment Action General',
                    'Studio Specials 20th Century Fox Home Entertainment Fox TV General',
                    'Walt Disney Studios Home Entertainment All Disney Titles',
                    'Paramount Home Entertainment', 'Fully Loaded DVDs Special Editions',
                    'Studio Specials Miramax Home Entertainment All Titles'
                   ] 

In [None]:
def remove_substrings(s, word_list):
    for word in word_list:
        s = s.replace(word, '')
    return s

mask = ~meta_df['genre'].isin(remove_genre_str)
meta_df = meta_df[mask]
meta_df['genre'] = meta_df['genre'].apply(remove_substrings, word_list=remove_genre_str).str.strip()
len(meta_df['genre'].value_counts())

In [202]:
# saving for if I have time after modeling to recover more entries with unknown or...
# ... ambiguous genre's

# meta_df.to_csv('./data/desc_all_genre.csv', encoding='utf-8', index=False)
meta_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 176224 entries, 0 to 203765
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   genre        176224 non-null  object
 1   description  176224 non-null  object
 2   title        176166 non-null  object
 3   starring     176224 non-null  object
 4   movie_id     176224 non-null  object
dtypes: object(5)
memory usage: 12.1+ MB


In [203]:
meta_df.dropna(subset='title', inplace=True)
meta_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 176166 entries, 0 to 203765
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   genre        176166 non-null  object
 1   description  176166 non-null  object
 2   title        176166 non-null  object
 3   starring     176166 non-null  object
 4   movie_id     176166 non-null  object
dtypes: object(5)
memory usage: 8.1+ MB


In [204]:
len(meta_df['genre'].value_counts())

627

In [205]:
meta_df['genre'] = meta_df['genre'].str.replace('.*Christmas.*', 'Christmas', regex=True)
len(meta_df['genre'].value_counts())

620

In [206]:
meta_df['genre'] = meta_df['genre'].str.replace('.*Anime.*', 'Anime', regex=True)
len(meta_df['genre'].value_counts())

618

In [207]:
meta_df['genre'] = meta_df['genre'].str.replace('.*Animation.*', 'Animation', regex=True)
len(meta_df['genre'].value_counts())

606

In [208]:
meta_df['genre'] = meta_df['genre'].str.replace('.*Reality TV.*', 'Reality TV', regex=True)
len(meta_df['genre'].value_counts())

605

In [209]:
meta_df['genre'] = meta_df['genre'].str.replace('.*Musicals.*', 'Musicals & Performing Arts', regex=True)
meta_df['genre'] = meta_df['genre'].str.replace('.*Performing Arts.*', 'Musicals & Performing Arts', regex=True)
meta_df['genre'] = meta_df['genre'].str.replace('.*Music Video.*', 'Music Videos & Concerts', regex=True)
len(meta_df['genre'].value_counts())

593

In [210]:
meta_df['genre'] = meta_df['genre'].str.replace('.*Art House.*', 'Art House & International', regex=True)
len(meta_df['genre'].value_counts())

532

In [211]:
meta_df['genre'] = meta_df['genre'].str.replace('.*Science Fiction & Fantasy.*', 'Science Fiction & Fantasy', regex=True)
meta_df['genre'] = meta_df['genre'].str.replace('.*Sci-Fi & Fantasy.*', 'Science Fiction & Fantasy', regex=True)
meta_df['genre'] = meta_df['genre'].str.replace('.*Science Fiction.*', 'Science Fiction & Fantasy', regex=True)
meta_df['genre'] = meta_df['genre'].str.replace('.*Fantasy.*', 'Science Fiction & Fantasy', regex=True)
len(meta_df['genre'].value_counts())

512

In [212]:
meta_df['genre'] = meta_df['genre'].str.replace('Action', 'Action & Adventure', regex=True)
meta_df['genre'] = meta_df['genre'].str.replace('.*Action & Adventure.*', 'Action & Adventure', regex=True)
len(meta_df['genre'].value_counts())

498

In [213]:
meta_df['genre'] = meta_df['genre'].str.replace('.*Christian Movies & TV.*', 'Faith & Spirituality', regex=True)
len(meta_df['genre'].value_counts())

497

In [214]:
meta_df['genre'] = meta_df['genre'].str.replace('.*Mystery.*', 'Mystery & Thrillers', regex=True)
len(meta_df['genre'].value_counts())

494

In [215]:
meta_df['genre'] = meta_df['genre'].str.replace('.*News.*', 'News', regex=True)
len(meta_df['genre'].value_counts())

485

In [216]:
meta_df['genre'] = meta_df['genre'].str.replace('.*John Wayne.*', 'Westerns', regex=True)
len(meta_df['genre'].value_counts())

484

In [217]:
meta_df['genre'] = meta_df['genre'].str.replace('.*Disney.*', 'Disney Studios', regex=True)
len(meta_df['genre'].value_counts())

470

In [218]:
meta_df['genre'] = meta_df['genre'].str.replace('.*Classics.*', 'Classics & Silent Film', regex=True)
len(meta_df['genre'].value_counts())

455

In [219]:
meta_df = meta_df[meta_df['genre'].isin(meta_df['genre'].value_counts()[meta_df['genre'].value_counts() >= 210].index)]
len(meta_df['genre'].value_counts())

47

In [220]:
not_descriptive = ['Movies','Sony Pictures Home Entertainment','Warner Home Video',
                   'TV','All Lionsgate Titles','MOD CreateSpace Video',
                   'A&E Home Video All A&E Titles','All Fox Titles',
                   'All Universal Studios Titles','MGM Home Entertainment All MGM Titles',
                   'BBC','DTS','HBO','Fox TV','Unknown','Special Editions',
                   'Television','Miramax Home Entertainment','PBS'
                  ]

In [221]:
meta_df = meta_df[~meta_df['genre'].isin(not_descriptive)]
meta_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 108901 entries, 8 to 203764
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   genre        108901 non-null  object
 1   description  108901 non-null  object
 2   title        108901 non-null  object
 3   starring     108901 non-null  object
 4   movie_id     108901 non-null  object
dtypes: object(5)
memory usage: 5.0+ MB


In [222]:
len(meta_df['genre'].value_counts())

28

In [223]:
# copying dataframe to hold on to more movie_id's...
# ...that will be used for my collaborative filtering model
# note: make sure shape is (108760, 5)
col_meta = meta_df.copy()
col_meta.info()

<class 'pandas.core.frame.DataFrame'>
Index: 108901 entries, 8 to 203764
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   genre        108901 non-null  object
 1   description  108901 non-null  object
 2   title        108901 non-null  object
 3   starring     108901 non-null  object
 4   movie_id     108901 non-null  object
dtypes: object(5)
memory usage: 5.0+ MB


In [283]:
meta_df['description'] = meta_df['description'].apply(lambda x: " ".join(ast.literal_eval(x)).strip())
meta_df.loc[meta_df['description'].str.len() < 200, 'description'] = np.nan
meta_df.replace({'description': "###############################################################################################################################################################################################################################################################"}, np.nan, inplace=True)
meta_df.dropna(subset='description', inplace=True)
meta_df['description'] = meta_df['description'].apply(lambda x: " ".join(ast.literal_eval(x)).strip())
meta_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 80387 entries, 10 to 203764
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   genre        80387 non-null  object
 1   description  80387 non-null  object
 2   title        80387 non-null  object
 3   starring     80387 non-null  object
 4   movie_id     80387 non-null  object
dtypes: object(5)
memory usage: 3.7+ MB


In [284]:
meta_df.to_csv('./data/desc_filtered_genre.csv', encoding='utf-8', index=False)

_____

# Part 2: Collobarative Filtering Dataframes

In [285]:
df_collab = pd.read_csv('./data/reviews.csv')
df_collab.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8765568 entries, 0 to 8765567
Data columns (total 12 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   overall         int64 
 1   verified        bool  
 2   reviewTime      object
 3   reviewerID      object
 4   asin            object
 5   style           object
 6   reviewerName    object
 7   reviewText      object
 8   summary         object
 9   unixReviewTime  int64 
 10  vote            object
 11  image           object
dtypes: bool(1), int64(2), object(9)
memory usage: 744.0+ MB


In [286]:
df_collab.drop(['image', 'reviewTime', 'reviewerName', 'summary',
              'vote', 'unixReviewTime'], axis=1, inplace=True)

df_collab.rename(columns={'overall': 'rating', 'asin': 'movie_id',
                          'reviewerID': 'user_id','reviewText':'reviews'}, inplace=True)
df_collab.head()

Unnamed: 0,rating,verified,user_id,movie_id,style,reviews
0,5,True,A3478QRKQDOPQ2,1527665,{'Format:': ' VHS Tape'},"really happy they got evangelised .. spoiler alert==happy ending liked that..since started bit worrisome... but yeah great stories these missionary movies, really short only half hour but still great"
1,5,True,A2VHSG6TZHU1OB,1527665,{'Format:': ' Amazon Video'},"Having lived in West New Guinea (Papua) during the time period covered in this video, it is realistic, accurate, and conveys well the entrance of light and truth into a culture that was for centuries dead to and alienated from God."
2,5,False,A23EJWOW1TLENE,1527665,{'Format:': ' Amazon Video'},Excellent look into contextualizing the Gospel and God's sovereignty over cultural barriers. The book and movie are both captivating. I would definitely recommend to both Christians and non-believers.
3,5,True,A1KM9FNEJ8Q171,1527665,{'Format:': ' Amazon Video'},"More than anything, I've been challenged to find ways to share Christ is a culturally relevant way to those around me. Peace child is a cherished ""how to"" for me to do that."
4,4,True,A38LY2SSHVHRYB,1527665,{'Format:': ' Amazon Video'},"This is a great movie for a missionary going into a foreign country, especially one that is not used to foreign presence. But, it was a little on the short side."


In [287]:
rm_format = df_collab['style'].apply(lambda x: isinstance(x, str) and "{'Format:" in x)
df_collab = df_collab.loc[rm_format]

df_collab['style'] = df_collab['style'].apply(lambda x: ast.literal_eval(x))
df_collab['style'] = df_collab['style'].apply(lambda x: x['Format:'])
df_collab['style'] = df_collab['style'].astype(str)

df_collab = df_collab[df_collab['style'].isin(df_collab['style'].value_counts()[df_collab['style'].value_counts() >= 200000].index)]

In [288]:
df_collab = df_collab[df_collab['verified'] == True]
df_collab.drop(columns = ['verified', 'style'], inplace=True)

In [289]:
df_collab.dropna(subset=['reviews'], inplace=True)
df_collab.drop_duplicates(subset=['user_id', 'movie_id'], keep='first', inplace=True)
df_collab.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6212819 entries, 1 to 8765561
Data columns (total 4 columns):
 #   Column    Dtype 
---  ------    ----- 
 0   rating    int64 
 1   user_id   object
 2   movie_id  object
 3   reviews   object
dtypes: int64(1), object(3)
memory usage: 237.0+ MB


In [290]:
all_vid = df_collab['movie_id'].unique().tolist()
col_meta = col_meta[col_meta['movie_id'].isin(all_vid)]

In [291]:
all_vid2 = col_meta['movie_id'].unique().tolist()
df_collab = df_collab[df_collab['movie_id'].isin(all_vid2)]
df_collab = df_collab[df_collab['user_id'].isin(df_collab['user_id'].value_counts()[df_collab['user_id'].value_counts() >= 5].index)]

In [292]:
all_vid3 = df_collab['movie_id'].unique().tolist()
col_meta = col_meta[col_meta['movie_id'].isin(all_vid3)]

In [293]:
df_collab.info()

<class 'pandas.core.frame.DataFrame'>
Index: 694899 entries, 71 to 8765560
Data columns (total 4 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   rating    694899 non-null  int64 
 1   user_id   694899 non-null  object
 2   movie_id  694899 non-null  object
 3   reviews   694899 non-null  object
dtypes: int64(1), object(3)
memory usage: 26.5+ MB


In [294]:
col_meta.info()

<class 'pandas.core.frame.DataFrame'>
Index: 57348 entries, 22 to 203763
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   genre        57348 non-null  object
 1   description  57348 non-null  object
 2   title        57348 non-null  object
 3   starring     57348 non-null  object
 4   movie_id     57348 non-null  object
dtypes: object(5)
memory usage: 2.6+ MB


In [295]:
col_meta['genre'].value_counts()

genre
Drama                         8069
Action & Adventure            7109
Documentary                   6624
Comedy                        6076
Kids & Family                 4014
Musicals & Performing Arts    3418
Science Fiction & Fantasy     2824
Special Interests             2532
Animation                     2216
Horror                        2208
Art House & International     1877
Anime                         1829
Sports                        1637
Music Videos & Concerts       1247
Mystery & Thrillers            908
Christmas                      732
Foreign Films                  727
Classics & Silent Film         713
Westerns                       638
Disney Studios                 535
Romance                        309
Faith & Spirituality           302
Reality TV                     166
Military & War                 156
3-6 Years                      145
LGBT                           144
Educational                    118
News                            75
Name: count, d

In [47]:
df_collab.to_csv('./data/collab_model_revs.csv',  encoding='utf-8', index=False)
col_meta.to_csv('./data/collab_meta.csv',  encoding='utf-8', index=False)