![title_pic](./img/title_page.png)
#### TV & Movie recommendation system using a collaborative and content based filtering approach

# Project Overview
The main goal for this project is to Develop a hybrid movie/TV recommendation system that combines collaborative filtering and content-based filtering to suggest new content to users. Currently, these techniques are applied independently. Our project aims to harness their combined potential.

**Collaborative Filtering**: Analyzes existing user profiles to discover shared preferences and recommend new content based on similarities.

**Content-Based Filtering**: Suggests new content with similar fearures to the movie/TV show that you input.


# Business Understanding
As streaming platforms pile-up content, users struggle to pinpoint films or shows that align with their tastes. The dubious presence of bias in platform algorithms exacerbates this challenge, making it harder for users to rely on platform recommendations. Biases emerge from factors like skewed user preferences, popularity bias, or even the platform's promotional agenda. As a result, recommended content may not cater to users' unique tastes, negatively affecting the overall user experience.

Streaming platforms stand to gain from implementing an unbiased hybrid recommendation system that blends content-based and collaborative filtering techniques. This approach leverages the best of both methods, increasing reliability and personalization while mitigating biases. The content-based technique analyzes features like genre and content description, while collaborative filtering harnesses the collective trends of user ratings. Together, they forge a powerful recommendation engine, enhancing user satisfaction and overall experience.


# Data Understanding
The data used in the Film Finder project comes from the Amazon Review Data (2018) dataset found [here](https://nijianmo.github.io/amazon/index.html) which was originally sourced by Jianmo Ni, UCSD for their paper [“Justifying recommendations using distantly-labeled reviews and fined-grained aspects”](https://cseweb.ucsd.edu//~jmcauley/pdfs/emnlp19a.pdf). 

The raw files are stored as compressed JSON files, with each JSON entry on its own line. To access the larger files, such as the 'Movies And TV' reviews and metadata files used in this project, you need to fill out a google form linked from the source site.

This dataset spans from May 1996 to October 2018, with our analysis focusing on the Movies and TV category, comprising 8,765,568 reviews and metadata for 203,970 products.

The dataset offers user reviews and metadata for various films and TV shows, allowing us to better understand user preferences and industry trends.

### Key dataset features:
**overall**: User-assigned ratings from 1 to 5.

**verified**: Indicates if the user actually purchased or streamed the product.

**reviewerID**: Unique Amazon user ID.

**asin**: Unique Amazon product identification number.

**style**: Format of the movie or TV show (e.g., streaming, DVD, VHS, etc.).

**description**: Synopsis/summary of the movie or TV show.

**reviewText**: User's review text.

**brand**: Starring or leading role of the film.

### Dataset limitations:
- The category feature, with genre and sub-genre details, has ambiguous labels, complicating recommendation accuracy and trend analysis.
- The description feature contains missing or unusable information in at least a third of entries, which restricts accurate suggestions based on plot summaries.
- The brand feature, despite listing the leading role, lacks comprehensive cast information, potentially limiting actor/actress-based recommendations.
- The dataset doesn't extend beyond 2018, limiting insights into recent viewership trends.

# Data Preparation

### Initial Dataset Access
Access review and metadata files, which are stored as compressed JSON files with separate entries on each line. For larger files like 'Movies and TV' reviews and metadata, fill out a Google Form provided by their respective links.

![data_source](./img/data_source.png)
![data_survey](./img/data_survey.png)

### Data Conversion
Uncompress JSON files and convert them using Pandas. 

![conversion](./img/json_convert.png)

### Data Cleaning
Perform data cleaning using Pandas, Numpy, and Python's ast module[(Abstract Syntax Trees)](https://docs.python.org/3/library/ast.html)

### EDA - Part 1: Metadata
**Remove unused features**: Discard unnecessary features for collaborative and content-based filtering, focusing on null or redundant information.

**Remove unused main_cat**: Discard all main categories not labeled as 'Movies & TV'. and then discard the feature.

**Extract genres**: Parse category to acquire specific genres/subgenres for movies/TV shows.

**Extract leading roles**: Analyze brand to obtain leading roles in films.

**Preprocess descriptions**: Clean description, isolating valuable information, and remove duplicates for text preprocessing and vectorization efficiency in the content-based filtering system.

### EDA - Part 2: User Review Data
**Remove unused features**: Discard unnecessary features for collaborative and content-based filtering, focusing on null or redundant information.

**Extract video content formats**: Parse format to obtain the top 4 formats, excluding VHS for relevancy.

**Keep verified reviews only**: Eliminate unverified reviews to maintain data validity.

**Remove duplicate reviews**: Discard duplicates based on asin and reviewerID to avoid bias.

**Match review with metadata**: Using cleaned metadata dataframe, remove review entries with movie/TV show IDs not found in the cleaned metadata dataframe.

**Filter by user review count**: Remove user IDs with less than 4 reviews to reduce dataset size and enhance model effectiveness. Chose the count of 5 after comparing against counts of 3, 4, and 6 based on the BaselineOnly() model's prediction score from Python's scikit Surprise module for recommender systems.[(Surprise Documentation found here)](https://surprise.readthedocs.io/en/stable/)

# Part 1: Meta Content

**sidenote**: Cleaning up the meta Content first so that all of the specific videos used in the collaborative model have at least a title and description.

### Importing all packages used for data prep

In [1]:
import pandas as pd
import numpy as np
import ast

from bs4 import BeautifulSoup

from nltk.tokenize import RegexpTokenizer

pd.set_option('display.max_colwidth', None)

import warnings
warnings.filterwarnings('ignore')

In [2]:
# Converting original zipped json files to csv files for
reviews_csv = pd.read_json('./data_original/Movies_and_TV.json.gz', compression='gzip', lines=True)
reviews_csv.to_csv('./data_original/reviews.csv', encoding='utf-8', index=False)

meta_csv = pd.read_json('./data_original/meta_Movies_and_TV.json.gz', compression='gzip', lines=True)
meta_csv.to_csv('./data_original/reviews_meta.csv', encoding='utf-8', index=False)

In [3]:
meta_df = pd.read_csv('./data/reviews_meta.csv')
meta_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 203766 entries, 0 to 203765
Data columns (total 19 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   category         203766 non-null  object 
 1   tech1            6 non-null       object 
 2   description      203766 non-null  object 
 3   fit              0 non-null       float64
 4   title            203707 non-null  object 
 5   also_buy         203766 non-null  object 
 6   tech2            0 non-null       float64
 7   brand            137335 non-null  object 
 8   feature          203766 non-null  object 
 9   rank             203766 non-null  object 
 10  also_view        203766 non-null  object 
 11  main_cat         203756 non-null  object 
 12  similar_item     0 non-null       float64
 13  date             38 non-null      object 
 14  price            110745 non-null  object 
 15  asin             203766 non-null  object 
 16  imageURL         203766 non-null  obje

### Renaming Columns and Removing Unecessary/Unusable:

**Renamed Columns**:
- 'category' --> 'genre': contains information about the genre of the video. (Unfortunately there are a lot of entries with missing genre and are labeled with a generic genre such as 'Movies'.)
- 'brand' --> 'starring': this column contains the leading or most recognized actor associated with the video.
- 'asin' --> 'movie_id': This contains the unique Amazon product ID for each video.

**Unchanged**
- 'description': this column contains, mostly, information about the synopsis of the video, such as plot, and other descriptive information.

**Unecessary Columns** : tech1, fit, also_buy, tech2, feature, rank, also_view, similar_item, date, price, imageURL, imageURLHighRes, and main_cat(after removing all entries not 'Movies & TV')



In [4]:
meta_df.rename(columns={'category': 'genre', 'brand': 'starring', 'asin': 'movie_id'}, inplace=True)
meta_df.drop_duplicates(subset='movie_id', inplace=True)
meta_df.dropna(subset='title', inplace=True)
meta_df = meta_df[meta_df['main_cat'] == 'Movies & TV']
meta_df.drop(['tech1', 'fit', 'tech2', 'similar_item',
         'date', 'price', 'imageURL', 'imageURLHighRes',
        'also_buy', 'also_view', 'feature', 'rank',
          'main_cat', 'details'], axis=1, inplace=True)

meta_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 181494 entries, 0 to 203765
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   genre        181494 non-null  object
 1   description  181494 non-null  object
 2   title        181494 non-null  object
 3   starring     120604 non-null  object
 4   movie_id     181494 non-null  object
dtypes: object(5)
memory usage: 8.3+ MB


### Creating functions to remove html format from movie titles and 'description'

In [5]:
def remove_html_tags(text):
    return BeautifulSoup(text, "html.parser").get_text(strip=True)

def clean_html(title):
    # Remove HTML entities and strip leading/trailing whitespaces
    cleaned_title = remove_html_tags(title).strip()
    # Remove extra information by splitting on " / " and taking the first part of the split
    cleaned_title = cleaned_title.split(" / ", 1)[0]
    return cleaned_title

In [6]:
meta_df[meta_df['movie_id'] == 'B001P82XPG']['title']

124912    Feh&eacute;rl&oacute;fia 1982 / Hungarian cartoon / Region 2 PAL / Hungarian only version / Director: Marcell Jankovics Writers: L&aacute;szl&oacute; Gy&ouml;rgy (writer) Marcell Jankovics (writer)
Name: title, dtype: object

In [7]:
meta_df['title'] = meta_df['title'].apply(remove_html_tags)
meta_df[meta_df['movie_id'] == 'B001P82XPG']['title']

124912    Fehérlófia 1982 / Hungarian cartoon / Region 2 PAL / Hungarian only version / Director: Marcell Jankovics Writers: László György (writer) Marcell Jankovics (writer)
Name: title, dtype: object

In [8]:
meta_df['title'] = meta_df['title'].apply(clean_html)
meta_df[meta_df['movie_id'] == 'B001P82XPG']['title']

124912    Fehérlófia 1982
Name: title, dtype: object

### Converting all values in starring to 'Various Artist' where the leading/most well known actor is not defined.

In [9]:
meta_df['starring'].value_counts()

starring
Various               3058
.                     1027
-                      420
\n                     408
Learn more             356
                      ... 
Jon Long                 1
Rajiv Kankala            1
James T. Flocker         1
T.M. Crew                1
Misha Gomiashvili        1
Name: count, Length: 55253, dtype: int64

In [10]:
convert = ['.', '\n', '-', '--', 'Na',
           'BRIDGESTONE MULTIMEDIA', '*', 'none',
           'na', 'N/a', 'VARIOUS', 'Artist Not Provided',
           'Sinister Cinema', 'Learn more', 'Various', 'various',
           'The Ambient Collection', 'Animation', 'Standard Deviants',
          'Animated']

meta_df['starring'] = meta_df['starring'].apply(lambda x: 'Various Artists' if isinstance(x, str) and (x in convert or '\n' in x) else x)
meta_df['starring'].fillna('Various Artists', inplace=True)

In [11]:
meta_df['starring'].value_counts()

starring
Various Artists                    67851
John Wayne                           203
LeVar Burton                         116
William Shatner                      111
Roy Rogers                            95
                                   ...  
Sergey Soloviev                        1
Fred Warshofsky                        1
Steve Girman                           1
Kent Nichols and Douglas Sarine        1
Misha Gomiashvili                      1
Name: count, Length: 55214, dtype: int64

### Reformatting and Filtering 'genre'
The genre column required the most work to clean up and the following steps were used to extract the genre for each entry:
- Convert type from list to string
- Removed generic 'Movies & TV' found in each entry
- Removed genres that are not explicitly entertainment such as 'Exercise & Fitness' (removed 'Exercise and Fitness' because I'm assuming, that if you're going to watch a video for entertainment you're not going to want to watch a video about exercising.) 
- Renamed and reformated genres with a lot of overlap such as 'Art House & International'

In [12]:
meta_df['genre'].value_counts().head(10)

genre
['Movies & TV', 'Movies']                                                 28826
['Movies & TV', 'Genre for Featured Categories', 'Action & Adventure']     8964
['Movies & TV', 'Genre for Featured Categories', 'Drama']                  8284
['Movies & TV', 'Genre for Featured Categories', 'Documentary']            7947
['Movies & TV', 'Genre for Featured Categories', 'Special Interests']      7573
['Movies & TV', 'Genre for Featured Categories', 'Kids & Family']          7044
['Movies & TV', 'Genre for Featured Categories', 'Comedy']                 6395
['Movies & TV', 'Genre for Featured Categories', 'Exercise & Fitness']     5328
['Movies & TV', 'Independently Distributed', 'Documentary']                5273
['Movies & TV', 'Genre for Featured Categories', 'Sports']                 4248
Name: count, dtype: int64

In [13]:
#converting all entries to list using ast
meta_df['genre'] = meta_df['genre'].apply(lambda x: ast.literal_eval(x))

#removing 'Movies & TV' from the beginning of each genre list
meta_df['genre'] = [x[1:] if len(x) > 1 and x[0] == 'Movies & TV' else x for x in meta_df['genre']]

#removing 'Exercise & Fitness' videos
meta_df = meta_df[~meta_df['genre'].apply(lambda x: 'Exercise & Fitness' in x)]

#exctracting Art House & International and the language origin of the film
meta_df.loc[meta_df['genre'].apply(lambda x: isinstance(x, list) and len(x) > 2 and x[0] == 'Art House & International'), 'genre'] = meta_df['genre'].apply(lambda x: [x[0] + ' ' + x[2]] if len(x) > 2 else x)

#combining Art House with it's language origin
meta_df['genre'] = meta_df['genre'].apply(lambda x: x[:1] + x[2:] if isinstance(x, list) and len(x) > 2 and x[0] == 'Art House & International' and len(x) > 2 else x)

#joining all of the lists so they are now one string value
meta_df['genre'] = meta_df['genre'].apply(lambda x: ' '.join(x) if isinstance(x, list) else x)

#replacing empy sting values with unknown
meta_df['genre'].replace({'': 'unknown'}, inplace=True)

### Reformating description by converting to string and applying html functions to remove html.

In [14]:
meta_df['description'] = meta_df['description'].apply(lambda x: " ".join(ast.literal_eval(x)).strip())
meta_df[meta_df['movie_id'] == 'B0002ERXC2']['description']

41833    From David Simon, creator and co-writer of HBO's triple Emmy-winning mini-series <em>The Corner</em>, this unvarnished, highly realistic HBO series follows a single sprawling drug and murder investigation in Baltimore. Told from the point of view of both the police and their targets, the series captures a universe of subterfuge and surveillance, where easy distinctions between good and evil, and crime and punishment, are challenged at every turn. After one episode of <I>The Wire</I> you'll be hooked. After three, you'll be astonished by the precision of its storytelling. After viewing all 13 episodes of the HBO series' remarkable first season, you'll be cheering a bona-fide American masterpiece. Series creator David Simon was a veteran crime reporter from <I>The Baltimore Sun</I> who cowrote the <a href="/exec/obidos/ASIN/0804109990/${0}">book</a> that inspired TV's <I><a href="/b/?node=13745601">Homicide</a></I>, and cowriter Ed Burns was a Baltimore cop, lending impeccable s

In [15]:
meta_df['description'] = meta_df['description'].apply(remove_html_tags)
meta_df[meta_df['movie_id'] == 'B0002ERXC2']['description']

41833    From David Simon, creator and co-writer of HBO's triple Emmy-winning mini-seriesThe Corner, this unvarnished, highly realistic HBO series follows a single sprawling drug and murder investigation in Baltimore. Told from the point of view of both the police and their targets, the series captures a universe of subterfuge and surveillance, where easy distinctions between good and evil, and crime and punishment, are challenged at every turn. After one episode ofThe Wireyou'll be hooked. After three, you'll be astonished by the precision of its storytelling. After viewing all 13 episodes of the HBO series' remarkable first season, you'll be cheering a bona-fide American masterpiece. Series creator David Simon was a veteran crime reporter fromThe Baltimore Sunwho cowrote thebookthat inspired TV'sHomicide, and cowriter Ed Burns was a Baltimore cop, lending impeccable street-cred to an inner-city Baltimore saga (and companion piece toThe Corner) that Simon aptly describes as "a visual 

In [16]:
meta_df['description'] = meta_df['description'].apply(clean_html)
meta_df[meta_df['movie_id'] == 'B0002ERXC2']['description']

41833    From David Simon, creator and co-writer of HBO's triple Emmy-winning mini-seriesThe Corner, this unvarnished, highly realistic HBO series follows a single sprawling drug and murder investigation in Baltimore. Told from the point of view of both the police and their targets, the series captures a universe of subterfuge and surveillance, where easy distinctions between good and evil, and crime and punishment, are challenged at every turn. After one episode ofThe Wireyou'll be hooked. After three, you'll be astonished by the precision of its storytelling. After viewing all 13 episodes of the HBO series' remarkable first season, you'll be cheering a bona-fide American masterpiece. Series creator David Simon was a veteran crime reporter fromThe Baltimore Sunwho cowrote thebookthat inspired TV'sHomicide, and cowriter Ed Burns was a Baltimore cop, lending impeccable street-cred to an inner-city Baltimore saga (and companion piece toThe Corner) that Simon aptly describes as "a visual 

In [17]:
meta_df['description'].value_counts().head()

description
                                                                                                                                                                                               24720
Quick Shipping !!! New And Sealed !!! This Disc WILL NOT play on standard US DVD player. A multi-region PAL/NTSC DVD player is request to view it in USA/Canada. Please Review Description.     1571
DVD                                                                                                                                                                                              538
vhs                                                                                                                                                                                              373
VHS                                                                                                                                                                                              274
Nam

### Removing Videos with Duplicate Descriptions 
- Decided to do this because the videos with duplicate descriptions either represent different versions of the same video or they are not actually descriptive of the synopisis of the movie and therefore not useable for inference later
- Though the description is not necessary for the collaborative modeling the only other descriptive information is found in the 'title' and 'genre' and am going to need description since 'genre' and 'title' do not provide enough information alone in order to evaluate the recommendations created later. 

In [18]:
meta_df.drop_duplicates(subset='description', inplace=True)
meta_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 143273 entries, 0 to 203764
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   genre        143273 non-null  object
 1   description  143273 non-null  object
 2   title        143273 non-null  object
 3   starring     143273 non-null  object
 4   movie_id     143273 non-null  object
dtypes: object(5)
memory usage: 6.6+ MB


### Using nltk package for easier removal of labels not descriptive of genre

In [19]:
tokenizer = RegexpTokenizer('\w+')


sw = ['genre','for','featured','categories',
      'independently','distributed','for','studio',
     'home', 'warner', 'specials', 'all', 'hbo',
      'titles', 'pictures', 'entertainment' 'blue',
      'ray', 'dvd', 'vhs', 'lionsgate', 'mod',
      'createspace', 'video', 'a', 'e', '20th', 'fox',
      'universal', 'mgm', 'entertainment', 'specials',
      'bbc', 'boxed', 'sets', 'walt', 'general',
      'paramount', 'loaded', 'dvds', 'fully', 'blu',
      'sony', 'studios', 'pbs', 'television', 'dts',
      'miramax', 'history', 'series', 'movies',
      'criterion','collection','century', 'top',
      'sellers', 'first', 'to', 'know', 'disney'
     ]

In [20]:
def tokenize_sw(text):
    
    #converting all letters to lowercase
    text = text.lower()
    
    #tokenizing words so that I can isolate words and remove unecessary labels
    words = tokenizer.tokenize(text)

    #removing unecessary genre labels found in my stopwords list
    words = [word for word in words if word not in sw]
    
    return words

In [21]:
meta_df['genre'] = meta_df['genre'].apply(tokenize_sw)

In [22]:
meta_df['genre'] = meta_df['genre'].apply(lambda x: ' '.join(x) if isinstance(x, list) else x)
meta_df['genre'].replace({'': 'unknown'}, inplace=True)

### Removing the last of the frequent unusable genre labels
##### Sidenote: Did not include 'tv' and 'special editions' in earlier stopwords in order to keep genres that contained 'reality tv' and 'special interest'

In [23]:
remove_genre_str = ['tv', 'special editions'] 

def remove_substrings(s, word_list):
    for word in word_list:
        s = s.replace(word, '')
    return s

mask = ~meta_df['genre'].isin(remove_genre_str)
meta_df = meta_df[mask]
meta_df['genre'] = meta_df['genre'].apply(remove_substrings, word_list=remove_genre_str).str.strip()
meta_df['genre'].replace({'': 'unknown'}, inplace=True)

In [24]:
# Copying the dataframe with all genre, including unknown,...
# ...to be used to filter the reviews dataframe later.
col_meta = meta_df.copy()
col_meta.info()

<class 'pandas.core.frame.DataFrame'>
Index: 139483 entries, 0 to 203764
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   genre        139483 non-null  object
 1   description  139483 non-null  object
 2   title        139483 non-null  object
 3   starring     139483 non-null  object
 4   movie_id     139483 non-null  object
dtypes: object(5)
memory usage: 6.4+ MB


### Renaming 'genre' so that if they contain atleast the keyword for that genre they will be included in that genre value count. Doing this to better categorize the genre and reduce complexity.
**sidenote**: the order that the genre is renamed is important because many genre label contain several genre such as 'christmas' videos also contain 'animation' so in order to keep the genre as specific as possible I made sure to start with more specific sub-genres first and worked my way up to more broad genre.

In [25]:
meta_df['genre'].nunique()

562

In [26]:
meta_df['genre'] = meta_df['genre'].str.replace('.*christmas.*', 'Christmas', regex=True)

In [27]:
meta_df['genre'] = meta_df['genre'].str.replace('.*anime.*', 'Anime', regex=True)

In [28]:
meta_df['genre'] = meta_df['genre'].str.replace('.*animation.*', 'Animation', regex=True)
meta_df['genre'] = meta_df['genre'].str.replace('.*animated.*', 'Animation', regex=True)

In [29]:
meta_df['genre'] = meta_df['genre'].str.replace('.*reality.*', 'Reality TV', regex=True)

In [30]:
meta_df['genre'] = meta_df['genre'].str.replace('.*musicals.*', 'Musicals & Performing Arts', regex=True)
meta_df['genre'] = meta_df['genre'].str.replace('.*performing arts.*', 'Musicals & Performing Arts', regex=True)
meta_df['genre'] = meta_df['genre'].str.replace('.*shakespeare.*', 'Musicals & Performing Arts', regex=True)
meta_df['genre'] = meta_df['genre'].str.replace('.*theatre.*', 'Musicals & Performing Arts', regex=True)

In [31]:
meta_df['genre'] = meta_df['genre'].str.replace('.*music art.*', 'Music Videos & Concerts', regex=True)
meta_df['genre'] = meta_df['genre'].str.replace('.*music con.*', 'Music Videos & Concerts', regex=True)
meta_df['genre'] = meta_df['genre'].str.replace('.*music video.*', 'Music Videos & Concerts', regex=True)

In [32]:
meta_df['genre'] = meta_df['genre'].str.replace('.*art house.*', 'Art House & International', regex=True)

In [33]:
meta_df['genre'] = meta_df['genre'].str.replace('.*science fi.*', 'Science Fiction & Fantasy', regex=True)
meta_df['genre'] = meta_df['genre'].str.replace('.*sci fi.*', 'Science Fiction & Fantasy', regex=True)
meta_df['genre'] = meta_df['genre'].str.replace('.*fantasy.*', 'Science Fiction & Fantasy', regex=True)

In [34]:
meta_df['genre'] = meta_df['genre'].str.replace('.*classic.*', 'Classics & Silent Film', regex=True)

In [35]:
meta_df['genre'] = meta_df['genre'].str.replace('.*action.*', 'Action & Adventure', regex=True)

In [36]:
meta_df['genre'] = meta_df['genre'].str.replace('.*christian.*', 'Faith & Spirituality', regex=True)
meta_df['genre'] = meta_df['genre'].str.replace('.*faith spirit.*', 'Faith & Spirituality', regex=True)

In [37]:
meta_df['genre'] = meta_df['genre'].str.replace('.*mystery.*', 'Mystery & Thrillers', regex=True)

In [38]:
meta_df['genre'] = meta_df['genre'].str.replace('.*news.*', 'News', regex=True)

In [39]:
meta_df['genre'] = meta_df['genre'].str.replace('.*kids.*', 'Kids & Family', regex=True)

In [40]:
meta_df['genre'] = meta_df['genre'].str.replace('.*comedy.*', 'Comedy', regex=True)

In [41]:
meta_df['genre'] = meta_df['genre'].str.replace('.*horror.*', 'Horror', regex=True)

In [42]:
meta_df['genre'] = meta_df['genre'].str.replace('.*drama.*', 'Drama', regex=True)

In [43]:
meta_df['genre'] = meta_df['genre'].str.replace('.*years.*', 'Young Children', regex=True)

In [44]:
meta_df['genre'].nunique()

220

In [45]:
meta_df['genre'] = meta_df['genre'].str.title()
meta_df['genre'].replace('Unknown', np.nan, inplace=True)
meta_df.dropna(subset=['genre'], inplace=True)
meta_df['genre'].nunique()

219

In [46]:
meta_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 99292 entries, 8 to 203764
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   genre        99292 non-null  object
 1   description  99292 non-null  object
 2   title        99292 non-null  object
 3   starring     99292 non-null  object
 4   movie_id     99292 non-null  object
dtypes: object(5)
memory usage: 4.5+ MB


In [47]:
genres_to_remove = meta_df['genre'].value_counts()[meta_df['genre'].value_counts() < 200].index
meta_df = meta_df[~meta_df['genre'].isin(genres_to_remove)]
meta_df['genre'].nunique()

27

In [48]:
meta_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 96834 entries, 8 to 203764
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   genre        96834 non-null  object
 1   description  96834 non-null  object
 2   title        96834 non-null  object
 3   starring     96834 non-null  object
 4   movie_id     96834 non-null  object
dtypes: object(5)
memory usage: 4.4+ MB


In [49]:
meta_df.to_csv('./data/descript_cont_based.csv', encoding='utf-8', index=False)
meta_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 96834 entries, 8 to 203764
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   genre        96834 non-null  object
 1   description  96834 non-null  object
 2   title        96834 non-null  object
 3   starring     96834 non-null  object
 4   movie_id     96834 non-null  object
dtypes: object(5)
memory usage: 4.4+ MB


___

# Part 2: Collobarative Filtering Dataframes

### loading json converted csv of review data

In [56]:
df_reviews = pd.read_csv('./data/reviews.csv')
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8765568 entries, 0 to 8765567
Data columns (total 12 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   overall         int64 
 1   verified        bool  
 2   reviewTime      object
 3   reviewerID      object
 4   asin            object
 5   style           object
 6   reviewerName    object
 7   reviewText      object
 8   summary         object
 9   unixReviewTime  int64 
 10  vote            object
 11  image           object
dtypes: bool(1), int64(2), object(9)
memory usage: 744.0+ MB


### Renaming Columns and Removing Unecessary/Unusable:

**Renamed Columns**:
- 'asin' --> 'movie_id': This contains the unique Amazon product ID for each video.
- 'overall' --> 'rating': rating that a user gave the video in their review.
- 'reviewerID' --> 'user_id': unique ID given to an Amazon user account and recorded for each review made.
- 'reviewText' --> 'reviews': The review text written by the user for the specific video they've reviewed.

**Unecessary Columns** : 'image', 'reviewTime', 'reviewerName', 'summary', 'unixReviewTime, 'vote'

**Unchanged**
- 'style': represents the format that the amazon product/video is in
- 'Verified': Represents whether or not a user is verified to have purchased/downloaded the video.

In [57]:
df_reviews.drop(['image', 'reviewTime', 'reviewerName', 'summary',
              'vote', 'unixReviewTime'], axis=1, inplace=True)

df_reviews.rename(columns={'overall': 'rating', 'asin': 'movie_id',
                          'reviewerID': 'user_id','reviewText':'reviews'}, inplace=True)
df_reviews.head()

Unnamed: 0,rating,verified,user_id,movie_id,style,reviews
0,5,True,A3478QRKQDOPQ2,1527665,{'Format:': ' VHS Tape'},"really happy they got evangelised .. spoiler alert==happy ending liked that..since started bit worrisome... but yeah great stories these missionary movies, really short only half hour but still great"
1,5,True,A2VHSG6TZHU1OB,1527665,{'Format:': ' Amazon Video'},"Having lived in West New Guinea (Papua) during the time period covered in this video, it is realistic, accurate, and conveys well the entrance of light and truth into a culture that was for centuries dead to and alienated from God."
2,5,False,A23EJWOW1TLENE,1527665,{'Format:': ' Amazon Video'},Excellent look into contextualizing the Gospel and God's sovereignty over cultural barriers. The book and movie are both captivating. I would definitely recommend to both Christians and non-believers.
3,5,True,A1KM9FNEJ8Q171,1527665,{'Format:': ' Amazon Video'},"More than anything, I've been challenged to find ways to share Christ is a culturally relevant way to those around me. Peace child is a cherished ""how to"" for me to do that."
4,4,True,A38LY2SSHVHRYB,1527665,{'Format:': ' Amazon Video'},"This is a great movie for a missionary going into a foreign country, especially one that is not used to foreign presence. But, it was a little on the short side."


### Removing all un-verified users
- removing all user reviews who were not verified in order to retain as much validity as possible
- Dropping afterwards as it is no longer necessary

In [58]:
df_reviews = df_reviews[df_reviews['verified'] == True]
df_reviews.drop(columns = 'verified', inplace=True)
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6731296 entries, 0 to 8765566
Data columns (total 5 columns):
 #   Column    Dtype 
---  ------    ----- 
 0   rating    int64 
 1   user_id   object
 2   movie_id  object
 3   style     object
 4   reviews   object
dtypes: int64(1), object(4)
memory usage: 308.1+ MB


### Reformatting 'style' column
- Reformating so that the entries in 'style' represent a string value for format of the video

In [59]:
# Filter the rows in df_reviews where 'style' contains the string "{'Format:"
rm_format = df_reviews['style'].apply(lambda x: isinstance(x, str) and "{'Format:" in x)
df_reviews = df_reviews.loc[rm_format]

# Convert the 'style' string to a dictionary using ast.literal_eval
df_reviews['style'] = df_reviews['style'].apply(lambda x: ast.literal_eval(x))
# Extract the 'Format:' value from the dictionary
df_reviews['style'] = df_reviews['style'].apply(lambda x: x['Format:'])
# Convert the 'style' column to string data type
df_reviews['style'] = df_reviews['style'].astype(str)

# Filter rows in 'style' (formate) column where count is greater or equal to 25000
# Done purely to get rid of movies labeled as VHS and other miscelaneous formats
# Wanted to remove vhs in order to keep the dataset more current as VHS is the most outdated tech listed
df_reviews = df_reviews[df_reviews['style'].isin(df_reviews['style'].value_counts()[df_reviews['style'].value_counts() >= 25000].index)]

df_reviews.head()

Unnamed: 0,rating,user_id,movie_id,style,reviews
0,5,A3478QRKQDOPQ2,1527665,VHS Tape,"really happy they got evangelised .. spoiler alert==happy ending liked that..since started bit worrisome... but yeah great stories these missionary movies, really short only half hour but still great"
1,5,A2VHSG6TZHU1OB,1527665,Amazon Video,"Having lived in West New Guinea (Papua) during the time period covered in this video, it is realistic, accurate, and conveys well the entrance of light and truth into a culture that was for centuries dead to and alienated from God."
3,5,A1KM9FNEJ8Q171,1527665,Amazon Video,"More than anything, I've been challenged to find ways to share Christ is a culturally relevant way to those around me. Peace child is a cherished ""how to"" for me to do that."
4,4,A38LY2SSHVHRYB,1527665,Amazon Video,"This is a great movie for a missionary going into a foreign country, especially one that is not used to foreign presence. But, it was a little on the short side."
5,5,AHTYUW2H1276L,1527665,Amazon Video,This movie was in ENGLISH....it was a great summary of the book and the experience of the Richardsons while in New Guinea.


### Creating a new dataframe for collab filtering with
- Will be used for collaborative filtering model and also for context filtering if I decide to not use 'description' as a feature from the meta dataframe.

In [60]:
df_collab = df_reviews.copy()
df_collab.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6432095 entries, 0 to 8765561
Data columns (total 5 columns):
 #   Column    Dtype 
---  ------    ----- 
 0   rating    int64 
 1   user_id   object
 2   movie_id  object
 3   style     object
 4   reviews   object
dtypes: int64(1), object(4)
memory usage: 294.4+ MB


In [61]:
col_meta.info()

<class 'pandas.core.frame.DataFrame'>
Index: 139483 entries, 0 to 203764
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   genre        139483 non-null  object
 1   description  139483 non-null  object
 2   title        139483 non-null  object
 3   starring     139483 non-null  object
 4   movie_id     139483 non-null  object
dtypes: object(5)
memory usage: 6.4+ MB


In [62]:
# unique movie_ids from the meta dataframe and convert them to a list
all_vid = col_meta['movie_id'].unique().tolist()

# Keep only rows in df_collab where 'movie_id' is in all_vid (for inference)
df_collab = df_collab[df_collab['movie_id'].isin(all_vid)]

# removing entries where a user reviewed a movie multiple times in order to 
# preserve the values of the ratings
df_collab.drop_duplicates(subset=['user_id', 'movie_id'], keep='first', inplace=True)

#dropping style as I will be assuming that all formats will be reviewed soely based on their content
df_collab.drop(columns='style', inplace=True)

# Removing user_id's that have less than 4 reviews in order to have collaborative
# .... filtered recommendations for similar users (may want to filter more depending on RMSE)
df_collab = df_collab[df_collab['user_id'].isin(df_collab['user_id'].value_counts()[df_collab['user_id'].value_counts() >= 4].index)]

df_collab.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2025476 entries, 9 to 8765560
Data columns (total 4 columns):
 #   Column    Dtype 
---  ------    ----- 
 0   rating    int64 
 1   user_id   object
 2   movie_id  object
 3   reviews   object
dtypes: int64(1), object(3)
memory usage: 77.3+ MB


In [63]:
col_meta['movie_id'].nunique()

139483

In [64]:
df_collab['movie_id'].nunique()

85605

In [65]:
#removing mvoie_id's not in meta dataframe for a clean merge
all_vid = df_collab['movie_id'].unique().tolist()
col_meta = col_meta[col_meta['movie_id'].isin(all_vid)]
col_meta['movie_id'].nunique()

85605

In [66]:
revtext_merged_df = pd.merge(df_collab, col_meta, on="movie_id", how="left")
revtext_merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2025476 entries, 0 to 2025475
Data columns (total 8 columns):
 #   Column       Dtype 
---  ------       ----- 
 0   rating       int64 
 1   user_id      object
 2   movie_id     object
 3   reviews      object
 4   genre        object
 5   description  object
 6   title        object
 7   starring     object
dtypes: int64(1), object(7)
memory usage: 123.6+ MB


In [67]:
#saving for collab recommendations
revtext_merged_df.to_csv('./data/collab_merged.csv', encoding='utf-8', index=False)

### Creating dataframe for reviewtext if I end up using it as a feature

In [68]:
meta_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 96834 entries, 8 to 203764
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   genre        96834 non-null  object
 1   description  96834 non-null  object
 2   title        96834 non-null  object
 3   starring     96834 non-null  object
 4   movie_id     96834 non-null  object
dtypes: object(5)
memory usage: 4.4+ MB


In [69]:
df_reviews.dropna(subset='reviews', inplace=True)
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6424014 entries, 0 to 8765561
Data columns (total 5 columns):
 #   Column    Dtype 
---  ------    ----- 
 0   rating    int64 
 1   user_id   object
 2   movie_id  object
 3   style     object
 4   reviews   object
dtypes: int64(1), object(4)
memory usage: 294.1+ MB


In [70]:
df_reviews['reviews'].value_counts().head()

reviews
Great movie    30679
Great          26348
Good movie     19748
Good           18624
great          18056
Name: count, dtype: int64

In [71]:
# unique movie_ids from the meta dataframe and convert them to a list
all_vid = meta_df['movie_id'].unique().tolist()

# Keep only rows in df_collab where 'movie_id' is in all_vid (for inference)
df_reviews = df_reviews[df_reviews['movie_id'].isin(all_vid)]

# removing entries where a user reviewed a movie multiple times and reviews where
# there may be duplicates across different users
df_reviews.drop_duplicates(subset=['user_id', 'movie_id'], keep='first', inplace=True)
df_reviews.drop_duplicates(subset=['reviews'], keep='first', inplace=True)

# removing review text where the reviews are shorter than a len of 200..
# ... in order to remove generic reviews like 'great movie' etc..
df_reviews = df_reviews[(df_reviews['reviews'].str.len() >= 200)]

#dropping style as I will be assuming that all formats will be reviewed soely based on their content
df_reviews.drop(columns='style', inplace=True)

df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
Index: 652426 entries, 9 to 8765560
Data columns (total 4 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   rating    652426 non-null  int64 
 1   user_id   652426 non-null  object
 2   movie_id  652426 non-null  object
 3   reviews   652426 non-null  object
dtypes: int64(1), object(3)
memory usage: 24.9+ MB


In [72]:
df_reviews['movie_id'].nunique()

60695

In [73]:
meta_df['movie_id'].nunique()

96834

In [74]:
#removing mvoie_id's not in meta dataframe for a clean merge
all_vid = df_reviews['movie_id'].unique().tolist()
meta_df = meta_df[meta_df['movie_id'].isin(all_vid)]
meta_df['movie_id'].nunique()

60695

In [75]:
revtext_merged_nonull_df = pd.merge(df_reviews, meta_df, on="movie_id", how="left")
revtext_merged_nonull_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 652426 entries, 0 to 652425
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   rating       652426 non-null  int64 
 1   user_id      652426 non-null  object
 2   movie_id     652426 non-null  object
 3   reviews      652426 non-null  object
 4   genre        652426 non-null  object
 5   description  652426 non-null  object
 6   title        652426 non-null  object
 7   starring     652426 non-null  object
dtypes: int64(1), object(7)
memory usage: 39.8+ MB


In [76]:
revtext_merged_df.to_csv('./data/revtext_cont_based.csv', encoding='utf-8', index=False)