![title_pic](./img/title_page.png)
#### TV & Movie recommendation system using a collaborative and content based filtering approach

# Data Understanding
The data used in the Film Finder project comes from the Amazon Review Data (2018) dataset found [here](https://nijianmo.github.io/amazon/index.html) which was originally sourced by Jianmo Ni, UCSD for their paper [“Justifying recommendations using distantly-labeled reviews and fined-grained aspects”](https://cseweb.ucsd.edu//~jmcauley/pdfs/emnlp19a.pdf). 

The raw files are stored as compressed JSON files, with each JSON entry on its own line. To access the larger files, such as the 'Movies And TV' reviews and metadata files used in this project, you need to fill out a google form linked from the source site.

This dataset spans from May 1996 to October 2018, with our analysis focusing on the Movies and TV category, comprising 8,765,568 reviews and metadata for 203,970 products.

The dataset offers user reviews and metadata for various films and TV shows, allowing us to better understand user preferences and industry trends.

### Key dataset features:
**overall**: User-assigned ratings from 1 to 5.

**verified**: Indicates if the user actually purchased or streamed the product.

**reviewerID**: Unique Amazon user ID.

**asin**: Unique Amazon product identification number.

**style**: Format of the movie or TV show (e.g., streaming, DVD, VHS, etc.).

**description**: Synopsis/summary of the movie or TV show.

**reviewText**: User's review text.

**brand**: Starring or leading role of the film.

### Dataset limitations:
- The category feature, with genre and sub-genre details, has ambiguous labels, complicating recommendation accuracy and trend analysis.
- The description feature contains missing or unusable information in at least a third of entries, which restricts accurate suggestions based on plot summaries.
- The brand feature, despite listing the leading role, lacks comprehensive cast information, potentially limiting actor/actress-based recommendations.
- The dataset doesn't extend beyond 2018, limiting insights into recent viewership trends.

# Data Preparation

### Initial Dataset Access
Access review and metadata files, which are stored as compressed JSON files with separate entries on each line. For larger files like 'Movies and TV' reviews and metadata, fill out a Google Form provided by their respective links.

### Data Conversion
Uncompress JSON files and convert them using Pandas. Perform data cleaning using Pandas, Numpy, and Python's ast module[(Abstract Syntax Trees)](https://docs.python.org/3/library/ast.html)

### EDA - Part 1: Metadata
**Remove unused features**: Discard unnecessary features for collaborative and content-based filtering, focusing on null or redundant information.

**Remove unused main_cat**: Discard all main categories not labeled as 'Movies & TV'. and then discard the feature.

**Extract genres**: Parse category to acquire specific genres/subgenres for movies/TV shows.

**Extract leading roles**: Analyze brand to obtain leading roles in films.

**Preprocess descriptions**: Clean description, isolating valuable information, and remove duplicates for text preprocessing and vectorization efficiency in the content-based filtering system.

### EDA - Part 2: User Review Data
**Remove unused features**: Discard unnecessary features for collaborative and content-based filtering, focusing on null or redundant information.

**Extract video content formats**: Parse format to obtain the top 4 formats, excluding VHS for relevancy.

**Keep verified reviews only**: Eliminate unverified reviews to maintain data validity.

**Remove duplicate reviews**: Discard duplicates based on asin and reviewerID to avoid bias.

**Match review with metadata**: Using cleaned metadata dataframe, remove review entries with movie/TV show IDs not found in the cleaned metadata dataframe.

**Filter by user review count**: Remove user IDs with less than 4 reviews to reduce dataset size and enhance model effectiveness. Chose the count of 5 after comparing against counts of 3, 4, and 6 based on the BaselineOnly() model's prediction score from Python's scikit Surprise module for recommender systems.[(Surprise Documentation found here)](https://surprise.readthedocs.io/en/stable/)

In [1]:
import pandas as pd
import numpy as np

from nltk.tokenize import RegexpTokenizer

import ast


import warnings
warnings.filterwarnings('ignore')

In [112]:
meta_df = pd.read_csv('./data/reviews_meta.csv')
meta_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 203766 entries, 0 to 203765
Data columns (total 19 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   category         203766 non-null  object 
 1   tech1            6 non-null       object 
 2   description      203766 non-null  object 
 3   fit              0 non-null       float64
 4   title            203707 non-null  object 
 5   also_buy         203766 non-null  object 
 6   tech2            0 non-null       float64
 7   brand            137335 non-null  object 
 8   feature          203766 non-null  object 
 9   rank             203766 non-null  object 
 10  also_view        203766 non-null  object 
 11  main_cat         203756 non-null  object 
 12  similar_item     0 non-null       float64
 13  date             38 non-null      object 
 14  price            110745 non-null  object 
 15  asin             203766 non-null  object 
 16  imageURL         203766 non-null  obje

In [113]:
meta_df.rename(columns={'category': 'genre', 'brand': 'starring', 'asin': 'movie_id'}, inplace=True)
meta_df.drop_duplicates(subset='movie_id', inplace=True)
meta_df.dropna(subset='title', inplace=True)
meta_df = meta_df[meta_df['main_cat'] == 'Movies & TV']
meta_df.drop(['tech1', 'fit', 'tech2', 'similar_item',
         'date', 'price', 'imageURL', 'imageURLHighRes',
        'also_buy', 'also_view', 'feature', 'rank',
          'main_cat', 'details'], axis=1, inplace=True)

meta_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 181494 entries, 0 to 203765
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   genre        181494 non-null  object
 1   description  181494 non-null  object
 2   title        181494 non-null  object
 3   starring     120604 non-null  object
 4   movie_id     181494 non-null  object
dtypes: object(5)
memory usage: 8.3+ MB


In [114]:
meta_df['genre'] = meta_df['genre'].apply(lambda x: ast.literal_eval(x))
meta_df['genre'] = [x[1:] if len(x) > 1 and x[0] == 'Movies & TV' else x for x in meta_df['genre']]
meta_df = meta_df[~meta_df['genre'].apply(lambda x: 'Exercise & Fitness' in x)]
meta_df.loc[meta_df['genre'].apply(lambda x: isinstance(x, list) and len(x) > 2 and x[0] == 'Art House & International'), 'genre'] = meta_df['genre'].apply(lambda x: [x[0] + ' ' + x[2]] if len(x) > 2 else x)
meta_df['genre'] = meta_df['genre'].apply(lambda x: x[:1] + x[2:] if isinstance(x, list) and len(x) > 2 and x[0] == 'Art House & International' and len(x) > 2 else x)
meta_df['genre'] = meta_df['genre'].apply(lambda x: ' '.join(x) if isinstance(x, list) else x)
meta_df['genre'].replace({'': 'unknown'}, inplace=True)

In [115]:
def tokenize_sw(text):
    
    text = text.lower()
    
    words = tokenizer.tokenize(text)

    words = [word for word in words if word not in sw]
    
    return words

In [116]:
tokenizer = RegexpTokenizer('\w+')


sw = ['genre','for','featured','categories',
      'independently','distributed','for','studio',
     'home', 'warner', 'specials', 'all', 'hbo',
      'titles', 'pictures', 'entertainment' 'blue',
      'ray', 'dvd', 'vhs', 'lionsgate', 'mod',
      'createspace', 'video', 'a', 'e', '20th', 'fox',
      'universal', 'mgm', 'entertainment', 'specials',
      'bbc', 'boxed', 'sets', 'walt', 'general',
      'paramount', 'loaded', 'dvds', 'fully', 'blu',
      'sony', 'studios', 'pbs', 'television', 'dts',
      'miramax', 'history', 'series', 'movies',
      'criterion','collection','century', 'top',
      'sellers', 'first', 'to', 'know', 'disney'
     ]

In [117]:
meta_df['genre'] = meta_df['genre'].apply(tokenize_sw)

In [118]:
meta_df['genre'] = meta_df['genre'].apply(lambda x: ' '.join(x) if isinstance(x, list) else x)
meta_df['genre'].replace({'': 'unknown'}, inplace=True)

In [119]:
remove_genre_str = ['tv', 'special editions'] 

In [120]:
def remove_substrings(s, word_list):
    for word in word_list:
        s = s.replace(word, '')
    return s

mask = ~meta_df['genre'].isin(remove_genre_str)
meta_df = meta_df[mask]
meta_df['genre'] = meta_df['genre'].apply(remove_substrings, word_list=remove_genre_str).str.strip()

In [121]:
meta_df['genre'] = meta_df['genre'].str.replace('.*christmas.*', 'Christmas', regex=True)

In [122]:
meta_df['genre'] = meta_df['genre'].str.replace('.*anime.*', 'Anime', regex=True)

In [123]:
meta_df['genre'] = meta_df['genre'].str.replace('.*animation.*', 'Animation', regex=True)
meta_df['genre'] = meta_df['genre'].str.replace('.*animated.*', 'Animation', regex=True)

In [124]:
meta_df['genre'] = meta_df['genre'].str.replace('.*reality.*', 'Reality TV', regex=True)

In [125]:
meta_df['genre'] = meta_df['genre'].str.replace('.*musicals.*', 'Musicals & Performing Arts', regex=True)
meta_df['genre'] = meta_df['genre'].str.replace('.*performing arts.*', 'Musicals & Performing Arts', regex=True)
meta_df['genre'] = meta_df['genre'].str.replace('.*shakespeare.*', 'Musicals & Performing Arts', regex=True)
meta_df['genre'] = meta_df['genre'].str.replace('.*theatre.*', 'Musicals & Performing Arts', regex=True)

In [126]:
meta_df['genre'] = meta_df['genre'].str.replace('.*music art.*', 'Music Videos & Concerts', regex=True)
meta_df['genre'] = meta_df['genre'].str.replace('.*music con.*', 'Music Videos & Concerts', regex=True)
meta_df['genre'] = meta_df['genre'].str.replace('.*music video.*', 'Music Videos & Concerts', regex=True)

In [127]:
meta_df['genre'] = meta_df['genre'].str.replace('.*art house.*', 'Art House & International', regex=True)

In [128]:
meta_df['genre'] = meta_df['genre'].str.replace('.*science fi.*', 'Science Fiction & Fantasy', regex=True)
meta_df['genre'] = meta_df['genre'].str.replace('.*sci fi.*', 'Science Fiction & Fantasy', regex=True)
meta_df['genre'] = meta_df['genre'].str.replace('.*fantasy.*', 'Science Fiction & Fantasy', regex=True)

In [129]:
meta_df['genre'] = meta_df['genre'].str.replace('.*classic.*', 'Classics & Silent Film', regex=True)

In [130]:
meta_df['genre'] = meta_df['genre'].str.replace('.*action.*', 'Action & Adventure', regex=True)

In [131]:
meta_df['genre'] = meta_df['genre'].str.replace('.*christian.*', 'Faith & Spirituality', regex=True)
meta_df['genre'] = meta_df['genre'].str.replace('.*faith spirit.*', 'Faith & Spirituality', regex=True)

In [132]:
meta_df['genre'] = meta_df['genre'].str.replace('.*mystery.*', 'Mystery & Thrillers', regex=True)

In [133]:
meta_df['genre'] = meta_df['genre'].str.replace('.*news.*', 'News', regex=True)

In [134]:
meta_df['genre'] = meta_df['genre'].str.replace('.*kids.*', 'Kids & Family', regex=True)

In [135]:
meta_df['genre'] = meta_df['genre'].str.replace('.*comedy.*', 'Comedy', regex=True)

In [136]:
meta_df['genre'] = meta_df['genre'].str.replace('.*horror.*', 'Horror', regex=True)

In [137]:
meta_df['genre'] = meta_df['genre'].str.replace('.*drama.*', 'Drama', regex=True)

In [138]:
meta_df['genre'] = meta_df['genre'].str.replace('.*years.*', 'Young Children', regex=True)

In [139]:
meta_df = meta_df[meta_df['genre'].isin(meta_df['genre'].value_counts()[meta_df['genre'].value_counts() >= 200].index)]
meta_df['genre'] = meta_df['genre'].str.title()

In [140]:
convert = ['.', '\n', '-', '--', 'Na',
           'BRIDGESTONE MULTIMEDIA', '*', 'none',
           'na', 'N/a', 'VARIOUS', 'Artist Not Provided',
           'Sinister Cinema', 'Learn more', 'Various', 'various',
           'The Ambient Collection', 'Animation', 'Standard Deviants',
          'Animated']

meta_df['starring'] = meta_df['starring'].apply(lambda x: 'Various Artists' if isinstance(x, str) and (x in convert or '\n' in x) else x)
meta_df['starring'].fillna('Various Artists', inplace=True)

In [141]:
col_meta = meta_df.copy()
col_meta.info()

<class 'pandas.core.frame.DataFrame'>
Index: 168110 entries, 0 to 203765
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   genre        168110 non-null  object
 1   description  168110 non-null  object
 2   title        168110 non-null  object
 3   starring     168110 non-null  object
 4   movie_id     168110 non-null  object
dtypes: object(5)
memory usage: 7.7+ MB


In [142]:
# saving for later to classify missing & ambiguous genre's
meta_df.to_csv('./data/genre_classification.csv', encoding='utf-8', index=False)

In [144]:
meta_df['genre'] = meta_df['genre'].replace('Unknown', np.nan)
meta_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 168110 entries, 0 to 203765
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   genre        111207 non-null  object
 1   description  168110 non-null  object
 2   title        168110 non-null  object
 3   starring     168110 non-null  object
 4   movie_id     168110 non-null  object
dtypes: object(5)
memory usage: 7.7+ MB


In [146]:
meta_df.dropna(subset='genre', inplace=True)
meta_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 111207 entries, 8 to 203764
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   genre        111207 non-null  object
 1   description  111207 non-null  object
 2   title        111207 non-null  object
 3   starring     111207 non-null  object
 4   movie_id     111207 non-null  object
dtypes: object(5)
memory usage: 5.1+ MB


In [104]:
revtex_meta = meta_df.copy()
revtex_meta.info()

<class 'pandas.core.frame.DataFrame'>
Index: 111207 entries, 8 to 203764
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   genre        111207 non-null  object
 1   description  111207 non-null  object
 2   title        111207 non-null  object
 3   starring     111207 non-null  object
 4   movie_id     111207 non-null  object
dtypes: object(5)
memory usage: 5.1+ MB


In [147]:
meta_df['description'] = meta_df['description'].apply(lambda x: " ".join(ast.literal_eval(x)).strip())
meta_df.drop_duplicates(subset='description', inplace=True)
meta_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 97167 entries, 8 to 203764
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   genre        97167 non-null  object
 1   description  97167 non-null  object
 2   title        97167 non-null  object
 3   starring     97167 non-null  object
 4   movie_id     97167 non-null  object
dtypes: object(5)
memory usage: 4.4+ MB


In [148]:
meta_df.loc[meta_df['description'].str.len() < 20, 'description'] = np.nan
meta_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 97167 entries, 8 to 203764
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   genre        97167 non-null  object
 1   description  95381 non-null  object
 2   title        97167 non-null  object
 3   starring     97167 non-null  object
 4   movie_id     97167 non-null  object
dtypes: object(5)
memory usage: 4.4+ MB


In [149]:
meta_df.dropna(subset='description', inplace=True)
meta_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 95381 entries, 9 to 203764
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   genre        95381 non-null  object
 1   description  95381 non-null  object
 2   title        95381 non-null  object
 3   starring     95381 non-null  object
 4   movie_id     95381 non-null  object
dtypes: object(5)
memory usage: 4.4+ MB


In [150]:
meta_df.to_csv('./data/descript_cont_based.csv', encoding='utf-8', index=False)

___

# Part 2: Collobarative Filtering Dataframes

In [75]:
df_reviews = pd.read_csv('./data/reviews.csv')
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8765568 entries, 0 to 8765567
Data columns (total 12 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   overall         int64 
 1   verified        bool  
 2   reviewTime      object
 3   reviewerID      object
 4   asin            object
 5   style           object
 6   reviewerName    object
 7   reviewText      object
 8   summary         object
 9   unixReviewTime  int64 
 10  vote            object
 11  image           object
dtypes: bool(1), int64(2), object(9)
memory usage: 744.0+ MB


In [76]:
df_reviews.drop(['image', 'reviewTime', 'reviewerName', 'summary',
              'vote', 'unixReviewTime'], axis=1, inplace=True)

df_reviews.rename(columns={'overall': 'rating', 'asin': 'movie_id',
                          'reviewerID': 'user_id','reviewText':'reviews'}, inplace=True)
df_reviews.head()

Unnamed: 0,rating,verified,user_id,movie_id,style,reviews
0,5,True,A3478QRKQDOPQ2,1527665,{'Format:': ' VHS Tape'},"really happy they got evangelised .. spoiler alert==happy ending liked that..since started bit worrisome... but yeah great stories these missionary movies, really short only half hour but still great"
1,5,True,A2VHSG6TZHU1OB,1527665,{'Format:': ' Amazon Video'},"Having lived in West New Guinea (Papua) during the time period covered in this video, it is realistic, accurate, and conveys well the entrance of light and truth into a culture that was for centuries dead to and alienated from God."
2,5,False,A23EJWOW1TLENE,1527665,{'Format:': ' Amazon Video'},Excellent look into contextualizing the Gospel and God's sovereignty over cultural barriers. The book and movie are both captivating. I would definitely recommend to both Christians and non-believers.
3,5,True,A1KM9FNEJ8Q171,1527665,{'Format:': ' Amazon Video'},"More than anything, I've been challenged to find ways to share Christ is a culturally relevant way to those around me. Peace child is a cherished ""how to"" for me to do that."
4,4,True,A38LY2SSHVHRYB,1527665,{'Format:': ' Amazon Video'},"This is a great movie for a missionary going into a foreign country, especially one that is not used to foreign presence. But, it was a little on the short side."


In [77]:
df_reviews = df_reviews[df_reviews['verified'] == True]
df_reviews.drop(columns = 'verified', inplace=True)
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6731296 entries, 0 to 8765566
Data columns (total 5 columns):
 #   Column    Dtype 
---  ------    ----- 
 0   rating    int64 
 1   user_id   object
 2   movie_id  object
 3   style     object
 4   reviews   object
dtypes: int64(1), object(4)
memory usage: 308.1+ MB


In [78]:
rm_format = df_reviews['style'].apply(lambda x: isinstance(x, str) and "{'Format:" in x)
df_reviews = df_reviews.loc[rm_format]

df_reviews['style'] = df_reviews['style'].apply(lambda x: ast.literal_eval(x))
df_reviews['style'] = df_reviews['style'].apply(lambda x: x['Format:'])
df_reviews['style'] = df_reviews['style'].astype(str)

df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6445983 entries, 0 to 8765561
Data columns (total 5 columns):
 #   Column    Dtype 
---  ------    ----- 
 0   rating    int64 
 1   user_id   object
 2   movie_id  object
 3   style     object
 4   reviews   object
dtypes: int64(1), object(4)
memory usage: 295.1+ MB


In [173]:
df_revtex_revs = df_reviews.copy()
df_revtex_revs.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6445983 entries, 0 to 8765561
Data columns (total 5 columns):
 #   Column    Dtype 
---  ------    ----- 
 0   rating    int64 
 1   user_id   object
 2   movie_id  object
 3   style     object
 4   reviews   object
dtypes: int64(1), object(4)
memory usage: 295.1+ MB


In [174]:
all_vid = revtex_meta['movie_id'].unique().tolist()
df_revtex_revs = df_revtex_revs[df_revtex_revs['movie_id'].isin(all_vid)]
df_revtex_revs.dropna(subset=['reviews'], inplace=True)
df_revtex_revs.drop_duplicates(subset=['reviews'], keep='first', inplace=True)
df_revtex_revs = df_revtex_revs[df_revtex_revs['style'].isin(df_revtex_revs['style'].value_counts()[df_revtex_revs['style'].value_counts() >= 25000].index)]
df_revtex_revs.drop(columns='style', inplace=True)
df_revtex_revs.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2346445 entries, 1 to 8765561
Data columns (total 4 columns):
 #   Column    Dtype 
---  ------    ----- 
 0   rating    int64 
 1   user_id   object
 2   movie_id  object
 3   reviews   object
dtypes: int64(1), object(3)
memory usage: 89.5+ MB


In [176]:
all_vid2 = df_revtex_revs['movie_id'].unique().tolist()
revtex_meta = revtex_meta[revtex_meta['movie_id'].isin(all_vid2)]
revtex_meta.info()

<class 'pandas.core.frame.DataFrame'>
Index: 79659 entries, 9 to 203763
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   genre        79659 non-null  object
 1   description  79659 non-null  object
 2   title        79659 non-null  object
 3   starring     79659 non-null  object
 4   movie_id     79659 non-null  object
dtypes: object(5)
memory usage: 3.6+ MB


In [178]:
revtex_merged_df = pd.merge(df_revtex_revs, revtex_meta, on="movie_id", how="left")
revtex_merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2346445 entries, 0 to 2346444
Data columns (total 8 columns):
 #   Column       Dtype 
---  ------       ----- 
 0   rating       int64 
 1   user_id      object
 2   movie_id     object
 3   reviews      object
 4   genre        object
 5   description  object
 6   title        object
 7   starring     object
dtypes: int64(1), object(7)
memory usage: 143.2+ MB


In [195]:
revtex_merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2346445 entries, 0 to 2346444
Data columns (total 8 columns):
 #   Column       Dtype 
---  ------       ----- 
 0   rating       int64 
 1   user_id      object
 2   movie_id     object
 3   reviews      object
 4   genre        object
 5   description  object
 6   title        object
 7   starring     object
dtypes: int64(1), object(7)
memory usage: 143.2+ MB


In [194]:
revtex_merged_df.to_csv('./data/revtext_cont_based.csv', encoding='utf-8', index=False)

___

In [221]:
df_collab = df_reviews.copy()
df_collab.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6445983 entries, 0 to 8765561
Data columns (total 5 columns):
 #   Column    Dtype 
---  ------    ----- 
 0   rating    int64 
 1   user_id   object
 2   movie_id  object
 3   style     object
 4   reviews   object
dtypes: int64(1), object(4)
memory usage: 295.1+ MB


In [211]:
col_meta = colopy_meta.copy()
col_meta.info()

<class 'pandas.core.frame.DataFrame'>
Index: 168110 entries, 0 to 203765
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   genre        168110 non-null  object
 1   description  168110 non-null  object
 2   title        168110 non-null  object
 3   starring     168110 non-null  object
 4   movie_id     168110 non-null  object
dtypes: object(5)
memory usage: 7.7+ MB


In [222]:
all_vid = col_meta['movie_id'].unique().tolist()
df_collab = df_collab[df_collab['movie_id'].isin(all_vid)]
df_collab.drop_duplicates(subset=['user_id', 'movie_id'], keep='first', inplace=True)
df_collab = df_collab[df_collab['user_id'].isin(df_collab['user_id'].value_counts()[df_collab['user_id'].value_counts() >= 4].index)]
df_collab = df_collab[df_collab['style'].isin(df_collab['style'].value_counts()[df_collab['style'].value_counts() >= 25000].index)]
df_collab.drop(columns='style', inplace=True)
df_collab.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2348972 entries, 21 to 8765560
Data columns (total 4 columns):
 #   Column    Dtype 
---  ------    ----- 
 0   rating    int64 
 1   user_id   object
 2   movie_id  object
 3   reviews   object
dtypes: int64(1), object(3)
memory usage: 89.6+ MB


In [223]:
df_collab['movie_id'].nunique()

93649

In [224]:
all_vid = df_collab['movie_id'].unique().tolist()
col_meta = col_meta[col_meta['movie_id'].isin(all_vid)]
col_meta.info()

<class 'pandas.core.frame.DataFrame'>
Index: 93649 entries, 3 to 203763
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   genre        93649 non-null  object
 1   description  93649 non-null  object
 2   title        93649 non-null  object
 3   starring     93649 non-null  object
 4   movie_id     93649 non-null  object
dtypes: object(5)
memory usage: 4.3+ MB


In [225]:
collab_merged_df = pd.merge(df_collab, col_meta, on="movie_id", how="left")
collab_merged_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2348972 entries, 0 to 2348971
Data columns (total 8 columns):
 #   Column       Dtype 
---  ------       ----- 
 0   rating       int64 
 1   user_id      object
 2   movie_id     object
 3   reviews      object
 4   genre        object
 5   description  object
 6   title        object
 7   starring     object
dtypes: int64(1), object(7)
memory usage: 143.4+ MB


In [226]:
print(collab_merged_df['rating'].isnull().sum())
print(collab_merged_df['user_id'].isnull().sum())
print(collab_merged_df['movie_id'].isnull().sum())
print(collab_merged_df['reviews'].isnull().sum())
print(collab_merged_df['genre'].isnull().sum())
print(collab_merged_df['title'].isnull().sum())
print(collab_merged_df['starring'].isnull().sum())

0
0
0
1762
0
0
0


In [227]:
collab_merged_df.to_csv('./data/collab_merged.csv', encoding='utf-8', index=False)