Remaining Tasks:  

1) Conclusion  
2) Objective  

# Amazon Product Recommender

**Jason Truong**  **|**  Jasontruong19@gmail.com  **|**  September 25, 2022

**Problem Statement:** based off a person’s review of a product, can items of similar properties be recommended to them such that they are more inclined to buy the item. This project adds business value by improving the customer buying experience on Amazon through personalized recommendations for products. These recommendations influence customer purchases by making it easier for customers to find similar products. 

***

# 1b Meta Data Cleaning

**Note**: This is notebook **2 of 5** for building the recommendation system.

This Jupyter workbook contains the meta data for the Movies and TV Amazon reviews. The data was sourced from http://deepyeti.ucsd.edu/jianmo/amazon/index.html

**Citing source:** Justifying recommendations using distantly-labeled reviews and fined-grained aspects
Jianmo Ni, Jiacheng Li, Julian McAuley
Empirical Methods in Natural Language Processing (EMNLP), 2019


# Table of Contents

1. **[Introduction](#1Introduction)**  
2. **[Data Processing](#2Preliminary)**  
    2.1. [Process `Date` column](#2.1date)  
    2.2. [Process columns `tech1`, `fit`, `tech2` and `similar item`](#2.2multiple)  
    2.3. [Process `price` column](#2.3price)  
    2.4. [Process `feature` column](#2.4feature)  
    2.5. [Process `main_cat` column](#2.5maincat)   
    2.6. [Process `title` column](#2.6title)  
    2.7. [Process `description` column](#2.7description)  
    2.8. [Process `category` column](#2.8category)  
    2.9. [Process `rank` column](#2.9rank)  
    2.10. [Process `details` column](#2.10detail)  
    2.11. [Process `asin` column](#2.11asin)  
    2.12. [Remove the Also_Buy and Also_view columns](#2.12alsobuy)  
    2.13. [Process Duplicates](#2.13duplicates)  
    2.14. [Which brands produce the most films?](#2.14mostfilm)  
3. **[Export Preprocessed Data](#3export)**  
4. **[Summary](#4summary)**  

# 1. Introduction <a class ='anchor' id='1Introduction'></a>

In the first notebook, the review dataset was processed so that a model can be trained to try to predict the sentiment of the review. This notebook will focus on processing the product meta data so that it can be used in the recommendation system.

# 2. Data Proccessing<a class ='anchor' id='2Preliminary'></a>

In [None]:
The necessary base packages will be imported below.

In [1]:
# Import dependencies
import random
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

Read in the data from the json file

In [2]:
# If the dataset is too big and only a portion of the data can be loaded in
# meta_data = []
# with open('meta_Movies_and_TV.json','r') as metafile:
#     for x in range(50000):
#         meta_data.append(json.loads(next(metafile)))

In [3]:
meta_data = []

# Read in every line of the json file
for line in open('meta_Movies_and_TV.json','r'):
    meta_data.append(json.loads(line))

In [4]:
meta_data[0]

{'category': ['Movies & TV', 'Movies'],
 'tech1': '',
 'description': [],
 'fit': '',
 'title': 'Understanding Seizures and Epilepsy',
 'also_buy': [],
 'tech2': '',
 'brand': '',
 'feature': [],
 'rank': '886,503 in Movies & TV (',
 'also_view': [],
 'main_cat': 'Movies & TV',
 'similar_item': '',
 'date': '',
 'price': '',
 'asin': '0000695009',
 'imageURL': [],
 'imageURLHighRes': []}

Create dataframe

In [5]:
meta_df = pd.DataFrame(meta_data)

Check the data types and see if there are any null values in the dataframe

In [6]:
meta_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 203766 entries, 0 to 203765
Data columns (total 19 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   category         203766 non-null  object
 1   tech1            203766 non-null  object
 2   description      203766 non-null  object
 3   fit              203766 non-null  object
 4   title            203766 non-null  object
 5   also_buy         203766 non-null  object
 6   tech2            203766 non-null  object
 7   brand            203766 non-null  object
 8   feature          203766 non-null  object
 9   rank             203766 non-null  object
 10  also_view        203766 non-null  object
 11  main_cat         203766 non-null  object
 12  similar_item     203766 non-null  object
 13  date             203766 non-null  object
 14  price            203766 non-null  object
 15  asin             203766 non-null  object
 16  imageURL         203766 non-null  object
 17  imageURLHi

At first glance, there seems to be null values in the detail column.

In [7]:
meta_df.head()

Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes,details
0,"[Movies & TV, Movies]",,[],,Understanding Seizures and Epilepsy,[],,,[],"886,503 in Movies & TV (",[],Movies & TV,,,,695009,[],[],
1,"[Movies & TV, Movies]",,[],,Spirit Led&mdash;Moving By Grace In The Holy S...,[],,,[],"342,688 in Movies & TV (",[],Movies & TV,,,,791156,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,
2,"[Movies & TV, Movies]",,[Disc 1: Flour Power (Scones; Shortcakes; Sout...,,My Fair Pastry (Good Eats Vol. 9),[],,Alton Brown,[],"370,026 in Movies & TV (",[],Movies & TV,,,,143529,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,
3,"[Movies & TV, Movies]",,[Barefoot Contessa Volume 2: On these three di...,,"Barefoot Contessa (with Ina Garten), Entertain...","[B002I5GNW4, B005WXPVMM, B009UY3W8O, B00N27ID1...",,Ina Garten,[],"342,914 in Movies & TV (","[B002I5GNW4, 0804187045, B009UY3W8O, 060960219...",Movies & TV,,,$74.95,143588,[],[],
4,"[Movies & TV, Movies]",,[Rise and Swine (Good Eats Vol. 7) includes bo...,,Rise and Swine (Good Eats Vol. 7),"[B000P1CKES, B000NR4CRM]",,Alton Brown,[],"351,684 in Movies & TV (",[B0015SVNXY],Movies & TV,,,,143502,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,


By looking at the overview of the dataframe, there seems to be many empty cells.

**Image data is not being considered in this NLP recommendation project so `imageURL` and `imageURLHighRes` will be dropped**

In [8]:
# Drop columns:
meta_df.drop(columns = ['imageURL','imageURLHighRes'], inplace = True)

In [9]:
# Check dataframe:
meta_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 203766 entries, 0 to 203765
Data columns (total 17 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   category      203766 non-null  object
 1   tech1         203766 non-null  object
 2   description   203766 non-null  object
 3   fit           203766 non-null  object
 4   title         203766 non-null  object
 5   also_buy      203766 non-null  object
 6   tech2         203766 non-null  object
 7   brand         203766 non-null  object
 8   feature       203766 non-null  object
 9   rank          203766 non-null  object
 10  also_view     203766 non-null  object
 11  main_cat      203766 non-null  object
 12  similar_item  203766 non-null  object
 13  date          203766 non-null  object
 14  price         203766 non-null  object
 15  asin          203766 non-null  object
 16  details       195392 non-null  object
dtypes: object(17)
memory usage: 26.4+ MB


Columns `tech`,`fit`,`tech2`,`similar item`, `date`, `price`, `feature`,`rank` and `details` may have null values

## 2.1 Process `Date` column<a class ='anchor' id='2.1date'></a>

In this section, the `date` column will be preprocessed.

In [1]:
# Look for the blank dates
blank_dates = sum(meta_df['date'] == "")

NameError: name 'meta_df' is not defined

In [11]:
# Determine the proportion of blank dates
(blank_dates/meta_df.shape[0])*100

99.9813511577005

99.98% of the `date` column is blank so the column will be dropped

In [12]:
meta_df.drop(columns = 'date', inplace = True)

Check that the column has been dropped

In [13]:
meta_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 203766 entries, 0 to 203765
Data columns (total 16 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   category      203766 non-null  object
 1   tech1         203766 non-null  object
 2   description   203766 non-null  object
 3   fit           203766 non-null  object
 4   title         203766 non-null  object
 5   also_buy      203766 non-null  object
 6   tech2         203766 non-null  object
 7   brand         203766 non-null  object
 8   feature       203766 non-null  object
 9   rank          203766 non-null  object
 10  also_view     203766 non-null  object
 11  main_cat      203766 non-null  object
 12  similar_item  203766 non-null  object
 13  price         203766 non-null  object
 14  asin          203766 non-null  object
 15  details       195392 non-null  object
dtypes: object(16)
memory usage: 24.9+ MB


## 2.2 Process columns `tech1`, `fit`, `tech2` and `similar item` <a class ='anchor' id='2.2multiple'></a>

From first glance above, these columns seem to be empty so they will be checked below. If they are indeed empty, these columns will be removed.

In [14]:
print('# of empty cells in tech1 column:')
print(sum(meta_df['tech1'] == ""))

# of empty cells in tech1 column:
203760


Only 6 out of 203766 entries in the tech1 column have data so this column can be dropped.

In [15]:
print('# of empty cells in fit column:')
print(sum(meta_df['fit'] == ""))

# of empty cells in fit column:
203766


In [16]:
print('# of empty cells in tech2 column:')
print(sum(meta_df['tech2'] == ""))

# of empty cells in tech2 column:
203766


In [17]:
print('# of empty cells in similar_item column:')
print(sum(meta_df['similar_item'] == ""))

# of empty cells in similar_item column:
203766


Over 99% of the data in the above columns have no data so these columns can be dropped.

In [18]:
# Drop columns:
meta_df.drop(columns = ['tech1','fit','tech2','similar_item'], inplace = True)

# Check dataframe:
meta_df.info()

## 2.3 Process `price` column <a class ='anchor' id='2.3price'></a>

In [20]:
sum(meta_df['price'] == "")/meta_df['price'].shape[0]

0.45650893672153353

45.6% of the price column contains blanks so over 50% contain data, thus this column should not be dropped.

In [21]:
meta_df[meta_df['price'] != ""]

Unnamed: 0,category,description,title,also_buy,brand,feature,rank,also_view,main_cat,price,asin,details
3,"[Movies & TV, Movies]",[Barefoot Contessa Volume 2: On these three di...,"Barefoot Contessa (with Ina Garten), Entertain...","[B002I5GNW4, B005WXPVMM, B009UY3W8O, B00N27ID1...",Ina Garten,[],"342,914 in Movies & TV (","[B002I5GNW4, 0804187045, B009UY3W8O, 060960219...",Movies & TV,$74.95,0000143588,
7,"[Movies & TV, Movies]","[Giada de Laurentis on ""Everyday Italian"" DVDs...","Everyday Italian (with Giada de Laurentiis), V...","[B002I5GNVU, B000KL8ODE]",,[],"409,173 in Movies & TV (",[B000KL8ODE],Movies & TV,$24.95,0000143561,
14,"[Movies & TV, Christian Video, General]",[An early movie edition of the life of Jesus.],Where Jesus Walked VHS,[],,[],"1,108,385 in Movies & TV (",[],Movies & TV,.a-box-inner{background-color:#fff}#alohaBuyBo...,0005000009,
15,"[Movies & TV, Genre for Featured Categories, F...",[],Passion According to Luke VHS,[],,[],"692,122 in Movies & TV (",[],Movies & TV,.a-box-inner{background-color:#fff}#alohaBuyBo...,0005054885,
16,"[Movies & TV, Movies]",[VHS],The Homecoming Kids Camp Out VHS,[],,[],"390,381 in Movies & TV (",[],Movies & TV,.a-box-inner{background-color:#fff}#alohaBuyBo...,0005059836,
...,...,...,...,...,...,...,...,...,...,...,...,...
203758,"[Movies & TV, Movies]","[Between the years of 1954 and 1963, the Rebbe...",The Rebbe's Nigunim,"[B00V3UOVBA, B07J22VS9C, B06WGSL4J6, B00HSC0OJ...",Benny Friedman,[],"120,487 in Movies & TV (",[B07J22VS9C],Movies & TV,$21.99,B01HIQHDAS,{'ASIN: ': 'B01HIQHDAS'}
203759,"[Movies & TV, HBO, All HBO Titles]","[<![CDATA[, All the Way (Digital HD/DVD), Brya...",All the Way (Digital HD/DVD),"[B077SKJCWZ, B001BSBC16, B00CLH08KC, B076ZR2F9...",Various,[],"50,367 in Movies & TV (","[B077SKJCWZ, B001BSBC16, B00CLH08KC, B078952HS...",Movies & TV,$7.90,B01HIUL6WU,"{'Language:': 'English', 'Dubbed:': 'English',..."
203761,"[Movies & TV, Studio Specials, Sony Pictures H...",[Tony Award-winning director Bartlett Sher pro...,Verdi: Otello,"[B01MXE4EVV, B014HFML6E, B07HGR7P4Z, B000F3T3C...",Sonya Yoncheva,[],"68,026 in Movies & TV (","[B07BF2PQHM, B00012AST0, B0094AH3Q6, B0007P0LP...",Movies & TV,$14.98,B01HJ1INB0,"{'Language:': 'English', 'Subtitles:': 'Italia..."
203762,"[Movies & TV, Movies]",[Mr. Miracle DVD Region 2 need an all region D...,Mr. Miracle - Ihn schickt der Himmel,[],,[],"344,483 in Movies & TV (",[],Movies & TV,$16.18,B01HJ3E0PQ,"{'Language:': 'German (Dolby Digital 5.1), Eng..."


After reviewing the goal of the project. Specifically for the recommendation system, the price of the item will not be taken into consideration as half the price values are missing. For future work, these prices will determined from web scraping the Amazon site for more accurate pricing. Thus this column will be dropped.

In [22]:
meta_df.drop(columns='price', inplace=True)

# Check results
meta_df.head()

Unnamed: 0,category,description,title,also_buy,brand,feature,rank,also_view,main_cat,asin,details
0,"[Movies & TV, Movies]",[],Understanding Seizures and Epilepsy,[],,[],"886,503 in Movies & TV (",[],Movies & TV,695009,
1,"[Movies & TV, Movies]",[],Spirit Led&mdash;Moving By Grace In The Holy S...,[],,[],"342,688 in Movies & TV (",[],Movies & TV,791156,
2,"[Movies & TV, Movies]",[Disc 1: Flour Power (Scones; Shortcakes; Sout...,My Fair Pastry (Good Eats Vol. 9),[],Alton Brown,[],"370,026 in Movies & TV (",[],Movies & TV,143529,
3,"[Movies & TV, Movies]",[Barefoot Contessa Volume 2: On these three di...,"Barefoot Contessa (with Ina Garten), Entertain...","[B002I5GNW4, B005WXPVMM, B009UY3W8O, B00N27ID1...",Ina Garten,[],"342,914 in Movies & TV (","[B002I5GNW4, 0804187045, B009UY3W8O, 060960219...",Movies & TV,143588,
4,"[Movies & TV, Movies]",[Rise and Swine (Good Eats Vol. 7) includes bo...,Rise and Swine (Good Eats Vol. 7),"[B000P1CKES, B000NR4CRM]",Alton Brown,[],"351,684 in Movies & TV (",[B0015SVNXY],Movies & TV,143502,


## 2.4 Process `feature` column <a class ='anchor' id='2.4feature'></a>

In [23]:
meta_df['feature'].value_counts()

TypeError: unhashable type: 'list'

Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
  File "pandas\_libs\hashtable_class_helper.pxi", line 4588, in pandas._libs.hashtable.PyObjectHashTable.map_locations
TypeError: unhashable type: 'list'


[]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  203594
[Officially Licensed, Highest Quality Recording]                                                                                             

In [24]:
blank_features = meta_df['feature'].value_counts()[0]

In [25]:
# Determine the number of blank features
(blank_features/meta_df.shape[0])*100

99.91558945064438

99.9% of this column are blanks so this column should be dropped.

In [26]:
# Drop columns:
meta_df.drop(columns = ['feature'], inplace = True)

In [27]:
# Check dataframe:
meta_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 203766 entries, 0 to 203765
Data columns (total 10 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   category     203766 non-null  object
 1   description  203766 non-null  object
 2   title        203766 non-null  object
 3   also_buy     203766 non-null  object
 4   brand        203766 non-null  object
 5   rank         203766 non-null  object
 6   also_view    203766 non-null  object
 7   main_cat     203766 non-null  object
 8   asin         203766 non-null  object
 9   details      195392 non-null  object
dtypes: object(10)
memory usage: 15.5+ MB


## 2.5 Process `main_cat` column <a class ='anchor' id='2.5maincat'></a>

This section will be for processing the `main_cat` column. The value counts method will be used to determine the data in the main category column.

In [28]:
meta_df['main_cat'].value_counts()

Movies & TV                                                                                                                                                                  203464
Sports & Outdoors                                                                                                                                                               118
Books                                                                                                                                                                            35
Amazon Home                                                                                                                                                                      33
<img src="https://images-na.ssl-images-amazon.com/images/G/01/digital/music/logos/amzn_music_logo_subnav._CB471835632_.png" class="nav-categ-image" alt="Digital Music"/>        31
Toys & Games                                                                                        

In [2]:
# Check the proportion that each value represents in the dataset
meta_df['main_cat'].value_counts()/meta_df.shape[0]

NameError: name 'meta_df' is not defined

Since the Movies & TV category represents the majority of the data, the other categories will be dropped to focus on Movies & TV

In [30]:
# Find the rows where the the main category is Movies & TV and store it in the meta_df. 
meta_df = meta_df[meta_df['main_cat'] == 'Movies & TV'].reset_index(drop=True)

# Check results
meta_df.head()

Unnamed: 0,category,description,title,also_buy,brand,rank,also_view,main_cat,asin,details
0,"[Movies & TV, Movies]",[],Understanding Seizures and Epilepsy,[],,"886,503 in Movies & TV (",[],Movies & TV,0000695009,
1,"[Movies & TV, Movies]",[],Spirit Led&mdash;Moving By Grace In The Holy S...,[],,"342,688 in Movies & TV (",[],Movies & TV,0000791156,
2,"[Movies & TV, Movies]",[Disc 1: Flour Power (Scones; Shortcakes; Sout...,My Fair Pastry (Good Eats Vol. 9),[],Alton Brown,"370,026 in Movies & TV (",[],Movies & TV,0000143529,
3,"[Movies & TV, Movies]",[Barefoot Contessa Volume 2: On these three di...,"Barefoot Contessa (with Ina Garten), Entertain...","[B002I5GNW4, B005WXPVMM, B009UY3W8O, B00N27ID1...",Ina Garten,"342,914 in Movies & TV (","[B002I5GNW4, 0804187045, B009UY3W8O, 060960219...",Movies & TV,0000143588,
4,"[Movies & TV, Movies]",[Rise and Swine (Good Eats Vol. 7) includes bo...,Rise and Swine (Good Eats Vol. 7),"[B000P1CKES, B000NR4CRM]",Alton Brown,"351,684 in Movies & TV (",[B0015SVNXY],Movies & TV,0000143502,
...,...,...,...,...,...,...,...,...,...,...
203459,"[Movies & TV, Studio Specials, Sony Pictures H...",[Tony Award-winning director Bartlett Sher pro...,Verdi: Otello,"[B01MXE4EVV, B014HFML6E, B07HGR7P4Z, B000F3T3C...",Sonya Yoncheva,"68,026 in Movies & TV (","[B07BF2PQHM, B00012AST0, B0094AH3Q6, B0007P0LP...",Movies & TV,B01HJ1INB0,"{'Language:': 'English', 'Subtitles:': 'Italia..."
203460,"[Movies & TV, Movies]",[Mr. Miracle DVD Region 2 need an all region D...,Mr. Miracle - Ihn schickt der Himmel,[],,"344,483 in Movies & TV (",[],Movies & TV,B01HJ3E0PQ,"{'Language:': 'German (Dolby Digital 5.1), Eng..."
203461,"[Movies & TV, Independently Distributed, Drama]",[The President and his family rule the land wi...,The President,"[B01M6DA5RJ, B0002F6BFG, B06XRGPHM3, B06XG2N43...",Misha Gomiashvili,"199,854 in Movies & TV (","[B01GWCHP2U, B0009PW3RE]",Movies & TV,B01HJ6R77G,"{'Language:': 'Georgian', 'Subtitles:': 'Engli..."
203462,"[Movies & TV, Genre for Featured Categories, A...","[Filmed in 1972 as Pope Joan, Michael Anderson...",She.....Who Would Be Pope,[B004L690XW],Liv Ullmann,"246,494 in Movies & TV (",[B0000C23HX],Movies & TV,B01HJCCLOY,"{'Language:': 'English', 'ASIN: ': 'B01HJCCLOY'}"


In [31]:
meta_df['main_cat'].value_counts()

Movies & TV    203464
Name: main_cat, dtype: int64

Since the data only contains Movies & TV, the main_cat column can be dropped because it isn't providing any useful information.

In [32]:
# Drop columns:
meta_df.drop(columns = 'main_cat', inplace = True)

In [33]:
# Check dataframe:
meta_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 203464 entries, 0 to 203463
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   category     203464 non-null  object
 1   description  203464 non-null  object
 2   title        203464 non-null  object
 3   also_buy     203464 non-null  object
 4   brand        203464 non-null  object
 5   rank         203464 non-null  object
 6   also_view    203464 non-null  object
 7   asin         203464 non-null  object
 8   details      195104 non-null  object
dtypes: object(9)
memory usage: 14.0+ MB


## 2.6 Process `title` column <a class ='anchor' id='2.6title'></a>

In this section, the title will be preprocessed.

In [3]:
# Check the unique values in this column
meta_df['title'].value_counts()

NameError: name 'meta_df' is not defined

In [35]:
sum(meta_df['title'] =='')/meta_df['title'].shape[0]

0.0002899775881728463

The title column contains the title of the movie and only 0.03% of the column contains a null value so this column must be kept.

## 2.7 Process `description` column <a class ='anchor' id='2.7description'></a>

The `description` column will be preprocessed in this section. The `description` column is in a list so it will be converted to a list to check the contents within the list.

In [36]:
# Convert the description column to a separate dataframe
desc_split_df = pd.DataFrame(meta_df['description'].tolist()).add_prefix('description_')

# Check results
desc_split_df
    

Unnamed: 0,description_0,description_1,description_2,description_3,description_4,description_5,description_6,description_7,description_8,description_9,...,description_597,description_598,description_599,description_600,description_601,description_602,description_603,description_604,description_605,description_606
0,,,,,,,,,,,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,Disc 1: Flour Power (Scones; Shortcakes; South...,,,,,,,,,,...,,,,,,,,,,
3,Barefoot Contessa Volume 2: On these three dis...,,,,,,,,,,...,,,,,,,,,,
4,Rise and Swine (Good Eats Vol. 7) includes bon...,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
203459,Tony Award-winning director Bartlett Sher prob...,,,,,,,,,,...,,,,,,,,,,
203460,Mr. Miracle DVD Region 2 need an all region DV...,,,,,,,,,,...,,,,,,,,,,
203461,The President and his family rule the land wit...,WINNER - Best Film - Chicago Int'l Film Festiv...,,,,,,,,,...,,,,,,,,,,
203462,"Filmed in 1972 as Pope Joan, Michael Andersons...",,,,,,,,,,...,,,,,,,,,,


There seems to be a lot of empty cells in the above dataframe so the below will check the proportion of empty cells in the above dataframe.

In [37]:
(desc_split_df.isna().sum())/desc_split_df.shape[0]

description_0      0.142295
description_1      0.790346
description_2      0.944147
description_3      0.970437
description_4      0.980011
                     ...   
description_602    0.999995
description_603    0.999995
description_604    0.999995
description_605    0.999995
description_606    0.999995
Length: 607, dtype: float64

Keep the first 1 columns since atleast 79% are null for all the other columns and the first column represents the main description which is the most important portion.

In [38]:
# Add the new decription column into the original dataframe
meta_df['description_0'] = desc_split_df['description_0']

# Check results
meta_df.head()

Unnamed: 0,category,description,title,also_buy,brand,rank,also_view,asin,details,description_0
0,"[Movies & TV, Movies]",[],Understanding Seizures and Epilepsy,[],,"886,503 in Movies & TV (",[],0000695009,,
1,"[Movies & TV, Movies]",[],Spirit Led&mdash;Moving By Grace In The Holy S...,[],,"342,688 in Movies & TV (",[],0000791156,,
2,"[Movies & TV, Movies]",[Disc 1: Flour Power (Scones; Shortcakes; Sout...,My Fair Pastry (Good Eats Vol. 9),[],Alton Brown,"370,026 in Movies & TV (",[],0000143529,,Disc 1: Flour Power (Scones; Shortcakes; South...
3,"[Movies & TV, Movies]",[Barefoot Contessa Volume 2: On these three di...,"Barefoot Contessa (with Ina Garten), Entertain...","[B002I5GNW4, B005WXPVMM, B009UY3W8O, B00N27ID1...",Ina Garten,"342,914 in Movies & TV (","[B002I5GNW4, 0804187045, B009UY3W8O, 060960219...",0000143588,,Barefoot Contessa Volume 2: On these three dis...
4,"[Movies & TV, Movies]",[Rise and Swine (Good Eats Vol. 7) includes bo...,Rise and Swine (Good Eats Vol. 7),"[B000P1CKES, B000NR4CRM]",Alton Brown,"351,684 in Movies & TV (",[B0015SVNXY],0000143502,,Rise and Swine (Good Eats Vol. 7) includes bon...
...,...,...,...,...,...,...,...,...,...,...
203459,"[Movies & TV, Studio Specials, Sony Pictures H...",[Tony Award-winning director Bartlett Sher pro...,Verdi: Otello,"[B01MXE4EVV, B014HFML6E, B07HGR7P4Z, B000F3T3C...",Sonya Yoncheva,"68,026 in Movies & TV (","[B07BF2PQHM, B00012AST0, B0094AH3Q6, B0007P0LP...",B01HJ1INB0,"{'Language:': 'English', 'Subtitles:': 'Italia...",Tony Award-winning director Bartlett Sher prob...
203460,"[Movies & TV, Movies]",[Mr. Miracle DVD Region 2 need an all region D...,Mr. Miracle - Ihn schickt der Himmel,[],,"344,483 in Movies & TV (",[],B01HJ3E0PQ,"{'Language:': 'German (Dolby Digital 5.1), Eng...",Mr. Miracle DVD Region 2 need an all region DV...
203461,"[Movies & TV, Independently Distributed, Drama]",[The President and his family rule the land wi...,The President,"[B01M6DA5RJ, B0002F6BFG, B06XRGPHM3, B06XG2N43...",Misha Gomiashvili,"199,854 in Movies & TV (","[B01GWCHP2U, B0009PW3RE]",B01HJ6R77G,"{'Language:': 'Georgian', 'Subtitles:': 'Engli...",The President and his family rule the land wit...
203462,"[Movies & TV, Genre for Featured Categories, A...","[Filmed in 1972 as Pope Joan, Michael Anderson...",She.....Who Would Be Pope,[B004L690XW],Liv Ullmann,"246,494 in Movies & TV (",[B0000C23HX],B01HJCCLOY,"{'Language:': 'English', 'ASIN: ': 'B01HJCCLOY'}","Filmed in 1972 as Pope Joan, Michael Andersons..."


Drop the original description column

In [39]:
# Drop column
meta_df.drop(columns = 'description', inplace = True)

# Check results
meta_df.head()

Unnamed: 0,category,title,also_buy,brand,rank,also_view,asin,details,description_0
0,"[Movies & TV, Movies]",Understanding Seizures and Epilepsy,[],,"886,503 in Movies & TV (",[],0000695009,,
1,"[Movies & TV, Movies]",Spirit Led&mdash;Moving By Grace In The Holy S...,[],,"342,688 in Movies & TV (",[],0000791156,,
2,"[Movies & TV, Movies]",My Fair Pastry (Good Eats Vol. 9),[],Alton Brown,"370,026 in Movies & TV (",[],0000143529,,Disc 1: Flour Power (Scones; Shortcakes; South...
3,"[Movies & TV, Movies]","Barefoot Contessa (with Ina Garten), Entertain...","[B002I5GNW4, B005WXPVMM, B009UY3W8O, B00N27ID1...",Ina Garten,"342,914 in Movies & TV (","[B002I5GNW4, 0804187045, B009UY3W8O, 060960219...",0000143588,,Barefoot Contessa Volume 2: On these three dis...
4,"[Movies & TV, Movies]",Rise and Swine (Good Eats Vol. 7),"[B000P1CKES, B000NR4CRM]",Alton Brown,"351,684 in Movies & TV (",[B0015SVNXY],0000143502,,Rise and Swine (Good Eats Vol. 7) includes bon...
...,...,...,...,...,...,...,...,...,...
203459,"[Movies & TV, Studio Specials, Sony Pictures H...",Verdi: Otello,"[B01MXE4EVV, B014HFML6E, B07HGR7P4Z, B000F3T3C...",Sonya Yoncheva,"68,026 in Movies & TV (","[B07BF2PQHM, B00012AST0, B0094AH3Q6, B0007P0LP...",B01HJ1INB0,"{'Language:': 'English', 'Subtitles:': 'Italia...",Tony Award-winning director Bartlett Sher prob...
203460,"[Movies & TV, Movies]",Mr. Miracle - Ihn schickt der Himmel,[],,"344,483 in Movies & TV (",[],B01HJ3E0PQ,"{'Language:': 'German (Dolby Digital 5.1), Eng...",Mr. Miracle DVD Region 2 need an all region DV...
203461,"[Movies & TV, Independently Distributed, Drama]",The President,"[B01M6DA5RJ, B0002F6BFG, B06XRGPHM3, B06XG2N43...",Misha Gomiashvili,"199,854 in Movies & TV (","[B01GWCHP2U, B0009PW3RE]",B01HJ6R77G,"{'Language:': 'Georgian', 'Subtitles:': 'Engli...",The President and his family rule the land wit...
203462,"[Movies & TV, Genre for Featured Categories, A...",She.....Who Would Be Pope,[B004L690XW],Liv Ullmann,"246,494 in Movies & TV (",[B0000C23HX],B01HJCCLOY,"{'Language:': 'English', 'ASIN: ': 'B01HJCCLOY'}","Filmed in 1972 as Pope Joan, Michael Andersons..."


Check rows where the description is empty because the recommendation system will bebased off of the description column.

In [40]:
# Find where there are no descriptions
meta_df['description_0'].isna()

0          True
1          True
2         False
3         False
4         False
          ...  
203459    False
203460    False
203461    False
203462    False
203463    False
Name: description_0, Length: 203464, dtype: bool

In [41]:
meta_df[meta_df['description_0'].isna()]

Unnamed: 0,category,title,also_buy,brand,rank,also_view,asin,details,description_0
0,"[Movies & TV, Movies]",Understanding Seizures and Epilepsy,[],,"886,503 in Movies & TV (",[],0000695009,,
1,"[Movies & TV, Movies]",Spirit Led&mdash;Moving By Grace In The Holy S...,[],,"342,688 in Movies & TV (",[],0000791156,,
13,"[Movies & TV, Art House & International, By Or...",Peace Child VHS,[],,"866,012 in Movies & TV (",[],0001527665,,
15,"[Movies & TV, Genre for Featured Categories, F...",Passion According to Luke VHS,[],,"692,122 in Movies & TV (",[],0005054885,,
18,"[Movies & TV, Movies]",Dance for Joy - Worship and Praise Through Dav...,[],,"710,984 in Movies & TV (",[],0005022134,,
...,...,...,...,...,...,...,...,...,...
203426,"[Movies & TV, Movies]",The Rezort,"[B07HSK1LX4, B06XC76G5K, B0192PERT8, B01N56KV0...",,"170,836 in Movies & TV (","[B07624VF27, B01GGJZNX2]",B01HGRJUGE,"{'Language:': 'German (Dolby Digital 5.1), Eng...",
203431,"[Movies & TV, Movies]",The Raw Chef Academy Raw Decadent Desserts Rus...,[],Russell James,"538,198 in Movies & TV (",[],B01HH17FAM,{'ASIN: ': 'B01HH17FAM'},
203439,"[Movies & TV, Movies]",Trading Spaces 4 Pack (The Best of Trading Spa...,[],,"301,993 in Movies & TV (",[],B01HH5I2RI,{'ASIN: ': 'B01HH5I2RI'},
203442,"[Movies & TV, TV]","Castle : Complete Collection, DVD (Series Seas...","[B07GJLNJR7, B01N95Z9L8, B00UCJ6TPE, B07D51563...",,"178,932 in Movies & TV (","[B07FB3P2WC, B06XRQ935Z, B003L77FYS, B01E7XTFN...",B01HH95M2M,{'ASIN: ': 'B01HH95M2M'},


From a quick scan of the above, it looks like some rows that do not have descriptions have information in the `also_buy` and `also_view` columns. These columns contain important information about the movies but due to time contraints, they will be incorporated into the recommondation system in future works. For simplicity, only the columns that have a description will be kept.

In [42]:
# Keep the rows where there is a value in the description
meta_df = meta_df[meta_df['description_0'].notna()]

# Check results
na_remain = meta_df['description_0'].isna().sum()

print(f'There are {na_remain} null values remaining in the description column.')

There are 0 null values remaining in the description column.


In [43]:
# Reset the index
meta_df.reset_index(drop = True, inplace = True)

# Check results
meta_df.head()

Unnamed: 0,category,title,also_buy,brand,rank,also_view,asin,details,description_0
0,"[Movies & TV, Movies]",My Fair Pastry (Good Eats Vol. 9),[],Alton Brown,"370,026 in Movies & TV (",[],0000143529,,Disc 1: Flour Power (Scones; Shortcakes; South...
1,"[Movies & TV, Movies]","Barefoot Contessa (with Ina Garten), Entertain...","[B002I5GNW4, B005WXPVMM, B009UY3W8O, B00N27ID1...",Ina Garten,"342,914 in Movies & TV (","[B002I5GNW4, 0804187045, B009UY3W8O, 060960219...",0000143588,,Barefoot Contessa Volume 2: On these three dis...
2,"[Movies & TV, Movies]",Rise and Swine (Good Eats Vol. 7),"[B000P1CKES, B000NR4CRM]",Alton Brown,"351,684 in Movies & TV (",[B0015SVNXY],0000143502,,Rise and Swine (Good Eats Vol. 7) includes bon...
3,"[Movies & TV, Genre for Featured Categories, E...",The Power of the Cross Joseph Prince,[],Joseph Prince,"444,474 in Movies & TV (",[],000073991X,,Have failures in your life caused you to feel ...
4,"[Movies & TV, Movies]",Live in Houston [VHS],[],Douglas Miller,"1,005,955 in Movies & TV (",[],000107461X,,Track Listings 1. Come On Everybody 2. My Stre...
...,...,...,...,...,...,...,...,...,...
174507,"[Movies & TV, Studio Specials, Sony Pictures H...",Verdi: Otello,"[B01MXE4EVV, B014HFML6E, B07HGR7P4Z, B000F3T3C...",Sonya Yoncheva,"68,026 in Movies & TV (","[B07BF2PQHM, B00012AST0, B0094AH3Q6, B0007P0LP...",B01HJ1INB0,"{'Language:': 'English', 'Subtitles:': 'Italia...",Tony Award-winning director Bartlett Sher prob...
174508,"[Movies & TV, Movies]",Mr. Miracle - Ihn schickt der Himmel,[],,"344,483 in Movies & TV (",[],B01HJ3E0PQ,"{'Language:': 'German (Dolby Digital 5.1), Eng...",Mr. Miracle DVD Region 2 need an all region DV...
174509,"[Movies & TV, Independently Distributed, Drama]",The President,"[B01M6DA5RJ, B0002F6BFG, B06XRGPHM3, B06XG2N43...",Misha Gomiashvili,"199,854 in Movies & TV (","[B01GWCHP2U, B0009PW3RE]",B01HJ6R77G,"{'Language:': 'Georgian', 'Subtitles:': 'Engli...",The President and his family rule the land wit...
174510,"[Movies & TV, Genre for Featured Categories, A...",She.....Who Would Be Pope,[B004L690XW],Liv Ullmann,"246,494 in Movies & TV (",[B0000C23HX],B01HJCCLOY,"{'Language:': 'English', 'ASIN: ': 'B01HJCCLOY'}","Filmed in 1972 as Pope Joan, Michael Andersons..."


In [44]:
meta_df['description_0'].isna().sum()

0

## 2.8 Process `category` column<a class ='anchor' id='2.8category'></a>

This section will focus on preprocessing the category column.

In [45]:
# Check the values in the category column.
meta_df['category'].value_counts()

TypeError: unhashable type: 'list'

Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
  File "pandas\_libs\hashtable_class_helper.pxi", line 4588, in pandas._libs.hashtable.PyObjectHashTable.map_locations
TypeError: unhashable type: 'list'


[Movies & TV, Movies]                                                                                                                                                                                                                                                                                                                19249
[Movies & TV, Genre for Featured Categories, Action & Adventure]                                                                                                                                                                                                                                                                      9442
[Movies & TV, Genre for Featured Categories, Drama]                                                                                                                                                                                                                                                                                   8271
[Movies

The categories can be split up with the delimiter ','

In [46]:
cate_split_df = pd.DataFrame(meta_df['category'].tolist()).add_prefix('category_')

# Check results
cate_split_df

Unnamed: 0,category_0,category_1,category_2,category_3,category_4,category_5,category_6
0,Movies & TV,Movies,,,,,
1,Movies & TV,Movies,,,,,
2,Movies & TV,Movies,,,,,
3,Movies & TV,Genre for Featured Categories,Exercise & Fitness,,,,
4,Movies & TV,Movies,,,,,
...,...,...,...,...,...,...,...
174507,Movies & TV,Studio Specials,Sony Pictures Home Entertainment,All Sony Pictures Titles,,,
174508,Movies & TV,Movies,,,,,
174509,Movies & TV,Independently Distributed,Drama,,,,
174510,Movies & TV,Genre for Featured Categories,Action & Adventure,,,,


Check the values in the first category

In [47]:
cate_split_df['category_0'].value_counts()

Movies & TV    174340
Name: category_0, dtype: int64

The first column does not contain any useful information other than saying everything is Movies & TV, thus remove the first column

In [48]:
cate_split_df.drop(columns= 'category_0', inplace = True)

# Check results
cate_split_df

Unnamed: 0,category_1,category_2,category_3,category_4,category_5,category_6
0,Movies,,,,,
1,Movies,,,,,
2,Movies,,,,,
3,Genre for Featured Categories,Exercise & Fitness,,,,
4,Movies,,,,,
...,...,...,...,...,...,...
174507,Studio Specials,Sony Pictures Home Entertainment,All Sony Pictures Titles,,,
174508,Movies,,,,,
174509,Independently Distributed,Drama,,,,
174510,Genre for Featured Categories,Action & Adventure,,,,


Check the number of Null values for each column.

In [49]:
cate_split_df.isna().sum()/cate_split_df.shape[0]

category_1    0.000986
category_2    0.151651
category_3    0.842183
category_4    0.983720
category_5    0.999983
category_6    0.999994
dtype: float64

Drop catetgories 3-6 because majority of the data are null values.

In [50]:
cate_split_df = cate_split_df[['category_1','category_2']]

# Check results
cate_split_df

Unnamed: 0,category_1,category_2
0,Movies,
1,Movies,
2,Movies,
3,Genre for Featured Categories,Exercise & Fitness
4,Movies,
...,...,...
174507,Studio Specials,Sony Pictures Home Entertainment
174508,Movies,
174509,Independently Distributed,Drama
174510,Genre for Featured Categories,Action & Adventure


Combine the above category dataframe with the meta_df and drop the category column from the meta_df

In [4]:
meta_df = pd.concat([meta_df.drop(columns = 'category'), cate_split_df], axis = 1)

# Check results
meta_df.head()

NameError: name 'pd' is not defined

## 2.9 Process `rank` column<a class ='anchor' id='2.9rank'></a>

The different values in the rank column will be analyzed.

In [52]:
meta_df['rank'].value_counts()

TypeError: unhashable type: 'list'

Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
  File "pandas\_libs\hashtable_class_helper.pxi", line 4588, in pandas._libs.hashtable.PyObjectHashTable.map_locations
TypeError: unhashable type: 'list'


[]                                                                                         836
350,519 in Movies & TV (                                                                     7
356,475 in Movies & TV (                                                                     6
68,427 in Movies & TV (                                                                      6
78,926 in Movies & TV (                                                                      6
                                                                                          ... 
110,254 in Movies & TV (                                                                     1
376,906 in Movies & TV (                                                                     1
203,879 in Movies & TV (                                                                     1
85,362 in Movies & TV (                                                                      1
[>#303,487 in Movies & TV (See Top 100 in Movies &

The words in the rank column are unnecessary so they will be removed so that it only shows the number.

In [74]:
# Replace the words and the bracket with an empty string
meta_df['rank'] = meta_df['rank'].str.replace(r" in Movies & TV \(","",regex = True)

# Remove all the commas
meta_df['rank'] = meta_df['rank'].str.replace(r",","",regex = True)
# Check results
meta_df

Unnamed: 0,title,also_buy,brand,rank,also_view,product_id,description_0,category_1,category_2
0,My Fair Pastry (Good Eats Vol. 9),[],Alton Brown,370026,[],0000143529,Disc 1: Flour Power (Scones; Shortcakes; South...,Movies,
1,"Barefoot Contessa (with Ina Garten), Entertain...","[B002I5GNW4, B005WXPVMM, B009UY3W8O, B00N27ID1...",Ina Garten,342914,"[B002I5GNW4, 0804187045, B009UY3W8O, 060960219...",0000143588,Barefoot Contessa Volume 2: On these three dis...,Movies,
2,Rise and Swine (Good Eats Vol. 7),"[B000P1CKES, B000NR4CRM]",Alton Brown,351684,[B0015SVNXY],0000143502,Rise and Swine (Good Eats Vol. 7) includes bon...,Movies,
3,The Power of the Cross Joseph Prince,[],Joseph Prince,444474,[],000073991X,Have failures in your life caused you to feel ...,Genre for Featured Categories,Exercise & Fitness
4,Live in Houston [VHS],[],Douglas Miller,1005955,[],000107461X,Track Listings 1. Come On Everybody 2. My Stre...,Movies,
...,...,...,...,...,...,...,...,...,...
174507,Verdi: Otello,"[B01MXE4EVV, B014HFML6E, B07HGR7P4Z, B000F3T3C...",Sonya Yoncheva,68026,"[B07BF2PQHM, B00012AST0, B0094AH3Q6, B0007P0LP...",B01HJ1INB0,Tony Award-winning director Bartlett Sher prob...,Studio Specials,Sony Pictures Home Entertainment
174508,Mr. Miracle - Ihn schickt der Himmel,[],,344483,[],B01HJ3E0PQ,Mr. Miracle DVD Region 2 need an all region DV...,Movies,
174509,The President,"[B01M6DA5RJ, B0002F6BFG, B06XRGPHM3, B06XG2N43...",Misha Gomiashvili,199854,"[B01GWCHP2U, B0009PW3RE]",B01HJ6R77G,The President and his family rule the land wit...,Independently Distributed,Drama
174510,She.....Who Would Be Pope,[B004L690XW],Liv Ullmann,246494,[B0000C23HX],B01HJCCLOY,"Filmed in 1972 as Pope Joan, Michael Andersons...",Genre for Featured Categories,Action & Adventure


## 2.10 Process `details` column<a class ='anchor' id='2.10detail'></a>

In this section, the details column will be analyzed.

In [54]:
#meta_df['details'].value_counts()

The details seems to contain the same info as other columns in the dataframe. A check will be performed to see if they contain the same information.

In [55]:
# Randomly sample from the meta_df details column to see what kind of data is stored in this column
for i in random.sample(range(0, meta_df.shape[0]), 25):
    print(meta_df['details'][i])

{'Language:': 'English', 'ASIN: ': 'B000HC2PLS'}
{'Language:': 'English', 'ASIN: ': 'B014L2YKQ2'}
{'Language:': 'English', 'ASIN: ': 'B0019BI114'}
{'Language:': 'English', 'ASIN: ': 'B007AAF1FW'}
{'Language:': 'English', 'ASIN: ': 'B000GAKQT4'}
{'Language:': 'Japanese (Mono), English (Dolby Digital 5.1)', 'Subtitles:': 'English', 'ASIN: ': 'B002Y0KR70'}
{'Language:': 'English', 'ASIN: ': 'B00GXCK85Q'}
{'Language:': 'Spanish', 'Subtitles:': 'French, English', 'Dubbed:': 'French', 'ASIN: ': 'B000ZBEOEM'}
{'ASIN: ': 'B001OFM2VK'}
{'Language:': 'Italian (Dolby Digital 2.0), English (Dolby Digital 2.0)', 'ASIN: ': 'B000FH0SDM'}
{'Language:': 'English (Dolby Digital 2.0), Spanish (Dolby Digital 2.0)', 'Subtitles:': 'Spanish', 'ASIN: ': 'B00NW88NTM'}
{'Language:': 'English', 'ASIN: ': 'B00008H2HE'}
{'Language:': 'English (Dolby Digital 5.1), French (Dolby Digital 2.0 Stereo), Spanish (Dolby Digital 2.0 Stereo)', 'Subtitles:': 'English, French, Spanish', 'Dubbed:': 'French, Spanish', 'ASIN: ':

The majority of the data in this column stores language related information and this is not needed for the NLP reviews recommendation system. Thus this column can be dropped.

In [56]:
# The details column will be dropped.
meta_df.drop(columns = 'details',inplace = True)

# Check results
meta_df.head()

# 2.11 Process `asin` column<a class ='anchor' id='2.11asin'></a>

In this section, the `asin` column will be preprocessed.

In [58]:
# Rename the column to a more understandable label
meta_df.rename(columns = {'asin':'product_id'}, inplace = True)

# Check results
meta_df.head()

Unnamed: 0,title,also_buy,brand,rank,also_view,product_id,description_0,category_1,category_2
0,My Fair Pastry (Good Eats Vol. 9),[],Alton Brown,370026,[],0000143529,Disc 1: Flour Power (Scones; Shortcakes; South...,Movies,
1,"Barefoot Contessa (with Ina Garten), Entertain...","[B002I5GNW4, B005WXPVMM, B009UY3W8O, B00N27ID1...",Ina Garten,342914,"[B002I5GNW4, 0804187045, B009UY3W8O, 060960219...",0000143588,Barefoot Contessa Volume 2: On these three dis...,Movies,
2,Rise and Swine (Good Eats Vol. 7),"[B000P1CKES, B000NR4CRM]",Alton Brown,351684,[B0015SVNXY],0000143502,Rise and Swine (Good Eats Vol. 7) includes bon...,Movies,
3,The Power of the Cross Joseph Prince,[],Joseph Prince,444474,[],000073991X,Have failures in your life caused you to feel ...,Genre for Featured Categories,Exercise & Fitness
4,Live in Houston [VHS],[],Douglas Miller,1005955,[],000107461X,Track Listings 1. Come On Everybody 2. My Stre...,Movies,
...,...,...,...,...,...,...,...,...,...
174507,Verdi: Otello,"[B01MXE4EVV, B014HFML6E, B07HGR7P4Z, B000F3T3C...",Sonya Yoncheva,68026,"[B07BF2PQHM, B00012AST0, B0094AH3Q6, B0007P0LP...",B01HJ1INB0,Tony Award-winning director Bartlett Sher prob...,Studio Specials,Sony Pictures Home Entertainment
174508,Mr. Miracle - Ihn schickt der Himmel,[],,344483,[],B01HJ3E0PQ,Mr. Miracle DVD Region 2 need an all region DV...,Movies,
174509,The President,"[B01M6DA5RJ, B0002F6BFG, B06XRGPHM3, B06XG2N43...",Misha Gomiashvili,199854,"[B01GWCHP2U, B0009PW3RE]",B01HJ6R77G,The President and his family rule the land wit...,Independently Distributed,Drama
174510,She.....Who Would Be Pope,[B004L690XW],Liv Ullmann,246494,[B0000C23HX],B01HJCCLOY,"Filmed in 1972 as Pope Joan, Michael Andersons...",Genre for Featured Categories,Action & Adventure


## 2.12 Remove the Also_Buy and Also_view columns<a class ='anchor' id='2.12alsobuy'></a>

Since the also_buy and also_view columns will not be used in this project, they will be dropped.

In [76]:
# Drop columns
new_df = meta_df.drop(columns = ['also_buy','also_view'])

# Check results
new_df.head()

Unnamed: 0,title,brand,rank,product_id,description_0,category_1,category_2
0,My Fair Pastry (Good Eats Vol. 9),Alton Brown,370026,0000143529,Disc 1: Flour Power (Scones; Shortcakes; South...,Movies,
1,"Barefoot Contessa (with Ina Garten), Entertain...",Ina Garten,342914,0000143588,Barefoot Contessa Volume 2: On these three dis...,Movies,
2,Rise and Swine (Good Eats Vol. 7),Alton Brown,351684,0000143502,Rise and Swine (Good Eats Vol. 7) includes bon...,Movies,
3,The Power of the Cross Joseph Prince,Joseph Prince,444474,000073991X,Have failures in your life caused you to feel ...,Genre for Featured Categories,Exercise & Fitness
4,Live in Houston [VHS],Douglas Miller,1005955,000107461X,Track Listings 1. Come On Everybody 2. My Stre...,Movies,
...,...,...,...,...,...,...,...
174507,Verdi: Otello,Sonya Yoncheva,68026,B01HJ1INB0,Tony Award-winning director Bartlett Sher prob...,Studio Specials,Sony Pictures Home Entertainment
174508,Mr. Miracle - Ihn schickt der Himmel,,344483,B01HJ3E0PQ,Mr. Miracle DVD Region 2 need an all region DV...,Movies,
174509,The President,Misha Gomiashvili,199854,B01HJ6R77G,The President and his family rule the land wit...,Independently Distributed,Drama
174510,She.....Who Would Be Pope,Liv Ullmann,246494,B01HJCCLOY,"Filmed in 1972 as Pope Joan, Michael Andersons...",Genre for Featured Categories,Action & Adventure


## 2.13 Drop the duplicated rows <a class ='anchor' id='2.13duplicates'></a>

This section will deal with processing the duplicated rows.

In [60]:
# Remove duplicates
new_df.drop_duplicates(inplace = True,ignore_index = True)

# Check results
new_df.head()

Unnamed: 0,title,brand,rank,product_id,description_0,category_1,category_2
0,My Fair Pastry (Good Eats Vol. 9),Alton Brown,370026,0000143529,Disc 1: Flour Power (Scones; Shortcakes; South...,Movies,
1,"Barefoot Contessa (with Ina Garten), Entertain...",Ina Garten,342914,0000143588,Barefoot Contessa Volume 2: On these three dis...,Movies,
2,Rise and Swine (Good Eats Vol. 7),Alton Brown,351684,0000143502,Rise and Swine (Good Eats Vol. 7) includes bon...,Movies,
3,The Power of the Cross Joseph Prince,Joseph Prince,444474,000073991X,Have failures in your life caused you to feel ...,Genre for Featured Categories,Exercise & Fitness
4,Live in Houston [VHS],Douglas Miller,1005955,000107461X,Track Listings 1. Come On Everybody 2. My Stre...,Movies,
...,...,...,...,...,...,...,...
156476,Verdi: Otello,Sonya Yoncheva,68026,B01HJ1INB0,Tony Award-winning director Bartlett Sher prob...,Studio Specials,Sony Pictures Home Entertainment
156477,Mr. Miracle - Ihn schickt der Himmel,,344483,B01HJ3E0PQ,Mr. Miracle DVD Region 2 need an all region DV...,Movies,
156478,The President,Misha Gomiashvili,199854,B01HJ6R77G,The President and his family rule the land wit...,Independently Distributed,Drama
156479,She.....Who Would Be Pope,Liv Ullmann,246494,B01HJCCLOY,"Filmed in 1972 as Pope Joan, Michael Andersons...",Genre for Featured Categories,Action & Adventure


Check the product_id of items to determine if any are still duplicated

In [61]:
new_df['product_id'].value_counts()

630315851X    1
B004V8ZPMK    1
B00LX66388    1
B000056YPK    1
6303118097    1
             ..
B005HZC93Y    1
6305538522    1
B004D8TXB2    1
B000A2UBKW    1
B00AMQGBYQ    1
Name: product_id, Length: 156481, dtype: int64

All product_ids are unqiue, therefore, every row is unique and the duplicated rows are gone.

## 2.14 Which brands produce the most films? <a class ='anchor' id='2.14mostfilm'></a>

In [70]:
# Determine how many films each brand produced
new_df['brand'].value_counts().head(20)

                          43749
Various                    3021
.                          1027
n/a                         494
-                           418
Learn more                  338
Sinister Cinema             265
None                        264
\n                          264
*                           232
John Wayne                  186
none                        153
na                          119
LeVar Burton                113
VARIOUS                     111
William Shatner             106
Artist Not Provided         106
N/A                          91
various                      89
The Ambient Collection       87
Name: brand, dtype: int64

Majority of the films do not have a associated brand. Since this column has a lot of missing values, they will need to be imputed to better determine which brands produce the most films. Due to time constraints for this analysis, this will be future work. 

# 3. Export preprocessed meta data<a class ='anchor' id='3export'></a>

The preprocessed meta data will be exported into a csv file to be used in the recommendation system

In [62]:
# Export the dataframe
new_df.to_csv('clean_meta.csv',index = False)

# 4. Summary <a class ='anchor' id='4summary'></a>

The meta data has been preprocessed and the most important columns `description_0` and `product_id` can now be used in other notebooks. 

Some future work for this notebook includes:

1) Web scrape amazon.ca for the products with missing descriptions, prices and brands  
2) Incorporate the `also buy` and `also viewed` columns into the recommendation system.

