# Meta Data Cleaning

Author: Jason Truong

Date: Aug 14, 2022 

Email: truong3@ualberta.ca 




This Jupyter workbook contains the meta data for the Movies and TV Amazon reviews. The data was sourced from https://nijianmo.github.io/amazon/index.html.

Citing source: Justifying recommendations using distantly-labeled reviews and fined-grained aspects
Jianmo Ni, Jiacheng Li, Julian McAuley
Empirical Methods in Natural Language Processing (EMNLP), 2019


In [1]:
# Import dependencies
import random
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Read in the data from the json file

In [2]:
# If the dataset is too big and only a portion of the data can be loaded in
# meta_data = []
# with open('meta_Movies_and_TV.json','r') as metafile:
#     for x in range(50000):
#         meta_data.append(json.loads(next(metafile)))

In [3]:
meta_data = []
for line in open('meta_Movies_and_TV.json','r'):
    meta_data.append(json.loads(line))

In [4]:
meta_data[0]

{'category': ['Movies & TV', 'Movies'],
 'tech1': '',
 'description': [],
 'fit': '',
 'title': 'Understanding Seizures and Epilepsy',
 'also_buy': [],
 'tech2': '',
 'brand': '',
 'feature': [],
 'rank': '886,503 in Movies & TV (',
 'also_view': [],
 'main_cat': 'Movies & TV',
 'similar_item': '',
 'date': '',
 'price': '',
 'asin': '0000695009',
 'imageURL': [],
 'imageURLHighRes': []}

Create dataframe

In [5]:
meta_df = pd.DataFrame(meta_data)

Check the data types and see if there are any null values in the dataframe

In [6]:
meta_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 203766 entries, 0 to 203765
Data columns (total 19 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   category         203766 non-null  object
 1   tech1            203766 non-null  object
 2   description      203766 non-null  object
 3   fit              203766 non-null  object
 4   title            203766 non-null  object
 5   also_buy         203766 non-null  object
 6   tech2            203766 non-null  object
 7   brand            203766 non-null  object
 8   feature          203766 non-null  object
 9   rank             203766 non-null  object
 10  also_view        203766 non-null  object
 11  main_cat         203766 non-null  object
 12  similar_item     203766 non-null  object
 13  date             203766 non-null  object
 14  price            203766 non-null  object
 15  asin             203766 non-null  object
 16  imageURL         203766 non-null  object
 17  imageURLHi

First glance, no null values 

In [7]:
meta_df.head()


Unnamed: 0,category,tech1,description,fit,title,also_buy,tech2,brand,feature,rank,also_view,main_cat,similar_item,date,price,asin,imageURL,imageURLHighRes,details
0,"[Movies & TV, Movies]",,[],,Understanding Seizures and Epilepsy,[],,,[],"886,503 in Movies & TV (",[],Movies & TV,,,,695009,[],[],
1,"[Movies & TV, Movies]",,[],,Spirit Led&mdash;Moving By Grace In The Holy S...,[],,,[],"342,688 in Movies & TV (",[],Movies & TV,,,,791156,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,
2,"[Movies & TV, Movies]",,[Disc 1: Flour Power (Scones; Shortcakes; Sout...,,My Fair Pastry (Good Eats Vol. 9),[],,Alton Brown,[],"370,026 in Movies & TV (",[],Movies & TV,,,,143529,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,
3,"[Movies & TV, Movies]",,[Barefoot Contessa Volume 2: On these three di...,,"Barefoot Contessa (with Ina Garten), Entertain...","[B002I5GNW4, B005WXPVMM, B009UY3W8O, B00N27ID1...",,Ina Garten,[],"342,914 in Movies & TV (","[B002I5GNW4, 0804187045, B009UY3W8O, 060960219...",Movies & TV,,,$74.95,143588,[],[],
4,"[Movies & TV, Movies]",,[Rise and Swine (Good Eats Vol. 7) includes bo...,,Rise and Swine (Good Eats Vol. 7),"[B000P1CKES, B000NR4CRM]",,Alton Brown,[],"351,684 in Movies & TV (",[B0015SVNXY],Movies & TV,,,,143502,[https://images-na.ssl-images-amazon.com/image...,[https://images-na.ssl-images-amazon.com/image...,


**Image data is not being considered in this NLP recommendation project so `imageURL` and `imageURLHighRes` will be dropped**

In [8]:
# Drop columns:
meta_df.drop(columns = ['imageURL','imageURLHighRes'], inplace = True)

In [9]:
# Check dataframe:
meta_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 203766 entries, 0 to 203765
Data columns (total 17 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   category      203766 non-null  object
 1   tech1         203766 non-null  object
 2   description   203766 non-null  object
 3   fit           203766 non-null  object
 4   title         203766 non-null  object
 5   also_buy      203766 non-null  object
 6   tech2         203766 non-null  object
 7   brand         203766 non-null  object
 8   feature       203766 non-null  object
 9   rank          203766 non-null  object
 10  also_view     203766 non-null  object
 11  main_cat      203766 non-null  object
 12  similar_item  203766 non-null  object
 13  date          203766 non-null  object
 14  price         203766 non-null  object
 15  asin          203766 non-null  object
 16  details       195392 non-null  object
dtypes: object(17)
memory usage: 26.4+ MB


Columns `tech`,`fit`,`tech2`,`similar item`, `date`, `price`, `feature`,`rank` and `details` may have null values

### Check `Date` column

In [10]:
sum(meta_df['date'] == "")

203728

In [11]:
203728/203766

0.999813511577005

99.98% of the `date` column is blank so the column will be dropped

In [12]:
meta_df.drop(columns = 'date', inplace = True)

Check that the column has been dropped

In [13]:
meta_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 203766 entries, 0 to 203765
Data columns (total 16 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   category      203766 non-null  object
 1   tech1         203766 non-null  object
 2   description   203766 non-null  object
 3   fit           203766 non-null  object
 4   title         203766 non-null  object
 5   also_buy      203766 non-null  object
 6   tech2         203766 non-null  object
 7   brand         203766 non-null  object
 8   feature       203766 non-null  object
 9   rank          203766 non-null  object
 10  also_view     203766 non-null  object
 11  main_cat      203766 non-null  object
 12  similar_item  203766 non-null  object
 13  price         203766 non-null  object
 14  asin          203766 non-null  object
 15  details       195392 non-null  object
dtypes: object(16)
memory usage: 24.9+ MB


### Check the column `tech1`, `fit`, `tech2` and `similar item`

In [14]:
print('# of empty cells in tech1 column:')
print(sum(meta_df['tech1'] == ""))

# of empty cells in tech1 column:
203760


Only 6 out of 203766 entries in the tech1 column have data so this column can be dropped.

In [15]:
print('# of empty cells in fit column:')
print(sum(meta_df['fit'] == ""))

# of empty cells in fit column:
203766


In [16]:
print('# of empty cells in tech2 column:')
print(sum(meta_df['tech2'] == ""))

# of empty cells in tech2 column:
203766


In [17]:
print('# of empty cells in similar_item column:')
print(sum(meta_df['similar_item'] == ""))

# of empty cells in similar_item column:
203766


Over 99% of the data in the above columns have no data so these columns can be dropped.

In [18]:
# Drop columns:
meta_df.drop(columns = ['tech1','fit','tech2','similar_item'], inplace = True)

In [19]:
# Check dataframe:
meta_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 203766 entries, 0 to 203765
Data columns (total 12 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   category     203766 non-null  object
 1   description  203766 non-null  object
 2   title        203766 non-null  object
 3   also_buy     203766 non-null  object
 4   brand        203766 non-null  object
 5   feature      203766 non-null  object
 6   rank         203766 non-null  object
 7   also_view    203766 non-null  object
 8   main_cat     203766 non-null  object
 9   price        203766 non-null  object
 10  asin         203766 non-null  object
 11  details      195392 non-null  object
dtypes: object(12)
memory usage: 18.7+ MB


### Check `price` column

In [20]:
sum(meta_df['price'] == "")/meta_df['price'].shape[0]

0.45650893672153353

45.6% of the price column contains blanks so over 50% contain data, thus this column should not be dropped.

In [21]:
meta_df[meta_df['price'] != ""]

Unnamed: 0,category,description,title,also_buy,brand,feature,rank,also_view,main_cat,price,asin,details
3,"[Movies & TV, Movies]",[Barefoot Contessa Volume 2: On these three di...,"Barefoot Contessa (with Ina Garten), Entertain...","[B002I5GNW4, B005WXPVMM, B009UY3W8O, B00N27ID1...",Ina Garten,[],"342,914 in Movies & TV (","[B002I5GNW4, 0804187045, B009UY3W8O, 060960219...",Movies & TV,$74.95,0000143588,
7,"[Movies & TV, Movies]","[Giada de Laurentis on ""Everyday Italian"" DVDs...","Everyday Italian (with Giada de Laurentiis), V...","[B002I5GNVU, B000KL8ODE]",,[],"409,173 in Movies & TV (",[B000KL8ODE],Movies & TV,$24.95,0000143561,
14,"[Movies & TV, Christian Video, General]",[An early movie edition of the life of Jesus.],Where Jesus Walked VHS,[],,[],"1,108,385 in Movies & TV (",[],Movies & TV,.a-box-inner{background-color:#fff}#alohaBuyBo...,0005000009,
15,"[Movies & TV, Genre for Featured Categories, F...",[],Passion According to Luke VHS,[],,[],"692,122 in Movies & TV (",[],Movies & TV,.a-box-inner{background-color:#fff}#alohaBuyBo...,0005054885,
16,"[Movies & TV, Movies]",[VHS],The Homecoming Kids Camp Out VHS,[],,[],"390,381 in Movies & TV (",[],Movies & TV,.a-box-inner{background-color:#fff}#alohaBuyBo...,0005059836,
...,...,...,...,...,...,...,...,...,...,...,...,...
203758,"[Movies & TV, Movies]","[Between the years of 1954 and 1963, the Rebbe...",The Rebbe's Nigunim,"[B00V3UOVBA, B07J22VS9C, B06WGSL4J6, B00HSC0OJ...",Benny Friedman,[],"120,487 in Movies & TV (",[B07J22VS9C],Movies & TV,$21.99,B01HIQHDAS,{'ASIN: ': 'B01HIQHDAS'}
203759,"[Movies & TV, HBO, All HBO Titles]","[<![CDATA[, All the Way (Digital HD/DVD), Brya...",All the Way (Digital HD/DVD),"[B077SKJCWZ, B001BSBC16, B00CLH08KC, B076ZR2F9...",Various,[],"50,367 in Movies & TV (","[B077SKJCWZ, B001BSBC16, B00CLH08KC, B078952HS...",Movies & TV,$7.90,B01HIUL6WU,"{'Language:': 'English', 'Dubbed:': 'English',..."
203761,"[Movies & TV, Studio Specials, Sony Pictures H...",[Tony Award-winning director Bartlett Sher pro...,Verdi: Otello,"[B01MXE4EVV, B014HFML6E, B07HGR7P4Z, B000F3T3C...",Sonya Yoncheva,[],"68,026 in Movies & TV (","[B07BF2PQHM, B00012AST0, B0094AH3Q6, B0007P0LP...",Movies & TV,$14.98,B01HJ1INB0,"{'Language:': 'English', 'Subtitles:': 'Italia..."
203762,"[Movies & TV, Movies]",[Mr. Miracle DVD Region 2 need an all region D...,Mr. Miracle - Ihn schickt der Himmel,[],,[],"344,483 in Movies & TV (",[],Movies & TV,$16.18,B01HJ3E0PQ,"{'Language:': 'German (Dolby Digital 5.1), Eng..."


There are some rows that contain prices while some contain CSS code so those rows will have to be cleaned up.

In [22]:
meta_df['price'].value_counts()

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        

For any rows in the price column that contain css text, turn it into an empty string.

In [23]:
meta_df.loc[meta_df['price'].str.contains('a-box-inner'), 'price'] = ''

Replace all the '$' with an empty string so that the numbers can be converted to a float type

In [24]:
meta_df['price'] = meta_df['price'].str.replace(r'\$','',regex = True)

Convert all the price values to float and change any non numbers to NaN

In [25]:
meta_df['price'] = pd.to_numeric(meta_df['price'], errors='coerce')

In [26]:
sum(meta_df['price'].isna())

96058

# 6 out of 96077 prices were not numbers. This represents an extremly small portion so leaving it as is can be justifiable.

### Check the column `feature` 

In [27]:
meta_df['feature'].value_counts()

[]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      

In [28]:
203594/203766

0.9991558945064437

99.9% of this column are blanks so this column should be dropped.

In [29]:
# Drop columns:
meta_df.drop(columns = ['feature'], inplace = True)

In [30]:
# Check dataframe:
meta_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 203766 entries, 0 to 203765
Data columns (total 11 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   category     203766 non-null  object 
 1   description  203766 non-null  object 
 2   title        203766 non-null  object 
 3   also_buy     203766 non-null  object 
 4   brand        203766 non-null  object 
 5   rank         203766 non-null  object 
 6   also_view    203766 non-null  object 
 7   main_cat     203766 non-null  object 
 8   price        107708 non-null  float64
 9   asin         203766 non-null  object 
 10  details      195392 non-null  object 
dtypes: float64(1), object(10)
memory usage: 17.1+ MB


### Check the column `main_cat` 

the value counts method will be used to determine what the main category is.

In [31]:
meta_df['main_cat'].value_counts()

Movies & TV                                                                                                                                                                  203464
Sports & Outdoors                                                                                                                                                               118
Books                                                                                                                                                                            35
Amazon Home                                                                                                                                                                      33
<img src="https://images-na.ssl-images-amazon.com/images/G/01/digital/music/logos/amzn_music_logo_subnav._CB471835632_.png" class="nav-categ-image" alt="Digital Music"/>        31
Toys & Games                                                                                        

In [32]:
203464/203766

0.9985179077961976

Since the Movies & TV category represents the majority of the data, the other categories will be dropped to focus on Movies & TV

In [33]:
# Find the rows where the the main category is Movies & TV and store it in the meta_df. 
meta_df = meta_df[meta_df['main_cat'] == 'Movies & TV'].reset_index(drop=True)

In [34]:
meta_df

Unnamed: 0,category,description,title,also_buy,brand,rank,also_view,main_cat,price,asin,details
0,"[Movies & TV, Movies]",[],Understanding Seizures and Epilepsy,[],,"886,503 in Movies & TV (",[],Movies & TV,,0000695009,
1,"[Movies & TV, Movies]",[],Spirit Led&mdash;Moving By Grace In The Holy S...,[],,"342,688 in Movies & TV (",[],Movies & TV,,0000791156,
2,"[Movies & TV, Movies]",[Disc 1: Flour Power (Scones; Shortcakes; Sout...,My Fair Pastry (Good Eats Vol. 9),[],Alton Brown,"370,026 in Movies & TV (",[],Movies & TV,,0000143529,
3,"[Movies & TV, Movies]",[Barefoot Contessa Volume 2: On these three di...,"Barefoot Contessa (with Ina Garten), Entertain...","[B002I5GNW4, B005WXPVMM, B009UY3W8O, B00N27ID1...",Ina Garten,"342,914 in Movies & TV (","[B002I5GNW4, 0804187045, B009UY3W8O, 060960219...",Movies & TV,74.95,0000143588,
4,"[Movies & TV, Movies]",[Rise and Swine (Good Eats Vol. 7) includes bo...,Rise and Swine (Good Eats Vol. 7),"[B000P1CKES, B000NR4CRM]",Alton Brown,"351,684 in Movies & TV (",[B0015SVNXY],Movies & TV,,0000143502,
...,...,...,...,...,...,...,...,...,...,...,...
203459,"[Movies & TV, Studio Specials, Sony Pictures H...",[Tony Award-winning director Bartlett Sher pro...,Verdi: Otello,"[B01MXE4EVV, B014HFML6E, B07HGR7P4Z, B000F3T3C...",Sonya Yoncheva,"68,026 in Movies & TV (","[B07BF2PQHM, B00012AST0, B0094AH3Q6, B0007P0LP...",Movies & TV,14.98,B01HJ1INB0,"{'Language:': 'English', 'Subtitles:': 'Italia..."
203460,"[Movies & TV, Movies]",[Mr. Miracle DVD Region 2 need an all region D...,Mr. Miracle - Ihn schickt der Himmel,[],,"344,483 in Movies & TV (",[],Movies & TV,16.18,B01HJ3E0PQ,"{'Language:': 'German (Dolby Digital 5.1), Eng..."
203461,"[Movies & TV, Independently Distributed, Drama]",[The President and his family rule the land wi...,The President,"[B01M6DA5RJ, B0002F6BFG, B06XRGPHM3, B06XG2N43...",Misha Gomiashvili,"199,854 in Movies & TV (","[B01GWCHP2U, B0009PW3RE]",Movies & TV,18.99,B01HJ6R77G,"{'Language:': 'Georgian', 'Subtitles:': 'Engli..."
203462,"[Movies & TV, Genre for Featured Categories, A...","[Filmed in 1972 as Pope Joan, Michael Anderson...",She.....Who Would Be Pope,[B004L690XW],Liv Ullmann,"246,494 in Movies & TV (",[B0000C23HX],Movies & TV,,B01HJCCLOY,"{'Language:': 'English', 'ASIN: ': 'B01HJCCLOY'}"


In [35]:
meta_df['main_cat'].value_counts()

Movies & TV    203464
Name: main_cat, dtype: int64

Since the data only contains Movies & TV, the main_cat column can be dropped because it isn't providing any useful information.

In [36]:
# Drop columns:
meta_df.drop(columns = 'main_cat', inplace = True)

In [37]:
# Check dataframe:
meta_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 203464 entries, 0 to 203463
Data columns (total 10 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   category     203464 non-null  object 
 1   description  203464 non-null  object 
 2   title        203464 non-null  object 
 3   also_buy     203464 non-null  object 
 4   brand        203464 non-null  object 
 5   rank         203464 non-null  object 
 6   also_view    203464 non-null  object 
 7   price        107501 non-null  float64
 8   asin         203464 non-null  object 
 9   details      195104 non-null  object 
dtypes: float64(1), object(9)
memory usage: 15.5+ MB


### Check the column `title` 

In [38]:
meta_df['title'].value_counts()

                                                                                                                                                                59
Treasure Island                                                                                                                                                 15
Live                                                                                                                                                            11
Carmen                                                                                                                                                          10
Mozart: Don Giovanni                                                                                                                                            10
                                                                                                                                                                ..
Mantis Fist and Tiger 

In [39]:
sum(meta_df['title'] =='')/meta_df['title'].shape[0]

0.0002899775881728463

The title column contains the title of the movie and only 0.03% of the column contains a null value so this column must be kept.

### Check the column `Description` 

The description column is in a list so it will be converted to a list to check the contents within the list.

In [40]:
desc_split_df = pd.DataFrame(meta_df['description'].tolist()).add_prefix('description_')
desc_split_df
    

Unnamed: 0,description_0,description_1,description_2,description_3,description_4,description_5,description_6,description_7,description_8,description_9,...,description_597,description_598,description_599,description_600,description_601,description_602,description_603,description_604,description_605,description_606
0,,,,,,,,,,,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,Disc 1: Flour Power (Scones; Shortcakes; South...,,,,,,,,,,...,,,,,,,,,,
3,Barefoot Contessa Volume 2: On these three dis...,,,,,,,,,,...,,,,,,,,,,
4,Rise and Swine (Good Eats Vol. 7) includes bon...,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
203459,Tony Award-winning director Bartlett Sher prob...,,,,,,,,,,...,,,,,,,,,,
203460,Mr. Miracle DVD Region 2 need an all region DV...,,,,,,,,,,...,,,,,,,,,,
203461,The President and his family rule the land wit...,WINNER - Best Film - Chicago Int'l Film Festiv...,,,,,,,,,...,,,,,,,,,,
203462,"Filmed in 1972 as Pope Joan, Michael Andersons...",,,,,,,,,,...,,,,,,,,,,


In [41]:
(desc_split_df.isna().sum())/desc_split_df.shape[0]

description_0      0.142295
description_1      0.790346
description_2      0.944147
description_3      0.970437
description_4      0.980011
                     ...   
description_602    0.999995
description_603    0.999995
description_604    0.999995
description_605    0.999995
description_606    0.999995
Length: 607, dtype: float64

Keep the first 1 columns since atleast 79% are null for all the other columns and the first column represents the main description which is the most important portion.

In [42]:
meta_df['description_0'] = desc_split_df['description_0']
meta_df

Unnamed: 0,category,description,title,also_buy,brand,rank,also_view,price,asin,details,description_0
0,"[Movies & TV, Movies]",[],Understanding Seizures and Epilepsy,[],,"886,503 in Movies & TV (",[],,0000695009,,
1,"[Movies & TV, Movies]",[],Spirit Led&mdash;Moving By Grace In The Holy S...,[],,"342,688 in Movies & TV (",[],,0000791156,,
2,"[Movies & TV, Movies]",[Disc 1: Flour Power (Scones; Shortcakes; Sout...,My Fair Pastry (Good Eats Vol. 9),[],Alton Brown,"370,026 in Movies & TV (",[],,0000143529,,Disc 1: Flour Power (Scones; Shortcakes; South...
3,"[Movies & TV, Movies]",[Barefoot Contessa Volume 2: On these three di...,"Barefoot Contessa (with Ina Garten), Entertain...","[B002I5GNW4, B005WXPVMM, B009UY3W8O, B00N27ID1...",Ina Garten,"342,914 in Movies & TV (","[B002I5GNW4, 0804187045, B009UY3W8O, 060960219...",74.95,0000143588,,Barefoot Contessa Volume 2: On these three dis...
4,"[Movies & TV, Movies]",[Rise and Swine (Good Eats Vol. 7) includes bo...,Rise and Swine (Good Eats Vol. 7),"[B000P1CKES, B000NR4CRM]",Alton Brown,"351,684 in Movies & TV (",[B0015SVNXY],,0000143502,,Rise and Swine (Good Eats Vol. 7) includes bon...
...,...,...,...,...,...,...,...,...,...,...,...
203459,"[Movies & TV, Studio Specials, Sony Pictures H...",[Tony Award-winning director Bartlett Sher pro...,Verdi: Otello,"[B01MXE4EVV, B014HFML6E, B07HGR7P4Z, B000F3T3C...",Sonya Yoncheva,"68,026 in Movies & TV (","[B07BF2PQHM, B00012AST0, B0094AH3Q6, B0007P0LP...",14.98,B01HJ1INB0,"{'Language:': 'English', 'Subtitles:': 'Italia...",Tony Award-winning director Bartlett Sher prob...
203460,"[Movies & TV, Movies]",[Mr. Miracle DVD Region 2 need an all region D...,Mr. Miracle - Ihn schickt der Himmel,[],,"344,483 in Movies & TV (",[],16.18,B01HJ3E0PQ,"{'Language:': 'German (Dolby Digital 5.1), Eng...",Mr. Miracle DVD Region 2 need an all region DV...
203461,"[Movies & TV, Independently Distributed, Drama]",[The President and his family rule the land wi...,The President,"[B01M6DA5RJ, B0002F6BFG, B06XRGPHM3, B06XG2N43...",Misha Gomiashvili,"199,854 in Movies & TV (","[B01GWCHP2U, B0009PW3RE]",18.99,B01HJ6R77G,"{'Language:': 'Georgian', 'Subtitles:': 'Engli...",The President and his family rule the land wit...
203462,"[Movies & TV, Genre for Featured Categories, A...","[Filmed in 1972 as Pope Joan, Michael Anderson...",She.....Who Would Be Pope,[B004L690XW],Liv Ullmann,"246,494 in Movies & TV (",[B0000C23HX],,B01HJCCLOY,"{'Language:': 'English', 'ASIN: ': 'B01HJCCLOY'}","Filmed in 1972 as Pope Joan, Michael Andersons..."


Drop the original description column

In [43]:
meta_df.drop(columns = 'description', inplace = True)
meta_df

Unnamed: 0,category,title,also_buy,brand,rank,also_view,price,asin,details,description_0
0,"[Movies & TV, Movies]",Understanding Seizures and Epilepsy,[],,"886,503 in Movies & TV (",[],,0000695009,,
1,"[Movies & TV, Movies]",Spirit Led&mdash;Moving By Grace In The Holy S...,[],,"342,688 in Movies & TV (",[],,0000791156,,
2,"[Movies & TV, Movies]",My Fair Pastry (Good Eats Vol. 9),[],Alton Brown,"370,026 in Movies & TV (",[],,0000143529,,Disc 1: Flour Power (Scones; Shortcakes; South...
3,"[Movies & TV, Movies]","Barefoot Contessa (with Ina Garten), Entertain...","[B002I5GNW4, B005WXPVMM, B009UY3W8O, B00N27ID1...",Ina Garten,"342,914 in Movies & TV (","[B002I5GNW4, 0804187045, B009UY3W8O, 060960219...",74.95,0000143588,,Barefoot Contessa Volume 2: On these three dis...
4,"[Movies & TV, Movies]",Rise and Swine (Good Eats Vol. 7),"[B000P1CKES, B000NR4CRM]",Alton Brown,"351,684 in Movies & TV (",[B0015SVNXY],,0000143502,,Rise and Swine (Good Eats Vol. 7) includes bon...
...,...,...,...,...,...,...,...,...,...,...
203459,"[Movies & TV, Studio Specials, Sony Pictures H...",Verdi: Otello,"[B01MXE4EVV, B014HFML6E, B07HGR7P4Z, B000F3T3C...",Sonya Yoncheva,"68,026 in Movies & TV (","[B07BF2PQHM, B00012AST0, B0094AH3Q6, B0007P0LP...",14.98,B01HJ1INB0,"{'Language:': 'English', 'Subtitles:': 'Italia...",Tony Award-winning director Bartlett Sher prob...
203460,"[Movies & TV, Movies]",Mr. Miracle - Ihn schickt der Himmel,[],,"344,483 in Movies & TV (",[],16.18,B01HJ3E0PQ,"{'Language:': 'German (Dolby Digital 5.1), Eng...",Mr. Miracle DVD Region 2 need an all region DV...
203461,"[Movies & TV, Independently Distributed, Drama]",The President,"[B01M6DA5RJ, B0002F6BFG, B06XRGPHM3, B06XG2N43...",Misha Gomiashvili,"199,854 in Movies & TV (","[B01GWCHP2U, B0009PW3RE]",18.99,B01HJ6R77G,"{'Language:': 'Georgian', 'Subtitles:': 'Engli...",The President and his family rule the land wit...
203462,"[Movies & TV, Genre for Featured Categories, A...",She.....Who Would Be Pope,[B004L690XW],Liv Ullmann,"246,494 in Movies & TV (",[B0000C23HX],,B01HJCCLOY,"{'Language:': 'English', 'ASIN: ': 'B01HJCCLOY'}","Filmed in 1972 as Pope Joan, Michael Andersons..."


### Check the column `category` 

In [44]:
meta_df['category'].value_counts()

[Movies & TV, Movies]                                               30703
[Movies & TV, Genre for Featured Categories, Action & Adventure]    10760
[Movies & TV, Genre for Featured Categories, Drama]                  9639
[Movies & TV, Genre for Featured Categories, Documentary]            8842
[Movies & TV, Genre for Featured Categories, Kids & Family]          8027
                                                                    ...  
[Movies & TV, Mini-DVD, Television]                                     1
[Movies & TV, Charlie Rose Store, Entertainment, Theater]               1
[Movies & TV, Charlie Rose Store, Other Topics, In Memoriam]            1
[Movies & TV, Charlie Rose Store, Entertainment, Music]                 1
[Movies & TV, Mini-DVD, Animation]                                      1
Name: category, Length: 704, dtype: int64

In [45]:
cate_split_df = pd.DataFrame(meta_df['category'].tolist()).add_prefix('category_')
cate_split_df

Unnamed: 0,category_0,category_1,category_2,category_3,category_4,category_5,category_6
0,Movies & TV,Movies,,,,,
1,Movies & TV,Movies,,,,,
2,Movies & TV,Movies,,,,,
3,Movies & TV,Movies,,,,,
4,Movies & TV,Movies,,,,,
...,...,...,...,...,...,...,...
203459,Movies & TV,Studio Specials,Sony Pictures Home Entertainment,All Sony Pictures Titles,,,
203460,Movies & TV,Movies,,,,,
203461,Movies & TV,Independently Distributed,Drama,,,,
203462,Movies & TV,Genre for Featured Categories,Action & Adventure,,,,


In [46]:
cate_split_df['category_0'].value_counts()

Movies & TV    203239
Name: category_0, dtype: int64

The first column does not contain any useful information other than saying everything is Movies & TV, thus remove the first column

In [47]:
cate_split_df.drop(columns= 'category_0', inplace = True)
cate_split_df

Unnamed: 0,category_1,category_2,category_3,category_4,category_5,category_6
0,Movies,,,,,
1,Movies,,,,,
2,Movies,,,,,
3,Movies,,,,,
4,Movies,,,,,
...,...,...,...,...,...,...
203459,Studio Specials,Sony Pictures Home Entertainment,All Sony Pictures Titles,,,
203460,Movies,,,,,
203461,Independently Distributed,Drama,,,,
203462,Genre for Featured Categories,Action & Adventure,,,,


Check the number of Null values for each column.

In [48]:
cate_split_df.isna().sum()/cate_split_df.shape[0]

category_1    0.001106
category_2    0.194668
category_3    0.849649
category_4    0.984513
category_5    0.999985
category_6    0.999995
dtype: float64

Drop catetgories 3-6 because majority of the data are null values.

In [49]:
cate_split_df = cate_split_df[['category_1','category_2']]
cate_split_df

Unnamed: 0,category_1,category_2
0,Movies,
1,Movies,
2,Movies,
3,Movies,
4,Movies,
...,...,...
203459,Studio Specials,Sony Pictures Home Entertainment
203460,Movies,
203461,Independently Distributed,Drama
203462,Genre for Featured Categories,Action & Adventure


Combine the above category dataframe with the meta_df and drop the category column from the meta_df

In [50]:
meta_df = pd.concat([meta_df.drop(columns = 'category'), cate_split_df], axis = 1)

meta_df

Unnamed: 0,title,also_buy,brand,rank,also_view,price,asin,details,description_0,category_1,category_2
0,Understanding Seizures and Epilepsy,[],,"886,503 in Movies & TV (",[],,0000695009,,,Movies,
1,Spirit Led&mdash;Moving By Grace In The Holy S...,[],,"342,688 in Movies & TV (",[],,0000791156,,,Movies,
2,My Fair Pastry (Good Eats Vol. 9),[],Alton Brown,"370,026 in Movies & TV (",[],,0000143529,,Disc 1: Flour Power (Scones; Shortcakes; South...,Movies,
3,"Barefoot Contessa (with Ina Garten), Entertain...","[B002I5GNW4, B005WXPVMM, B009UY3W8O, B00N27ID1...",Ina Garten,"342,914 in Movies & TV (","[B002I5GNW4, 0804187045, B009UY3W8O, 060960219...",74.95,0000143588,,Barefoot Contessa Volume 2: On these three dis...,Movies,
4,Rise and Swine (Good Eats Vol. 7),"[B000P1CKES, B000NR4CRM]",Alton Brown,"351,684 in Movies & TV (",[B0015SVNXY],,0000143502,,Rise and Swine (Good Eats Vol. 7) includes bon...,Movies,
...,...,...,...,...,...,...,...,...,...,...,...
203459,Verdi: Otello,"[B01MXE4EVV, B014HFML6E, B07HGR7P4Z, B000F3T3C...",Sonya Yoncheva,"68,026 in Movies & TV (","[B07BF2PQHM, B00012AST0, B0094AH3Q6, B0007P0LP...",14.98,B01HJ1INB0,"{'Language:': 'English', 'Subtitles:': 'Italia...",Tony Award-winning director Bartlett Sher prob...,Studio Specials,Sony Pictures Home Entertainment
203460,Mr. Miracle - Ihn schickt der Himmel,[],,"344,483 in Movies & TV (",[],16.18,B01HJ3E0PQ,"{'Language:': 'German (Dolby Digital 5.1), Eng...",Mr. Miracle DVD Region 2 need an all region DV...,Movies,
203461,The President,"[B01M6DA5RJ, B0002F6BFG, B06XRGPHM3, B06XG2N43...",Misha Gomiashvili,"199,854 in Movies & TV (","[B01GWCHP2U, B0009PW3RE]",18.99,B01HJ6R77G,"{'Language:': 'Georgian', 'Subtitles:': 'Engli...",The President and his family rule the land wit...,Independently Distributed,Drama
203462,She.....Who Would Be Pope,[B004L690XW],Liv Ullmann,"246,494 in Movies & TV (",[B0000C23HX],,B01HJCCLOY,"{'Language:': 'English', 'ASIN: ': 'B01HJCCLOY'}","Filmed in 1972 as Pope Joan, Michael Andersons...",Genre for Featured Categories,Action & Adventure


### Check the column `rank` 

The different values in the rank column will be analyzed.

In [51]:
meta_df['rank'].value_counts()

[]                          2047
350,519 in Movies & TV (       7
490,118 in Movies & TV (       6
114,816 in Movies & TV (       6
141,169 in Movies & TV (       6
                            ... 
65,815 in Movies & TV (        1
413,789 in Movies & TV (       1
159,833 in Movies & TV (       1
225,491 in Movies & TV (       1
239,915 in Movies & TV (       1
Name: rank, Length: 159859, dtype: int64

The words in the rank column are unnecessary so they will be removed so that it only shows the number.

In [52]:
# Replace the words and the bracket with an empty string
meta_df['rank'] = meta_df['rank'].str.replace(r" in Movies & TV \(","",regex = True)

In [53]:
meta_df

Unnamed: 0,title,also_buy,brand,rank,also_view,price,asin,details,description_0,category_1,category_2
0,Understanding Seizures and Epilepsy,[],,886503,[],,0000695009,,,Movies,
1,Spirit Led&mdash;Moving By Grace In The Holy S...,[],,342688,[],,0000791156,,,Movies,
2,My Fair Pastry (Good Eats Vol. 9),[],Alton Brown,370026,[],,0000143529,,Disc 1: Flour Power (Scones; Shortcakes; South...,Movies,
3,"Barefoot Contessa (with Ina Garten), Entertain...","[B002I5GNW4, B005WXPVMM, B009UY3W8O, B00N27ID1...",Ina Garten,342914,"[B002I5GNW4, 0804187045, B009UY3W8O, 060960219...",74.95,0000143588,,Barefoot Contessa Volume 2: On these three dis...,Movies,
4,Rise and Swine (Good Eats Vol. 7),"[B000P1CKES, B000NR4CRM]",Alton Brown,351684,[B0015SVNXY],,0000143502,,Rise and Swine (Good Eats Vol. 7) includes bon...,Movies,
...,...,...,...,...,...,...,...,...,...,...,...
203459,Verdi: Otello,"[B01MXE4EVV, B014HFML6E, B07HGR7P4Z, B000F3T3C...",Sonya Yoncheva,68026,"[B07BF2PQHM, B00012AST0, B0094AH3Q6, B0007P0LP...",14.98,B01HJ1INB0,"{'Language:': 'English', 'Subtitles:': 'Italia...",Tony Award-winning director Bartlett Sher prob...,Studio Specials,Sony Pictures Home Entertainment
203460,Mr. Miracle - Ihn schickt der Himmel,[],,344483,[],16.18,B01HJ3E0PQ,"{'Language:': 'German (Dolby Digital 5.1), Eng...",Mr. Miracle DVD Region 2 need an all region DV...,Movies,
203461,The President,"[B01M6DA5RJ, B0002F6BFG, B06XRGPHM3, B06XG2N43...",Misha Gomiashvili,199854,"[B01GWCHP2U, B0009PW3RE]",18.99,B01HJ6R77G,"{'Language:': 'Georgian', 'Subtitles:': 'Engli...",The President and his family rule the land wit...,Independently Distributed,Drama
203462,She.....Who Would Be Pope,[B004L690XW],Liv Ullmann,246494,[B0000C23HX],,B01HJCCLOY,"{'Language:': 'English', 'ASIN: ': 'B01HJCCLOY'}","Filmed in 1972 as Pope Joan, Michael Andersons...",Genre for Featured Categories,Action & Adventure


### Check the column `details` 

In [54]:
#meta_df['details'].value_counts()

The details seems to contain the same info as other columns in the dataframe. A check will be performed to see if they contain the same information.

In [55]:
# Randomly sample from the meta_df details column to see what kind of data is stored in this column
for i in random.sample(range(0, meta_df.shape[0]), 25):
    print(meta_df['details'][i])

{'ASIN: ': 'B00004XPJ4'}
{'Language:': 'English', 'ASIN: ': 'B0019KAQ7M'}
{'Language:': 'Castilian', 'ASIN: ': 'B00008T2I1'}
{'Language:': 'English', 'ASIN: ': 'B0000AR8NR'}
{'Language:': 'English, French', 'ASIN: ': 'B00005AQ4W'}
{'Language:': 'English', 'ASIN: ': 'B007JVJ36U'}
{'ASIN: ': 'B00O4GTX7M'}
nan
{'Language:': 'English', 'Subtitles:': 'English, Spanish, French', 'ASIN: ': 'B00OV3VGP0'}
{'Language:': 'English', 'ASIN: ': 'B001G7ICZ6'}
{'Language:': 'English, Japanese', 'Subtitles:': 'English', 'ASIN: ': 'B0037B2WOG'}
{'Language:': 'English (Dolby Digital 2.0 Mono), French (Dolby Digital 2.0 Mono)', 'Subtitles:': 'English, Spanish', 'ASIN: ': 'B00005PJ8W'}
{'ASIN: ': 'B000KICXGG'}
nan
{'Language:': 'English', 'Subtitles:': 'Spanish', 'ASIN: ': 'B001LX4BVY'}
{'ASIN: ': 'B005ZV54Z0'}
{'ASIN: ': 'B006M5V2HC'}
{'Language:': 'Spanish', 'ASIN: ': 'B00004Z4SK'}
{'Language:': 'English', 'Subtitles:': 'English, Spanish', 'ASIN: ': 'B0017INRHM'}
{'ASIN: ': 'B0014H7EFW'}
{'Language:': 'E

The majority of the data in this column stores language related information and this is not needed for the NLP reviews recommendation system. Thus this column can be dropped.

In [56]:
# The details column will be dropped.
meta_df.drop(columns = 'details',inplace = True)

## Dropping duplicated rows

Check for duplicated rows in the data and remove any duplicated rows

In [57]:
meta_df.duplicated()

TypeError: unhashable type: 'list'

In [None]:
new_df = meta_df.drop(columns = ['also_buy','also_view'])

In [None]:
new_df

Drop the duplicated rows

In [None]:
new_df.drop_duplicates(inplace = True,ignore_index = True)
new_df

Check the ASIN of items to determine if any are still duplicated

In [None]:
new_df['asin'].value_counts()

All ASINs are unqiue, therefore, every row is unique and the duplicated rows are gone.

**Export preprocessed meta data into a csv file**

In [None]:
#new_df.to_csv('clean_meta.csv',index = False)

# What to do with also_buy and also_view

In [None]:
meta_df['also_buy'].value_counts()

In [None]:
abuy_split_df = pd.DataFrame(meta_df['also_buy'].tolist()).add_prefix('also_buy_')
abuy_split_df