# Capstone 3: Data Wrangling
### By Joshua Dytko

In [1]:
import pandas as pd
import numpy as np

The dataset comes from the Amazon Review Data (2018) from https://nijianmo.github.io/amazon/#subsets. The specific dataset used here ist he movies and tv data set. The data is in jsonl format. Jsonl is a format where each record is kept on a single line in the file. Pandas has a convenient "lines" optional argument that we can set to true that will allow it to read single line json. The data should already be cleaned and cleansed by the team that accumulated the data. We will confirm that here.

In [2]:
df = pd.read_json('Movies_and_TV_5.json', lines=True)

In [3]:
df.head(5)

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,vote,image
0,5,True,"11 9, 2012",A2M1CU2IRZG0K9,0005089549,{'Format:': ' VHS Tape'},Terri,So sorry I didn't purchase this years ago when...,Amazing!,1352419200,,
1,5,True,"12 30, 2011",AFTUJYISOFHY6,0005089549,{'Format:': ' VHS Tape'},Melissa D. Abercrombie,Believe me when I tell you that you will recei...,Great Gospel VHS of the Cathedrals!,1325203200,,
2,5,True,"04 21, 2005",A3JVF9Y53BEOGC,000503860X,{'Format:': ' DVD'},Anthony Thompson,"I have seen X live many times, both in the ear...",A great document of a great band,1114041600,11.0,
3,5,True,"04 6, 2005",A12VPEOEZS1KTC,000503860X,{'Format:': ' DVD'},JadeRain,"I was so excited for this! Finally, a live co...",YES!! X LIVE!!,1112745600,5.0,
4,5,True,"12 3, 2010",ATLZNVLYKP9AZ,000503860X,{'Format:': ' DVD'},T. Fisher,X is one of the best punk bands ever. I don't ...,X have still got it,1291334400,5.0,


The dataset comes with many more columns than are useful for Item recommender system and thus can be trimmed down.

In [4]:
df.drop(columns=['verified', 'reviewTime', 'style', 'reviewerName', 'reviewText', 'summary', 'unixReviewTime', 'vote', 'image'], inplace=True)

With that done we can rename the columns to names that better reflect their purpose.

In [5]:
df.columns = ['rating', 'user_id', 'asin']
df.head(10)

Unnamed: 0,rating,user_id,asin
0,5,A2M1CU2IRZG0K9,0005089549
1,5,AFTUJYISOFHY6,0005089549
2,5,A3JVF9Y53BEOGC,000503860X
3,5,A12VPEOEZS1KTC,000503860X
4,5,ATLZNVLYKP9AZ,000503860X
5,5,A3TNYNA2360NPA,000503860X
6,5,A2LUL6PRTXE7SE,000503860X
7,5,A2CFV9UPFTTM10,0005419263
8,3,A3139J3877Y61F,0005419263
9,5,A2PANT8U0OJNT4,0005419263


First thing to do is to see if there are any null values in our data.

In [6]:
#Checking to see if there are any null values
df.isnull().values.any()

False

In [7]:
df.dtypes

rating      int64
user_id    object
asin       object
dtype: object

Now I'll check the ratings columns values to confirm that all the values are integers 1 through 5.

In [8]:
df['rating'].unique()

array([5, 3, 4, 2, 1], dtype=int64)

Next we will check that the strings that represent the asin and user IDs don't conatain any special characters or spaces in them.

In [9]:
cols = df.astype(str).apply(lambda c: c.str.contains(r'^\s+|\s+$ ')).any()
cols.head()

rating     False
user_id    False
asin       False
dtype: bool

In [10]:
df.to_csv('Cleaned data.csv', index=False)

### Summary

The data has been imported from Amazon Review Data (2018) dataset put together by Jianmo Ni from the University of California San Diego computer science department. The dataset has been cleaned and reduced to only the neccessary information and is now ready for EDA.

The columns retained are rating, user_id, and asin.

Source:

Justifying recommendations using distantly-labeled reviews and fined-grained aspects

Jianmo Ni, Jiacheng Li, Julian McAuley

Empirical Methods in Natural Language Processing (EMNLP), 2019