## Creating Own Dataset 

In part 1 of the project using this notebook, the aim is to create a dataset that contains a random sample of reviews
<br>
<br>
Some of the lessons taught during this tasks:
* Working with a big dataset is laborious, hence requires too much time from one's own PC to work for. The work solution can be more effective if operating via a reduced corpus. Since one cannot reduce the dataset arbitrarily, one should take a random sample of it.

## Task 1, 2 and 3 

Tasks revolve around:
<br>
1) import relevant libraries 
<br>
2) read in game review data from new line json delimited data (jsonnd) to pandas DataFrame
<br>
3) Create a plot understanding the distribution of ratings of the product (**Overall column**)

In [42]:
import pandas as pd 
import altair as alt 
import ndjson
import matplotlib.pyplot as plt 
import requests 
import gzip
import numpy as np 
from pathlib import Path
%matplotlib inline

In [43]:
file_location = Path.home()/'Downloads/Video_Games_5.json'

In [44]:
# Read json (new line delimited) data via ndjson
with open(file_location) as f:
    file = ndjson.load(f)

In [45]:
# Sample of the json 'like' data 
file[0]

{'overall': 5.0,
 'verified': True,
 'reviewTime': '10 17, 2015',
 'reviewerID': 'A1HP7NVNPFMA4N',
 'asin': '0700026657',
 'reviewerName': 'Ambrosia075',
 'reviewText': "This game is a bit hard to get the hang of, but when you do it's great.",
 'summary': "but when you do it's great.",
 'unixReviewTime': 1445040000}

In [46]:
# To make this more operational the json(nd) data should be converted into a pandas DataFrame format
df = pd.DataFrame(file)

In [47]:
df.head()

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,vote,style,image
0,5.0,True,"10 17, 2015",A1HP7NVNPFMA4N,700026657,Ambrosia075,"This game is a bit hard to get the hang of, bu...",but when you do it's great.,1445040000,,,
1,4.0,False,"07 27, 2015",A1JGAP0185YJI6,700026657,travis,I played it a while but it was alright. The st...,"But in spite of that it was fun, I liked it",1437955200,,,
2,3.0,True,"02 23, 2015",A1YJWEXHQBWK2B,700026657,Vincent G. Mezera,ok game.,Three Stars,1424649600,,,
3,2.0,True,"02 20, 2015",A2204E1TH211HT,700026657,Grandma KR,"found the game a bit too complicated, not what...",Two Stars,1424390400,,,
4,5.0,True,"12 25, 2014",A2RF5B5H74JLPE,700026657,jon,"great game, I love it and have played it since...",love this game,1419465600,,,


In [48]:
# Check basic structure (info) of dataset 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 497577 entries, 0 to 497576
Data columns (total 12 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   overall         497577 non-null  float64
 1   verified        497577 non-null  bool   
 2   reviewTime      497577 non-null  object 
 3   reviewerID      497577 non-null  object 
 4   asin            497577 non-null  object 
 5   reviewerName    497501 non-null  object 
 6   reviewText      497419 non-null  object 
 7   summary         497468 non-null  object 
 8   unixReviewTime  497577 non-null  int64  
 9   vote            107793 non-null  object 
 10  style           289237 non-null  object 
 11  image           3634 non-null    object 
dtypes: bool(1), float64(1), int64(1), object(9)
memory usage: 42.2+ MB


Most of the rows/observations are present, with very few null values for the first 9 columns. It's only regarding the last 3 columns **(vote, style and image)** that contain a noticeably large amount of null values. We can leave this for now because these features are not part of what is going to be analysed.

In [49]:
import pandas_bokeh
pandas_bokeh.output_notebook()

In [50]:
# Loaded in as float point values originally - better to keep as integers 
df['overall'] = df['overall'].astype('int')

In [51]:
df.overall.unique()

array([5, 4, 3, 2, 1])

Given the **overall** column is ***ordinal*** in nature, we should make its datatype an ordered category in ascending order (1 to 5)

In [52]:
#order = [1, 2, 3, 4, 5] # From lowest-highest
#ordered_cat = pd.api.types.CategoricalDtype(ordered= True, categories = order)
#df['overall'] = df.overall.astype(ordered_cat)

In [53]:
#df['overall'].unique()

In [54]:
df_counts = df['overall'].value_counts()
df_counts

5    299759
4     93654
3     49146
1     30883
2     24135
Name: overall, dtype: int64

In [55]:
df_counts.sort_index(ascending=True).plot_bokeh(kind='bar', ylabel='Count', xlabel='Rating', legend=False,
                                               hovertool_string='<h4>Count:</h4> @{overall}<h4>');

~~Find ways via Altair to visualise results~~

In [56]:
from collections import Counter
rating_counts = Counter(df['overall'])
ratings = [str(i) for i in list(rating_counts.keys())]
counts = list(rating_counts.values())

# Altair mainly works by making sub-dataframes as in this case
df_distribution = pd.DataFrame({'rating': ratings, 'count': counts})
df_distribution

Unnamed: 0,rating,count
0,5,299759
1,4,93654
2,3,49146
3,2,24135
4,1,30883


In [59]:
chart1 = alt.Chart(df_distribution).mark_bar().encode(x="rating", y="count", tooltip=[alt.Tooltip('count'), alt.Tooltip('rating')]).interactive()
chart1

In [62]:
chart1.save('rating_counts.html')

## Task 4 

Take a random sample of the reviews by selecting 1500 reviews with rating 1, 500-500-500 reviews with ratings 2, 3, 4, and 1500 reviews with rating 5 - enables us to gather a **small corpus** to work from

In [159]:
# Relevant libraries/workflow to perform random under-sampling - taking majority classes and randomly picking samples with/without replacement
from collections import Counter
from imblearn.under_sampling import RandomUnderSampler

In [160]:
# Define features and target 
X = df[['overall', 'reviewText']]
y = df['overall']

In [161]:
# Covert to numpy arrays 
X_arr = X.values
y_arr = y.values

In [162]:
print(f'Original dataset shape: {Counter(y)}')

Original dataset shape: Counter({5: 299759, 4: 93654, 3: 49146, 1: 30883, 2: 24135})


In [163]:
sampling_strategy = {1: 1500, 2: 500, 3: 500, 4: 500, 5: 1500}

In [164]:
# results replicable
seed = 42
undersample = RandomUnderSampler(random_state=seed, sampling_strategy=sampling_strategy)
X_under_small, y_under_small = undersample.fit_resample(X_arr, y_arr)
print(f'Resampled dataset shape: {Counter(y_under_small)}')

Resampled dataset shape: Counter({1: 1500, 5: 1500, 2: 500, 3: 500, 4: 500})


In [165]:
print(X_under_small.shape, y_under_small.shape)

(4500, 2) (4500,)


## Task 5

Take a random sample of the reviews by selecting 100,000 reviews - which provides us with a **large corpus**

In [170]:
random_state = 42
np.random.seed(random_state)

# gather 100k sample reviews based on random sampling of ratings (from 1 to 5 where high in computation is exclusive)
rand_ratings_sample = np.random.randint(low=1, high=6, size=100_000)

In [171]:
# Advanced data structure (dictionary) to handle counting unique frequencies 
from collections import defaultdict

In [172]:
# workflow to count the observations (integers) for each rating
sampling_strategy = defaultdict(int)
for frequency in rand_ratings_sample:
# Add control flow to make sure that values in data structure are added only when the rating is present
# otherwise make sure to add the rating as a key and then set it upon first count
    if frequency in sampling_strategy:
        sampling_strategy[frequency] += 1
    else:
        sampling_strategy[frequency] = 1

In [173]:
frequency_counts

defaultdict(int, {4: 19981, 5: 20187, 3: 19732, 2: 20082, 1: 20018})

In [174]:
undersample = RandomUnderSampler(random_state=seed, sampling_strategy=sampling_strategy)
X_under_large, y_under_large = undersample.fit_resample(X_arr, y_arr)
print(f'Original dataset shape: {Counter(y)}')
print(f'Resampled dataset shape: {Counter(y_under_large)}')

Original dataset shape: Counter({5: 299759, 4: 93654, 3: 49146, 1: 30883, 2: 24135})
Resampled dataset shape: Counter({5: 20187, 2: 20082, 1: 20018, 4: 19981, 3: 19732})


In [175]:
print(X_under_large.shape, y_under_large.shape)

(100000, 2) (100000,)


## Task 6

Finally, we'll export the corpora (small and large corpus) to two separate csv files 

In [180]:
# Go from arrays to a pandas DataFrame - tabulated format to then export to csv 
# format - ratings (as overall) and text (as reviewText)
# small corpus 
small_corpus = pd.DataFrame({'ratings': X_under_small[:,0], 'reviews': X_under_small[:,1]})
# notice ratings are not specified as int - fix 
small_corpus['ratings'] = small_corpus['ratings'].astype('int')
small_corpus.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4500 entries, 0 to 4499
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   ratings  4500 non-null   int64 
 1   reviews  4496 non-null   object
dtypes: int64(1), object(1)
memory usage: 70.4+ KB


In [181]:
# large courpus 
large_corpus = pd.DataFrame({'ratings': X_under_large[:,0], 'reviews': X_under_large[:,1]})
large_corpus['ratings'] = large_corpus['ratings'].astype('int')
large_corpus.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 2 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   ratings  100000 non-null  int64 
 1   reviews  99985 non-null   object
dtypes: int64(1), object(1)
memory usage: 1.5+ MB


In [185]:
# Finally export to csv - specify encoding to 'utf-8-sig' in case of any special characters being unable to be viewed within in a spreadsheet 
small_corpus.to_csv('small_corpus.csv', encoding='utf-8-sig', index=False)
large_corpus.to_csv('large_corpus.csv', encoding='utf-8-sig', index=False)