# Supplementary Notebook: Features

This notebook will familiarize you with the dataset you will be using in the Recommender System notebook in this course. Most of this should be review from courses 2 & 3, but this should be a good refresher for those who may have forgotten. We will discuss how to obtain certain features from our data using the dataset found here. https://s3.amazonaws.com/amazon-reviews-pds/tsv/amazon_reviews_us_Digital_Video_Games_v1_00.tsv.gz

### The Data: Amazon Video Game Reviews

This dataset is a series of reviews and ratings from Amazon.

We will import the data and set up our dataset below.

In [1]:
import gzip
path = "C:/Users/Ian/Documents/PythonDataProducts4PredictiveAnalytics/DesignThinking&PredictiveAnalytics4DataProducts/Final_Course2/datasets/W2_Task/amazon_reviews_us_Digital_Video_Games_v1_00.tsv.gz"

f = gzip.open(path, 'rt', encoding="utf8")
header = f.readline()
header = header.strip().split('\t')
dataset = []

for line in f:
    fields = line.strip().split('\t')
    d = dict(zip(header, fields))
    dataset.append(d)

In the next cell we will display our header. This header is a list of the features that each element in our dataset. The following cell will display how to view an entry, and the one after that how to get a specific value out of an entry.

In [2]:
#list of features
header

['marketplace',
 'customer_id',
 'review_id',
 'product_id',
 'product_parent',
 'product_title',
 'product_category',
 'star_rating',
 'helpful_votes',
 'total_votes',
 'vine',
 'verified_purchase',
 'review_headline',
 'review_body',
 'review_date']

In [3]:
#general format of dataset entry
dataset[0]

{'marketplace': 'US',
 'customer_id': '21269168',
 'review_id': 'RSH1OZ87OYK92',
 'product_id': 'B013PURRZW',
 'product_parent': '603406193',
 'product_title': 'Madden NFL 16 - Xbox One Digital Code',
 'product_category': 'Digital_Video_Games',
 'star_rating': '2',
 'helpful_votes': '2',
 'total_votes': '3',
 'vine': 'N',
 'verified_purchase': 'N',
 'review_headline': 'A slight improvement from last year.',
 'review_body': "I keep buying madden every year hoping they get back to football. This years version is a little better than last years -- but that's not saying much.The game looks great. The only thing wrong with the animation, is the way the players are always tripping on each other.<br /><br />The gameplay is still slowed down by the bloated pre-play controls. What used to take two buttons is now a giant PITA to get done before an opponent snaps the ball or the play clock runs out.<br /><br />The turbo button is back, but the player movement is still slow and awkward. If you lik

In [4]:
#pulling a feature out of
dataset[0]['product_title']

'Madden NFL 16 - Xbox One Digital Code'

Lets see how many data entries contain a 'NA' value by creating a second dataset which ignores entries with 'NA' values.

In [5]:
#this function will help with the below dataset cleaning
def cleaned(datum, feat_list):
    for f in feat_list:
        if datum[f] == 'NA':
            return False
    return True

dataset_cleaned = [d for d in dataset if cleaned(d, header)]

len(dataset) == len(dataset_cleaned)

True

Notice the two are equal! This is because the amazon datasets used in this course are pre-cleaned. Meaning that they contain no missing values. You may not always have this with your data, so be sure to clean your data before using!

### Try this!

Next, try to write a function that replaces any 'NA' value of an entry with the average. (Note: The amazon data does have text entries as well but we'll use the new feature set defined below, which only cover a few numerical columns.) 

In [102]:
num_features = ['star_rating','helpful_votes','total_votes']

#notice avg is an input here, so it would be calculated outside of this for each individual feature
def replace_w_avg(datum, feat_list, avg_list):
    for key in feat_list:
        if(datum[key] == 'NA'):
            if(key == 'star_rating'):
                datum[key] = avg_list[0]
            elif(key == 'helpful_votes'):
                datum[key] = avg_list[2]
            elif(key == 'total_votes'):
                datum[key] = avg_list[2]
                
                
    print(datum)     
    return datum

In [103]:
#Test for previous function
test_dat = [{'star_rating': 3, 'helpful_votes': 3, 'total_votes': 0},
            {'star_rating': 2, 'helpful_votes': 4, 'total_votes': 3},
            {'star_rating': 3, 'helpful_votes': 2, 'total_votes': 3},
            {'star_rating': 2, 'helpful_votes': 4, 'total_votes': 1},
            {'star_rating': 4, 'helpful_votes': 1, 'total_votes': 2}]
#note this was randomly generated such that each numeric value was an int between 0-4

# the calculated averages
avg_list = [2.8, 2.8, 1.8]

In [104]:
test_dat[0]

{'star_rating': 3, 'helpful_votes': 3, 'total_votes': 0}

In [105]:
test_datum = {'star_rating': 'NA', 'helpful_votes': 3, 'total_votes': 'NA'}

replace_w_avg(test_datum, num_features, avg_list) == {'star_rating': 2.8, 'helpful_votes': 3, 'total_votes': 1.8}

{'star_rating': 2.8, 'helpful_votes': 3, 'total_votes': 1.8}


True

## There we go!

This should have been a quick refresher on data features, transformations, and missing values. This example is all based on Categorical data, but the same principles apply for each feature which would represent a time step (similar to the rating:month example in the Temporal data video.)