# Reviewer Feature Engineering

Now that the reviewer data has been mapped and consolidated, it just has the following columns:

reviewerID | 1* | 2* | 3* | 4* | 5*

Still want to add some various metrics:

activity | avg_review | std_reviews

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
pd.options.mode.chained_assignment = None
%matplotlib inline

In [2]:
consolidated_reviewer_data = pd.read_csv("./Intermediate_Datasets/consolidated_reviewer_data.csv")

In [3]:
consolidated_reviewer_data.shape # as expected, there are around 43 million rows

(43531850, 6)

In [4]:
ratings = [1,2,3,4,5]
rating_columns = ['1*','2*','3*','4*','5*']

In [5]:
consolidated_reviewer_data['activity'] = consolidated_reviewer_data[rating_columns].sum(axis=1)

In [6]:
consolidated_reviewer_data['activity'].sum() #as expected, around 230 million reviews

233055318

In [7]:
consolidated_reviewer_data['avg_rating'] = \
    ( consolidated_reviewer_data['1*'] + \
     2*consolidated_reviewer_data['2*'] + \
     3*consolidated_reviewer_data['3*'] + \
     4*consolidated_reviewer_data['4*'] + \
     5*consolidated_reviewer_data['5*'] ) / \
    consolidated_reviewer_data['activity']

## Vectorized approach to calculating standard deviation

###### Main difficulty is that we have # of each rating, not a list of ratings

In [8]:
consolidated_reviewer_data['sum_resid_squared'] = 0

for rating, rating_column in zip(ratings, rating_columns):
    # the right side is essentially (resid^2 + resid^2 + resid^2), which is just count*(resid^2)
    consolidated_reviewer_data['sum_resid_squared'] += \
        consolidated_reviewer_data[rating_column] * (rating - consolidated_reviewer_data['avg_rating'])**2
    
# now divide by n-1, and squarerot
consolidated_reviewer_data['std_reviews'] = \
    (consolidated_reviewer_data['sum_resid_squared'] / (consolidated_reviewer_data['activity'])) ** 0.5

In [9]:
consolidated_reviewer_data.drop(columns=['sum_resid_squared'], inplace=True)

## Double check the calculations

In [10]:
consolidated_reviewer_data.head(2000).query('std_reviews > 1').head(5)

Unnamed: 0,reviewerID,1*,2*,3*,4*,5*,activity,avg_rating,std_reviews
9,A002556217M3R4LLKZHR,0,1,0,0,2,3,4.0,1.414214
18,A0048432VUYJSUTI513P,0,1,0,0,2,3,4.0,1.414214
27,A0081581LX99MYDYNRIB,1,0,1,0,5,7,4.142857,1.456863
31,A0092179C7BLLJP4Y2WP,0,2,0,0,1,3,3.0,1.414214
38,A0116085IRE3KOZX3Y4D,4,1,1,1,12,19,3.842105,1.662691


In [11]:
np.std([2,5,5]) # matches the calculated value in the dataframe

1.4142135623730951

In [12]:
consolidated_reviewer_data.head(1000).query('std_reviews == 0').head(5)
# these are all from users with only one type of rating (like all 5's)

Unnamed: 0,reviewerID,1*,2*,3*,4*,5*,activity,avg_rating,std_reviews
0,A0009674W2SIW8AIECUF,0,0,0,0,2,2,5.0,0.0
1,A0010028HGBTWSS5F8J6,0,0,0,0,1,1,5.0,0.0
2,A0011110I4YVY1W3WC02,0,0,0,0,5,5,5.0,0.0
3,A00142007XJDHJI1P4J5,0,0,0,0,2,2,5.0,0.0
4,A001753853F77I9Z0WD9,0,0,0,1,0,1,4.0,0.0


## Take an initial look at the distribution

In [13]:
consolidated_reviewer_data['activity'].describe()

count    4.353185e+07
mean     5.353674e+00
std      1.686515e+01
min      0.000000e+00
25%      1.000000e+00
50%      2.000000e+00
75%      4.000000e+00
max      1.344600e+04
Name: activity, dtype: float64

#### Because later models will be trying to predict average rating given a review, need to identify users to include in the training set

#### Need to include users with enough reviews to establish a meaningful average
#### Don't want too many reviews, because that might not be representative (ie a single reviewer posting 200+ reviews)

In [14]:
consolidated_reviewer_data['active_reviewer'] = \
    (consolidated_reviewer_data['activity'] > 5) & (consolidated_reviewer_data['activity'] < 15)

In [15]:
consolidated_reviewer_data['active_reviewer'].value_counts()
# even after applying condition for "active_reviewer", still have almost 4 million users!

False    37792102
True      5739748
Name: active_reviewer, dtype: int64

## Save for later EDA!

In [16]:
consolidated_reviewer_data.to_csv("../Processed_Data/reviewer_data.csv", index=False)