##### Spoiler Alert! Spoiler Detection Project

## Train-Test-Split and Preprocessing 

In [2]:
reset -fs

In [3]:
import seaborn as sns
import pandas as pd
import numpy as np
import gzip
import matplotlib.pyplot as plt
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from datetime import datetime
from tqdm import tqdm

In [4]:
#Disable scientific notation for floats
pd.options.display.float_format = '{:,}'.format

#Enable viewing more (in this case: all) features of a dataset
pd.set_option('display.max_columns', 500)

#ignore warnings
import warnings
warnings.filterwarnings("ignore")

In [5]:
#Load datafile
df = pd.read_hdf('data/complete_data.h5')

### Train-Test-Split

In [6]:
# Split train, validation and test test with ratios 70% - 20% -10%
train, validation, test = np.split(df.sample(frac=1), [int(.7*len(df)), int(.9*len(df))])

In [6]:
#Save validation and test sets (train set will be saved after preprocessing) as HDF5
test.to_hdf('data/test_data.h5', key = 'test')
validation.to_hdf('data/validation_data.h5', key = 'validation')

### Data Preprocessing

From now on, only the train data is manipulated, validation and test sets are only worked on just before model evaluation.

In [7]:
#Reset the index
train = train.reset_index()

In [7]:
#Drop the feature containing the old index
train.drop('index', axis = 1, inplace = True)

In [1]:
#Information on data types and missing values
train.info()

NameError: name 'train' is not defined

In [10]:
#Show unique values in the sorted genres column. Since the genres column contains dictionaries, 
#the data type is temporarily changed to string format. 
values = train.genres.astype('str').sort_values(ascending = False)
values

475915                                                   {}
404976    {'young-adult': 998, 'fantasy, paranormal': 13...
335487    {'young-adult': 998, 'fantasy, paranormal': 13...
720943    {'young-adult': 998, 'fantasy, paranormal': 13...
416191    {'young-adult': 998, 'fantasy, paranormal': 13...
                                ...                        
785621                      {'children': 102, 'fiction': 7}
583641                      {'children': 102, 'fiction': 7}
611706                      {'children': 102, 'fiction': 7}
53633                       {'children': 102, 'fiction': 7}
454983                      {'children': 102, 'fiction': 7}
Name: genres, Length: 964623, dtype: object

#### Change data and data types

Obviously, some changes are necessary:
* Missing values are denoted as '' or '[]', respectively, and need to be changed to np.nan
* Datatypes need to be changed for some variables:
  * _time_ to date
  * _book_id_ to string
  * _publication_year_, _publication_month_, _publication_day_, _average_rating_, _ratings_count_, _num_pages_ to numeric

#### Change data types

In [8]:
#Change the data type of book_id to string
train.book_id = train.book_id.astype('str')

In [9]:
#Change the data type of time from object to date in the format (YYYY-MM-DD)
from datetime import datetime
train.time = pd.to_datetime(train.time)

In [10]:
#Change datatypes from object to floats.
to_num = ['average_rating', 'ratings_count', 'publication_year', 'publication_month', 'publication_day']
for col in to_num:
    train[col] = pd.to_numeric(train[col], errors = 'coerce')

#### Feature Engineering

We add a new feature containing the frequency-weighted average of book ratings.

In [11]:
train['weighted_avg_rating'] = train.average_rating * train.ratings_count

The genres column contains more than one genre assignment to the books. Since we only want one genre per book, we create a new column containing the genre most frequently allocated. 

In [15]:
#Define function fetching the most frequent (= value) genre (= key)
import operator

def get_genre(dic):
    
    ''' Return the key of the highest value of dictionary given in.
    If the dictionary is empty, return np.nan
    '''
    
    try:
        x = max(dic.items(), key = operator.itemgetter(1))[0]
        return x
    except:
        return np.nan  

In [16]:
#Use the function defined above to fetch the most frequent genre allocation.
#First, write all keys to a list.
genre = []
for i in range(len(train)):
    a = get_genre(train.genres[i])
    genre.append(a)

In [17]:
#Add the information from the list as a new column to the genre dataframe
train['genre'] = pd.Series(genre)

We compute another column with overall spoiler labels coded as 0 = "no spoiler" and 1 = "spoiler".

In [18]:
train['spoiler_dum'] = np.where(train['spoiler']== False, 0, 1)

To also have sentence-wise labels and review text without labels, we define and apply the following functions:

In [19]:
#Get only the labels 0 and 1 from the review
def get_labels(x):
    return [label for label, text in x]

#Get only the text from the review
def get_text(x):
    return [text for label, text in x]

In [20]:
#Apply the function to the data
train['sentence_labels'] = train.review.apply(lambda x: get_labels(x))
train['review_texts'] = train.review.apply(lambda x: get_text(x))

In [21]:
train.review_texts = train.review_texts.astype('str')

We also want to delete special and digits characters from the review text and lower the text.

In [22]:
train['raw_text'] = pd.Series('str')

In [23]:
import re
for i in tqdm(range(len(train))):
    train['raw_text'][i] = re.sub('[^a-zA-Z0-9 " "]', '', train['review_texts'].copy()[i])

100%|██████████| 964623/964623 [11:44:01<00:00, 22.84it/s]  


In [24]:
train.head()

Unnamed: 0,user_id,time,review,rating,spoiler,book_id,review_id,genres,title,description,publication_year,publication_month,publication_day,average_rating,ratings_count,num_pages,weighted_avg_rating,genre,spoiler_dum,sentence_labels,review_texts,raw_text
0,28f8f1b5d8462df2dde271f0c4992bf3,2013-09-19,"[[0, Harry Dresden is the best supernatural de...",5,False,7779059,41e678f5fa16cb717e0000ba833d423e,"{'fantasy, paranormal': 3499, 'fiction': 416, ...",Side Jobs: Stories from the Dresden Files (The...,"Here, together for the first time, are the sho...",2010.0,10.0,26.0,4.24,34578.0,418.0,146610.72,"fantasy, paranormal",0,"[0, 0, 0]",['Harry Dresden is the best supernatural detec...,Harry Dresden is the best supernatural detecti...
1,521496a9a29d60e3fa1d814041f1c62b,2017-04-05,"[[0, Aww this was so much fun.], [0, I think I...",4,False,32066878,29443974b3bb5c63d8dab8871f76ece0,"{'romance': 251, 'young-adult': 6}","The Failing Hours (How to Date a Douchebag, #2)",Zeke Daniels isn't just a douchebag; he's an a...,2017.0,1.0,31.0,4.18,3959.0,322.0,16548.62,romance,0,"[0, 0]","['Aww this was so much fun.', 'I think I enjoy...",Aww this was so much fun I think I enjoyed the...
2,a8e55d9ec4691a720168153872b0e8b5,2015-03-17,"[[0, Book 1 was unique and clever despite bein...",2,False,7778609,bb8a6328c6235fdc602b395b27101469,"{'fantasy, paranormal': 1177, 'fiction': 145, ...","Kill the Dead (Sandman Slim, #2)",What do you do after you've crawled out of Hel...,2010.0,10.0,5.0,4.06,13882.0,434.0,56360.91999999999,"fantasy, paranormal",0,"[0, 0, 0, 0, 0, 0]",['Book 1 was unique and clever despite being o...,Book 1 was unique and clever despite being out...
3,a5145fcfae582bd6a2a1f958604d0903,2013-12-29,"[[0, So this one was much better than the firs...",4,False,2567987,aef659e4536be8d5803896a24738eec0,"{'comics, graphic': 2096, 'fantasy, paranormal...",Buffy the Vampire Slayer: No Future for You (S...,When a rogue debutant Slayer begins to use her...,2008.0,5.0,14.0,4.11,9511.0,120.0,39090.21000000001,"comics, graphic",0,"[0, 0]","[""So this one was much better than the first G...","""So this one was much better than the first GN..."
4,68f9915717ccc347b5f46f1b11ec40fe,2017-06-19,"[[0, 4 Beautifully Flawed Stars!], [0, Source:...",4,False,34117112,7d612b0fed55b60d8c81f9f36a57c9c6,"{'romance': 94, 'fiction': 5, 'mystery, thrill...","Singe (Guardian Protection, #1)",From USA Today bestselling author Aly Martinez...,,,,4.12,880.0,,3625.6,romance,0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","['4 Beautifully Flawed Stars!', 'Source: eARC ...",4 Beautifully Flawed Stars Source eARC for Hon...


We finally compute two features with the length (word-wise) of each review since one can hypothesize that longer reviews are more likely to contain spoilers than shorter ones.

In [25]:
train['review_len'] = train.raw_text.str.split(' ').map(len)

#### Missing values

In [8]:
#Denote missing values as np.nan instead of ''.  
train.replace('', np.nan, inplace = True)

In [27]:
# We want to see how many missing values are in every column (as relative frequencies): 
for col in train.columns:
    pct_missing = np.mean(train[col].isna())
    print('{} - {}%'.format(col, round(pct_missing*100, 2)))

user_id - 0.0%
time - 0.0%
review - 0.0%
rating - 0.0%
spoiler - 0.0%
book_id - 0.0%
review_id - 0.0%
genres - 0.0%
title - 0.0%
description - 0.51%
publication_year - 9.88%
publication_month - 11.68%
publication_day - 13.59%
average_rating - 0.0%
ratings_count - 0.0%
num_pages - 3.71%
weighted_avg_rating - 0.0%
genre - 0.0%
spoiler_dum - 0.0%
sentence_labels - 0.0%
review_texts - 0.0%
raw_text - 0.0%
review_len - 0.0%


Publication year, month and day also contain missing values. We drop month and day features as well as the rows with missing values for the publication year.
Missing values for num_pages are also dropped.

This means a reduction of the train set by 12.3%. Since all features containing NaNs are not of major importance, the reduced dataset is stored separately and will be used only when needed.

In [28]:
#Copy the original train set
train_red = train.copy()

In [29]:
# Drop the columns and rows not needed in the copied dataframe
train_red.drop(columns = ['publication_month', 'publication_day'], axis = 1, inplace = True)
train_red.dropna(subset = ['publication_year', 'num_pages', 'description'], inplace = True)

#### Outliers

We explore boxplots of numeric features for outlier detection.

In [30]:
plot = [train.rating, train.average_rating, train.ratings_count, train.num_pages, train.review_len, train.publication_year]
plt.figure(figsize = (10,6), sharey = False)
for i in range(len(plot)):
    plt.subplot(2,3, i+1)
    plt.boxplot(x = plot[i])
    plt.title(plot[i].name)

TypeError: __init__() got an unexpected keyword argument 'sharey'

Generally, outliers play only a minor role for our project: our main subject is the classification of reviews with regard to spoilers, which might be modulated by other features (we will learn about that in the EDA) but we're only secondarily intersted in these interactions. Therefore, outliers will not be removed from the dataframe but we will account for them in the feature standardization. 

To look for unusual values, let's have a look at the descriptives:

In [31]:
#Descriptives of non-numeric features
train.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
rating,964623.0,3.685356869989623,1.2521834710155255,0.0,3.0,4.0,5.0,5.0
publication_year,869295.0,2011.083542410804,9.749220026983188,16.0,2010.0,2012.0,2014.0,2089.0
publication_month,851963.0,6.286950254881961,3.26023845799034,1.0,4.0,6.0,9.0,12.0
publication_day,833516.0,14.024936533911768,9.809261160922626,1.0,5.0,13.0,24.0,31.0
average_rating,964622.0,4.002128460682008,0.2716504563013976,2.08,3.83,4.02,4.19,4.82
ratings_count,964622.0,124056.87547557488,393196.15318957984,17.0,4180.0,16379.0,69923.0,4899965.0
weighted_avg_rating,964622.0,509597.67249578587,1647387.2711884014,65.96,16418.38,65793.75,285122.81,21265848.1
spoiler_dum,964623.0,0.0651031542892923,0.2467079177648244,0.0,0.0,0.0,0.0,1.0
review_len,964623.0,193.34548937771544,226.9422574450046,1.0,40.0,110.0,267.0,3278.0


In [32]:
train.query('publication_year > 2020')

Unnamed: 0,user_id,time,review,rating,spoiler,book_id,review_id,genres,title,description,publication_year,publication_month,publication_day,average_rating,ratings_count,num_pages,weighted_avg_rating,genre,spoiler_dum,sentence_labels,review_texts,raw_text,review_len
39896,da77e2322e9f7e4d83ff38dcb37dfbae,2013-01-16,"[[0, I received this book free from the author...",3,False,13638436,0de0456e39de7131a4927d8bfd1b6afb,"{'young-adult': 39, 'fiction': 55, 'history, h...","Showtime (Marvelle Circus, #1)",Nope.,2089.0,9.0,21.0,3.58,198.0,2.0,708.84,fiction,0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",['I received this book free from the author in...,I received this book free from the author in e...,382
143766,2379b71a22fc6fc92c79b3b0fe0a60c5,2014-01-07,"[[0, 10/14/12 I'm just going to go ahead and a...",0,False,6382055,ea5ce53b86bf075d95a82e69473d2308,"{'fantasy, paranormal': 982, 'fiction': 196, '...","A Dream of Spring (A Song of Ice and Fire, #7)","Originally titled ""A Time For Wolves"". The sev...",2021.0,,,4.41,914.0,,4030.74,"fantasy, paranormal",0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[""10/14/12 I'm just going to go ahead and assu...","""101412 Im just going to go ahead and assume t...",189
148748,5c65c852183f68b0e0b03885ebb8e9cf,2013-12-12,"[[0, Summary:], [0, Laila Vilonia escaped from...",3,False,13638436,d35fd2072ce6955927a3be4488e1a4da,"{'young-adult': 39, 'fiction': 55, 'history, h...","Showtime (Marvelle Circus, #1)",Nope.,2089.0,9.0,21.0,3.58,198.0,2.0,708.84,fiction,0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","['Summary:', 'Laila Vilonia escaped from a pit...",Summary Laila Vilonia escaped from a pit in Mi...,205
187892,e3feb63356bdb985f4dd8dd0adb10218,2015-03-22,"[[0, Dear George R.R.], [0, Martin, King of th...",0,False,6382055,0290cbcbc470da9059c01e84d989904e,"{'fantasy, paranormal': 982, 'fiction': 196, '...","A Dream of Spring (A Song of Ice and Fire, #7)","Originally titled ""A Time For Wolves"". The sev...",2021.0,,,4.41,914.0,,4030.74,"fantasy, paranormal",0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","['Dear George R.R.', 'Martin, King of the Seve...",Dear George RR Martin King of the Seven Kingdo...,268
191704,92de18f3a06b499b1bf6d2157b22b3ca,2015-04-11,"[[0, I have no idea when this book will be out...",0,False,6382055,9fac6b58270a11583e5539c5c919e95b,"{'fantasy, paranormal': 982, 'fiction': 196, '...","A Dream of Spring (A Song of Ice and Fire, #7)","Originally titled ""A Time For Wolves"". The sev...",2021.0,,,4.41,914.0,,4030.74,"fantasy, paranormal",0,[0],['I have no idea when this book will be out--s...,I have no idea when this book will be outsafe ...,27
229305,0e68b8e2a2bd7bd6874ef96ae720f299,2012-12-28,"[[0, 3.5 STARS], [0, Upon reading the summary ...",3,False,13638436,50fa04c1748b23a46c7decbdc26817a1,"{'young-adult': 39, 'fiction': 55, 'history, h...","Showtime (Marvelle Circus, #1)",Nope.,2089.0,9.0,21.0,3.58,198.0,2.0,708.84,fiction,0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","['3.5 STARS', 'Upon reading the summary of the...",35 STARS Upon reading the summary of the book ...,436
448515,931f1250f18bf586090bda909b3002ee,2016-08-17,"[[0, All in all I really like the book very ve...",3,True,13638436,18d6e734439288e1f4bf865ec004de86,"{'young-adult': 39, 'fiction': 55, 'history, h...","Showtime (Marvelle Circus, #1)",Nope.,2089.0,9.0,21.0,3.58,198.0,2.0,708.84,fiction,1,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, ...",['All in all I really like the book very very ...,All in all I really like the book very very mu...,376
454429,ab01eb77625180ec7b2cd26e6f4f8c87,2012-12-21,"[[0, This book was absolutely sensational.], [...",5,False,13638436,e8ae708e779ea89282086810a44a47dc,"{'young-adult': 39, 'fiction': 55, 'history, h...","Showtime (Marvelle Circus, #1)",Nope.,2089.0,9.0,21.0,3.58,198.0,2.0,708.84,fiction,0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","['This book was absolutely sensational.', 'I c...",This book was absolutely sensational I could n...,154
486522,e9b0c550b7ee0ae72301c1a4aca56b77,2014-03-16,"[[0, When the sun rises in the west and sets i...",5,False,6382055,0338530f3e156fa792db416854efdafc,"{'fantasy, paranormal': 982, 'fiction': 196, '...","A Dream of Spring (A Song of Ice and Fire, #7)","Originally titled ""A Time For Wolves"". The sev...",2021.0,,,4.41,914.0,,4030.74,"fantasy, paranormal",0,[0],['When the sun rises in the west and sets in t...,When the sun rises in the west and sets in the...,12
533177,4c1080bee19c7f135f46eaf876dab50b,2014-02-25,"[[0, The 1918s.], [0, WTF, who doesn't want to...",3,False,13638436,d72e26ed1686aeca258bcdf3481d91a1,"{'young-adult': 39, 'fiction': 55, 'history, h...","Showtime (Marvelle Circus, #1)",Nope.,2089.0,9.0,21.0,3.58,198.0,2.0,708.84,fiction,0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","['The 1918s.', ""WTF, who doesn't want to get s...","The 1918s ""WTF who doesnt want to get snuggly ...",689


Publication year: maximum is not a possible value. Looking for the corresponding titles with publication years > 2020, we learn that there are some reviews for the final GoT Series 7, which hasn't been published yet. The corresponding entries are dropped.
The other title with an impossible publication date was actually published in 2012, so the value is changed.
The minimum values for publication year are also errorous. The corresponding book was published in 2016, so the value is changed.

Number of pages: Also, the minimum number of pages is null which is impossible. Respective values are changed to their true values.
There still remain over 2000 entries with values < 100 pages. Although this seems legit for children's books, respective entries are dropped from the datatset if they do not belong to the children genre. 

In [9]:
#Drop entries for last GoT volume.
train.drop(train.loc[train.title == 'A Dream of Spring (A Song of Ice and Fire, #7)'].index, inplace = True)

In [34]:
#Change values from 2089 to 2012 
train.publication_year.replace(2089, 2012, inplace = True)

In [35]:
train.query('publication_year < 1000')

Unnamed: 0,user_id,time,review,rating,spoiler,book_id,review_id,genres,title,description,publication_year,publication_month,publication_day,average_rating,ratings_count,num_pages,weighted_avg_rating,genre,spoiler_dum,sentence_labels,review_texts,raw_text,review_len
67256,42f0a7cb849df52c7b4950ce0a5bf876,2017-03-13,"[[0, So much fun.], [0, I really enjoyed this ...",4,False,28385237,10eb775fe0fdea27fa47389afce98710,"{'romance': 115, 'fiction': 6}","Magnificent Bastard (Sexy Flirty Dirty, #1)","F*ck Prince Charming. Sometimes, you need a Ma...",16.0,4.0,25.0,4.1,1808.0,252,7412.799999999999,romance,0,"[0, 0]","['So much fun.', 'I really enjoyed this book.']",So much fun I really enjoyed this book,8
163155,23fdcd448f85aa584c0994edf1aa10d1,2017-07-13,"[[0, I liked, but didn't love this novel.], [0...",3,False,28385237,8a218cddcd98f1fed1c4c7bfdf22a0ee,"{'romance': 115, 'fiction': 6}","Magnificent Bastard (Sexy Flirty Dirty, #1)","F*ck Prince Charming. Sometimes, you need a Ma...",16.0,4.0,25.0,4.1,1808.0,252,7412.799999999999,romance,0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[""I liked, but didn't love this novel."", 'It w...","""I liked but didnt love this novel"" It was fai...",351
177921,01546af41b83c0b4141b11c19e8c35af,2016-04-26,"[[0, The characters in this book were wonderfu...",4,False,28385237,d34670c070d9ae7e73b4065c86587f2b,"{'romance': 115, 'fiction': 6}","Magnificent Bastard (Sexy Flirty Dirty, #1)","F*ck Prince Charming. Sometimes, you need a Ma...",16.0,4.0,25.0,4.1,1808.0,252,7412.799999999999,romance,0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",['The characters in this book were wonderful a...,The characters in this book were wonderful and...,661
208076,c434b8f9592dec15134e6b12293bcb4f,2016-08-11,"[[0, * 5 STARS *], [0, Another new author [to ...",5,False,28385237,9c6047dafb13c3e907d4573a75ce58aa,"{'romance': 115, 'fiction': 6}","Magnificent Bastard (Sexy Flirty Dirty, #1)","F*ck Prince Charming. Sometimes, you need a Ma...",16.0,4.0,25.0,4.1,1808.0,252,7412.799999999999,romance,0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]","['* 5 STARS *', 'Another new author [to me] an...",5 STARS Another new author to me and another...,235
258294,4a7171cd31ff626293b2ea3990279cdb,2016-11-12,"[[0, Oww, this was so damn sweet!], [0, I real...",4,False,28385237,dc4ccd4b87a950184dd9efcc063c1be5,"{'romance': 115, 'fiction': 6}","Magnificent Bastard (Sexy Flirty Dirty, #1)","F*ck Prince Charming. Sometimes, you need a Ma...",16.0,4.0,25.0,4.1,1808.0,252,7412.799999999999,romance,0,"[0, 0, 0, 0, 0, 0, 0, 0, 0]","['Oww, this was so damn sweet!', ""I really did...","Oww this was so damn sweet ""I really didnt hav...",81
326522,27c6ff07ee057f379ad6f49a684b6acb,2016-04-28,"[[0, 5 lobster pot stars!], [0, Lili writes a ...",5,False,28385237,c64bdcaa54c804f1c5d4b3f561f76047,"{'romance': 115, 'fiction': 6}","Magnificent Bastard (Sexy Flirty Dirty, #1)","F*ck Prince Charming. Sometimes, you need a Ma...",16.0,4.0,25.0,4.1,1808.0,252,7412.799999999999,romance,0,"[0, 0, 0, 0, 0, 0, 0]","['5 lobster pot stars!', 'Lili writes a Swoony...",5 lobster pot stars Lili writes a Swoony hero ...,84
346183,0223a9592bfaf2edce5a348a293c254b,2016-04-25,"[[0, ARC provided by author in exchange for an...",4,False,28385237,1d2b4f3ac42c2ef200ce199119984070,"{'romance': 115, 'fiction': 6}","Magnificent Bastard (Sexy Flirty Dirty, #1)","F*ck Prince Charming. Sometimes, you need a Ma...",16.0,4.0,25.0,4.1,1808.0,252,7412.799999999999,romance,0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",['ARC provided by author in exchange for an ho...,ARC provided by author in exchange for an hone...,321
474491,1351894a9079f881009cb2006c76839b,2017-01-31,"[[0, 4 stars!], [0, Full Review for 1001 Dark ...",4,False,28385237,5b16979ee69e40291c3dded3795b69e9,"{'romance': 115, 'fiction': 6}","Magnificent Bastard (Sexy Flirty Dirty, #1)","F*ck Prince Charming. Sometimes, you need a Ma...",16.0,4.0,25.0,4.1,1808.0,252,7412.799999999999,romance,0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","['4 stars!', 'Full Review for 1001 Dark Nights...",4 stars Full Review for 1001 Dark Nights Bundl...,258
630898,8aeb1623684ec6d185a060f57f26e9e7,2016-11-30,"[[0, One thing I loved the best of this book w...",5,True,28385237,649bc9c8f30817bd379b6ae0676ccffe,"{'romance': 115, 'fiction': 6}","Magnificent Bastard (Sexy Flirty Dirty, #1)","F*ck Prince Charming. Sometimes, you need a Ma...",16.0,4.0,25.0,4.1,1808.0,252,7412.799999999999,romance,1,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1]",['One thing I loved the best of this book was ...,One thing I loved the best of this book was th...,138
644629,0c2511284f1a56239d879d5f29dc95a9,2016-12-15,"[[0, Told primarily from a male POV, this is a...",4,False,28385237,04f9989d424a54ef78b14ff365cba498,"{'romance': 115, 'fiction': 6}","Magnificent Bastard (Sexy Flirty Dirty, #1)","F*ck Prince Charming. Sometimes, you need a Ma...",16.0,4.0,25.0,4.1,1808.0,252,7412.799999999999,romance,0,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","['Told primarily from a male POV, this is a pr...",Told primarily from a male POV this is a prett...,413


In [36]:
#Change values from 16 to 2016 
train.publication_year.replace(16, 2016, inplace = True)

In [37]:
train.query('num_pages == 0')

Unnamed: 0,user_id,time,review,rating,spoiler,book_id,review_id,genres,title,description,publication_year,publication_month,publication_day,average_rating,ratings_count,num_pages,weighted_avg_rating,genre,spoiler_dum,sentence_labels,review_texts,raw_text,review_len


In [38]:
#Change values of num_pages for titles with num_pages == 0
train.num_pages = np.where(train.title == 'Ruins (Pathfinder, #2)', 544, train.num_pages)
train.num_pages = np.where(train.title == 'The False Prince (The Ascendance Trilogy, #1)', 342, train.num_pages)
train.num_pages = np.where(train.title == 'The Night Circus', 400, train.num_pages)
train.num_pages = np.where(train.title == 'War Horse (War Horse, #1)', 165, train.num_pages)

In [39]:
#Safe the reduced dataframe as HDF5
train_red.to_json('data/train_reduced.json')

In [40]:
#Save the not reduced dataframe as json (HDF5 is not possible for values too large to convert)
train.to_json('data/train_data.json')

#### Rescaling of numeric variables

Data are rescaled using the RobustScaler.
The centering and scaling statistics of this scaler are based on percentiles and are therefore not influenced by a few number of very large marginal outliers. Consequently, the resulting range of the transformed feature values is larger than for the previous scalers and, more importantly, are approximately similar.

In [41]:
#Select numerical variables
train_num = train.copy().select_dtypes('number')

In [42]:
#Rescale data
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
scaler = MinMaxScaler()

scaled_features = scaler.fit_transform(train_num)
train_num_scaled = pd.DataFrame(scaled_features, index= train_num.index, columns= train_num.columns)

In [43]:
#Rescale reduced data
train_red_num = train_red.copy().select_dtypes('number')
scaled_features_red = scaler.fit_transform(train_red_num)
train_red_num_scaled = pd.DataFrame(scaled_features_red, index= train_red_num.index, columns= train_red_num.columns)

In [44]:
#Isolate the non-numeric variables
train_obj = train.copy().select_dtypes('object', 'datetime')

In [45]:
train_red_obj = train_red.copy().select_dtypes('object', 'datetime')

In [46]:
# Concatenate 
train_scaled = pd.concat([train_obj, train_num_scaled],axis = 1)

In [47]:
train_scaled.head()

Unnamed: 0,user_id,review,book_id,review_id,genres,title,description,num_pages,genre,sentence_labels,review_texts,raw_text,rating,publication_year,publication_month,publication_day,average_rating,ratings_count,weighted_avg_rating,spoiler_dum,review_len
0,28f8f1b5d8462df2dde271f0c4992bf3,"[[0, Harry Dresden is the best supernatural de...",7779059,41e678f5fa16cb717e0000ba833d423e,"{'fantasy, paranormal': 3499, 'fiction': 416, ...",Side Jobs: Stories from the Dresden Files (The...,"Here, together for the first time, are the sho...",418.0,"fantasy, paranormal","[0, 0, 0]",['Harry Dresden is the best supernatural detec...,Harry Dresden is the best supernatural detecti...,1.0,0.9488636363636348,0.8181818181818182,0.8333333333333334,0.7883211678832117,0.0070533401578955,0.0068911060517428,0.0,0.0341776014647543
1,521496a9a29d60e3fa1d814041f1c62b,"[[0, Aww this was so much fun.], [0, I think I...",32066878,29443974b3bb5c63d8dab8871f76ece0,"{'romance': 251, 'young-adult': 6}","The Failing Hours (How to Date a Douchebag, #2)",Zeke Daniels isn't just a douchebag; he's an a...,322.0,romance,"[0, 0]","['Aww this was so much fun.', 'I think I enjoy...",Aww this was so much fun I think I enjoyed the...,0.8,0.9886363636363632,0.0,1.0,0.7664233576642333,0.0008044983334517,0.0007750789456738,0.0,0.0045773573390296
2,a8e55d9ec4691a720168153872b0e8b5,"[[0, Book 1 was unique and clever despite bein...",7778609,bb8a6328c6235fdc602b395b27101469,"{'fantasy, paranormal': 1177, 'fiction': 145, ...","Kill the Dead (Sandman Slim, #2)",What do you do after you've crawled out of Hel...,434.0,"fantasy, paranormal","[0, 0, 0, 0, 0, 0]",['Book 1 was unique and clever despite being o...,Book 1 was unique and clever despite being out...,0.4,0.9488636363636348,0.8181818181818182,0.1333333333333333,0.7226277372262772,0.0028296218653748,0.0026472085357308,0.0,0.0176991150442477
3,a5145fcfae582bd6a2a1f958604d0903,"[[0, So this one was much better than the firs...",2567987,aef659e4536be8d5803896a24738eec0,"{'comics, graphic': 2096, 'fantasy, paranormal...",Buffy the Vampire Slayer: No Future for You (S...,When a rogue debutant Slayer begins to use her...,120.0,"comics, graphic","[0, 0]","[""So this one was much better than the first G...","""So this one was much better than the first GN...",0.8,0.9375,0.3636363636363636,0.4333333333333333,0.740875912408759,0.0019375715823923,0.0018350724061353,0.0,0.0131217577052181
4,68f9915717ccc347b5f46f1b11ec40fe,"[[0, 4 Beautifully Flawed Stars!], [0, Source:...",34117112,7d612b0fed55b60d8c81f9f36a57c9c6,"{'romance': 94, 'fiction': 5, 'mystery, thrill...","Singe (Guardian Protection, #1)",From USA Today bestselling author Aly Martinez...,,romance,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","['4 Beautifully Flawed Stars!', 'Source: eARC ...",4 Beautifully Flawed Stars Source eARC for Hon...,0.8,,,,0.7445255474452553,0.0001761243180539,0.0001673881532579,0.0,0.1727189502593835


In [48]:
train_red_scaled = pd.concat([train_red_num_scaled, train_red_obj],axis = 1)

In [49]:
#Safe the reduced dataframe as json-file
#train_red_scaled.to_json('data/train_reduced_scaled.json')

In [50]:
#Save the not reduced dataframe as json.file
train_scaled.to_json('data/train_data_scaled.json')

In [None]:
time = pd.Series(train.time.astype('str'))
time.to_csv('data/time.csv')