##### Spoiler Alert! Spoiler Detection Project

## Getting the Data

The data for this project is available at [UCSD Book Graph](https://sites.google.com/eng.ucsd.edu/ucsdbookgraph/home).
For our project, we use three datasets:
 * The _Book Reviews Spoiler Subset_ containing English book reviews, where each book/user has at least one associated spoiler review. This dataset contains ca. 1.3 million reviews by about 19,000 users for about 25,000 books. Besides the review texts with markings for spoiler vs. no spoiler for every sentence, the datasets includes information on the overall book rating, user ID, book ID, review ID and a timestamp.
 * The _Detailed Book Graph_ containing meta data like book titles, average book ratings, number of pages, publication date, etc. As you'll see below, we will only use a few features.
 * The _Extracted Fuzzy Book Genres_ with the genre assigned to each book ID. 

### Imports 

In [1]:
reset -fs

In [2]:
import json
import gzip
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm as tqdm
from collections import defaultdict

In [3]:
#Disenable scientific notation for floats
pd.options.display.float_format = '{:,}'.format

#Enable viewing more (in this case: all) features of a dataset
pd.set_option('display.max_columns', 500)

### Load Data

#### Book reviews

In [4]:
#Load the data from a compressed json file
reviews = []
with gzip.open('data/GoodReads.json.gz') as f:
    for l in f:
        reviews.append(json.loads(l.strip()))

In [5]:
#Convert the data into a pandas dataframe
df_reviews = pd.DataFrame(reviews)

In [6]:
#Show first 3 rows of the dataframe 
df_reviews.head(3)

Unnamed: 0,user_id,timestamp,review_sentences,rating,has_spoiler,book_id,review_id
0,8842281e1d1347389f2ab93d60773d4d,2017-08-30,"[[0, This is a special book.], [0, It started ...",5,True,18245960,dfdbb7b0eb5a7e4c26d59a937e2e5feb
1,8842281e1d1347389f2ab93d60773d4d,2017-03-22,"[[0, Recommended by Don Katz.], [0, Avail for ...",3,False,16981,a5d2c3628987712d0e05c4f90798eb67
2,8842281e1d1347389f2ab93d60773d4d,2017-03-20,"[[0, A fun, fast paced science fiction thrille...",3,True,28684704,2ede853b14dc4583f96cf5d120af636f


In [7]:
#Rename columns for comfort (make them shorter)
df_reviews.rename(columns = {'review_sentences': 'review', 'has_spoiler': 'spoiler', 'timestamp': 'time'}, inplace = True)

#### Functions

Our reviews dataset is only a small subset of the original data. Though, the assoiciated metadata files include information on the mich larger original dataset, i.e., contains a great amount of information on books which are not part of our review subset. 
Therefore, the following functions are used to fetch only the information on _our_ books via the book_ID.

Additionally, we're only interested in few features of the _Detailed Book Graph_ data, so we only want to load these columns of the dataset:
* book_id
* title
* decription
* publication_year/ _month/ _day
* average_rating
* ratings_count
* num_ratings

In [8]:
# Function to get the data files
def get_data(file):
    
    '''
    Generator will yield lines of the passed file
    '''
    
    with gzip.open(file, 'r') as f:
        for l in f:
            yield l

In [9]:
# Fetch the features we are interested in.
def fetch_features(file, features, book_list):
    '''
    Provide a list of features you want to extract in a single run.
    Returns a dictionary.
    
    Parameters:
    file: file you want to load
    features: list of features you want to extract
    book_list: containing all book IDs (--> rows) you want to load
    '''
    
    feature_dict = defaultdict(list)
    data = get_data(file)
    
    for l in tqdm(data):
        data_dict = json.loads(l)
        book_id = data_dict.get('book_id')
        
        try:
            if book_id in book_list: 
                for f in features:
                    feature_dict[f].append(data_dict.get(f))
        except:
            pass
    return feature_dict

In [10]:
#Fetch only one feature
def fetch_one_feature(file, f):
    '''
    Provide a list of features you want to extract in a single run.
    Returns a dictionary.
    
    Parameters:
    file: file you want to load
    f: feature you want to load
    '''
    
    feature = []
    data = get_data(file)
    
    for l in tqdm(data):
        data_dict = json.loads(l)
        feature.append(data_dict.get(f))
    return feature

In [11]:
#file with metadata
file_meta = 'data/goodreads_meta.json.gz'

#file with reviews
file_reviews = 'data/GoodReads.json.gz'

#file with genres
file_genre = 'data/goodreads_genre.json.gz'

#features we want to extract
features = ['book_id', 'title', 'description', 'publication_year', 'publication_month', 
            'publication_day', 'average_rating', 'ratings_count', 'num_pages']

#book_list containing all book IDs in the reviews dataset:
book_ids = set(fetch_one_feature(file_reviews, 'book_id'))

HBox(children=(FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0), HTML(value=''…




#### Metadata Feature Extraction

In [12]:
#Get the metadata we are interested in with the function defined above
features = fetch_features(file_meta, features, book_ids)

HBox(children=(FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0), HTML(value=''…




In [13]:
#Convert data to a pandas datadframe
df_feat = pd.DataFrame.from_dict(features)

#Show the first 10 rows of the dataframe
df_feat.head(10)

Unnamed: 0,book_id,title,description,publication_year,publication_month,publication_day,average_rating,ratings_count,num_pages
0,22642971,The Body Electric,The future world is at peace.\nElla Shepherd h...,2014.0,10.0,6.0,3.71,1525,351.0
1,32336119,Worth the Wait (Guthrie Brothers #2),Ready or not...love will find a way \nSingle d...,2017.0,7.0,25.0,4.19,693,384.0
2,2741853,Slow Hands,This is Maddy Turner's lucky day. The civilize...,2008.0,6.0,1.0,3.41,3852,210.0
3,12077902,Solaris: The Definitive Edition,A classic work of science fiction by renowned ...,2011.0,6.0,7.0,3.98,252,8.0
4,7843586,"More (More, #1)",After a series of explosive encounters with tw...,2010.0,3.0,23.0,3.88,1675,245.0
5,18663972,"Fortune's Pawn (Paradox, #1)",When professional mercenary Deviana Morris too...,,,,3.96,1023,340.0
6,25501128,"Kept from You (Tear Asunder, #4)",A sexy second-chance romance from New York Tim...,2017.0,3.0,5.0,4.36,1759,
7,10806009,The Storyteller,A good girl.\nA bad boy.\nA fairy tale that's ...,2012.0,1.0,1.0,4.1,289,402.0
8,10806008,Peter Nimble and His Fantastic Eyes (Peter Nim...,Peter Nimble and His Fantastic Eyesis the utte...,2011.0,8.0,1.0,4.04,6049,400.0
9,9469517,Mercy,,2009.0,2.0,1.0,3.64,3559,226.0


#### Metadata Genre Extraction

In [14]:
#Define the feartures we want to extract
features = ['book_id', 'genres']

#Get the features only for our book IDs with the function defined above
genres = fetch_features(file_genre, features , book_ids)

HBox(children=(FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0), HTML(value=''…




In [15]:
#Convert to pandas dataframe
df_gen = pd.DataFrame.from_dict(genres)

In [16]:
#Show the first 5 rows
df_gen.head(5)

Unnamed: 0,book_id,genres
0,22642971,"{'young-adult': 235, 'fiction': 182, 'romance'..."
1,32336119,"{'romance': 84, 'mystery, thriller, crime': 4,..."
2,2741853,"{'romance': 555, 'fiction': 61}"
3,12077902,"{'fiction': 2695, 'fantasy, paranormal': 358, ..."
4,7843586,"{'romance': 232, 'fiction': 11, 'mystery, thri..."


#### Merge Dataframes

Finally, we merge the three dataframes based on the book ID. 

In [17]:
#Make the final dataframe complete by adding the metadata
df = df_reviews.merge(df_gen, on = 'book_id').merge(df_feat, on = 'book_id');

In [18]:
df.head(2)

Unnamed: 0,user_id,time,review,rating,spoiler,book_id,review_id,genres,title,description,publication_year,publication_month,publication_day,average_rating,ratings_count,num_pages
0,8842281e1d1347389f2ab93d60773d4d,2017-08-30,"[[0, This is a special book.], [0, It started ...",5,True,18245960,dfdbb7b0eb5a7e4c26d59a937e2e5feb,"{'fiction': 393, 'fantasy, paranormal': 341, '...",The Three-Body Problem (Remembrance of Earth’s...,The Three-Body Problemis the first chance for ...,2014,10,14,4.01,6336,400
1,1a2398eca437fed5d9add310a0c09611,2015-10-21,"[[0, Average between the 4 star concepts (over...",3,False,18245960,b88eb6519a046159a31afcc21a448b6f,"{'fiction': 393, 'fantasy, paranormal': 341, '...",The Three-Body Problem (Remembrance of Earth’s...,The Three-Body Problemis the first chance for ...,2014,10,14,4.01,6336,400


In [19]:
#Save the data as HDF5-file. The HDF5 format is chosen since it, as opposed to CSV, stores data types.
df.to_hdf('data/complete_data.h5', complevel = 0, key = 'complete')

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block2_values] [items->Index(['user_id', 'time', 'review', 'book_id', 'review_id', 'genres', 'title',
       'description', 'publication_year', 'publication_month',
       'publication_day', 'average_rating', 'ratings_count', 'num_pages'],
      dtype='object')]

  encoding=encoding,
