##### Silver Speech and Golden Silence: Spoiler Detection Project

### Baseline Stochastic Gradient Descent Classifier (downsampled)

After running the baseline models on all data, we repeat the procedure on downsampled training data since the dataset is very imbalanced. 

We chose downsampling the non-spoiler cases instead pf upsampling spoilers since we experienced heavy crashing difficulties with upsampling via SMOTE and similar techniques.

In [1]:
#Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier

In [2]:
#Disable scientific notation for floats
pd.options.display.float_format = '{:,}'.format

#Enable viewing more (in this case: all) features of a dataset
pd.set_option('display.max_columns', 500)

#ignore warnings
import warnings
warnings.filterwarnings("ignore")

In [5]:
#Load datafiles
train = pd.read_json('/Users/juliaschafer/NF_Capstone_Spoiler_Detection/data/train_preprocessed.json')

In [7]:
val = pd.read_json('/Users/juliaschafer/NF_Capstone_Spoiler_Detection/data/validation_preprocessed.json')

In [8]:
#Since the data is imbalanced, we also try a model with downsampled non-spoilers, accpeting a potential wide loss of information.
def downsample_nonspoilers(df):
    df_majority = df[df['spoiler_dum'] == 0] #nonspoilers
    df_minority = df[df['spoiler_dum'] == 1] #spoilers
    
    # Downsample majority labels equal to the number of samples in the minority class
    df_majority = df_majority.sample(len(df_minority), random_state = 42)

    # Concatenate the majority and minority dataframes
    sample = pd.concat([df_majority, df_minority])
    
    sample.reset_index(inplace = True, drop = True)
    
    return sample

In [9]:
#Downsample
train = downsample_nonspoilers(train)

### Model: SGD-Classifier

Stochastic Gradient Descent (SGD) is a simple and very efficient approach to fitting linear classifiers under convex loss functions such as (linear) Support Vector Machines and Logistic Regression. 

SGD has been successfully applied to large-scale machine learning problems often encountered in text classification and natural language processing. 

Therefore, we use this a approach as a basic model for spoiler detection.

We calculate two kinds of models: In the first one, the reviews are fed sentence-wise to the classifier, in the second one, we give in the whole review. 

#### First model: Feed the reviews sentence-wise

In [10]:
#Function to transfer the review sentences to a list and then to a numpy array for training.
def get_X_sen(df):
    
    '''Get review sentences from a dataframe df given in.
    The review sentences are written in a list 'lst' which is then transformed
    into a numpy array'''
    
    lst = []
    for review in df['tokenized']:
        for sentence in review:
            lst.append(sentence)
    X = np.array(lst) 
    return X 

In [11]:
#Function to transfer the review labels for each review sentence to a list and then to a numpy array for training.
def get_y_sen(df):
    
    '''Get review labels for each sentence from a dataframe df given in.
    The review sentences are written in a list 'llst' which is then transformed
    into a numpy array'''
    
    llst = []
    for labellist in df['sentence_labels']:
        for label in labellist:
            llst.append(label)
    y = np.array(llst)
    return y

In [12]:
#Get X and y (train) with sentence-wise review texts
X_train_sen = get_X_sen(train)
y_train_sen = get_y_sen(train)

print(y_train_sen.shape, X_train_sen.shape)

(2257873,) (2257873,)


In [13]:
#Get X and y (validation) with sentence-wise review texts
X_val_sen = get_X_sen(val)
y_val_sen = get_y_sen(val)

print(y_val_sen.shape, X_val_sen.shape)

(664558,) (664558,)


In [14]:
#Build a pipeline for feature extraction with TF IDF and SGD
#TFIDF
tfidf = TfidfVectorizer(stop_words = 'english', ngram_range = (1,1), min_df = 100, max_features = 5000)
#SGD
sgd = SGDClassifier(random_state = 42, penalty = 'l2', shuffle = True, n_jobs = -1, max_iter = 1000, 
                                       loss = 'hinge', class_weight = {0: 0.5, 1: .5}, alpha = .0001)
pipe = Pipeline([('tfidf', tfidf),('sgd', sgd)])

In [15]:
#Function to run a model and print the classification report
def run_sgd(pipeline, X_train, y_train, X_test, y_test):
    #Fit the model
    sgd = pipeline.fit(X_train, y_train)
    
    #Predict labels of test data
    y_pred = pipeline.predict(X_test)
    
    return print(classification_report(y_test, y_pred))

In [16]:
run_sgd(pipe, X_train_sen, y_train_sen, X_val_sen, y_val_sen)

              precision    recall  f1-score   support

           0       0.83      1.00      0.91    550592
           1       0.00      0.00      0.00    113966

    accuracy                           0.83    664558
   macro avg       0.41      0.50      0.45    664558
weighted avg       0.69      0.83      0.75    664558



The basic model is completely fails to detect spoilers.
We tune the hyperparameters.

In [17]:
#Build a pipeline for feature extraction with TF IDF and SGD
#TFIDF
tfidf = TfidfVectorizer(stop_words = 'english', ngram_range = (1,2), min_df = 1)
#SGD
sgd = SGDClassifier(random_state = 42, penalty = 'elasticnet', alpha = .001, class_weight = {0: 0.3, 1: 0.7}, 
                    l1_ratio = 0, max_iter = 1000, loss = 'perceptron', shuffle = True, n_jobs = -1)
pipe = Pipeline([('tfidf', tfidf), ('sgd', sgd)])

In [18]:
run_sgd(pipe, X_train_sen, y_train_sen, X_val_sen, y_val_sen)

              precision    recall  f1-score   support

           0       0.93      0.87      0.90    550592
           1       0.52      0.68      0.59    113966

    accuracy                           0.84    664558
   macro avg       0.73      0.78      0.75    664558
weighted avg       0.86      0.84      0.85    664558



#### Second model: review-wise modelling

Now we use the whole review as predictor. 

In [21]:
#For review-wise model training: Transfer sentences to np.array:
def reviewwise_X(df):
    reviews = []
    for review in df['tokenized']: 
        reviews.append(' '.join(review))
    X = np.array(reviews)
    return X

In [22]:
X_train_rev = reviewwise_X(train)
y_train_rev = train.spoiler_dum
print(X_train_rev.shape, y_train_rev.shape)

(125572,) (125572,)


In [23]:
X_val_rev = reviewwise_X(val)
y_val_rev = val.spoiler_dum
print(X_val_rev.shape, y_val_rev.shape)

(36124,) (36124,)


In [26]:
#Build a pipeline for feature extraction with TF IDF and SGD
#TFIDF
tfidf = TfidfVectorizer(stop_words = 'english', ngram_range = (1,1), min_df = 1)
#SGD
sgd = SGDClassifier(random_state = 42, penalty = 'l2', shuffle = True, n_jobs = -1, max_iter = 1000, 
                                       loss = 'hinge', class_weight = {0: 0.4, 1: .6}, alpha = .0001)

pipe = Pipeline([('tfidf', tfidf),('sgd', sgd)])

In [27]:
run_sgd(pipe, X_train_rev, y_train_rev, X_val_rev, y_val_rev)

              precision    recall  f1-score   support

           0       0.83      0.52      0.64     18062
           1       0.65      0.89      0.75     18062

    accuracy                           0.70     36124
   macro avg       0.74      0.70      0.69     36124
weighted avg       0.74      0.70      0.69     36124



The best baseline model is the one with review-wise training and the following hyperparameters:

TfidfVectorizer(stop_words = 'english', ngram_range = (1,1), min_df = 1)

SGDClassifier(random_state = 42, penalty = 'l2', shuffle = True, n_jobs = -1, max_iter = 1000, 
                                       loss = 'hinge', class_weight = {0: 0.4, 1: .6}, alpha = .0001)
