<a href="https://colab.research.google.com/github/NekrozQliphort/SarcasmDetectionReddit/blob/main/NLP_Reddit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Reddit Sarcasm Detection

### Import Libraries

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import numpy as np
import os
print(os.getcwd())

/content


### Import CSV

In [2]:
training_csv_1 = pd.read_csv("train-balanced-sarcasm.csv")

In [3]:
training_csv_1["comment"] = training_csv_1["comment"].astype(str)

In [4]:
training_csv_1.head()

Unnamed: 0,label,comment,author,subreddit,score,ups,downs,date,created_utc,parent_comment
0,0,NC and NH.,Trumpbart,politics,2,-1,-1,2016-10,2016-10-16 23:55:23,"Yeah, I get that argument. At this point, I'd ..."
1,0,You do know west teams play against west teams...,Shbshb906,nba,-4,-1,-1,2016-11,2016-11-01 00:24:10,The blazers and Mavericks (The wests 5 and 6 s...
2,0,"They were underdogs earlier today, but since G...",Creepeth,nfl,3,3,0,2016-09,2016-09-22 21:45:37,They're favored to win.
3,0,"This meme isn't funny none of the ""new york ni...",icebrotha,BlackPeopleTwitter,-8,-1,-1,2016-10,2016-10-18 21:03:47,deadass don't kill my buzz
4,0,I could use one of those tools.,cush2push,MaddenUltimateTeam,6,-1,-1,2016-12,2016-12-30 17:00:13,Yep can confirm I saw the tool they use for th...


### Exploratory Data Analysis

In [5]:
print(f"The total training data has {training_csv_1.author.nunique()} rows.")
training_csv_1.groupby("author").mean()["label"].value_counts()

The total training data has 32690 rows.


0.000000    17856
1.000000    10666
0.500000     3033
0.333333      432
0.666667      352
0.250000       75
0.750000       68
0.400000       49
0.600000       49
0.200000       16
0.800000       15
0.428571       11
0.833333        9
0.571429        8
0.375000        8
0.625000        5
0.714286        4
0.166667        4
0.285714        4
0.555556        4
0.545455        3
0.444444        2
0.384615        2
0.720000        1
0.363636        1
0.657143        1
0.533333        1
0.458333        1
0.380952        1
0.526316        1
0.542857        1
0.142857        1
0.777778        1
0.437500        1
0.466667        1
0.818182        1
0.461538        1
0.642857        1
Name: label, dtype: int64

##### The authors is mostly 0.5 probability of each label, might consider dropping it

In [6]:
print(f"The total training data has {training_csv_1.subreddit.nunique()} rows.")
training_csv_1.groupby("subreddit").mean()["label"].value_counts()

The total training data has 3682 rows.


0.000000    1666
1.000000     532
0.500000     362
0.333333     204
0.250000      78
            ... 
0.451613       1
0.378238       1
0.314815       1
0.488889       1
0.483146       1
Name: label, Length: 285, dtype: int64

##### Subreddit seems to provide more info than expected, should probably keep

In [7]:
training_csv_1[["ups", "downs"]]

Unnamed: 0,ups,downs
0,-1,-1
1,-1,-1
2,3,0
3,-1,-1
4,-1,-1
...,...,...
41477,-1,-1
41478,-1,-1
41479,-1,-1
41480,-1,-1


##### Notice how ups and downs seem to have a correlation? Lets test this theory out

In [8]:
training_csv_1[training_csv_1["ups"].apply(lambda x: -1 if x <= -1 else 0) != training_csv_1["downs"]]

Unnamed: 0,label,comment,author,subreddit,score,ups,downs,date,created_utc,parent_comment
140,0,My comment very similar to this went down a fu...,Schumarker,Android,-6,-6,0,2016-09,2016-09-24 21:50:56,Badumm-tzz
204,0,it really does,Horus_Krishna_2,radiohead,-1,-1,0,2016-09,2016-09-14 20:07:04,"As far as I know, someone's reddit history doe..."
414,0,"Meh, my upper body blows his away.",GiveMeSomeIhedigbo,bodybuilding,-6,-6,0,2016-09,2016-09-19 06:27:32,Do you Agree that this version is The BEST Ver...
431,0,Such a shitty meme.,Geralt-of_Rivia,AdviceAnimals,-4,-4,0,2016-09,2016-09-02 02:39:44,Front page post with 2000 comments and is 10 h...
454,0,"This sub is for open ended questions, not yes ...",hunterz5,AskReddit,-3,-3,0,2016-09,2016-09-10 01:54:47,Do you think IB/AP classes are truly worth it?...
...,...,...,...,...,...,...,...,...,...,...
38690,0,Wtf is this boob ribbon thing ?,powsm,anime,-1,-1,0,2016-09,2016-09-27 17:28:36,I know that but those girls are mostly either ...
39290,0,"Idc, dumbasses prolly deserved it",PM_ME_UR_THIGH_HIGHS,BlackPeopleTwitter,-19,-19,0,2016-09,2016-09-24 21:59:50,And you find that acceptable?!
39476,0,"Well we'll talk about that next year lol, I'm ...",OrbisAlius,formula1,-1,-1,0,2016-09,2016-09-27 06:53:06,If next year or 2018 turns out to go well for ...
40042,0,"And yet, I feel less sympathetic.",Blazebow,todayilearned,-1,-1,0,2016-09,2016-09-05 09:52:30,That doesn't justify what happened.


##### Only 6.1% does not follow the rules, is downs worth keeping? Debatable I guess

### Build model using Comment Column only (Unigram Model)

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier

In [4]:
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

In [11]:
## Better abstraction

class sklearnClassifier:
    def __init__(self, model, data, label, fitBool = True):
        self.model = model
        if fitBool: self.fit(data, label)
            
    def fit(self, data, label):
        self.model.fit(data, label)
    
    def score(self, X, y_true):
        y_pred = self.model.predict(X)
        print(f"Accuracy score: {accuracy_score(y_true, y_pred)}")
        print(f"Recall score: {recall_score(y_true, y_pred)}")
        print(f"Precision score: {precision_score(y_true, y_pred)}")
        print(f"F1 score: {f1_score(y_true, y_pred)}")

In [12]:
training_csv_1["comment"] = training_csv_1["comment"].apply(lambda x: x.lower())

In [13]:
X_train, X_val, y_train, y_val = train_test_split(
    training_csv_1["comment"], 
    training_csv_1["label"], 
    test_size = 0.2
)

In [14]:
def create_ngram_vectorizer(text_train, ngram_range = (1,1), **kwargs):
    vectorizer = CountVectorizer(ngram_range = ngram_range, **kwargs)
    vectorizer.fit(text_train)
    return vectorizer

In [15]:
unigram_vectorizer = create_ngram_vectorizer(X_train)

In [16]:
X_train_transformed = unigram_vectorizer.transform(X_train)
X_val_transformed = unigram_vectorizer.transform(X_val)

In [17]:
base_classifier = sklearnClassifier(SGDClassifier(), X_train_transformed, y_train)

In [18]:
print("Training: ")
base_classifier.score(X_train_transformed, y_train)
print("Validation: ")
base_classifier.score(X_val_transformed, y_val)

Training: 
Accuracy score: 0.8060569534428206
Recall score: 0.6184512782512042
Precision score: 0.8663967611336032
F1 score: 0.7217225873400207
Validation: 
Accuracy score: 0.681089550439918
Recall score: 0.44682115270350564
Precision score: 0.6573426573426573
F1 score: 0.5320127343473647


### Now what? Bigrams and Trigrams, LETZ GO!!!

In [19]:
# for i in range(1, 3): # Trigram is a bit slow so we'll bring that back later
#     igram_vectorizer = create_ngram_vectorizer(X_train, ngram_range = (1,i))
#     X_train_transformed = igram_vectorizer.transform(X_train)
#     X_val_transformed = igram_vectorizer.transform(X_val)
    
#     base_classifier = sklearnClassifier(SGDClassifier(), X_train_transformed, y_train)
    
#     print("Training: ")
#     base_classifier.score(X_train_transformed, y_train)
#     print("Validation: ")
#     base_classifier.score(X_val_transformed, y_val)
#     print()

### Using TFIDF instead of just counting

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [21]:
def create_tfidf_ngram_vectorizer(text_train, ngram_range = (1,1), **kwargs):
    vectorizer = TfidfVectorizer(ngram_range = ngram_range, **kwargs)
    vectorizer.fit(text_train)
    return vectorizer

In [22]:
# for i in range(1,3):
#     tfidf_igram_vectorizer = create_tfidf_ngram_vectorizer(X_train, ngram_range = (1,i))
#     X_train_transformed = tfidf_igram_vectorizer.transform(X_train)
#     X_val_transformed = tfidf_igram_vectorizer.transform(X_val)
    
#     base_classifier = sklearnClassifier(SGDClassifier(), X_train_transformed, y_train)
    
#     print("Training: ")
#     base_classifier.score(X_train_transformed, y_train)
#     print("Validation: ")
#     base_classifier.score(X_val_transformed, y_val)
#     print()

### Vector Representation Test

In [23]:
### Abstraction for easier work
class EmbeddingTechniques:
    def __init__(self, method):
        self.transformMethod = method
    
    def transform(self, X):
        return self.transformMethod(X)

In [24]:
class EmbeddingTester:
    def __init__(self, sklearnmodel):
        self.list_of_techniques = {}
        self.tokenized = {}
        self.model = sklearnmodel
        
    def addEmbeddingTechniques(self, key, method, tokenized = False):
        self.list_of_techniques[key] = method
        self.tokenized[key] = tokenized
        
        
    def testModel(self, X_train_transformed, y_train_true, X_test_transformed, y_test_true, text = None):
        if text is not None: print(text)
        self.model.fit(X_train_transformed, y_train_true)
        print("Training: ")
        self.model.score(X_train_transformed, y_train_true)
        print()
        print("Validation: ")
        self.model.score(X_test_transformed, y_test_true)
        print("-" * 80)
        
    def test(self, X_train_untransformed, y_train_true, X_test_untransformed, y_test_true,
            X_train_tokenized, X_test_tokenized):
        for key, val in self.list_of_techniques.items():
            if self.tokenized[key]:
                X_train_transformed = val.transform(X_train_tokenized)
                X_test_transformed = val.transform(X_test_tokenized)
            else:
                X_train_transformed = val.transform(X_train_untransformed)
                X_test_transformed = val.transform(X_test_untransformed)
            self.testModel(X_train_transformed, y_train_true, X_test_transformed, y_test_true, text = key)

In [25]:
tester = EmbeddingTester(base_classifier)
tester.addEmbeddingTechniques(
    "Count Vectorizer(No stopwords removal)", 
    create_ngram_vectorizer(X_train, ngram_range = (1,2))
)

tester.addEmbeddingTechniques(
    "TFIDF Vectorizer(No stopwords removal)", 
    create_tfidf_ngram_vectorizer(X_train, ngram_range = (1,2))
)

tester.addEmbeddingTechniques(
    "Count Vectorizer(With stopwords removal)", 
    create_ngram_vectorizer(X_train, ngram_range = (1,2), stop_words='english')
)

tester.addEmbeddingTechniques(
    "TFIDF Vectorizer(With stopwords removal)", 
    create_tfidf_ngram_vectorizer(X_train, ngram_range = (1,2), stop_words='english')
)

In [26]:
## Thanks Rama, like srsly
from gensim.models import Word2Vec
from nltk.tokenize import TreebankWordTokenizer

In [27]:
vector_size = 128
word_tokenizer = TreebankWordTokenizer()

X_train_tokenized = [word_tokenizer.tokenize(text) for text in X_train]
X_val_tokenized = [word_tokenizer.tokenize(text) for text in X_val]

model = Word2Vec(X_train_tokenized, min_count = 1, vector_size= vector_size, workers = 3, window = 3, sg = 1)

TypeError: ignored

In [28]:
def transform(X_tokenized):
    temp = np.matrix(
        [np.mean([model.wv[i] if i in model.wv else np.array([0.0] * vector_size, dtype=np.float64) for i in tokens], axis = 0) for tokens in X_tokenized],
        dtype=np.float64
    )
    return temp

In [29]:
tester.addEmbeddingTechniques(
    "word2Vec Mean Embedding", 
    EmbeddingTechniques(transform),
    True
)

tester.test(X_train, y_train, X_val, y_val, X_train_tokenized, X_val_tokenized)

Count Vectorizer(No stopwords removal)
Training: 
Accuracy score: 0.9656772638240169
Recall score: 0.922341608002964
Precision score: 0.9927420641250598
F1 score: 0.9562478392809126

Validation: 
Accuracy score: 0.6835000602627456
Recall score: 0.49376114081996436
Precision score: 0.6431888544891641
F1 score: 0.5586554621848739
--------------------------------------------------------------------------------
TFIDF Vectorizer(No stopwords removal)
Training: 
Accuracy score: 0.7322886846466777
Recall score: 0.3780659503519822
Precision score: 0.9122116931879135
F1 score: 0.5345766974015088

Validation: 
Accuracy score: 0.682656381824756
Recall score: 0.33481877599524656
Precision score: 0.7409598948060486
F1 score: 0.46122365459382036
--------------------------------------------------------------------------------
Count Vectorizer(With stopwords removal)
Training: 
Accuracy score: 0.9313545276480337
Recall score: 0.8415709522045202
Precision score: 0.9878229103244325
F1 score: 0.908850832

NameError: ignored

## Feature Engineering

### Imports

In [5]:
from nltk.tokenize import TreebankWordTokenizer, WordPunctTokenizer
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.corpus import stopwords
import nltk

nltk.download('stopwords')
porter_stemmer = PorterStemmer()
word_tokenizer = TreebankWordTokenizer()
word_tokenizer2 = WordPunctTokenizer()
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [3]:
## tokenize cols
## Define function to remove stopwords and tokenize comments and or parent_comments
def create_stopwords_dict():
  stopwords_dict = {}
  for word in set(stopwords.words('english')):
    stopwords_dict[word] = True
  return stopwords_dict

stop_words_dict = create_stopwords_dict()


def remove_stopwords_and_tokenize(text):
  try:
    arr = word_tokenizer.tokenize(text)
    arr = [word for word in arr if word not in stop_words_dict]
    return arr
  except:
    print(text)
    return []

def remove_stopwords_and_tokenize_cols_in_dataset(dataset, cols):
    for col in cols:
        dataset[col] = dataset[col].apply(lambda x: remove_stopwords_and_tokenize(x))
    return dataset

In [4]:
## After tokenization
## Define function to add length of comments and or parent comments
def add_length_feature_to_dataset(dataset, cols):
    for col in cols:
        new_col = "num_" + col + "_words" 
        dataset[new_col] = dataset[col].apply(lambda x: len(x))
    return dataset

In [5]:
## Define a function that splits training set into just sarcasm and just non-sarcasm
def split_training_dataset_into_separate_labels(training_dataset):
    sarcasm = training_dataset[training_dataset['label'] == 1]
    non_sarcasm = training_dataset[training_dataset['label'] == 0]
    return sarcasm, non_sarcasm

## Define function to engineer features for model such as subreddit history and author history
def feature_history(training_dataset, col):
    history_sarcasm = {}
    history_non_sarcasm = {}
    
    total_comments_by_feature_history = {}
    proportion_sarcasm_by_feature_history = {}
    
    for index, row in training_dataset.iterrows():
        if int(row['label']) == 1:
            if row[col] not in history_sarcasm:
                history_sarcasm[row[col]] = 0
                history_non_sarcasm[row[col]] = 0
            history_sarcasm[row[col]] += 1
    
        elif int(row['label']) == 0:
            if row[col] not in history_non_sarcasm:
                history_non_sarcasm[row[col]] = 0
                history_sarcasm[row[col]] = 0
            history_non_sarcasm[row[col]] += 1
    
    for val in history_sarcasm.keys():
        num_sarcasm = history_sarcasm[val]
        num_non_sarcasm = history_non_sarcasm[val]
        total_comments = num_sarcasm + num_non_sarcasm
        sarcasm_proportion = num_sarcasm/total_comments
        
        proportion_sarcasm_by_feature_history[val] = sarcasm_proportion
        total_comments_by_feature_history[val] = total_comments
    
    return proportion_sarcasm_by_feature_history, total_comments_by_feature_history



## Define function to prepare training dataset

def add_feature_history_to_train(train_dataset, col):
    (proportion_history, total_comments_history) = feature_history(train_dataset, col)
    proportion_col = "sarcasm_proportion_by_" + col
    total_col = "total_num_comments_by_" + col
    
    train_dataset[proportion_col] = train_dataset[col].apply(lambda x: proportion_history[x])
    train_dataset[total_col] = train_dataset[col].apply(lambda x: total_comments_history[x])
    
    return train_dataset

## Define function to prepare testing dataset

def calculate_mean(table):
    values = table.values()
    return sum(values)/(len(values))

def add_feature_history_to_test(test_dataset, col, proportion_history, total_comments_history):
    default_proportion = calculate_mean(proportion_history)
    default_total_comments = calculate_mean(total_comments_history)
    
    def getProportion(col_val):
        proportion = default_proportion
        if col_val in proportion_history:
            proportion = proportion_history[col_val]
    
        return proportion
    
    def getTotal(col_val):
        total = default_total_comments
        if col_val in total_comments_history:
            total = total_comments_history[col_val]
        
        return total
    
    proportion_col = "sarcasm_proportion_by_" + col
    total_col = "total_num_comments_by_" + col
    
    test_dataset[proportion_col] = test_dataset[col].apply(lambda x: getProportion(x))
    test_dataset[total_col] = test_dataset[col].apply(lambda x: getTotal(x))
    
    return test_dataset

In [6]:
## Before tokenizing
## Counting number of exclamation marks
def count_num_exclamation_marks(text):
    return text.count("!")
        
def add_num_exclamation_mark_in_feature(dataset, cols):
    for col in cols:
        dataset[col + "_num_exclamation_marks"] = dataset[col].apply(lambda x: count_num_exclamation_marks(x))
    return dataset

In [7]:
## Before tokenizing
## Counting number of repeated exclamation marks
def count_num_repeated_explanation_marks(text):
    return text.count("!!")

def add_num_repeated_exclamation_mark_in_feature(dataset, cols):
    for col in cols:
        dataset[col + "_num_repeated_exclamation_marks"] = dataset[col].apply(lambda x: count_num_repeated_explanation_marks(x))
    return dataset

In [8]:
## Before tokenizing
## Count number of emoticons
def count_num_common_emoticons(text):
    common_emoticons = [":(", ":)", "<3", ":'(", ":')", "):", "(:", "</3"]
    count = 0
    for emoticon in common_emoticons:
        count += text.count(emoticon)
    return count

def add_num_emoticons_in_feature(dataset, cols):
    for col in cols:
        dataset[col + "_num_emoticons"] = dataset[col].apply(lambda x: count_num_common_emoticons(x))
    return dataset

In [9]:
## Before tokenizing
## Count number of common "slang" style abbreviations
def count_num_common_slang(text):
    common_slang = ["kms", "smh", "smdh", "smfh", "rofl", "roflmao", "sic", 
                    "lol", "yolo", "ikr ", "dfkm", "lmao", "ofc", "surprise surprise",
                   ]
    count = 0
    text = text.casefold()
    for slang in common_slang:
        count += text.count(slang)
    return count

def add_num_slang_in_feature(dataset, cols):
    for col in cols:
        dataset[col + "_num_slang"] = dataset[col].apply(lambda x: count_num_common_slang(x))
    return dataset

In [10]:
!pip install text2emotion
!pip install pyspellchecker

Collecting text2emotion
  Downloading text2emotion-0.0.5-py3-none-any.whl (57 kB)
[?25l[K     |█████▊                          | 10 kB 27.9 MB/s eta 0:00:01[K     |███████████▍                    | 20 kB 15.6 MB/s eta 0:00:01[K     |█████████████████               | 30 kB 18.0 MB/s eta 0:00:01[K     |██████████████████████▊         | 40 kB 16.4 MB/s eta 0:00:01[K     |████████████████████████████▍   | 51 kB 14.1 MB/s eta 0:00:01[K     |████████████████████████████████| 57 kB 3.9 MB/s 
Collecting emoji>=0.6.0
  Downloading emoji-1.6.1.tar.gz (170 kB)
[?25l[K     |██                              | 10 kB 33.6 MB/s eta 0:00:01[K     |███▉                            | 20 kB 43.0 MB/s eta 0:00:01[K     |█████▉                          | 30 kB 47.6 MB/s eta 0:00:01[K     |███████▊                        | 40 kB 50.7 MB/s eta 0:00:01[K     |█████████▋                      | 51 kB 53.8 MB/s eta 0:00:01[K     |███████████▋                    | 61 kB 17.0 MB/s eta 0:00:0

In [6]:
## Before tokenizing
## Measure emotions of text
import text2emotion as t2e
def get_emotions_from_text(text):
    return t2e.get_emotion(text)

def get_emotion_from_text(text, emotion):
    return get_emotions_from_text(text)[emotion]



def add_emotions_features_to_dataset(dataset, cols, emotions):
    for col in cols:
        for emotion in emotions:
            col_name = col + "_" + emotion
            dataset[col_name] = dataset[col].apply(lambda x: get_emotion_from_text(x, emotion))
    return dataset

print(get_emotions_from_text("I love you"))
print(get_emotion_from_text("I love you", "Happy"))

ModuleNotFoundError: ignored

In [31]:
## After tokenizing
## Count number of misspelled words
from spellchecker import SpellChecker

spellchecker = SpellChecker(language="en")

def count_number_of_misspelled_words(text):
    count = 0
    misspelled_words = spellchecker.unknown(text)
    return len(misspelled_words)

def add_num_misspelled_words_feature(dataset, cols):
    for col in cols:
        dataset[col + "_num_misspelled_words"] = dataset[col].apply(lambda x: count_number_of_misspelled_words(x))
    return dataset

In [13]:
## After tokenizing
## Measure misspelling in a different way - by summing up edit distances
from nltk.metrics import edit_distance

def measure_sum_of_edit_distances(text):
    distances = 0
    misspelled_words = spellchecker.unknown(text)
    for misspelled_word in misspelled_words:
        corrected_word = spellchecker.correction(misspelled_word)
        distances += edit_distance(corrected_word, misspelled_word)
    return distances

def add_sum_of_edit_distances_feature(dataset, cols):
    for col in cols:
        dataset[col + "_edit_distance_misspelled_words"] = dataset[col].apply(lambda x: measure_sum_of_edit_distances(x))
    return dataset

## Prepare Data Structures

In [40]:
## Load CSV
csv_feature_engineering = pd.read_csv("train-balanced-sarcasm.csv")
csv_feature_engineering.dropna(subset=['comment', 'parent_comment'], inplace=True)
csv_feature_engineering["comment"] = csv_feature_engineering["comment"].astype(str)
csv_feature_engineering["parent_comment"] = csv_feature_engineering["parent_comment"].astype(str)

In [55]:
labels = csv_feature_engineering['label']
training_csv_feature_engineering, testing_csv_feature_engineering, label_train, label_test = train_test_split(csv_feature_engineering, labels, test_size=0.25, random_state=1000)


In [42]:
COMMENT_AND_PARENT_COMMENT = ["comment", "parent_comment"]
COMMENT = ["comment"]
PARENT_COMMENT = ["parent_comment"]
AUTHOR = "author"
SUBREDDIT = "subreddit"
EMOTIONS = ["Happy", "Sad", "Angry", "Surprise", "Fear"]

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Add Features Before Tokenization

In [43]:
## Add BEFORE tokenization features to training data
training_csv_feature_engineering = add_num_exclamation_mark_in_feature(
    training_csv_feature_engineering, COMMENT_AND_PARENT_COMMENT)

training_csv_feature_engineering = add_num_repeated_exclamation_mark_in_feature(
    training_csv_feature_engineering, COMMENT_AND_PARENT_COMMENT)

training_csv_feature_engineering = add_num_emoticons_in_feature(
    training_csv_feature_engineering, COMMENT_AND_PARENT_COMMENT)

training_csv_feature_engineering = add_num_slang_in_feature(
    training_csv_feature_engineering, COMMENT_AND_PARENT_COMMENT)

## training_csv_feature_engineering = add_emotions_features_to_dataset(
##    training_csv_feature_engineering, COMMENT_AND_PARENT_COMMENT, EMOTIONS)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app


In [44]:
## Add BEFORE tokenization features to testing data
testing_csv_feature_engineering = add_num_exclamation_mark_in_feature(
    testing_csv_feature_engineering, COMMENT_AND_PARENT_COMMENT)

testing_csv_feature_engineering = add_num_repeated_exclamation_mark_in_feature(
    testing_csv_feature_engineering, COMMENT_AND_PARENT_COMMENT)

testing_csv_feature_engineering = add_num_emoticons_in_feature(
    testing_csv_feature_engineering, COMMENT_AND_PARENT_COMMENT)

testing_csv_feature_engineering = add_num_slang_in_feature(
    testing_csv_feature_engineering, COMMENT_AND_PARENT_COMMENT)

## testing_csv_feature_engineering = add_emotions_features_to_dataset(
##    testing_csv_feature_engineering, COMMENT_AND_PARENT_COMMENT, EMOTIONS)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app


## Tokenize data

In [45]:
## tokenize training data
training_csv_feature_engineering = remove_stopwords_and_tokenize_cols_in_dataset(
    training_csv_feature_engineering, COMMENT_AND_PARENT_COMMENT)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [46]:
## tokenize testing data
testing_csv_feature_engineering = remove_stopwords_and_tokenize_cols_in_dataset(
    testing_csv_feature_engineering, COMMENT_AND_PARENT_COMMENT)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


## Add Features After Tokenization

In [47]:
## Add AFTER tokenization features to training data
training_csv_feature_engineering = add_length_feature_to_dataset(
    training_csv_feature_engineering, COMMENT_AND_PARENT_COMMENT)

training_csv_feature_engineering = add_feature_history_to_train(
    training_csv_feature_engineering, AUTHOR)

training_csv_feature_engineering = add_feature_history_to_train(
    training_csv_feature_engineering, SUBREDDIT)

training_csv_feature_engineering = add_num_misspelled_words_feature(
    training_csv_feature_engineering, COMMENT_AND_PARENT_COMMENT)

## training_csv_feature_engineering = add_sum_of_edit_distances_feature(
    ## training_csv_feature_engineering, COMMENT_AND_PARENT_COMMENT)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/u

In [48]:
proportion_history_author, total_comments_history_author = feature_history(
    training_csv_feature_engineering, AUTHOR)
proportion_history_subreddit, total_comments_history_subreddit = feature_history(
    training_csv_feature_engineering, SUBREDDIT)

In [49]:
## Add AFTER tokenization features to testing data
testing_csv_feature_engineering = add_length_feature_to_dataset(
    testing_csv_feature_engineering, COMMENT_AND_PARENT_COMMENT)

testing_csv_feature_engineering = add_feature_history_to_test(
    testing_csv_feature_engineering, AUTHOR, 
    proportion_history_author, total_comments_history_author)

testing_csv_feature_engineering = add_feature_history_to_test(
    testing_csv_feature_engineering, SUBREDDIT,
    proportion_history_subreddit, total_comments_history_subreddit)

testing_csv_feature_engineering = add_num_misspelled_words_feature(
    testing_csv_feature_engineering, COMMENT_AND_PARENT_COMMENT)

## testing_csv_feature_engineering = add_sum_of_edit_distances_feature(
    ## testing_csv_feature_engineering, COMMENT_AND_PARENT_COMMENT)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/u

## Normalize Data

In [50]:
from google.colab import files
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler() 
scaled_values = scaler.fit_transform(training_csv_feature_engineering.iloc[:, 10:])
training_csv_feature_engineering.iloc[:,10:] = scaled_values

scaled_values = scaler.fit_transform(testing_csv_feature_engineering.iloc[:, 10:]) 
testing_csv_feature_engineering.iloc[:,10:] = scaled_values

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value[:, i].tolist())
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value[:, i].tolist())


## Download files in case

In [None]:
training_csv_feature_engineering.to_csv(r'training_data_engineered.csv', index = False, header=True)
files.download("training_data_engineered.csv")

testing_csv_feature_engineering.to_csv(r'testing_data_engineered.csv', index = False, header=True)
files.download("testing_data_engineered.csv")

## Data Preview

In [51]:
## Preview of features
## Engineered features from column 10 to end (0 based indexing)
training_csv_feature_engineering.head()

Unnamed: 0,label,comment,author,subreddit,score,ups,downs,date,created_utc,parent_comment,comment_num_exclamation_marks,parent_comment_num_exclamation_marks,comment_num_repeated_exclamation_marks,parent_comment_num_repeated_exclamation_marks,comment_num_emoticons,parent_comment_num_emoticons,comment_num_slang,parent_comment_num_slang,num_comment_words,num_parent_comment_words,sarcasm_proportion_by_author,total_num_comments_by_author,sarcasm_proportion_by_subreddit,total_num_comments_by_subreddit,comment_num_misspelled_words,parent_comment_num_misspelled_words
439773,0,"[Like, NSA, care, .]",Qeebl,CrusaderKings,217,217,0,2016-01,2016-01-31 22:28:48,"[Well, ,, MI5, ,, Brit, .]",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002,0.000785,0.5,0.002294,0.298013,0.004567,0.0,0.002364
915206,0,[Yup],oliverw92,Minecraft,5,5,0,2011-08,2011-08-31 10:40:50,"[like, meteor, something, ?]",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0005,0.000524,0.0,0.0,0.515436,0.022651,0.0,0.0
68827,0,"[-cough-nuclear, war-cough-]",agfnov,videos,1,-1,-1,2016-10,2016-10-17 05:30:47,"[Surface, temperatures, Arizona, exceed, melti...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001,0.001309,0.666667,0.004587,0.488696,0.185806,0.046512,0.002364
309105,0,"[That, would, include, Domino, 's, ,, 7-11s, ,...",chuboy91,australia,1,1,0,2016-05,2016-05-07 11:56:08,"[Small, businesses, per, government, 's, defin...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.008,0.000916,0.8,0.009174,0.603081,0.067162,0.046512,0.002364
189391,0,"[It, 's, thing, .]",GSBaelog,The_Donald,3,3,0,2016-05,2016-05-07 05:51:27,"[In, description, video, use, term, ``, womyn,...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002,0.00144,0.666667,0.004587,0.463695,0.145467,0.023256,0.007092


In [52]:
## Preview of features
## Engineered features from column 10 to end (0 based indexing)
testing_csv_feature_engineering.head()

Unnamed: 0,label,comment,author,subreddit,score,ups,downs,date,created_utc,parent_comment,comment_num_exclamation_marks,parent_comment_num_exclamation_marks,comment_num_repeated_exclamation_marks,parent_comment_num_repeated_exclamation_marks,comment_num_emoticons,parent_comment_num_emoticons,comment_num_slang,parent_comment_num_slang,num_comment_words,num_parent_comment_words,sarcasm_proportion_by_author,total_num_comments_by_author,sarcasm_proportion_by_subreddit,total_num_comments_by_subreddit,comment_num_misspelled_words,parent_comment_num_misspelled_words
432262,1,"[Oh, ,, get, extra, Missionary, Points, dare, ...",charlaron,atheism,2,2,0,2016-03,2016-03-16 23:58:07,"[I, live, Indian, reserve, (, 30, minute, driv...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.003355,0.009937,0.5,0.002294,0.638126,0.113043,0.0,0.003279
163893,0,"[Good, tackle, Scottish, thunda]",ThatBucsLife,buccaneers,3,-1,-1,2016-10,2016-10-11 02:00:45,"[Game, Thread, :, Bucs, vs, Panthers, -, Monda...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00122,0.006625,0.0,0.0,0.462687,0.002009,0.010753,0.013115
161030,1,"[Breaking, news, :, Antonio, Morrison, cut, .]",yellowlikegold,Colts,4,-1,-1,2016-10,2016-10-04 20:46:31,"[I, thought, liked, Moore, Irving, enough, let...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002135,0.008612,0.0,0.002294,0.475806,0.003745,0.0,0.009836
525810,0,"[Are, 50, ?]",HelloLadies13,DotA2,1,1,0,2015-08,2015-08-08 12:25:13,"[See, ,, shit, reason, I, 'm, installing, win1...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.076923,0.000915,0.004306,0.5,0.011468,0.504279,0.0889,0.0,0.009836
605262,1,"[As, saying, goes, :, I, 'm, gay, ,, $, 50, $,...",MaxNanasy,creepyPMs,12,12,0,2015-08,2015-08-31 17:12:28,"[Thats, twice, insult, ,, jeez., So, discovere...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.076923,0.00366,0.006956,0.45,0.043578,0.770833,0.084729,0.010753,0.009836


In [53]:
## for downloading df into csv for R visualisation
## training_csv_feature_engineering.to_csv (r'FOLDER_PATH\FILE_NAME.csv', index = False, header=True)

## Download Dataset with Features Engineered

In [8]:
training_csv_feature_engineering = pd.read_csv("training_data_engineered.csv")
testing_csv_feature_engineering = pd.read_csv("testing_data_engineered.csv")
label_train = training_csv_feature_engineering['label']
label_test = testing_csv_feature_engineering['label']
X_train = training_csv_feature_engineering.iloc[:,10:]
X_test = testing_csv_feature_engineering.iloc[:,10:]

## Baseline model (Logistic Regression)

In [11]:
from sklearn.linear_model import LogisticRegression

## Without word embeddings and word vectorization

LogisticRegression_classifier = LogisticRegression()
LogisticRegression_classifier.fit(X_train, label_train)
score = LogisticRegression_classifier.score(X_test, label_test)
print(score)

0.5753330586457904
