### In this project we aim to clean the text and implement both count vectorizer and TF-IDF vectorizer for creating the DTM - Document term matrix which will be used to train the base models. Build Multinomial Naive Bayes Model, Decision Tree Classifier and Logistic Regression Model and compare the results


In [101]:
import pandas as pd
import numpy as np

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
from nltk.tokenize import RegexpTokenizer
from nltk.stem import LancasterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

import string
from six import string_types
from bs4 import BeautifulSoup


from textblob import TextBlob
import matplotlib.pyplot as plt
import requests


In [102]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

lancaster=LancasterStemmer()
lmtzr = WordNetLemmatizer()

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/jervissaldanha/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jervissaldanha/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/jervissaldanha/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Import the data into a dataframe

In [103]:
data = pd.read_csv('sarcasm_detection.csv')
data.head()

Unnamed: 0,ID,comment,date,down,parent_comment,score,top,topic,user,label
0,uid_590555,"Well, let's be honest here, they don't actuall...",2015-04,0,They should shut the fuck up and let the commu...,2,2,starcitizen,Combat_Wombatz,0
1,uid_671762,"Well, I didn't need evidence to believe in com...",2016-12,-1,You need evidence to kill people? I thought we...,6,-1,EnoughCommieSpam,starkadd,1
2,uid_519689,"Who does an ""official promo"" in 360p?",2013-11,0,2014 BMW S1000R: Official Promo,3,3,motorcycles,phybere,0
3,uid_788362,Grotto koth was the best,2015-09,0,Not really that memorable lol if you want memo...,2,2,hcfactions,m0xyMC,1
4,uid_299252,Neal's back baby,2015-11,0,James Neal hit on Zach Parise,-5,-5,hockey,Somuch101,1


In [104]:
comments = data.comment.copy()

In [105]:
comments.head()

0    Well, let's be honest here, they don't actuall...
1    Well, I didn't need evidence to believe in com...
2                Who does an "official promo" in 360p?
3                             Grotto koth was the best
4                                     Neal's back baby
Name: comment, dtype: object

## Clean and preprocess the text data:

In [106]:
# - Tokenize
# - Case conversion to Lower Case
# - Removing Punctuation & Stopwords
# - Stemming & Lemitization

from nltk.stem import LancasterStemmer
lancaster=LancasterStemmer()

import nltk
from nltk.stem.wordnet import WordNetLemmatizer
nltk.download('wordnet')
lmtzr = WordNetLemmatizer()

def process_tokens(tokens: list) -> list:
    processed_tokens=[]
    for token in tokens:
        punct_and_stop_words = set(stopwords.words('english') + list(string.punctuation))
        processed_tokens = [lmtzr.lemmatize(token.lower()) for token in tokens if token.lower() not in punct_and_stop_words]
    return processed_tokens
    

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/jervissaldanha/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [107]:
from cleantext.sklearn import CleanTransformer

cleaner = CleanTransformer(no_punct=True)
cleaned_comments = cleaner.transform(comments)



In [108]:
comments

0        Well, let's be honest here, they don't actuall...
1        Well, I didn't need evidence to believe in com...
2                    Who does an "official promo" in 360p?
3                                 Grotto koth was the best
4                                         Neal's back baby
                               ...                        
14995    Well with a name like El Cubano I'm surprised ...
14996                            ... This is a good point.
14997                                                 Yep.
14998     That's what the government WANTS you to believe!
14999    because Windows 10 has the glorious start menu...
Name: comment, Length: 15000, dtype: object

In [109]:
cleaned_comments

0        well lets be honest here they dont actually se...
1        well i didnt need evidence to believe in commu...
2                       who does an official promo in 360p
3                                 grotto koth was the best
4                                          neals back baby
                               ...                        
14995    well with a name like el cubano im surprised h...
14996                                 this is a good point
14997                                                  yep
14998       thats what the government wants you to believe
14999    because windows 10 has the glorious start menu...
Name: comment, Length: 15000, dtype: object

In [110]:

### Clean & Tokenize
data_list = list()
for text in cleaned_comments:
    data_list.append(RegexpTokenizer('\w+').tokenize(text))

## print Example
print(data_list[:3])

[['well', 'lets', 'be', 'honest', 'here', 'they', 'dont', 'actually', 'seem', 'to', 'do', 'much', 'moderating', 'so', 'they', 'have', 'to', 'spend', 'their', 'time', 'doing', 'something'], ['well', 'i', 'didnt', 'need', 'evidence', 'to', 'believe', 'in', 'communism'], ['who', 'does', 'an', 'official', 'promo', 'in', '360p']]


In [111]:
cleaned_corpus = list(map(process_tokens, data_list))

In [112]:
cleaned_corpus = list(map(lambda lis: " ".join(lis),(cleaned_corpus)))

In [113]:
cleaned_corpus

['well let honest dont actually seem much moderating spend time something',
 'well didnt need evidence believe communism',
 'official promo 360p',
 'grotto koth best',
 'neals back baby',
 'orange new black house card hemlock grove going see sense8 worth watching hype',
 'pff everybody know science sexist',
 'called rmmas golden boy sage shitlist since',
 'he dude though',
 'probably shouldve followed original comment',
 'thats gonna look great later life',
 'nonsense bookstore competitive price way leverage location',
 'least asus known putting quality ram android device',
 'dont equate stance rice',
 'since federal law start applying military',
 'take fistful shrooms go watch holy mountain way meant enjoyed',
 'id date id date hard',
 'thought sub balance fun walking spawn kill sentry gun placed',
 'oh yeah well doctorate c master soldering iron',
 'wow glanced history racist antisemitic comment impressive sir impressive',
 'took guy long',
 'could put suck john oliver dick',
 'got f

In [114]:
df = pd.DataFrame({"comments": cleaned_corpus, 'labels':data.label})
df.head()

Unnamed: 0,comments,labels
0,well let honest dont actually seem much modera...,0
1,well didnt need evidence believe communism,1
2,official promo 360p,0
3,grotto koth best,1
4,neals back baby,1


## Split data into train and test sets

In [115]:
from sklearn.model_selection import train_test_split

trainx, testx, trainy, testy = train_test_split(df.drop(columns=['labels']), df.labels, test_size=0.2, random_state=32)

## Building model with count vectorizer

In [116]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

## Vectorize the words
vectorizer = CountVectorizer().fit(trainx.comments)

## Create vocabulary
vocab = vectorizer.get_feature_names()

## Vectorize train & test
train_dtm = vectorizer.transform(trainx.comments)
test_dtm = vectorizer.transform(testx.comments)




In [117]:
train_dtm

<12000x14225 sparse matrix of type '<class 'numpy.int64'>'
	with 66883 stored elements in Compressed Sparse Row format>

In [118]:
trainy.values

array([0, 0, 0, ..., 1, 0, 1])

In [119]:
from sklearn.naive_bayes import MultinomialNB

naive_bayes_model = MultinomialNB().fit(train_dtm, trainy)
test_y_pred = naive_bayes_model.predict(test_dtm)

In [120]:
from sklearn.metrics import accuracy_score, f1_score
print('Accuracy score: ', accuracy_score(testy, test_y_pred))

Accuracy score:  0.6086666666666667


## Build model with TF IDF vectorizer

In [125]:
## Vectorize the words
vectorizer = TfidfVectorizer(min_df=10).fit(trainx.comments)

## Create vocabulary
vocab = vectorizer.get_feature_names()

## Vectorize train & test
train_dtm = vectorizer.transform(trainx.comments)
test_dtm = vectorizer.transform(testx.comments)



## Build Multinomial Naive Bayes Model

In [126]:
naive_bayes_model = MultinomialNB().fit(train_dtm, trainy)
test_y_pred = naive_bayes_model.predict(test_dtm)

In [127]:
print('Accuracy score: ', accuracy_score(testy, test_y_pred))

Accuracy score:  0.5986666666666667


## Build Logistic regression model

In [128]:
from sklearn.linear_model import LogisticRegression

lr_model = LogisticRegression().fit(train_dtm, trainy)
test_y_pred = lr_model.predict(test_dtm)
print('Accuracy score: ', accuracy_score(testy, test_y_pred))


Accuracy score:  0.6106666666666667


## Build Decision tree classifier

In [129]:
from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier().fit(train_dtm, trainy)
test_y_pred = dtc.predict(test_dtm)
print('Accuracy score: ', accuracy_score(testy, test_y_pred))

Accuracy score:  0.585


## Conclusion

We cleaned the text by removing the stop words, applied lemmatization and removed punctuations.
Also, we used count vectorizer and TF-IDF vectorizer for creating the DTM - Document term matrix which was then used to train the base models.

Accuracy score of Logistic regression model (0.61) is higher compared to MultinomialNB and Decision tree classifier.