The following dataset was gather from Kaggle

Dataset name: Twitter Sentiment Analysis

Link: https://www.kaggle.com/arkhoshghalb/twitter-sentiment-analysis-hatred-speech?select=train.csv

Potential uses of the dataset, per website: 

The objective of this task is to detect hate speech in tweets. For the sake of simplicity, we say a tweet contains hate speech if it has a racist or sexist sentiment associated with it. So, the task is to classify racist or sexist tweets from other tweets.

Formally, given a training sample of tweets and labels, where label '1' denotes the tweet is racist/sexist and label '0' denotes the tweet is not racist/sexist, your objective is to predict the labels on the test dataset.

In [1]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
from textblob import TextBlob
import sklearn
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix

np.random.seed(47)

tweet_train = pd.read_csv('train.csv')

#running train dataset simitaneously to run prediction
tweet_test = pd.read_csv('test.csv')


tweet_train.head()


Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [2]:
tweet_train = tweet_train.drop(['id'],axis=1)

#dropping column id
tweet_test = tweet_test.drop(['id'],axis=1)

print(tweet_train.tail())


       label                                              tweet
31957      0  ate @user isz that youuu?ðððððð...
31958      0    to see nina turner on the airwaves trying to...
31959      0  listening to sad songs on a monday morning otw...
31960      1  @user #sikh #temple vandalised in in #calgary,...
31961      0                   thank you @user for you follow  


In [3]:
import re


def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    return text

In [4]:
tweet_train['tweet'] = tweet_train['tweet'].apply(preprocessor)

#running preprocessor in test data
tweet_test['tweet'] = tweet_test['tweet'].apply(preprocessor)


In [5]:

#checking if preprocessor works by looking at column 'tweet' and row 28. 

print(tweet_train.loc[28][1] )


happy father s day user ð ð ð ð 


In [6]:
#randomizing out train dataset. 

tweet_train = tweet_train.reindex(np.random.permutation(tweet_train.index))



print (tweet_train.head())

       label                                              tweet
17253      0   user getting spoilt or what wishing all dads ...
15758      0                 user smile with my dog dogsarejoy 
25747      0   user user user user user user user user ð ð m...
20548      0         i am successful i_am positive affirmation 
17073      0   duschszene fear origins pib moore temprano cl...


In [7]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /Users/izzy/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [8]:
from nltk.corpus import stopwords

stop = stopwords.words('english')
stop = stop + [u'a',u'b',u'c',u'd',u'e',u'f',u'g',u'h',u'i',u'j',u'k',u'l',u'm',u'n',u'o',u'p',u'q',u'r',u's',u't',u'v',u'w',u'x',u'y',u'z']

stop.append('user') #user is a common word in our dataset, therefore removing

print(stop)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [9]:
def split_into_lemmas(tweet_train):
    tweet_train = 'utf8' + str(tweet_train).lower()
    words = TextBlob(tweet_train).words
    # for each word, take its "base form" = lemma 
    return [word.lemma for word in words if word not in stop]

#running lemmas into our test data to get it ready for our stasitstic
tweet_test.tweet.apply(split_into_lemmas)

tweet_train.tweet.apply(split_into_lemmas)

17253    [utf8, getting, spoilt, wishing, dad, amp, hea...
15758                       [utf8, smile, dog, dogsarejoy]
25747                           [utf8, ð, ð, monday, ð, ð]
20548     [utf8i, successful, i_am, positive, affirmation]
17073    [utf8, duschszene, fear, origin, pib, moore, t...
                               ...                        
23112    [utf8, strong, word, love, introduction, thoug...
11528    [utf8, life, vanity, let, hand, find, still, a...
14663    [utf8, clearwater, polar, bear, climb, racing,...
18310                                 [utf8, kindð, rð, ð]
5255     [utf8wishing, happy, day, daddy, ð, daddy, fat...
Name: tweet, Length: 31962, dtype: object

In [10]:
%%time
bow_transformer = CountVectorizer(analyzer=split_into_lemmas).fit(tweet_train['tweet'])
print (len(bow_transformer.vocabulary_))

41718
CPU times: user 8.53 s, sys: 20.5 ms, total: 8.55 s
Wall time: 8.55 s


In [11]:
%%time


tweets_bow = bow_transformer.transform(tweet_train['tweet'])

#running our bag of words transformer into out test dataset
tweet_final_test = bow_transformer.transform(tweet_test['tweet'])

print ('sparse matrix shape:', tweets_bow.shape)
print ('number of non-zeros:', tweets_bow.nnz)
print ('sparsity: %.2f%%' % (100.0 * tweets_bow.nnz / (tweets_bow.shape[0] * tweets_bow.shape[1])))


sparse matrix shape: (31962, 41718)
number of non-zeros: 276004
sparsity: 0.02%
CPU times: user 13 s, sys: 17.9 ms, total: 13 s
Wall time: 13 s


In [12]:
#the train dataset has a told of 31,962 rows.  half of that is 15981

tweets_bow_train = tweets_bow[:15981]
tweets_bow_test = tweets_bow[15981:]
tweets_class_train = tweet_train['label'][:15981]
tweets_class_test = tweet_train['label'][15981:]

print (tweets_bow_train.shape)
print (tweets_bow_test.shape) 

(15981, 41718)
(15981, 41718)


In [13]:
%time 

tweets_class = MultinomialNB().fit(tweets_bow_train, tweets_class_train)

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 4.05 µs


In [14]:

predictions = tweets_class.predict(tweets_bow_test)
print (predictions)

[0 0 0 ... 0 0 0]


In [15]:
print ('accuracy', accuracy_score(tweets_class_test, predictions))
print ('confusion matrix\n', confusion_matrix(tweets_class_test, predictions))
print ('(row=expected, col=predicted)')

accuracy 0.9432451035604781
confusion matrix
 [[14594   212]
 [  695   480]]
(row=expected, col=predicted)


In [16]:
#running stats on train dataset

print (classification_report(tweets_class_test, predictions))

              precision    recall  f1-score   support

           0       0.95      0.99      0.97     14806
           1       0.69      0.41      0.51      1175

    accuracy                           0.94     15981
   macro avg       0.82      0.70      0.74     15981
weighted avg       0.94      0.94      0.94     15981



In [19]:
#running new prediction statistic with the test dataset

predictions2 = tweets_class.predict(tweet_final_test[:15981])

print ('accuracy', accuracy_score(tweets_class_test, predictions2))
print ('confusion matrix\n', confusion_matrix(tweets_class_test, predictions2))
print ('(row=expected, col=predicted)')


accuracy 0.9023840810963019
confusion matrix
 [[14389   417]
 [ 1143    32]]
(row=expected, col=predicted)


In [20]:
print (classification_report(tweets_class_test, predictions2))

              precision    recall  f1-score   support

           0       0.93      0.97      0.95     14806
           1       0.07      0.03      0.04      1175

    accuracy                           0.90     15981
   macro avg       0.50      0.50      0.49     15981
weighted avg       0.86      0.90      0.88     15981



In [21]:
#creating model to test out any text.


def predict_tweets(new_tweets): 
    new_sample = bow_transformer.transform([new_tweets])
    print (new_tweets, np.around(tweets_class.predict_proba(new_sample), decimals=3),"\n")
    

In [22]:
predict_tweets('Horrible. Terrible. Dreadful. Awful. Pile of garbage. Junk.')
predict_tweets('Fantastic. Amazing. Terrific. Classic. Best! Extraordinary. Authentic. Ideal. Vibrant. Powerful. Perfect. Imaginative. Incredible. Happy. Love. Pleasure.')
predict_tweets('Okay. Great.')

Horrible. Terrible. Dreadful. Awful. Pile of garbage. Junk. [[0.04 0.96]] 

Fantastic. Amazing. Terrific. Classic. Best! Extraordinary. Authentic. Ideal. Vibrant. Powerful. Perfect. Imaginative. Incredible. Happy. Love. Pleasure. [[1. 0.]] 

Okay. Great. [[0.983 0.017]] 



In [23]:
predict_tweets('alright white supremacy')

alright white supremacy [[0.166 0.834]] 



In [24]:
predict_tweets('whitesupremacy and hate jews')

whitesupremacy and hate jews [[0.255 0.745]] 



In [25]:
predict_tweets('Little fucking kikes. They get ruled by people like me. Little fucking octaroons. My ancestors fucking enslaved those little pieces of fucking shit.')

Little fucking kikes. They get ruled by people like me. Little fucking octaroons. My ancestors fucking enslaved those little pieces of fucking shit. [[1. 0.]] 

