## <p style="text-align:center;">Let's classify Xenophobic tweets using a [Naive Bayes Classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) model from scratch</p>
### <div style="text-align:center;font-size:1.2em;">Author : <a style="color:#1da1f2 !important;text-decoration:none !important;" href="https://linkedin.com/in/qasimwani/">Qasim Wani</a></div>

### <p style="text-align:center;font-weight:700;">Here are some examples of [Xenophobic](https://en.wikipedia.org/wiki/Xenophobia) speech</p>

![alt text](https://miro.medium.com/max/1485/0*c8nQ_q3acJHeO_gq)
![alt text](https://miro.medium.com/max/1495/1*gEGkXAA99FIVpoRSd7vIKQ.png)
![alt text](https://miro.medium.com/max/1488/0*kcb5Rs9m9RL4CVue)

<h4 style="font-size:1.1em">Hate speech is starting to become a major issue where social media is prevelant.</h4>

> "[Louise Matsakis](https://twitter.com/lmatsakis?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor) of Wired explains that *only <span style="font-weight:900;">38%</span> of hate-speech posts that Facebook removes are detected by AI.* 
This is mainly because there are so many types of hate speech, and the language used changes rapidly."
<br>
Source : [Abraham Starosta](https://medium.com/sculpt/xenophobic-tweets-78a9b316635)

>**With that being said, let's classify Tweets based on a special type of hate-speech, known as Xenophobia**


### Step 0 : Importing libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
import re
from nltk.corpus import stopwords 
from collections import Counter

### Step 1 : Extracting Data

In [2]:
# We have two datasets, training and testing.

#Extracting training
df_train = pd.read_csv("xenophobia train.csv")
df_test  = pd.read_csv("xenophobia test.csv")

In [3]:
X_train = np.array(df_train.iloc[:,1])
y_train = np.array(df_train.iloc[:,0])

In [4]:
X_test = np.array(df_test.iloc[:, 1])
y_test = np.array(df_test.iloc[:, 0])

In [5]:
df_train.head()

Unnamed: 0,label,tweets
0,1,To send them back where they come from you hav...
1,2,A bunch of racists chanted send her back after...
2,2,Who are the phony sources who do not exist? Th...
3,1,Trump didn't tell any one to send them back. I...
4,1,In order for an unlawful Alien to have same Ri...


In [6]:
df_test.head()

Unnamed: 0,label,tweets
0,1,what do you have to say about this? Illegal pr...
1,1,send them back
2,2,Send her back Trump reverses course and backs ...
3,2,How could you have quality of life if your pre...
4,1,"Send them back, build the wall, fix the laws."


### It appears as if the labels aren't clear. Let's assign <span style="color:red;">1</span> as Xenophobic and <span style="color:red;">-1</span> as normal.

In [7]:
df_train["label"] = df_train["label"].replace(1, 1).replace(2, -1)
df_test["label"] = df_test["label"].replace(1, 1).replace(2, -1)

In [8]:
df_train.head()

Unnamed: 0,label,tweets
0,1,To send them back where they come from you hav...
1,-1,A bunch of racists chanted send her back after...
2,-1,Who are the phony sources who do not exist? Th...
3,1,Trump didn't tell any one to send them back. I...
4,1,In order for an unlawful Alien to have same Ri...


In [9]:
df_test.head()

Unnamed: 0,label,tweets
0,1,what do you have to say about this? Illegal pr...
1,1,send them back
2,-1,Send her back Trump reverses course and backs ...
3,-1,How could you have quality of life if your pre...
4,1,"Send them back, build the wall, fix the laws."


> #### Let's check to see if we have any null data in our training and test set

In [10]:
print("Null objects in our training set:\n",df_train.notnull().count()-df_train.isnull().count())
print("\nNull objects in our testing set:\n",df_test.notnull().count() - df_test.isnull().count())

Null objects in our training set:
 label     0
tweets    0
dtype: int64

Null objects in our testing set:
 label     0
tweets    0
dtype: int64


#### Let's classify our dataframe into two numpy arrays:
> **1.** xenophobic tweets
<br><br>
> **2.** non-xenophobic tweets

In [11]:
def classify(df):
    """
    This function accepts a pandas dataframe object.
    It returns two classified np.array objects (xenophobic (1) and non-xenophobic (-1))
    
    Parameters:
    df : a pandas dataframe object consisting of all tweets to be classified
    """
    y = np.array(df["label"])
    X = np.array(df["tweets"])
    
    xenophobic = []     # Xenophobic = 1
    non_xenophobic = [] # non-xenophobic = -1
    
    for i in range(len(X)):
        one_tweet = str(X[i]).lower().strip()
        one_tweet = re.sub(r'[^a-zA-Z0-9\s]', "", one_tweet)
        if(y_train[i] == 1):
            xenophobic.append(one_tweet)
        else:
            non_xenophobic.append(one_tweet)
    return np.array(xenophobic), np.array(non_xenophobic)

In [12]:
xen_train, non_xen_train = classify(df_train)
print("Number of Training Xenophobic tweets : {0}\nNumber of Training non-xenophobic tweets : {1}"
      .format(len(xen_train), len(non_xen_train)))

Number of Training Xenophobic tweets : 7031
Number of Training non-xenophobic tweets : 3029


In [13]:
xen_test, non_xen_test = classify(df_test)
print("Number of Testing Set Xenophobic tweets : {0}\nNumber of Testing Set non-xenophobic tweets : {1}"
      .format(len(xen_test), len(non_xen_test)))

Number of Testing Set Xenophobic tweets : 81
Number of Testing Set non-xenophobic tweets : 38


> Now, in order to know which tweets classify as Xenophobic, we need to **tokenize words.**
<br>
> This will help us see the **most occuring words** in xenophobic/non-xenophobic speech.
<br>
> We will tokenize words from most occurring 1 words upto most occurring 4 words.

In [14]:
def ngram_tokenizer(data, n=3):
    """
    This function finds the n most occurring words in our data.
    Returns a list of sorted tuples of 500 most occurring words.
    
    Parameters: 
    1. n    : Int. Number of words to tokenize. By default, n = 3.
    2. data : np.array() object. List of datapoints to tokenize.
    """
    n_word_count = {}
    stop_words = set(stopwords.words('english')) 
    for i in range(len(data)):
        n_grams = ngrams(word_tokenize(data[i]), n)
#         tokenized = [word for word in n_grams if word not in stop_words] <-- Use this when ignoring stop_words
        tokenized = [ ' '.join(grams) for grams in n_grams]
        for tokens in tokenized:
#             if(tokens not in stop_words): <-- Use this when ignoring stop_words
            if(tokens not in n_word_count):
                n_word_count[tokens] = 1
            else:
                n_word_count[tokens] += 1
            
    most_common = np.array(Counter.most_common(n_word_count))[:500]
    return most_common

In [15]:
def calc_tf_idf(words, size):
    """Calculates the term frequency of top 500 most common words in all tweets i.e.
        Xenophobic or non-xenophobic.
        
        Returns a new list with the word, frequency, and occurance as a fraction
        
        Takes in two parameters: 
        1. words : a list of tuples consisting of most frequent words and their respective frequencies
        2. size  : number of tweets in given class
    """
    i = 0
    new_list = []
    for i in range(len(words)):
        num = float(words[i][-1])
        x = float(num/size)
        a = list(words[i])
        y = float(x)
        y = x*float(np.log(1/y))
        a.append(y)
        new_list.append(a)
    return new_list

In [16]:
def first_n_tf_idf(n_start, n_end):
    """
    Calculates the term frequency - inverse document frequency of the
    n most frequent words
    
    Parameters : 
    n_start : number to start from (n_start is inclusive)
    n_end   : number to end (n_end is exclusive)
    Returns : a list of Text Document Matrices
    """
    xen_all_tdm = []
    non_xen_all_tdm = []
    for i in range(n_start, n_end):
        
        n_sorted_xen = ngram_tokenizer(xen_train, i)
        n_sorted_non_xen = ngram_tokenizer(non_xen_train, i)   
        
        non_xen_td_idf = calc_tf_idf(n_sorted_non_xen, len(non_xen_train))
        xen_td_idf = calc_tf_idf(n_sorted_xen, len(xen_train))
        
        xen_all_tdm.append(xen_td_idf)
        non_xen_all_tdm.append(non_xen_td_idf)
        
    return np.array(xen_all_tdm), np.array(non_xen_all_tdm)

In [17]:
#Let's calculate the TD-IDF for the first 500 most common tokenized words
xen_TDM, non_xen_TDM = first_n_tf_idf(1, 5)

In [18]:
#let's represent it into a pandas dataframe

xen_df_TDM = pd.DataFrame(data=xen_TDM[0], columns=['Terms','Frequency','TF-IDF'])
non_xen_df_TDM = pd.DataFrame(data=non_xen_TDM[0], columns=['Terms','Frequency','TF-IDF'])

### <div style='text-align:center;'>Sorting based on Frequencies</div>

In [19]:
xen_df_TDM.sort_values(by="TF-IDF",ascending=False).head()

Unnamed: 0,Terms,Frequency,TF-IDF
8,of,2655,0.3677517829652808
7,her,2726,0.3673542285171035
6,a,2736,0.367276944573686
9,you,2410,0.3670021748928815
10,is,2403,0.3669303371454771


In [20]:
non_xen_df_TDM.sort_values(by="TF-IDF",ascending=False).head()

Unnamed: 0,Terms,Frequency,TF-IDF
9,is,1088,0.3677761060974408
8,trump,1239,0.3656575452374511
10,you,944,0.3633454581605677
11,chant,941,0.3631796090547336
7,of,1326,0.3616253017541022



# <div style="text-align:center"> Naive Bayes Classifier Formula</div>

![alt text](https://blog.easysol.net/wp-content/uploads/2017/12/Image-1-1-600x169.png)

### <div style="text-align:center"> Let's understand what the above formula means in detail</div>
> Here, P(A|B) is the posterior probability, i.e. the objective. 
In our case, P(A|B) is P(xenophobia|tweet)
P(B|A) is the likelihood, i.e. P(tweet | xenophobia)
P(A) referes to the prior probability, i.e. P(xenophobia)
P(B) referes to the marginal probability, i.e. P(tweet)

## <div style="text-align:center"> Note about calculating likelihood probability:</div>
> In order to calculate P(B | A), we need to use the product operator, Π
> Here's an example of how it works
![alt text](https://math.illinoisstate.edu/day/courses/old/305/contentsum07.gif)

## <div style="text-align:center">Xenophobic Tweet Naive Bayes Classifier</div>
![alt text](https://pbs.twimg.com/media/EA93xLoUEAEUmWQ?format=jpg&name=small)

In [21]:
def calculate_posterior(likelihood, prior, marginal):
    """
    Calculates the posterior probability of a tweet being xenophobic or not.
    Return the posterior value (0 - 1)
    Parameters:
    1. likelihood : The likelihood probability (float : 0 - 1)
    2. prior : The prior probability (float : 0 - 1)
    3. marginal : The marginal probability (float : 0 - 1)
    """
    num = float(likelihood * prior)
    marginal = num/float(marginal)
    return float(marginal)

In [22]:
def calculate_marginal(word, _type,xen_tdm, non_xen_tdm):
    """
    Calculates the marginal probability of a word.
    Returns the marginal probability (0-1) as a float.
    Parameters:
    1. word : the word to calculate marginal probability for.
    """
    
    marginal_non = 1
    marginal_xen = 1
    for xen, non in zip(xen_tdm, non_xen_tdm):
        if(xen[0] == word):
            marginal_xen = float(xen[1])
        if(non[0] == word):
            marginal_non = float(xen[1])
    
    frequency = marginal_non + marginal_xen
    marginal_non /= frequency
    marginal_xen /= frequency
    
    if(_type == "xen"):
        return float(marginal_xen)
    elif(_type == "non"):
        return float(marginal_non)

In [23]:
def one_naive_bayes(twt):
    """
    Predicts if a tweet is Xenophobic or not.
    
    Returns 1 if Xenophobic; 
    Returns -1 if non-xenophobic;
    
    Also Returns the posterior of Xenophobic and non-xenophobic.
    
    Parameters:
    1. tweet : tweet to calculate the posterior for. Type : np.array() [Split each word.]
    """
    
    tots_xen = 0
    tots_non = 0
    size = 0
    i = 0
    for (xen_tdm, non_xen_tdm) in zip(xen_TDM, non_xen_TDM):
        i += 1
        tweet = list(ngrams(word_tokenize(twt), i))
    # Calculating the prior probability of Xen and non-xen tweet
        size_xen = len(xen_tdm)
        size_non = len(non_xen_tdm)
        total_size = size_non + size_xen
        prior_xen = float(size_xen/total_size)
        prior_non = float(size_non/total_size)
    #-----------------------------------------------------------
        likelihood_xen = 1
        likelihood_non = 1

        marginal_xen = 1
        marginal_non = 1

        for word in tweet:
            word = " ".join(word)
            for (checker_xen,checker_non) in zip(xen_tdm, non_xen_tdm):
                if(checker_xen[0] == word):
                    likelihood_xen *= float(checker_xen[-1])
                    marginal_xen *= calculate_marginal(word, 'xen',xen_tdm, non_xen_tdm)
                if(checker_non[0] == word):
                    likelihood_non *= float(checker_non[-1])
                    marginal_non *= calculate_marginal(word,"non",xen_tdm, non_xen_tdm)

        posterior_xen = calculate_posterior(likelihood_xen, prior_xen, marginal_xen)
        posterior_non = calculate_posterior(likelihood_non, prior_non, marginal_non)
        tots_xen += abs(posterior_xen)
        tots_non += abs(posterior_non)
        size += 1
        
    XEN = float(tots_xen/size)
    NON_XEN = float(tots_non/size)
    
    if(XEN >= NON_XEN):
        return XEN,NON_XEN, -1
    return XEN, NON_XEN, 1
       

# <div style='text-align:center;'>Validating our Model</div>

In [24]:
def polish_text(text):
    """
    Polished text by making it lowercase and removing punctuation.
    Returns the polished rext.
    Parameters:
    1. text : text to polish
    """
    sentence = str(text).lower().strip()
    sentence = re.sub(r'[^a-zA-Z0-9\s]', " ", sentence)
    return sentence

In [25]:
def validation(data):
    """
    This function validates our Naive Bayes Model.
    Returns the number of estimated Xenophobic and non-xenophobic tweets
    Parameters:
    1. data : dataset of tweets to classify. Type = np.array()
    """
    xen = 0
    non = 0
    for i in range(len(data)):
        tweet = polish_text(data[i])
        _, _, result = one_naive_bayes(tweet)
        if(result == True):
            xen += 1
        else:
            non += 1
            
    return xen, non

### <div style="text-align:center;">Let's calculate the precision score of our training set:</div>

In [26]:
#returns the number of trained Xenophobic tweets and non xenophobc tweets
xen_T, non_T = validation(X_train)

In [27]:
print(xen_T, non_T, "<-- Model Generated ||| Actual -->", len(xen_train), len(non_xen_train))

6996 3064 <-- Model Generated ||| Actual --> 7031 3029


In [28]:
precision_score = (xen_T/len(xen_train))*100
print("Training Precision score : {0:.3g}%".format(precision_score))

Training Precision score : 99.5%


### <div style='text-align:center;'><span style="color:red;">99.5% precision score. </span> Not bad for our training set. Hopefully, we didn't overfit 🙏😂<br><br><br>Let's validate our testing set now</div>

In [29]:
#returns the number of test xenophobic tweets and non xenophobic tweets
xen_TEST, non_TEST = validation(X_test)

In [30]:
precision_score = (1 - ((non_TEST - len(non_xen_test))/len(non_xen_test)))*100
print("Testing Set Precision score : {0:.3g}%".format(precision_score))

Testing Set Precision score : 97.4%


# <div style='text-align:center;'><span style="color:red;">97.4% precision score</span> for our testing set!</div>
### <div style='text-align:center;'>And that is why a Multinomial Naive Bayes Classifier is powerful for classifying text</div>

### But before I conclude, let's validate a sample tweet from [another wesbite](https://www.humanrights.gov.au/our-work/examples-racist-material-internet):![alt text](https://pbs.twimg.com/media/EA9-jo3UIAAyQDj?format=jpg&name=medium)

In [31]:
racist_content = """
GET THE
FUCK OUT OF OUR COUNTRY
NIGGERS,SPICS,KIKES,SANDNIGGERS,ANDCHINKS are ALL the SHIT that makes
our COUNTRY STINK

"""
x, n, _ = one_naive_bayes(racist_content)
if(x > n):
     print("Xenophobic Content.\nXenophobic Percent: {}%\nNon-Xenophobic content: {}%".format(x*100, n*100))
else:
    print("Non-Xenophobic Content.\nXenophic Percent: {}%\nNon-Xenophobic content: {}%".format(x*100, n*100))

Xenophobic Content.
Xenophobic Percent: 37.52657871080821%
Non-Xenophobic content: 37.524267053139525%


## <div style='text-align:center;'>As you can see, that was clearly maked as Xenophobic content.</div>
### Now, let's check to see if the [following quote](http://www.wiseoldsayings.com/neighbors-quotes/) is categorized as Xenophobic or not.![alt text](https://pbs.twimg.com/media/EA901ciUYAAuq5f?format=jpg&name=medium)

In [32]:
peaceful_content = """
How you can have dreams when your neighbors have nightmares.
"""
xen2,non2,ret = one_naive_bayes(peaceful_content)
if(xen2>non2):
    print("Xenophobic Content.\n\nXenophobic Percent: {}%\nNon-Xenophobic content: {}%".format(xen2*100, non2*100))
else:
    print("Non-Xenophobic Content.\n\nXenophic Percent: {}%\nNon-Xenophobic content: {}%".format(xen2*100, non2*100))

Non-Xenophobic Content.

Xenophic Percent: 26.167131080473325%
Non-Xenophobic content: 26.780867721054353%


### Follow me on [Github](https://github.com/QasimWani)#### <div style="text-align:center;color:blue;font-size:1.618em;">Peace Out!</div>

> ### Note: The model usually takes 30 minutes to run.