### The Purpose of this Model is to build a classifier that can do the following:

#### Predict if a post is from a driver or a rider. Build a Gausian Binomial Naive Bayes Model given post content.

> Copyright Product of HitchHiqe © 2019
>
>  Author: Qasim Wani.
>
> Written: 26th November, 2019
>
> Version: 1.0

In [8]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
import sklearn.model_selection as skl
import re
from nltk.corpus import stopwords 
from collections import Counter

## Step 1. Data Extraction and Train Test Split

In [41]:
df = pd.read_csv("C:/Users/qasim/Desktop/Exigence/HitchHiqe - official/sourcecode/hitchHiqe/Fraper/data/dev train/raw.csv")

In [46]:
df = df[df.Type != -1] ## Dropping irrelevant data
X = np.array(df.iloc[:,1])
y = np.array(df.iloc[:,-1])
X_train, X_test, y_train, y_test = skl.train_test_split(X, y, test_size=0.33, random_state=42)

In [47]:
df.head(3)

Unnamed: 0,MongoDB Object ID,Post,Post Date,client id,Name,Post Link,Post Timestamp,Type
0,5dddf40c7d7e3e467c07a921,leaving sunday december 1st (can make stops on...,"11/26/19, 9:49 PM",https://www.facebook.com/eric.aponte.5817,Eric Aponte,https://www.facebook.com/groups/19322097409271...,1574822985,0
1,5dddf40c7d7e3e467c07a91e,please let me know if anyone is driving to ral...,"11/26/19, 9:08 PM",https://www.facebook.com/yogen.phanse,Yogen Phanse,https://www.facebook.com/groups/19322097409271...,1574820503,1
2,5dddf40c7d7e3e467c07a922,"driving from asheville, nc to vt on saturday, ...","11/26/19, 4:23 PM",https://www.facebook.com/haley.fontaine.3,Haley Fontaine,https://www.facebook.com/groups/19322097409271...,1574803421,0


#### Check to see for null data

In [48]:
print("\nNull objects in our set:\n",df.notnull().count() - df.isnull().count())


Null objects in our set:
 MongoDB Object ID    0
Post                 0
Post Date            0
client id            0
Name                 0
Post Link            0
Post Timestamp       0
Type                 0
dtype: int64


In [49]:
## Looks good! Let's proceed to building the actual model now and more data synthesis

## Step 2. Data Synthesis and Document Matrix Generation

### Note:
> Type = 0 : <strong>Driver</strong>
>
> Type = 1 : <strong>Rider</strong>

In [50]:
def classify(X, y):
    """
    This function accepts two parameters.
    1. X: Posts.
    2. y : Type of posts (0 = Driver ; 1 = Rider)
    It returns two np.array() objects:
    1. driver: An np.array() of all driver posts
    2. rider: An np.array() of all rider posts
    """
    driver = []     # driver = 0
    rider = [] # rider = 1
    
    for i in range(len(X)):
        one_post = str(X[i]).lower().strip()
        one_post = re.sub(r'[^a-zA-Z0-9\s]', "", one_post)
        if(y[i] == 1):
            rider.append(one_post)
        else:
            driver.append(one_post)
    return np.array(driver), np.array(rider)

In [51]:
driver_train, rider_train = classify(X_train, y_train)
print("Number of Training Driver Posts : {0}\nNumber of Training Rider Posts : {1}"
      .format(len(driver_train), len(rider_train)))

Number of Training Driver Posts : 129
Number of Training Rider Posts : 53


In [52]:
### Wow. This is some nice data! A quite even proportion of drivers to riders. Exciting for HitchHiqe.

> Now, in order to detect which post is from a driver or a rider, we need to tokenize words. 
This will help identify the N most occurring words in two types of posts.
The model shall be dependent on 1-4 most occurring words using ngram and bag of word approach.

In [54]:
def ngram_tokenizer(data, n=1, topN=314):
    """
    This function finds the n most occurring words in our data.
    Returns a list of sorted tuples of topN most occurring words.
    
    Parameters: 
    1. n    : Int. Number of words to tokenize. By default, n = 1.
    2. data : np.array() object. List of datapoints to tokenize.
    3. topN : Int. Top N batches to return. By default, topN = 314
    """
    n_word_count = {}
    stop_words = set(stopwords.words('english')) 
    for i in range(len(data)):
        n_grams = ngrams(word_tokenize(data[i]), n)
#         tokenized = [word for word in n_grams if word not in stop_words] <-- Use this when ignoring stop_words
        tokenized = [ ' '.join(grams) for grams in n_grams]
        for tokens in tokenized:
#             if(tokens not in stop_words): <-- Use this when ignoring stop_words
            if(tokens not in n_word_count):
                n_word_count[tokens] = 1
            else:
                n_word_count[tokens] += 1
            
    most_common = np.array(Counter.most_common(n_word_count))[:topN]
    return most_common

In [55]:
def calc_tf_idf(words, size):
    """Calculates the term frequency of top-N-most common words in all Posts.
        
        Returns a new list with the word, frequency, and occurrence as a fraction
        
        Takes in two parameters: 
        1. words : a list of tuples consisting of most frequent words and their respective frequencies
        2. size  : number of tweets in given class
    """
    i = 0
    new_list = []
    for i in range(len(words)):
        num = float(words[i][-1])
        x = float(num/size)
        a = list(words[i])
        y = float(x)
        y = x*float(np.log(1/y))
        a.append(y)
        new_list.append(a)
    return new_list

In [56]:
def first_n_tf_idf(n_start, n_end, driver_data_train, rider_data_train):
    """
    Calculates the term frequency - inverse document frequency of the
    n most frequent words
    
    Parameters : 
    1. n_start : number to start from (n_start is inclusive)
    2. n_end   : number to end (n_end is exclusive)
    3. driver_data_train : Driver Training data
    4. rider_data_train : Rider Training data
    
    Returns :
    1. driver_all_TDM : List of Driver TDM of n most frequent words.
    2. rider_all_TDM  : List of Rider TDM of n most frequent words.
    """
    rider_all_TDM = []
    driver_all_TDM = []
    for i in range(n_start, n_end):
        
        n_sorted_driver = ngram_tokenizer(driver_train, i)
        n_sorted_rider = ngram_tokenizer(rider_train, i)   
        
        driver_TD_IDF = calc_tf_idf(n_sorted_driver, len(driver_data_train))
        rider_TD_IDF = calc_tf_idf(n_sorted_rider, len(rider_data_train))
        
        driver_all_TDM.append(driver_TD_IDF)
        rider_all_TDM.append(rider_TD_IDF)
        
    return np.array(driver_all_TDM), np.array(rider_all_TDM)

In [57]:
#Let's calculate the TD-IDF for the first 100 most common tokenized words (1 - 4 words)
driver_TDM, rider_TDM = first_n_tf_idf(1, 5, driver_train, rider_train)

In [58]:
#let's represent it into a pandas dataframe

driver_df_TDM = pd.DataFrame(data=driver_TDM[0], columns=['Terms','Frequency','TF-IDF'])
rider_df_TDM = pd.DataFrame(data=rider_TDM[0], columns=['Terms','Frequency','TF-IDF'])

### Now, let's sort the dataframe objects based on descending frequencies

In [64]:
# Driver TDM, sorted
driver_df_TDM.sort_values(by="TF-IDF",ascending=False).head(11)

Unnamed: 0,Terms,Frequency,TF-IDF
9,ride,47,0.3678623699583397
10,at,47,0.3678623699583397
11,you,44,0.3668790844923262
12,message,43,0.3662040962227032
13,i,42,0.3653488140720059
8,for,54,0.3645328009384456
14,around,41,0.3643089445267591
7,driving,55,0.3634601321868688
15,off,40,0.3630799845729414
6,me,57,0.3608944556747748


In [65]:
# Rider TDM, sorted
rider_df_TDM.sort_values(by="TF-IDF",ascending=False).head(11)

Unnamed: 0,Terms,Frequency,TF-IDF
10,will,20,0.367758
11,vt,18,0.366765
9,gas,23,0.362271
12,friday,16,0.361571
13,the,16,0.361571
8,or,24,0.358749
7,pay,25,0.354442
14,sunday,13,0.344707
15,if,12,0.336314
6,from,29,0.329941


## Step 3. Gaussian Binomial Naive Bayes Algorithm Build

In [67]:
def calculate_posterior(likelihood, prior, marginal):
    """
    Calculates the posterior probability of a Post Type being a driver/rider.
    Return the posterior value (0 - 1) {Spectrum}
    Parameters:
    1. likelihood : The likelihood probability (float : 0 - 1)
    2. prior : The prior probability (float : 0 - 1)
    3. marginal : The marginal probability (float : 0 - 1)
    """
    num = float(likelihood * prior)
    marginal = num/float(marginal)
    return float(marginal)

In [69]:
def calculate_marginal(word, _type, driver_tdm, rider_tdm):
    """
    Calculates the marginal probability of a word.
    Returns the marginal probability (0-1) as a float.
    Parameters:
    1. word : the word to calculate marginal probability for.
    2. _type : Indicates the _type of post, generated from one_naive_bayes() : 0 -> DRIVER; 1 -> RIDER
    3. driver_tdm : The associated Driver TDM
    4. rider_tdm  : The associated Rider TDM
    """
    marginal_DRIVER = 1    
    marginal_RIDER = 1
    
    for driver, rider in zip(driver_tdm, rider_tdm):
        if(driver[0] == word):
            marginal_DRIVER = float(driver[1])
        if(rider[0] == word):
            marginal_RIDER = float(rider[1])
    
    frequency = marginal_RIDER + marginal_DRIVER
    marginal_DRIVER /= frequency
    marginal_RIDER /= frequency
    
    if(_type == 0):
        return float(marginal_DRIVER)
    elif(_type == 1):
        return float(marginal_RIDER)

In [82]:
def one_naive_bayes(pst, rider_T_D_M, driver_T_D_M):
    """
    Predicts if a Post is regarding a Rider request (RIDER --> 1) or Ride Offer (DRIVER --> 0).
    
    Returns 1 if Rider; 
    Returns 0 if Driver;
    
    Also Returns the posterior probability of Driver and Rider Posts.
    
    Parameters:
    1. pst : Post to calculate the posterior for; Type : np.array() [Split each word.]
    2. rider_tdm : The Rider based TDM.
    3. driver_tdm : The Driver based TDM.
    """
    
    tots_driver = 0
    tots_rider = 0
    size = 0
    i = 0
    for (driver_tdm, rider_tdm) in zip(driver_T_D_M, rider_T_D_M):
        i += 1
        fb_POST = list(ngrams(word_tokenize(pst), i))
    # Calculating the prior probability of Driver and Rider Post
        size_driver_tdm = len(driver_tdm)
        size_rider_tdm = len(rider_tdm)
        total_size = size_driver_tdm + size_rider_tdm
        prior_DRIVER = float(size_driver_tdm / total_size)
        prior_RIDER = float(size_rider_tdm / total_size)
    #-----------------------------------------------------------
        likelihood_DRIVER = 1
        likelihood_RIDER = 1

        marginal_DRIVER = 1
        marginal_RIDER = 1

        for word in fb_POST:
            word = " ".join(word)
            for (checker_DRIVER, checker_RIDER) in zip(driver_tdm, rider_tdm):
                if(checker_DRIVER[0] == word):
                    likelihood_DRIVER *= float(checker_DRIVER[-1])
                    marginal_DRIVER *= calculate_marginal(word, 0, driver_tdm, rider_tdm)
                    
                if(checker_RIDER[0] == word):
                    likelihood_RIDER *= float(checker_RIDER[-1])
                    marginal_RIDER *= calculate_marginal(word, 1, driver_tdm, rider_tdm)

        posterior_DRIVER = calculate_posterior(likelihood_DRIVER, prior_DRIVER, marginal_DRIVER)
        posterior_RIDER = calculate_posterior(likelihood_RIDER, prior_RIDER, marginal_RIDER)
        tots_driver += abs(posterior_DRIVER)
        tots_rider += abs(posterior_RIDER)
        size += 1
        
    DRIVER_prob = float(tots_driver / size)
    RIDER_prob = float(tots_rider / size)
    
    if(DRIVER_prob > RIDER_prob):
        return DRIVER_prob, RIDER_prob, 1
    return DRIVER_prob, RIDER_prob, 0

## Step 4. Validation
#### (no k-cross and hyper-parameter validation)

In [83]:
def polish_text(text):
    """
    Polished text by making it lowercase and removing punctuation.
    Returns the polished rext.
    Parameters:
    1. text : text to polish
    """
    sentence = str(text).lower().strip()
    sentence = re.sub(r'[^a-zA-Z0-9\s]', " ", sentence)
    return sentence

In [84]:
def validation(data, rider_tdm, driver_tdm):
    """
    This function validates our GBN-NB Model.
    Returns the number of estimated Driver and Rider tweets
    Parameters:
    1. data : dataset of Posts to classify. Type = np.array()
    """
    DRIVER = 0
    RIDER = 0
    for i in range(len(data)):
        post = polish_text(data[i])
        _, _, result = one_naive_bayes(post, rider_tdm, driver_tdm)
        if(result == 1):
            RIDER += 1
        else:
            DRIVER += 1
            
    return DRIVER, RIDER

In [85]:
### Let the validation begin...

In [90]:
#returns the number of trained Driver and Rider Posts as calculated from the GBN-NB Model
driver_post_trained, rider_post_trained = validation(X_train, rider_TDM, driver_TDM)

In [91]:
print(driver_post_trained, rider_post_trained, "<-- Model Generated ||| Actual -->", len(driver_train), len(rider_train))

127 55 <-- Model Generated ||| Actual --> 129 53


In [103]:
precision_score_Train = (driver_post_trained/len(driver_train))*100
print("Training Precision score : {0:.3g}%".format(precision_score_Train))

Training Precision score : 98.4%


In [99]:
### Holy Fuck!!! The Gaussian Binomial Naive Bayes Model correctly detected 98.4% of posts.
## Let's now try to see how our model behaves with testing dataset.

In [101]:
#returns the number of test Driver and Rider Posts
driver_post_test, rider_post_test = validation(X_test, rider_TDM, driver_TDM)
validated_driver_test, validated_rider_test = classify(X_test, y_test)

In [102]:
print(driver_post_test, rider_post_test, "<-- Model Prediction vs. Actual -->", len(validated_driver_test), len(validated_rider_test))

64 26 <-- Model Prediction vs. Actual --> 63 27


In [134]:
precision_score_Test = (rider_post_test/len(validated_rider_test))*100
print("Testing Set Precision score : {0:.3g}%".format(precision_score_Test))

Testing Set Precision score : 96.3%


## Summary: 
### Training Set Precision score : 98.4%
### Testing Set Precision score : 96.3%

In [105]:
### That's sexy good. 96.3% Test validation score in detecting a driver or rider post.
### Implementing this into HitchHiqe is literally going to take the product through the roof. 
### Now, the challenging part: Unleashing the engineering onto the world.

#### For further testing purposes, let me run the bayes net through totally different text from NOVA carpool FB group.

In [106]:
vt_nova_carpool_group_sample_post = "Leaving Leesburg area Sunday around 10/11 am. Message me if you need a ride back to school."
driver_posterior, rider_posterior, result = one_naive_bayes(vt_nova_carpool_group_sample_post, rider_TDM, driver_TDM)

In [115]:
rider_posterior / driver_posterior, result # 0 --> Driver ; 1 --> Rider

(23.923161879676172, 0)

#### Wow. A totally random post from a completely new source ranks this post as a driver related by a factor of ~24

In [117]:
#Let's try another, a bit more confusing post

In [118]:
confusing_post = "Anyone driving to DC on Thursday 11/28? Need a ride."
driver_posterior2, rider_posterior2, result2 = one_naive_bayes(confusing_post, rider_TDM, driver_TDM)

In [123]:
driver_posterior2, rider_posterior2, result2

(0.2871249275197578, 0.3203524308752435, 0)

In [124]:
### Okay, this is indeed a bit confusing, hence why the algorithm incorrectly classified it. 
### But the striking closeness in results is remarkable.

In [125]:
third_post = "Need a ride back to tech from Springfield area on Sunday 12/1 will pay gas"
dr3, rd3, rs3 = one_naive_bayes(third_post, rider_TDM, driver_TDM)

In [130]:
dr3 / rd3, rs3 # 0 --> Driver ; 1 --> Rider
## Nice!!! (Nearly 10 times as likely to be a post from a rider than a driver.)

(9.069096214230044, 1)

### If the following post works, I'll be celebrating by leaving my dorm to stretch for a minute after 4 straight days!

In [131]:
last_test = "Yoo so this is my third time posting. Is no one leaving on Saturday? Looking for a ride! Will pay for gas."

In [132]:
dr4, rd4, rs4 = one_naive_bayes(last_test, rider_TDM, driver_TDM)
dr4 / rd4, rs4 # 0 --> Driver ; 1 --> Rider
## Nice!!! (Nearly 2 times as likely to be a post from a rider than a driver.)

(1.9777276010766784, 1)

### Fuck. Guess I have to leave now... Now, I somehow need to figure out how to implement this on HitchHiqe.