
<img src="https://www2.gmu.edu/sites/all/modules/features/feature_core_theme/templates/resources/images/mason-logo.png " alt="GMU Logo" title="George Mason University" />
<hr style ="color:#99CC99">
    
<h2 style="font-family:Helvetica; color:#006633;">Programming Assignment # 1</h2>
<h3 style="font-family:Helvetica; color:#006633;"> Sentiment Classification with Naïve Bayes and Logistic Regression</h3>

<p style="font-family:Helvetica; font-size:1.5em;"> 
Authors: Team 1 - Shinoj Jerald Bounaventure Kumar Jeronmary, Yitong Li, Anh Nguyen and Nina Nnamani<br>
Course Professor: Dr. Lindi Liao <br>
Course Name: Natural Language Processing <br>
Course Name and Section#: AIT 726-001<br>
University Name: George Mason University<br>
Date: October 4, 2020    <br>
</p>    
<hr style ="color:#99CC99" width="75%">
<p style="font-family:Helvetica; font-size:1.4em;"> 
Description: This program implements a "from scratch" Naïve Bayes and a Logistic Regression classifier for sentiment classification of airline review tweets. 
 <br>
    
<p style="font-family:Helvetica; font-size:1.4em;"> 
Instructions: This program is presented as a jupyter notebook and requires all packages in the "import packages" section to be installed prior to running the code to avoid errors. All training and testing tweet texts, the notebook and html of outputs are included in the folder, and as such the code can be run directly from it. Given that the coding for each classifer was split within the Team, the functions created for each step of the classifers developnments are documented with comments in the notebook. First is Naïve Bayes, followed by Logistic Regression. An html file of program of all outputs are included in this submission. The following overarching featurs are implemented:<br>

<p style="font-family:Helvetica; font-size:1.2em;"> 
1) Creates vocabulary<br>
2) Extracts features/ Bag of words representations <br>
3) Trains the classifers<br>
4) Evaluates the test sets<br>
5) Reports accuracy score and confusion matrix of classifiers performance<br>
<br> 
    
        
<p style="font-family:Helvetica; font-size:1.4em;"> 
References: The following listed sources provided some insight on code techniques that were partially adapted.
    
https://github.com/ChanchalKumarMaji/Natural-Language-Processing-Specialization-deeplearning.ai/tree/master/Natural%20Language%20Processing%20with%20Classification%20and%20Vector%20Spaces/Week%201 
    
https://streamsql.io/blog/sentiment-analysis
    
https://stackoverflow.com/questions/4145451/using-a-regular-expression-to-replace-upper-case-repeated-letters-in-python-with 
    
https://www.nltk.org/_modules/nltk/tokenize/casual.html
 <br>

<hr style ="color:#99CC99">


## Import Libraries

In [1]:
#Import packages
import os
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
import collections
import re
import numpy as np
import time
import xml
import pandas as pd
import math
import warnings
warnings.filterwarnings("ignore")

#captures runtime
start_time = time.time()

## Naïve Bayes

### Removes mark-up, normalizes capitalized first letter and removed emoticons/emojis

In [2]:
def normalize_case(s):    
    '''
    Paramaeter: Word to be normalized
    Converts words with capitalized first letters in to lower case.
    '''
    if(not s.isupper()):
        return s.lower()
    else:
        return s

In [3]:
def remove_emoji(string):
    emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251" 
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642" 
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f"  # dingbats
        u"\u3030""]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string) # no emoji

def remove_tags(s):
    '''
    Paramaeter: Word to be normalized
    Removes HTML tags
    '''
    s = re.sub(r'[^\w\s]', '', s)
    return s

### Creates vocabulary by incorporating function and tokenizing text for unstemmed and stemmed.  Extract features by getting frequency counts in each document and binary representation count for "in" or "not" in document.

In [4]:
def count_words(rootdir):
    '''
    Parameter: root directory
    
    The funtion collects the training files. Tokenizes text into words. Creates stemmed vocabulary and 
    Counts the the occurance of each word in each class (positve and negative).
    '''    
    #Port Stemmer for stemming vocabulary
    ps = PorterStemmer()
    vocab=[]
    total_words=0
    stemmed_vocab=[]
    prior=0
    binary_count={}
    stemmed_binary_count={}
    #For each directory in the path
    for subdir, dirs, files in os.walk(rootdir):
        #For each file in the directory
        for file in files:
            f=open(rootdir+file,'r',encoding='utf-8') #use the absolute URL of the file
            lines = f.readlines()
            S = set()
            #For each line in the file
            for line in lines:
                #Tokenize the words using word tokenize of nltk
                document=word_tokenize(line)
                #For each word in the document 
                for i in range(0,len(document)):
                    #Normalize case for the word, convert capitalized letter to lower case
                    document[i]=normalize_case(document[i])
                    #Remove HTML tags
                    document[i]=remove_tags(document[i])
                    # Remove emoji
                    document[i]=remove_emoji(document[i])
                    if(document[i]!=''):
                        total_words+=1
                        #Stem the words and append to stemmed list
                        stemmed_vocab.append(ps.stem(document[i]))
                        if(not document[i] in S):
                            #Calculate stemmed binary count
                            stemmed_binary_count[ps.stem(document[i])]=stemmed_binary_count.get(ps.stem(document[i]),0)+1
                            S.add(ps.stem(document[i]))
                        vocab.append(document[i])
                        if(not document[i] in S):
                            #Calculate  binary count
                            binary_count[document[i]]=binary_count.get(document[i],0)+1
                            S.add(document[i])
            f.close()
            prior+=1
    #Count frequency of words from respective vocabs
    
    count=dict(collections.Counter(vocab))
    stemmed_count=dict(collections.Counter(stemmed_vocab))
    return [count,stemmed_count,prior,binary_count,stemmed_binary_count,total_words]

In [6]:
# Return the unique words from two lists 
def Union(lst1, lst2): 
    final_list = list(set(lst1) | set(lst2)) 
    return final_list 

def nb():
    '''
    The funtion collects the count of non stemmed and stemmed vocabulary and assigns it to global variables.
    '''    
    global pos_count
    global neg_count
    global unique_pos_words
    global unique_neg_words
    global unique_words
    global stemmed_pos_count
    global stemmed_neg_count
    global positive_prior
    global negative_prior
    global pos_stemmed_binary_count
    global neg_stemmed_binary_count
    
    global pos_binary_count
    global neg_binary_count
    
    global total_pos_words
    global total_neg_words
    
    #Calculate parameters for positve documents
    rootdir= "train/positive/" 
    count=count_words(rootdir)
    pos_count=count[0]
    stemmed_pos_count=count[1]
    unique_pos_words=len(pos_count)
    positive_prior=count[2]
    pos_binary_count=count[3]
    pos_stemmed_binary_count=count[4]
    total_pos_words=count[5]
    
    #Calculate parameters for negative documents
    rootdir= "train/negative/"
    count=count_words(rootdir)
    neg_count=count[0]
    stemmed_neg_count=count[1]
    unique_neg_words=len(neg_count)
    negative_prior=count[2]
    neg_binary_count=count[3]
    neg_stemmed_binary_count=count[4]
    total_neg_words=count[5]
    
    # Calculate the unique number of words from the training set
    unique_words=len(Union(list(neg_count.keys()),list(pos_count.keys())))

### Trainings and Evaluations

In [7]:
def get_test(rootdir):
    '''
    The funtion collects the test data. Creates Bag of words and
    stemmed vocabulary.
    '''    
    ps = PorterStemmer()
    document=[]
    tokenized_document=[]
    for subdir, dirs, files in os.walk(rootdir):
        for file in files:
            f=open(rootdir+file,'r',encoding='utf-8') #use the absolute URL of the file
            lines = f.readlines()
            for line in lines:
                tokenized=word_tokenize(line)
                final=[]
                tokenized_final=[]
                for i in range(0,len(tokenized)):
                    tokenized[i]=normalize_case(tokenized[i])
                    tokenized[i]=remove_tags(tokenized[i])
                    tokenized[i]=remove_emoji(tokenized[i])
                    final.append(tokenized[i])
                    tokenized_final.append(ps.stem(tokenized[i]))
                document.append(final)
                tokenized_document.append(tokenized_final)
            test_files=document
            stemmed_test_files=tokenized_document
    f.close()
    return [test_files,stemmed_test_files]


In [8]:
nb()
def classify(test_files,stemmed_test_files):
    '''
    The function first calculates likelyhood and prior for each word. And then classify test documents based on these values

    '''
    doc_classification=[]
    binary_doc_classification=[]
    #Classify test documents using non-stemmed vocabulary
    for doc in test_files:
        pos_word_likelyhood=0
        neg_word_likelyhood=0
        pos_binary_likelyhood=1
        neg_binary_likelyhood=1

        for word in doc:
            #Calculate the likeleyhood for each word appeared in positive and negative documents
            pos_word_likelyhood=pos_word_likelyhood+np.log((pos_count.get(word,0)+1)/(total_pos_words+unique_words))
            neg_word_likelyhood=neg_word_likelyhood+np.log((neg_count.get(word,0)+1)/(total_neg_words+unique_words))
            
            # Binary
            #Calculate the likeleyhood for each word appeared in positive and negative documents
            if(word in pos_binary_count):
                pos_binary_likelyhood=pos_binary_likelyhood*(((pos_binary_count.get(word,0)+1)/(len(pos_binary_count)+len(pos_binary_count)+len(neg_binary_count))))
            if(word in neg_binary_count):
                neg_binary_likelyhood=neg_binary_likelyhood*(((neg_binary_count.get(word,0)+1)/(len(neg_binary_count)+len(neg_binary_count)+len(pos_binary_count))))
        
        #Calculate posterior
        pos_class=(pos_word_likelyhood)+np.log(positive_prior/(positive_prior+negative_prior))
        neg_class=(neg_word_likelyhood)+np.log(negative_prior/(positive_prior+negative_prior))
        
        #Classify documents based on calculated values
        if(pos_class > neg_class):
            doc_classification.append('pos')
        elif(pos_class < neg_class):
            doc_classification.append('neg')
        if(pos_binary_likelyhood>neg_binary_likelyhood):
            binary_doc_classification.append('pos')
        else:
            binary_doc_classification.append('neg')
                       
    #Classify test documents using stemmed vocabulary
    stemmed_doc_classification=[]
    binary_doc_classification_stemmed=[]
    for doc in stemmed_test_files:
        pos_word_likelyhood=0
        neg_word_likelyhood=0
        pos_stemmed_binary_likelyhood=1
        neg_stemmed_binary_likelyhood=1
        
        for word in doc:
                
            pos_word_likelyhood=pos_word_likelyhood+np.log((stemmed_pos_count.get(word,0)+1)/(total_pos_words+unique_words))
            neg_word_likelyhood=neg_word_likelyhood+np.log((stemmed_neg_count.get(word,0)+1)/(total_neg_words+unique_words))
            
            if(word in pos_stemmed_binary_count):
                pos_stemmed_binary_likelyhood=pos_stemmed_binary_likelyhood*(((pos_stemmed_binary_count.get(word,0)+1)/(len(pos_stemmed_binary_count)+len(pos_stemmed_binary_count)+len(neg_stemmed_binary_count))))
            if(word in neg_stemmed_binary_count):
                neg_stemmed_binary_likelyhood=neg_stemmed_binary_likelyhood*(((neg_stemmed_binary_count.get(word,0)+1)/(len(neg_stemmed_binary_count)+len(neg_stemmed_binary_count)+len(pos_stemmed_binary_count))))
        
        pos_class=(pos_word_likelyhood)+np.log(positive_prior/(positive_prior+negative_prior))
        neg_class=(neg_word_likelyhood)+np.log(negative_prior/(positive_prior+negative_prior))
        if(pos_class > neg_class):
            stemmed_doc_classification.append('pos')
        elif(pos_class <= neg_class):
            stemmed_doc_classification.append('neg')
        if(pos_stemmed_binary_likelyhood>neg_stemmed_binary_likelyhood):
            binary_doc_classification_stemmed.append('pos')
        else:
            binary_doc_classification_stemmed.append('neg')
    return [doc_classification,binary_doc_classification,stemmed_doc_classification,binary_doc_classification_stemmed]


In [9]:
# Use function "get_test" to get Bag of words and stemmed vocabulary for POSITIVE test set
# root directory is shown for data stored on the desktop
test_files_pos=get_test("test/positive/")[0]
stemmed_test_files_pos=get_test("test/positive/")[1]

# Use function get_test to get Bag of words and stemmed vocabulary for NEGATIVE test set
test_files_neg=get_test("test/negative/")[0]
stemmed_test_files_neg=get_test("test/negative/")[1]

# Use function "classify" to classify tes set based on combinations of 
# no stemming + frequency count, gold standard POSITIVE
no_stemming_frequency_count_pos =classify(test_files_pos,stemmed_test_files_pos)[0]
# no stemming + binary, gold standard POSITIVE
no_stemming_binary_pos=classify(test_files_pos,stemmed_test_files_pos)[1]
# stemming + frequency count, gold standard POSITIVE
stemming_frequency_count_pos=classify(test_files_pos,stemmed_test_files_pos)[2]
# stemming + binary, gold standard POSITIVE
stemming_binary_pos=classify(test_files_pos,stemmed_test_files_pos)[3]

# no stemming + frequency count, gold standard NEGATIVE
no_stemming_frequency_count_neg =classify(test_files_neg,stemmed_test_files_neg)[0]
# no stemming + binary, gold standard NEGATIVE
no_stemming_binary_neg=classify(test_files_neg,stemmed_test_files_neg)[1]
# stemming + frequency count, gold standard NEGATIVE
stemming_frequency_count_neg=classify(test_files_neg,stemmed_test_files_neg)[2]
# stemming + binary, gold standard NEGATIVE
stemming_binary_neg=classify(test_files_neg,stemmed_test_files_neg)[3]


### Accuracy Score and Confusion Matrix of Naïve Bayes Classifer

In [21]:
def metrics(gold_pos,gold_neg): # classified documents with gold standard positive and negative, respectively
    '''
    The function calculates accuracy and confusion matrix.
    '''
    arr=np.ndarray(shape=(2,2), dtype=float, order='F')
    arr.fill(0)
    for i in range(0,len(gold_pos)):
        if (gold_pos[i]=='pos'):
            arr[0][0]=arr[0][0]+1
        if (gold_pos[i]=='neg'):
            arr[1][0]=arr[1][0]+1            
    for i in range(0,len(gold_neg)):
        if (gold_neg[i]=='pos'):
            arr[0][1]=arr[0][1]+1
        if (gold_neg[i]=='neg'):
            arr[1][1]=arr[1][1]+1            
            
    accuracy=(arr[0][0]+arr[1][1])/(arr[0][0]+arr[1][1]+arr[0][1]+arr[1][0])
    print("Accuracy ",accuracy)
    print("Confusion Matrix:- ")
    print(arr)
    
# Compute accuracy and con for all four combinations
# no stemming + frequency count
metrics(no_stemming_frequency_count_pos,no_stemming_frequency_count_neg)
# no stemming + binary
metrics(no_stemming_binary_pos,no_stemming_binary_neg)
# stemming + frequency count
metrics(stemming_frequency_count_pos,stemming_frequency_count_neg)
# stemming + binary
metrics(stemming_binary_pos,stemming_binary_neg)

Accuracy  0.8162878787878788
Confusion Matrix:- 
[[1088.  669.]
 [ 107. 2360.]]
Accuracy  0.4786931818181818
Confusion Matrix:- 
[[ 451. 1458.]
 [ 744. 1571.]]
Accuracy  0.842092803030303
Confusion Matrix:- 
[[1073.  545.]
 [ 122. 2484.]]
Accuracy  0.6633522727272727
Confusion Matrix:- 
[[ 233.  460.]
 [ 962. 2569.]]


## Logistic Regression

### Removes mark-up, normalizes capitalized first letter and removed emoticons/emojis

In [10]:
#reading the text data
pos_train = os.listdir("train/positive/")
neg_train = os.listdir("train/negative/")
pos_test = os.listdir("test/positive/")
neg_test = os.listdir("test/negative/")

In [11]:
#Removing Emoticons
def remove_emoji(string):
    emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251" 
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642" 
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f"  # dingbats
        u"\u3030""]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string) # no emoji

In [12]:
#Removing HTML tags
def remove_tags(text):
    s = re.sub(r'<[^>]+>', '', text)
    return s

### Creates vocabulary by incorporating function and tokenizing text for unstemmed and stemmed.  Extract features for Bag of Words.

In [13]:
# Cleaning and adding the data
def create_list(dir,type,type1):
    return_list=[]
    for i in range(0, 500):
        file1 = open(type+"/"+type1+"/" + dir[i])
        try:
            text = file1.read()
            text = remove_tags(text)
            text = remove_emoji(text)
            return_list.append(text)
        except UnicodeDecodeError:
            k=0
        file1.close()
    return return_list

In [14]:
#Creating dataframe with positive and negative tweets
def createDataFrame(pos_train_list,neg_train_list):
    df1 = pd.DataFrame(neg_train_list)
    target2 = [0] * len(neg_train_list)
    df1["target"] = target2
    df1 = df1.rename(columns={0: "text"})
#Giving positive tweets as 1 and negative tweets as 0
    df = pd.DataFrame(pos_train_list)
    target1 = [1] * len(pos_train_list)
    df["target"] = target1
    df = df.rename(columns={0: "text"})
    
#Data is getting shuffled here
    data = pd.concat([df, df1])
    #data = data.sample(frac=1)
    x = list(data["text"])
    y=np.array(data["target"])

    return x,y;

In [15]:
# Tokenizing the text
def clean(text):

    vocab=[]
    for j in word_tokenize(text):
        if (j != ''):
            if not j.islower() and not j.isupper():
                j = j.lower()
            vocab.append(j)

    return vocab

In [16]:
# Tokenizing the text and stemming it
def cleaningStemmed(text):

    ps = PorterStemmer()
    vocabulary_stemmed=[]
    for j in word_tokenize(text):
        if (j != ''):
            if not j.islower() and not j.isupper():
                j = j.lower()
            vocabulary_stemmed.append(ps.stem(j))

    return vocabulary_stemmed

### Training Functions: Initialize weights, cross-entropy as the loss function and stochastic gradient ascent as the optimization algorithm, sets sigmoid threshold.  Predict labels for each sample. Compute the cross entropy and gradient of predictions against the gold standard labels. Updates weights with the gradient of the score function using learning rate. Iterate until performance converges

In [17]:
#Defining sigmoid function
def sigmoid(x):
  return 1 / (1 + np.exp(-x))


# Defining the gradient function for 500 iterations 
def gradient_descent(X, y, params, learning_rate, iterations):

    m = len(y)
    cost_history = np.zeros((iterations,1))

    for i in range(iterations):
        params = params - (learning_rate/m) * (X.T @ (sigmoid(X @ params) - y))
        cost_history[i] = compute_cost(X, y, params)

    return (cost_history, params)


# The objective cost function
def compute_cost(X, y, theta):

    m = len(y)
    h = sigmoid(X @ theta)
    cost = (1 / m) * np.sum(-y.dot(np.log(h)) - (1 - y).dot(np.log(1 - h)))

    return cost

#Defining the regularised gradient function for 500 iterations
def gradient_descent_reg(X, y, params, learning_rate, iterations, lmbda):

    m = len(y)
    cost_history = np.zeros((iterations,1))

    for i in range(iterations):
        params = params - (learning_rate/m) * (X.T @ (sigmoid(X @ params) - y))
        cost_history[i] = compute_cost_reg(X, y, params, lmbda)

    return (cost_history, params)

# The objective regularised cost function
def compute_cost_reg(X, y, theta, lmbda):

    m = len(y)
    h = sigmoid(X @ theta)
    temp = theta
    cost = (1 / m) * np.sum(-y.dot(np.log(h)) - (1 - y).dot(np.log(1 - h))) + (lmbda / (2 * m)) * np.sum(np.square(temp))

    return cost

# Final prediction function
def predict(X, params):
    return np.round(sigmoid(X @ params))


### Vectorizing function does so by counts and also does TF-IDF

In [18]:
# Vectorizing based user input
def vectorizer(X,vectorArr,dict_vocab,vectorType,row,col):

    if (vectorType==1):
        for i in range(0, len(X)):
            for j in X[i]:
                if j in dict_vocab:
                    vectorArr[i, dict_vocab[j]] += 1


        idf= np.zeros((row, col), dtype=np.int64)
        for i in range(0,len(vectorArr)):
            for j in range(0,col):
                if vectorArr[i][j] > 0:
                    idf[i][j]= math.log10(row / float(vectorArr[i][j]))

                else:
                    idf[i][j]=0
        vectorArr=np.multiply(vectorArr, idf)

    elif (vectorType==2):
        for i in range(0, len(X)):
            for j in X[i]:
                if j in dict_vocab:
                    vectorArr[i, dict_vocab[j]] += 1

    else:
        for i in range(0, len(X)):
            for j in X[i]:
                if j in dict_vocab:
                    vectorArr[i,dict_vocab[j]]=1

    return vectorArr

### Evaluation of models: F1 score, accuracy and confusion matrix


In [19]:
def accuracy_score(actual, predicted):
  correct = 0
  for i in range(len(actual)):
    #compare the element at index i between actual result vs prediction
    if actual[i] == predicted[i]:
      correct += 1
  return correct / float(len(actual)) * 100.0
def confusion_matrix(actual, predicted):
#inputs are binary arrays of actual results in test data and prediction from the model
  actual_results = pd.Series(actual, name='Actual')
  predictions = pd.Series(predicted, name='Predicted')
  #cros table function in pandas
  df_confusion = pd.crosstab(actual_results,predictions)
  return df_confusion
def f1_score(matrix):
  #confusion matrix as printed using confusion_matrix() function. [1][1] for TP, [1][0] for FP, [0][0] for TN, [0][1] for FN
  precision=matrix[1][1]/(matrix[1][1]+matrix[1][0])
  recall=matrix[1][1]/(matrix[1][1]+matrix[0][1])
  f1=2/((1/precision)+(1/recall))
    #accuracy is same as the result in accuracy_score
    #accuracy=100* (matrix[0][0]+matrix[1][1])/(matrix[1][1]+matrix[1][0]+matrix[0][1]+matrix[0][0])
  print("Precision: ","{:.2f}".format(precision),"\nRecall: ","{:.2f}".format(recall), "\nF1 score (micro):","{:.2f}".format(f1))

### Main function to combine functions for Logistic Regression

In [20]:
def main(stemmed,vectorType,regularized):

    #random.seed(123) #set seed
  # loading the train set
    pos_train_list = create_list(pos_train,"train", "positive")
    neg_train_list = create_list(neg_train,"train", "negative")
    X,y=createDataFrame(pos_train_list,neg_train_list)

    #Checking Stemming/ Not Stemming data
    if stemmed == 1:
        for i in range(0, len(X)):
            X[i] = cleaningStemmed(X[i])

    else:
        for i in range(0, len(X)):
            X[i] = clean(X[i])

    # Creating vocabulary list
    vocab = X[0]
    for i in range(1, len(X)):
        vocab.extend(X[i])
    vocab = sorted(set(vocab))

    row = len(X)
    col = len(vocab)

    dict_vocab = {}
    for i, j in enumerate(vocab):
        dict_vocab[j] = i
    trainVector = np.zeros((row, col), dtype=np.int64)

    # Vectorizing the trainset
    trainVector=vectorizer(X,trainVector,dict_vocab,vectorType,row,col)
    m, n = trainVector.shape
    trainVector = np.concatenate([np.ones((m, 1)), trainVector], axis=1)
    #trainVector = preprocessing.scale(trainVector)

    initial_theta = np.zeros(n + 1)
    iterations = 1000
    learning_rate = 0.01

    # Logistic function
    if regularized==1:
        lmbda = 0.1
        (cost_history, params_optimal) = gradient_descent_reg(trainVector, y, initial_theta, learning_rate, iterations,lmbda)
    else :
        (cost_history, params_optimal) = gradient_descent(trainVector, y, initial_theta, learning_rate, iterations)

 #Loading the test set
    pos_test_list = create_list(pos_test, "test", "positive")
    neg_test_list = create_list(neg_test, "test", "negative")
    X_test, y_test = createDataFrame(pos_test_list, neg_test_list)

    # Stemming data
    if (stemmed == 1):
        for i in range(0, len(X_test)):
            X_test[i] = cleaningStemmed(X_test[i])

    else:
        for i in range(0, len(X_test)):
            X_test[i] = clean(X_test[i])

    row = len(X_test)
    col = len(vocab)

    testVector = np.zeros((row, col), dtype=np.int64)
    # Vectorizing the test data
    testVector = vectorizer(X_test,testVector,dict_vocab,vectorType,row,col)
    m, n = testVector.shape

    testVector=np.concatenate([np.ones((m, 1)), testVector], axis=1)
    #testVector = preprocessing.scale(testVector)

    # Final Prediction
    preds = predict(testVector , params_optimal)

    # Final values based on threshold value 0.5
    for i in range(0, len(preds)):
        if (preds[i] <= 0.5):
            preds[i] = 0
        else:
            preds[i] = 1

    if(stemmed==1):
        dataClean="Stemmed"
    else:
        dataClean = "Not Stemmed"

    if (vectorType == 1):
        type = " TF-IDF vectorizer"
    elif (vectorType==2):
        type = " Count vectorizer"
    else:
        type="Binary vectorizer"

    if (regularized==1):
        reg="Regularized"
    else:
        reg="Not regularized"

    # Output
    print("Data Clean: ",dataClean)
    print("Vectorization: ",type)
    print("LinReg costfunction: ",reg)
    #print("F1 SCore: ",f1_score(y_test,preds, average='macro'))
    print(f1_score(confusion_matrix(y_test,preds)))
    print("Accuracy: ","{:.2f}%".format(accuracy_score(y_test,preds)))
    print("Confusion Matrix:\n",confusion_matrix(y_test,preds))
if __name__ == "__main__" :
   
    for i in range(1, 3):
        for j in range(1,4):
            for k in range(1,3):
                main(i,j,k)

Data Clean:  Stemmed
Vectorization:   TF-IDF vectorizer
LinReg costfunction:  Regularized
Precision:  0.84 
Recall:  0.79 
F1 score (micro): 0.82
None
Accuracy:  82.55%
Confusion Matrix:
 Predicted  0.0  1.0
Actual             
0          421   70
1           98  374
Data Clean:  Stemmed
Vectorization:   TF-IDF vectorizer
LinReg costfunction:  Not regularized
Precision:  0.84 
Recall:  0.79 
F1 score (micro): 0.82
None
Accuracy:  82.55%
Confusion Matrix:
 Predicted  0.0  1.0
Actual             
0          421   70
1           98  374
Data Clean:  Stemmed
Vectorization:   Count vectorizer
LinReg costfunction:  Regularized
Precision:  0.82 
Recall:  0.75 
F1 score (micro): 0.78
None
Accuracy:  79.85%
Confusion Matrix:
 Predicted  0.0  1.0
Actual             
0          415   76
1          118  354
Data Clean:  Stemmed
Vectorization:   Count vectorizer
LinReg costfunction:  Not regularized
Precision:  0.82 
Recall:  0.75 
F1 score (micro): 0.78
None
Accuracy:  79.85%
Confusion Matrix:
 Pr

### Bonus point: 

##### How would the results change if you used term frequency x inverse document frequency instead of binary representation for both logistic regression and Naïve Bayes (1 point)? 

For logistic regression, we found that using TF-IDF improved the  model accuracy compared to when just binary representation for both stemmed and not stemmed vocabulary of tweets. When TF-IDF was used for not stemmed, accuracy was 84.80% while with binary it was 82.50%. When TF-IDF was used for stemmed, accuracy was 86.40% while with Binary it was 83.90%.


##### How do your results change if you regularize your logistic regression (1 point)?

Adding regularization did not decrease accuracy with our models when implemented for binary representation, count vectorizer or Tf-IDF. 