## Logistic Regression Model and F1 Score Threshold Calibration

In this notebook, we go over the Logistic Regression method to predict the __isPositive__ field of our final dataset, while also having a look at how Precision-Recall curves and threshold selection can help improve classifier's F1 Score.

Overall dataset schema:
* __reviewText:__ Text of the review
* __summary:__ Summary of the review
* __verified:__ Whether the purchase was verified (True or False)
* __time:__ UNIX timestamp for the review
* __log_votes:__ Logarithm-adjusted votes log(1+votes)
* __isPositive:__ Rating of the review


### 1. Reading the datasets

We will use the __pandas__ library to read our datasets.

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

df_train = pd.read_csv('../../DATA/NLP/EMBK-NLP-FINAL-TRAIN-CSV.csv')
df_test = pd.read_csv('../../DATA/NLP/EMBK-NLP-FINAL-TEST-CSV.csv')

Let's look at the first five rows in the dataset.

In [2]:
df_train.head()

Unnamed: 0,reviewText,summary,verified,time,log_votes,isPositive
0,"PURCHASED FOR YOUNGSTER WHO\nINHERITED MY ""TOO...",IDEAL FOR BEGINNER!,True,1361836800,0.0,1.0
1,unable to open or use,Two Stars,True,1452643200,0.0,0.0
2,Waste of money!!! It wouldn't load to my system.,Dont buy it!,True,1433289600,0.0,0.0
3,I attempted to install this OS on two differen...,I attempted to install this OS on two differen...,True,1518912000,0.0,0.0
4,I've spent 14 fruitless hours over the past tw...,Do NOT Download.,True,1441929600,1.098612,0.0


In [3]:
df_test.head()

Unnamed: 0,reviewText,summary,verified,time,log_votes,isPositive
0,Kaspersky offers the best security for your co...,State of the art protection,True,1465516800,0.0,1.0
1,This Value was extremely discounted which I ap...,Quickbooks,True,1393632000,0.0,1.0
2,Some dufus probably got stock options by the t...,Sad,False,1228176000,2.639057,0.0
3,I have reviewed the software and it is beyond ...,Excellent product,True,1402531200,0.0,1.0
4,"Plain old simple you need Anti-Virus,I have tr...",A must have,True,1367539200,0.0,1.0


### 2. Exploratory Data Analysis and Missing Value Imputation

Let's look at the target distribution for our dataset.

In [4]:
df_train["isPositive"].value_counts()

1.0    43692
0.0    26308
Name: isPositive, dtype: int64

In [5]:
df_test["isPositive"].value_counts()

1.0    4980
0.0    3020
Name: isPositive, dtype: int64

Checking the number of missing values    

In [6]:
print(df_train.isna().sum())

reviewText    11
summary       14
verified       0
time           0
log_votes      0
isPositive     0
dtype: int64


In [7]:
print(df_test.isna().sum())

reviewText    2
summary       1
verified      0
time          0
log_votes     0
isPositive    0
dtype: int64


Let's fill-in a placeholder for the text missing values

In [8]:
df_train["reviewText"].fillna("Missing", inplace=True)
df_test["reviewText"].fillna("Missing", inplace=True)

### 3. Stop Word Removal and Stemming

We will apply the text processing methods discussed in the class. 

In [9]:
# Install the library and functions
import nltk

nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [10]:
import nltk, re
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

stop = stopwords.words('english')

excluding = ['against', 'not', 'don', "don't",'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't",
             'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 
             'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't",'shouldn', "shouldn't", 'wasn',
            "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

stop_words = [word for word in stop if word not in excluding]

snow = SnowballStemmer('english')

def process_text(texts): 
    final_text_list=[]
    for sent in texts:
        filtered_sentence=[]
        
        sent = sent.lower() # Lowercase 
        sent = sent.strip() # Remove leading/trailing whitespace
        sent = re.sub('\s+', ' ', sent) # Remove extra space and tabs
        sent = re.compile('<.*?>').sub('', sent) # Remove HTML tags/markups:
        
        for w in word_tokenize(sent):
            # Check if it is not numeric and its length>2 and not in stop words
            if(not w.isnumeric()) and (len(w)>2) and (w not in stop_words):  
                # Stem and add to filtered list
                filtered_sentence.append(snow.stem(w))
        final_string = " ".join(filtered_sentence) #final string of cleaned words
 
        final_text_list.append(final_string)
    
    return final_text_list

In [11]:
print("Pre-processing training reviewText")
df_train["reviewText"] = process_text(df_train["reviewText"].tolist())

print("Pre-processing test reviewText")
df_test["reviewText"] = process_text(df_test["reviewText"].tolist())


Pre-processing training reviewText
Pre-processing test reviewText


### 4. Splitting the training dataset into training and validation

Sklearn library has a useful function to split datasets. We will use the __train_test_split()__ function. In the example below, we get 90% of the data for training and 10% is left for validation.

In [12]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(df_train["reviewText"].tolist(), # Using these three fields
                                                  df_train["isPositive"].tolist(), # Target field
                                                  test_size=0.10, # 10% test, 90% tranining
                                                  shuffle=True) # Shuffle the whole dataset


### 5. Computing Bag of Words Features

We are using binary features here. TF and TF-IDF are other options.

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

X_test = df_test["reviewText"].tolist()
y_test = df_test["isPositive"].tolist()

# Initialize the binary count vectorizer
tfidf_vectorizer = CountVectorizer(binary=True,
                                   max_features=50    # Limit the vocabulary size
                                  )

X_train_text_vectors = tfidf_vectorizer.fit_transform(X_train) # Fit and transform
X_val_text_vectors = tfidf_vectorizer.transform(X_val)       # Only transform
X_test_text_vectors = tfidf_vectorizer.transform(X_test)       # Only transform


### 6. Fitting LogisticRegression and checking the validation performance

Let's fit __LogisticRegression__ from Sklearn library, and check the performance on the validation dataset.

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [14]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import make_scorer, accuracy_score, f1_score

lrClassifier = LogisticRegression(penalty = 'l2', C = 0.1, class_weight = 'balanced')
lrClassifier.fit(X_train_text_vectors, y_train)
lrClassifier_val_predictions = lrClassifier.predict(X_val_text_vectors)
lrClassifier_test_predictions = lrClassifier.predict(X_test_text_vectors)

print("LogisticRegression on Validation: Accuracy Score: %f, F1-score: %f" % (accuracy_score(y_val, lrClassifier_val_predictions), f1_score(y_val, lrClassifier_val_predictions)))
print("LogisticRegression on Test: Accuracy Score: %f, F1-score: %f" % (accuracy_score(y_test, lrClassifier_test_predictions), f1_score(y_test, lrClassifier_test_predictions)))




LogisticRegression on Validation: Accuracy Score: 0.745143, F1-score: 0.792799
LogisticRegression on Test: Accuracy Score: 0.748000, F1-score: 0.795080


### 7. Picking the Probability Threshold
We will plot the Precision-Recall curve and pick the point with the highest F1 score. We can easily calculate F1 score using precision and recall.

In [15]:
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve

val_predictions_probs = lrClassifier.predict_proba(X_val_text_vectors)
precisions, recalls, thresholds = precision_recall_curve(y_val, val_predictions_probs[:, 1])

# Let's plot the Precision-Recall curve, precision on the y axis and recall on the x axis
plt.plot([0, 1], [0.5, 0.5], linestyle='--')
plt.plot(recalls, precisions, marker='.')
plt.title('Precision-Recall Curve')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.show()

<Figure size 640x480 with 1 Axes>

Calculate the F1 score using the precision and recall from the curve above, and pick the threshold that resulted in the largest F1 score.

![f1](https://drive-render.corp.amazon.com/view/cesazara@/cv-notebook-images/f1_score.png?download=true)

In [16]:
highest_f1 = 0
threshold_highest_f1 = 0
for id, threhold in enumerate(thresholds):
    f1_score = 2*precisions[id]*recalls[id]/(precisions[id]+recalls[id])
    if(f1_score > highest_f1):
        highest_f1 = f1_score
        threshold_highest_f1 = threhold
print("Highest F1 on Validation:", highest_f1, ", Threshold for the highest F1:", threshold_highest_f1)


Highest F1 on Validation: 0.816023428511662 , Threshold for the highest F1: 0.311373488104497
