# Capstone: Employee Review Monitoring

---

#### 03a: <b>Modeling - Machine Learning</b>

### Contents:
- [Imports and functions](#Library-and-data-import)
- [Initial Analysis](#Initial-analysis)
- [Data Cleaning](#Data-cleaning)
- [Exploratory Visualizations](#Exploratory-visualizations)
- [Combine dataframes & feature engineer](#Combine-dataframes-&-feature-engineer)
- [Export](#Export)

## Library and data import

In [1]:
# Load libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE

from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import WhitespaceTokenizer
from nltk.corpus import stopwords

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import cohen_kappa_score
from sklearn.metrics import make_scorer
from sklearn.metrics import recall_score

In [2]:
# Load data
df_modeling = pd.read_csv("../data/train_modeling.csv")

In [3]:
print(df_modeling.shape)
display(df_modeling.head())
print(df_modeling['v_sentiment'].value_counts())

(30099, 2)


Unnamed: 0,combined_text,v_sentiment
0,Best Company to work for People are smart and ...,1
1,"Moving at the speed of light, burn out is inev...",1
2,Great balance between big-company security and...,1
3,The best place I've worked and also the most d...,1
4,Execellent for engineers Impact driven. Best t...,1


 1    27294
-1     2578
 0      227
Name: v_sentiment, dtype: int64


## Preprocessing
---

### Changing 'Neutral' class to 'Negative'

We will be changing the 'Neutral' class to be a 'Negative' class. This is because a neutral review made by an employee does contain a certain degree of organizational problems in addition to its positive aspects. Hence, in order to highlight these organizational problems, we would convert them as negative reviews.

In [4]:
df_modeling['v_sentiment'] = df_modeling['v_sentiment'].replace([0],-1)

In [5]:
df_modeling['v_sentiment'].value_counts()

 1    27294
-1     2805
Name: v_sentiment, dtype: int64

In [6]:
# Change 'Negative' to '0' label
df_modeling['v_sentiment'] = df_modeling['v_sentiment'].replace([-1],0)

df_modeling['v_sentiment'].value_counts()

1    27294
0     2805
Name: v_sentiment, dtype: int64

### Text normalization via Lemmatization

When we normalize text, we attempt to reduce its randomness, bringing it closer to a predefined “standard”. This helps us to reduce the amount of different information that the computer has to deal with, and therefore improves efficiency.

In [7]:
# Instantiate tokernizer and lemmatizer
w_tokenizer = WhitespaceTokenizer()
lemmatizer = WordNetLemmatizer()

# Function to lemmatize words in dataframe
def lemma_text(text):
    return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]

df_modeling['combined_text_lemma'] = df_modeling['combined_text'].\
                                        apply(lemma_text)

# Join all words with one spacing
df_modeling['combined_text_lemma'] = df_modeling['combined_text_lemma'].\
                                        apply(lambda x: ' '.join(x))

# Organize dataframe
df_modeling.drop('combined_text',axis=1,inplace = True)
df_modeling.head()

Unnamed: 0,v_sentiment,combined_text_lemma
0,1,Best Company to work for People are smart and ...
1,1,"Moving at the speed of light, burn out is inev..."
2,1,Great balance between big-company security and...
3,1,The best place I've worked and also the most d...
4,1,Execellent for engineer Impact driven. Best te...


### Customise stopwords

Based on frequently appearing words in N-grams and WordCloud (done in EDA), we will be adding those words as new stopwords in the existing Sklearn stopwords package. 

Nltk's stopwords was chosen over Sklearn's stopwords due to Nltk being less comprehensive (almost twice the amount of stopwords as compared to Nltk's). Since sentiment analysis is involved, having a more comprehensive stopword list would result in a greater chance of altering the sentence such that the sentiment of the sentence might change.

In [8]:
# The default sklearn stopword list
stop_words = stopwords.words('english')  

# Add additional stopwords
additional_stopwords = {'job','work','people','employee','company',
                       'environment','life', 'wa','ha'}

# Create custom stopword list
my_stop_words = list(set().union(stop_words, additional_stopwords))

### Train-validation-split

In [9]:
# Create input and output variables
X = df_modeling['combined_text_lemma']
y = df_modeling['v_sentiment']

# Train-test-split
X_train, X_val, y_train, y_val = train_test_split(X,
                                                  y,
                                                  stratify=y,
                                                  random_state=42)

In [10]:
# Check the shape of training and testing vectors
print(" Train Set Shape ".center(27, "="))
print(f"Features X_train: {X_train.shape}")
print(f"Targets y_train:  {y_train.shape}")
print()
print(" Validation Set Shape ".center(26, "="))
print(f"Features X_val:   {X_val.shape}")
print(f"Targets y_val:    {y_val.shape}")

===== Train Set Shape =====
Features X_train: (22574,)
Targets y_train:  (22574,)

== Validation Set Shape ==
Features X_val:   (7525,)
Targets y_val:    (7525,)


### Baseline Accuracy

We need to calculate baseline accuracy in order to tell if our model is better than null model (predicting the plurality of class).

In [11]:
y_val.value_counts(normalize = True)

1    0.906844
0    0.093156
Name: v_sentiment, dtype: float64

## Modeling with SMOTE

Due to the class imbalance of minority classes which is present in the training data, we applied SMOTE technique on the training set to oversample and balance the target.

### Model Preparation

**Workflow for this notebook:**
- Transform data using vectorizer and SMOTE
- Fit model (with hyperparameter tuning) to training data
- Generate predictions using test data
- Evaluate model based on evaluation metrics
- Select best model

**Vectorizers used:**

`CountVectorizer` and `Tfidfvectorizer`

**Models used:**

`Multinomial Naïve Bayes`, `Logistic Regression`, `SVM`

**Evaluation metrics used:**

For machine learning models, ideally, we would want a high `specificity` for negative reviews (i.e. number of correctly predicted negative reviews) since being able to accurately predict negative reviews would give the company insights into organizational problems. We would also want a model that also does a good job correctly classifying positive reviews out of all the actual positive reviews, hence a high `precision` for positive reviews. 

We will also be using `Cohen's Kappa Statistic` which measures the proximity of the predicted classes to the actual classes when compared to a random classification. This is one of the best metrics for evaluating multi-class classifiers on imbalanced datasets. The closer the score to one, the better the classifier.

---

In notebook 3b, we will be running a `LSTM` model using TensorFlow Keras.

### Functions for modeling

In [12]:
# Function to fit GridSearch pipeline and generate best parameters
def fit_rs(clf, params):
    """fits a RandomizedSearchCV to a classifier, prints best params and returns model"""
    
    # to optimize for scoring 'specificity' which is our primary metric
    scoring = make_scorer(recall_score,pos_label=0)
    rs = RandomizedSearchCV(clf, params, n_iter=100, cv=5, n_jobs=-1, scoring = scoring)
    
    # fit model
    rs.fit(X_train, y_train)
    
    print(" RandomizedSearchCV ".center(45, "="))
    print(f"Best Parameters: {rs.best_params_}")
    print(f"Best CV Score (Specificity): {rs.best_score_}")
    print()
    print(" Evaluation ".center(45, "="))
    print(
        f"Train Score: {rs.score(X_train, y_train)}"
    )
    print(
        f"Testing Score: {rs.score(X_val, y_val)}"
    )

    return rs

In [39]:
# Function to generate train/test accuracy & evaluation metrics
def eval_model(model):
    """returns dataframe of evaluation metrics for classifier"""

    # Get predictions
    y_pred = model.predict(X_val)
#     y_proba = model.predict_proba(X_val)[:, 1]
    
    randomcv = model.best_score_
    
    # Metrics for evaluating classifier
    tn, fp, fn, tp = confusion_matrix(y_val, y_pred).ravel()

    accuracy = (tn + tp) / (tn + fp + fn + tp)
    
    print(classification_report(y_val, y_pred, digits =3))
    
#     if (tn + fp) != 0:
#         spec = tn / (tn + fp)
#     else:
#         spec = "NA"

    if (tp + fn) != 0:
        recall = tp / (tp + fn)
    else:
        recall = np.nan

    if (tp + fp) != 0:
        precision = tp / (tp + fp)
    else:
        precision = np.nan

    if recall == np.nan or precision == np.nan:
        f1 = np.nan
    else:
        f1 = 2 * ((precision * recall) / (precision + recall))
    
    
    ck = cohen_kappa_score(y_val, y_pred)

    try:
        model_name = str(model.estimator)[:-2]
        if len(model_name) > 30:
            model_name = model_name[:13]
    except:
        model_name = "Dummy"

    df = pd.DataFrame(
        [np.round([randomcv, accuracy, recall, 
                   precision, f1, 
                   ck, fp, 
                   fn], 3)],
        columns=[
            "Best CV Score (specificity)",
            "Accuracy",
            "Recall",
            "Precision",
            "F1",
            "Cohen Kappa",
            "False Positives",
            "False Negatives"
        ],
        index=[model_name],
        dtype="str",
    )
    return df

### Naïve Bayes [Benchmark Model]

Naïve Bayes classifier is a probabilistic machine learning model that is commonly used in classification problems. It relies on Bayes Theorem, which is a way of finding a probability when we know certain other probabilities. In this case, we want to calculate the probability that a review is classified under `Positive (1)` or `Negative (-1)` given the words in the 'combined_text' column.

We will be using `Multinomial Naive Bayes` bacause it works with occurrence counts (features are positive discrete integers), while Bernoullii Naive Bayes is designed for binaary/boolean features.

However, limitations of the Naive Bayes model is that it makes the assumption that all features are independent of one another in which text data is never independent (i.e. certain words can change the context of a sentence when used with other words. 

Despite this assumption not being realistic with NLP data, we still use Naïve Bayes pretty frequently.
- It's a very fast modeling algorithm (which is great especially when we have lots of features and/or lots of data!).
- It is often an excellent classifier, outperforming more complicated models.
- Very useful for text processing

Hence, this would be our **benchmark model**.

### Multinomial NB (CountVectorizer)

In [14]:
# Create pipeline for vectorizer and estimator
# Pipeline is from imblearn and not sklearn, in order to apply SMOTE
pipe1 = Pipeline([
    ('cvec', CountVectorizer()),
    ('smote', SMOTE(random_state=42)),
    ('nb', MultinomialNB())
])

In [15]:
# Visualise all parameters available for tuning
pipe1.get_params()

{'memory': None,
 'steps': [('cvec', CountVectorizer()),
  ('smote', SMOTE(random_state=42)),
  ('nb', MultinomialNB())],
 'verbose': False,
 'cvec': CountVectorizer(),
 'smote': SMOTE(random_state=42),
 'nb': MultinomialNB(),
 'cvec__analyzer': 'word',
 'cvec__binary': False,
 'cvec__decode_error': 'strict',
 'cvec__dtype': numpy.int64,
 'cvec__encoding': 'utf-8',
 'cvec__input': 'content',
 'cvec__lowercase': True,
 'cvec__max_df': 1.0,
 'cvec__max_features': None,
 'cvec__min_df': 1,
 'cvec__ngram_range': (1, 1),
 'cvec__preprocessor': None,
 'cvec__stop_words': None,
 'cvec__strip_accents': None,
 'cvec__token_pattern': '(?u)\\b\\w\\w+\\b',
 'cvec__tokenizer': None,
 'cvec__vocabulary': None,
 'smote__k_neighbors': 5,
 'smote__n_jobs': None,
 'smote__random_state': 42,
 'smote__sampling_strategy': 'auto',
 'nb__alpha': 1.0,
 'nb__class_prior': None,
 'nb__fit_prior': True}

In [16]:
# Hyperparameter tuning
pipe_params1 = {
    'cvec__max_features': [None], # selecting top N words from entire corpus
    'cvec__min_df':[2,3,4], # word must occur in at least N documents in corpus
    'cvec__max_df':[0.5,0.6,0.4], # ignore words that occur in >N% of the documents in corpus
    'cvec__ngram_range': [(1,1), (1,2)], # words from unigram / words from unigram + bigram
    'cvec__stop_words': [my_stop_words, None], # stopwords from sklearn + custom words
    'smote__k_neighbors': [4,5,6],
    'nb__fit_prior': [True, False],
    'nb__alpha': [1] # additive smoothing parameter
}

In [17]:
# Fit model
mnb_cvec = fit_rs(pipe1, pipe_params1)

Best Parameters: {'smote__k_neighbors': 6, 'nb__fit_prior': False, 'nb__alpha': 1, 'cvec__stop_words': ['before', 'she', "don't", "mustn't", 'than', 'they', 'further', 'more', 'too', 'll', 're', "weren't", "shouldn't", 'some', "doesn't", 'these', 'am', 'is', 'our', 'o', 'only', 'down', 'shouldn', 'at', 'no', 'doesn', 'yours', "wasn't", "haven't", 'each', 'aren', 'just', 'hers', 'mustn', 'shan', 'the', 'above', 'why', 'between', 'under', "isn't", 'that', 'ma', 'be', 'its', 'once', 'most', 'against', 'ourselves', 'until', 'so', 'my', 'off', "you'll", 'd', 'up', 'mightn', 's', 'on', 'him', 'themselves', 'ours', 'myself', 'very', 'hadn', "hasn't", 'you', 'from', "you'd", 'them', 'had', 'what', 'for', 'can', 'do', 'isn', 'by', 'was', 'having', "mightn't", 'couldn', "you've", 'to', "it's", 'yourself', 'itself', 'environment', "couldn't", 'work', 'other', 'ain', 'it', "wouldn't", 'life', 'were', 'people', 'm', 'how', 'now', 'should', 'your', 'wa', 'both', 'wouldn', 'over', 'hasn', 'did', 'i',

In [18]:
mnb_cvec_res = eval_model(mnb_cvec)
mnb_cvec_res

              precision    recall  f1-score   support

           0      0.362     0.499     0.420       701
           1      0.946     0.910     0.928      6824

    accuracy                          0.871      7525
   macro avg      0.654     0.704     0.674      7525
weighted avg      0.892     0.871     0.880      7525



Unnamed: 0,Best CV Score (specificity),Accuracy,Recall,Precision,F1,Cohen Kappa,False Positives,False Negatives
Pipeline(step,0.502,0.871,0.91,0.946,0.928,0.349,351.0,617.0


### Multinomial NB (TfidfVectorizer)

In [19]:
# Create pipeline for vectorizer and estimator
pipe2 = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('smote', SMOTE(random_state=42)),
    ('nb', MultinomialNB())
])

In [20]:
pipe_params2 = {
    'tvec__max_features': [None],
    'tvec__min_df':[3,4,5],
    'tvec__max_df':[0.5,0.6,0.4],
    'tvec__ngram_range': [(1,1), (1,2)],
    'tvec__stop_words': [my_stop_words, None],
    'smote__k_neighbors': [4,5,3],
    'nb__fit_prior': [True, False],
    'nb__alpha': [1]
}

In [21]:
# Fit model
mnb_tvec = fit_rs(pipe2, pipe_params2)

Best Parameters: {'tvec__stop_words': ['before', 'she', "don't", "mustn't", 'than', 'they', 'further', 'more', 'too', 'll', 're', "weren't", "shouldn't", 'some', "doesn't", 'these', 'am', 'is', 'our', 'o', 'only', 'down', 'shouldn', 'at', 'no', 'doesn', 'yours', "wasn't", "haven't", 'each', 'aren', 'just', 'hers', 'mustn', 'shan', 'the', 'above', 'why', 'between', 'under', "isn't", 'that', 'ma', 'be', 'its', 'once', 'most', 'against', 'ourselves', 'until', 'so', 'my', 'off', "you'll", 'd', 'up', 'mightn', 's', 'on', 'him', 'themselves', 'ours', 'myself', 'very', 'hadn', "hasn't", 'you', 'from', "you'd", 'them', 'had', 'what', 'for', 'can', 'do', 'isn', 'by', 'was', 'having', "mightn't", 'couldn', "you've", 'to', "it's", 'yourself', 'itself', 'environment', "couldn't", 'work', 'other', 'ain', 'it', "wouldn't", 'life', 'were', 'people', 'm', 'how', 'now', 'should', 'your', 'wa', 'both', 'wouldn', 'over', 'hasn', 'did', 'i', 'such', 've', 'there', 'again', 'any', 'me', 'if', 'where', 'not

In [22]:
mnb_tvec_res = eval_model(mnb_tvec)
mnb_tvec_res

              precision    recall  f1-score   support

           0      0.327     0.622     0.428       701
           1      0.957     0.868     0.911      6824

    accuracy                          0.845      7525
   macro avg      0.642     0.745     0.669      7525
weighted avg      0.898     0.845     0.866      7525



Unnamed: 0,Best CV Score (specificity),Accuracy,Recall,Precision,F1,Cohen Kappa,False Positives,False Negatives
Pipeline(step,0.608,0.845,0.868,0.957,0.911,0.349,265.0,899.0


### Logistic Regression (CountVectorization)

In [23]:
# Create pipeline for vectorizer and estimator
pipe3 = Pipeline([
    ('cvec', CountVectorizer()),
    ('smote', SMOTE(random_state = 42)),
    ('logreg', LogisticRegression(random_state = 42))
])

In [27]:
pipe_params3 = {
    'cvec__max_features': [None],
    'cvec__min_df':[2,3,4],
    'cvec__max_df':[0.5,0.6,0.4],
    'cvec__ngram_range': [(1,1), (1,2)],
    'cvec__stop_words': [my_stop_words, None],
    'smote__k_neighbors': [4,5,6],
    'logreg__C': [10e-1, 10e0, 10e1],
    'logreg__penalty': ["l1", "l2"],
    'logreg__max_iter': [50, 100, 200],
    'logreg__solver': ["liblinear"]
}

In [28]:
# Fit model
logreg_cvec = fit_rs(pipe3, pipe_params3)

Best Parameters: {'smote__k_neighbors': 4, 'logreg__solver': 'liblinear', 'logreg__penalty': 'l1', 'logreg__max_iter': 50, 'logreg__C': 1.0, 'cvec__stop_words': None, 'cvec__ngram_range': (1, 1), 'cvec__min_df': 2, 'cvec__max_features': None, 'cvec__max_df': 0.5}
Best CV Score (Specificity): 0.5731919466123742

Train Score: 0.8236692015209125
Testing Score: 0.6034236804564908


In [29]:
logreg_cvec_res = eval_model(logreg_cvec)
logreg_cvec_res

              precision    recall  f1-score   support

           0      0.505     0.603     0.550       701
           1      0.958     0.939     0.949      6824

    accuracy                          0.908      7525
   macro avg      0.732     0.771     0.749      7525
weighted avg      0.916     0.908     0.912      7525



Unnamed: 0,Best CV Score (specificity),Accuracy,Recall,Precision,F1,Cohen Kappa,False Positives,False Negatives
Pipeline(step,0.573,0.908,0.939,0.958,0.949,0.499,278.0,414.0


### Logistic Regression (TfidfVectorizer)

In [30]:
# Create pipeline for vectorizer and estimator
pipe4 = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('smote', SMOTE(random_state = 42)),
    ('logreg', LogisticRegression(random_state = 42))
])

In [31]:
pipe_params4 = {
    'tvec__max_features': [None],
    'tvec__min_df':[3,4,5],
    'tvec__max_df':[0.5,0.6,0.4],
    'tvec__ngram_range': [(1,1), (1,2)],
    'tvec__stop_words': [my_stop_words, None],
    'smote__k_neighbors': [4,5,3],
    'logreg__C': [10e-1, 10e0, 10e1],
    'logreg__penalty': ["l1", "l2"],
    'logreg__max_iter': [50, 100, 200],
    'logreg__solver': ["liblinear"]
}

In [32]:
# Fit model
logreg_tvec = fit_rs(pipe4, pipe_params4)

Best Parameters: {'tvec__stop_words': None, 'tvec__ngram_range': (1, 2), 'tvec__min_df': 4, 'tvec__max_features': None, 'tvec__max_df': 0.5, 'smote__k_neighbors': 5, 'logreg__solver': 'liblinear', 'logreg__penalty': 'l1', 'logreg__max_iter': 200, 'logreg__C': 1.0}
Best CV Score (Specificity): 0.7542744033480375

Train Score: 0.9315589353612167
Testing Score: 0.7617689015691869


In [33]:
logreg_tvec_res = eval_model(logreg_tvec)
logreg_tvec_res

              precision    recall  f1-score   support

           0      0.507     0.762     0.609       701
           1      0.974     0.924     0.948      6824

    accuracy                          0.909      7525
   macro avg      0.741     0.843     0.779      7525
weighted avg      0.931     0.909     0.917      7525



Unnamed: 0,Best CV Score (specificity),Accuracy,Recall,Precision,F1,Cohen Kappa,False Positives,False Negatives
Pipeline(step,0.754,0.909,0.924,0.974,0.948,0.56,167.0,519.0


### Support Vector Machine Classifier (CountVectorization)

In [34]:
# Create pipeline for vectorizer and estimator
pipe5 = Pipeline([
    ('cvec', CountVectorizer()),
    ('smote', SMOTE(random_state = 42)),
    ('svc', SVC(random_state = 42, max_iter = 2000))
])

In [35]:
pipe_params5 = {
    'cvec__max_features': [None],
    'cvec__min_df':[2,3,4],
    'cvec__max_df':[0.5,0.6,0.4],
    'cvec__ngram_range': [(1,1), (1,2)],
    'cvec__stop_words': [my_stop_words, None],
    'smote__k_neighbors': [4,5,6],
    'svc__C': [10e-1, 10e0, 10e1],
    'svc__kernel': ['rbf','poly'],
    'svc__degree': [2,3]
}

In [36]:
# Fit model
svc_cvec = fit_rs(pipe5, pipe_params5)



Best Parameters: {'svc__kernel': 'rbf', 'svc__degree': 3, 'svc__C': 100.0, 'smote__k_neighbors': 5, 'cvec__stop_words': None, 'cvec__ngram_range': (1, 1), 'cvec__min_df': 4, 'cvec__max_features': None, 'cvec__max_df': 0.5}
Best CV Score (Specificity): 0.5874799230856238

Train Score: 0.9795627376425855
Testing Score: 0.6419400855920114


In [40]:
svc_cvec_res = eval_model(svc_cvec)
svc_cvec_res

              precision    recall  f1-score   support

           0      0.268     0.642     0.379       701
           1      0.957     0.820     0.883      6824

    accuracy                          0.804      7525
   macro avg      0.613     0.731     0.631      7525
weighted avg      0.893     0.804     0.836      7525



Unnamed: 0,Best CV Score (specificity),Accuracy,Recall,Precision,F1,Cohen Kappa,False Positives,False Negatives
Pipeline(step,0.587,0.804,0.82,0.957,0.883,0.285,251.0,1226.0


### Support Vector Machine Classifier (TfidfVectorizer)

In [41]:
# Create pipeline for vectorizer and estimator
pipe6 = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('smote', SMOTE(random_state = 42)),
    ('svc', SVC(random_state = 42, max_iter = 2000))
])

In [42]:
pipe_params6 = {
    'tvec__max_features': [None],
    'tvec__min_df':[3,4,5],
    'tvec__max_df':[0.5,0.6,0.4],
    'tvec__ngram_range': [(1,1), (1,2)],
    'tvec__stop_words': [my_stop_words, None],
    'smote__k_neighbors': [4,5,3],
    'svc__C': [10e-1, 10e0, 10e1],
    'svc__kernel': ['rbf','poly'],
    'svc__degree': [2,3]
}

In [43]:
# Fit model
svc_tvec = fit_rs(pipe6, pipe_params6)



Best Parameters: {'tvec__stop_words': None, 'tvec__ngram_range': (1, 2), 'tvec__min_df': 3, 'tvec__max_features': None, 'tvec__max_df': 0.4, 'svc__kernel': 'poly', 'svc__degree': 3, 'svc__C': 10.0, 'smote__k_neighbors': 5}
Best CV Score (Specificity): 1.0

Train Score: 1.0
Testing Score: 1.0


In [44]:
svc_tvec_res = eval_model(svc_tvec)
svc_tvec_res

              precision    recall  f1-score   support

           0      0.094     1.000     0.171       701
           1      1.000     0.005     0.010      6824

    accuracy                          0.098      7525
   macro avg      0.547     0.502     0.091      7525
weighted avg      0.916     0.098     0.025      7525



Unnamed: 0,Best CV Score (specificity),Accuracy,Recall,Precision,F1,Cohen Kappa,False Positives,False Negatives
Pipeline(step,1.0,0.098,0.005,1.0,0.01,0.001,0.0,6790.0


## Evaluating and Interpreting Results

### Summary of results for above models

In [60]:
# Print summary results without oversampling
summary = pd.concat(
    [
        mnb_cvec_res,
        mnb_tvec_res,
        logreg_cvec_res,
        logreg_tvec_res,
        svc_cvec_res,
        svc_tvec_res
    ]
)

summary.reset_index(drop=True, inplace=True)


In [63]:
summary.rename({0: 'MNB-CVEC',
                1: 'MNB-TFIDF',
               2: 'LR-CVEC',
               3: 'LR-TFIDF',
               4: 'SVC-CVEC',
               5: 'SVC-TFIDF'}, axis='index', inplace=True)

In [67]:
summary.sort_values(by=['Best CV Score (specificity)'], ascending = False)

Unnamed: 0,Best CV Score (specificity),Accuracy,Recall,Precision,F1,Cohen Kappa,False Positives,False Negatives
SVC-TFIDF,1.0,0.098,0.005,1.0,0.01,0.001,0.0,6790.0
LR-TFIDF,0.754,0.909,0.924,0.974,0.948,0.56,167.0,519.0
MNB-TFIDF,0.608,0.845,0.868,0.957,0.911,0.349,265.0,899.0
SVC-CVEC,0.587,0.804,0.82,0.957,0.883,0.285,251.0,1226.0
LR-CVEC,0.573,0.908,0.939,0.958,0.949,0.499,278.0,414.0
MNB-CVEC,0.502,0.871,0.91,0.946,0.928,0.349,351.0,617.0


In [69]:
import pickle

pickle_out = open("lr_tfidf.pkl", mode = "wb")
pickle.dump(logreg_tvec, pickle_out)
pickle_out.close()