## Creating a logistic regression model for sentiment analysis

Borrowed heavily (almost entirely from) https://kavita-ganesan.com/news-classifier-with-logistic-regression-in-python/#.Yui_xXbMKUl

Code also on git at https://github.com/kavgan/nlp-in-practice/blob/2d9e23c1d8ab56e9533be188c9ce7a0f6efc11e1/text-classification/notebooks/Text%20Classification%20with%20Logistic%20Regression.ipynb

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [2]:
reviews = pd.read_csv("comments_preproc.csv", index_col=0).sample(n=200000, random_state=0)
reviews.reset_index(drop=True, inplace=True)

reviews.head()

Unnamed: 0,lastName,comment,clarityRating,cleanedComment,sentiment
0,Williams,A great professor who really cares about his s...,5,great professor care student,1
1,Lehman,I don't know why people would say he does not ...,5,not know people not care class not accept b gi...,1
2,Glade,Even though the book you have to read is prett...,4,book read pretty difficult test easy answer di...,1
3,Heckelman,"Good teacher, although his lectures are very s...",4,good teacher lecture stale,1
4,Lipset,"Rude to students with questions, dry speech, r...",3,rude student question dry speech ramble easy c...,1


#### Interestingly, some comments are entirely empty after being sampled

They must have been entirely consistent of stop words that just got removed lmao

In [3]:
#reviews["sentiment"] = reviews["clarityRating"].apply(lambda x: 1 if x > 3 else 0 if x == 3 else -1)
reviews["sentiment"] = reviews["clarityRating"].apply(lambda x: 1 if x > 2.5 else 0)

In [4]:
reviews["cleanedComment"].isna().sum()

11

In [5]:
reviews.dropna(subset=["cleanedComment"], inplace=True)

Next few cells are function definitions

In [6]:
# Extract features using different available methods

def extract_features(df, train_data, test_data, type="binary", ngr=(1,1)):
    if "binary" in type:
        # binary feature representation
        
        cv = CountVectorizer(binary=True, max_df=0.95, ngram_range=ngr)
        cv.fit_transform(train_data["cleanedComment"].values)

        train_features = cv.transform(train_data["cleanedComment"].values)
        test_features = cv.transform(test_data["cleanedComment"].values)

        return train_features, test_features, cv
    
    elif "counts" in type:
        # count-based feature representation

        cv = CountVectorizer(binary=False, max_df=0.95, ngram_range=ngr)
        cv.fit_transform(train_data["cleanedComment"].values)

        train_features = cv.transform(train_data["cleanedComment"].values)
        test_features = cv.transform(test_data["cleanedComment"].values)

        return train_features, test_features, cv
    
    else:
        # TF-IDF based feature representation

        tfidf_vec = TfidfVectorizer(use_idf=True, max_df=0.95, ngram_range=ngr)
        tfidf_vec.fit_transform(train_data["cleanedComment"].values)

        train_features = tfidf_vec.transform(train_data["cleanedComment"].values)
        test_features = tfidf_vec.transform(test_data["cleanedComment"].values)

        return train_features, test_features, tfidf_vec

In [7]:
def get_top_k_predictions(model, X_test, k):
    probs = model.predict_proba(X_test)         # get probabilities instead of labels
    best_n = np.argsort(probs, axis=1)[:,-k:]   # get top k predictions by index (note: just index)

    preds = [[model.classes_[predicted_cat] for predicted_cat in prediction] for prediction in best_n]          # get category of predictions
    preds = [item[::-1] for item in preds]     # reverse categories, descending order of importance

    return preds

In [8]:
def collect_preds(Y_test, Y_preds):
    pred_gold_list=[[[Y_test[idx]],pred] for idx,pred in enumerate(Y_preds)]
    return pred_gold_list

In [9]:
def compute_accuracy(eval_items:list):
    correct = 0
    total = 0

    for item in eval_items:
        true_pred = item[0]
        machine_pred = set(item[1])

        for cat in true_pred:
            if cat in machine_pred:
                correct += 1
    accuracy = correct/float(len(eval_items))
    return accuracy

In [10]:
def _reciprocal_rank(true_labels:list, machine_preds:list):
    tp_pos_list = [(idx + 1) for idx, r in enumerate(machine_preds) if r in true_labels]
    
    rr = 0
    if len(tp_pos_list) > 0:
        first_pos_list = tp_pos_list[0]
        rr = 1 / float(first_pos_list)
    
    return rr

In [11]:
# compute mean reciprocal rank, which I understand as follows
""" this is admittedly much more useful when dealing with multiple categories
essentially, how many of the relevant categories appear in the top k predicted categories (or something to that effect, as it's typically shown as a percentage)
as our data currently has two possible categories, if we let top_k=2, everything comes out to 100% lol, and if top_k=1, accuracy and mrr are the same
"""

def compute_mrr_at_k(items:list):
    rr_total = 0

    for item in items:
        rr_at_k = _reciprocal_rank(item[0], item[1])
        rr_total += rr_at_k
        mrr = rr_total / 1/float(len(items))
    
    return mrr

Can tweak log reg model parameters to potentially improve accuracy, see https://levelup.gitconnected.com/regularization-in-machine-learning-59c619da4537 for more info

In [12]:
def train_model(df, field="cleanedComments", feature_rep="binary", top_k=1, ngr=(1,1)):
    train_data, test_data = train_test_split(df, random_state=0)            # get train-test split
    y_train = train_data["sentiment"].values                                # isolate labels in training and testing data
    y_test = test_data["sentiment"].values

    X_train, X_test, feature_transformer = extract_features(reviews, train_data, test_data, type=feature_rep, ngr=ngr)           # get features

    # NOTE: tweak parameters (especially C and penalty) to potentially improve log reg model
    log_reg = LogisticRegression(verbose=1, solver="liblinear", random_state=0, C=0.5, penalty="l2")       # create model and fit to training data
    model = log_reg.fit(X_train, y_train)

    preds = get_top_k_predictions(model, X_test, top_k)                 # get k most relevant predictions

    eval_items = collect_preds(y_test, preds)                           # get predicted values and ground into list of lists (for ease of evaluation)

    accuracy = compute_accuracy(eval_items)                             # get final stats on success rate of model
    mrr_at_k = compute_mrr_at_k(eval_items)

    return model, feature_transformer, accuracy, mrr_at_k

*Finally ready to start actually using the model*

In [38]:
def runLRModel(feature_rep="binary", ngr=(1,1), k=1):
    model, transformer, accuracy, mrr = train_model(reviews, "cleanedComments", feature_rep=feature_rep, top_k=k, ngr=ngr)
    print("\n*** USING " + feature_rep.upper() +" FEATURE REPRESENTATION ***")
    print("--- top " + str(k) + " feature(s) in n-gram range " + str(ngr) + " ---")
    print("Accuracy={0}; MRR={1}".format(accuracy * 100,mrr * 100))
    return model, transformer

Run model using every reasonable combination of feature representation, k-value, and n-gram range

In [39]:
# binary rep, top 1 feature, unigram
model, transformer = runLRModel(feature_rep="binary", k=1, ngr=(1,1))

[LibLinear]
*** USING BINARY FEATURE REPRESENTATION ***
--- top 1 feature(s) in n-gram range (1, 1) ---
Accuracy=86.1534461378455; MRR=86.1534461378455


In [57]:
# binary rep, top 2 features, unigram
model, transformer = runLRModel(feature_rep="binary", k=2, ngr=(1,1))

[LibLinear]
*** USING BINARY FEATURE REPRESENTATION ***
--- top 2 feature(s) in n-gram range (1, 1) ---
Accuracy=100.0; MRR=93.07672306892276


In [41]:
# binary rep, top 1 feature, bigram
model, transformer = runLRModel(feature_rep="binary", k=1, ngr=(2,2))

[LibLinear]
*** USING BINARY FEATURE REPRESENTATION ***
--- top 1 feature(s) in n-gram range (2, 2) ---
Accuracy=85.39941597663908; MRR=85.39941597663908


In [42]:
# binary rep, top 2 features, bigram
model, transformer = runLRModel(feature_rep="binary", k=2, ngr=(2,2))

[LibLinear]
*** USING BINARY FEATURE REPRESENTATION ***
--- top 2 feature(s) in n-gram range (2, 2) ---
Accuracy=100.0; MRR=92.69970798831953


In [43]:
# binary rep, top 1 feature, unigram + bigram
model, transformer = runLRModel(feature_rep="binary", k=1, ngr=(1,2))

[LibLinear]
*** USING BINARY FEATURE REPRESENTATION ***
--- top 1 feature(s) in n-gram range (1, 2) ---
Accuracy=87.24948997959918; MRR=87.24948997959918


In [44]:
# binary rep, top 2 features, unigram + bigram
model, transformer = runLRModel(feature_rep="binary", k=2, ngr=(1,2))

[LibLinear]
*** USING BINARY FEATURE REPRESENTATION ***
--- top 2 feature(s) in n-gram range (1, 2) ---
Accuracy=100.0; MRR=93.6247449897996


In [45]:
# count rep, top 1 feature, unigram
model, transformer = runLRModel(feature_rep="count", k=1, ngr=(1,1))

[LibLinear]
*** USING COUNT FEATURE REPRESENTATION ***
--- top 1 feature(s) in n-gram range (1, 1) ---
Accuracy=86.57946317852713; MRR=86.57946317852713


In [46]:
# count rep, top 2 features, unigram
model, transformer = runLRModel(feature_rep="count", k=2, ngr=(1,1))

[LibLinear]
*** USING COUNT FEATURE REPRESENTATION ***
--- top 2 feature(s) in n-gram range (1, 1) ---
Accuracy=100.0; MRR=93.28973158926357


In [47]:
# count rep, top 1 feature, bigram
model, transformer = runLRModel(feature_rep="count", k=1, ngr=(2,2))

[LibLinear]
*** USING COUNT FEATURE REPRESENTATION ***
--- top 1 feature(s) in n-gram range (2, 2) ---
Accuracy=84.07936317452697; MRR=84.07936317452697


In [48]:
# count rep, top 2 features, bigram
model, transformer = runLRModel(feature_rep="count", k=2, ngr=(2,2))

[LibLinear]
*** USING COUNT FEATURE REPRESENTATION ***
--- top 2 feature(s) in n-gram range (2, 2) ---
Accuracy=100.0; MRR=92.03968158726349


In [49]:
# count rep, top 1 feature, unigram + bigram
model, transformer = runLRModel(feature_rep="count", k=1, ngr=(1,2))

[LibLinear]
*** USING COUNT FEATURE REPRESENTATION ***
--- top 1 feature(s) in n-gram range (1, 2) ---
Accuracy=87.4234969398776; MRR=87.4234969398776


In [50]:
# count rep, top 2 features, unigram + bigram
model, transformer = runLRModel(feature_rep="count", k=2, ngr=(1,2))

[LibLinear]
*** USING COUNT FEATURE REPRESENTATION ***
--- top 2 feature(s) in n-gram range (1, 2) ---
Accuracy=100.0; MRR=93.7117484699388


In [51]:
# tfidf rep, top 1 feature, unigram
model, transformer = runLRModel(feature_rep="tfidf", k=1, ngr=(1,1))

[LibLinear]
*** USING TFIDF FEATURE REPRESENTATION ***
--- top 1 feature(s) in n-gram range (1, 1) ---
Accuracy=86.57946317852713; MRR=86.57946317852713


In [52]:
# tfidf rep, top 2 features, unigram
model, transformer = runLRModel(feature_rep="tfidf", k=2, ngr=(1,1))

[LibLinear]
*** USING TFIDF FEATURE REPRESENTATION ***
--- top 2 feature(s) in n-gram range (1, 1) ---
Accuracy=100.0; MRR=93.28973158926357


In [53]:
# tfidf rep, top 1 feature, bigram
model, transformer = runLRModel(feature_rep="tfidf", k=1, ngr=(2,2))

[LibLinear]
*** USING TFIDF FEATURE REPRESENTATION ***
--- top 1 feature(s) in n-gram range (2, 2) ---
Accuracy=84.07936317452697; MRR=84.07936317452697


In [54]:
# tfidf rep, top 2 features, bigram
model, transformer = runLRModel(feature_rep="tfidf", k=2, ngr=(2,2))

[LibLinear]
*** USING TFIDF FEATURE REPRESENTATION ***
--- top 2 feature(s) in n-gram range (2, 2) ---
Accuracy=100.0; MRR=92.03968158726349


In [55]:
# tfidf rep, top 1 feature, unigram + bigram
model, transformer = runLRModel(feature_rep="tfidf", k=1, ngr=(1,2))

[LibLinear]
*** USING TFIDF FEATURE REPRESENTATION ***
--- top 1 feature(s) in n-gram range (1, 2) ---
Accuracy=87.4234969398776; MRR=87.4234969398776


In [56]:
# tfidf rep, top 2 features, unigram + bigram
model, transformer = runLRModel(feature_rep="tfidf", k=2, ngr=(2,2))

[LibLinear]
*** USING TFIDF FEATURE REPRESENTATION ***
--- top 2 feature(s) in n-gram range (2, 2) ---
Accuracy=100.0; MRR=92.03968158726349
