In [1]:
from joblib import load
import re

### Load pipeline

In [2]:
sent_pipe = load("Attempts/NB-pipeline.joblib")

#### Useful functions

In [3]:
def basicPreproc(comments: list):
    comments_proper = []

    for review in comments:
        review = re.sub('&([a-zA-z]+|#\d+);', "", review)           # remove HTML codes
        review = re.sub('&#63;?', '', review)                       # HTML code for question mark evades erasure on occasion, handle here
        review = re.sub(r'\s*https?://\S+(\s+|$)', ' ', review)                                     # remove links
        review = re.sub("^(\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}$", ' ', review)         # remove phone numbers
        review = re.sub("[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+", " ", review)              # remove email addresses

        review = re.sub(r'(.)\1\1+', '\g<1>', review)               # replace any three characters in a row with one

        review = re.sub('[^a-zA-Z]+', ' ', review)                  # remove non-alphabetic characters

        review = re.sub('\s+', ' ', review)
        review = review.lower()                                     # lowercase review for uniformity

        comments_proper.append(review)

    return comments_proper

In [4]:
def printResults(docs: list, preds:list, probs:list):
    # now with fancy text coloring (works best in dark mode)
    for comm, pred, prob in zip(docs, preds, probs):
        print("\033[2;32mComment: \033[0;37m{0}".format(comm))
        
        if pred == 0:
            print("\033[0;31mSentiment: \033[0;31m{0}".format(pred))
            print("\t\033[0;35mConfidence:\033[1;35m %0.5f%%" % (prob[0] * 100))
        else:
            print("\033[0;34mSentiment: \033[0;34m{0}".format(pred))
            print("\t\033[0;35mConfidence:\033[1;35m %0.5f%%" % (prob[1] * 100))
        
        print()

### Begin testing pipeline

First, we need example reviews to classify.

These are reviews that were misclassified by some earlier models, let's see if the pipeline improvements make it better.

Also, these are for professors that we all had together :)

In [5]:
# new comments to predict on
docs = [
    "He has his own grading criteria, which may throw you off. Tests are divided into weekly quiz, which you can redo them for better grade. PAs are difficult and mimir grading provides limited info, but he do provide fast and helpful feedback via office hour or mail. I was too late when I realized that, so contact him quickly if PA is hurting you.",
    "Makes the course unnecessarily hard. Passing a test with a C is uncommon. Don't be tricked by how nice of a guy he is, he wants to watch the world burn.",
    "This Professor is a very helpful, but is extremely difficult. The homework is a lot of work and expect the final to very very difficult.",
    "Think it was his first time teaching 221 so he was disorganized on the syllabus and assignments. Quite a bit of work so wouldn't recommend taking his class before ETAM. He pushes you to develop good programming practices. You'll come out of this class a better individual programmer but probably not with the best grade.",
    "I have perspective transferring from TAMU to another University, this class is hard, and talking to graduate students from other institutions it's not hard to see why, there's typically graduate level DSA covered in this class. It's worth taking for that reason alone. The depth, and content covered makes this the most important class in TAMU CS.",
    "Insufferable lectures, but an easy grader. She may be the professor you want, but she is not the professor you need.",
    "Leyk is a decent teacher for 121. This was the first semester she taught the course I believe so it was kinda all over the place, but she did a good job. HW isn't impossible or unreasonable and there's plenty of help. Also only having a final project and no final exam was nice."
]

Preprocess comments to ensure uniformity

In [6]:
preproc_comments = basicPreproc(docs)

Using pipeline, predict sentiments of each comment, as well as confidence thereof

In [7]:
preds = sent_pipe.predict(preproc_comments)
probs = sent_pipe.predict_proba(preproc_comments).tolist()          # probs is a list of lists, shown below for clarity

Output results

In [8]:
printResults(docs, preds, probs)

[2;32mComment: [0;37mHe has his own grading criteria, which may throw you off. Tests are divided into weekly quiz, which you can redo them for better grade. PAs are difficult and mimir grading provides limited info, but he do provide fast and helpful feedback via office hour or mail. I was too late when I realized that, so contact him quickly if PA is hurting you.
[0;31mSentiment: [0;31m0
	[0;35mConfidence:[1;35m 99.99527%

[2;32mComment: [0;37mMakes the course unnecessarily hard. Passing a test with a C is uncommon. Don't be tricked by how nice of a guy he is, he wants to watch the world burn.
[0;31mSentiment: [0;31m0
	[0;35mConfidence:[1;35m 96.02287%

[2;32mComment: [0;37mThis Professor is a very helpful, but is extremely difficult. The homework is a lot of work and expect the final to very very difficult.
[0;34mSentiment: [0;34m1
	[0;35mConfidence:[1;35m 99.91636%

[2;32mComment: [0;37mThink it was his first time teaching 221 so he was disorganized on the syllab

#### Examining probability values

List of lists, where inner lists represent probabilities for a single comment

ie. `probs[1][0]` (0.960228721961716) represents the confidence with which the model believes the second comment has a *negative* sentiment, and the value next to it (0.03977127803830788) is the likelihood that it has *positive* sentiment

In [9]:
probs

[[0.9999527030073255, 4.7296992719313765e-05],
 [0.960228721961716, 0.03977127803830788],
 [0.0008364494577630544, 0.9991635505422285],
 [3.0200411033148475e-05, 0.9999697995889829],
 [0.9999361542766583, 6.384572329533306e-05],
 [0.006200802878954854, 0.993799197121028],
 [0.8961740800108247, 0.10382591998917863]]