## Recreating my logistic regression model from project 3 and turning it into a predictor:

Import libraries:

In [3]:
import pandas as pd
import numpy as np
import pickle
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import make_pipeline, Pipeline


Read in the data, drop the extra column:

In [100]:
df = pd.read_csv('./data/redditcomments.csv')
df.drop(columns=['Unnamed: 0'],inplace=True)


Cleaning:

In [101]:
df.dropna(inplace=True)
df['char_length'] = [len(comment) for comment in list(df['comment'])]
df = df[(df['char_length']>=10) & (df['char_length']<1000)]
comments = list(df['comment'])
df['flagged'] = [1 if ('removed' in comment.split(' ')) else 0 for comment in comments]
df = df[df['flagged']==0]

Split into X and y:

In [115]:
X = df['comment']
y = df['types']


Vectorize, with custom stopwords list:

In [116]:
stops_df = pd.read_csv('./data/stopwords.csv').drop(columns=['Unnamed: 0'])
stops = list(stops_df['0'])
cvec = CountVectorizer(stop_words=stops)


In [119]:
# test train split -- my data is pretty close to a 50/50 balance, but I will stratify just to be safe:
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state = 42, stratify=y)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((13992,), (4664,), (13992,), (4664,))

In [134]:
# Count Vectorize with the NEW stop_words list:
cv = CountVectorizer(stop_words=stops)
cv.fit(X_train,y_train)
X_train = cv.transform(X_train)
X_test  = cv.transform(X_test)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((13992, 23616), (4664, 23616), (13992,), (4664,))

In [135]:
model = LogisticRegression()
model.fit(X_train,y_train)
lr_train = model.score(X_train,y_train)
lr_test  = model.score(X_test,y_test)
print(f'Train: {lr_train}, Test: {lr_test}')

Train: 0.952544311034877, Test: 0.8072469982847341


In [122]:
text_list = ["Warhammer is a fustercluck in space", 
        "Yeah just like GOT is, but on a made up planet.",
        "I love warhammer, but my point is it’s too varied to really quantify or compare to other franchises. There’s GoT, there’s Predator, there’s Star Ship troopers, Star Wars, etc...",
        "It makes them similar, especially since star wars is fantasy in a sci-fi skirt.",
        "I came to recommend Piranesi! I love Susanna Clarke. And of course I adore Jonathan Strange and Mr Norrell, one of my all time favorite books... but reading that tome is a bit like reading a whole series of someone else’s books (that is to say it’s a commitment... might not be the quick change of pace the OP is looking for)",
        "Nearly anything by Patricia Mckillip. Forgotten Beasts of Eld is the standard rec; I'd also suggest Song for the Basilisk or the Book of Atrix Wolfe. Nearly anything by Robin McKinley. Try Sunshine or Chalice. Uprooted or Spinning Silver by Naomi Novik",    
        "Charisma. GLaDOS has no redeeming features. An amoral psychopath at the best of times, actively sadistic at others. But she has a great sense of humor, so it's easy to like her.",
        "Dark and doesn't hold back on mature themes? I'd recommend Sangrook Saga by Steve Thomas. I didn't even know he wrote horror until I read it."]    

In [123]:
comments = pd.DataFrame(text_list,columns = ['comment'])
comments = comments['comment']
comments.shape

(8,)

In [124]:
comments = cv.transform(comments)
comments.shape

(8, 23616)

In [130]:
preds = model.predict(comments)
probs = model.predict_proba(comments)

preds

array(['scifi', 'scifi', 'scifi', 'scifi', 'fantasy', 'fantasy',
       'fantasy', 'fantasy'], dtype=object)

In [127]:
def predict_subreddit(comment):
    comments = pd.DataFrame([comment],columns = ['comment'])
    comments = comments['comment']
    comments = cv.transform(comments)
    preds = model.predict(comments)
    pred  = preds[0]
    probs = lr.predict_proba(comments)
    prob  = np.round(np.max(probs[0])*100,2)
    print(comment)
    print(f'I am {prob}% confident that your comment belongs in r/{pred}.')
    print()
    return pred, prob


In [128]:
for comment in text_list:
    pred, prob = predict_subreddit(comment)

Warhammer is a fustercluck in space
I am 92.62% confident that your comment belongs in r/scifi.

Yeah just like GOT is, but on a made up planet.
I am 89.88% confident that your comment belongs in r/scifi.

I love warhammer, but my point is it’s too varied to really quantify or compare to other franchises. There’s GoT, there’s Predator, there’s Star Ship troopers, Star Wars, etc...
I am 99.03% confident that your comment belongs in r/scifi.

It makes them similar, especially since star wars is fantasy in a sci-fi skirt.
I am 55.07% confident that your comment belongs in r/scifi.

I came to recommend Piranesi! I love Susanna Clarke. And of course I adore Jonathan Strange and Mr Norrell, one of my all time favorite books... but reading that tome is a bit like reading a whole series of someone else’s books (that is to say it’s a commitment... might not be the quick change of pace the OP is looking for)
I am 99.38% confident that your comment belongs in r/fantasy.

Nearly anything by Patric

Make pickles:

In [140]:
filename1 = 'reddit_vectorizer.sav'
pickle.dump(cv, open(filename1, 'wb'))


In [141]:
filename2 = 'reddit_model.sav'
pickle.dump(model, open(filename2, 'wb'))


Check to make sure my pickles work:

(restart the kernal here)

In [1]:
import pandas as pd
import numpy as np
import pickle

text_list = ["Warhammer is a fustercluck in space", 
        "Yeah just like GOT is, but on a made up planet.",
        "I love warhammer, but my point is it’s too varied to really quantify or compare to other franchises. There’s GoT, there’s Predator, there’s Star Ship troopers, Star Wars, etc...",
        "It makes them similar, especially since star wars is fantasy in a sci-fi skirt.",
        "I came to recommend Piranesi! I love Susanna Clarke. And of course I adore Jonathan Strange and Mr Norrell, one of my all time favorite books... but reading that tome is a bit like reading a whole series of someone else’s books (that is to say it’s a commitment... might not be the quick change of pace the OP is looking for)",
        "Nearly anything by Patricia Mckillip. Forgotten Beasts of Eld is the standard rec; I'd also suggest Song for the Basilisk or the Book of Atrix Wolfe. Nearly anything by Robin McKinley. Try Sunshine or Chalice. Uprooted or Spinning Silver by Naomi Novik",    
        "Charisma. GLaDOS has no redeeming features. An amoral psychopath at the best of times, actively sadistic at others. But she has a great sense of humor, so it's easy to like her.",
        "Dark and doesn't hold back on mature themes? I'd recommend Sangrook Saga by Steve Thomas. I didn't even know he wrote horror until I read it."]    

In [2]:
filename1 = 'reddit_vectorizer.sav'
cv_load = pickle.load(open(filename1, 'rb'))
filename2 = 'reddit_model.sav'
model_load = pickle.load(open(filename2, 'rb'))

In [3]:
def predict_subreddit_new(comment):
    comments = pd.DataFrame([comment],columns = ['comment'])
    comments = comments['comment']
    comments = cv_load.transform(comments)
    preds = model_load.predict(comments)
    pred  = preds[0]
    probs = model_load.predict_proba(comments)
    prob  = np.round(np.max(probs[0])*100,2)
    print(comment)
    print(f'I am {prob}% confident that your comment belongs in r/{pred}.')
    print()
    return pred, prob



In [4]:
for comment in text_list:
    pred, prob = predict_subreddit_new(comment)

Warhammer is a fustercluck in space
I am 92.62% confident that your comment belongs in r/scifi.

Yeah just like GOT is, but on a made up planet.
I am 89.88% confident that your comment belongs in r/scifi.

I love warhammer, but my point is it’s too varied to really quantify or compare to other franchises. There’s GoT, there’s Predator, there’s Star Ship troopers, Star Wars, etc...
I am 99.03% confident that your comment belongs in r/scifi.

It makes them similar, especially since star wars is fantasy in a sci-fi skirt.
I am 55.07% confident that your comment belongs in r/scifi.

I came to recommend Piranesi! I love Susanna Clarke. And of course I adore Jonathan Strange and Mr Norrell, one of my all time favorite books... but reading that tome is a bit like reading a whole series of someone else’s books (that is to say it’s a commitment... might not be the quick change of pace the OP is looking for)
I am 99.38% confident that your comment belongs in r/fantasy.

Nearly anything by Patric

In [1]:
probs = [0.25, 0.75]

In [12]:
probs_dict = [{'subreddit': 'r/fantasy', 'probability':probs[0]},
              {'subreddit': 'r/scifi',   'probability':probs[1]}]
chart_data = pd.DataFrame(probs_dict).set_index(['subreddit'])
chart_data

Unnamed: 0_level_0,probability
subreddit,Unnamed: 1_level_1
r/fantasy,0.25
r/scifi,0.75


In [13]:
import matplotlib.pyplot as plt

In [17]:
plt.bar(chart_data['subreddit'],chart_data['probability']);

KeyError: 'subreddit'