# Post Here: Subreddit Predictor

## Recommendation API - 1.1

> aka: the MVP Classifier®

---
---

## Intro - MVP Classifier (Model #2)

The second iteration of the model for recommending (predicting) appropriate subreddit(s) will be built using a somewhat naive approach to text analysis and a multi-class classification model.

The model will be trained using the [reddit self-post classification task dataset](https://www.kaggle.com/mswarbrickjones/reddit-selfposts), available on Kaggle thanks to [Evolution AI](https://evolution.ai//blog/page/5/an-imagenet-like-text-classification-task-based-on-reddit-posts/).

The full dataset includes 1,013,000 rows (1000 records each from 1013 subreddits). To keep things manageable for the proof-of-concept, the first 100,000 records will be used to train the embeddings and models.

According to the dataset description on Kaggle, the data has already been randomized. Therefore, reading in the first 100k records, as opposed to extracting a random sample, will not introduce any more bias into the data.

---

### Imports

In [2]:
# === General imports === #
import pandas as pd
import numpy as np
import os

In [3]:
# === sklearn imports === #
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import MultinomialNB

---

### Load and preprocess the data

In [4]:
# === Load the dataset === #
df1 = pd.read_csv("rspct_100k.csv", sep="\t")
df1.head()

Unnamed: 0,id,subreddit,title,selftext
0,0,talesfromtechsupport,Remember your command line switches...,"Hi there, <lb>The usual. Long time lerker, fi..."
1,1,teenmom,"So what was Matt ""addicted"" to?",Did he ever say what his addiction was or is h...
2,2,Harley,No Club Colors,Funny story. I went to college in Las Vegas. T...
3,3,ringdoorbell,"Not door bell, but floodlight mount height.",I know this is a sub for the 'Ring Doorbell' b...
4,4,intel,Worried about my 8700k small fft/data stress r...,"Prime95 (regardless of version) and OCCT both,..."


In [5]:
# === First looks === #
print(df1.shape)
df1.head()

(100000, 4)


Unnamed: 0,id,subreddit,title,selftext
0,0,talesfromtechsupport,Remember your command line switches...,"Hi there, <lb>The usual. Long time lerker, fi..."
1,1,teenmom,"So what was Matt ""addicted"" to?",Did he ever say what his addiction was or is h...
2,2,Harley,No Club Colors,Funny story. I went to college in Las Vegas. T...
3,3,ringdoorbell,"Not door bell, but floodlight mount height.",I know this is a sub for the 'Ring Doorbell' b...
4,4,intel,Worried about my 8700k small fft/data stress r...,"Prime95 (regardless of version) and OCCT both,..."


The MVP model will be trained on the selftext column only. Thus, we only need `selftext` (X) and `subreddit` (y).

In [6]:
# === Split up dataset into train and test === #

# 80% train, 20% test, stratified on the target
train, test = train_test_split(df1, test_size=0.2, stratify=df1["subreddit"])

train.shape, test.shape

((80000, 4), (20000, 4))

In [7]:
# === Arrange data into feature and target === #

X_train = train["selftext"]
X_test = test["selftext"]

y_train = train["subreddit"]
y_test = test["subreddit"]

In [8]:
print(X_train.shape, X_test.shape)
print(y_train.shape, y_test.shape)

(80000,) (20000,)
(80000,) (20000,)


---
---

### Vectorization

The vectorizer that will be used to convert the words into numbers will not analyze the text for meaning or anything like that - remember, this is the MVP model. We can get as crazy as we want after we have a working baseline.

TF-IDF vectorization finds the unique aspects of documents of text, based on a simple count of the words within each document. In this context, "document" refers to an individual reddit post.

The vectorizer can be instantiated then "trained" on the dataset. One way to think about the training step is that it builds a vectorized vocabulary of the words in the dataset.

The TF-IDF implementation that will be used in the MVP comes from scikit-learn:

> [sklearn.feature_extraction.text.TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

A custom tokenizer function can be passed into the vectoirizer to increase the quality of the tokens (number representations of words). The tokenizer function will use the NLP library [spaCy](https://spacy.io/usage/).

However, for the MVP, we will go with the built-in tokenizer function built into the TfidfVectorizer class.

In [None]:
# === Custom tokenizer function === #

def tokenize(doc):
    """
    Extracts nouns and adjectives from a string of text.
    Returns a list of spacy token.lemma objects.
    """
    
    doc = nlp(doc)
    na_tokens = []
    
    for token in doc:
        if (
            ((token.is_stop == False) and (token.is_punct == False))
            and (token.pos_ == "NOUN")
            or (token.pos_ == "ADJ")
        ):
            na_tokens.append(token.lemma_.strip().lower())

    return na_tokens

In [9]:
# === Encode the target using LabelEncoder === #

# This process naively transforms each category of the target into a number
# from sklearn.preprocessing import LabelEncoder

le = LabelEncoder() # Instantiate a new encoder instance
le.fit(y_train)  # Fit it on training label data

# Transform both using the train fit
y_train = le.transform(y_train)
y_test  = le.transform(y_test)

y_train[:8]

array([650, 246, 191, 215, 426, 779, 400, 502])

In [10]:
# === Vectorize! === #

# Extract features from the text data using the bag-of-words approach (single words + bigrams).
# Uses tfidf weighting (helps a little for Naive Bayes in general).
# from sklearn.feature_extraction.text import TfidfVectorizer
# from sklearn.feature_selection import chi2, SelectKBest

# TODO: use spacy's stopwords

tfidf = TfidfVectorizer(
    max_features=100000,
    min_df=5,
    ngram_range=(1,2),
    stop_words=None,
    token_pattern='(?u)\\b\\w+\\b',
)

# Fit the vectorizer on the feature column to create vocab
# This process is split into component parts to make pickling the vocab simpler
vocab = tfidf.fit(X_train)

# Get sparse document-term matrix for training data
X_train_sparse = tfidf.transform(X_train)

# Get sparse document-term matrix for test data
X_test_sparse = tfidf.transform(X_test)

In [11]:
# === Prune the resulting featureset to most useful features === #

# Instantiate the selector
chi2_selector = SelectKBest(chi2, 50000)

# Fit on the training data
chi2_selector.fit(X_train_sparse, y_train) 

# Transform both train and test feature
X_train_select = chi2_selector.transform(X_train_sparse)
X_test_select  = chi2_selector.transform(X_test_sparse)

# Take a look at the result
X_train_select.shape, X_test_select.shape

((810400, 50000), (202600, 50000))

In [11]:
# === Naive Bayes model === #

# Instantiate and train the model
nb = MultinomialNB(alpha=0.1)
nb.fit(X_train_sparse, y_train)

MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True)

In [12]:
# === Create predictions on test feature === #
y_pred_proba = nb.predict_proba(X_test_sparse)

print(y_pred_proba.shape)
y_pred_proba[:10]

(20000, 1013)


array([[0.00025731, 0.0017611 , 0.0004326 , ..., 0.00104355, 0.00089922,
        0.00120952],
       [0.00024139, 0.00145366, 0.00139822, ..., 0.00128417, 0.0007256 ,
        0.00069661],
       [0.00055992, 0.00337738, 0.00059392, ..., 0.00074112, 0.00057163,
        0.00061933],
       ...,
       [0.00030182, 0.00194612, 0.00069209, ..., 0.00124686, 0.00065143,
        0.00065566],
       [0.00034963, 0.00143321, 0.00102417, ..., 0.00224432, 0.00052771,
        0.0008239 ],
       [0.00048969, 0.00347422, 0.00072534, ..., 0.00075197, 0.00061453,
        0.0021289 ]])

In [13]:
# === For each prediction, find the index with the highest probability === #
y_pred = np.argmax(y_pred_proba, axis=1)
y_pred[:10]

array([862, 345, 654, 453, 867, 530, 453, 835, 126, 311])

In [14]:
# === Evaluate performance using precision-at-k === #

def precision_at_k(y_true, y_pred, k=5):
    y_true = np.array(y_true)
    y_pred = np.array(y_pred)
    y_pred = np.argsort(y_pred, axis=1)
    y_pred = y_pred[:, ::-1][:, :k]
    arr = [y in s for y, s in zip(y_true, y_pred)]
    return np.mean(arr)

print('precision@1 =', np.mean(y_test == y_pred))
print('precision@3 =', precision_at_k(y_test, y_pred_proba, 3))
print('precision@5 =', precision_at_k(y_test, y_pred_proba, 5))

# With all 1m records
# precision@1 = 0.6732724580454097
# precision@3 = 0.8058588351431392
# precision@5 = 0.8481391905231984

# With 100k records
# precision@1 = 0.45935
# precision@3 = 0.6075
# precision@5 = 0.665

precision@1 = 0.45935
precision@3 = 0.6075
precision@5 = 0.665


---

### Predict subreddit from new input

Now that our model is trained and we have our baseline performance metric that is not half-bad, we can use the trained model to predict what subreddit would belong to a new piece of data (a post).

In order to do this, the post will have to be vectorized.

In [15]:
# === Example post === #

# The example comes from 'r/learnprogramming'
post = """I am a new grad looking for a job and currently in the process with a company for a junior backend engineer role. I was under the impression that the position was Javascript but instead it is actually Java. My general programming and "leet code" skills are pretty good, but my understanding of Java is pretty shallow. How can I use the next three days to best improve my general Java knowledge? Most resources on the web seem to be targeting complete beginners. Maybe a book I can skim through in the next few days?

Edit:

A lot of people are saying "the company is a sinking ship don't even go to the interview". I just want to add that the position was always for a "junior backend engineer". This company uses multiple languages and the recruiter just told me the incorrect language for the specific team I'm interviewing for. I'm sure they're mainly interested in seeing my understanding of good backend principles and software design, it's not a senior lead Java position."""

In [16]:
# === Vectorize the example post using the trained vocab === #
post_sparse = tfidf.transform([post])

In [22]:
# === Transform using chi2 === #
post_select = chi2_selector.transform(post_sparse)
post_select

<1x50000 sparse matrix of type '<class 'numpy.float64'>'
	with 178 stored elements in Compressed Sparse Row format>

In [17]:
# === Generate prediction === #
post_pred_proba = nb.predict_proba(post_sparse)
post_pred_proba

array([[0.00021132, 0.00085588, 0.00038891, ..., 0.00134066, 0.00053571,
        0.00043089]])

In [18]:
# === Return the indices that would sort and return top 10 === #
post_pred_proba_sorted = np.argsort(post_pred_proba[0])
pred_sorted_index = post_pred_proba_sorted[:10]
pred_sorted_index

array([197, 856, 407, 577, 400, 247, 874, 662, 993,  93])

In [19]:
# === Plug that into the label classes to get hydrated af === #
post_pred_top10 = le.classes_[pred_sorted_index]
print(post_pred_top10)

['KeybaseProofs' 'sharditkeepit' 'WouldYouRather' 'emojipasta'
 'WayfarersPub' 'MtvChallenge' 'soccer' 'ifttt' 'whatsthisplant'
 'DanLeBatardShow']


In [60]:
post_pred_top10.shape

(10,)

In [21]:
# === Test it out with another dummy post === #

# This one comes from r/suggestmeabook
post2 = """I've been dreaming about writing my own stort story for a while but I want to give it an unexpected ending. I've read lots of books, but none of them had the plot twist I want. I want to read books with the best plot twists, so that I can analyze what makes a good plot twist and write my own story based on that points. I don't like romance novels and I mostly enjoy sci-fi or historical books but anything beside romance novels would work for me, it doesn't have to be my type of novel. I'm open to experience after all. I need your help guys. Thanks in advance."""

# === Vectorize the example post using the trained vocab === #
post2_sparse = tfidf.transform([post2])

# === Generate prediction === #
post2_pred_proba = nb.predict_proba(post2_sparse)

In [20]:
# === Return the indices that would sort and return top 10 === #
pred2_sorted_index = np.argsort(post2_pred_proba[0])[:10]

In [20]:
# === Plug that into the label classes to get hydrated af === #
post2_pred_top10 = le.classes_[pred2_sorted_index]
print(post_pred_top10)

['KeybaseProofs' 'sharditkeepit' 'WouldYouRather' 'emojipasta'
 'WayfarersPub' 'MtvChallenge' 'soccer' 'ifttt' 'whatsthisplant'
 'DanLeBatardShow']


The predictions are the same, meaning something didn't work.

In [27]:
post2_pred_proba.shape

(1, 1013)

In [39]:
df_post2_pred = pd.DataFrame(post2_pred_proba.T, columns=["proba"])
df_post2_pred.sort_values(by="proba", ascending=False)

Unnamed: 0,proba
905,0.156121
579,0.029756
1005,0.027238
74,0.010433
101,0.008873
...,...
874,0.000056
247,0.000053
400,0.000037
856,0.000028


In [None]:
# === Sort the resulting array and return top 10 === #
post_pred_proba_sorted = np.sort(post_pred_proba, axis=None)
post_pred_proba_sorted

In [39]:
# Inverse transform
post_pred_proba_inversed = le.inverse_transform(post_pred_proba)

ValueError: bad input shape (1, 1013)

In [28]:
# === Cast result to dataframe === #
df_ppp = pd.DataFrame(post_pred_proba.T).reset_indexdex()
df_ppp.head()

Unnamed: 0,index,0
0,0,6.592764e-07
1,1,1.02292e-05
2,2,1.042813e-06
3,3,1.616618e-06
4,4,4.187258e-06


In [32]:
y_train.shape

(810400,)

In [31]:
# === Look at the encoded labels to be sure it matches === #
le.get_params()

{}

In [None]:
# === Now to inverse transform the encoded labels === #

In [35]:
y_train_df = pd.DataFrame(y_train)
y_train_df.head()

Unnamed: 0,0
0,691
1,289
2,117
3,952
4,77


In [36]:
le.classes_

array(['13ReasonsWhy', '3Dprinting', '3d6', ..., 'yoga', 'yorku',
       'zootopia'], dtype=object)

---

### Picklization

In [61]:
# === Create pickle func to make pickling (a little) easier === #

def picklizer(to_pickle, filename, path):
    """
    Creates a pickle file.
    
    Parameters
    ----------
    to_pickle : Python object
        The trained / fitted instance of the 
        transformer or model to be pickled.
    filename : string
        The desired name of the output file,
        not including the '.pkl' extension.
    path : string or path-like object
        The path to the desired output directory.
    """
    import os
    import pickle

    # Create the path to save location
    picklepath = os.path.join(path, filename)

    # Use context manager to open file
    with open(picklepath, "wb") as p:
        pickle.dump(to_pickle, p)

In [62]:
# === Picklize! === #
filepath = "../../models"

# Export vectorizer as pickle
picklizer(vocab, "02_vocab.pkl", filepath)

# Export chi2 selector as pickle
picklizer(chi2_selector, "02_chi.pkl", filepath)

# Export naive bayes model as pickle
picklizer(nb, "02_nb.pkl", filepath)