### Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

## preprocessing - NLP
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

## workflow
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV

## models
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

## metrics
from sklearn.metrics import accuracy_score, balanced_accuracy_score, recall_score, precision_score, f1_score, confusion_matrix

In [2]:
%run 00_Workflow_Functions.ipynb import na_only, api_call, data_wrangling

In [3]:
subs = pd.read_csv('../datasets/submissions_data.csv')
subs.shape

(9717, 10)

In [4]:
subs.head(5)

Unnamed: 0,author,author_fullname,created_utc,selftext,title,subreddit,is_video,num_comments,score,upvote_ratio
0,anonymousbrowzer,t2_14k10v,1648770000.0,,"When the smoke detector goes off from cooking,...",lifehacks,False,0.0,1.0,1.0
1,PlantBasedRedditor,t2_g4e0rfz,1648767000.0,,Use Goo Gone on scissors and blades to reduce ...,lifehacks,False,0.0,1.0,1.0
2,CryptographerFar5073,t2_ldjcr311,1648764000.0,,Bingo Bash,lifehacks,False,0.0,1.0,1.0
3,Giant_weiner_not_dog,t2_konlr4kt,1648763000.0,,How to troll someone,lifehacks,False,0.0,1.0,1.0
4,Giant_weiner_not_dog,t2_konlr4kt,1648762000.0,,what a nice way to have your meal( credit to u...,lifehacks,False,0.0,1.0,1.0


In [5]:
subs.tail(5)

Unnamed: 0,author,author_fullname,created_utc,selftext,title,subreddit,is_video,num_comments,score,upvote_ratio
9712,sscorpio77,t2_jugpp8rg,1646173000.0,[removed],LPT: You won’t have to constantly brake to avo...,LifeProTips,False,2.0,1.0,1.0
9713,CreatorVilla,t2_f1xkvrju,1646173000.0,,"LPT: If you want someone to trust you, approac...",LifeProTips,False,1.0,1.0,1.0
9714,TheNative93,t2_52ulcx1x,1646173000.0,"If they have a number on their website, or on ...",LPT Whenever you submit a resume don’t wait fo...,LifeProTips,False,1.0,1.0,1.0
9715,duskymk,t2_22umtf1d,1646173000.0,[removed],LPT: Always bring your phone with you in a pub...,LifeProTips,False,1.0,1.0,1.0
9716,shwarma_heaven,t2_ddvb4,1646172000.0,,LPT: Just because a car stops before a parking...,LifeProTips,False,1.0,1.0,1.0


In [6]:
subs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9717 entries, 0 to 9716
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   author           9717 non-null   object 
 1   author_fullname  9717 non-null   object 
 2   created_utc      9717 non-null   float64
 3   selftext         5557 non-null   object 
 4   title            9717 non-null   object 
 5   subreddit        9717 non-null   object 
 6   is_video         9717 non-null   bool   
 7   num_comments     9717 non-null   float64
 8   score            9717 non-null   float64
 9   upvote_ratio     9717 non-null   float64
dtypes: bool(1), float64(4), object(5)
memory usage: 692.8+ KB


In [7]:
na_only(subs)

selftext    4160
dtype: int64

When we wrangled the data, we had no NAs reported. However, after exporting the CSV and importing the CSV, we now have NAs present. Also, there appears to be a particular value `[removed]` that indicates a submission was deleted, and data was redacted. We will need to drop all of this data for NLP, as it's not useful for us.

In [8]:
subs['subreddit'].value_counts(normalize=True)

lifehacks      0.520119
LifeProTips    0.479881
Name: subreddit, dtype: float64

Prior to dropping NAs, we have a nearly perfectly balance proportion between our two classes.

### Dropping erraneous data

Let's drop NAs from our dataset entirely, since only `selftext` data with content is relevant for us.

In [9]:
# dropping NAs
subs = subs.dropna()
#subs.reset_index(drop=True, inplace=True) #resetting index after dropping rows
subs.shape

(5557, 10)

In addition NAs, let's remove `[removed]` and selftext posts that are less than `7` words long (subjective choice). We do this because a post of a few words may indicate the post is referencing a multimedia file or hypyerlink, which is noise that will affect our model performance.

In [10]:
#storing all (index,value) pairs of erraneous data in an array
text_filter = np.array([[i, text] for i, text in enumerate(subs['selftext']) if len(str(text).split()) < 10]) 

In [11]:
pd.DataFrame(text_filter[:, 1]).value_counts()

[removed]                                                                                                        2933
to avoid their insanely high prices this summer!                                                                    2
Original:\nhttps://m.youtube.com/watch?v=SQNtGoM3FVU\n\nFor download:\nhttps://m.yout.com/watch?v=SQNtGoM3FVU       2
\n\n[View Poll](https://www.reddit.com/poll/rjy7cm)                                                                 1
Went to make cookies and this did the trick.                                                                        1
                                                                                                                 ... 
It's the most commonly used utensil.                                                                                1
It’s less risky, and you can eat the evidence.                                                                      1
It’s simply satisfying.                                 

In [12]:
ind = list(text_filter[:, 0].astype(int)) #these indices contain the erraneous data
ind[:10]

[0, 3, 4, 6, 7, 8, 9, 10, 11, 12]

In [14]:
subs.iloc[ind, 3] = None

In [18]:
print(len(ind), subs.iloc[ind, 3].isna().sum())

3081 3081


Values match; we have replaced all erraneous text as NAs and we will now drop them.

In [19]:
# dropping NAs
subs = subs.dropna()
subs.shape

(2476, 10)

In [33]:
na_only(subs)

0

In [20]:
subs['subreddit'].value_counts(normalize=True)

LifeProTips    0.758885
lifehacks      0.241115
Name: subreddit, dtype: float64

After dropping NAs, `LifeProTips` has become our majority class.

### Train and Test Splits

In [22]:
# we are only interested in the self-text
X = subs[['selftext']]
y = subs['subreddit']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=14, stratify=y)

In [23]:
print(X_train.shape, X_test.shape)

(1857, 1) (619, 1)


In [24]:
print(y_train.shape, y_test.shape)

(1857,) (619,)


In [25]:
y_test.value_counts(normalize=True) #verifying stratification worked

LifeProTips    0.759289
lifehacks      0.240711
Name: subreddit, dtype: float64

### Preprocessing

Transforming our response into labels of 1 and 0, where 1 is `lifehacks` and 0 is `LifeProTips`.

In [26]:
le = LabelEncoder()

In [27]:
y_train_encoded = le.fit_transform(y_train)
y_test_encoded = le.transform(y_test)

In [28]:
pd.DataFrame(y_train_encoded).value_counts(normalize=True)

0    0.758751
1    0.241249
dtype: float64

Although we started off with a strongly balanced proportion of positive and negative classes, after preprocessing our data we now have dataset that is fairly unbalanced. This may affect our model performance down the road. We may be able to mitigate this by wrangling more data from the subreddits. We will proceed for now.

Lets vectorize our training data.