# Preprocessing and Train-Test Split

In this notebook, I take a quick look at my clean dataset. I binarize the target variable to 0 and 1. I also Lemmatize the title since I was interested in using that as a feature, so that it is there if I need it. I finally split my data into training and testing sets. 

In [83]:
import pandas as pd
import reddit_functions as rf

from sklearn.model_selection import train_test_split


In [84]:
# import data
red_df = pd.read_csv('../data/all_posts_cleaned.csv')

In [85]:
red_df.isna().mean()

subreddit       0.0
title           0.0
selftext        0.0
created_utc     0.0
num_comments    0.0
post_length     0.0
dtype: float64

In [86]:
# binarize target
red_df['subreddit'].value_counts()

TalesFromTheFrontDesk    4999
talesfromtechsupport     4886
Name: subreddit, dtype: int64

In [87]:
red_df['subreddit'].replace({'TalesFromTheFrontDesk':0, 'talesfromtechsupport':1}, inplace=True)
red_df['subreddit'].value_counts()

0    4999
1    4886
Name: subreddit, dtype: int64

In [88]:
# save binarized target to file

red_df.to_csv('../data/all_posts_cleaned.csv', index=False)

### lemmatize and remove stop words from titles

In [78]:
red_df['title_lemmatized'] = red_df['title'].apply(rf.lemmatize)

In [79]:
red_df.head()

Unnamed: 0,subreddit,title,selftext,created_utc,num_comments,post_length,title_lemmatized
0,0,Update from my post about my coworker,Here is the original [post](https://www.reddit...,1648499626,0,189,Update post coworker
1,0,You’re forcing me!,"So, for context, I work at a 3-star-ish busine...",1648471124,0,501,You’re forcing me!
2,0,“I actually have to read what I’m signing for?”,So this literally happened as I’m walking in a...,1648445630,0,305,“I actually read I’m signing for?”
3,0,The Straw(s) That Broke the Camel's Back,"Hey guys, it's been a while since I've posted....",1648441064,0,1185,The Straw(s) That Broke Camel's Back
4,0,DM Report Ideas?,"Hello Front Desk,\n\nApologies if this is agai...",1648435782,0,179,DM Report Ideas?


### Split and save train and test sets

In [80]:
X = red_df['selftext']
y = red_df['subreddit']

In [81]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y, test_size=0.3)


In [82]:
X_train.to_csv('../data/train_test_sets/X_train.csv', index=False)
y_train.to_csv('../data/train_test_sets/y_train.csv', index=False)
X_test.to_csv('../data/train_test_sets/X_test.csv', index=False)
y_test.to_csv('../data/train_test_sets/y_test.csv', index=False)