# 02 Initial Data Processing

### Purpose of Notebook
- Convert raw JSON data into dataframe
- Remove duplicate posts
- Convert target variable from string to integer
- Light feature engineering
- Train Test Split data
- Export X and y data for use later in workflow

## Imports & Functions

In [17]:
import pandas as pd
import json
import pickle
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from nltk.stem import WordNetLemmatizer
import regex as re
from bs4 import BeautifulSoup      

This function converts JSON into a Pandas DataFrame

In [18]:
def posts_to_df(posts, features = ['subreddit', 'author', 'title', 'selftext', 'created_utc', 'num_comments']):
    feat_dict = [{feat : post['data'][feat] for feat in features}  for post in posts]
    return pd.DataFrame(feat_dict)

## Pull in raw JSON data file

In [19]:
with open('../Data/raw.json', 'r') as f:
    raw = json.load(f)

In [20]:
feature_list = ['subreddit', 'author', 'title', 'selftext', 'created_utc', 'num_comments','score','over_18',
                'score']
df = posts_to_df(raw,features=feature_list)

## Remove duplicates from data

In [21]:
df.drop_duplicates(inplace=True)

## Remove numbers from text data

In [22]:
df['selftext'] = df['selftext'].str.replace('\d+', '')
df['title'] = df['title'].str.replace('\d+', '')

## Remove non-text data and lemmatize words

### This function does the following:
1. Uses Beautiful Soup to remove HTML Markup
2. Uses regex to remove any remaining non-text data
3. Converts all text to lower case
4. Lemmatizes each word converting it into its base/dictionary form

In [23]:
lemmatizer = WordNetLemmatizer()
lemmatize_text = True

def clean_text(raw_text): 
    bs_text = BeautifulSoup(raw_text, 'lxml').get_text()
    only_text = re.sub("[^a-zA-Z]", " ", bs_text)
    words = only_text.lower().split()
    if lemmatize_text == True:
        word_list = [lemmatizer.lemmatize(word) for word in words]
    else:
        word_list = [word for word in words]
    return " ".join(word_list)

In [24]:
df.title = df.title.map(clean_text)
df.selftext = df.selftext.map(clean_text);

  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup

:" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
  ' that document to Beautiful Soup.' % decoded_markup


## Light feature engineering 
- Calculate number of characters for title and text
- Create boolean field:  title_only

In [25]:
df['title_len'] = df.title.str.len()
df['text_len'] = df.selftext.str.len()
df['title_only'] = (df.text_len == 0)*1

## Convert target variable (subreddit) to integer

In [26]:
subreddit_map = {'confessions': 0, 'Jokes':1}
df['subreddit_int'] = df['subreddit'].map(subreddit_map)

## Export Cleansed DataFrame for EDA

In [27]:
with open('../Data/df_clean.pkl', 'wb') as f:
    pickle.dump(df, f)

## Setup target variable y

In [28]:
y = list(df['subreddit_int'])

## Setup feature variables X

- Author and Create Date are also dropped since they are not good predictors of sub-reddit

In [29]:
X = df.drop(labels=['subreddit','subreddit_int','author','created_utc'], axis=1).copy()

## Train Test Split

In [30]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state = 42)

## Export y_train and y_test objects

In [31]:
with open('../Data/y_train.pkl', 'wb') as f:
    pickle.dump(y_train, f)
    
with open('../Data/y_test.pkl', 'wb') as f:
    pickle.dump(y_test, f)

## Export X_train and X_test objects

In [32]:
with open('../Data/X_train.pkl', 'wb') as f:
    pickle.dump(X_train, f)
    
with open('../Data/X_test.pkl', 'wb') as f:
    pickle.dump(X_test, f)