# Using Reddit's API for Predicting WhichSubreddit?¶

## Notebook 2:  Data Preparation and Cleaning with REGEX

#### Data imported as a json file and put in a dataframe. Duplicates will then be dropped and text from remaining posts processed (text and title combined, punctuation, numbers, and URL's removed, common words in English removed and text separated into individual words). Data saved and passed on to next notebook as a csv.

In [18]:
import requests
import time
import pandas as pd
import json
import re
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.cross_validation import cross_val_score, cross_val_predict
from sklearn import metrics
from sklearn.linear_model import LogisticRegression


In [19]:
with open('../data/test_dump3.json', 'r') as f:
    import_posts = json.load(f)

#### Divide data into target ("which subreddit") and dataframe of predictive features

In [20]:
def posts_as_DataFrame(posts, features = ['subreddit', 'author', 'id',
                                          'title', 'selftext',
                                          'created_utc', 'num_comments']):
    feature_dict = [{feature : post['data'][feature] for feature in features}
                 for post in posts]
    return pd.DataFrame(feature_dict)

In [21]:
X_features_df = posts_as_DataFrame(import_posts)

In [22]:
len(X_features_df)

9988

In [23]:
X_features_df.drop_duplicates('id', inplace = True)

In [24]:
X_features_df.shape

(1870, 7)

#### Prepare Text Data for analysis and join w/other features
#### REGEX to remove URL's, whitespace, punctuation, convert to lowercase

In [25]:
# Combine title and body of post as X_text; drop body and title
X_features_df['text'] = X_features_df['selftext'] + X_features_df['title']

In [26]:
X_features_df.drop(columns = ['selftext', 'title'], inplace=True)

In [27]:
X_text = []
X_text = X_features_df['text']

In [28]:
# Clean URL's
re.sub("(\(http.*\))", ' ', X_text[0])

'The Mod Team has decided that it would be nice to put together a list of recommended books, similar to [the podcast list] .\n\n**Please post any books that you have found particularly interesting or helpful for learning during your career.  Include the title with either an author or link.**\n\nSome restrictions:\n\n* Must be directly related to data science\n* Non\\-fiction only\n* Must be an actual **book**, not a blog post, scientific article, or website\n* Nothing self\\-promotional\n\n ***** \n\nMy recommendations:\n\n* [Machine Learning: A Probabilistic Perspective] \n* [Computer Age Statistical Inference] \n* [Data Analysis Using Regression and Multilevel/Hierarchical Models] \n* [Design and Analysis of Experiments] \n* [Data Mining: Concepts and Techniques] \n* [Active Learning] \n* [All of Statistics: A Concise Course in Statistical Inference] DS Book Suggestions/Recommendations Megathread'

In [29]:
# Clean punctuation, newlines
X_text[0] = re.sub("[^a-zA-Z]", " ", X_text[0])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [32]:
# Convert to all lowercase, split into individual words
lowercase_X_test = X_text[0].lower()
X_words = lowercase_X_test.split()
X_words[0:5]

['the', 'mod', 'team', 'has', 'decided']

In [33]:
#remove 'stopwords' (commonly occuring words in english language) from X_words
X_words = [w for w in X_words if not w in stopwords.words("english")]
X_words[0:5]

['mod', 'team', 'decided', 'would', 'nice']

In [34]:
# Combine all data cleaning steps above into single function, to save time
def post_to_words(X_text):
    # Function to convert text from subreddit posts to a string of words
    # The input is a single string (obtained in "get raw data" notebook), and 
    # the output is a single string (a preprocessed reddit post)
    #
    # Clean URL's
    X_text = re.sub("(\(http.*\))", ' ', X_text)
    # Clean punctuation, newlines
    X_text = re.sub("[^a-zA-Z]", " ", X_text)
    #
    # Convert to all lowercase, split into individual words
    X_words = X_text.lower().split()
    #
    # In Python, searching a set is much faster than searching
    #   a list, so convert the stop words to a set
    stops = set(stopwords.words('english'))
    # 
    # Remove stop words
    meaningful_words = [w for w in X_words if not w in stops]
    #
    # Join the words back into one string separated by space, 
    # and return the result.
    return(" ".join(meaningful_words))

#### Apply function above to text of posts, stepping through each post, i.

In [35]:
# Get the number of posts based on the dataframe column size
num_posts = X_features_df.shape[0]

# Initialize an empty list to hold the clean posts
clean_X_text = []
num_posts

1870

In [36]:
X_features_df.text = X_features_df.text.apply(post_to_words)

In [37]:
X_features_df.head()

Unnamed: 0,author,created_utc,id,num_comments,subreddit,text
0,Omega037,1526405000.0,8jneyb,42,datascience,mod team decided would nice put together list ...
1,Omega037,1527799000.0,8nlsqi,18,datascience,weekly entering amp transitioning thread quest...
2,One_Last_Thyme,1528047000.0,8oa4uy,6,datascience,hey guys looking input project working interns...
3,Hydralyze,1528048000.0,8oa880,2,datascience,lots google sheets work others spreadsheet gen...
4,FeelTheQuickening,1528050000.0,8oaiac,2,datascience,hello industrial engineer experience dealing d...


In [38]:
# Save X_features_df as a csv for use in the next notebook:  Vectorizing and Model Building!
X_features_df.to_csv('../data/X_features_df3.csv')

Save X_features_df as a csv for use in the next notebook:  Vectorizing and Model Building!