# Data Collection & Preprocessing 

The code in this notebook is for the use of scaping the latest comment titles and body from a specific subreddit, converting to a dataframe, and exporting to a .csv file


### Import Libaries

In [1]:
import pandas as pd
import praw
import regex as re
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import RegexpTokenizer

### Reddit API Scrape

To use the below code and function you must create a reddit account [here](https://www.reddit.com/) and register for use of the API. The username and password will be from your general Reddit account while your client id, cleint secret, and user agent will be from the API token your create attached to the account.

In [2]:
# enter your personal account info in accordingly
reddit = praw.Reddit(
    client_id='YOUR CLIENT ID',
    client_secret='YOUR ACCOUNT CLIENT SECRET',
    password='YOUR ACCOUNT PASSWORD', 
    user_agent='YOUR ACCOUNT USER AGENT',
    username='YOUR ACCOUNT USERNAME'
)

The below function takes in the above-created praw instance, the subreddit you would like to scape as a string, and the number of latest posts you would like to collect as an integer. 

The output will print the number of comments successfully scaped and a dataframe with each row including the comments category, body, and title.

In [3]:
def reddit_scapper(praw_object, sub_reddit, num_posts):
    # create submissions object to iterate over
    submissions = praw_object.subreddit(sub_reddit).new(limit = num_posts)
    
    # create list of dictionaries for easy conversion to df
    dictionary = []
    for post in submissions:
        dictionary.append({
            'categroy': post.subreddit,
            'title': post.title,
            'body': post.selftext
        })
    
    # show number of articles collected and out df
    print('You collected {} reddit comments about {}'.format(len(dictionary), sub_reddit))
    return pd.DataFrame(dictionary)

In [4]:
# scape 1,000 articles on history and consipracy to model
history_df = reddit_scapper(reddit, 'history', 1_000)
conspiracy_df = reddit_scapper(reddit, 'conspiracy', 1_000)

You collected 885 reddit comments about history
You collected 989 reddit comments about conspiracy


Combine into one dataframe for further preprocessing

In [None]:
df = pd.concat([history_df, conspiracy_df], axis =0)

### Preprocessing
The below cells are for the purpose of exploring the data and preparing for modeling

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1891 entries, 0 to 1890
Data columns (total 3 columns):
category    1891 non-null object
body        1292 non-null object
title       1891 non-null object
dtypes: object(3)
memory usage: 59.1+ KB


##### Create a binary target column

In [13]:
# making category column binary
df['category'] = df['category'].map({'conspiracy': 0,
                                     'history': 1})

##### Feature Engineering

A large number of consuparcy posts did not contain a body. To be able to still use the text, I used the text from both body and title in this feature.


Below lambda function replaces the null values with empty strings, combines the two text cells, and overwrites the entry into body

In [14]:
# lamba function to treat nulls as empty text cells for new feature
df['body'] = df['title'] + df['body'].apply(lambda x: x if type(x)== str else '')

Below function takes in text, removes puctuation, emojis, applies a porter stemmer to each word, and outputs the newly processed text

In [15]:
def text_stemmer(raw_text_rows):
    # Remove non-letters.
    letters_only = re.sub("[^a-zA-Z]", " ", raw_text_rows)

    # tokenize the script
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(letters_only)

    #stem the words
    p_stemmer = PorterStemmer()
    clean_stems = [p_stemmer.stem(w) for w in tokens]

    # Join the words back into one string separated by space
    return(" ".join(clean_stems))

Uses the above function to create features that contain stemmed text for both title and body

In [16]:
# create new column with transformed title
df['stemmed_title'] = [text_stemmer(text) for text in df['title']]

# create new column with transformed body data
df['stemmed_body'] = [text_stemmer(text) for text in df['body']]

In [17]:
df.head()

Unnamed: 0,category,body,title,stemmed_title,stemmed_body
0,0,George Popadopoulos Judiciary Committee Transc...,George Popadopoulos Judiciary Committee Transc...,georg popadopoulo judiciari committe transcrip...,georg popadopoulo judiciari committe transcrip...
1,0,Scientists Will Spray Particles Into the Sky t...,Scientists Will Spray Particles Into the Sky t...,scientist will spray particl into the ski to b...,scientist will spray particl into the ski to b...
2,0,We Are Change confronted Joe Biden in 2007 abo...,We Are Change confronted Joe Biden in 2007 abo...,We are chang confront joe biden in about hi me...,We are chang confront joe biden in about hi me...
3,0,Jeff Sessions swats creepy Uncle Joe's hands away,Jeff Sessions swats creepy Uncle Joe's hands away,jeff session swat creepi uncl joe s hand away,jeff session swat creepi uncl joe s hand away
4,0,NXIVM cultist (🍕 gate) admits to enslaving wom...,NXIVM cultist (🍕 gate) admits to enslaving wom...,nxivm cultist gate admit to enslav woman for y...,nxivm cultist gate admit to enslav woman for y...


### Export Data
Export for modeling

In [18]:
df.to_csv('./data/preprocessed_data.csv')