# Data Collection & Preprocessing 

The code in this notebook is for the use of scaping the latest comment titles and body from a specific subreddit, converting to a dataframe, and exporting to a .csv file


### Import Libaries

In [12]:
import pandas as pd
import praw
import regex as re
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import RegexpTokenizer
import sensitive as sens
from tqdm.notebook import tqdm

### Reddit API Scrape

To use the below code and function you must create a reddit account [here](https://www.reddit.com/) and register for use of the API. The username and password will be from your general Reddit account while your client id, cleint secret, and user agent will be from the API token your create attached to the account.

In [6]:
# enter your personal account info in accordingly
reddit = praw.Reddit(
    client_id=sens.client_id,
    client_secret=sens.client_secret,
    password=sens.password, 
    user_agent=sens.user_agent,
    username=sens.username
)

The below function takes in the above-created praw instance, the subreddit you would like to scape as a string, and the number of latest posts you would like to collect as an integer. 

The output will print the number of comments successfully scaped and a dataframe with each row including the comments category, body, and title.

In [140]:
def reddit_scraper(praw_object, sub_reddit, num_posts):
    # create submissions object to iterate over
    submissions = praw_object.subreddit(sub_reddit).hot(limit = num_posts)
    
    # create list of dictionaries for easy conversion to df
    posts = []
    comments = []
    for i, post in enumerate(submissions):
        comms = post.comments.list()
        posts.append({
            'categroy': post.subreddit,
            'title': post.title,
            'body': post.selftext,
            'sub': sub_reddit,
            'sub_post_id': i,
            'n_comments': len(comms)
        })
        for com in comms:
            comments.append({
                'sub_post_id': i,
                'text': com.body,
                'likes': com.likes,
            })
    
    # show number of articles collected and out df
    print(f'You collected {len(posts)} reddit comments about {sub_reddit}')
    return pd.DataFrame(posts), pd.DataFrame(comments)

In [144]:
subs = ['datascience', 'datasciencejobs', 'machinelearning', 
        'machinelearningjobs', 'learnmachinelearning',
        'learndatascience']
scraped = {}
for sub in tqdm(subs):
    scraped[sub] = reddit_scraper(reddit, sub, 10)

# # scape 1,000 articles on history and consipracy to model
# history_df = reddit_scapper(reddit, 'epicseven', 1_000)
# conspiracy_df = reddit_scapper(reddit, 'lewdseven', 1_000)

HBox(children=(IntProgress(value=0, max=6), HTML(value='')))

You collected 10 reddit comments about datascience
You collected 10 reddit comments about datasciencejobs
You collected 10 reddit comments about machinelearning
You collected 10 reddit comments about machinelearningjobs
You collected 10 reddit comments about learnmachinelearning
You collected 10 reddit comments about learndatascience



In [149]:
posts_l = []
comms_l = []
for _, tup in scraped.items():
    posts_l.append(tup[0])
    comms_l.append(tup[1])

In [151]:
posts_df = pd.concat(posts_l)
comms_df = pd.concat(comms_l)

In [152]:
posts_df

Unnamed: 0,categroy,title,body,sub,sub_post_id,n_comments
0,datascience,Weekly Entering & Transitioning Thread | 08 Ma...,_Bleep Bloop_. Welcome to this week's entering...,datascience,0,23
1,datascience,It’s never too early,,datascience,1,20
2,datascience,When to use statistical tests?,"Hello all, I'm wondering when do you want to u...",datascience,2,27
3,datascience,Would you hire me into an analytics/data scien...,I have a BS in Industrial Engineering. Got hir...,datascience,3,2
4,datascience,How to better explain my regression project in...,I had a project where I was finding race diffe...,datascience,4,0
5,datascience,i applied for data sci for Uni !! how do i get...,,datascience,5,0
6,datascience,Can a single person consultancy build data war...,"Hello everyone, I am currently an undergrad co...",datascience,6,0
7,datascience,How do you make a credit scorecard?,I am in a Junior Data Analyst in a small finan...,datascience,7,1
8,datascience,Awesome data science interview resources,I already shared a list of data science questi...,datascience,8,0
9,datascience,Good course(s) on analysis of time series?,"Hi, \n\n\nLooking into learning more about th...",datascience,9,8


In [153]:
comms_df

Unnamed: 0,sub_post_id,text,likes
0,0,"I'm a high school student(10th grade, 16) who'...",
1,0,Tomorrow I am having a coffee with a Senior Da...,
2,0,Learn SQL\n\nI’m always aghast at how many que...,
3,0,What salary range to give to a fortune 500 com...,
4,0,Upcoming (this May) graduate with two BS degre...,
...,...,...,...
7,0,When I reread what I typed I can totally see h...,
8,2,I will be following this video series. Thank y...,
9,2,Thank you! All the best to you too :),
10,6,Try github.,


# not sure if relevant for data 607 project

Combine into one dataframe for further preprocessing

In [None]:
df = pd.concat([history_df, conspiracy_df], axis =0)

### Preprocessing
The below cells are for the purpose of exploring the data and preparing for modeling

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1891 entries, 0 to 1890
Data columns (total 3 columns):
category    1891 non-null object
body        1292 non-null object
title       1891 non-null object
dtypes: object(3)
memory usage: 59.1+ KB


##### Create a binary target column

In [13]:
# making category column binary
df['category'] = df['category'].map({'conspiracy': 0,
                                     'history': 1})

##### Feature Engineering

A large number of consuparcy posts did not contain a body. To be able to still use the text, I used the text from both body and title in this feature.


Below lambda function replaces the null values with empty strings, combines the two text cells, and overwrites the entry into body

In [14]:
# lamba function to treat nulls as empty text cells for new feature
df['body'] = df['title'] + df['body'].apply(lambda x: x if type(x)== str else '')

Below function takes in text, removes puctuation, emojis, applies a porter stemmer to each word, and outputs the newly processed text

In [15]:
def text_stemmer(raw_text_rows):
    # Remove non-letters.
    letters_only = re.sub("[^a-zA-Z]", " ", raw_text_rows)

    # tokenize the script
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(letters_only)

    #stem the words
    p_stemmer = PorterStemmer()
    clean_stems = [p_stemmer.stem(w) for w in tokens]

    # Join the words back into one string separated by space
    return(" ".join(clean_stems))

Uses the above function to create features that contain stemmed text for both title and body

In [16]:
# create new column with transformed title
df['stemmed_title'] = [text_stemmer(text) for text in df['title']]

# create new column with transformed body data
df['stemmed_body'] = [text_stemmer(text) for text in df['body']]

In [17]:
df.head()

Unnamed: 0,category,body,title,stemmed_title,stemmed_body
0,0,George Popadopoulos Judiciary Committee Transc...,George Popadopoulos Judiciary Committee Transc...,georg popadopoulo judiciari committe transcrip...,georg popadopoulo judiciari committe transcrip...
1,0,Scientists Will Spray Particles Into the Sky t...,Scientists Will Spray Particles Into the Sky t...,scientist will spray particl into the ski to b...,scientist will spray particl into the ski to b...
2,0,We Are Change confronted Joe Biden in 2007 abo...,We Are Change confronted Joe Biden in 2007 abo...,We are chang confront joe biden in about hi me...,We are chang confront joe biden in about hi me...
3,0,Jeff Sessions swats creepy Uncle Joe's hands away,Jeff Sessions swats creepy Uncle Joe's hands away,jeff session swat creepi uncl joe s hand away,jeff session swat creepi uncl joe s hand away
4,0,NXIVM cultist (🍕 gate) admits to enslaving wom...,NXIVM cultist (🍕 gate) admits to enslaving wom...,nxivm cultist gate admit to enslav woman for y...,nxivm cultist gate admit to enslav woman for y...


### Export Data
Export for modeling

In [18]:
df.to_csv('./data/preprocessed_data.csv')