# Project 3 - Classifying Reddit Posts by NLP

## Problem Statement
---

We are developing a new wellness app that prompts the User to write a short journal entry, which will be analysed to determined the User's philosophical inclination and return a relevant message or thought of the day. 
Our app focuses on two philosophical beliefs - Stoicism and Buddhism.  
To understand the topics from each philosophical groups, data is extracted from two subreddits - r/Stoicism and r/Buddhism. We will use the data to attempt to train a classifier model to predict the User's philosophical inclination based on the journal entry.   

We believe we can adapt the model in the future into other subreddits.

## Import libraries 
---

In [1]:
import pandas as pd
import requests
import time

## Functions
---

In [3]:
def get_params(base_df, subreddit):
    """
    Define the parameters for Pushshift API for each 100 post extraction thereafter
    """
    params = {
        'subreddit': subreddit, 
        'size': 100, 
        'before': base_df.loc[(base_df.shape[0] - 1), 'created_utc'] 
    }
    return params

In [4]:
# Define a function that returns a list of dictionaries for the content of each post
def get_posts(params, baseurl='https://api.pushshift.io/reddit/search/submission'):
    res = requests.get(baseurl, params)
    if res.status_code != 200:
        return f'Error code: {res.status_code}'
    else:
        data = res.json()
        posts = data['data']
    return posts

In [5]:
# Define a function to update the base DataFrame with the 500 succeeding posts
def update_df(base_df, subreddit):
    params = get_params(base_df, subreddit)
    posts = get_posts(params)
    df2 = create_new_df(posts)
    updated = pd.concat([base_df, df2], axis=0, ignore_index=True, sort=True)
    return updated

In [6]:
# Define a function to turn the list of posts into a DataFrame
def create_new_df(posts):
    return pd.DataFrame(posts)

## Data Extraction using Pushshift API
---

Reddit is a social news platform where Users are able to post contents, images and links which other Users can participate in the discussion. Posts are organized into User-created boards called 'subreddits' where each subreddit caters to a specific topics or subject. For our problem statement, we will extract posts from two specific subreddits and understand the topics between them. The two subreddits are listed below, and they have 461k and 621k members respectively and regular contents are posted, sufficient for data extraction.
> * https://www.reddit.com/r/Stoicism/
> * https://www.reddit.com/r/Buddhism/  

Pushift.io API provides a method of extracting data from subreddits. However, it is limited to 100 posts per request and to gather enough data multiple requests is needed and so we will define a for loop to achieve this. We will gather around 10,000 posts to analyse.

### Extract first 100 post from r/Stoicism

In [7]:
# define parameters for extraction
params_stocism = {'subreddit': 'Stoicism', 'size': 100}

In [8]:
# extract posts
posts_stocism = get_posts(params_stocism)

In [9]:
# Create a dataframe from the posts
df_stocism = create_new_df(posts_stocism)

In [10]:
print(df_stocism.shape)
df_stocism.head()

(100, 73)


Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,url_overridden_by_dest,whitelist_status,wls,media,media_embed,secure_media,secure_media_embed,removed_by_category,suggested_sort,poll_data
0,[],False,tylerdhenry,,[],,text,t2_6phe51a9,False,False,...,https://ogjre.com/episode/1836-ryan-holiday,all_ads,6,,,,,,,
1,[],False,Kriispy66,,[],,text,t2_5ygkrs6g,False,False,...,https://youtu.be/AzszJ4Ey2ws,all_ads,6,"{'oembed': {'author_name': 'Kriispy', 'author_...","{'content': '&lt;iframe width=""267"" height=""20...","{'oembed': {'author_name': 'Kriispy', 'author_...","{'content': '&lt;iframe width=""267"" height=""20...",,,
2,[],False,Tristan6252,,[],,text,t2_mzzjcxf,False,False,...,,all_ads,6,,,,,,,
3,[],False,Anarchist-monk,,[],,text,t2_7czy605j,False,False,...,,all_ads,6,,,,,automod_filtered,,
4,[],False,w0ntfix,,[],,text,t2_19rrs330,False,False,...,,all_ads,6,,,,,,,


### Extract first 100 post from r/Buddhism

In [11]:
# define parameters for extraction
params_buddhism = {'subreddit': 'Buddhism', 'size': 100}

In [12]:
# extract posts
posts_buddhism = get_posts(params_buddhism)

In [13]:
# Create a dataframe from the posts
df_buddhism = create_new_df(posts_buddhism)

In [14]:
print(df_buddhism.shape)
df_buddhism.head()

(100, 80)


Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,whitelist_status,wls,author_flair_template_id,author_flair_text_color,crosspost_parent,crosspost_parent_list,poll_data,gallery_data,is_gallery,media_metadata
0,[],False,LabbaikYaRasulAllah,,[],,text,t2_mtaflvek,False,False,...,all_ads,6,,,,,,,,
1,[],False,HelpCurious,,[],,text,t2_a8202bhe,False,False,...,all_ads,6,,,,,,,,
2,[],False,comoestas969696,,[],,text,t2_e6jjoxd2,False,False,...,all_ads,6,,,,,,,,
3,[],False,comoestas969696,,[],,text,t2_e6jjoxd2,False,False,...,all_ads,6,,,,,,,,
4,[],False,Much_Walk1823,,[],,text,t2_azpxxie2,False,False,...,all_ads,6,,,,,,,,


### Extract remaining post

* Pushshift API is limited to 100 posts per request. We aim to extract 10,000 post, hence we will run the request 99 more times.

In [15]:
# extract remaining post for r/Stocism
for i in range(99):
    df_stocism = update_df(df_stocism, 'Stoicism')
print('Number of posts extracted: ',len(df_stocism))

Number of posts extracted:  9987


In [16]:
# extract remaining post for r/Buddhism
for i in range(99):
    df_buddhism = update_df(df_buddhism, 'Buddhism')
print('Number of posts extracted: ',len(df_buddhism))

Number of posts extracted:  9992


## Save extracted dataframe to CSV file

In [18]:
# save to csv
df_stocism.to_csv('../data/df1.csv', index=False)
df_buddhism.to_csv('../data/df2.csv', index=False)