## Problem Statement

We are a group of data scientists working for a young yoga studio (Moga Yoga). The team was tasked to work together with the marketing team to entice more members to subscribe to the studio and maximise ad spend. 

As the pandemic rages on, increased focus on mental well-being has driven a demand for meditation/yoga apps or classes. This would serve as an opportunity for customer base expansion and the need to tailor advertisements to promote yoga as a coping strategy / self-care tool. 

To maximise the effectiveness of our marketing campaigns to yoga enthusiasts, we have examined submissions under 2 subreddits - r/yoga and r/Meditation. These are commonly referred to as 2 different practises even though Meditation **is** an integral part of yoga. By capitalising on their similarity in nature yet differences in mode of practicising, we hope to build an effective NLP classification model to better target yoga enthusiasts. 

The model will help us to
1. identify top predictors for r/yoga to investigate the needs and wants of yoga enthusiasts
2. identify trending words to tailor advertisements for the yoga enthusiasts and also possibly extend the scope to meditation seekers to consider picking up yoga.

In [1]:
# imports
import pandas as pd
import requests
import datetime
from datetime import datetime
import time

In [2]:
def scrap_posts(subreddit, n_posts, created_utc):
    posts = []
    url = 'https://api.pushshift.io/reddit/search/submission'
    
    bef_dict = {'before': created_utc}
    
    for i in range(n_posts):
        params = {
                'subreddit':subreddit,
                'size': 100,
                'before': bef_dict['before']
                }
            
        res = requests.get(url, params)
        
        if res.status_code != 200:
            print(f'Error Code {res.status_code}, {res.reason}')
            break
        
        data = res.json()
        posts.extend(data['data'])
            
        bef_dict['before'] = data['data'][-1]['created_utc']
        
        time.sleep(3)
    
    print(f"r/{subreddit} - Code:{res.status_code}, Status:{res.reason}")
    
    # create dataframe for scrapped posts
    df = pd.DataFrame(posts)
    df['created'] = df['created_utc'].apply(lambda x: datetime.fromtimestamp(x))
    
    # Stamping post and datetime while scraping 
    latest_post_stamped = datetime.fromtimestamp(df['created_utc'].iloc[0:].values[0])
    last_post_stamped = datetime.fromtimestamp(df['created_utc'].iloc[-1:].values[0])
    
    print(f"Scrapped {df.shape[0]} posts from {latest_post_stamped} to {last_post_stamped}")
    print()
    
#     df.to_csv('./' + subreddit + '_submissions.csv', header=True, index=False, columns=list(submissions_df.axes[1]))
    df.to_csv('./' + subreddit + '_submissions.csv', index=False)
    
    return df.head()

In [3]:
scrap_posts('yoga', 15, 1640908800)

r/yoga - Code:200, Status:OK
Scrapped 1500 posts from 2021-12-31 07:55:04 to 2021-11-07 19:16:53



Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,gallery_data,is_gallery,media_metadata,crosspost_parent,crosspost_parent_list,author_flair_template_id,author_cakeday,edited,banned_by,created
0,[],False,shibahuskymom,,[],,text,t2_2kjzb7tv,False,False,...,,,,,,,,,,2021-12-31 07:55:04
1,[],False,fdrecordings,,[],,text,t2_8lmsmdn3,False,False,...,,,,,,,,,,2021-12-31 07:49:45
2,[],False,fdrecordings,,[],,text,t2_8lmsmdn3,False,False,...,,,,,,,,,,2021-12-31 07:49:26
3,[],False,meggriffinsglasses,,[],,text,t2_dbgfda3k,False,False,...,,,,,,,,,,2021-12-31 06:33:38
4,[],False,Friendly_Popo,,[],,text,t2_kjpgn,False,False,...,,,,,,,,,,2021-12-31 04:53:59


In [4]:
scrap_posts('meditation', 15, 1640908800)

r/meditation - Code:200, Status:OK
Scrapped 1500 posts from 2021-12-31 07:38:50 to 2021-12-03 11:39:22



Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,author_flair_template_id,poll_data,live_audio,event_end,event_is_live,event_start,media_metadata,banned_by,edited,created
0,[],False,Extension_Mouse686,,[],,text,t2_7neqv2mr,False,False,...,,,,,,,,,,2021-12-31 07:38:50
1,[],False,saad2607,,[],,text,t2_3arwul82,False,False,...,,,,,,,,,,2021-12-31 07:04:07
2,[],False,dshep9729,,[],,text,t2_1glxiccy,False,False,...,,,,,,,,,,2021-12-31 07:03:52
3,[],False,hartmanners,,[],,text,t2_24pivn1q,False,False,...,,,,,,,,,,2021-12-31 06:35:51
4,[],False,Alina_1981,,[],,text,t2_bk6wx6vq,False,False,...,,,,,,,,,,2021-12-31 06:11:53


**1500** posts from each subreddit was scrapped backwards from 2021-12-31 (New Year's Eve). This totalled to 3000 submissions comprising of redditors' year-end thoughts on the topics.

It is interesting to note that within the same number of posts, the time period for r/yoga spans across 2 months while r/Meditation's posts are within 1 month. This means that r/Meditation has more frequent submissions than r/yoga.