## Background

Social media offers a real-time platform for food and beverage (F&B) businesses to engage with their customers, even those not within their immediate vicinity. These social media interactions can provide a wealth of insights to help F&B business owners improve their businesses. However, for the many small business establishments, managing and responding effectively to customer posts can be challenging. 

## Problem statement

Working as an analyst for an F&B consultancy, you are tasked with developing a model that can sort a text post by whether it is about tea or coffee. The target clients are cafe operators, with the aim of the model being to help business owners better manage their social media interactions, as well as derive quick insights about their customers. This project will use data scraped from the subreddits for tea and coffee, and using it to train a model that can predict which subreddit a post came from. This is used as a proxy to assess the model's ability to determine whether a text post is about coffee or tea.

The tea and coffee subreddits can be found below:
1. https://www.reddit.com/r/tea/
2. https://www.reddit.com/r/Coffee


## Approach

There will be 3 jupyter notebooks for this project:
1. Defining problem statement and data collection
2. Data cleaning and preprocessing
3. Modeling and results

## Import Python modules

In [1]:
import requests 
import pandas as pd
import numpy as np
import time

## Download posts from subreddits

Pushshift's API was used to download the posts from the subreddits r/tea and r/Coffee. They were subsequently stored in a dataframe and exported as a csv file to be used in the next notebook. Close to 3000 posts were scraped from each subreddit, working backwards from the date of 23 May 2022.

In [3]:
def scrape_posts(subreddit):
    posts = []
    after = None

    for a in range(120): # run the dowload process repeatedly #42 times for roughly 1000 posts
            url = "https://www.reddit.com/r/" + str(subreddit) + ".json" 
            if after == None:
                current_url = url
            else:
                current_url = url + '?after=' + after

            # send request to url
            res = requests.get(current_url, headers={'User-agent': 'GA_Project_3 1.0'})

            # check whether request instance is successful 
            if res.status_code != 200:
                print('Status error', res.status_code)
                break

            # get posts and add to [posts]
            current_dict = res.json()
            current_posts = [p['data'] for p in current_dict['data']['children']]
            posts.extend(current_posts)

            # get tag of last post on the page
            after = current_dict['data']['after']

            # generate a random sleep duration to look more 'natural'
            sleep_duration = np.random.randint(2,6)
            time.sleep(sleep_duration)
            
    #convert string of posts and return as DataFrame
    return pd.DataFrame(posts)               


In [4]:
%%time 
#download posts
df_tea = scrape_posts('tea')
df_coffee = scrape_posts('Coffee')

#Wall time: 17min 23s

Rows and columns of tea dataframe:   approved_at_utc subreddit  \
0            None       tea   
1            None       tea   
2            None       tea   
3            None       tea   
4            None       tea   

                                            selftext author_fullname  saved  \
0  What are you drinking today?  What questions h...        t2_6l4z3  False   
1  We realize there are lots of people involved i...        t2_6l4z3  False   
2                                                           t2_3jjsa  False   
3                                                        t2_2euurysf  False   
4                                                        t2_1a8frkgn  False   

  mod_reason_title  gilded  clicked  \
0             None       0    False   
1             None       0    False   
2             None       0    False   
3             None       0    False   
4             None       0    False   

                                               title  \
0  What's in

In [6]:
#checking number of rows and columns of downloaded data
print(f'Rows and columns of tea dataframe: {df_tea.shape} \n') #2958, 118
print(f'Rows and columns of coffee dataframe: {df_coffee.shape} \n') #2985, 109

Rows and columns of tea dataframe: (2958, 118) 

Rows and columns of coffee dataframe: (2985, 109) 



## Check results of download

In [7]:
#checking outputs of download - tea
df_tea.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,url_overridden_by_dest,preview,is_gallery,media_metadata,gallery_data,crosspost_parent_list,crosspost_parent,call_to_action,poll_data,author_cakeday
0,,tea,What are you drinking today? What questions h...,t2_6l4z3,False,,0,False,"What's in your cup? Daily discussion, question...","[{'e': 'text', 't': 'Recurring'}]",...,,,,,,,,,,
1,,tea,We realize there are lots of people involved i...,t2_6l4z3,False,,0,False,"Marketing Monday! - May 23, 2022","[{'e': 'text', 't': 'Recurring'}]",...,,,,,,,,,,
2,,tea,,t2_3jjsa,False,,0,False,I’m in London. I like tea. (Thank you for the ...,"[{'e': 'text', 't': 'Photo'}]",...,https://i.imgur.com/PcuYX6j.jpg,{'images': [{'source': {'url': 'https://extern...,,,,,,,,
3,,tea,,t2_2euurysf,False,,0,False,Gifted myself a set of Shui Xian oolong on Int...,"[{'e': 'text', 't': 'Photo'}]",...,https://www.reddit.com/gallery/uwdphw,,True,"{'ys7ebvd0cb191': {'status': 'valid', 'e': 'Im...","{'items': [{'media_id': 'ys7ebvd0cb191', 'id':...",,,,,
4,,tea,,t2_1a8frkgn,False,,0,False,"when you drink so much tea, you just buy it by...","[{'e': 'text', 't': 'Photo'}]",...,https://i.redd.it/umvbkb3tx8191.jpg,{'images': [{'source': {'url': 'https://previe...,,,,,,,,


In [8]:
#checking outputs of download - coffee
df_coffee.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,subreddit_subscribers,created_utc,num_crossposts,media,is_video,media_metadata,call_to_action,url_overridden_by_dest,is_gallery,gallery_data
0,,Coffee,\n\nWelcome to the daily [/r/Coffee](https://...,t2_a04m4,False,,0,False,[MOD] The Daily Question Thread,[],...,999603,1653300000.0,0,,False,,,,,
1,,Coffee,Welcome to the /r/Coffee deal and promotional ...,t2_a04m4,False,,0,False,[MOD] The Official Deal Thread,[],...,999603,1653261000.0,0,,False,,,,,
2,,Coffee,Went to make my annual summer toddy filter pur...,t2_6qnov,False,,0,False,TIL: The diameters of the Aeropress and Toddy ...,[],...,999603,1653337000.0,0,,False,,,,,
3,,Coffee,"Honduras may be a tiny country, but its coffee...",t2_l21xy1r1,False,,0,False,Coffee in Honduras,[],...,999603,1653335000.0,0,,False,,,,,
4,,Coffee,I wonder where you have tried your best coffee...,t2_eb78oggo,False,,0,False,Where have you tried your best coffee?,[],...,999603,1653297000.0,0,,False,,,,,


## Export downloaded posts to csv

In [9]:
#exporting to csv
df_tea.to_csv('data/tea.csv', index = False) #done
df_coffee.to_csv('data/coffee.csv', index = False) #done

End of notebook 1 on Problem Statement and Data Collection.