<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Web APIs & NLP

# Part 1 of 4

# Executive Summary 

As a team of young entrepreneurs and data scientists planning to start up a new cafe, we have decided to scrap and analyze data from online forums to gather and inform the business decision of the cafe set-up. The online forum chosen to start off with would be Reddit which is a popular platform for discussions and trending topics from enthusiasts - of which we will be focusing on tea and coffee.([source](https://backlinko.com/reddit-users)) Subsequently, we would be able to identify and sort future feedback from the public into the respective categories for our futher analysis - with the assumption that key label words would not be included - based on the data the model was trained with from Reddit. 

This is a classification problem and we would utilise models such as LogRegression, RandomForest Classifier, Naive Bayers to identify the best model that can be used for future predictions.

Next we will also look into the general sentiments of the public to some of these key topics and gain further insights (positive, neutral, negative) for implementation to the cafe menu or business opportunities. This would be done by utilising sentiment analysis models such as Spacy([source](https://spacy.io/usage/spacy-101)) and twitter roberta base sentiment([source](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment)) imported from Hugging Face. 

## Problem Statement

To create a data-driven menu from gathering feedback from the public for implementation into the cafe's tea and coffee menu. 

Key questions to look at include: 
1. Key topics, discussions, unique flavors that can be identified from popular online forum platform Reddit 
2. Which model is the best for predicting / categorising tea and coffee - this can be used for future implementation on feedback received by the cafe 
3. General sentiments of public towards coffee and tea 

## Contents: 
Part 1: Data Collection 

-[Webscrapping Subreddit Tea](#Webscrapping_subreddit_tea)

-[Webscrapping Subreddit Coffee](#Webscrapping_subreddit_coffee)

-[Summary](#Summary)

Part 2: EDA and Data Cleaning

Part 3: Modelling and Model Evaluation

Part 4: Sentiment Analysis, Recommendation and Conclusion

## Data Sets

* [`tea.csv`](./datasets/tea.csv): 15,000 posts scrapped from subreddit tea
* [`coffee.csv`](./datasets/coffee.csv): 15,000 posts scrapped from subreddit coffee
* [`df_export.csv`](./datasets/df_export.csv): Combined dataframe export after EDA

Two sets of 15,000 posts for tea and coffee were scrapped from subreddit using the Pushshift Reddit API, target from before Saturday, 1 October 2022 00:00:00 

# Data Collection

- Scrapping 15,000 posts from subreddit tea and coffee
- Webscraped using Pushshift Reddit API
- Target from before Saturday, 1 October 2022 00:00:00

In [1]:
import requests
import pandas as pd
import time
import random

In [2]:
url_subs = "https://api.pushshift.io/reddit/search/submission"

<a id="Webscrapping_subreddit_tea"></a>
# Webscrapping 
## subreddit tea

In [34]:
# Define function for extraction of data
def get_bfore_posts(url, subreddit, date, runs=150):
    params = {'subreddit': subreddit, 'size' : 100, 'before': date}
    reddit_subs = []
    for i in range(runs):
        res = requests.get(url, params)
        if res.status_code != 200:
            print("error")
        else:
            reddit_extract = res.json()
            reddit_subs += reddit_extract['data']
            params['before'] = reddit_subs[-1]['created_utc']
            time.sleep((random.randint(1, 5)))
            print(f"batch {i} completed")
    return reddit_subs

In [5]:
tea_df = get_bfore_posts(url_subs, "tea", date='1664582400')

batch 0 completed
batch 1 completed
batch 2 completed
batch 3 completed
batch 4 completed
batch 5 completed
batch 6 completed
batch 7 completed
batch 8 completed
batch 9 completed
batch 10 completed
batch 11 completed
batch 12 completed
batch 13 completed
batch 14 completed
batch 15 completed
batch 16 completed
batch 17 completed
batch 18 completed
batch 19 completed
batch 20 completed
batch 21 completed
batch 22 completed
batch 23 completed
batch 24 completed
batch 25 completed
batch 26 completed
batch 27 completed
batch 28 completed
batch 29 completed
batch 30 completed
batch 31 completed
batch 32 completed
batch 33 completed
batch 34 completed
batch 35 completed
batch 36 completed
batch 37 completed
batch 38 completed
batch 39 completed
batch 40 completed
batch 41 completed
batch 42 completed
batch 43 completed
batch 44 completed
batch 45 completed
batch 46 completed
batch 47 completed
batch 48 completed
batch 49 completed
batch 50 completed
batch 51 completed
batch 52 completed
bat

In [6]:
len(tea_df)

14971

In [7]:
tea_df = pd.DataFrame(tea_df)

In [8]:
# Data collected between Sunday, August 22, 2021 and Saturday, October 1, 2022
tea_df['created_utc']

0        1664578075
1        1664575731
2        1664573221
3        1664568074
4        1664559432
            ...    
14966    1629596349
14967    1629594998
14968    1629589895
14969    1629583662
14970    1629583267
Name: created_utc, Length: 14971, dtype: int64

In [28]:
print(tea_df.shape)
tea_df.head()

(14971, 85)


Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,crosspost_parent,crosspost_parent_list,media,media_embed,secure_media,secure_media_embed,author_cakeday,banned_by,edited,collections
0,[],False,psydchicjohn,,[],,text,t2_7j12i,False,False,...,,,,,,,,,,
1,[],False,piecesofagrippa,,[],,text,t2_gepwr23i,False,False,...,,,,,,,,,,
2,[],False,SamGoldfield,,[],,text,t2_3aw3j,False,False,...,,,,,,,,,,
3,[],False,yddandy,,"[{'e': 'text', 't': 'keemunthusiast'}]",keemunthusiast,richtext,t2_5f4myn,False,False,...,,,,,,,,,,
4,[],False,BigBart123,,[],,text,t2_gn72mmfq,False,False,...,,,,,,,,,,


<a id="Webscrapping_subreddit_coffee"></a>
# Webscrapping 
## subreddit coffee

In [35]:
coffee_df = get_bfore_posts(url_subs, "Coffee", date='1664582400')

batch 0 completed
batch 1 completed
batch 2 completed
batch 3 completed
batch 4 completed
batch 5 completed
batch 6 completed
batch 7 completed
batch 8 completed
batch 9 completed
batch 10 completed
batch 11 completed
batch 12 completed
batch 13 completed
batch 14 completed
batch 15 completed
batch 16 completed
batch 17 completed
batch 18 completed
batch 19 completed
batch 20 completed
batch 21 completed
batch 22 completed
batch 23 completed
batch 24 completed
batch 25 completed
batch 26 completed
batch 27 completed
batch 28 completed
batch 29 completed
batch 30 completed
batch 31 completed
batch 32 completed
batch 33 completed
batch 34 completed
batch 35 completed
batch 36 completed
batch 37 completed
batch 38 completed
batch 39 completed
batch 40 completed
batch 41 completed
batch 42 completed
batch 43 completed
batch 44 completed
batch 45 completed
batch 46 completed
batch 47 completed
batch 48 completed
batch 49 completed
batch 50 completed
batch 51 completed
batch 52 completed
bat

In [36]:
coffee_df = pd.DataFrame(coffee_df)

In [37]:
# Data collected between Monday, January 3, 2022 and Saturday, October 1, 2022
coffee_df['created_utc']

0        1664581292
1        1664579910
2        1664579037
3        1664578285
4        1664577472
            ...    
14974    1641175984
14975    1641175603
14976    1641173271
14977    1641172582
14978    1641171781
Name: created_utc, Length: 14979, dtype: int64

In [38]:
print(coffee_df.shape)
coffee_df.head()

(14979, 82)


Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,author_flair_template_id,media,media_embed,secure_media,secure_media_embed,author_cakeday,edited,banned_by,call_to_action,category
0,[],False,_FormerFarmer,,[],,text,t2_97nkyu0,False,False,...,,,,,,,,,,
1,[],False,Geegoriel9,,[],,text,t2_rxuc56yj,False,False,...,,,,,,,,,,
2,[],False,TheRealOsamaru,,[],,text,t2_1wrhxlk5,False,False,...,,,,,,,,,,
3,[],False,Complex_Secretary_14,,[],,text,t2_52g92zw8,False,False,...,,,,,,,,,,
4,[],False,deemonstalker,,[],,text,t2_e9205,False,False,...,,,,,,,,,,


# Export to csv

In [43]:
tea_df.to_csv("data/tea.csv", index=False)

In [44]:
coffee_df.to_csv("data/coffee.csv", index=False)

<a id="Summary"></a>
# Summary

An attempt to scrap a total of 15,000 posts from subreddit tea between a period of Sunday, August 22, 2021 and Saturday, October 1, 2022 yielded 14971 posts. Similarly, an attempt to scrap a total of 15,000 posts from subreddit coffee between a period of Monday, January 3, 2022 and Saturday, October 1, 2022 yielded 14979 posts. 

An excess of posts were scrapped in the event that there are posts that could not be used after cleaning up the data and removal of duplicates etc. 