## `Problem statement`
---
Reddit has decided to change their business model by becoming the world's top conversation platform.
- Reddit now want to display organic conversations from its members, but also any publicly available conversation on the internet, and classified it under one of its existing subrredits.

- They have hired us to run a "stage 0 testing" to evaluate whether such an algorithm would even be effective based sources from reddit itself, let alone importing from other sources.

**The aim of our project is to create and train a model that can correctly classify a corpus into its correct subreddit**

We want to be mindful about sentiment analysis and will **create a sentiment filter** category that future reddit users will be able to use to filter for each post.


## `Project 3 - Part1: Fetching data`
---

In this notebook, we will 
1. [scrape reddit using the reddit API](#defining-the-function-to-fetch-data-and-pass-it-to-df)
2. [reduce dimensionality of dataframes](#data-selection)
3. [cleanse the data at high level ( lower/emoji, empty posts)](#data-wrangling-plan)

In [1]:
# imports
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


import requests as requests
import regex


import emoji
from emoji import UNICODE_EMOJI
import re

### `Defining the function to fetch data and pass it to DF`

In [2]:
url = 'https://api.pushshift.io/reddit/search/submission'

def fetch_reddit(url,size, subreddit, iterations,first_date): #defining scope to scrape over
    my_list=[]
    params = {'subreddit':subreddit,'size':size,'before':first_date} #fixing timestamp param
    for i in range(iterations):
        response = requests.get(url,params)
        # print(response.status_code)
        response_extract = response.json()
        my_list += response_extract['data'] #adding the comments to a list
        params['before'] = my_list[-1]['created_utc']
    return my_list

Fetching Data from2 subreddits  + passing it to a Dataframe. <br>
We go with 6000 extracts for each subreddit, to make sure we have enough data to play with.<br>
Step below : we are telling the API <br>
*go fetch batches of 200 posts from this subreddit start on Aug 30th and work backwards until you have done it 30 times*

In [3]:
%%time
function_1 = fetch_reddit('https://api.pushshift.io/reddit/search/submission/',200,'DunkinDonuts',30,1661867019) # Unix Epoch time =  Tuesday, August 30, 2022 1:43:39 PM
function_2 = fetch_reddit('https://api.pushshift.io/reddit/search/submission/',200,'starbucks',30,1661867019) ## Unix Epoch time =  Tuesday, August 30, 2022 1:43:39 PM
df_dunkin = pd.DataFrame(function_1) # this is the key to store function results in a dataframe!
df_starbucks = pd.DataFrame(function_2)

CPU times: total: 3.44 s
Wall time: 5min 46s


In [4]:
df_dunkin.info() # checking the size for each dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5995 entries, 0 to 5994
Data columns (total 77 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   all_awardings                  5995 non-null   object 
 1   allow_live_comments            5995 non-null   bool   
 2   author                         5995 non-null   object 
 3   author_flair_css_class         0 non-null      object 
 4   author_flair_richtext          5932 non-null   object 
 5   author_flair_text              0 non-null      object 
 6   author_flair_type              5932 non-null   object 
 7   author_fullname                5932 non-null   object 
 8   author_is_blocked              4233 non-null   object 
 9   author_patreon_flair           5932 non-null   object 
 10  author_premium                 5932 non-null   object 
 11  awarders                       5995 non-null   object 
 12  can_mod_post                   5995 non-null   b

In [5]:
df_starbucks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6000 entries, 0 to 5999
Data columns (total 78 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   all_awardings                  6000 non-null   object 
 1   allow_live_comments            6000 non-null   bool   
 2   author                         6000 non-null   object 
 3   author_flair_background_color  1808 non-null   object 
 4   author_flair_css_class         1899 non-null   object 
 5   author_flair_richtext          5996 non-null   object 
 6   author_flair_template_id       1885 non-null   object 
 7   author_flair_text              1899 non-null   object 
 8   author_flair_text_color        1904 non-null   object 
 9   author_flair_type              5996 non-null   object 
 10  author_fullname                5996 non-null   object 
 11  author_is_blocked              6000 non-null   bool   
 12  author_patreon_flair           5996 non-null   o

### `Data selection`

77 features, but only 2 or 3 will be useful to predict the origin of the post:<br>
We select  selftext and title as explanatory variable + the subreddit column<br>
We merge both df into one for easier data manipulation

In [6]:
df_starbucks = df_starbucks[['subreddit','selftext','title']]
df_dunkin = df_dunkin[['subreddit','selftext','title']]

In [7]:
df_concat = pd.concat([df_dunkin,df_starbucks]) # merging both files

In [8]:
df_concat.shape # checking new df

(11995, 3)

### `Data wrangling plan`
---
- Transform to lower
- Remove empty posts
- Remove emojis

##### Transform to lower
Why we are doing this:Your model might treat a word which is in the beginning of a sentence with a capital letter different from the same word which appears later in the sentence but without any capital latter. This might lead to decline in the accuracy.<br>
*https://stackoverflow.com/questions/45855160/nlp-when-to-lowercase-text-during-preprocessing*

In [9]:
df_concat['selftext'] = df_concat['selftext'].str.lower()
df_concat['title'] = df_concat['title'].str.lower()

#### Remove empty and 100< chars posts

In [10]:
# we trim posts that have self-text less than 100 char, which eleminates '[REMOVED]', but also simple 
#hyperlinks that do not provide useful information
df_concat = df_concat[df_concat['selftext'].str.len()>=100]

In [11]:
df_concat.shape

(5801, 3)

In [12]:
df_concat.head() # we need to reset the index

Unnamed: 0,subreddit,selftext,title
0,DunkinDonuts,pumpkin small: $1.99\n\noriginal small: $2.29\...,how come the pumpkin coffee is less expensive ...
1,DunkinDonuts,a few weeks ago someone posted a comment about...,dunkin app zip code issue
6,DunkinDonuts,"our nearest dunkin is a bit of a drive, and th...",can you buy the unsweetened flavors dunkin uses?
7,DunkinDonuts,how do i ask for an extra shot of flavor in th...,mobile app ordering question
8,DunkinDonuts,i’ve used a can of monster and a variable amou...,monster energy punch at home


In [13]:
# we see that the index have been changed with the concat
# reseting indexes to be able to iterate over them
df_concat.reset_index(inplace=True)

In [14]:
df_concat.drop(columns=['index'], inplace=True)

In [15]:
df_concat.head()

Unnamed: 0,subreddit,selftext,title
0,DunkinDonuts,pumpkin small: $1.99\n\noriginal small: $2.29\...,how come the pumpkin coffee is less expensive ...
1,DunkinDonuts,a few weeks ago someone posted a comment about...,dunkin app zip code issue
2,DunkinDonuts,"our nearest dunkin is a bit of a drive, and th...",can you buy the unsweetened flavors dunkin uses?
3,DunkinDonuts,how do i ask for an extra shot of flavor in th...,mobile app ordering question
4,DunkinDonuts,i’ve used a can of monster and a variable amou...,monster energy punch at home


##### Removing emojis

We remove Emojis as we are not focusing on sentiment, but rather word importance.<br>

In [16]:
# from https://gist.github.com/n1n9-jp/5857d7725f3b14cbc8ec3e878e4307ce
def remove_emoji(string):
    emoji_pattern = re.compile("["
        u"\U00002700-\U000027BF"  # Dingbats
        u"\U0001F600-\U0001F64F"  # Emoticons
        u"\U00002600-\U000026FF"  # Miscellaneous Symbols
        u"\U0001F300-\U0001F5FF"  # Miscellaneous Symbols And Pictographs
        u"\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
        u"\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
        u"\U0001F680-\U0001F6FF"  # Transport and Map Symbols
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)

In [17]:
#identify 1 emoji in our corpus
def is_emoji(s):
    count = 0
    for emoji in UNICODE_EMOJI:
        count += s.count(emoji)
        if count > 1:
            return False
    return count

In [18]:
# checking how many emojis in our corpus
df_concat['selftext'].apply(is_emoji).sum()

198

In [19]:
# #4 has one, let's print it out
df_concat['selftext'][4]

'i’ve used a can of monster and a variable amount of ice with a zero sugar mixed berry powerade and it tastes almost exactly like it 🤷🏽\u200d♂️'

In [20]:
# testing the replace function on index 4 
remove_emoji(df_concat['selftext'][4]) 

'i’ve used a can of monster and a variable amount of ice with a zero sugar mixed berry powerade and it tastes almost exactly like it \u200d️'

In [21]:
%%time
for i in range(len(df_concat)):
    df_concat['selftext'][i]=remove_emoji(str(df_concat['selftext'][i]))


CPU times: total: 1.02 s
Wall time: 1.08 s


In [22]:
df_concat.head()

Unnamed: 0,subreddit,selftext,title
0,DunkinDonuts,pumpkin small: $1.99\n\noriginal small: $2.29\...,how come the pumpkin coffee is less expensive ...
1,DunkinDonuts,a few weeks ago someone posted a comment about...,dunkin app zip code issue
2,DunkinDonuts,"our nearest dunkin is a bit of a drive, and th...",can you buy the unsweetened flavors dunkin uses?
3,DunkinDonuts,how do i ask for an extra shot of flavor in th...,mobile app ordering question
4,DunkinDonuts,i’ve used a can of monster and a variable amou...,monster energy punch at home


In [23]:
df_concat['selftext'][4] # checking that it works:

'i’ve used a can of monster and a variable amount of ice with a zero sugar mixed berry powerade and it tastes almost exactly like it \u200d️'

### `Export dataframes to CSV`
---

In [28]:
df_concat.to_csv('transformed_csv\df_concat_part1.csv') # exporting the consolidated file

We have collected, organized and cleaned the data to get ready for an EDA. Part2 will focus on finding trends and testing feature engineering before we model in Part3 