![](./assets/images/reddit_code_banner.png)
[Image Source](https://preview.redd.it/k0ozkhhjubh31.jpg?width=2400&format=pjpg&auto=webp&s=6d44bf6a3a98bee16d1a70697b919fbd53a97796)

## Problem Statement

Is it clear what the goal of the project is?

What type of model will be developed?

How will success be evaluated?

Is the scope of the project appropriate?

Is it clear who cares about this or why this is important to investigate?

Does the student consider the audience and the primary and secondary stakeholders?

Data Scientist for a company looking to expand into Singapore and Malaysia. Tasked by the marketing department to look at reddit to see what are the hot topics of the day for both countries and if the citizens of these two countries have enough similiarities so that the marketing department would only need to create a single strategy. Or if they're starkly different, what are the difference so that the company would be better able to target the each country's populance. for an e commerce company that wants to speak the lingo of msia and sg online forum users

I'll be webscrapping, eda, knives, drecisiontree, logisticReg...

# Part 1 Data Collection and Cleaning

## Table of Content

1. [Data Collection](#Data-Collection)
2. [Data Cleaning](#Data-Cleaning)









In [1]:
#Libraries
import requests
import pandas as pd
import numpy as np
import time
import regex as re

pd.options.mode.chained_assignment = None   # disable SettingWithCopyWarning: 
                                            # A value is trying to be set on a copy of a slice from a DataFrame

## Data Collection
Scrapping reddit using a function that loops 15 times retrieving a 100 post each time to get 1,500 post for each chosen subreddit and then putting them each into a dataframe.

In [2]:
#Function to get 1,500 post from a subreddit and put it into a dataframe

def webscrape_reddit(subreddit):
    data = []
    url = 'https://api.pushshift.io/reddit/search/submission'
    header = {'User-agent':'GA DSIF-4 Student Project'}
    count = 0
    #starting from 0 loop 15 times to scrape 1,500 posts
    while count < 15: 
        # set the parameter 'before' to get subsequent posts after the first 100
        if count == 0:
            params = {
                'subreddit':subreddit,
                'size':100
            }
        else:
            params = {
                'subreddit':subreddit,
                'size':100,
                'before': before #check last post ['created_utc'] to get time/date
            }
        count+=1
        #actual requests/scrapping 
        res = requests.get(url,params,headers=header)
        #Check if successful and if so to save to list called data and also extract the last post's date/time for params
        if res.status_code == 200:
            print('Status Code:',res.status_code,'of scrape count:',count)
            post = res.json()['data']
            before = post[-1]['created_utc']
            data.extend(post)
        else:
            print('Error. Something when wrong. Status code:',res.status_code)
            break
        time.sleep(1)
    #put results into a single dataframe    
    df = pd.DataFrame(data)
    print(f'Total number of post scrapped: {df.shape[0]}, from Subreddit: {subreddit}\n')
    return df

In [11]:
# This cell is marked out to preserve the dataset used

#scrape both subreddits

#df_singapore = webscrape_reddit('singapore')
#df_malaysia = webscrape_reddit('malaysia')

#save the original scapped datasets

#df_malaysia.to_csv('./assets/datasets/malaysia_raw.csv',index=False)
#df_singapore.to_csv('./assets/datasets/singapore_raw.csv',index=False)

Status Code: 200 of scrape count: 1
Status Code: 200 of scrape count: 2
Status Code: 200 of scrape count: 3
Status Code: 200 of scrape count: 4
Status Code: 200 of scrape count: 5
Status Code: 200 of scrape count: 6
Status Code: 200 of scrape count: 7
Status Code: 200 of scrape count: 8
Status Code: 200 of scrape count: 9
Status Code: 200 of scrape count: 10
Status Code: 200 of scrape count: 11
Status Code: 200 of scrape count: 12
Status Code: 200 of scrape count: 13
Status Code: 200 of scrape count: 14
Status Code: 200 of scrape count: 15
Total number of post scrapped: 1500, from Subreddit: singapore

Status Code: 200 of scrape count: 1
Status Code: 200 of scrape count: 2
Status Code: 200 of scrape count: 3
Status Code: 200 of scrape count: 4
Status Code: 200 of scrape count: 5
Status Code: 200 of scrape count: 6
Status Code: 200 of scrape count: 7
Status Code: 200 of scrape count: 8
Status Code: 200 of scrape count: 9
Status Code: 200 of scrape count: 10
Status Code: 200 of scrape co

## Data Cleaning

In [2]:
df_singapore = pd.read_csv('./assets/datasets/singapore_raw.csv')
df_malaysia = pd.read_csv('./assets/datasets/malaysia_raw.csv')

In [3]:
# Check total columns and rows for each dataset
datasets = {'Malaysia':df_malaysia,'Singapore':df_singapore}

for c,df in datasets.items():
    print(f'{c} Dataset rows: {df.shape[0]}, columns: {df.shape[1]}')

Malaysia Dataset rows: 1499, columns: 86
Singapore Dataset rows: 1500, columns: 81


Since there are extra variables in the Malaysian dataset, I shall drop them. And also drop any column within the Singapore dataset that does not match the Malaysian dataset.

In [4]:
# Isolate the extra variables in df_malaysia and drop them
for col in list(df_malaysia.columns):
    if col not in list(df_singapore.columns):
        df_malaysia.drop(col,axis=1,inplace=True)
        print(f'dropped from Malaysia: {col}')

# Isolate the extra variables in df_singapore and drop them
for col in list(df_singapore.columns):
    if col not in list(df_malaysia.columns):
        df_singapore.drop(col,axis=1,inplace=True)
        print(f'dropped from Singapore: {col}')

dropped from Malaysia: crosspost_parent
dropped from Malaysia: crosspost_parent_list
dropped from Malaysia: poll_data
dropped from Malaysia: collections
dropped from Malaysia: call_to_action
dropped from Malaysia: category
dropped from Singapore: banned_by


In [5]:
# Check for duplicated posts by looking at the title and then dropping the duplicates
for df in datasets.values():
    df.drop_duplicates('title',keep='last',inplace=True)

# Check shape of both 
print('After dropping duplicated posts, we\'re left with:')
for c,df in datasets.items():
    print(f'{c} Dataset rows: {df.shape[0]}, columns: {df.shape[1]}')

After dropping duplicated posts, we're left with:
Malaysia Dataset rows: 1476, columns: 80
Singapore Dataset rows: 1437, columns: 80


In [6]:
#save the datasets
df_malaysia.to_csv('./assets/datasets/malaysia_cleaned.csv',index=False)
df_singapore.to_csv('./assets/datasets/singapore_cleaned.csv',index=False)

In [7]:
# Combine both sets into a single dataset for further cleaning
df_merge = pd.concat(datasets)

In [8]:
#Check for duplicates between Singapore and Malaysia
df_merge[['title','author']][df_merge.duplicated(['title'],keep=False)].sort_values(by=['title'])

Unnamed: 0,Unnamed: 1,title,author
Malaysia,585,1.FRESH VEGETABLES,Current_Green_6063
Singapore,514,1.FRESH VEGETABLES,Current_Green_6063
Malaysia,81,University degree,tysonreddit
Singapore,86,University degree,tysonreddit


In [9]:
# Drop duplicates between Singapore and Malaysia
df_merge.drop_duplicates('title',keep='last',inplace=True) 

In [10]:
#Look at variables available and decide which to keep
df_merge.columns

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_is_blocked',
       'author_patreon_flair', 'author_premium', 'awarders', 'can_mod_post',
       'contest_mode', 'created_utc', 'domain', 'full_link', 'gildings', 'id',
       'is_created_from_ads_ui', 'is_crosspostable', 'is_meta',
       'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable',
       'is_self', 'is_video', 'link_flair_background_color',
       'link_flair_richtext', 'link_flair_text_color', 'link_flair_type',
       'locked', 'media_metadata', 'media_only', 'no_follow', 'num_comments',
       'num_crossposts', 'over_18', 'parent_whitelist_status', 'permalink',
       'pinned', 'pwls', 'retrieved_on', 'score', 'selftext', 'send_replies',
       'spoiler', 'stickied', 'subreddit', 'subreddit_id',
       'subreddit_subscribers', 'subreddit_type', 'thumbnail',
       't

There isn't an official document for the reddit parameters and one is left to make educated guesses on what each variable stand for from looking at the heading and content within.

Some helpful guides I found on the subject matter:
1. https://github.com/pushshift/api
2. https://www.reddit.com/r/redditdev/comments/a1dn2p/any_documentation_on_post_properties_such_as/

Some intial thoughts for EDA (Are there certain patterns that could be established of the different netizens?):
1. Do one side embed more media then the other?
2. Which side tend to upload more NSFW posts?
3. Average number of comments per post?
4. Which group tend to moderate more vigrously?

Base on the above considerations I decided to retain the below few variables for EDA and ML purposes.
* removed_by_category -> reason the post was removed
* num_comments -> number of comments the post receive
* over_18 -> tagged if post is NSFW
* selftext -> content of the post 
* title -> title of the post
* media_embed -> indicate if the post has a media embeded
* subreddit -> identify whether the post is from Singapore or Malaysia

In [11]:
# Create final dataframe for part 2 EDA and ML
df_final = df_merge[['removed_by_category',
                     'num_comments','over_18','selftext',
                     'title','media_embed','subreddit']]

df_final.reset_index(inplace=True,drop=True)

In [12]:
#check for null values
df_final.isnull().sum()

removed_by_category    2063
num_comments              0
over_18                   0
selftext               1799
title                     0
media_embed            2772
subreddit                 0
dtype: int64

As NaN refers to the fact that there was no attribute given for that variable/ post instead of being a missing value, I shall fill all NaN in removed_by_category as 'still_live' and for media to map it to 0 and 1

In [13]:
df_final['removed_by_category'] = df_final['removed_by_category'].fillna('still_live')
df_final['media_embed'] = df_final['media_embed'].apply(lambda x: 0 if x is np.nan else 1)

Finally I shall clean up the title and contents (selftext) of the posts using regular expression and lambda

In [31]:
# Remove Special characters
df_final['title'] = df_final['title'].str.replace('[^0-9a-zA-Z]+',' ',regex=True)
df_final['selftext'] = df_final['selftext'].str.replace('[^0-9a-zA-Z]+',' ',regex=True)

# Remove digits
df_final['title'] = df_final['title'].apply(lambda x:''.join(i for i in x if not i.isdigit()))

# Remove white spaces at both ends
df_final['title'] = df_final['title'].str.strip()
df_final['selftext'] = df_final['selftext'].str.strip()

# Replace the word removed from selftext 
# As it is just an indication that the post was deleted and not the actual content
df_final['selftext'] = df_final['selftext'].apply(lambda x: '' if x =='removed' else x)

df_final.head()

Unnamed: 0,removed_by_category,num_comments,over_18,selftext,title,media_embed,subreddit
0,still_live,0,False,Why non Slavs still decide to be in the Russia...,Why non Slavic ethnic in the Russian Federatio...,0,malaysia
1,still_live,0,False,How much revenue can hardware shop generate Se...,How much revenue can hardware shop generate,0,malaysia
2,still_live,0,False,,PRN Johor Perdana Menteri Tinjau Keadaan Anggo...,1,malaysia
3,still_live,0,False,,PRN Johor Program Gotong Royong Bersama Pendud...,1,malaysia
4,automod_filtered,0,False,,Boleh pcaya ka puasa dan aidilfitri tak PKP,0,malaysia


In [32]:
# Save final dataframe for part 2
df_final.to_csv('./assets/datasets/combined_dataset.csv',index=False)