# Project 3: Scrapping Reddit
---

Notebook Organisation:
1. **Webscraping (SavingMoney & Investing)**
2. EDA and Preprocessing
3. Model Tuning and Insights

## Introduction
---

Reddit is an American social news aggregation, web content rating, and discussion website. Registered members submit content to the site such as links, text posts, and images, which are then voted up or down by other members. Posts are organized by subject into user-created boards called "subreddits", which cover a variety of topics such as news, politics, science, movies, video games, music, books, sports, fitness, cooking, pets, and image-sharing. Submissions with more up-votes appear towards the top of their subreddit and, if they receive enough up-votes, ultimately on the site's front page. Despite strict rules prohibiting harassment, Reddit's administrators spend considerable resources on moderating the site.

In this study two subreddit were examined; (1) r/SavingMoney and (2) r/Investing. Both topics revolves around the idea of preparing for the future with the emphasis of money. However, while being similar in nature where money is the center of gravity, the ultilization of it is different in concept as one emphasize the importance of saving while another shares the idea of growing wealth through investment. The goal of this project is therefore to try and figure out how distinct these concepts are from one to another. 

## Problem Statement
---

We, a consultation firm (Data Insights Pte Ltd), was recently **engaged by ABC bank** to predict whether if a particular post within a given subreddit is savings related as the bank wants to better engage their customers on savings based on ground sentiments. A classification model will be devised and evaluated based on accuracy score.

## Executive Summary
---

In this study two subreddit were examined; (1) r/SavingMoney and (2) r/Investing. Both topics revolves around the idea of preparing for the future with the emphasis of money. However, while being similar in nature where money is the center of gravity, the ultilization of it is different in concept as one emphasize the importance of saving while another shares the idea of growing wealth through investment. The goal of this project is therefore to try and figure out how distinct these concepts are from one to another. 

We, a consultation firm (Data Insights Pte Ltd), was recently **engaged by ABC bank** to predict whether if a particular post within a given subreddit is savings related as the bank wants to better engage their customers on savings based on ground sentiments. A classification model will be devised and evaluated based on accuracy score.

In general, both subreddit(r/investing & r/SavingMoney) are quite distinct from one another as the average accuracy of the model ranges from 50 to 90% with the highest being 95%. While both revolves around money, the way in which how the money is utilized is different. Hence, the correlated words used in the prediction is not highly correlated between post as shown in the EDA process.

During the model selection, the Naive Bayes and TdifVectorizer model is able predict with an accuracy of 95% based on the testing data and 98.6% based on the training. Among all the features, between title, post and combining both, post seems to give the best results in terms of accuracy and computing time. While combing both title and post may increase the accruary, it is limited with the max number of features allowed to pass through the model and requires longer computing time. Among all the model, it proves to be the best as it is not overfitted and has the highest accuracy for test data when identifying saving related post as tasked by our client.

However, the model has some limitation. Currently, there are some misclassified words within the model which will throw the prediction off if the words appears within a post that may not be savings related. The words should be removed to increase the accuracy of the model.

## Content
---

- [Data Dictionary](#Data-Dictionary)
- [Library](#Library)
- [Webscraping](#Webscraping)
- [Understanding the Data](#Understading-the-Data)
- [Initial Feature Selection](#Feature-Selection)

## Data Dictionary
---

**Data Dictionary for combined.csv (r/SavingMoney & r/Investing)**

|S/N|Feature|Data Type|Dataset|Description|
|---|---|---|---|---|
|1|**subreddit**|*int*|combine|Mapped as 1 for SavingMoney, 0 for Investment| 
|2|**id**|*str*|combine|The identification of the person who posted the post|
|3|**title**|*str*|combine|The title of the subreddit post|
|4|**selftext**|*str*|combine|The body of the post|
|5|**title_len**|*flaot*|combine|The length of the title|
|6|**text_len**|*float*|combine|The length of the body|
|7|**title_cleaned**|*str*|combine|The processed title|
|8|**selftext_cleaned**|*str*|combine|The processed body of post|
|9|**titlepost**|*str*|combine|The combination of processed title and post|

## Library
---

In [65]:
import requests
import time
import nltk
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
import warnings

from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
warnings.filterwarnings('ignore')
sns.set_style('ticks')
%matplotlib inline

## Webscraping
---

The [Reddit API](https://www.reddit.com/dev/api/) allows one to remotely interact with reddit and scrape post from subreddit. The scrapping was done using .json, user-agent and a time.sleep() function after scrapping to prevent the system from thinking that it is a bot that is scrapping data from the web. <br>

A function was created for scrapping and two subreddits(r/SavingMoney & r/Investing) were identified to scrap for this project. While there are 100+ features within a subreddit post, only a certain number of features will be selected to build the prediction model.

### Function for Scrapping

In [2]:
#The function 'reddit_to_csv' will take three arguments: 1. the subreddit being scraped; 2. the filename, or the name
# the csv file will be given; and 3. the number of requests the user would like to make of reddit's API. 

def reddit_to_csv(subreddit, filename, n_requests=40):
    
    #Create an empty list to be used later in function:
    posts = []
    
    #Establish that 'after' (a variable used later) is None type:
    after = None
    
    #Create User-Agent to avoid 429 res.status_code:
    headers = {'User-Agent': 'Pony Inc 1.0'}
    
    #for loop n_requests iterations (n_requests is established by user):
    for i in range(n_requests):
        
        #Assign 'url' to reddit's base url, plus whatever subreddit the user provides, plus .json for clean results:
        url = 'https://www.reddit.com/' + str(subreddit) + '/.json'
       
        if after == None:
            current_url = url
        else:
            current_url = url + '?after=' + after
        print(current_url)
        
        #Set my res variable equal to the results from requests.get, and the parameters set above like 'url' or 'params':
        res = requests.get(current_url, headers = headers)
        
        #Conditional statement to ensure access to the API is approved:
        if res.status_code == 200:
            
            current_dict = res.json()
            current_posts = [p['data'] for p in current_dict['data']['children']]
            posts.extend(current_posts)
            after = current_dict['data']['after']
            
        else:
            print('Status error', res.status_code)
            break
            
        if i > 0:
            prev_posts = pd.read_csv(str(filename))
            current_df = pd.DataFrame(current_posts)
            new_df = pd.concat([prev_posts, current_df])
            new_df.to_csv(str(filename), index = False)

        else:
            pd.DataFrame(posts).to_csv(str(filename), index = False)
            
        #Enter a delay of one second in the requests to reddit's API for good internet citizenship:    
        sleep_duration = random.randint(2, 10)
        print(f'Resting Time: {sleep_duration}')
        time.sleep(sleep_duration)

#### Scrapping of r/Investing subreddit

In [528]:
reddit_to_csv(subreddit = 'r/investing',
              n_requests = 30,
              filename = 'data/investing_reddit_posts.csv')

https://www.reddit.com/r/investing/.json
Resting Time: 5
https://www.reddit.com/r/investing/.json?after=t3_k3whgz
Resting Time: 7
https://www.reddit.com/r/investing/.json?after=t3_k4b2gj
Resting Time: 2
https://www.reddit.com/r/investing/.json?after=t3_k42og2
Resting Time: 9
https://www.reddit.com/r/investing/.json?after=t3_k3uhkb
Resting Time: 9
https://www.reddit.com/r/investing/.json?after=t3_k3r79u
Resting Time: 7
https://www.reddit.com/r/investing/.json?after=t3_k37y63
Resting Time: 10
https://www.reddit.com/r/investing/.json?after=t3_k3e2eh
Resting Time: 2
https://www.reddit.com/r/investing/.json?after=t3_k28s7y
Resting Time: 4
https://www.reddit.com/r/investing/.json?after=t3_k1z1f5
Resting Time: 8
https://www.reddit.com/r/investing/.json?after=t3_k17zoo
Resting Time: 7
https://www.reddit.com/r/investing/.json?after=t3_k1chqh
Resting Time: 6
https://www.reddit.com/r/investing/.json?after=t3_k0ry8u
Resting Time: 8
https://www.reddit.com/r/investing/.json?after=t3_k0iur3
Resting T

#### Scrapping of r/SavingMoney subreddit

In [529]:
reddit_to_csv(subreddit = 'r/SavingMoney',
              n_requests = 30,
              filename = 'data/saving_reddit_posts.csv')

https://www.reddit.com/r/SavingMoney/.json
Resting Time: 4
https://www.reddit.com/r/SavingMoney/.json?after=t3_jpu2fx
Resting Time: 4
https://www.reddit.com/r/SavingMoney/.json?after=t3_jbxs0f
Resting Time: 8
https://www.reddit.com/r/SavingMoney/.json?after=t3_j3yu1s
Resting Time: 6
https://www.reddit.com/r/SavingMoney/.json?after=t3_itv6yt
Resting Time: 6
https://www.reddit.com/r/SavingMoney/.json?after=t3_igyx13
Resting Time: 8
https://www.reddit.com/r/SavingMoney/.json?after=t3_i6qdih
Resting Time: 9
https://www.reddit.com/r/SavingMoney/.json?after=t3_hukl8q
Resting Time: 4
https://www.reddit.com/r/SavingMoney/.json?after=t3_hbhwnk
Resting Time: 2
https://www.reddit.com/r/SavingMoney/.json?after=t3_gt4xbc
Resting Time: 4
https://www.reddit.com/r/SavingMoney/.json?after=t3_gelvt7
Resting Time: 9
https://www.reddit.com/r/SavingMoney/.json?after=t3_g96hzn
Resting Time: 10
https://www.reddit.com/r/SavingMoney/.json?after=t3_fzcf7w
Resting Time: 3
https://www.reddit.com/r/SavingMoney/.js

## Understanding the Data
---

### r/SavingMoney Reddit

Within this subreddit, r/SavingMoney, a total of 748 post was scrapped. The post consist of 107 columns/feature where some columns where contains null value and the only noticable columns like title, selftext(body of post), votes, id and subreddit was filled. <br>
Of note, there are duplicated post within the subreddit which will be removed as it will affected the accurary score of the prediction mode. Within this step, we will also try to indentify potential false duplicate within the dulicates.

In [51]:
saving = pd.read_csv("data/saving_reddit_posts.csv")

In [67]:
saving.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,downs,top_awarded_type,hide_score,name,quarantine,link_flair_text_color,upvote_ratio,author_flair_background_color,subreddit_type,ups,total_awards_received,media_embed,author_flair_template_id,is_original_content,user_reports,secure_media,is_reddit_media_domain,is_meta,category,secure_media_embed,link_flair_text,can_mod_post,score,approved_by,author_premium,thumbnail,edited,author_flair_css_class,author_flair_richtext,gildings,content_categories,is_self,mod_note,created,link_flair_type,wls,removed_by_category,banned_by,author_flair_type,domain,allow_live_comments,selftext_html,likes,suggested_sort,banned_at_utc,view_count,archived,no_follow,is_crosspostable,pinned,over_18,all_awardings,awarders,media_only,can_gild,spoiler,locked,author_flair_text,treatment_tags,visited,removed_by,num_reports,distinguished,subreddit_id,mod_reason_by,removal_reason,link_flair_background_color,id,is_robot_indexable,report_reasons,author,discussion_type,num_comments,send_replies,whitelist_status,contest_mode,mod_reports,author_patreon_flair,author_flair_text_color,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,poll_data,crosspost_parent_list,url_overridden_by_dest,crosspost_parent,media_metadata
0,,SavingMoney,In order to minimize the constant referral pos...,t2_clauj,False,,0,False,Most Common Money Saving Tools: Do NOT Post Th...,[],r/SavingMoney,False,6,,0,,False,t3_calpl0,False,dark,1.0,,public,10,0,{},,False,[],,False,False,,{},,False,10,,False,,1562596387.0,,[],{},,True,,1562625000.0,text,6,,,text,self.SavingMoney,True,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,True,False,False,False,False,[],[],False,False,False,False,,[],False,,,moderator,t5_2qmsg,,,,calpl0,True,,likelyculprit,,3,False,all_ads,False,[],False,,/r/SavingMoney/comments/calpl0/most_common_mon...,all_ads,True,https://www.reddit.com/r/SavingMoney/comments/...,7229,1562596000.0,1,,False,,,,,
1,,SavingMoney,I just cleared out the mod queue and HOLY CRAP...,t2_clauj,False,,0,False,"Heads up: If you post a Yotta referral, I'm ju...",[],r/SavingMoney,False,6,,0,,False,t3_jth9il,False,dark,1.0,,public,13,0,{},,False,[],,False,False,,{},,False,13,,False,,False,,[],{},,True,,1605309000.0,text,6,,,text,self.SavingMoney,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,False,False,False,False,False,[],[],False,False,False,True,,[],False,,,,t5_2qmsg,,,,jth9il,True,,likelyculprit,,0,False,all_ads,False,[],False,,/r/SavingMoney/comments/jth9il/heads_up_if_you...,all_ads,True,https://www.reddit.com/r/SavingMoney/comments/...,7229,1605280000.0,0,,False,,,,,
2,,SavingMoney,Throughout my time working at a New York based...,t2_88c81w27,False,,0,False,Everything is Negotiable,[],r/SavingMoney,False,6,,0,,False,t3_k4tihv,False,dark,0.84,,public,4,0,{},,False,[],,False,False,,{},,False,4,,False,,1606868725.0,,[],{},,True,,1606884000.0,text,6,,,text,self.SavingMoney,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,False,False,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2qmsg,,,,k4tihv,True,,Finance_and_Saving,,2,True,all_ads,False,[],False,,/r/SavingMoney/comments/k4tihv/everything_is_n...,all_ads,False,https://www.reddit.com/r/SavingMoney/comments/...,7229,1606855000.0,0,,False,,,,,
3,,SavingMoney,"I'm 20 and I never saved money for ""big projec...",t2_6dtd0u20,False,,0,False,Best currency and saving bank account? (Advice...,[],r/SavingMoney,False,6,,0,,False,t3_k51oue,False,dark,1.0,,public,1,0,{},,False,[],,False,False,,{},,False,1,,False,,False,,[],{},,True,,1606910000.0,text,6,,,text,self.SavingMoney,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,False,True,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2qmsg,,,,k51oue,True,,__The__Dude,,1,True,all_ads,False,[],False,,/r/SavingMoney/comments/k51oue/best_currency_a...,all_ads,False,https://www.reddit.com/r/SavingMoney/comments/...,7229,1606881000.0,0,,False,,,,,
4,,SavingMoney,Which bank is the most reasonable in terms of ...,t2_3sjs2poy,False,,0,False,Looking to switch my bank,[],r/SavingMoney,False,6,,0,,False,t3_k3ejpc,False,dark,1.0,,public,3,0,{},,False,[],,False,False,,{},,False,3,,False,,False,,[],{},,True,,1606704000.0,text,6,,,text,self.SavingMoney,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,False,False,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2qmsg,,,,k3ejpc,True,,JustPonsie,,2,True,all_ads,False,[],False,,/r/SavingMoney/comments/k3ejpc/looking_to_swit...,all_ads,False,https://www.reddit.com/r/SavingMoney/comments/...,7229,1606675000.0,0,,False,,,,,


In [53]:
saving.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 748 entries, 0 to 747
Columns: 107 entries, approved_at_utc to media_metadata
dtypes: bool(26), float64(30), int64(10), object(41)
memory usage: 492.5+ KB


In [59]:
# Based on the scrapping, some post was scrapped twice.
# Apart from that, may be advertisement so there will repost of with similar title or body.
save_dup = saving[(saving.duplicated(subset = ['title']) == True) & (saving.duplicated(subset = ['selftext']) == True)]\
['title'].value_counts()

print(len(save_dup))
save_dup.head()

59


My opinion on Yotta savings                                    1
How to Save Money to Have More Money to Spend                  1
The Ultimate Guide To Saving Money In Your Home                1
Welp, I love saving money So I had to make a video about it    1
ISO advice                                                     1
Name: title, dtype: int64

In [61]:
saving.drop_duplicates(subset=['title', 'selftext'], inplace = True)

In [62]:
saving.shape

(689, 107)

In [66]:
saving.isnull().sum()

approved_at_utc                  689
subreddit                          0
selftext                         317
author_fullname                   14
saved                              0
mod_reason_title                 689
gilded                             0
clicked                            0
title                              0
link_flair_richtext                0
subreddit_name_prefixed            0
hidden                             0
pwls                               0
link_flair_css_class             689
downs                              0
top_awarded_type                 689
hide_score                         0
name                               0
quarantine                         0
link_flair_text_color              0
upvote_ratio                       0
author_flair_background_color    689
subreddit_type                     0
ups                                0
total_awards_received              0
media_embed                        0
author_flair_template_id         689
i

### r/Investing Reddit

In [87]:
investment = pd.read_csv("data/investing_reddit_posts.csv")

In [88]:
investment.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,downs,top_awarded_type,hide_score,name,quarantine,link_flair_text_color,upvote_ratio,author_flair_background_color,subreddit_type,ups,total_awards_received,media_embed,author_flair_template_id,is_original_content,user_reports,secure_media,is_reddit_media_domain,is_meta,category,secure_media_embed,link_flair_text,can_mod_post,score,approved_by,author_premium,thumbnail,edited,author_flair_css_class,author_flair_richtext,gildings,content_categories,is_self,mod_note,created,link_flair_type,wls,removed_by_category,banned_by,author_flair_type,domain,allow_live_comments,selftext_html,likes,suggested_sort,banned_at_utc,view_count,archived,no_follow,is_crosspostable,pinned,over_18,all_awardings,awarders,media_only,can_gild,spoiler,locked,author_flair_text,treatment_tags,visited,removed_by,num_reports,distinguished,subreddit_id,mod_reason_by,removal_reason,link_flair_background_color,id,is_robot_indexable,report_reasons,author,discussion_type,num_comments,send_replies,whitelist_status,contest_mode,mod_reports,author_patreon_flair,author_flair_text_color,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,author_cakeday
0,,investing,"Alright everyone, it looks like we had pretty ...",t2_p8vmm,False,,1,False,Formal posting guidelines for political topics...,[],r/investing,False,6,,0,,False,t3_cyee69,False,dark,0.95,,public,269,3,{},,False,[],,False,False,,{},,False,269,,False,,1567518369.0,,[],{'gid_2': 1},,True,,1567395000.0,text,6,,,text,self.investing,True,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,True,False,False,False,False,"[{'giver_coin_reward': None, 'subreddit_id': N...",[],False,False,False,True,,[],False,,,,t5_2qhhq,,,,cyee69,True,,MasterCookSwag,,0,True,all_ads,False,[],False,,/r/investing/comments/cyee69/formal_posting_gu...,all_ads,True,https://www.reddit.com/r/investing/comments/cy...,1217324,1567366000.0,2,,False,
1,,investing,"If your question is ""I have $10,000, what do I...",t2_6l4z3,False,,0,False,Daily Advice Thread - All basic help or advice...,[],r/investing,False,6,,0,,False,t3_k4jq8t,False,dark,0.87,,public,11,0,{},,False,[],,False,False,,{},,False,11,,True,,False,,[],{},,True,,1606854000.0,text,6,,,text,self.investing,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,new,,,False,False,False,False,False,[],[],False,False,False,False,,[],False,,,moderator,t5_2qhhq,,,,k4jq8t,True,,AutoModerator,,154,False,all_ads,False,[],False,,/r/investing/comments/k4jq8t/daily_advice_thre...,all_ads,True,https://www.reddit.com/r/investing/comments/k4...,1217324,1606825000.0,0,,False,
2,,investing,&gt; https://investor.salesforce.com/press-rel...,t2_4gluq,False,,0,False,Salesforce Signs Definitive Agreement to Acqui...,[],r/investing,False,6,,0,,False,t3_k4ue9o,False,dark,0.97,,public,904,0,{},,False,[],,False,False,,{},,False,904,,True,,False,,[],{},,True,,1606886000.0,text,6,,,text,self.investing,True,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,False,False,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2qhhq,,,,k4ue9o,True,,avamore,,269,True,all_ads,False,[],False,,/r/investing/comments/k4ue9o/salesforce_signs_...,all_ads,False,https://www.reddit.com/r/investing/comments/k4...,1217324,1606858000.0,0,,False,
3,,investing,https://tcrn.ch/33yjZxh\n\nCanadian electric t...,t2_3rl12np9,False,,0,False,Canadian electric truck and bus manufacturer t...,[],r/investing,False,6,,0,,False,t3_k4j98w,False,dark,0.97,,public,845,2,{},,False,[],,False,False,,{},,False,845,,False,,False,,[],{},,True,,1606851000.0,text,6,,,text,self.investing,True,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,False,False,False,False,False,"[{'giver_coin_reward': 0, 'subreddit_id': None...",[],False,False,False,False,,[],False,,,,t5_2qhhq,,,,k4j98w,True,,aLifel0ngLearner,,195,True,all_ads,False,[],False,,/r/investing/comments/k4j98w/canadian_electric...,all_ads,False,https://www.reddit.com/r/investing/comments/k4...,1217324,1606823000.0,2,,False,
4,,investing,Hopefully this quick summary video will be hel...,t2_yghxf0f,False,,0,False,Airbnb (ABNB) IPO S1 Filing Summary,[],r/investing,False,6,,0,,False,t3_k4w85m,False,dark,0.98,,public,99,2,{},,False,[],,False,False,,{},,False,99,,False,,False,,[],{},,True,,1606892000.0,text,6,,,text,self.investing,True,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,False,False,False,False,False,"[{'giver_coin_reward': None, 'subreddit_id': N...",[],False,False,False,False,,[],False,,,,t5_2qhhq,,,,k4w85m,True,,learner4f,,58,True,all_ads,False,[],False,,/r/investing/comments/k4w85m/airbnb_abnb_ipo_s...,all_ads,False,https://www.reddit.com/r/investing/comments/k4...,1217324,1606863000.0,0,,False,


In [89]:
investment.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 745 entries, 0 to 744
Columns: 103 entries, approved_at_utc to author_cakeday
dtypes: bool(26), float64(31), int64(10), object(36)
memory usage: 467.2+ KB


In [90]:
#checking for duplicates
in_dup = investment[(investment.duplicated(subset = ['title']) == True) & (investment.duplicated(subset = ['selftext']) == True)] \
['title'].value_counts()

print(len(in_dup))
in_dup.head()

223


Daily Advice Thread - All basic help or advice questions must be posted here.                                             18
Anti-capitalist forms of investing?                                                                                        1
My elderly parents wants to put some of their savings into the stock market rather than just gaining interest in banks     1
Thoughts on Â£CEY Centamin                                                                                                  1
Realistically what is ARKK 5-year and 10-year realistic return?                                                            1
Name: title, dtype: int64

In [91]:
investment.drop_duplicates(subset=['title', 'selftext'], inplace = True)

In [92]:
investment.shape

(505, 103)

In [80]:
investment.isnull().sum()

approved_at_utc                  505
subreddit                          0
selftext                           0
author_fullname                    1
saved                              0
mod_reason_title                 505
gilded                             0
clicked                            0
title                              0
link_flair_richtext                0
subreddit_name_prefixed            0
hidden                             0
pwls                               0
link_flair_css_class             505
downs                              0
top_awarded_type                 505
hide_score                         0
name                               0
quarantine                         0
link_flair_text_color              0
upvote_ratio                       0
author_flair_background_color    505
subreddit_type                     0
ups                                0
total_awards_received              0
media_embed                        0
author_flair_template_id         505
i

## Feature Selection 
---

A number of features within the scrapping indicates that it is null or does not hold any value in prediction.
Examples features with null values: mod_reason_title, approved_by (all these features are records for modification of post)
Examples features which does not hold any prediction value: awards or is_video (all these features is to indicate if the post has achieve a certain goal or whether the post is a video or not)

After filtering, the remaining features were selected based on their indentification potential for the model.

- **subreddit** - To identify where the particular post originated from.
- **distinguised** - To differntiate if the post is from a moderator or others
- **title** - As a means for prediction within the model
- **selftext** - As a means for prediction within the model

Note: apart from the text itself, the later step shows potentially how the length of the title or post will help in prediction. 

In [81]:
columns = ['subreddit', 'id', 'distinguished', 'title', 'selftext']

In [82]:
saving = saving[columns]
investment = investment[columns]

In [83]:
saving.head()

Unnamed: 0,subreddit,id,distinguished,title,selftext
0,SavingMoney,calpl0,moderator,Most Common Money Saving Tools: Do NOT Post Th...,In order to minimize the constant referral pos...
1,SavingMoney,jth9il,,"Heads up: If you post a Yotta referral, I'm ju...",I just cleared out the mod queue and HOLY CRAP...
2,SavingMoney,k4tihv,,Everything is Negotiable,Throughout my time working at a New York based...
3,SavingMoney,k51oue,,Best currency and saving bank account? (Advice...,"I'm 20 and I never saved money for ""big projec..."
4,SavingMoney,k3ejpc,,Looking to switch my bank,Which bank is the most reasonable in terms of ...


In [84]:
investment.head()

Unnamed: 0,subreddit,id,distinguished,title,selftext
0,investing,cyee69,,Formal posting guidelines for political topics...,"Alright everyone, it looks like we had pretty ..."
1,investing,k4jq8t,moderator,Daily Advice Thread - All basic help or advice...,"If your question is ""I have $10,000, what do I..."
2,investing,k4ue9o,,Salesforce Signs Definitive Agreement to Acqui...,&gt; https://investor.salesforce.com/press-rel...
3,investing,k4j98w,,Canadian electric truck and bus manufacturer t...,https://tcrn.ch/33yjZxh\n\nCanadian electric t...
4,investing,k4w85m,,Airbnb (ABNB) IPO S1 Filing Summary,Hopefully this quick summary video will be hel...


In [85]:
saving.to_csv('data/saving_reddit_posts_selected.csv',index=False)

In [86]:
investment.to_csv('data/investment_reddit_posts_selected.csv',index=False)