# Project 3: Part I

- Problem Statement
- Executive Summary
- Data Collection
- Simple Cleaning of Data


# Problem Statement

Discussions on Reddit are organized into user-created areas of interest called "subreddits". There are about 138,000 active subreddits among a total of 1.2 million, as of July 2018. As the amount of subreddits will continue to grow, we would like build a classifier model to help identify posts from different subbredits and guide user traffic to the appropiate subbredit.

# Executive Summary:
The reason to create this classifier model is so that we can help administrators to summarise a whole thread, suppress redundant or irrelevant posts. For contributers, it might provide guidance for increasing the usefulness of a contribution. For moderators, it might to act as a base model to identify things such as spam posts, misplaced posts, flame wars, troll users or threads that need locking, merging or deleting.

In [29]:
# import the libraries
import requests
import pandas as pd
import time
import random

from bs4 import BeautifulSoup

# Data Collection

I picked the below 2 subbreddits on romance and relationships as they are both similar in topics and to see if the classifier is capable of differentiating between these 2 subreddits. 

In [2]:
# targetted urls for webscraping

url1 = "https://www.reddit.com/r/romance.json"
url2 = "https://www.reddit.com/r/relationships.json"

I will scrape 1000 posts from reddits API for each subbreddit

In [20]:
# Webscrape r/romance subreddit

posts = [] # list to store the posts
after = None

for a in range(40):
    if after == None:
        current_url = url1
    else:
        current_url = url1 + '?after=' + after
    print(current_url)
    res = requests.get(current_url, headers={'User-agent': 'JP Inc 1.0'})
    
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
        
    # Read in posts for the current page
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    print("no. of post: " + str(len(current_posts)))
    posts.extend(current_posts)
    after = current_dict['data']['after']
    
    # if more than 1 page, read previous pages then merge with current page, write back to csv file
    if a > 0:
        prev_posts = pd.read_csv('romance.csv')
        current_df = pd.DataFrame(posts)
        combined_posts = pd.concat([prev_posts,current_df], ignore_index=True)
        combined_posts.to_csv('romance.csv', index = False)
        
    else:
        pd.DataFrame(posts).to_csv('romance.csv', index = False)

    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,6)
    print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/romance.json
no. of post: 25
2
https://www.reddit.com/r/romance.json?after=t3_gctr16
no. of post: 25
5
https://www.reddit.com/r/romance.json?after=t3_g4lkuz
no. of post: 25
6
https://www.reddit.com/r/romance.json?after=t3_fszgsb
no. of post: 25
5
https://www.reddit.com/r/romance.json?after=t3_flmf7c
no. of post: 25
2
https://www.reddit.com/r/romance.json?after=t3_fdqqhn
no. of post: 25
6
https://www.reddit.com/r/romance.json?after=t3_f7niat
no. of post: 25
5
https://www.reddit.com/r/romance.json?after=t3_f0494b
no. of post: 25
4
https://www.reddit.com/r/romance.json?after=t3_eslf7q
no. of post: 25
6
https://www.reddit.com/r/romance.json?after=t3_emcbqx
no. of post: 25
4
https://www.reddit.com/r/romance.json?after=t3_efpj5g
no. of post: 25
3
https://www.reddit.com/r/romance.json?after=t3_e7zbqa
no. of post: 25
2
https://www.reddit.com/r/romance.json?after=t3_e05nxt
no. of post: 25
3
https://www.reddit.com/r/romance.json?after=t3_ds95n2
no. of post: 25
5
https://

In [7]:
# Webscrape r/relationships subreddit

posts = []
after = None

for a in range(40):
    if after == None:
        current_url = url2
    else:
        current_url = url2 + '?after=' + after
    print(current_url)
    res = requests.get(current_url, headers={'User-agent': 'JP Inc 1.0'})
    
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
        
    # Read in posts for the current page
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    print("no. of post: " + str(len(current_posts)))
    posts.extend(current_posts)
    after = current_dict['data']['after']
    
    # if more than 1 page, read previous pages then merge with current page, write back to csv file
    if a > 0:
        prev_posts = pd.read_csv('relationships.csv')
        current_df = pd.DataFrame(posts)
        combined_posts = pd.concat([prev_posts,current_df], ignore_index=True)
        combined_posts.to_csv('relationships.csv', index = False)
        
    else:
        pd.DataFrame(posts).to_csv('relationships.csv', index = False)

    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,6)
    print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/relationships.json
no. of post: 26
4
https://www.reddit.com/r/relationships.json?after=t3_gi0s8d
no. of post: 25
4
https://www.reddit.com/r/relationships.json?after=t3_ghyzlo
no. of post: 25
6
https://www.reddit.com/r/relationships.json?after=t3_ghyp0t
no. of post: 25
6
https://www.reddit.com/r/relationships.json?after=t3_gi60e9
no. of post: 25
4
https://www.reddit.com/r/relationships.json?after=t3_gi3xu3
no. of post: 25
2
https://www.reddit.com/r/relationships.json?after=t3_ghuy77
no. of post: 25
4
https://www.reddit.com/r/relationships.json?after=t3_ghyur2
no. of post: 25
4
https://www.reddit.com/r/relationships.json?after=t3_ghwxud
no. of post: 25
4
https://www.reddit.com/r/relationships.json?after=t3_ghny0z
no. of post: 25
5
https://www.reddit.com/r/relationships.json?after=t3_ghsen0
no. of post: 25
3
https://www.reddit.com/r/relationships.json?after=t3_ghpzg2
no. of post: 25
5
https://www.reddit.com/r/relationships.json?after=t3_ghm62q
no. of post: 25
3
ht

## Simple Cleaning of data
I will now proceed to check the data and to remove any null or duplicate posts.

In [74]:
rom_df = pd.read_csv('romance.csv')
rel_df = pd.read_csv('relationships.csv')

In [60]:
rom_df.shape

(993, 104)

In [61]:
rel_df.shape

(979, 103)

Seems like there are some missing data in relationships dataframe or the posts have reached their limit. There also seem to be 1 missing column in relationship dataframe compared to romance dataframe. This should not be an issue as long as that missing column is not the column containing our posts. 

In [50]:
# Display to see dataframe max columns of 104
pd.options.display.max_columns = 104

In [51]:
rom_df.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,downs,hide_score,name,quarantine,link_flair_text_color,upvote_ratio,author_flair_background_color,subreddit_type,ups,total_awards_received,media_embed,author_flair_template_id,is_original_content,user_reports,secure_media,is_reddit_media_domain,is_meta,category,secure_media_embed,link_flair_text,can_mod_post,score,approved_by,author_premium,thumbnail,edited,author_flair_css_class,author_flair_richtext,gildings,content_categories,is_self,mod_note,created,link_flair_type,wls,removed_by_category,banned_by,author_flair_type,domain,allow_live_comments,selftext_html,likes,suggested_sort,banned_at_utc,view_count,archived,no_follow,is_crosspostable,pinned,over_18,all_awardings,awarders,media_only,can_gild,spoiler,locked,author_flair_text,treatment_tags,visited,removed_by,num_reports,distinguished,subreddit_id,mod_reason_by,removal_reason,link_flair_background_color,id,is_robot_indexable,report_reasons,author,discussion_type,num_comments,send_replies,whitelist_status,contest_mode,mod_reports,author_patreon_flair,author_flair_text_color,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,crosspost_parent_list,crosspost_parent,author_cakeday
0,,romance,I am 21 years old and I’ve never been in love....,t2_6eot1ted,False,,0,False,I Desire Romance.,[],r/romance,False,,,0,False,t3_ghiik1,False,dark,0.81,,public,3,0,{},,False,[],,False,False,,{},,False,3,,False,,False,,[],{},,True,,1589208000.0,text,,,,text,self.romance,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,False,True,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2r6ie,,,,ghiik1,True,,Sinclair914,,5,True,,False,[],False,,/r/romance/comments/ghiik1/i_desire_romance/,,False,https://www.reddit.com/r/romance/comments/ghii...,7120,1589180000.0,0,,False,,,
1,,romance,I usually don't do this but my heart beats to ...,t2_nxyxdb2,False,,0,False,Going through a tough break up right now so de...,[],r/romance,False,,,0,False,t3_ghfzef,False,dark,1.0,,public,3,0,{},,False,[],,False,False,,{},,False,3,,False,,False,,[],{},,True,,1589197000.0,text,,,,text,self.romance,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,False,True,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2r6ie,,,,ghfzef,True,,chi310,,2,True,,False,[],False,,/r/romance/comments/ghfzef/going_through_a_tou...,,False,https://www.reddit.com/r/romance/comments/ghfz...,7120,1589168000.0,0,,False,,,
2,,romance,I know marriage is not exactly a necessity and...,t2_nowm36j,False,,0,False,It’s my biggest dream to get married and I hop...,[],r/romance,False,,,0,False,t3_gh1biw,False,dark,1.0,,public,16,0,{},,False,[],,False,False,,{},,False,16,,False,,False,,[],{},,True,,1589145000.0,text,,,,text,self.romance,True,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,False,False,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2r6ie,,,,gh1biw,True,,kikiswirline456,,2,True,,False,[],False,,/r/romance/comments/gh1biw/its_my_biggest_drea...,,False,https://www.reddit.com/r/romance/comments/gh1b...,7120,1589117000.0,0,,False,,,
3,,romance,I like you— more than your bravery and intelli...,t2_6eoh35vg,False,,0,False,To someone I like,[],r/romance,False,,,0,False,t3_ggxdcj,False,dark,1.0,,public,7,0,{},,False,[],,False,False,,{},,False,7,,False,,False,,[],{},,True,,1589126000.0,text,,,,text,self.romance,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,False,True,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2r6ie,,,,ggxdcj,True,,andriamei,,2,True,,False,[],False,,/r/romance/comments/ggxdcj/to_someone_i_like/,,False,https://www.reddit.com/r/romance/comments/ggxd...,7120,1589097000.0,0,,False,,,
4,,romance,I could usually go out of my way and flirt or ...,t2_4e8rqi12,False,,0,False,Is it actually possible to fall in love with a...,[],r/romance,False,,,0,False,t3_gh3r13,False,dark,1.0,,public,1,0,{},,False,[],,False,False,,{},,False,1,,False,,False,,[],{},,True,,1589154000.0,text,,,,text,self.romance,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,,,,False,True,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2r6ie,,,,gh3r13,True,,ksapatupov,,1,True,,False,[],False,,/r/romance/comments/gh3r13/is_it_actually_poss...,,False,https://www.reddit.com/r/romance/comments/gh3r...,7120,1589125000.0,0,,False,,,


In [52]:
rel_df.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,downs,hide_score,name,quarantine,link_flair_text_color,upvote_ratio,author_flair_background_color,subreddit_type,ups,total_awards_received,media_embed,author_flair_template_id,is_original_content,user_reports,secure_media,is_reddit_media_domain,is_meta,category,secure_media_embed,link_flair_text,can_mod_post,score,approved_by,author_premium,thumbnail,edited,author_flair_css_class,author_flair_richtext,gildings,content_categories,is_self,mod_note,created,link_flair_type,wls,removed_by_category,banned_by,author_flair_type,domain,allow_live_comments,selftext_html,likes,suggested_sort,banned_at_utc,view_count,archived,no_follow,is_crosspostable,pinned,over_18,all_awardings,awarders,media_only,link_flair_template_id,can_gild,spoiler,locked,author_flair_text,treatment_tags,visited,removed_by,num_reports,distinguished,subreddit_id,mod_reason_by,removal_reason,link_flair_background_color,id,is_robot_indexable,report_reasons,author,discussion_type,num_comments,send_replies,whitelist_status,contest_mode,mod_reports,author_patreon_flair,author_flair_text_color,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,author_cakeday
0,,relationships,"Dear /r/relationships community,\n\nWe recogni...",t2_ahi9l,False,,0,False,"Coronavirus (SARS-CoV2, COVID-19) Mega-thread ...",[],r/relationships,False,6,meta meta,0,False,t3_g79ye3,False,dark,0.94,,public,85,0,{},,False,[],,False,False,,{},« Meta »,False,85,,True,,False,,[],{},,True,,1587769000.0,text,6,,,text,self.relationships,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,confidence,,,False,False,False,False,False,[],[],False,57db65d4-cb01-11e2-a792-12313d14782d,False,False,False,,[],False,,,moderator,t5_2qjvn,,,,g79ye3,True,,ResidentBlackGuy,,142,False,all_ads,False,[],False,,/r/relationships/comments/g79ye3/coronavirus_s...,all_ads,True,https://www.reddit.com/r/relationships/comment...,2862255,1587741000.0,0,,False,
1,,relationships,"I'm very openly bisexual, and my friend ""Lily""...",t2_6fzpu1aw,False,,0,False,My [21F] best friend [20F]'s bf [30M] won't le...,[],r/relationships,False,6,m-gn misc,0,False,t3_ghzarl,False,dark,0.97,,public,2180,0,{},,False,[],,False,False,,{},Non-Romantic,False,2180,,False,,1589258140.0,,[],{},,True,,1589269000.0,text,6,,,text,self.relationships,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,confidence,,,False,False,False,False,False,[],[],False,0dcfde48-cb01-11e2-bb61-12313d14782d,False,False,False,,[],False,,,,t5_2qjvn,,,,ghzarl,True,,TheMagusFowles,,258,True,all_ads,False,[],False,,/r/relationships/comments/ghzarl/my_21f_best_f...,all_ads,False,https://www.reddit.com/r/relationships/comment...,2862255,1589240000.0,0,,False,
2,,relationships,Link to the original post:\n\n [https://www.re...,t2_4d6cmw7i,False,,0,False,UPDATE: Coworker (M) is almost certainly recor...,[],r/relationships,False,6,m-it updates,0,False,t3_ghnm3z,False,dark,0.99,,public,3794,0,{},,False,[],,False,False,,{},Updates,False,3794,,False,,False,,[],{},,True,,1589233000.0,text,6,,,text,self.relationships,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,confidence,,,False,False,False,False,False,[],[],False,1df36664-cb01-11e2-aec7-12313d14782d,False,False,False,,[],False,,,,t5_2qjvn,,,,ghnm3z,True,,coworkerspythrowaway,,93,True,all_ads,False,[],False,,/r/relationships/comments/ghnm3z/update_cowork...,all_ads,False,https://www.reddit.com/r/relationships/comment...,2862255,1589204000.0,0,,False,
3,,relationships,We’ve been together for 3 years. Recently he’s...,t2_4zxugwnn,False,,0,False,He [22M] wouldn’t let me [21F] see his phone a...,[],r/relationships,False,6,m-be breakups,0,False,t3_gi7fuk,False,dark,0.89,,public,40,0,{},,False,[],,False,False,,{},Breakups,False,40,,False,,False,,[],{},,True,,1589302000.0,text,6,,,text,self.relationships,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,confidence,,,False,False,False,False,False,[],[],False,05d9361c-cb01-11e2-b494-12313d262949,False,False,False,,[],False,,,,t5_2qjvn,,,,gi7fuk,True,,SadAndEatingRamen,,49,True,all_ads,False,[],False,,/r/relationships/comments/gi7fuk/he_22m_wouldn...,all_ads,False,https://www.reddit.com/r/relationships/comment...,2862255,1589273000.0,0,,False,
4,,relationships,So this literally just happened. My husband wa...,t2_4qfxw6b7,False,,0,False,My (28F) husband (30M) just walked out. What now?,[],r/relationships,False,6,m-vt relationships,0,False,t3_gi2awb,False,dark,0.91,,public,104,0,{},,False,[],,False,False,,{},Relationships,False,104,,False,,False,,[],{},,True,,1589280000.0,text,6,,,text,self.relationships,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,confidence,,,False,False,False,False,False,[],[],False,c8ef68fc-cb00-11e2-8393-12313d262949,False,False,False,,[],False,,,,t5_2qjvn,,,,gi2awb,True,,artcritiquerealness,,47,True,all_ads,False,[],False,,/r/relationships/comments/gi2awb/my_28f_husban...,all_ads,False,https://www.reddit.com/r/relationships/comment...,2862255,1589251000.0,0,,False,


In [86]:
# check if string
rom_df['selftext'].dtype

dtype('O')

In [87]:
# check if string
rel_df['selftext'].dtype

dtype('O')

Columns that I would be interested to work on would be the main content of the posts which are in the 'selftext' column and I will keep the 'subreddit' column as a label to identify where my post originate from.  

In [75]:
# Check for null columns
rom_df['selftext'].isnull().sum()

68

In [76]:
# Check for null columns
rom_df['subreddit'].isnull().sum()

0

In [77]:
# Check for null columns
rel_df['selftext'].isnull().sum()

0

In [78]:
# Check for null columns
rel_df['subreddit'].isnull().sum()

0

Seems like there are missing data in rom_df['selftext']. I will drop those rows.

In [79]:
# Drop null rows
rom_df = rom_df.dropna(subset=['selftext', 'subreddit'])

In [80]:
rom_df.shape

(925, 104)

In [81]:
# Drop any duplicate entries
rom_df = rom_df.drop_duplicates(subset=['selftext'])

In [82]:
rom_df.shape

(924, 104)

In [83]:
rel_df.shape

(979, 103)

In [84]:
# Drop any duplicate entries
rel_df = rel_df.drop_duplicates(subset=['selftext'])

In [85]:
rel_df.shape

(506, 103)

For rom_df, there is a 1 duplicate entry which was dropped.<br>
For rel_df, there is a huge drop of 979-506 = 473 posts which were dropped. This is almost half my data. I decided to look further into the relationships.csv file to find out the reason.

I discover the below points which is causing the posts to be identified as duplicates.
- some of the posts have new line spaces at the start of the posts
- characters such as Iâ€™, Weâ€™ and &amp which are at the start of the posts

Theses posts that i examined visually, are not duplicates. 
Therefore, for relationships.csv, i have decided to use the data without processing it for duplicates.<br>
I will now export the clean data for 'selftext' and 'subreddit' to another csv file for data exploration and analysis.

In [90]:
clean_rom_df = pd.DataFrame(rom_df, columns=['selftext', 'subreddit'])

In [96]:
clean_rom_df.shape

(924, 2)

In [92]:
# export to new .csv
pd.DataFrame(clean_rom_df).to_csv('clean_romance.csv', index = False)

In [93]:
rel_df = pd.read_csv('relationships.csv')

In [98]:
clean_rel_df = pd.DataFrame(rel_df, columns=['selftext', 'subreddit'])

In [99]:
clean_rel_df.shape

(979, 2)

In [100]:
# export to new .csv
pd.DataFrame(clean_rel_df).to_csv('clean_relationships.csv', index = False)

EDA, model building and predictions will be carried out in the next jupyter notebook