## Project 3: Web Scrapping and Classification: Writing and Blogging

#### Blogging and Writing subreddit groups are very similar in nature. Both are communities that are focused on writing.

However, the size of the communities differ, with 67.7K users in the Blogging subreddit, whereas Writing subreddit has 1.4M users.
Based on the following image, we can conclude that the Writing subreddit is far more active than Blogging subreddit, with 1.1k users online at the point of snapshot, and only 96 online for blogging.
Lastly, both of these groups were created at around the same period of time, in Q1 2008. 
What is blogging?
A blog (a truncation of "weblog") is a discussion or informational website published on the World Wide Web (www) consisting of discrete, often informal diary-style text entries (posts). 

What is writing?
Writing is a medium of human communication that involves the representation of a language with symbols. Writing systems are not themselves human languages (with the debatable exception of computer languages); they are means of rendering a language into a form that can be reconstructed by other humans separated by time and/or space. 

In other words, a blogger is also a writer, who writes in the internet through weblogs ('blogs'). However, writing is an art itself, which emphasise the communication through languages. A writer could write anywhere (newspapers, books, magazines, emails, blogs etc.).

### Business Case

As a data scientist, my team works with online influencers and youtubers to improve their views for their posts.

A rising trend in the market was for Bloggers to improve their views on their posts through the use of analytics. A subset of this trend is the rise of Reddit as a virtual community for bloggers to ask questions and seek guidance from like-minded individuals.

Our team is working on a long-term project targeted to assist bloggers to improve their views on their blog. The first phase of this project would be targeted at the Reddit platform, where bloggers often visit for idea sharing, feedbacks and questions.

#### Phase A Part 1: Create a classifying tool to help bloggers to post their questions and experiences in the correct Subreddit group. (Current Project!)
#### Phase A Part 2: We will look into Subreddit analytics to understand how to structure a reddit post to maximise views (upvotes, comments)

### Problem Statement
To create a text classifier to determine whether a reddit post would be classified into the Subreddit group "Blogging" or "Writing".

### Executive Summary
#### EDA

From our analysis, the Blogging subreddit group have a huge emphasis on their blog optimisation (common phrases that appear: SEO, traffic flow, keyword search etc.) and less on technical writing elements. However, the Writing subreddit group appears to be the opposite. Posts appear to be focused on writing techniques, with many users posting questions and seeking help for their stories (common phrases that appear: 'writing advice, don know, dont want, help writing, start writing').

In terms of overall tonality of the words/phrases that commonly appear in both subreddit groups, we can conclude Blogging subreddit group appears to be more formal and professional, whilst Writing subreddit group appears to be more casual and community-based. This could be due to the fact that Bloggers are more marketing/promotion oriented, whilst writers are more focues on the art of writing.

#### Modelling (Classification Model)

As seen from the models, both models have performed similarly in predicting whether posts fall under the Writing or Blogging subreddit groups, with an accuracy of approximately 93%.
From our Logistic regression, we were able to understand how our Logistic regression classify our posts based on the words appeared.


Interestingly, words that are of greater importance in classfiying posts into the Blogging subreddit are: Posts, Content, Website, Niche, Article, Google, SEO, link etc (web-analytics oriented)
Whereas words that are of greater importance in classfiying posts into the Writing subreddit are: story, character book, novel, read, writer, plot, feel, chapter etc. (traditional writing-oriented)

In [1]:
# Imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Webscrapping Imports
import requests
import time
import random

In [2]:
write = 'https://www.reddit.com/r/writing.json'
blog = 'https://www.reddit.com/r/Blogging.json'

In [3]:
res_w = requests.get(write,headers={'User-agent': 'GAProj3'})
res_b = requests.get(blog,headers={'User-agent': 'GAProj3'})

In [4]:
res_w.status_code

200

In [5]:
res_b.status_code

200

In [6]:
write_dict = res_w.json()
blog_dict = res_b.json()

### First Round of Scraping (Popular Posts)

In [13]:
posts_w = []
after = None

for a in range(30):
    if after == None:
        current_url = write
    else:
        current_url = write + '?after=' + after
    print(current_url)
    res_w = requests.get(current_url, headers={'User-agent': 'GAProj3'})
    
    if res_w.status_code != 200:
        print('Status error', res.status_code)
        break
    
    write_dict = res_w.json()
    current_posts = [p['data'] for p in write_dict['data']['children']]
    posts_w.extend(current_posts)
    after = write_dict['data']['after']
    
    sleep_duration = random.randint(2,15)
    print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/writing.json
14
https://www.reddit.com/r/writing.json?after=t3_taeq1r
2
https://www.reddit.com/r/writing.json?after=t3_ta00ss
8
https://www.reddit.com/r/writing.json?after=t3_t9l4rs
11
https://www.reddit.com/r/writing.json?after=t3_t92eko
15
https://www.reddit.com/r/writing.json?after=t3_t92t65
7
https://www.reddit.com/r/writing.json?after=t3_t8rore
3
https://www.reddit.com/r/writing.json?after=t3_t7w8oq
9
https://www.reddit.com/r/writing.json?after=t3_t7vv77
15
https://www.reddit.com/r/writing.json?after=t3_t7l90b
6
https://www.reddit.com/r/writing.json?after=t3_t6v6c8
13
https://www.reddit.com/r/writing.json?after=t3_t65i45
8
https://www.reddit.com/r/writing.json?after=t3_t63ogv
2
https://www.reddit.com/r/writing.json?after=t3_t5db2n
14
https://www.reddit.com/r/writing.json?after=t3_t5b5cl
15
https://www.reddit.com/r/writing.json?after=t3_t4ppko
4
https://www.reddit.com/r/writing.json?after=t3_t49u3j
13
https://www.reddit.com/r/writing.json?after=t3_t3ocnd
8


In [14]:
posts_b = []
after = None

for a in range(30):
    if after == None:
        current_url = blog
    else:
        current_url = blog + '?after=' + after
    print(current_url)
    res_b = requests.get(current_url, headers={'User-agent': 'GAProj3'})
    
    if res_b.status_code != 200:
        print('Status error', res.status_code)
        break
    blog_dict = res_b.json()    
    current_posts = [p['data'] for p in blog_dict['data']['children']]
    posts_b.extend(current_posts)
    after = blog_dict['data']['after']
    
    sleep_duration = random.randint(2,10)
    print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/Blogging.json
3
https://www.reddit.com/r/Blogging.json?after=t3_t84glq
5
https://www.reddit.com/r/Blogging.json?after=t3_t5nre3
4
https://www.reddit.com/r/Blogging.json?after=t3_t2sp68
9
https://www.reddit.com/r/Blogging.json?after=t3_sx9g1s
5
https://www.reddit.com/r/Blogging.json?after=t3_stx2qu
4
https://www.reddit.com/r/Blogging.json?after=t3_sqerew
5
https://www.reddit.com/r/Blogging.json?after=t3_smlpc3
7
https://www.reddit.com/r/Blogging.json?after=t3_sjjout
7
https://www.reddit.com/r/Blogging.json?after=t3_sghmt6
7
https://www.reddit.com/r/Blogging.json?after=t3_sapfhe
8
https://www.reddit.com/r/Blogging.json?after=t3_s4nwx9
7
https://www.reddit.com/r/Blogging.json?after=t3_rz4hpg
6
https://www.reddit.com/r/Blogging.json?after=t3_ru9hg9
10
https://www.reddit.com/r/Blogging.json?after=t3_rqgc5v
6
https://www.reddit.com/r/Blogging.json?after=t3_rjhqly
8
https://www.reddit.com/r/Blogging.json?after=t3_re1nwm
6
https://www.reddit.com/r/Blogging.json?after=t

In [15]:
posts_w2 = []
after = None

website_w2 = 'https://www.reddit.com/r/writing/new.json'
for a in range(15):
    if after == None:
        current_url = website_w2
    else:
        current_url = website_w2 + '?after=' + after
    print(current_url)
    res_w = requests.get(current_url, headers={'User-agent': 'GAProj3'})
    
    if res_w.status_code != 200:
        print('Status error', res.status_code)
        break
    
    write_dict = res_w.json()
    current_posts = [p['data'] for p in write_dict['data']['children']]
    posts_w2.extend(current_posts)
    after = write_dict['data']['after']
    
    
    
    sleep_duration = random.randint(2,15)
    print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/writing/new.json
12
https://www.reddit.com/r/writing/new.json?after=t3_tafjym
2
https://www.reddit.com/r/writing/new.json?after=t3_ta46d1
13
https://www.reddit.com/r/writing/new.json?after=t3_t9l4rs
15
https://www.reddit.com/r/writing/new.json?after=t3_t97t4c
5
https://www.reddit.com/r/writing/new.json?after=t3_t8y24z
14
https://www.reddit.com/r/writing/new.json?after=t3_t8kcut
8
https://www.reddit.com/r/writing/new.json?after=t3_t84dyk
11
https://www.reddit.com/r/writing/new.json?after=t3_t7m5po
2
https://www.reddit.com/r/writing/new.json?after=t3_t76iui
13
https://www.reddit.com/r/writing/new.json?after=t3_t6qdct
15
https://www.reddit.com/r/writing/new.json?after=t3_t6hg0y
6
https://www.reddit.com/r/writing/new.json?after=t3_t5zjvw
13
https://www.reddit.com/r/writing/new.json?after=t3_t5kjku
13
https://www.reddit.com/r/writing/new.json?after=t3_t537ia
13


In [16]:
posts_b2 = []
after = None

website_b2 = 'https://www.reddit.com/r/blogging/new.json'
for a in range(15):
    if after == None:
        current_url = website_b2
    else:
        current_url = website_b2 + '?after=' + after
    print(current_url)
    res_b = requests.get(current_url, headers={'User-agent': 'GAProj3'})
    
    if res_b.status_code != 200:
        print('Status error', res.status_code)
        break
    
    blog_dict = res_b.json()
    current_posts = [p['data'] for p in blog_dict['data']['children']]
    posts_b2.extend(current_posts)
    after = blog_dict['data']['after']
    
    
    sleep_duration = random.randint(2,15)
    print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/blogging/new.json
2
https://www.reddit.com/r/blogging/new.json?after=t3_t84z4v
13
https://www.reddit.com/r/blogging/new.json?after=t3_t5ryyb
9
https://www.reddit.com/r/blogging/new.json?after=t3_t2tbtl
5
https://www.reddit.com/r/blogging/new.json?after=t3_sxgovh
14
https://www.reddit.com/r/blogging/new.json?after=t3_styn02
10
https://www.reddit.com/r/blogging/new.json?after=t3_sr0g61
10
https://www.reddit.com/r/blogging/new.json?after=t3_smw24k
6
https://www.reddit.com/r/blogging/new.json?after=t3_sk278v
4
https://www.reddit.com/r/blogging/new.json?after=t3_sgvcrf
5
https://www.reddit.com/r/blogging/new.json?after=t3_sax3jb
3
https://www.reddit.com/r/blogging/new.json?after=t3_s55i8j
11
https://www.reddit.com/r/blogging/new.json?after=t3_s0dlyv
5
https://www.reddit.com/r/blogging/new.json?after=t3_ruc5qu
5
https://www.reddit.com/r/blogging/new.json?after=t3_rqlbf8
15


### Merge all 

In [17]:
print(f'Scrape 1 Type (Blogging): {type(posts_b)}')
print(f'Scrape 2 Type (Blogging): {type(posts_b2)}')
print('\n')
print(f'Scrape 1 Type (Writing): {type(posts_w)}')
print(f'Scrape 2 Type (Writing): {type(posts_w2)}')

Scrape 1 Type (Blogging): <class 'list'>
Scrape 2 Type (Blogging): <class 'list'>


Scrape 1 Type (Writing): <class 'list'>
Scrape 2 Type (Writing): <class 'list'>


In [18]:
print(f'#Posts for Scrape 1 (Blogging): {len(posts_b)}')
print(f'#Posts for Scrape 2 (Blogging): {len(posts_b2)}')
print('\n')
print(f'#Posts for Scrape 1 (Writing): {len(posts_w)}')
print(f'#Posts for Scrape 2 (Writing): {len(posts_w2)}')

#Posts for Scrape 1 (Blogging): 737
#Posts for Scrape 2 (Blogging): 375


#Posts for Scrape 1 (Writing): 739
#Posts for Scrape 2 (Writing): 375


In [19]:
blogposts = posts_b
type(blogposts)

list

In [20]:
posts_b.extend(posts_b2)
len(posts_b)

1112

In [21]:
writeposts = posts_w
type(writeposts)

list

In [22]:
posts_w.extend(posts_w2)
len(posts_w)

1114

### Lets check the Data! 

In [23]:
blog= pd.DataFrame(posts_b)
write= pd.DataFrame(posts_w)

In [24]:
blog.shape

(1112, 107)

In [25]:
write.shape

(1114, 114)

In [26]:
write[['selftext','title']].duplicated().sum()

428

In [27]:
blog[['selftext','title']].duplicated().sum()

602

### Initial Observations

By the looks of our results from our two rounds of webscrape, we see that there are many duplicated posts across both writing and blogging subreddit. This is perhaps due to combining both New and Hot posts. There should be some posts that are duplicated within the 'New' and 'hot' categories as well.

In [28]:
write.drop_duplicates(subset=['selftext','title'],inplace=True)
blog.drop_duplicates(subset=['selftext','title'],inplace=True)

In [29]:
print(write.shape)
print(blog.shape)

(686, 114)
(510, 107)


### Exporting the Data

In [71]:
blog.to_csv(r".\blogging.csv")

In [72]:
write.to_csv(r".\write.csv")