# Project 3: Web APIs & NLP

---
# Part 1 - Data Collection
### Notebook 1 - Contents:
[1.1 Context](#1.1-Context)<br>
[1.2 Problem Statement](#1.2-Problem-Statement)<br>
[1.3 Data Dictionary](#1.3-Data-Dictionary)<br>
[1.4 Data Collection](#1.4-Data-Collection)<br>
[1.5 Notebook Summary](#1.5-Notebook-Summary)

---
# 1.1 Context

The field of data science has witnessed remarkable growth and popularity as organizations increasingly rely on data-driven decision-making across diverse industries. This surge has led to the [emergence of new roles and specializations](https://www.datacamp.com/blog/data-scientist-vs-data-engineer) within the data domain, including data engineering. Data engineering is a critical function that involves the collection, processing, and organization of vast volumes of data, facilitating data scientists in extracting valuable insights and building impactful models.

However, there is a growing challenge in distinguishing between data science and data engineering roles, as their job requirements, skills, and educational qualifications often overlap. Job postings for these roles may present confusion, making it difficult for individuals to accurately classify data science and data engineering responsibilities. [The lines between these two roles are often blurred](https://datascience.virginia.edu/news/data-science-vs-data-engineering), leading to a lack of clarity for both job seekers and employers. Additionally, it is worse for people new to the industry, trying to enter the field of data.

# 1.2 Problem Statement


You are a member of the General Assembly (GA) team, responsible for evaluating the need for a new Data Engineering Immersive Course. To gain valuable insights and understand the demand for such a program, GA has plans to set up a discussion channel to gather feedback and queries from potential students. For internal analysis, the team hopes to categorize the comments and queries into the categories of data science or data engineering. This sorting and categorizing could be time-consuming and impractical if done manually.

To address this challenge, your objective is to **develop a natural language processing (NLP) model that can automatically classify the comments into two distinct categories: Data Science and Data Engineering**. This could help the team gain insights on the demand for each specialization and guide your decision-making process in setting up the new Data Engineering Immersive Course. We will be training and testing our NLP models using Reddit, through 2 subreddits r/datascience and r/dataengineering. Further investigations - **sentiment analysis** on the comments would also allow us to gain insights into the attitudes and perceptions of potential students towards each specialization. 

### Model and Success Evaluation

In order to sort text-based comments/queries into two distinct categories, we will be using several models that are appropriate for binary classification of text. The model selection process involves weighing the trade-offs between simplicity and complexity, considering factors like interpretability and computational efficiency. Various models, such as Logistic Regression, Naive Bayes, Random Forests, ADABoost and Support Vector Machines will be used for this task. <br>

We will then use accuracy and F1 score to evaluate the models. Model and success evaluation will be further discussed in notebook 3.

---
# 1.3 Data Dictionary

Two subreddits:

* [Subreddit: r/datascience](https://www.reddit.com/r/datascience/)
* [Subreddit: r/dataengineering](https://www.reddit.com/r/dataengineering/)


The posts and comments were scraped separately for each subreddit, and exported into 4 csv files in total:
* [`data_sc.csv`](./data/data_sc.csv): Posts from subreddit r/datascience
* [`data_sc_comm.csv`](./data/data_sc_comm.csv): Comments from subreddit r/datascience
* [`data_engr.csv`](./data/data_engr.csv): Posts from subreddit r/dataengineering
* [`data_engr_comm.csv`](./data/data_engr_comm.csv): Comments from subreddit r/dataengineering

The cleaned data (from both subreddits) was exported for modelling:
* [`df_text.csv`](./data/df_text.csv): Cleaned and combined text data

From the subreddits, these were the attributes that were pulled to conduct EDA and modelling:
|**Attributes**|Description|
|----|----|
|title|The title of the submission.|
|id|ID of the submission.|
|author|Provides an instance of Redditor.|
|created_utc|Time the submission was created, represented in Unix Time.|
|edited|Whether or not the submission has been edited.|
|is_self|Whether or not the submission is a selfpost (text-only).|
|link_flair_text|The link flair’s text content, or None if not flaired.|
|num_comments|The number of comments on the submission.|
|saved|Whether or not the submission is saved.|
|score|The number of upvotes for the submission.|
|selftext|The submissions’ selftext - an empty string if a link post.|
|stickied|Whether or not the submission is stickied.|
|upvote_ratio|The percentage of upvotes from all votes on the submission.|
|url|The URL the submission links to, or the permalink if a selfpost.|
|subreddit|Provides an instance of Subreddit.|
|title_selftext|Feature Engineered column merging title and selftext.|

---

# 1.4 Data Collection
 PRAW: The Python Reddit API Wrapper
---

PRAW is a Python library that provides a simple and convenient way to interact with the Reddit API. Praw allows developers to access various functionalities of Reddit, such as retrieving posts and comments, submitting new posts, voting on content, and more.

For data collection, a function was created to pull all posts from r/datascience and r/dataengineering. Later on, it was realised that reddit had limited post availability of each subreddit to max 1000 posts. I managed to pull 961 and 963 posts respectively.<br>
Another function was then created to pull comments from both subreddits. 12403 and 11740 comments were pulled for r/datascience and r/dataengineering respectively.

Import the libraries required for Data pulling:

In [1]:
import requests
import pandas as pd
import numpy as np
import time
import praw
from bs4 import BeautifulSoup
from tqdm import tqdm

Input Reddit API details for PRAW:

In [2]:
user_agent = 'Scraper 1.0 by /u/glori-aaa'
reddit = praw.Reddit(
    client_id = 'MnnRTwZWB9K875i2gBQXAw',
    client_secret = 'SjvzB5aH1PIMRgly6Kx38JbjAdgH-g',
    user_agent = user_agent
)

---
## Subreddit - Column Selection

From the the reddit submission data dictionary (see https://praw.readthedocs.io/en/stable/code_overview/models/submission.html), I selected a preliminary round of attributes to be extracted. Later on, upon extracting one batch of sample data, then we can further refine the information we would like to keep.

In [3]:
titles = ['title', 'id', 'author', 'author_flair_text', 'created_utc', 'comments', 'distinguished', 'edited', 'id', 'is_original_content', 'is_self', 'link_flair_text', 'name', 'num_comments', 'saved', 
          'score', 'selftext', 'stickied', 'upvote_ratio', 'url', 'subreddit']

Create a dataframe with headings from the reddit submission data dictionary:

In [4]:
data_sc = pd.DataFrame(columns = titles)
data_sc

Unnamed: 0,title,id,author,author_flair_text,created_utc,comments,distinguished,edited,id.1,is_original_content,...,link_flair_text,name,num_comments,saved,score,selftext,stickied,upvote_ratio,url,subreddit


In [5]:
# Create an empty list to store the submissions data
data_sc = []

# Extract the attributes ive selected in "titles" from each submission and store them in a dictionary
for submission in reddit.subreddit('datascience').hot(limit=5):
    submission_data = {attr: getattr(submission, attr) for attr in titles}
    data_sc.append(submission_data)

# Convert the list of dictionaries into a DataFrame
data_sc = pd.DataFrame(data_sc)

In [6]:
data_sc.head(2)

Unnamed: 0,title,id,author,author_flair_text,created_utc,comments,distinguished,edited,is_original_content,is_self,link_flair_text,name,num_comments,saved,score,selftext,stickied,upvote_ratio,url,subreddit
0,Weekly Entering & Transitioning - Thread 31 Ju...,15e5iw6,AutoModerator,,1690776000.0,"(juawerx, jugdppb, ju6k0p5, ju7yvce, ju8lv71, ...",,False,False,True,,t3_15e5iw6,28,False,2,\n\nWelcome to this week's entering & transit...,True,1.0,https://www.reddit.com/r/datascience/comments/...,datascience
1,"R programmers, what are the greatest issues yo...",15g96dc,joaoareias,,1690984000.0,"(juhf0fy, juhmgan, juhe94q, juhnzx5, juhi6gh, ...",,False,False,True,Education,t3_15g96dc,262,False,171,I'm a Data Scientist with a computer science b...,False,0.96,https://www.reddit.com/r/datascience/comments/...,datascience


Have a quick further look at some of the columns to further refine my column selection:

In [7]:
data_sc['comments']

0    (juawerx, jugdppb, ju6k0p5, ju7yvce, ju8lv71, ...
1    (juhf0fy, juhmgan, juhe94q, juhnzx5, juhi6gh, ...
2    (juih9u5, juiepcp, jujgtxw, juivus3, juijms6, ...
3                          (jujsjmz, jujtngo, jujspge)
4    (jujftd1, jujgols, jujlc1b, jujc0bb, jujamvi, ...
Name: comments, dtype: object

I decided to drop a first batch of columns which I believe will not be useful for further analysis (to cut down time on the pulling process):
* comments - only gives the ID/name of the comments. We will drop this and use another function to extract comments later.
* author_flair_text - information is too detailed, not relevant to our study
* is_original_content - does not matter to us
* name - seems to have somewhat repeated info from id

**Final list of columns to export for EDA:**

In [8]:
titles_export = ['title', 'id', 'author', 'created_utc', 'distinguished', 'edited', 'id', 'is_self', 'link_flair_text', 'num_comments', 'saved', 
          'score', 'selftext', 'stickied', 'upvote_ratio', 'url', 'subreddit']

---

## Create PRAW function
Create a function that we can use on both our subreddits to extract required data:

In [9]:
def webscr_data(subreddit_name, titles_export, limit=5000):
    df = pd.DataFrame(columns=titles_export)
    data_list = []

    # Extract the attributes specified in "titles_export" from each submission and store them in a dictionary
    for submission in reddit.subreddit(subreddit_name).hot(limit=limit):
        submission_data = {attr: getattr(submission, attr) for attr in titles_export}
        data_list.append(submission_data)

    # Convert the list of dictionaries into a DataFrame
    df = pd.DataFrame(data_list)

    return df

Try creating another function to scrape comments:

* A time.sleep(delay) of 2seconds was added to the code, as I received a "HTTP 429 Too Many Requests" response code. This delay allows us to fulfil the reddit API rate limit.

In [10]:
titles_comm_export = ['body', 'body_html', 'subreddit_id', 'author', 'created_utc', 'distinguished', 'edited', 'id', 'saved', 
          'score', 'stickied', 'submission', 'subreddit']

In [11]:
def webscr_comment(subreddit_name, titles_comm_export, limit=8000, delay=2):
    df = pd.DataFrame(columns=titles_comm_export)
    comment_list = []

    # Extract the attributes specified in "titles_comm_export" from each comment and store them in a dictionary
    for submission in reddit.subreddit(subreddit_name).hot(limit=limit):
        submission.comments.replace_more(limit=0)  # Ensure all comments are loaded
        for comment in submission.comments.list():
            comment_data = {attr: getattr(comment, attr) for attr in titles_comm_export}
            comment_list.append(comment_data)

            time.sleep(delay)

    # Convert the list of dictionaries into a DataFrame
    df = pd.DataFrame(comment_list)

    return df

---
## Subreddit r/datascience

In [12]:
data_sc = webscr_data('datascience', titles_export)

In [13]:
print(data_sc.shape)
data_sc.head()

(961, 16)


Unnamed: 0,title,id,author,created_utc,distinguished,edited,is_self,link_flair_text,num_comments,saved,score,selftext,stickied,upvote_ratio,url,subreddit
0,Weekly Entering & Transitioning - Thread 31 Ju...,15e5iw6,AutoModerator,1690776000.0,,False,True,,28,False,2,\n\nWelcome to this week's entering & transit...,True,1.0,https://www.reddit.com/r/datascience/comments/...,datascience
1,"R programmers, what are the greatest issues yo...",15g96dc,joaoareias,1690984000.0,,False,True,Education,262,False,169,I'm a Data Scientist with a computer science b...,False,0.96,https://www.reddit.com/r/datascience/comments/...,datascience
2,U.S. Hiring Managers: how diverse is your appl...,15gezo5,dantzigismyhero,1690997000.0,,False,True,Discussion,42,False,38,We are currently hiring for a mid-level DS and...,False,0.89,https://www.reddit.com/r/datascience/comments/...,datascience
3,Do I need a masters?,15gmte5,Odd-Company-1440,1691017000.0,,False,True,Education,4,False,5,I am going into my second year as a statistics...,False,1.0,https://www.reddit.com/r/datascience/comments/...,datascience
4,How do you describe your job when someone asks...,15gkc1y,briannalynn24,1691009000.0,,False,True,Discussion,24,False,5,I recently got a job as a data scientist and I...,False,0.86,https://www.reddit.com/r/datascience/comments/...,datascience


Export to csv:

In [14]:
import os # to work with files/directories
if not os.path.exists('../data'): 
    os.makedirs('../data') 

# Save the DataFrame to a CSV file
data_sc.to_csv('../data/data_sc.csv', index=False)

Try to export comments from posts:

In [15]:
data_sc_comm = webscr_comment('datascience', titles_comm_export)

In [16]:
print(data_sc_comm.shape)
data_sc_comm.head()

(12403, 13)


Unnamed: 0,body,body_html,subreddit_id,author,created_utc,distinguished,edited,id,saved,score,stickied,submission,subreddit
0,Hello.\n\nI’m about to start my BS in Data Sci...,"<div class=""md""><p>Hello.</p>\n\n<p>I’m about ...",t5_2sptq,Competitive_Pay_9117,1690864000.0,,False,juawerx,False,2,False,15e5iw6,datascience
1,"Hey guys, I'm graduating from my bachelor's in...","<div class=""md""><p>Hey guys, I&#39;m graduatin...",t5_2sptq,Emperorofweirdos,1690962000.0,,False,jugdppb,False,2,False,15e5iw6,datascience
2,Can I get a job as a fresher in data science?\...,"<div class=""md""><p>Can I get a job as a freshe...",t5_2sptq,Luo-yi-,1690796000.0,,False,ju6k0p5,False,1,False,15e5iw6,datascience
3,Is it a good idea and possible to get a MS in ...,"<div class=""md""><p>Is it a good idea and possi...",t5_2sptq,slimjimmy1928,1690821000.0,,False,ju7yvce,False,0,False,15e5iw6,datascience
4,I am finishing my PhD program in which I did ~...,"<div class=""md""><p>I am finishing my PhD progr...",t5_2sptq,tunamouse,1690829000.0,,False,ju8lv71,False,0,False,15e5iw6,datascience


Export to csv:

In [17]:
import os # to work with files/directories
if not os.path.exists('../data'): 
    os.makedirs('../data') 

# Save the DataFrame to a CSV file
data_sc_comm.to_csv('../data/data_sc_comm.csv', index=False)

---
## Subreddit r/dataengineering

In [18]:
data_engr = webscr_data('dataengineering', titles_export)

In [19]:
print(data_engr.shape)
data_engr.head()

(963, 16)


Unnamed: 0,title,id,author,created_utc,distinguished,edited,is_self,link_flair_text,num_comments,saved,score,selftext,stickied,upvote_ratio,url,subreddit
0,Monthly General Discussion - Aug 2023,15fgn9y,AutoModerator,1690906000.0,,False,True,Discussion,2,False,3,This thread is a place where you can share thi...,True,1.0,https://www.reddit.com/r/dataengineering/comme...,dataengineering
1,Quarterly Salary Discussion - Jun 2023,13xldpd,AutoModerator,1685635000.0,,False,True,Career,213,False,83,This is a recurring thread that happens quarte...,True,1.0,https://www.reddit.com/r/dataengineering/comme...,dataengineering
2,What replaced cubes?,15gnctu,leaky_shrew,1691018000.0,,False,True,Discussion,31,False,32,I’m fairly old school with lots of on prem and...,False,0.95,https://www.reddit.com/r/dataengineering/comme...,dataengineering
3,Is traditional data modeling dead?,15gf97e,New-Ship-5404,1690998000.0,,False,True,Discussion,46,False,66,As someone who has worked in the data field fo...,False,0.9,https://www.reddit.com/r/dataengineering/comme...,dataengineering
4,Lots of people seem to hate data engineering. ...,15gizw0,InevitableTraining69,1691006000.0,,False,True,Discussion,37,False,30,There are a lot of engineering positions avail...,False,0.81,https://www.reddit.com/r/dataengineering/comme...,dataengineering


Export to csv:

In [20]:
import os # to work with files/directories
if not os.path.exists('../data'): 
    os.makedirs('../data') 

# Save the DataFrame to a CSV file
data_engr.to_csv('../data/data_engr.csv', index=False)

Try to export comments from posts:

In [None]:
data_engr_comm = webscr_comment('dataengineering', titles_comm_export)

In [None]:
print(data_engr_comm.shape)
data_engr_comm.head()

Export to csv:

In [None]:
import os # to work with files/directories
if not os.path.exists('../data'): 
    os.makedirs('../data') 

# Save the DataFrame to a CSV file
data_engr_comm.to_csv('../data/data_engr_comm.csv', index=False)

---
# 1.5 Notebook Summary

In notebook 1, we used PRAW: The Python Reddit API Wrapper to pull all available posts and comments from subreddits r/datascience and r/dataengineering. We managed to pull the following:
* `r/datascience`
    * 961 posts
    * 12403 comments
* `r/dataengineering`
    * 963 posts
    * 11740 comments
 
This provides ample rows of data for us to further analyse + use to train our model.