### Overview

One of the methodologies to efficiently index data for future usage, is tagging them before saving them in the dataset. For example, on Reddit, posts will be tagged by the subreddit. Assuming a scenario that the Reddit server would not be able to process data due to any reason (say hardware limitation or unexpected behavior after an update), we need a classifier to process data and tag each post. The goal of this project is to provide the possibility of tagging Reddit posts using NLP and machine learning. For the use case, posts from two subreddits are extracted: 

Samsunggalaxy and iPhone

Therefore, we intend to make a binary classifier in order to tag each Reddit post.

In [1]:
import requests
import time
import pandas as pd
import os

### Defining scraper using Reddit API

In [2]:
def extract_reddit_posts(target_url, count, bot_header, sleep):
    posts = []
    after = None
    
    for i in range(count):
        if after==None:
            params = {}
        else:
            params = {'after' : after}
        
        res = requests.get(url=target_url, params=params, headers=bot_header)
        
        if res.status_code == 200:
            the_json = res.json()
            posts.extend(the_json['data']['children'])
            after = the_json['data']['after']
        
        else:
            # this confirms that an error happens
            print(res.status_code)
            break
        
        time.sleep(sleep)
    
    return posts

def prepare_df(raw_data, cols):
    df = pd.DataFrame(raw_data)
    for col in cols:
        df[col] = df['data'].map(lambda x: x[col])
    return df

### Initial Setup

In [3]:
url = ""
header = {'User-agent': 'Bleep blorp bot 0.1'}
count = 200
sleep_value = 0.02
selected_columns = ["subreddit", "title", "name", "selftext", "domain", "upvote_ratio",
                    "score", "subreddit_id", "is_robot_indexable", "author",
                    "num_comments", "send_replies", "is_video"]

### Extracting Samsung Galaxy Data

In [4]:
url = 'https://www.reddit.com/r/samsunggalaxy/.json'
# Step one: data extraction from reddit API
raw_scraped_galaxy = extract_reddit_posts(target_url=url, count=count, bot_header=header, sleep=sleep_value)### Extracting Samsung Galaxy Data
# Step two: conversion raw extracted data into pandas dataframe and filtering columns
galaxy_df = prepare_df(raw_data=raw_scraped_galaxy, cols=selected_columns)

### Extracting Iphone Data

In [5]:
url = 'https://www.reddit.com/r/iphone/.json'
raw_scraped_iphone = extract_reddit_posts(target_url=url, count=count, bot_header=header, sleep=sleep_value)
iphone_df = prepare_df(raw_data=raw_scraped_iphone, cols=selected_columns)

### Saving data

In [14]:
dataDirectory = "./data/"
galaxy_df.to_csv(dataDirectory + "galaxy.csv", encoding='utf-8')
iphone_df.to_csv(dataDirectory + "iphone.csv", encoding='utf-8')