## Web APIs & Classification

### Contents:
- [1. Problem Statement](#Problem_Statement)
- [2. Importing Libraries](#Importing_Libraries)
- [3. Getting posts from the subreddits](#Getting_posts_from_the_two_subreddits) 


 ........ Continued in project3 Notebook_2

## **1. Problem Statement**

### The challenge is to study the contents of a new post on Reddit and classify as which of 2 subreddits the post should be tagged to. 

### The subreddits being considered are 'r/vegetarian' and 'r/lowimpactlifestyle'. 

### Low Impact Lifestyle - An active group on Reddit working towards a sustainable future by sharing of ideas and information to promote and enjoy a ‘Low Impact Lifestyle.

### The objective is to develop a Machine Learning model that can correctly identify posts highlighting a low impact lifestyle, i.e., one that aims at creating less waste and use of less resources. Although people who are  following a vegetarian diet may be part of the movement towards a low impact lifestyle, others  may be vegetarian for various other reasons and not subscribe to this particular lifestyle as a form of earth sustainability. It is thus possible that when they post on the Low Impact Lifestyle reddit, they may have done so in error and the post needs to be redirected to subreddit 'Vegetarian'.  This redirection will also reduce confusion in minds of members on both subreddits on the objective of their respective page. 

### A classification model will be developed from the existing posts from the two subreddits. The model will be tested on new posts on the subreddits to check its efficacy.

### The details are covered in 2 Jupyter notebooks
### - Notebook 1 - covering collection of the posts from the subreddits and saving them as .csv files
### - Notebook 2 - retrieving info stored in the .csv files to prepare the model after necessary cleaning and pre-processing. This will be followed for modelling, evaluation and selection of model ending with testing of random sample posts.

## 2. Importing Libraries

In [2]:
# Standard data science imports:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import requests
import time
import random

In [3]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', -1)

## 3. Getting posts from the two subreddits - Vegetarian and Low Impact Lifestyle

### The posts on the subreddits are collected by accessing the url and doing webscraping to collect content using json. A  cover user-agent name is used to mask the accessing of the json file by python. In addition, a random sleep generating code is included to allow the collection to continue undetected by the Reddit system.

In [5]:
# urls for the subreddits

# url_1 = 'https://www.reddit.com/r/vegetarian/.json'
# url_1 = 'https://www.reddit.com/r/vegetarian/new.json'

In [7]:
# defining function to get subreddits

def get_posts_reddit(url, times):
    
    after = None

    for a in range(times):
        if after == None:
            current_url = url
        else:
            current_url = url + '?after=' + after
        print(current_url)
        
    # cover user-agent used to mask access of json file by python     
        res = requests.get(current_url, headers={'User-agent': 'Sloppy Joe 2.0'})
    
        if res.status_code != 200:
            print('Status error', res.status_code)
            break
    
        current_dict = res.json()
        current_posts = [p['data'] for p in current_dict['data']['children']]
        
        
    # add post to directory
        posts.extend(current_posts)
        after = current_dict['data']['after']
        
    # generate a random sleep duration to look more 'natural'
        sleep_duration = random.randint(2,60)
        print(sleep_duration)
        time.sleep(sleep_duration)


### i. Collection of posts from subreddit 'Vegetarian'

In [1]:
# collection of posts from subreddit 'vegetarian'

posts = []
get_posts_reddit('https://www.reddit.com/r/vegetarian/.json', 60)

In [None]:
# checking number of posts collected
len(posts)

In [None]:
# conversion of posts to a dataframe (df) for easy handling 
# saving a .csv copy as backup 
# to enable working on further data exploration 

veg_posts=pd.DataFrame(posts)
veg_posts.to_csv('../datasets/vegetarian.csv', index = False)

In [None]:
# determining shape of df
veg_posts.shape

### ii. Collection of posts from subreddit 'Low Impact Lifestyle'

In [None]:
# collection of posts from subreddit 'low impact lifestlye'

posts = []
get_posts_reddit('https://www.reddit.com/r/lowimpactlifestyle/.json', 60)

In [13]:
len(posts)

1500

In [14]:
lil_posts=pd.DataFrame(posts)
lil_posts.to_csv('../datasets/lowimpactlifestyle.csv', index = False)