<a id="section__top"></a>

# Project  3 - Subreddit Classifier
## Get Data
General Assembly DSI CC7 Project 3
<br>Anne Kerr - SF<br>
Due April 5, 2019
#### Notebook Overview
This notebook contains the steps to gather data from reddit. I selected four subreddits from which to gather data. The goal of the project is to create a model that will predict from which subreddit a post belongs. We will use at least two of the four subreddits in the project, but I wanted to gather data from four so I would have a lot of data with which to work. I chose some of my favorite topics: Travel, Fitness, Gardening, and Wine,



#### Approach 
I used the reddit api to request the posts. The data comes back in json form. After exploring the data I decided to save eight data elements from the returned values. They are:


|Data Element|Description|
|-------|----------|
| subreddit | Name of the subreddit, e.g., 'travel', 'fitness', gardening, 'wine' |
| id | Unique identier of post |
| selftext| The text of the post. Not all posts have text. Some are only images or videos. For this project only post with text were collected  |
| title | Post Title |
| author | Reddit ID of author |
| created | Date the post was created |
| ups | Number of up votes the post has recevied |
| downs | Number of down votes the post has receivd |

###### RedditPostReader

To handle the interaction with reddit I defined a class called RedditPostReader, and included it in a file called reddit_posts.py. A copy of the code is included at the end of this notepbook, and the code is included in the code folder of the project. Once you instantiate the class you call the gather_posts() method, passing a url and the desired number of posts to gather. The method only gathers posts with non-empty selftext values. It returns a Pandas DataFrame with the n most recent posts.

This notebook iterates over the list of subreddits of interest and builds a list of DataFrame objects, one for each subreddit. These are saved to disk for archiving. They are then concatenated together, duplicates are dropped, and the combined final DataFrame is stored to disk, and saved to SQL. (Note: as of this version the SQL connection is failing. This needs to be debugged. The code is left in, but commented out so no errors appear.
                        

In [1]:
import pandas as np
from reddit_posts import *

In [2]:
import datetime
now = datetime.datetime.now()
current_time_stamp = f'{now.year}'
current_time_stamp += f'{now.year}'
current_time_stamp += f'{now.month}'
current_time_stamp += f'{now.day}'
current_time_stamp += f'{now.second}'
current_time_stamp += f'{now.microsecond}'
current_time_stamp

'201920194419162896'

###### Define subreddits to process

In [3]:
subreddits = ['travel', 'fitness', 'gardening', 'wine']

###### RedditPostReader
Instantiate an instance of the RedditPostReader class

In [4]:
r = RedditPostReader()

Function to call the gather_posts() method for each of the subreddits of interest.

In [5]:
def get_posts(n=2000):
    posts = []
    for i in range(len(subreddits)):
        sr = subreddits[i]
        url = f'https://www.reddit.com/r/{sr}.json'
        df = r.gather_posts(url,n)
        posts.append(df)
        filename = f'../data/{sr}_posts{current_time_stamp}.csv' 
        df.to_csv(filename, index=False)
        print(f'Gathered {n} posts from {sr}')
    return posts

Call the get_posts method. Earlier test have shown that the number of duplicate posts is quite high. After dropping duplicates we were let with fewer than 50% of our original number, and for some subreddits it was closer to 25%. To get more than 1000 non-duplicate posts to use for the model we will have to set the number to gather two or three times that. This takes quite a while to run, because it skips all the posts with emptly selftext, and continues to search until it has 2000 non-empty posts from each subreddit. Perhaps choosing different threads might have been easier, but this is a good challenge.

A better approach may have been to iterate over the list of subreddits, using a more pythonic way to call the function. This can be improved in the next version. Since this works as is, I chose to leave it this way for now so I could go on to the EDA and Analysis steps.


In [6]:
post_df_list = get_posts(1000)

Gathering posts from https://www.reddit.com/r/travel.json
Gathered 500 posts so far
Gathered 1000 posts so far
Gathered 1011 posts
skipped 233 posts with no selftext
Gathered 1000 posts from travel
Gathering posts from https://www.reddit.com/r/fitness.json
Gathered 500 posts so far
Gathered 1000 posts so far
Gathered 1006 posts
skipped 0 posts with no selftext
Gathered 1000 posts from fitness
Gathering posts from https://www.reddit.com/r/gardening.json
Gathered 500 posts so far
Gathered 1000 posts so far
Gathered 1006 posts
skipped 3551 posts with no selftext
Gathered 1000 posts from gardening
Gathering posts from https://www.reddit.com/r/wine.json
Gathered 500 posts so far
Gathered 1000 posts so far
Gathered 1004 posts
skipped 1005 posts with no selftext
Gathered 1000 posts from wine


Now that we have finished gatherig the posts, let's check the dataframes to see that we got what we intended.

In [7]:
def check_posts():
    for i in range(len(subreddits)):
        sr = subreddits[i]
        df = post_df_list[i]  
        print(f'Shape of {sr} DataFrame: {df.shape}')


In [8]:
check_posts()

Shape of travel DataFrame: (1011, 8)
Shape of fitness DataFrame: (1006, 8)
Shape of gardening DataFrame: (1006, 8)
Shape of wine DataFrame: (1004, 8)


In [9]:
df1 = post_df_list[0]
df2 = post_df_list[1]
df3 = post_df_list[2]
df4 = post_df_list[3]


In [10]:
df1.head()


Unnamed: 0,subreddit,id,selftext,title,author,created,ups,downs
0,travel,b6i1po,Hey travellers!\n \nIn this weekly community d...,r/travel Topic of the Week: 'Action!',AutoModerator,1553775000.0,18,0
1,travel,b9avjb,"Hi, I'm travelling to Orlando this summer with...",Travel cards or currency?,scottishguyhere,1554372000.0,3,0
2,travel,b9bbtp,I’m aware that that if I exchange money at a l...,Question about exchange rates,KeydGV21,1554376000.0,2,0
3,travel,b9b4f5,I am thinking of travelling to Sri Lanka but r...,Illegal for women to buy alcohol in Sri Lanka?,SecondAccount404,1554374000.0,2,0
4,travel,b9bkmw,I will be in France on April 24th (release dat...,Doe as France dub American movies in French?,purplewhitewine,1554378000.0,1,0


In [11]:
df2.head()

Unnamed: 0,subreddit,id,selftext,title,author,created,ups,downs
0,Fitness,b5d3q4,Howdy!\n\nWelcome to r/Fitness Community Campf...,Community Campfire: Eating Less Sugar and Junk...,purplespengler,1553533000.0,153,0
1,Fitness,b9aufw,Welcome to the /r/Fitness Daily Simple Questio...,"Daily Simple Questions Thread - April 04, 2019",AutoModerator,1554372000.0,4,0
2,Fitness,b94tak,"I am 25 years old 6'7"" 250lb 20% body fat. I h...",Should heavy people run?,pastathehoagie,1554331000.0,844,0
3,Fitness,b92zqz,With all the recent news about it here I wante...,Do the performance benefits of coffee only com...,Work1Work2Work3,1554322000.0,125,0
4,Fitness,b8w0kq,Welcome to Rant Wednesday: It's your time to l...,Rant Wednesday,AutoModerator,1554286000.0,682,0


In [12]:
df3.head()

Unnamed: 0,subreddit,id,selftext,title,author,created,ups,downs
0,gardening,b6x29k,This is the Friendly Friday Thread. \n\nNegat...,Friendly Friday Thread,AutoModerator,1553865000.0,18,0
1,gardening,b99ocn,I haven't left my bedroom in 2 weeks apart fro...,I've got a Peace Lily growing in the corner of...,SplashBandicoot,1554363000.0,3,0
2,gardening,b98wc3,"Now that spring has arrived (more or less), it...",Favorite Spray Nozzle,Fleemo17,1554357000.0,3,0
3,gardening,b99f7a,I inherited some irises and I put them in betw...,Irises and blackberries help,kd5tdu,1554361000.0,2,0
4,gardening,b97hqr,I have some questions regarding mulching that ...,How to mulch? And other questions.,Writer_A,1554347000.0,3,0


In [13]:
df4.head()

Unnamed: 0,subreddit,id,selftext,title,author,created,ups,downs
0,wine,b6xh6l,"Bottle porn without notes, random musings, off...",Free Talk Friday,CondorKhan,1553867000.0,9,0
1,wine,b89vf2,"Hi Everyone, so here we are at our April chall...",**Monthly Wine Challenge - April 2019 Selectio...,PhoenixRising20,1554156000.0,9,0
2,wine,b995ws,"i just opened a bottle of port, and I have bas...",what do you with wine you don't like ?,Maximilianne,1554359000.0,2,0
3,wine,b97ggk,My friend decided not to go anymore so I have ...,Extra general admission ticket to Wine Spectat...,aparice,1554347000.0,2,0
4,wine,b8yvdy,My wife and I are traveling with friends to Na...,Napa Valley Recommendations,Yoko_Loco,1554303000.0,8,0


#### Post processing
Concatenate the datasets into one dataframe
Drop duplicates
Save to disk for archival purposes
Write to postgres database

In [14]:
dfall = pd.concat(post_df_list)
dfall.shape

(4027, 8)

Check for duplicates

In [15]:
dfall.drop_duplicates(subset=['subreddit', 'id', 'title', 'author']).shape

(1800, 8)

In [16]:
dfall.subreddit.value_counts()

travel       1011
Fitness      1006
gardening    1006
wine         1004
Name: subreddit, dtype: int64

There are quite a few. Visual inspection of the data confirmed they are duplicates, so we will go ahead and drop them here. Further investigation may need to be done to understand if this is just an artifact of the reddit API, or if there is something I can change in the class to prevent this, For now, we will just drop the duplicates.

In [17]:
dfall.drop_duplicates(subset=['subreddit', 'id', 'title', 'author'], inplace=True)
dfall.shape

(1800, 8)

Re-check the counts for each of the subreddits

In [18]:
dfall.subreddit.value_counts()

travel       815
wine         451
Fitness      318
gardening    216
Name: subreddit, dtype: int64

###### How many posts remain?

#### Save the data

Write the final combined file to csv, and also store it in a postres database running in an Amazon Web Services (AWS) instance.

In [19]:
dfall.to_csv(f'../data/final_posts_all{current_time_stamp}.csv', index=False)

In [20]:
# import warnings;
# warnings.simplefilter('ignore')

# ##Note: The dns value changes with each AWS session.

# dns = 'ec2-18-224-40-114.us-east-2.compute.amazonaws.com'

# from pandas.io import sql
# from sqlalchemy import create_engine

# ###engine = create_engine(f'postgres://postgres:pass@{dns}')
# engine = create_engine(f'postgres://postgres:Letmeinplease00@{dns}')

# table_name = 'reddit_posts'


# dfall.to_sql('table_name',con=engine,index=False, if_exists='replace')

#### On to analysis

Having completed the data gathering phase, we can move on to notebook 3 for EDA and Analysis.


###### Source code for the RedditPostReader class

```

import requests
import time
import pandas as pd
import numpy as np
import warnings

class RedditPostReader:
    header = {'user-agent': 'anne'}
    post_cols = ['subreddit', 'id', 'selftext', 'title', 'author', 'created', 'ups', 'downs']
  
    post_count = 0
    url=''
   
    def __init__(self):
         pass

   
    def gather_posts(self, url, n=100):

        df = pd.DataFrame(columns=self.post_cols)
        warnings.simplefilter('ignore')
        max_posts = n
        self.url = url
        self.post_count = 0
        empty_count = 0
        after = None
        print(f'Gathering posts from {self.url}')
 
        while True:

            if after == None:
                params = {}
            else:
                params = {'after': after}
            rep = requests.get(self.url, params=params, headers=self.header)
 
            if rep.status_code == 200:
                pjson = rep.json()
                nposts =  len(pjson['data']['children'])  

                for i in range(0, nposts):
                    self_text = [pjson['data']['children'][i]['data']['selftext']]

                    if len(self_text) > 0 and len(self_text[0]) > 0:
                        pdict = {
                            'subreddit' : [pjson['data']['children'][i]['data']['subreddit']],
                            'id' : [pjson['data']['children'][i]['data']['id']],
                            'selftext' : [pjson['data']['children'][i]['data']['selftext']],
                            'title' : [pjson['data']['children'][i]['data']['title']],
                            'author' : [pjson['data']['children'][i]['data']['author']],
                            'created' : [pjson['data']['children'][i]['data']['created']],
                            'ups' : [pjson['data']['children'][i]['data']['ups']],
                            'downs' : [pjson['data']['children'][i]['data']['downs']],
                        }
                        self.post_count += 1
                        if (self.post_count % 500 == 0):
                            print(f'Gathered {self.post_count} posts so far')
                        df2 = pd.DataFrame(pdict)
                        df = df.append(df2, ignore_index=True)
                    else:
                        empty_count += 1
                after = pjson['data']['after']
            else:
                print(rep.status_code)
                break

            if self.post_count < max_posts:
                time.sleep(3)
            else:
                print(f'Gathered {self.post_count} posts')
                print(f'skipped {empty_count} posts with no selftext')
                break
 
        return(df)
        
        