# Data Import
We are pulling data from the pushshift API. I'll be pulling 5000 rows of post data from the [r/bipolar](https://www.reddit.com/r/bipolar/) and [r/depression](https://www.reddit.com/r/depression/) subreddits

# 
These are the imports we will use. Requests will allow us to call on the API:

In [1]:
import requests
import pandas as pd
pd.set_option('display.max_columns', 150)

# 
### Create a Function to Pull Data
This is a function I created that takes in a subreddit name and an amount of rows and creates a dataframe with that many rows of data from that subreddit. It saves a global variable in the subreddit name and also saves the dataframe as a csv file

In [3]:
def subreddit_df_creator(subreddit, rows):
        
        
        # establish base url for pulling posts
        base_url = 'https://api.pushshift.io/reddit/search/submission'
        
        # establish a count so our iterations don't bring back the same data over and over
        count = 0
        
        # create a list that we will append with the reddit posts that meet our conditions
        json_data = []
        
        # create a while loop that will continue until we have the necessary amount of data
        while len(json_data) < (rows + rows*(.01)):
            
            # create parameters for our api requests so they will be less intensive
            new_params = {'subreddit': subreddit, 
                'fields': ['selftext', 'subreddit', 'title'], 
                
                # these parameters refer to the timeframe to pull posts from, 
                # which using our counter, goes back one day further after each iteration         
                'after': (str(count + 1) + 'd'), 
                'before': (str(count) + 'd')}
            
            # make the request to the api
            request = requests.get(base_url, new_params)
            
            # convert the requested data to JSON
            try:
                json = request.json()
            except:
                count = count + 1
                continue
            
            # get the nested json within the data key
            json_data_only = json['data']
            
            # filter out posts with less than 40 characters in their post
            for i in json_data_only:
                try:
                    if len(i['selftext'].split()) > 100:
                        json_data.append(i)
                except:
                    continue
            
            # increase the counter by 1 so we call on posts from the previous day in the next iteration
            count = count + 1
            
            
            # printing number of rows added so far to track progress
            print('Subreddit posts added to dataframe: ' + str(len(json_data)))
        
        
        # turn list of json dictionaries into a pandas dataframe using json_normalize, save as global variable,
        # drop duplicates, and slice datframe to be desired size
        globals()[subreddit] = pd.json_normalize(json_data)
        globals()[subreddit] = globals()[subreddit].drop_duplicates()
        globals()[subreddit] = globals()[subreddit].iloc[:rows]
        
        # save dataframe as a csv file
        #globals()[subreddit].to_csv('../data/' + subreddit + '.csv', index=False)

# 
This is a demonstration of the output, limited to 100 posts here:

In [4]:
subreddit_df_creator('depression', 100)
subreddit_df_creator('bipolar', 100)

Subreddit posts added to dataframe: 0
Subreddit posts added to dataframe: 0
Subreddit posts added to dataframe: 14
Subreddit posts added to dataframe: 26
Subreddit posts added to dataframe: 36
Subreddit posts added to dataframe: 48
Subreddit posts added to dataframe: 58
Subreddit posts added to dataframe: 67
Subreddit posts added to dataframe: 77
Subreddit posts added to dataframe: 89
Subreddit posts added to dataframe: 97
Subreddit posts added to dataframe: 110
Subreddit posts added to dataframe: 11
Subreddit posts added to dataframe: 21
Subreddit posts added to dataframe: 28
Subreddit posts added to dataframe: 37
Subreddit posts added to dataframe: 48
Subreddit posts added to dataframe: 55
Subreddit posts added to dataframe: 61
Subreddit posts added to dataframe: 71
Subreddit posts added to dataframe: 77
Subreddit posts added to dataframe: 84
Subreddit posts added to dataframe: 90
Subreddit posts added to dataframe: 96
Subreddit posts added to dataframe: 101


# 
**Now that we have our data, we can take a look at it:**

# 
**Up Next:**  
[Cleaning and Exploratory Analysis](./Cleaning_and_Exploratory_Analysis.ipynb)  
  
[Return to Read Me](../README.md)

# 