# Data Acquisition

### Notebook Summary

For this project I will be sourcing text data from Reddit. Due to the limits of the Reddit API, I will use the following code (lightly adapted from DSI-US-7 TA Chris Sinatra) to query the Pushshift API for thousands of posts from the r/Dodgers and r/baseball subreddits before storing the acquired data in json files to be used in the next notebooks for analysis.

In [1]:
import requests, time, csv, json, re
import pandas as pd

The `filename_format_log` function defined in the following cell will be called within the actual query function further below. When the query is finished and the function is closing out, it will call `filename_format_log` to write the name and directory path of the saved file, as well as the timestamp of the earliest post pulled as part of that query.

In [7]:
def filename_format_log(file_path, 
                        logfile = '../assets/file_log.txt', 
                        now = round(time.time()), 
                        file_description = None): 
   
    try:
        ext = re.search('(?<!^)(?<!\.)\.(?!\.)', file_path).start() 
    except:
        raise NameError('Please enter a relative path with a file extension.') 
    
    stamp = re.search('(?<!^)(?<!\.)[a-z]+_[a-z]+(?=\.)', file_path).start()
    formatted_name = f'{file_path[:stamp]}{now}_{file_path[stamp:]}'  
    if not file_description:
        file_description = f'Pull: {time.asctime(time.gmtime(now))}'
    with open(logfile, 'a+') as f:
        f.write(f'{formatted_name}: {file_description}\n')
    return formatted_name, now, file_description

The following function will take in a string argument to identify which subreddit should be queried. Unless otherwise specified by passing a specific value for `n_samples`, the function will pull 1500 total posts. The function can also take arguments for `before` and `after` to specify a window of time from which to pull the posts, otherwise it will default to pulling the most recent `n_samples` number of posts.

Once it runs, it stores the current time in `last_post` and creates an empty list in which to store pulled posts. It then queries the designated subreddit. If the query doesn't turn up empty, it assigns the original submission time of the earliest pulled post to `last_post`, adds the newly-pulled posts to the storage list, pauses processing for one second (so as to not overload the host server) and continues interating through this process, pulling posts until there are no more uncollected posts in the targeted subreddit or until the number of posts designated in the function call has been collected.

Finally, the function will call the `filename_format_log` function above to create a record of the successful query, and then it will store the collected data in a json file. As the function runs, it will print the loop's status at the start of each iteration.

In [8]:
def reddit_query(subreddits, n_samples=1500, before=None, after=None):
    url = f'https://api.pushshift.io/reddit/search/submission'
    last_post = round(time.time())
    post_list = []
    
    run = 1
    while len(post_list) < n_samples:
        
        try:
            print(f'Starting query {run}')
            
            params = {
              'subreddit':subreddits,
              'sort':'desc',
              'size':n_samples,
              'before':last_post-1,
              'after':after,
             }
                
            response = requests.get(url, params = params)
            posts = response.json()['data']
            
            if len(posts) == 0:
                last_post = last_post
            else:
                last_post = posts[-1]['created_utc']
                post_list.extend(posts)
                timestamp = posts[-1]['created_utc']
                time.sleep(1) 
                run += 1
        except:
            if response.status_code != 200:
                return f'Check status. Error code: {response.status_code}'
            else:
                return 'Error. Pull not completed.'
    
    formatted_name, now, file_description = filename_format_log(file_path =f'../data/raw_submissions.json', now=timestamp)
    with open(formatted_name, 'w+') as f:
        json.dump(post_list, f)
    
    print(f'Saved and completed query and returned {len(post_list)} submissions.')
    print(f'Reddit text is ready for processing.')
    return print(f'Last timestamp was {timestamp}.')

Now I will use the `reddit_query` function to collect 20,000 posts from the r/baseball subreddit.

In [None]:
reddit_query(subreddits='baseball', n_samples=20000)

Now I will use the `reddit_query` function to collect 20,000 posts from the /Dodgers subreddit.

In [10]:
reddit_query(subreddits='Dodgers', n_samples=20000)

Starting query 1
Starting query 2
Starting query 3
Starting query 4
Starting query 5
Starting query 6
Starting query 7
Starting query 8
Starting query 9
Starting query 10
Starting query 11
Starting query 12
Starting query 13
Starting query 14
Starting query 15
Starting query 16
Starting query 17
Starting query 18
Starting query 19
Starting query 20
Saved and completed query and returned 20000 submissions.
Reddit text is ready for processing.
Last timestamp was 1499265759.


Now that I have some data, I will move on to EDA and pre-processing in the next notebook.