## API
### 1. Choose an API
#### a). Choose an API and briefly describe the type of data you can obtain from it. Note: Please do not use any of the APIs we covered in lecture (e.g. NYTimes, Github etc.).
I choose the `Pushshift Reddit API`, which provide enhanced functionality and search capabilities for searching Reddit comments and submissions.

#### b). Provide a link to the API documentation.
The link of the documentation is as followed: https://github.com/pushshift/api

#### c).Provide the base URL of the API you intend to use.
The base URL of the API is as follwed: https://api.pushshift.io/

### 2. Authentication
#### a) Briefly explain how the API authenticates the user. b) Apply for an API key if necessary and provide the information (with relevant URL) how that can be done. Do not include the API key in the assignment submission.

This API does not require authentication.

### 3. Send a Simple GET request
#### a) Execute a simple GET request to obtain a small amount of data from the API. Describe a few query parameters and add them to the query. If you have a choice of the output the API returns (e.g. XML or JSON), I suggest to choose JSON because it easier to work with. Your output here should include the code for the GET request, including the query parameters, as well as a snippet of the output.


In [1]:
import requests
import os
import json

In [66]:
# Search Reddit comments in the r/stocks subreddit with the parameter "subreddit",
# and that mentioned China with the parameter of "q",
# and specify the time period with the "before" and "after" parameter.
r= requests.get(
    'https://api.pushshift.io/reddit/search/comment/?q=china&subreddit=stocks&after=1000d&before=1d&size=500')

# Inspect some attributes of the desired properties.
json_response = r.json()
example=json_response['data'][0]['body']
example

"Personally, I don't believe any numbers coming out of China. I have no doubt they are growing and becoming more powerful, but anything the government reports needs to be taken with a grain of salt. "

#### b) Check (and show) the status of the request.

In [14]:
r.status_code

200

#### c) Check (and show) the type of the response (e.g. XML, JSON, csv).

In [51]:
r.headers['Content-type']

'application/json; charset=UTF-8'

### 4. Parse the response and Create a dataset
#### a) Take the response returned by the API and turn it into a useful Python object (e.g. a list, vector, or pandas data frame). Show the code how this is done.

In [56]:
# First, check out what python object can be generated through the get requests.
print(json_response.keys())

# This shows the attributs of each post 
json_response['data'][0]

dict_keys(['data'])


{'all_awardings': [],
 'associated_award': None,
 'author': 'steatorrhoea',
 'author_flair_background_color': None,
 'author_flair_css_class': None,
 'author_flair_richtext': [],
 'author_flair_template_id': None,
 'author_flair_text': None,
 'author_flair_text_color': None,
 'author_flair_type': 'text',
 'author_fullname': 't2_56c85aon',
 'author_patreon_flair': False,
 'author_premium': False,
 'awarders': [],
 'body': 'No don’t fuck with China stocks man. Cheating is their culture',
 'collapsed_because_crowd_control': None,
 'created_utc': 1589646877,
 'gildings': {},
 'id': 'fqtvfrc',
 'is_submitter': False,
 'link_id': 't3_gkx2rj',
 'locked': False,
 'no_follow': True,
 'parent_id': 't3_gkx2rj',
 'permalink': '/r/stocks/comments/gkx2rj/should_i_invest_10k_i_have_for_my_3_year_old_in/fqtvfrc/',
 'retrieved_on': 1589646879,
 'score': 1,
 'send_replies': True,
 'stickied': False,
 'subreddit': 'stocks',
 'subreddit_id': 't5_2qjfk',
 'total_awards_received': 0,
 'treatment_tags': []}

#### b) Using the API, create a dataset (in data frame format) for multiple records. I'd say a sample size greater than 100 is sufficient for the example but feel free to get more data if you feel ambitious and the API allows you to do that fairly easily. The dataset can include only a small subset of the returned data. Just choose some interesting features. There is no need to be inclusive here.

In [57]:
import pandas as pd

In [83]:
json_df = pd.DataFrame(json_response['data'])
json_df.head()

Unnamed: 0,author,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,author_flair_type,author_fullname,author_patreon_flair,...,link_id,no_follow,parent_id,permalink,retrieved_on,score,send_replies,stickied,subreddit,subreddit_id
0,atdharris,,,[],,,,text,t2_7ki9f,False,...,t3_ask4rr,True,t1_egvvjjp,/r/stocks/comments/ask4rr/why_should_i_simply_...,1550682576,1,True,False,stocks,t5_2qjfk
1,coquinaa,,,[],,,,text,t2_s2rt9,False,...,t3_at2tqk,True,t3_at2tqk,/r/stocks/comments/at2tqk/stocks_directionless...,1550758712,1,True,False,stocks,t5_2qjfk
2,Sirrus_VG,,,[],,,,text,t2_vgf65cl,False,...,t3_at5s3j,True,t3_at5s3j,/r/stocks/comments/at5s3j/why_is_ea_still_drop...,1550770889,1,True,False,stocks,t5_2qjfk
3,Sirrus_VG,,,[],,,,text,t2_vgf65cl,False,...,t3_at4iwc,True,t3_at4iwc,/r/stocks/comments/at4iwc/thought_on_ea/egyu2x9/,1550770896,1,True,False,stocks,t5_2qjfk
4,Sirrus_VG,,,[],,,,text,t2_vgf65cl,False,...,t3_at4iwc,True,t1_egykiqe,/r/stocks/comments/at4iwc/thought_on_ea/egyuzqc/,1550771500,1,True,False,stocks,t5_2qjfk


In [69]:
json_df['body'].head()

0    Personally, I don't believe any numbers coming...
1    China is the biggest catalyst, too much uncert...
2    All you guys spreading misinformation about An...
3    All you guys spreading misinformation about An...
4     \n\nAll you guys spreading misinformation abo...
Name: body, dtype: object

#### c) Provide some summary statistics of the data. Include the data frame in a .csv file called data.csv with your submission for the grader.

In [91]:
# Statistical summary of the dataset.
json_df.shape

(100, 24)

In [98]:
# Check the average length of comments concerned with China concept.
len(json_df['body'][1])

def check_len(df):
    count=0
    for i in range(len(df)):
        count+=len(df[i])
    return (count/len(df))

check_len(json_df['body'])

337.76

In [99]:
# Export the dataframe.
json_df.to_csv('Reddit_China.csv')

### 5. API client
For your API function, try to create a simple function that does the following things:

- allows the user to specify some smallish set of query parameters (from Q.3a)
- run a GET request with these parameters
- check the status of the request the server returns and inform the user of any errors (from Q.3b)
- parse the response and return a Python object to the user of the function. You can choose whether returning a list (from Q.4a) or a data frame (from Q.4b) is best.
- Add docstrings to the API client function that explain the paramters, the output, and ideally include a quick example.

In [137]:
# Allows the user to specify some smallish set of query parameters (from Q.3a)
def reddit_stock_china(q,subreddit,size):
    """
    A function to get the comment about the stocks concerned with China concept. 
    
    Parameters
    ----------
    q : String / Quoted String for phrases.
      Here specifically indicates the sting "china", and search is not case-sensitive. 
    
    subreddit : String that restrict to a specific subreddit, here we use the r/Stocks subreddit.
      Subscription token for accessing Last10K website
    
    size: Number of results to return, default is 25, acceptable value is for Integer <= 500.
      
    Returns
    -------
    data_df : pandas.core.frame.DataFrame
      Dataframe with information includes the time period, author, the body, etc.

    ...	...	...	...	...	...	...
    """
    base_url='https://api.pushshift.io/reddit/search/comment/?'
    url = base_url+f'q={q}&subreddit={subreddit}&size={size}'
    
    # Run a GET request with these parameters
    # Check the status of the request the server returns and inform the user of any errors (from Q.3b)
    try:
        r = requests.get(url)
        
        # If the response was successful, no Exception will be raised
        r.raise_for_status()
    except HTTPError as http_err:
        print(f'HTTP error occurred: {http_err}')
    except Exception as err:
        print(f'Other error occurred: {err}')
    else:
        print('Success!')
    
    # Parse the response and return a Python object to the user of the function. 
    #You can choose whether returning a list (from Q.4a) or a data frame (from Q.4b) is best.
    json_response = r.json()
    data_df =pd.DataFrame(json_response['data'])
    
    return data_df

In [139]:
reddit_stock_china('china','stocks',500)

Success!


Unnamed: 0,all_awardings,archived,associated_award,author,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,...,stickied,subreddit,subreddit_id,subreddit_name_prefixed,subreddit_type,top_awarded_type,total_awards_received,treatment_tags,unrepliable_reason,author_cakeday
0,[],False,,HumanFromTexas,,,[],,,,...,False,stocks,t5_2qjfk,r/stocks,public,,0,[],,
1,[],False,,NPRjunkieDC,,,[],,,,...,False,stocks,t5_2qjfk,r/stocks,public,,0,[],,
2,[],False,,courseman5,,,[],,,,...,False,stocks,t5_2qjfk,r/stocks,public,,0,[],,
3,[],False,,harmlessloafofbread,,,[],,,,...,False,stocks,t5_2qjfk,r/stocks,public,,0,[],,
4,[],False,,spankyiloveyou,,,[],,,,...,False,stocks,t5_2qjfk,r/stocks,public,,0,[],,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,[],False,,callmecrude,,,[],,,,...,False,stocks,t5_2qjfk,r/stocks,public,,0,[],,
96,[],False,,OweHen,,,[],,,,...,False,stocks,t5_2qjfk,r/stocks,public,,0,[],,
97,[],False,,jesperbj,,,[],,,,...,False,stocks,t5_2qjfk,r/stocks,public,,0,[],,
98,[],False,,Androgogy,,,[],,,,...,False,stocks,t5_2qjfk,r/stocks,public,,0,[],,
