# Workflow and Automation Functions

This notebook is dedicated to setting various functions that will assist us in EDA, modeling, performance tuning, and interpretation of results for this project.

In [1]:
# importing relevant libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests

## metrics
from sklearn.metrics import accuracy_score, balanced_accuracy_score, recall_score, precision_score, f1_score, confusion_matrix

### Find NA values

This function will find all NA values, and capture the sum of NAs in their respective column, in descending order, **only** if there are NAs present.

**Input:** Dataframe <br />
**Output:** Series of NA counts for each column

In [2]:
def na_only(df):
    na_ser = df.isna().sum().sort_values(ascending=False)[lambda x: x > 0]
    if na_ser.empty:
        return 0
    else:
        return na_ser

### API Call

Using the Pushshift API, this funcion makes calls for reddit submission data.

**Input**: subbreddit parameter, size parameter, before date parameter <br />
**Output**: json object with response data.

In [3]:
def api_call(subreddit, size=25, before=''):
    url = f"https://api.pushshift.io/reddit/search/submission/?subreddit={subreddit}&size={size}&metadata=True&is_video=False&before={before}"
    req = requests.get(url)
    
    if req.status_code != 200:
        return "Error: API call failed."
    else:
        call = req.json()
        return call['data']

## Data Wrangling

This function takes an json object and parses relevant data into a dictionary.

**Input**: dictionary to store data, json object keys, json object storing api response data <br />
**Output**: dictionary with stored data

In [4]:
def data_wrangling(dict_, keys, api_call):
    error_log = [] #used to capture indices with missing data
    for i in range(len(api_call)):
        for key in keys:
            try:
                dict_[key].append(api_call[i][key])
            except:
                error_log.append(f"Error on index: {i}\nkey \"{key}\" not found.")
                dict_[key].append(None) #if there is not data, set it to null
    return {'data': dict_, 'error_log': error_log}

## Model Instantiation and Performance

This function instates an estimator and prints various model performance metrics:

**Input:** estimator, feature train data, feature test data, response train data, response test data <br/>
**Output:** Printout of all metrics.

In [5]:
def make_model(estimator, X_train, X_test, y_train, y_test):
    estimator.fit(X_train, y_train)
    preds = estimator.predict(X_test)
    
    print(f'''
        Training Accuracy Score: {estimator.score(X_train, y_train)}
        Test Accuracy Score: {estimator.score(X_test, y_test)}
        
        --- Performance on unseen data ----
        Recall (Sensitivity): {recall_score(y_test, preds)}
        Specificity: {recall_score(y_test, preds, pos_label=0)}
        Precision: {precision_score(y_test, preds)}
        
        Balance Accuracy: {balanced_accuracy_score(y_test, preds)}
        F1 Score: {f1_score(y_test, preds)}
        ''')