This file uses imported functions from the public_acn_psm file. The functions file contains all of the functions required for this file to properly run.

The code uses the following packages: 
- pandas
- datetime



In [None]:
#install xgboost imported in public_acn_psm
pip install xgboost

In [None]:
# import packages/functions
import public_acn_psm
import pandas as pd
from datetime import datetime
pd.set_option('display.max_columns',None)

The following code reads the csv file, model_features.csv, that was the result of the code from the file public_psm_feature_engineering.

Unnecessary columns are dropped. In addition, columns "Officer ID" and "Officer Squad" are converted to type 'str' to prevent errors. The rows where the column "Precinct" value is equal to 'OOJ' are dropped as it caused problems with the model.

An important note in this code is that the dataframe created from model_features.csv is cut so that it only consists of the first 6000 rows. This allows for the model to take a much smaller amount of time to compute.

In [None]:
# create dataframe from model_features.csv
df_orig = pd.read_csv("model_features_sunny_test.csv")
df_output_all = pd.DataFrame(data = df_orig)
df_output_all = df_output_all.drop('Unnamed: 0', axis = 1)
df_output_all['Officer ID'] = df_output_all['Officer ID'].astype(str)
df_output_all['Officer Squad'] = df_output_all['Officer Squad'].astype(str)
df_output_all = df_output_all[df_output_all['Precinct'] != 'OOJ']
# decrease dataset size to decrease runtime
df_output_all = df_output_all[:6000]

The code below creates a list of dataframes in which each dataframe contains a unique combination of the following columns:
- "Precinct"
- "observation_year_d"
- "observation_month_d"

Essentially, each row within a dataframe would contain the same values for those three columns. 

In [None]:
unique_dfs = [group for i, group in df_output_all.groupby(['Precinct','observation_year_d','observation_month_d'])]

This block of code has many parts:
- Creates a for-loop that iterates through each dataframe in unique_dfs.
- Creates another for-loop within the first for-loop in which it assigns the values of the following variables: 
    - precinct
    - year
    - month
- Runs the psm model using the function **evaluate_all_stops** and dataframe 'df'. The model returns the following:
    - eval_flag
    - summaries
    - propensities
- If the sample size is large enough, the columns of dataframe 'propensities' is filtered, renamed, and added to the combined propensities dataframe. In addition, the dataframe 'summaries' is added to the combined summaries dataframe.

In [None]:
combinedSummaries = pd.DataFrame()
combinedPropensities = pd.DataFrame()
for df_all in unique_dfs:
    df = df_all
    for index, row in df_all.iterrows():  
        precinct = df_all['Precinct']
        year = df_all['observation_year_d']
        month = df_all['observation_month_d']
        df['precinct'] = precinct
        df['observation_year_d'] = year
        df['observation_month_d'] = month
    print(f'read features data frame for precinct: {precinct}, month: {year}-{month}', df.shape)
    # run psm model
    eval_flag, summaries, propensities = public_acn_psm.evaluate_all_stops(df)
    if eval_flag:
        # filter columns 
        propensities = propensities[
            [
                'watch_d', 'Precinct_watch_d', 'observation_datetime_d', 'observation_day_d', 'observation_week_d', 'observation_week_of_month_d',
                'Officer Race', 'Subject Perceived Race', 'label', 
                'prediction', 'probability_0', 'probability_1', 'weight', 'mean_control', 
                'mean_treat', 'mean_control_sum', 'mean_treat_sum', 'absolute_difference', 
                'run_timestamp', 'Frisk Flag', 'Subject Age Group', 'Subject Perceived Race'
            ]
        ] 

        # rename prediction set of columns
        propensities = propensities.rename(columns = {
            'label': 'subject_race_label_d',
            'prediction': 'subject_race_label_pred',
            'probability_0': 'probability_0_pred',
            'probability_1': 'probability_1_pred',
            'weight': 'weight_pred',
            'mean_control': 'mean_control_pred',
            'mean_treat': 'mean_treat_pred',
            'mean_control_sum': 'mean_control_sum_pred',
            'mean_treat_sum': 'mean_treat_sum_pred',
            'absolute_difference': 'disparity_pred'
        })
        print('evaluation complete')
        combinedSummaries = pd.concat([combinedSummaries, summaries])
        combinedPropensities = pd.concat([combinedPropensities, propensities])
    else:
        print(f'disparity not evaluated due to small sample size', df.shape[0])
        response_code = 2 


The index for both resulting dataframes are reset. Then, both are exported to their respective csv files.

In [None]:
combinedSummaries = combinedSummaries.reset_index(drop = True)
combinedPropensities = combinedPropensities.reset_index(drop = True)

In [None]:
combinedSummaries.to_csv('summaries_results.csv', index = False)
combinedPropensities.to_csv('propensities_results.csv', index = False)

The two resulting dataframes can be seen below.

In [None]:
combinedSummaries

In [None]:
combinedPropensities