# <font color="#703bdb">Part 2. Data Integration and Processing : Automation Pipeline</font> <hr>

<a href="http://policingequity.org/">Center of Policing Equity</a> is a research and action think tank that works collaboratively with law enforcement, communities, and political stakeholders to identify ways to strengthen relationships with the communities they serve. CPE is also the home of the nation’s first and largest <a href="http://policingequity.org/national-justice-database/">database</a> tracking national statistics on police behavior. 

The main aim of CPE is to bridge the divide created by communication problems, suffering and generational mistrust, and forge a path towards public safety, community trust, and racial equity. This kernel series is my contribution to the <a href="https://www.kaggle.com/center-for-policing-equity/data-science-for-good">Data Science for Good: Center for Policing Equity</a>. The contribution is focused on providing a generic, robust, and automated approach to integrate, standardize the data and further diagnose disparities in policing, shed light on police behavior, and provide actionable recommendations. 

Following are parts of Kernels Submissions in order:  

<ul>
    <li><a href="https://www.kaggle.com/shivamb/1-solution-workflow-science-of-policing-equity/">Part 1: Solution Workflow - The Science of Policing Equity </a>  </li>
    <li><a href="https://www.kaggle.com/shivamb/2-automation-pipeline-integration-processing">Part 2: Data Integration and Processing : Automation Pipeline</a>  </li>
    <li><a href="https://www.kaggle.com/shivamb/3-example-runs-of-automation-pipeline">Part 3: Example Runs of Automation Pipeline </a>  </li> 
    <li><a href="https://www.kaggle.com/shivamb/4-1-analysis-report-minneapolis-24-00013">Part 4.1: Analysis Report - Minneapolis Police Department (24-00013) </a>   </li>
    <li><a href="https://www.kaggle.com/shivamb/4-2-analysis-report-lapd-49-00033">Part 4.2: Analysis Report - Los Angles Police Department (49-00033) </a>   </li>
    <li><a href="https://www.kaggle.com/shivamb/4-3-analysis-report-officer-level-analysis">Part 4.3: Analysis Report - Indianapolis Officer Level Analysis (23-00089) </a>   </li></ul>

The complete overview of the solution is shared in the *first kernel*. It explains the process and flow of automation, standardization, processing, and analysis of data. In the *second kernel*, the first component of the solution pipeline : data integration and processing is implemented. It processes both core level data as well as department level data. In the *third kernel*, this pipeline is executed and run for several departments. After all the standardized and clean data is produced, it is analysed with different formats of the Analysis Framework in 4.1, 4.2 and 4.3 kernels. In *kernel 4.1*, core analysis is done along with link with crime rate and poverty data. In *kernel 4.2*, core analysis is done along with statistical analysis. In *kernel 4.3*, officer level analysis is done. 

<hr>

This kernel, is the second of the series. In this kernel, the implementation of first two components of the entire pipeline is done. 

## <font color="#703bdb">Kernel Conents </font> 

### <a href="#a">Component A - Core Data Integration and Processing  </a>

<ul>
    <li><a href="#a1">Step1 : Seting up the global config parameters - general  </a>  </li>
    <li><a href="#a2">Step2: Structured repository creation    </a>  </li>
    <li><a href="#a3">Step3: Standardization of police shape files  </a>  </li>
    <li><a href="#a4">Step4: Standardization of ACS / Census data   </a>  </li>
    <li><a href="#a5">Step5: Trigger function for this component   </a>  </li>
</ul>

### <a href="#b">Component B - Department Level Processing Pipeline  </a>

<ul>
    <li><a href="#b1">Step1 : Set Global Config Parameters - ShapeFiles  </a>  </li>
    <li><a href="#b2">Step2 : Find Overlapping Census Tracts with Department Districts    </a>  </li>
    <li><a href="#b3">Step3 : Setting up global config parameters - ACS  </a>  </li>
    <li><a href="#b4">Step 4 : Enrich ACS Information in Overlapped Districts   </a>  </li>
    <li><a href="#b5">Step 5: Standardize Police Incidents Data </a>  </li>
    <li><a href="#b6">Step 6: Extending Police Data using External Datasets </a>  </li>
    <li><a href="#b7">Step 7: Save Final Cleaned Datasets </a>  </li>
    <li><a href="#b8">Step 8: Component B Trigger Function</a>  </li>
</ul>

First, load the important libraries to be used in the overall implementation

In [None]:
import shutil, os, folium, warnings
from shapely.geometry import Point
import pandas as pd, numpy as np 
from collections import Counter
from statistics import median
import geopandas as gpd
warnings.filterwarnings('ignore')

<a id="a"></a>
## <font color="#703bdb">Component A : Core Data Integration and Processing </font> <hr>

This is the first component of the overall pipeline. In this component, the major focus is on the integrate and process two main datasets to be used in the complete solution. These datasets are : ACS data for different regions and the Shape Files corresponding to different police departments. Following are the key tasks which are executed in this component. 

1. Integration of data from multiple data sources     
2. Creation of Structured Repository     
3. Processing of Police Department Shape Files    
&nbsp;&nbsp;&nbsp;&nbsp; 3.1 File naming conventions   
&nbsp;&nbsp;&nbsp;&nbsp; 3.2 Error handelling   
&nbsp;&nbsp;&nbsp;&nbsp; 3.3 Consistent parameters     
4. Processing of ACS Data    
&nbsp;&nbsp;&nbsp;&nbsp; 4.1 File naming conventions  
&nbsp;&nbsp;&nbsp;&nbsp; 4.2 Cleanup of metric and meta files   

Here is the overview of this component:  

<br>

![](https://i.imgur.com/2d4b7BG.png)


<br>

<a id="a1"></a>
## <font color="#703bdb">Step 1. Seting up global config parameters</font><hr>

In the first step of the pipeline, we define the important global configurations to be used in the pipeline. This step serves as a configuration defining step, which can be changed again and again by the user with different types or sources of data to be used. This config acts like a controller to the user through which they can control how they want to execute the pipeline. In this config file, following parameters need to be defined: 

> **_base_dir** : The base directory path containing all the raw data  
> **_root_dir** : The new directory path which will contain all the cleaned and structured data  
> **ct_base_path** : The base path containing the census-tracts shape files   
> **external_datasets_path** : The base path of any external data to be used 

In [None]:
ct_base_path = "../input/census-tracts/cb_2017_<NUM>_tract_500k/cb_2017_<NUM>_tract_500k.shp"
external_datasets_path = "../input/external-datasets-cpe/"
_base_dir = "../input/data-science-for-good/cpe-data/"
_root_dir = "CPE_ROOT/"

The given data may contains many unnecessary shape files which may not be of user, so we define the mandatory shape file extensions which are required. Rest can be ignored. Additionally, we also define the names of the new directories to be created.

In [None]:
## define the new directory names and mandatory shape files 
mandatory_shapefiles = ["shp", "shx", "dbf", "prj"]
new_dirs = ["shapefiles", "events", "metrics", "metrics_meta"]

<a id="a2"></a>
## <font color="#703bdb">Step 2. Structured Repository Creation </font><hr>

The next step of the pipeline creates a new well defined repository structure containing repositories with proper naming conventions. After the execution of this pipeline component, for every department, well-defined structured repositories are created. <br><br>

![](https://i.imgur.com/XSTXkaF.png)

<br>

Their structure is as follows: 

> **shapefiles** : Contains the four mandatory shapefile types with filenames standardized, (example : "department".shp)  
> **events** : Contains police level data : use-of-force / arrests / vehicle stops etc  
> **metrics** : Contains ACS / census level metrics data with filenames standardized (example : "education.csv" )  
> **metrics-meta** : Contains the corresponding meta data for every metrics  

We define two main functions for this step. 

> **_cleanup_environment() :** This function is used to cleanup the environment, removes any unnecessary repository in the path         
> **_create_repository_structure():** This function is used to create the new repositories that will save the cleaned up data.  

In [None]:
## Utility function to cleanup the environment
def _cleanup_environment():
    if os.path.exists(_root_dir):
        !rm -r CPE_ROOT
        pass
    return None

## Function to create a new repository structure 
def _create_repository_structure():            
    ## refresh environment 
    _cleanup_environment()
    
    ## list of all departments whose raw data is available
    depts = [_ for _ in os.listdir(_base_dir) if "Dept" in _]
    
    ## master folder
    os.mkdir(_root_dir) 
    for dept in depts:

        ## every department folder 
        os.mkdir(_root_dir + "/" + dept)         
        for _dir in new_dirs:
        
            ## sub directories for - shapefiles, acsdata, metrics, metrics-meta
            os.mkdir(_root_dir + "/" + dept + "/" + _dir + "/")            
    print ("Status : Directory Structured Created")

<a id="a3"></a>
## <font color="#703bdb">Step 3. Standardization of Police Shape Files </font><hr>

In this step, standardization of police shape files is performed. Following tasks are executed in this step: 

**1. File Naming Conventions:** The given shape files for every department comprises of different names, so it is important to standardize them and maintain a consistent naming. In this step, The shapefiles corresponding to every deparment are standardized. In this process the only the mandatory files are picked, their names are changed to following: 

![](https://multimedia.journalism.berkeley.edu/media/upload/tutorials/qgis-basics/shp.jpg)

> - department.shp
> - department.shx 
> - department.prj  
> - department.dbf  

<br>
**2. Missing Files Error Handelling :** Addionally, some files may contain errors, so this step also performes error handelling. It uses a config, in which user can define what error handling needs to be performed. A global variable (config) is defined missing_shape_meta which cotains a list of key-value pairs for every department which needs to be fixed. For example, in the given data, the "prj" data is missing for Department : 37-00027. So its content will be manually supplied in this config. 

> **missing_shape_meta** = { "Dept_37-00027" : {"prj" : dept_37_27_prj} }

And, we define a function that will perform the corresponding error handling among the shape files. To fix the corresonding issues, it performs two steps
- Step 1 : Add the missing prj content   
- Step 2 : Fix the CRS of the shape file  

<br>
**3. Consistent Coordinate System (CRS): **

Another important part of this challenge is to automatically identify the coordinate systems of the shape files and make them consistent for further analysis. The PRJ files contains data about the projected coordinate system. The provides the information about : name for the projected coordinate system, the geographic coordinate system, the projection and all the parameters needed for the projection. Tools such as arcgis and mapinfo can definately help in solving this issue, but doing this programatically is bit challenging. In my current Implementation, I convert the given shapefile into one standard CRS : "epsg=4326". Later this part can be changed and made dynamic. 

<hr>

### Implementation 

So for the implementation part, we define the function which perform the standardization and cleanup of shapefiles. Additionally, the cleaned files are moved to new location. Following are the main steps which are performed in this function: 

Step 1 : Configure the old and New Paths   
Step 2 : Standardize the file names and move to new path  
Step 3: Fix Errorenous shape files  

Current limitation of this function is that it does not handles the directories in which more than one shapefiles are present. To handle this, the improtant and most relevant police shape files should be kept in the base raw data level. 

In [None]:
## Function to standardize the shape files
def _standardize_shapefiles():
    depts = [_ for _ in os.listdir(_base_dir) if "Dept" in _]
    for dept in depts:    
        ## Step1: Configure the old and new path
        shp_dir = dept.replace("Dept_","") + "_Shapefiles/"
        old_pth = _base_dir + dept + "/" + shp_dir
        new_pth = _root_dir + dept + "/" + "shapefiles/"

        ## Step2: Standardize the file names and move to new path 
        _files = os.listdir(old_pth)
        for _file in _files:
            if _file[-3:].lower() not in mandatory_shapefiles:
                continue
            ext = ".".join(_file.split(".")[1:]).lower()
            new_name = "department." + ext
            shutil.copy(old_pth+_file, new_pth+new_name)

        ## Step3: Fix Erroroneus shapefiles
        fix_flag = _fix_errors_shapefiles(new_pth, dept)
        
    print ("Status : Shapefile Standardization Complete")
    return None

Next, we define the function for error handelling.

In [None]:
dept_37_27_prj = 'PROJCS["NAD_1983_StatePlane_Texas_Central_FIPS_4203_Feet",GEOGCS["GCS_North_American_1983",DATUM["North_American_Datum_1983",SPHEROID["GRS_1980",6378137,298.257222101]],PRIMEM["Greenwich",0],UNIT["Degree",0.017453292519943295]],PROJECTION["Lambert_Conformal_Conic_2SP"],PARAMETER["False_Easting",2296583.333333333],PARAMETER["False_Northing",9842499.999999998],PARAMETER["Central_Meridian",-100.3333333333333],PARAMETER["Standard_Parallel_1",30.11666666666667],PARAMETER["Standard_Parallel_2",31.88333333333333],PARAMETER["Latitude_Of_Origin",29.66666666666667],UNIT["Foot_US",0.30480060960121924],AUTHORITY["EPSG","102739"]]'

In [None]:
## create a config to handle the errors in raw shape files
missing_shape_meta = { "Dept_37-00027" : {"prj" : dept_37_27_prj} }

## Function to fix / cleanup the errors in shapefile types
def _fix_errors_shapefiles(_path, dept):
    """
    :params:
    _path : root path containig the shape files 
    dept : selected dept if it is called only for a particular department
    """
    
    if dept not in missing_shape_meta:
        return False
    
    ## Fix the errors in raw corresponding shape files
    for extension, content in missing_shape_meta[dept].items():
        if extension == "prj": 
            # Step1: Add missing prj file
            with open(_path + "department.prj", 'w') as outfile:
                outfile.write(content)
            
            # Step2: Fix CRS of shape file
            df = gpd.read_file(_path + 'department.shp')
            df.to_file(filename = _path + 'department.shp', 
                       driver='ESRI Shapefile', crs_wkt = content)

        elif extension == "shx":
            ## This function can be extended for other shape filetypes
            ## the corresponding logic can be added in these blocks 
            pass
    return True

<a id="a4"></a>
## <font color="#703bdb">Step 4. Standardization of ACS / Census Data </font><hr>

In the next step, ACS / Census data is cleaned up and moved to new path. The names of the ACS files are standardized and cleaned up. Additionally, their corresponding meta files are also stored with same naming convention. Different Metrics (education, housing, income etc) data are shared but this part can be extended with more data. Following steps are performed in this step. 

1 : Configure the old and new paths  
2 : Move all the ACS datafiles  
3 : Standardize / Cleanup the names  
4.1 : Move the data files - Metric Data  
4.2 : Move the meta files - Metric Meta  

In [None]:
## cleaned names corresponding to given raw metric names
acs_metrics_dic = { 'owner-occupied-housing' : 'housing', 'education-attainment' : 'education', 'employment' : 'employment', 'education-attainment-over-25' : 'education25', 'race-sex-age' : 'race-sex-age', 'poverty' : 'poverty', 'income' : 'income' }
metrics_names = list(acs_metrics_dic.values())

## function to cleanup and move the ACS data
def _standardize_acs():
    depts = [_ for _ in os.listdir(_base_dir) if "Dept" in _]
    for dept in depts:  
        ## Step1: Configure the old and new path
        acs_dir = dept.replace("Dept_","") + "_ACS_data"
        old_dirs = os.listdir(_base_dir + dept +"/"+ acs_dir)
        new_dirs = [f.replace(dept.replace("Dept_",""),"") for f in old_dirs]
        new_dirs = [f.replace("_ACS_","") for f in new_dirs]
        
        ## Step2: Move all ACS datafiles
        for j, metric in enumerate(old_dirs):
            metric_files = os.listdir(_base_dir + dept +"/"+ acs_dir +"/"+ metric)
            _file = [f for f in metric_files if "metadata" not in f][0]
            _meta = [f for f in metric_files if "metadata" in f][0]

            ## Step3: Standardize / Cleanup the name 
            for name, clean_name in acs_metrics_dic.items():
                if "25" in metric:
                    cname = "education25"
                if name in metric:
                    cname = clean_name     

            ## Step4.1 : Move Metric File
            old_path = _base_dir + dept +"/"+ acs_dir +"/"+ metric +"/"+ _file
            new_path = _root_dir + dept +"/metrics/" + cname + ".csv"
            shutil.copy(old_path, new_path)

            ## Step4.2 : Move Metrics meta files
            old_path = _base_dir + dept +"/"+ acs_dir +"/"+ metric +"/"+ _meta
            new_path = _root_dir + dept +"/metrics_meta/" + cname + ".csv"
            shutil.copy(old_path, new_path)

    print ("Status : Standardization of Metrics complete")

<a id="a5"></a>
## <font color="#703bdb">Step 5. Trigger Function : Component A  </font><hr>

Next, we compile all the corresponding functions of Component A together, and trigger it.  Finally creating the well defined raw data source which makes the analysis very quick and accessible. 

In [None]:
def _run_standardization_pipeline():
    _create_repository_structure()
    _standardize_shapefiles()
    _standardize_acs()

_run_standardization_pipeline()

After the sucessful run of this pipeline part, a new well defined structured repository is created having the following layout. This type of directory makes it consistent to work with different departments and makes the analysis and modelling part easier and accessible.  

<br>

<a id="b"></a>
# <font color="#703bdb">Component B - Department Level Processing  </font><hr>

In the previous component, the core level processing, integration, and standardization was performed in which all the universal data such as ACS information, Police Shape File Information was processed. Now, in the next component, a particular department is selected, and its department level information is processed. The key tasks are:

 - Find the overlapping the shape files of department with the census tracts  
 - Enriching them with the acs level information  
 - Standardize the police incidents 
- Further enriching the overlapped districts with the police incidents  
 
The overview of Part B is shown below: 

![](https://i.imgur.com/XBBolXR.png)

<br> 

<a id="b1"></a>
## <font color="#703bdb">Step 1 : Set Global Config Parameters - ShapeFiles  </font><hr>

As the first step, we define a global config file in which we will store two essential information: 

- **_rowid:** Represents the unique identifier present in the corresponding shape file of a department  
- **ct_num:** Represents the Corresponding Census Tract State Number which can be used to map the census tracts shape files with the unique department shape files. 

I have created the following depts_config in which I have added the details of the given departments. When new departments are added to this data, same config can be updated. 



In [None]:
## Provide the config file for the departments
depts_config = {
    'Dept_23-00089' : {'_rowid' : "DISTRICT", "ct_num" : "18"},  
    'Dept_49-00035' : {'_rowid' : "pol_dist", "ct_num" : "06"},  
    'Dept_24-00013' : {'_rowid' : "OBJECTID", "ct_num" : "27"},  
    'Dept_24-00098' : {'_rowid' : "gridnum",  "ct_num" : "27"},   
    'Dept_49-00033' : {'_rowid' : "number",   "ct_num" : "06"},    
    'Dept_11-00091' : {'_rowid' : "ID",       "ct_num" : "25"},         
    'Dept_49-00081' : {'_rowid' : "company",  "ct_num" : "06"},   
    'Dept_37-00049' : {'_rowid' : "Name",     "ct_num" : "48"},      
    'Dept_37-00027' : {'_rowid' : "CODE",     "ct_num" : "48"},     
    'Dept_49-00009' : {'_rowid' : "objectid", "ct_num" : "53"}, 
}

<a id="b2"></a>
## <font color="#703bdb">Step 2 : Find Overlapping Census Tracts with Department Districts  </font><hr>

This is one of the most essential step of this component. In this step, the overlapping census tract along with the percentage of overlap are found. Before writing the core function, we will define the utilities few functions first that will help us to process the department level shape files

> **_read_shape_gdf():**  Function to read shape files for a department and return the corresponding shape file geodataframe  
> **_read_ctfile():**  Function to read the corresponding census tract file for a department  
> **_plot_shapefile_base():**  Function to plot the base / overlapped shape file on a map   


In [None]:
## Function to read a shapefile
def _read_shape_gdf(_dept):
    shape_pth = _root_dir + _dept + "/shapefiles/department.shp"
    ## ensure that CRS are consistent
    shape_gdf = gpd.read_file(shape_pth).to_crs(epsg=4326)
    return shape_gdf

## Read the CT File
def _read_ctfile(_dept):
    ## find the corresponding CT number from the config
    _ct = depts_config[_dept]["ct_num"]
    ## generate the base CT path 
    ct_path = ct_base_path.replace("<NUM>", _ct)
    ## load the geo data frame for CT 
    state_cts = gpd.read_file(ct_path).to_crs(epsg='4326')
    return state_cts

## Function to get the centroid of a polygon
def _get_latlong_point(point):
    _ll = str(point).replace("POINT (","").replace(")", "")
    _ll = list(reversed([float(_) for _ in _ll.split()]))
    return _ll

## Function to plot a shapefile
## Function to plot a shapefile
def _plot_shapefile_base(shape_gdf, _dept, overlapped_cts = {}):
    ## obtain the center most point of the map 
    
    if "center_ll" not in depts_config[_dept]:
        center_pt = shape_gdf.geometry.centroid[0]
        center_pt = _get_latlong_point(center_pt)
    else:
        center_pt = depts_config[_dept]["center_ll"]
    
    ## initialize the folium map 
    mapa = folium.Map(center_pt,  zoom_start=10, tiles='CartoDB dark_matter')
    if len(overlapped_cts) == 0:
        ## only the base map
        folium.GeoJson(shape_gdf).add_to(mapa)
    else:
        ## overlapped map
        ct_style = {'fillColor':"red",'color':"red",'weight':1,'fillOpacity':0.5}
        base_style = {'fillColor':"blue",'color':"blue",'weight':1,'fillOpacity':0.5}
        folium.GeoJson(overlapped_cts, style_function = lambda feature: ct_style).add_to(mapa)
        folium.GeoJson(shape_gdf, style_function = lambda feature: base_style).add_to(mapa)
    return mapa

**Overlapping CTs Percentage Calculations**

Next, define the function to find the overlapping Census Tracts with the department shape files. This complete function is executed in four main steps: 

Step A: Initialize the overlapping percentage dictionary   
Step B: Find overlap between district and CT layers   
Step C: Calculate and save the overlapping percentage    
Step D: Find the unique overlapping census tracts separately  

In [None]:
## Find Overlapping Census Tracts
def find_overlapping_cts(dept_gdf, state_cts, _identifier, _threshold = 10.0):
    """
    :params:
    dept_gdf : the geo dataframe loaded from shape file for the department 
    state_cts : the geo dataframe of the corresponding ct file
    _identifier : the unique row identifier for the department 
    _threshold : the overlapping threshold percentage to consider 
    """
    
    
    ## Step 1: Initialize
    olaps_percentages, overlapped_idx = {}, []
    for i, row in dept_gdf.iterrows():
        if row[_identifier] not in olaps_percentages: 
            olaps_percentages[row[_identifier]] = {}

        ## Step 2: Find overlap bw district and ct layer
        layer1 = row["geometry"] # district layer
        for j, row2 in state_cts.iterrows():
            layer2 = row2["geometry"] # ct layer
            layer3 = layer1.intersection(layer2) # overlapping layer
            
            ## Step 3: Save overlapping percentage
            overlap_percent = layer3.area / layer2.area * 100
            if overlap_percent >= _threshold: 
                olaps_percentages[row[_identifier]][row2["GEOID"]] = overlap_percent
                overlapped_idx.append(j)
    
    ## Step 4: Find unique overlapping census tracts
    overlapped_idx = list(set(overlapped_idx))
    overlapped_cts = state_cts.iloc[overlapped_idx]
    return overlapped_cts, olaps_percentages

## function to convert overlapping percentages dictionary to a dataframe 
def _prepare_olaps_df(olaps_percentages):
    temp = pd.DataFrame()
    distid, ct, pers = [], [], []
    for k, vals in olaps_percentages.items():
        for v, per in vals.items():
            distid.append (k)
            ct.append(v)
            pers.append(round(per, 2))
    temp["DistId"] = distid
    temp["CensusTract"] = ct
    temp["Overlap %"] = pers
    return temp

<a id="b3"></a>
## <font color="#703bdb">Step 3 : Setting up global config parameters - ACS  </font><hr>

As the next step, define the parameters related to ACS data. Only two variables need to be defined: 

> - **metrics_config:** Dictionary that defines which all ACS metrics needs to be processed. An important field **measure** is added which states that what is the type of computation that needs to be performed, For example: To find the number of blacks / whites / hispanics etc prportion is used, To find the median income of a population, median will be used and similarly to find the umployment ratio / unemployment ratio mean will be used. 
> - **_column_names:** Dictionary that saves the human defined cleaned column names corresponding to the actual column names given in the ACS data. 

In [None]:
## Specific Metrics and their measures 
metrics_config = {
            'race-sex-age': {'metrics':['race','age','sex'], "measure":"proportion"},
            'income':       {'metrics':['median_income'],    "measure":"median"},
            'poverty':      {'metrics':['below_poverty'],    "measure":"proportion"},
            'employment':   {'metrics':['ep_ratio', 'unemp_ratio'], "measure" : "mean"}
            }

## Cleaned Column Names 
_column_names = {"race" : { "HC01_VC43" : "total_pop",
                            "HC01_VC49" : "white_pop",
                            "HC01_VC50" : "black_pop",
                            "HC01_VC56" : "asian_pop",
                            "HC01_VC88" : "hispanic_pop"},
                "age" : {
                            "HC01_VC12" : "20_24_pop", 
                            "HC01_VC13" : "25_34_pop", 
                            "HC01_VC14" : "35_44_pop", 
                            "HC01_VC15" : "45_54_pop", 
                            "HC01_VC16" : "55_59_pop", 
                },
                "sex": {
                            "HC01_VC04" : "male_pop",
                            "HC01_VC05" : "female_pop",
                },
                "median_income" : {
                            "HC02_EST_VC02" : "pop_income",
                            "HC02_EST_VC04" : "whites_income",
                            "HC02_EST_VC05" : "blacks_income",
                            "HC02_EST_VC07" : "asian_income",
                            "HC02_EST_VC12" : "hispanic_income",
                },
                "below_poverty" : {
                            "HC02_EST_VC01" : "below_pov_pop"},
                 "ep_ratio" : {
                             "HC03_EST_VC15" : "whites_ep_ratio",
                             "HC03_EST_VC16" : "blacks_ep_ratio"
                  },
                 "unemp_ratio" : {
                             "HC04_EST_VC15" : "whites_unemp_ratio",
                             "HC04_EST_VC16" : "blacks_unemp_ratio"}
                }

Additionally, we write the utilitiy functions to be used for this step.

> - **_cleanup_metrics_data()**:  Function to perform basic pre-processing on metrics data, load all the metrics data, save it in a dictionary, and returns as the object.  
> - **_flatten_gdf()**:  Function to flatten the details present in the dataframe, 

In [None]:
## Function to perform basic pre-processing on metrics data 
def _cleanup_metrics_data(_dept):
    metrics_df = {}
    for _metric in metrics_names: ## metrics_name is deinfed in config 
        mpath = _root_dir + _dept + "/metrics/" + _metric + ".csv"
        mdf = pd.read_csv(mpath, low_memory=False).iloc[1:]
        mdf = mdf.reset_index(drop=True).rename(columns={'GEO.id2':'GEOID'})
        metrics_df[_metric] = mdf
    
    ## returns metrics_df that contains all the dataframe for ACS metrics 
    return metrics_df

## Function to Flatten the details
def _flatten_gdf(df, _identifier):
    relevant_cols = [_identifier]
    flatten_df = df[relevant_cols]
    for c in df.columns:
        if not c.startswith("_"):
            continue
        _new_cols = list(df[c].iloc(0)[0].keys())
        for _new_col in _new_cols:
            _clean_colname = _column_names[c[1:]][_new_col]
            flatten_df[_clean_colname] = df[c].apply(lambda x : x[_new_col]\
                                                if type(x) == dict else 0.0)
            relevant_cols.append(_clean_colname)
    return flatten_df[relevant_cols]

<a id="b4"></a>
## <font color="#703bdb">Step 4 : Enrich ACS Information in Overlapped Districts  </font><hr>

Here we will define the final function that will use the overlapped percentages, ACS information and perform the necessary calculation to generate the estimated information linked with the police department zones. The enrichment process performs different calculations to find the estimated numbers associated with a district of a department. 

- **Proportion:** For population estimates such as black population, white population, total population, etc, proportion is used to get the estimated number. The actual number is multiplied with the overlapped percentage in order to get the estimated number.  
    
        estimated_value = true_value * overlapping_percentage

- **Median:** Median is used to compute the metrics such as median income of the overlapped groups.
- **Mean:** Mean is used to compute the average estimates of the overlapped population, for example - average unemployment rate of the overlapped population.  

In [None]:
## Function that enriches the information using overlapped percentage
def _enrich_info(idf, percentages, m_df, columns, m_measure):
    """
    :params:
    idf : unique identifier for the police department information
    percentages : The overalapped CTs and their percentages
    m_df : the dataframe of the metric containing all the information
    columns : the corresponding column names of the metric, defined in config
    m_measure : the measure (mean, median, proportion) to perform
    """
    
    ## define the updated_metrics object that will store the estimated information
    updated_metrics = {}
    
    ## return None if no overlapping CTs
    if len(percentages[idf]) == 0:
        return ()
    
    ## Iterate in all Districts with the overlapped CTs and percentage
    for idd, percentage in percentages[idf].items(): 
        ## find the corresponding row for an overlapped CT in the metric data 
        ct_row = m_df[m_df["GEOID"] == idd]
        for rcol in columns:
            if rcol not in updated_metrics:
                updated_metrics[rcol] = []
            
            ## Perform the necessary calculation to find the estimated number 
            try:
                actual_value = ct_row[rcol].iloc(0)[0].replace("-","")
                actual_value = actual_value.replace(",","")
                actual_value = float(actual_value.replace("+",""))
                if m_measure == "proportion":
                    updated_value = actual_value * percentage / 100
                else:
                    updated_value = actual_value
                updated_metrics[rcol].append(updated_value)
            except Exception as E:
                pass
        
    ## Update the information in updated_metrics
    for rcol in columns:
        if len(updated_metrics[rcol]) == 0:
            updated_metrics[rcol] = 0
        else:
            if m_measure == "proportion":
                updated_metrics[rcol] = sum(updated_metrics[rcol])
            elif m_measure == "median":
                updated_metrics[rcol] = median(updated_metrics[rcol])
            elif m_measure == "mean":
                _mean = float(sum(updated_metrics[rcol])) / len(updated_metrics[rcol])
                updated_metrics[rcol] = _mean
    return updated_metrics

We will define another function that will call the enrich information function for different metrics. 

In [None]:
## Master Function to process the ACS info in dept df
def _process_metric(metrics_df, dept_df, _identifier, olaps_percentages, metric_name):
    """
    :params:
    metrics_df : the complete dataframe containing the metrics data
    dept_df : the geodataframe for police shape files 
    _identifier : the row identifier column corresponding to the police dept shape file 
    olaps_percentages : the overlapping percentage object calculated in previous step
    metric_name : Name of the metric, example - education / poverty / income 
    """
    
    m_df = metrics_df[metric_name]
    m_measure = metrics_config[metric_name]["measure"]
    for flag in metrics_config[metric_name]['metrics']:
        cols = list(_column_names[flag].keys())
        dept_df["_"+flag] = dept_df[_identifier].apply(lambda x : \
                            _enrich_info(x, olaps_percentages, m_df, cols, m_measure))
    return dept_df 

Now enriched data contains the information about: 

- the overlapped census tracts  
- the percentage of overlap  
- the estimated / calculated numbers for department districts  

<a id="b5"></a>
## <font color="#703bdb">Step 5: Standardize Police Incidents Data </font><hr>

In this step we add our target information, which in this case is the police incidents information. This data is part of <a href="http://policingequity.org/national-justice-database/">National Justice Database</a> in which data from different departments is ingested. The biggest challenge is that there is no standardization followed. For the standardization purposes, I have considered following points: 

1. Standardization of File Names   
2. Standardization of Key Fields  
    - Subject Race
    - Subject Gender
    - Incident Date   
3. Standardization of Column Names  
    
    
### Standardization Process 
   
The standardization process is a three step proces : 

1. Otain all the unique values of the column to standardize from every police department incident file.  
2. Combine all the values together, and remove duplicates  
3. Quickly, paste them on an excel sheet (see example file [here](https://docs.google.com/spreadsheets/d/1mM9c6CYt7KRR9NK0QuulW5G66cXU_Q0geC1xAgE5Vsc/edit?usp=sharing) ) and update the standardized_column column.  
4. Export the file as csv, and pass it to this pipeline which performs the automatic standardization. Moreover, the pipeline also handles the standardization of column names. For instance, in District: "Dept_23-00089", race column is given as "SUBJECT_RACT" instead of "SUBJECT_RACE". 


<img src="https://i.imgur.com/nTItNxI.png" height=500 width=500>

<br>

In the current implementation of the pipeline, subject race and subject gender are integrated, more columns can be added. Just need to update the conlumn config file as shown below : 

> column_config = { <br>
    "SUBJECT_RACE" : { "variations": ["SUBJECT_RACT"],  "values_map" : subject_race_map }, <br>
    "SUBJECT_GENDER" : { "variations": [],  "values_map" : subject_gender_map }  <br>
    }  <br>
    
So we write the utility functions to perform this standardization process: 


In [None]:
subject_race_csv_content = """W	White
W(White)	White
White	White
B	Black
B(Black)	Black
Black	Black
Black or African American	Black
Black, Black	Black
Unk	Unknown
Unknown	Unknown
UNKNOWN	Unknown
No Data	Unknown
NO DATA ENTERED	Unknown
not recorded	Unknown
Not Specified	Unknown
P	Pacific Islander
Pacific Islander	Pacific Islander
O	Other
Other	Other
Other / Mixed Race	Other
Native Am	Native American
Native Amer	Native American
Native American	Native American
Latino	Latino
H	Hispanic
H(Hispanic)	Hispanic
Hispanic	Hispanic
Hispanic or Latino	Hispanic
A	Asian
A(Asian or Pacific Islander)	Asian
Asian	Asian
Asian or Pacific islander	Asian
American Ind	American Indian
American Indian/Alaska Native	American Indian"""

subject_gender_csv_content = """F	Female
Female	Female
FEMALE	Female
M	Male
M, M	Male
Male	Male
MALE	Male
No Data	Unknown
not recorded	Unknown
Not Specified	Unknown
Unk	Unknown
Unknown	Unknown
UNKNOWN	Unknown
-	Unknown"""

In [None]:
## utility function to get the map of raw -> standardized
def _get_map(content):
    _map = {}
    for line in content.split("\n"):
        raw = line.split("	")[0]
        standardized = line.split("	")[1]
        _map[raw] = standardized
    return _map

## utility function to get the frequency count of elements 
def _get_count(x):
    return dict(Counter("|".join(x).split("|")))

## utility function to cleanup the name 
def _cleanup_dist(x):
    try:
        x = str(int(float(x)))
    except Exception as E:
        x = "NA"
    return x 

## Create the raw-standardized maps after reading the csv content as shown in image above 
subject_race_map = _get_map(subject_race_csv_content)
subject_gender_map = _get_map(subject_gender_csv_content)

column_config = {
    "SUBJECT_RACE" : { "variations": ["SUBJECT_RACT"],  "values_map" : subject_race_map },
    "SUBJECT_GENDER" : { "variations": [],  "values_map" : subject_gender_map },
    }

Now, we write our master function to standardize the police files, column names, and values. 

In [None]:
## master function to standardize the column names and values
def _standardize_columns(datadf):
    for col, col_dict in column_config.items():
        col_dict["variations"].append(col)
        _map = col_dict["values_map"]
        for colname in col_dict["variations"]:
            if colname in datadf.columns:
                datadf[col] = datadf[colname].apply(lambda x : _map[x] if x in _map else "-")
                
    ## Standardize Date Column, add Year and Month
    if "INCIDENT_DATE" in datadf.columns:
        datadf["INCIDENT_DATE"] = pd.to_datetime(datadf["INCIDENT_DATE"])
        datadf["INCIDENT_YEAR"] = datadf["INCIDENT_DATE"].dt.year
        datadf["INCIDENT_MONTH"] = datadf["INCIDENT_DATE"].dt.month
    
    if "LOCATION_DISTRICT" in datadf.columns:
        datadf["LOCATION_DISTRICT"] = datadf["LOCATION_DISTRICT"].astype(str)    

    return datadf

## Function to standardize the events data file
def _standardize_filename(_dept):
    _file = [f for f in os.listdir(_base_dir + _dept) if f.endswith(".csv")][0]
    old_path = _base_dir + _dept + "/" + _file
    new_path = _root_dir + _dept + "/events/" + _file
    shutil.copy(old_path, new_path)
    return _file

Now, we define the function to process the police level information.  

In [None]:
def _process_events(pol_config):
    ## load the given police incidents file and cleanup some missing info
    ppath = _root_dir + _dept + "/events/" + pol_config["police_file"]
    events_df = pd.read_csv(ppath, low_memory=False)[1:]
    events_df = _standardize_columns(events_df)

    ## Slice the data for the given years, if given by user
    years_to_process = pol_config["years_to_process"]
    if len(years_to_process) != 0: 
        events_df = events_df[events_df['INCIDENT_YEAR'].isin(years_to_process)]
    
    ## Aggregate the events by every district of the department
    police_df = events_df.groupby("LOCATION_DISTRICT")

    ## [Extendable] Obtain the distribution by gender, race etc
    police_df = police_df.agg({"SUBJECT_GENDER" : lambda x : _get_count(x),\
                               "SUBJECT_RACE"   : lambda x : _get_count(x)})
    police_df = police_df.reset_index()
    police_df = police_df.rename(columns={
                    "SUBJECT_GENDER" : pol_config['event_type'] + "_sex",\
                    "SUBJECT_RACE" : pol_config['event_type'] + "_race"})
    return police_df, events_df 

<a id="b6"></a>
## <font color="#703bdb">Step 6: Extending Police Data using External Datasets </font><hr>

Sometimes external data can be useful to measure police behaviour. so we will define a function to load any external data. 

In [None]:
def _load_external_dataset(pol_config):
    ## load the dataset 
    _path = external_datasets_path + pol_config["path"]
    events2 = pd.read_csv(_path, parse_dates=[pol_config["date_col"]])

    ## basic standardization
    events2['year'] = events2[pol_config["date_col"]].dt.year
    years_to_process = pol_config["years_to_process"]
    events2 = events2[events2['year'].isin(years_to_process)]
    events2[pol_config["race_col"]] = events2[pol_config["race_col"]].fillna("")
    events2[pol_config["gender_col"]] = events2[pol_config["gender_col"]].fillna("")
    
    ## Aggregate and cleanup
    events2["LOCATION_DISTRICT"] = events2[pol_config['identifier']].apply(
                                                lambda x : _cleanup_dist(x))
    temp_df = events2.groupby("LOCATION_DISTRICT").agg({
                                pol_config['gender_col'] : lambda x : _get_count(x),\
                                pol_config['race_col'] : lambda x : _get_count(x)})
    
    ## cleanup the column names
    temp_df = temp_df.reset_index().rename(columns={
                                pol_config['gender_col'] : pol_config["event_type"]+"_sex", 
                                pol_config['race_col'] : pol_config["event_type"]+"_race"})
    return temp_df

<a id="b7"></a>
## <font color="#703bdb">Step 7: Save Final Cleaned Datasets  </font><hr>

The output of previous 6 steps will be the enriched data containing overlapped census tracts with corresponding acs data numbers. Along with this, police incidents data (or any external data) will also be loaded. In this step, we will write a function to save the final data frames to the disk / (or database)

In [None]:
def _save_final_data(enriched_df, police_df, events_df):
    enriched_df.to_csv(_root_dir +"/"+ _dept + "/enriched_df.csv", index = False)
    police_df.to_csv(_root_dir +"/"+ _dept + "/police_df.csv", index = False)
    events_df.to_csv(_root_dir +"/"+ _dept + "/events/events_df.csv", index = False)

<a id="b8"></a>
## <font color="#703bdb">Step 8: Component B Trigger Function </font><hr>

Finally, we will write a function that will trigger the component B of this pipeline and execute all the steps, finally producing the enriched data and police data which can be easilly used for analysis and modelling purposes. 

In [None]:
def _execute_district_pipeline(_dept, _police_config1, _police_config2=None):
    print ("Selected Department: ", _dept)
    
    ## department shape file
    print (". Loading Shape File Data")
    dept_shape_gdf = _read_shape_gdf(_dept)
    base_plot = _plot_shapefile_base(dept_shape_gdf, _dept, overlapped_cts = {})    

    ## finding overlapped CTs percentages
    print (".. Finding Overlapping CTs")
    _identifier = depts_config[_dept]["_rowid"]
    state_cts = _read_ctfile(_dept)
    overlapped_cts, olaps_percentages = find_overlapping_cts(dept_shape_gdf, state_cts, _identifier)
    overlapped_plot = _plot_shapefile_base(dept_shape_gdf, _dept, overlapped_cts)
    
    ## Adding the Metrics Data
    print ("... Loading ACS Metrics Data")
    metrics_df = _cleanup_metrics_data(_dept)

    ## Add Metrics to the dept df
    print (".... Enrichment of ACS Metrics with Overlapped Data")
    dept_enriched_gdf = dept_shape_gdf.copy(deep=True)
    for metric_name in metrics_config.keys():
        dept_enriched_gdf = _process_metric(metrics_df, dept_enriched_gdf, _identifier, 
                                            olaps_percentages, metric_name=metric_name)
    
    ## Find Enriched DF
    enriched_df = _flatten_gdf(dept_enriched_gdf, _identifier)
    enriched_df = enriched_df.rename(columns={_identifier : "LOCATION_DISTRICT"})
    
    ## Processing Police DF
    if _police_config1 != None:
        print ("..... Standardizing the Police Events")
        police_file1 = _standardize_filename(_dept)
        _police_config1["police_file"] = police_file1
        police_df, events_df = _process_events(_police_config1)
    else:
        police_df, events_df = pd.DataFrame(), pd.DataFrame()
    
    ## Adding any other external Police Data 
    if _police_config2 != None:
        print ("..... Standardizing the External Data")
        external_df = _load_external_dataset(_police_config2)
        police_df = police_df.merge(external_df, on="LOCATION_DISTRICT")
    
    ## Save Final Data
    print ("...... Saving the Final Data in New Repository")
    _save_final_data(enriched_df, police_df, events_df)
    
    response = {
                "dept_shape_gdf" : dept_shape_gdf,
                "base_plot" : base_plot,
                "olaps_percentages" : _prepare_olaps_df(olaps_percentages),
                "overlapped_plot" : overlapped_plot,
                "dept_enriched_gdf" : dept_enriched_gdf,
                "enriched_df" : enriched_df,
                "police_df" : police_df,
                "events_df" : events_df
                }
    return response

So this completes the end of component B. Now, let's run the pipeline for one of the department. In the next kernel, pipeline is run for multiple departments. 

<a id="c"></a>
## <font color="#703bdb">Example Run of the Pipeline </font><hr>

### <a href="https://www.kaggle.com/shivamb/3-example-runs-data-processing-pipeline">NEXT KERNEL:</a> Pipeline Run for 8 departments 
<br>

In this kernel, Let's run it only for a single department, Derpartment : 49-00033. We define three inputs : 

> **_dept :** "Dept_49-00033"  
> **police_config1:** config file for given police incidents data  
> **police_config2:** any external police incidents data to be integrated  

In [None]:
## select department 
_dept = "Dept_49-00033"

## given police data config 
_police_config1 = { 'event_type' : 'arrest', "years_to_process" : []}

# ## external police data config
_police_config2 = {  'path' : "la_stops/vehicle-and-pedestrian-stop-data-2010-to-present.csv", 
                     'event_type' : 'vstops',
                     'identifier' : "Officer 1 Division Number" , 
                     'gender_col' : 'Sex Code', 
                     'race_col' : 'Descent Code', 
                     'date_col' : "Stop Date", 
                     'years_to_process' : [2015] }

## call the trigger for the given department and their configurations
pipeline_resp = _execute_district_pipeline(_dept, _police_config1, _police_config2)

At the end, a well structured flat dataset is produced which can be easily analyzed to measure racial bias, or any other analysis. This gives quick access to the users to the hidden information, and they can quicly generate custom reports. 

<a id="d"></a>
## <font color="#703bdb">Features of this pipeline :  </font><hr>

1. **Scalable:** In the next kernel, I have shown the example runs of this pipeline for several departments that produces well structured, cleaned datasets for the analysis purposes. 
2. **Robust:** Performs different levels of error handling throughout the flow.  
3. **Automated:** Least human intervention, users only need to define **"_config objects"**  

<a href="https://www.kaggle.com/shivamb/3-example-runs-of-automation-pipeline"> Next Kernel </a> - The output's and the showcase of these pipeline features.