### Task
Step two - Students to merge the disciplinary action database with the BPD financial contributions data. Verify common names using the Race/Ethnicity BPD personnel dataset.
First merge the disciplinary action database with the entire "All_Police_Contributions.csv" dataset, then filter for "Boston Police" under the "Employer" column.

### Merge with Fuzzy Matching using Fuzzy merge template

In [1]:
import pandas as pd
import numpy as np
import dask.dataframe as dd
from fuzzywuzzy import fuzz 



In [2]:
# Install Dask as a library using the following code:
# Dask is an additional library for Pandas that parrellizes the memory when handling dataframes, this greatly speeds up the merging and other data processing.

import sys
!{sys.executable} -m pip install "dask[complete]"



In [3]:
# String similarity between the two fields

def fuzzySimilarity(row):
    name1 = row['Name']
    name2 = row['Contributor']
    fuzzyRatio = fuzz.ratio(name1, name2)
    return fuzzyRatio

In [4]:
# Create a new column called lastName character that has the first letter of the last name as its separate column

def getLastCh(s):
    s_list = s.split()
    suffixes = ['jr','jr.','sr','sr.','i','ii','iii', '3rd']

    # remove suffixes in s_list
    for i in reversed(range(len(s_list))):
        if s_list[i] in suffixes:
            s_list.pop(i)

    lastName = s_list[-1]
    firstCh = lastName[0]
    return firstCh

In [5]:
bostonDF = pd.read_csv("processedBostonPoliceInternalAffairs.csv")
policeDF = pd.read_csv("processedPoliceContributions.csv")

In [6]:
bostonDF['lastNameCh'] = [getLastCh(s) for s in bostonDF['Name']]
policeDF['lastNameCh'] = [getLastCh(s) for s in policeDF['Contributor']]

In [7]:
print(bostonDF[['Name','lastNameCh']].head())

                  Name lastNameCh
0     joseph abasciano          a
1     joseph abasciano          a
2     joseph abasciano          a
3     joseph abasciano          a
4  ramadani abdul-aziz          a


In [8]:
print(policeDF[['Contributor', 'lastNameCh']].head())

          Contributor lastNameCh
0  allan l ciccone jr          c
1     michael linskey          l
2     william haffner          h
3      matthew maglio          m
4       donna colbert          c


Things to take note of:

1. Some names are entered in the incorrect format. For example, "Gannetti, iii, Salvatore" was instead entered as "Gannetti, Salvatore iii" creating a case in preprocessing to result in salvatore iii gannetti.
2. Some suffixes have . after and others don't (i.e. jr and jr.)

We removed suffixes before identifying the last name character.

In [9]:
# This program will merge the two dataframes using their lastName characters then apply a string similarity score for each row then we'll filter the string similarity value to create the final dataframe with name matches.

unique_names = list(bostonDF['lastNameCh'].unique())

for name in unique_names:
    df1_sub_zip = bostonDF[bostonDF['lastNameCh'] == name]
    df2_sub_zip = policeDF[policeDF['lastNameCh'] == name]

    df_merge = dd.merge(df1_sub_zip, df2_sub_zip, how='left', left_on='lastNameCh', right_on='lastNameCh')
    
    df_merge['Fuzzy Similarity'] = df_merge.apply(lambda row: fuzzySimilarity(row), axis=1)
    
    # You can adjust this number for a more selective fuzzy similarity merge
    Fuzzy_Filter = df_merge[df_merge['Fuzzy Similarity'] > 85]
    
    title = "./fuzzyDatasets/merge_df_name_" + name + ".csv"
    Fuzzy_Filter.to_csv(title, encoding = "utf-8")

In [10]:
# This will create the list

list_of_csv_titles = []

for name in unique_names:
    title = "./fuzzyDatasets/merge_df_name_" + name + ".csv"
    list_of_csv_titles.append(title)
    
print(list_of_csv_titles)

['./fuzzyDatasets/merge_df_name_a.csv', './fuzzyDatasets/merge_df_name_b.csv', './fuzzyDatasets/merge_df_name_c.csv', './fuzzyDatasets/merge_df_name_d.csv', './fuzzyDatasets/merge_df_name_s.csv', './fuzzyDatasets/merge_df_name_e.csv', './fuzzyDatasets/merge_df_name_f.csv', './fuzzyDatasets/merge_df_name_g.csv', './fuzzyDatasets/merge_df_name_h.csv', './fuzzyDatasets/merge_df_name_i.csv', './fuzzyDatasets/merge_df_name_j.csv', './fuzzyDatasets/merge_df_name_l.csv', './fuzzyDatasets/merge_df_name_k.csv', './fuzzyDatasets/merge_df_name_m.csv', './fuzzyDatasets/merge_df_name_n.csv', './fuzzyDatasets/merge_df_name_o.csv', './fuzzyDatasets/merge_df_name_p.csv', './fuzzyDatasets/merge_df_name_q.csv', './fuzzyDatasets/merge_df_name_r.csv', './fuzzyDatasets/merge_df_name_t.csv', './fuzzyDatasets/merge_df_name_v.csv', './fuzzyDatasets/merge_df_name_w.csv', './fuzzyDatasets/merge_df_name_x.csv', './fuzzyDatasets/merge_df_name_y.csv', './fuzzyDatasets/merge_df_name_z.csv']


In [11]:
# Merging all the batches

df_merge_final = pd.DataFrame()

for files in list_of_csv_titles:
    data = pd.read_csv(files)
    df_merge_final = df_merge_final.append(data)

print(df_merge_final)

   Unnamed: 0 Unnamed: 0_x              Name            Rank   Race  Year  \
0           8            0  joseph abasciano  Police Officer  White  2011   
1          28            0  joseph abasciano  Police Officer  White  2011   
2          40            0  joseph abasciano  Police Officer  White  2011   
3          42            0  joseph abasciano  Police Officer  White  2011   
4          47            0  joseph abasciano  Police Officer  White  2011   
..        ...          ...               ...             ...    ...   ...   
0           0         5638   vladimir xavier  Police Officer  Black  2012   
1           4         5639   vladimir xavier  Police Officer  Black  2012   
2           8         5640   vladimir xavier  Police Officer  Black  2014   
0         530         5657    robert m zingg       Detective  White  2011   
1         599         5658    robert m zingg       Detective  White  2012   

          CaseID        TypeOfMisconduct                   Allegation  \
0 

In [12]:
df_merge_final.to_csv("merged.csv")