## Multiple Imputation
Leveraging Multiple Imputation as a portion of the data preprocessing to ensure the Age_Range *null* values are handled accordingly before the feature engineering portion of the project. 

### Libraries

In [2]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

### Read CSV & Check Null Values

In [6]:
csv_file_path = 'Data/Data_Sets/KCPD_5_Year_Analysis_Cleaned.csv'
df = pd.read_csv(csv_file_path)
print(df.head())

print(df.isnull().sum())

    Report_No Reported_Date Reported_Time  Year  Quarter  Month Day_of_Week  \
0  KC19020397       3/20/19      13:55:00  2019        1  March   Wednesday   
1  KC19025235        4/7/19      15:52:00  2019        2  April      Sunday   
2  KC19036511       5/17/19      20:27:00  2019        2    May    Thursday   
3  KC19024315        4/4/19      04:20:00  2019        2  April   Wednesday   
4  KC19035992       5/16/19      08:17:00  2019        2    May    Thursday   

  From_Date From_Time Adjusted_To_Date  ... Zip_Code Area  Involvement Race  \
0   3/20/19  09:00:00          3/20/19  ...  64118.0  NPD      CMP VIC    W   
1    4/7/19  15:45:00           4/7/19  ...      NaN  EPD          VIC    B   
2   5/16/19  20:30:00          5/16/19  ...  64138.0  MPD          VIC    B   
3    4/3/19  21:30:00           4/4/19  ...  64120.0  NaN  ARR SUS CHA    B   
4   5/16/19  08:15:00          5/16/19  ...  64119.0  SCP          VIC    W   

  Sex   Age Age_Range                             

### Custom Imputation Function
Replacing imputes that are missing **'Age_Range'** values and it is based on the most common **'Age_Range'** within each category. It is worth noting that only 73,348 values are missing out of the 445,336 rows, which is about 16% of the dataset. 

In [14]:
# Replace 'Null' strings with np.nan
df['Age_Range'].replace('Null',np.nan, inplace=True)

def impute_age_range(row, offense_age_map):
    # Check if age_range is NaN
    if pd.isnull(row['Age_Range']):
        if row['General _Offense_Categorization'] in offense_age_map:
            return offense_age_map[row['General _Offense_Categorization']]
        else: 
            return np.nan
    else:
        # if age_range is not NaN, return it as is
        return row['Age_Range']

# creating a map of the most frequent age_range
offense_age_map = df.dropna(subset=['Age_Range']).groupby('General _Offense_Categorization')['Age_Range'].agg(lambda x: x.mode().iloc[0])

# apply the custom imputation function
df['Age_Range'] = df.apply(impute_age_range,offense_age_map=offense_age_map,axis=1)

### Review DataFrame Values After Custom Imputation

In [15]:
# check for missing values
print(df.isnull().sum())

# inspect df values
print(df[['General _Offense_Categorization','Age_Range']].head(25))

Report_No                               0
Reported_Date                           0
Reported_Time                           0
Year                                    0
Quarter                                 0
Month                                   0
Day_of_Week                             0
From_Date                               0
From_Time                               0
Adjusted_To_Date                        0
Adjusted_To_Time                        0
Offense                                 0
Description                         24117
General _Offense_Categorization         0
Type_of_Crime                           0
UCR_Offense_Classification              0
NIBRS                                   0
NIBRS Offense Group                     0
Address                                 0
City                                    0
Zip_Code                            35523
Area                                 2094
Involvement                             0
Race                              

### Remove Columns/Renaming Columns
After determining the necessary values and reviewing the relationships in the EDA process; I'll be removing more columns and renaming as well before the feature engineering portion.

In [17]:
# define columns to remove
columns_to_remove = ['Reported_Date','Reported_Time','Offense','Description','Address','City','Zip_Code','Area','Involvement','Race','Sex','Age','Location','Firearm_Flag','DVFlag'] 
df = df.drop(columns=columns_to_remove)

# rename columns
rename_columns = {
    'General _Offense_Categorizatio': 'General_Offense_Categorization'
}
df = df.rename(columns=rename_columns)

In [18]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 445289 entries, 0 to 445288
Data columns (total 15 columns):
 #   Column                           Non-Null Count   Dtype 
---  ------                           --------------   ----- 
 0   Report_No                        445289 non-null  object
 1   Year                             445289 non-null  int64 
 2   Quarter                          445289 non-null  int64 
 3   Month                            445289 non-null  object
 4   Day_of_Week                      445289 non-null  object
 5   From_Date                        445289 non-null  object
 6   From_Time                        445289 non-null  object
 7   Adjusted_To_Date                 445289 non-null  object
 8   Adjusted_To_Time                 445289 non-null  object
 9   General _Offense_Categorization  445289 non-null  object
 10  Type_of_Crime                    445289 non-null  object
 11  UCR_Offense_Classification       445289 non-null  object
 12  NIBRS           

### Export Corrected Data to New CSV

In [19]:
import pandas as pd

# file path for the new file
file_path = 'Data/Data_Sets/KCPD-5-Year-Analysis-Feature-Eng.CSV'

# export the dataframe
df.to_csv(file_path,index=False)