# 3 Pre-processing & Training Data Development <a id='3_Pre-processing_&_training_data_development'></a>


## 3.1 Contents <a id='3.1_Contents'></a>

- [3.1 Contents](#3.1_Contents)
- [3.2 Introduction](#3.2_Introduction)
- [3.3 Imports](#3.3_Imports)
- [3.4 Load The Data](#3.4_Load_The_Data)
- [3.5 Data Cleaning](#3.5_Data_cleaning)
    - [3.5.1 Encoding Categorical Features](#3.5.1_encoding)

## 3.2 Introduction <a id="3.2_Introduction"></a>

This is a continuation of "2.0-faa-exploratory-data-analysis.ipynb" focusing on feature engineering, training and model selection. 

Goals: Impute missing values, scale data, encode categorical types, train/test split, create a pipeline and model selection 

### **Problem Statement:**
The purpose of this data science project involves predicting the age and sex of individuals who become victims of crime using crime data and potentially other relevant variables. By analyzing patterns within crime data, we aim to develop predictive models that estimate the age and sex of victims, which can have applications in law enforcement, victim support and aid victim service providers target relevant areas. 


## 3.3 Imports <a id='3.3_Imports'></a>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#import seaborn as sns
#from datetime import datetime
#from scipy import stats
#from math import trunc
#import os

## 3.4 Load The Data <a id='3.4_Load_The_Data'></a> 

In [2]:
# Storing file path in variable and then using pd.read_csv() to load the data as a dataframe into crimeData

dataFilePath = "/Users/frankyaraujo/Development/Springboard_Main/Capstone Two/\
Springboard-Capstone-Two/src/data/2010-2023 Crime_Traffic_Collisions_Data_R2 .csv"
crimedf = pd.read_csv(dataFilePath, low_memory = False)

In [3]:
# Review of the data using .head() and .info()

crimedf.head()

Unnamed: 0.1,Unnamed: 0,DR Number,Date Reported,Date Occurred,Time Occurred,Area ID,Area Name,Reporting District,Crime Code,Crime Code Description,...,Status,Status Desc,Crm Cd 1,Crm Cd 2,Crm Cd 3,Crm Cd 4,Address,Cross Street,LAT,LON
0,0,10304468,2020-01-08,2020-01-08,2230,3,Southwest,377,624,BATTERY - SIMPLE ASSAULT,...,AO,Adult Other,624.0,,,,1100 W 39TH PL,,34.0141,-118.2978
1,1,190101086,2020-01-02,2020-01-01,330,1,Central,163,624,BATTERY - SIMPLE ASSAULT,...,IC,Invest Cont,624.0,,,,700 S HILL ST,,34.0459,-118.2545
2,2,200110444,2020-04-14,2020-02-13,1200,1,Central,155,845,SEX OFFENDER REGISTRANT OUT OF COMPLIANCE,...,AA,Adult Arrest,845.0,,,,200 E 6TH ST,,34.0448,-118.2474
3,3,191501505,2020-01-01,2020-01-01,1730,15,N Hollywood,1543,745,VANDALISM - MISDEAMEANOR ($399 OR UNDER),...,IC,Invest Cont,745.0,998.0,,,5400 CORTEEN PL,,34.1685,-118.4019
4,4,191921269,2020-01-01,2020-01-01,415,19,Mission,1998,740,"VANDALISM - FELONY ($400 & OVER, ALL CHURCH VA...",...,IC,Invest Cont,740.0,,,,14400 TITUS ST,,34.2198,-118.4468


## 3.5 Data Cleaning <a id='3.5_Data_cleaning'></a>

The dataset is still missing values, has categorical data, and potentially useless or redundant features so this section will focus on cleaning the data.  

In [4]:
# quick look at the object data 

crimedf_obj = crimedf.select_dtypes(include=[object])
crimedf_obj.head()

Unnamed: 0,Date Reported,Date Occurred,Area Name,Crime Code Description,MO Codes,Victim Sex,Victim Descent,Premise Description,Weapon Desc,Status,Status Desc,Address,Cross Street
0,2020-01-08,2020-01-08,Southwest,BATTERY - SIMPLE ASSAULT,0444 0913,F,Black,SINGLE FAMILY DWELLING,"STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)",AO,Adult Other,1100 W 39TH PL,
1,2020-01-02,2020-01-01,Central,BATTERY - SIMPLE ASSAULT,0416 1822 1414,M,Hispanic/Latin/Mexican,SIDEWALK,UNKNOWN WEAPON/OTHER WEAPON,IC,Invest Cont,700 S HILL ST,
2,2020-04-14,2020-02-13,Central,SEX OFFENDER REGISTRANT OUT OF COMPLIANCE,1501,X,Unknown,POLICE FACILITY,,AA,Adult Arrest,200 E 6TH ST,
3,2020-01-01,2020-01-01,N Hollywood,VANDALISM - MISDEAMEANOR ($399 OR UNDER),0329 1402,F,White,"MULTI-UNIT DWELLING (APARTMENT, DUPLEX, ETC)",,IC,Invest Cont,5400 CORTEEN PL,
4,2020-01-01,2020-01-01,Mission,"VANDALISM - FELONY ($400 & OVER, ALL CHURCH VA...",0329,X,Unknown,BEAUTY SUPPLY STORE,,IC,Invest Cont,14400 TITUS ST,


In [5]:
crimedf_obj.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1375881 entries, 0 to 1375880
Data columns (total 13 columns):
 #   Column                  Non-Null Count    Dtype 
---  ------                  --------------    ----- 
 0   Date Reported           1375881 non-null  object
 1   Date Occurred           1375881 non-null  object
 2   Area Name               1375881 non-null  object
 3   Crime Code Description  1375881 non-null  object
 4   MO Codes                1181546 non-null  object
 5   Victim Sex              1375881 non-null  object
 6   Victim Descent          1375881 non-null  object
 7   Premise Description     1374460 non-null  object
 8   Weapon Desc             271192 non-null   object
 9   Status                  779803 non-null   object
 10  Status Desc             779803 non-null   object
 11  Address                 1375881 non-null  object
 12  Cross Street            692955 non-null   object
dtypes: object(13)
memory usage: 136.5+ MB


There are few things that can be flagged already:  
- The 'Status' and 'Status Desc' columns are redundant one of those columns can be dropped
- There are multiple MO codes so this will have to be reviewed as encoding may not work as-is
- Missing values will need to be resolved unless the missing values provide relevant information

In [6]:
# drop Status column 

crimedf.drop(columns="Status", inplace=True)

In [7]:
# unnamed 0: column has no information as it is the duplicate of the index

crimedf.drop(columns="Unnamed: 0",inplace=True)

### 3.5.1 Encoding <a id='3.5.1_encoding'></a>

In [13]:
# Before encoding all categorical features, the number of distinct values per features will be reviewed

categorical_feature_names = ["Area Name", "Crime Code Description", "MO Codes", 
                             "Victim Sex", "Victim Descent", "Premise Description", 
                             "Weapon Desc", "Status Desc", "Address", "Cross Street"]

# Date column names were not included as they are temporal variables 

In [15]:
for i in categorical_feature_names:
    print("Feature",i,"has",crimedf[i].nunique(),"unique values")

Feature Area Name has 21 unique values
Feature Crime Code Description has 139 unique values
Feature MO Codes has 370931 unique values
Feature Victim Sex has 3 unique values
Feature Victim Descent has 19 unique values
Feature Premise Description has 307 unique values
Feature Weapon Desc has 79 unique values
Feature Status Desc has 6 unique values
Feature Address has 71414 unique values
Feature Cross Street has 22588 unique values


In [9]:
### 3.5.1 Encoding Categorical Features