# 3 Pre-processing & Training Data Development <a id='3_Pre-processing_&_training_data_development'></a>


## 3.1 Contents <a id='31-contents'></a>

- [3.1 Contents](#31-contents)
- [3.2 Introduction](#32-introduction)
- [3.3 Imports](#33-imports)
- [3.4 Load The Data](#34-load-the-data)
- [3.5 Data Cleaning](#35-data-cleaning)
    - [3.5.1 Imputing Missing/Removing Values](#351-imputing-missing-values)
- [3.6 Train/Test Split](#36-traintest-split)
- [3.7 Encoding Categorical Features](#37-encoding-categorical-features)
     - [3.7.1 Encoding with get_Dummies](#371-encoding-with-get-dummies)
     - [3.7.1 Encoding with word2vec](#371-encoding-with-word2vec)
- [3.8 Scale the Data](#38-scale-the-data)
- [3.9 Train/Predict with a "Baseline Model"](#39-trainpredict-with-a-baseline-model)
- [3.10 Setting up Pipelines](#310-setting-up-pipelines)
    - [3.10.1 Define](#3101-define)
- [3.11 Fit/Train/Predict and Assess Models ](#"3102-fit-train-predict-and-assess)
- [3.12 Final Model Selection](#314-final-model-selection)
    - [3.12.1 Logistic Regression Model Performance](#3141-logistic-regression-model-performance)
    - [3.12.2 Random Forest Regression Model Performance](#3142-random-forest-regression-model-performance)
- [3.13 Conclusion](#315-conclusion)
 

## 3.2 Introduction <a id='32-introduction'></a>

This is a continuation of "2.0-faa-exploratory-data-analysis.ipynb" focusing on feature engineering, training and model selection. 

Goals: Impute missing values, scale data, encode categorical types, train/test split, create a pipeline and model selection 

### **Problem Statement:**
The purpose of this data science project involves predicting the age and sex of individuals who become victims of crime using crime data and potentially other relevant variables. By analyzing patterns within crime data, we aim to develop predictive models that estimate the age and sex of victims, which can have applications in law enforcement, victim support and aid victim service providers target relevant areas. 


## 3.3 Imports <a id='33-imports'></a>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string 

from sklearn.preprocessing import StandardScaler, MinMaxScaler, FunctionTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

from gensim.models import Word2Vec
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

import os
import pickle
import seaborn as sns
from sklearn import __version__ as sklearn_version
from sklearn.compose import ColumnTransformer

from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.model_selection import cross_validate, GridSearchCV, learning_curve
from sklearn.dummy import DummyRegressor, DummyClassifier
from sklearn.linear_model import LinearRegression

from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier,\
                        GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.metrics import accuracy_score, precision_score,recall_score, f1_score,\
                                                        multilabel_confusion_matrix

from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_regression
import datetime



## 3.4 Load The Data <a id='34-load-the-data'></a>

In [2]:
# Storing file path in variable and then using pd.read_csv() to load the data as a dataframe into crimeData

dataFilePath = "/Users/frankyaraujo/Development/Springboard_Main/Capstone Two/\
Springboard-Capstone-Two/src/data/2010-2023 Crime_Traffic_Collisions_Data_R2 .csv"
crimedf = pd.read_csv(dataFilePath, low_memory = False)

In [18]:
crimedf[(crimedf["Crime Code"]<900) & (crimedf["Crime Code"]>799)]\
["Crime Code Description"].unique()

array(['SEX OFFENDER REGISTRANT OUT OF COMPLIANCE',
       'BATTERY WITH SEXUAL CONTACT', 'TRESPASSING',
       'DISTURBING THE PEACE',
       'SEX,UNLAWFUL(INC MUTUAL CONSENT, PENETRATION W/ FRGN OBJ',
       'CRM AGNST CHLD (13 OR UNDER) (14-15 & SUSP 10 YRS OLDER)',
       'SEXUAL PENETRATION W/FOREIGN OBJECT', 'FAILURE TO YIELD',
       'INDECENT EXPOSURE', 'ORAL COPULATION',
       'SODOMY/SEXUAL CONTACT B/W PENIS OF ONE PERS TO ANUS OTH',
       'CHILD ANNOYING (17YRS & UNDER)', 'PIMPING',
       'HUMAN TRAFFICKING - COMMERCIAL SEX ACTS', 'CHILD PORNOGRAPHY',
       'PANDERING', 'DISRUPT SCHOOL', 'DRUGS, TO A MINOR',
       'CHILD ABANDONMENT',
       'BEASTIALITY, CRIME AGAINST NATURE SEXUAL ASSLT WITH ANIM',
       'FAILURE TO DISPERSE',
       'INCEST (SEXUAL ACTS BETWEEN BLOOD RELATIVES)', 'INCITING A RIOT'],
      dtype=object)

In [4]:
# Review of the data using .head() and .info()

crimedf.head()

Unnamed: 0.1,Unnamed: 0,DR Number,Date Reported,Date Occurred,Time Occurred,Area ID,Area Name,Reporting District,Crime Code,Crime Code Description,...,Status,Status Desc,Crm Cd 1,Crm Cd 2,Crm Cd 3,Crm Cd 4,Address,Cross Street,LAT,LON
0,0,10304468,2020-01-08,2020-01-08,2230,3,Southwest,377,624,BATTERY - SIMPLE ASSAULT,...,AO,Adult Other,624.0,,,,1100 W 39TH PL,,34.0141,-118.2978
1,1,190101086,2020-01-02,2020-01-01,330,1,Central,163,624,BATTERY - SIMPLE ASSAULT,...,IC,Invest Cont,624.0,,,,700 S HILL ST,,34.0459,-118.2545
2,2,200110444,2020-04-14,2020-02-13,1200,1,Central,155,845,SEX OFFENDER REGISTRANT OUT OF COMPLIANCE,...,AA,Adult Arrest,845.0,,,,200 E 6TH ST,,34.0448,-118.2474
3,3,191501505,2020-01-01,2020-01-01,1730,15,N Hollywood,1543,745,VANDALISM - MISDEAMEANOR ($399 OR UNDER),...,IC,Invest Cont,745.0,998.0,,,5400 CORTEEN PL,,34.1685,-118.4019
4,4,191921269,2020-01-01,2020-01-01,415,19,Mission,1998,740,"VANDALISM - FELONY ($400 & OVER, ALL CHURCH VA...",...,IC,Invest Cont,740.0,,,,14400 TITUS ST,,34.2198,-118.4468


## 3.5 Data Cleaning <a id='3.5_Data_cleaning'></a>

The dataset is still missing values and potentially useless or redundant features so this section will focus on cleaning the data. In addition to this, the categorical variables will be reviewed to determine the most appropriate encoding technique.

In [5]:
# look at the missing values 

pd.DataFrame(crimedf.isnull().sum()/len(crimedf)*100).sort_values(by=0, ascending=False)

Unnamed: 0,0
Crm Cd 4,99.995857
Crm Cd 3,99.860526
Crm Cd 2,95.819115
Weapon Desc,80.289574
Weapon Used Cd,80.289574
Cross Street,49.635543
Crm Cd 1,43.324096
Status Desc,43.323369
Status,43.323369
Victim Age,20.101084


- Over 90% of Crm Cd 2,3 and 4 are missing so those can be dropped.

- About 80% of the 'Weapon Used Cd' and 'Weapon Desc' columns is missing so that can be dropped. Also, there is a 'Weapon Desc' variable that holds most of this information.  

In [6]:
# dropping columns with the majority of information missing 

crimedf.drop(columns=["Crm Cd 4",
"Crm Cd 3",
"Crm Cd 2",
"Weapon Desc",
"Weapon Used Cd"], inplace=True)

<a id='351-imputing-missing-values'></a>
### 3.5.1 Imputing/Removing Missing Values


The following features have missing values: 
- Cross Street, 49% Missing
- Crm Cd 1, 43% Missing
- Status Desc, 43% Missing
- Victim Age, 20% Missing
- MO Codes, 14% Missing
- Premise Description, <1% Missing
- Premise Code, <1% Missing

Note: Victim Age is a target variable and will be excluded from imputation 

In [7]:
# quick look at cross street values
crimedf[crimedf["Cross Street"].notna()]["Cross Street"]

10                                   OLIVE
17                                 VERMONT
19                                    HILL
27                             LOS ANGELES
34                              SAN JULIAN
                        ...               
1375876    SATICOY                      ST
1375877    GUTHRIE                      AV
1375878    FULTON                       AV
1375879    OXNARD                       ST
1375880    LA TUNA CANYON               RD
Name: Cross Street, Length: 692955, dtype: object

This is categorical feature and LAT and LON have no missing values so there already exists features with information on location of the crime. Cross Street will be be dropped. 

Also Status and Status Desc are not both needed so Status will be dropped. 

In [8]:
    # dropping column
crimedf.drop(columns=["Cross Street","Status"], inplace=True)

In [9]:
# Look at Status Desc values
crimedf["Status Desc"].value_counts()

Invest Cont     624160
Adult Other      83749
Adult Arrest     68087
Juv Arrest        2490
Juv Other         1314
UNK                  3
Name: Status Desc, dtype: int64

In [10]:
# Fill missing values with 'UNK' as that is a option for a label
crimedf["Status Desc"].fillna("UNK", inplace=True)

In [11]:
# Look at Crm Cd 1
crimedf[crimedf["Crm Cd 1"].notna()]["Crm Cd 1"]
crimedf["Crm Cd 1"].unique()

array([624., 845., 745., 740., 121., 442., 946., 341., 330., 930., 648.,
       626., 440., 354., 210., 230., 310., 510., 420., 761., 236., 662.,
       350., 860., 480., 623., 956., 900., 888., 331., 901., 886., 421.,
       647., 940., 810., 922., 812., 220., 625., 755., 649., 434., 815.,
       251., 320., 890., 850., 668., 902., 664., 920., 343., 437., 753.,
       928., 910., 760., 762., 661., 351., 821., 237., 903., 813., 666.,
       820., 627., 805., 763., 441., 122., 443., 450., 520., 410., 352.,
       670., 951., 660., 654., 250., 110., 652., 933., 950., 231., 345.,
       822., 814., 932., 622., 471., 235., 470., 921., 906., 433., 651.,
       806., 943., 653., 436., 949., 446., 113., 487., 438., 451., 521.,
       439., 485., 944., 954., 756., 942.,  nan, 473., 347., 435., 880.,
       444., 475., 474., 931., 865., 349., 430., 353., 452., 870., 522.,
       924., 840., 948., 884., 904., 830., 432., 882., 445.])

When looking at the data source - Crime Code is the same as Crm Cd 1 so this feature can be dropped. 

In [12]:
    # dropping Crm Cd 1
crimedf.drop(columns="Crm Cd 1", inplace=True)

In [13]:
# Finally -> MO Codes, Premise Description, and Premise Code
crimedf.loc[:,["MO Codes", "Premise Description", "Premise Code"]].\
sort_values(by="MO Codes").head(10)

Unnamed: 0,MO Codes,Premise Description,Premise Code
377002,100,SINGLE FAMILY DWELLING,501.0
638898,100,SINGLE FAMILY DWELLING,501.0
419782,100,CYBERSPACE,750.0
560308,100,SINGLE FAMILY DWELLING,501.0
182583,100,SINGLE FAMILY DWELLING,501.0
726725,100,OTHER RESIDENCE,504.0
38305,100,"MULTI-UNIT DWELLING (APARTMENT, DUPLEX, ETC)",502.0
118566,100,"MULTI-UNIT DWELLING (APARTMENT, DUPLEX, ETC)",502.0
182529,100,SINGLE FAMILY DWELLING,501.0
560383,100,"MULTI-UNIT DWELLING (APARTMENT, DUPLEX, ETC)",502.0


Premise Description and Premise Code will have the same information so both features are not needed. The Premise Code can be dropped as the Premise Description would be easier to extract meaning from the data. There is less than 1% of data missing dropping the rows would be the easiest way to get past this. 

Lastly, about 40% of the MO Codes are missing. There may be some correlation between MO Codes and Premise Codes/Descriptions - the next steps would be to determine if imputation can be done based on the Premise features. 

MO Codes values are object types whereas Premise Codes are float. Vectorizing MO Codes will be necessary before determining correlation. 

In [14]:
# Exploration of the MO Codes will be done before dropping any more feature

mo_codes_obj = crimedf["MO Codes"]
mo_codes_obj.fillna("", inplace=True) # fill with empty for now 
mo_codes_obj_sep = [i.split(' ') for i in mo_codes_obj]

mo_code_df = pd.DataFrame(mo_codes_obj_sep).fillna("")

In [15]:
# pulling all unique values per mo code column created previously
all_mo_codes=[]
for i in range(len(mo_code_df.columns)):
    all_mo_codes.extend(mo_code_df[i].unique())

In [16]:
# getting the unique values from the list of all mo codes created previously
unique_mo_codes = set(all_mo_codes)
len(unique_mo_codes)

760

It looks like there are 760 unique mo codes so one-hot encoding would create up 760 dimensions so this will be avoided. Determining correlation between MO Codes and Premise Codes will require a more involved approach and here isn't much value in pursuing this as they are only two of many features.

The approach that will be taken is that the missing values in MO Codes will be imputed with empty strings and later this feature will be vectorized. 

In [17]:
# the column was filled with empty strings above so it just needs to replace the original col
crimedf["MO Codes"] = mo_codes_obj

In [18]:
# train_test_split will not run with missing values so the rest will be imputed with 'unknown'
from sklearn.impute import SimpleImputer

num_imputer = SimpleImputer(strategy='mean')
df_victim_age = num_imputer.fit_transform(np.array(crimedf["Victim Age"]).reshape(-1, 1))

cat_imputer =  SimpleImputer(strategy='constant', fill_value="unknown")
df_imputed = cat_imputer.fit_transform(crimedf.drop(columns="Victim Age"))


In [19]:
# ensuring this feature keeps its name
df_victim_age = pd.DataFrame(df_victim_age,columns=["Victim Age"])

In [20]:
# ensuring these featurea keep their names
df_imputed = pd.DataFrame(df_imputed,columns=crimedf.drop(columns="Victim Age").columns)

In [21]:
# dropping columns with no information
df_imputed.drop(columns="Unnamed: 0", inplace=True)

In [22]:
# concatenating dataframes after imputing
crime_df = pd.concat([df_imputed,df_victim_age], ignore_index=False, axis=1)

In [23]:
crime_df.head()

Unnamed: 0,DR Number,Date Reported,Date Occurred,Time Occurred,Area ID,Area Name,Reporting District,Crime Code,Crime Code Description,MO Codes,Victim Sex,Victim Descent,Premise Code,Premise Description,Status Desc,Address,LAT,LON,Victim Age
0,10304468,2020-01-08,2020-01-08,2230,3,Southwest,377,624,BATTERY - SIMPLE ASSAULT,0444 0913,F,Black,501.0,SINGLE FAMILY DWELLING,Adult Other,1100 W 39TH PL,34.0141,-118.2978,36.0
1,190101086,2020-01-02,2020-01-01,330,1,Central,163,624,BATTERY - SIMPLE ASSAULT,0416 1822 1414,M,Hispanic/Latin/Mexican,102.0,SIDEWALK,Invest Cont,700 S HILL ST,34.0459,-118.2545,25.0
2,200110444,2020-04-14,2020-02-13,1200,1,Central,155,845,SEX OFFENDER REGISTRANT OUT OF COMPLIANCE,1501,X,Unknown,726.0,POLICE FACILITY,Adult Arrest,200 E 6TH ST,34.0448,-118.2474,40.387163
3,191501505,2020-01-01,2020-01-01,1730,15,N Hollywood,1543,745,VANDALISM - MISDEAMEANOR ($399 OR UNDER),0329 1402,F,White,502.0,"MULTI-UNIT DWELLING (APARTMENT, DUPLEX, ETC)",Invest Cont,5400 CORTEEN PL,34.1685,-118.4019,76.0
4,191921269,2020-01-01,2020-01-01,415,19,Mission,1998,740,"VANDALISM - FELONY ($400 & OVER, ALL CHURCH VA...",0329,X,Unknown,409.0,BEAUTY SUPPLY STORE,Invest Cont,14400 TITUS ST,34.2198,-118.4468,31.0


In [24]:
# do not need Premise Code and Premise Description
crime_df.drop(columns="Premise Code", inplace=True)

In [25]:
# all missing values imputed
crime_df.isnull().sum()

DR Number                 0
Date Reported             0
Date Occurred             0
Time Occurred             0
Area ID                   0
Area Name                 0
Reporting District        0
Crime Code                0
Crime Code Description    0
MO Codes                  0
Victim Sex                0
Victim Descent            0
Premise Description       0
Status Desc               0
Address                   0
LAT                       0
LON                       0
Victim Age                0
dtype: int64

## 3.6 Train/Test Split  <a id="36-traintest-split"></a> 

In [27]:
crime_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1375881 entries, 0 to 1375880
Data columns (total 18 columns):
 #   Column                  Non-Null Count    Dtype  
---  ------                  --------------    -----  
 0   DR Number               1375881 non-null  object 
 1   Date Reported           1375881 non-null  object 
 2   Date Occurred           1375881 non-null  object 
 3   Time Occurred           1375881 non-null  object 
 4   Area ID                 1375881 non-null  object 
 5   Area Name               1375881 non-null  object 
 6   Reporting District      1375881 non-null  object 
 7   Crime Code              1375881 non-null  object 
 8   Crime Code Description  1375881 non-null  object 
 9   MO Codes                1375881 non-null  object 
 10  Victim Sex              1375881 non-null  object 
 11  Victim Descent          1375881 non-null  object 
 12  Premise Description     1375881 non-null  object 
 13  Status Desc             1375881 non-null  object 
 14  Ad

In [28]:
# Ensure the proper data types before splitting

crime_df["Date Reported"]=pd.to_datetime(crime_df["Date Reported"])
crime_df["Date Occurred"]=pd.to_datetime(crime_df["Date Occurred"])
crime_df['Time Occurred']=pd.to_datetime(crime_df['Time Occurred'], format='%H%M', \
                                         errors='coerce').dt.time

int_type_features=["DR Number","Area ID","Reporting District","Crime Code","Victim Age"]
for i in int_type_features:
    crime_df[i]=crime_df[i].astype(int, errors="ignore")

In [29]:
# datetime values will not work with certain models so lets pull usueful information 
crime_df['Year_Reported'] = crime_df['Date Reported'].dt.year
crime_df['Month_Reported'] = crime_df['Date Reported'].dt.month
crime_df['Day_Reported'] = crime_df['Date Reported'].dt.day

crime_df['Year_Occurred'] = crime_df['Date Occurred'].dt.year
crime_df['Month_Occurred'] = crime_df['Date Occurred'].dt.month
crime_df['Day_Occurred'] = crime_df['Date Occurred'].dt.day

# Extract hour and minute from the 'Time Occurred' column
crime_df['Hour_Occurred'] = pd.to_datetime(crime_df['Time Occurred'], format='%H:%M:%S').dt.hour
crime_df['Minute_Occurred'] = pd.to_datetime(crime_df['Time Occurred'], format='%H:%M:%S').dt.minute

In [30]:
# After addding the new columns, missing values were found within Time Occurred which would 
# mean missing values in Hour_Occurred and Minute_Occurred as well 

# impute using SimpleImputeruter

dt_imputer =  SimpleImputer(strategy='mean')
crime_df['Hour_Occurred'] = dt_imputer.fit_transform(
    np.array(crime_df['Hour_Occurred']).reshape(-1,1) )
crime_df['Minute_Occurred'] = dt_imputer.fit_transform(
    np.array(crime_df['Minute_Occurred']).reshape(-1,1))

In [31]:
# we extracted the information from the datetime values as numerical data
# initial datetime columns can be dropped 

crime_df.drop(columns=["Date Reported","Date Occurred","Time Occurred" ],inplace=True)

In [32]:
crime_df.head()

Unnamed: 0,DR Number,Area ID,Area Name,Reporting District,Crime Code,Crime Code Description,MO Codes,Victim Sex,Victim Descent,Premise Description,...,LON,Victim Age,Year_Reported,Month_Reported,Day_Reported,Year_Occurred,Month_Occurred,Day_Occurred,Hour_Occurred,Minute_Occurred
0,10304468,3,Southwest,377,624,BATTERY - SIMPLE ASSAULT,0444 0913,F,Black,SINGLE FAMILY DWELLING,...,-118.2978,36,2020,1,8,2020,1,8,22.0,30.0
1,190101086,1,Central,163,624,BATTERY - SIMPLE ASSAULT,0416 1822 1414,M,Hispanic/Latin/Mexican,SIDEWALK,...,-118.2545,25,2020,1,2,2020,1,1,3.0,30.0
2,200110444,1,Central,155,845,SEX OFFENDER REGISTRANT OUT OF COMPLIANCE,1501,X,Unknown,POLICE FACILITY,...,-118.2474,40,2020,4,14,2020,2,13,12.0,0.0
3,191501505,15,N Hollywood,1543,745,VANDALISM - MISDEAMEANOR ($399 OR UNDER),0329 1402,F,White,"MULTI-UNIT DWELLING (APARTMENT, DUPLEX, ETC)",...,-118.4019,76,2020,1,1,2020,1,1,17.0,30.0
4,191921269,19,Mission,1998,740,"VANDALISM - FELONY ($400 & OVER, ALL CHURCH VA...",0329,X,Unknown,BEAUTY SUPPLY STORE,...,-118.4468,31,2020,1,1,2020,1,1,4.0,15.0


In [33]:
# train_test_split data for target variable: victim age
X = crime_df.drop(columns=["Victim Sex","Victim Age"])
y_sex = crime_df["Victim Sex"]
y_age = crime_df["Victim Age"]

# split for Victim Age target variable 
X_train_va,X_test_va,y_train_va,y_test_va = train_test_split\
    (X,y_age,test_size=.3, stratify=y_age,random_state=42)

In [34]:
# train_test_split data for target variable: victim sex
# In order to apply train_test_split, the Victim Sex feature needs to be encoded
l_encoder = LabelEncoder()
y_vs_encoded = l_encoder.fit_transform(y_sex.values)

X_train_vs, X_test_vs, y_train_vs, y_test_vs = train_test_split(
    X, y_vs_encoded, test_size=0.3, stratify=y_vs_encoded, random_state=42)

In [35]:
sex_class_names = l_encoder.classes_ # 0->F, 1->M, and 2->X (Unknown)

## 3.7 Encoding Categorical Features <a id="37-encoding-categorical-features"></a> 

Since there are two target variables, there will be two predictive models so each approach will be applied to each set of training/test data. In this section, the categorical variables will be encoded to have all the data as numeric types. 

In [36]:
''' These are the following variables:
X_train_va,X_test_va,y_train_va,y_test_va to predict Victim Age
X_train_vs, X_test_vs, y_train_vs, y_test_vs to predict Victim Sex 
'''
list_of_x_train_test_vars = [X_train_va,X_test_va,X_train_vs, X_test_vs]

X_train_va.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 963116 entries, 1212026 to 623061
Data columns (total 21 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   DR Number               963116 non-null  int64  
 1   Area ID                 963116 non-null  int64  
 2   Area Name               963116 non-null  object 
 3   Reporting District      963116 non-null  int64  
 4   Crime Code              963116 non-null  int64  
 5   Crime Code Description  963116 non-null  object 
 6   MO Codes                963116 non-null  object 
 7   Victim Descent          963116 non-null  object 
 8   Premise Description     963116 non-null  object 
 9   Status Desc             963116 non-null  object 
 10  Address                 963116 non-null  object 
 11  LAT                     963116 non-null  object 
 12  LON                     963116 non-null  object 
 13  Year_Reported           963116 non-null  int64  
 14  Month_Reported

In [37]:
# Before encoding all categorical features, the number of distinct values per features will 
# be reviewed to ensure that we are considering dimensionality 

categorical_feature_names = X_train_va.select_dtypes(include='object').columns

In [38]:
# used X_train_va only since X_train_vs would have the same features and unique values
for i in categorical_feature_names:
    print(i,"has",X_train_va[i].nunique(),"unique values")
  

Area Name has 21 unique values
Crime Code Description has 137 unique values
MO Codes has 275184 unique values
Victim Descent has 19 unique values
Premise Description has 305 unique values
Status Desc has 6 unique values
Address has 65646 unique values
LAT has 5479 unique values
LON has 5028 unique values


One hot encoding will be applied to the following variables due to the low number of unique values present:
1. Area Name
2. Victim Descent
3. Status Desc

The following variables will be encoded followed by a vectorizer due to high number of unique levels:
1. Crime Code Description 
2. MO Codes 
3. Premise Description 
4. Address 


#### **3.7.1 Encoding - get_dummies: Area Name, Victim Descent, Status Desc** <a id='371-encoding-with-get-dummies'></a>  

In [39]:
# Encoding Area Name, Victim Descent, Status Desc
area_descent_status = ["Area Name","Victim Descent","Status Desc"]

for index in range(len(list_of_x_train_test_vars)):
    list_of_x_train_test_vars[index] = pd.get_dummies(list_of_x_train_test_vars[index],\
                                        columns=area_descent_status,drop_first=True)
    

#### **3.7.2 Encoding - word2vec** <a id='371-encoding-with-word2vec'></a>

In [40]:
# Need to download data from NLTK to use stopwords.words('english') and tokenize.word_tokenize
nltk.download('punkt')
nltk.download('stopwords')
stop_words = set.union(set(stopwords.words('english')))

# function definitions
# purpose: encode text by removing stop words (ie non-value words) + convert all to lower case

def preprocess(text):
    text = text.lower() # make strings lower case
    text = ''.join([word for word in text if word not in string.punctuation]) #remove punctuation
    tokens = nltk.tokenize.word_tokenize(text) #tranform text
    tokens = [word for word in tokens if word not in stop_words] #ensure no stopwords are present
    return ' '.join(tokens) # return complete text since tokenizing split the text

# function to vectorize the text
def vectorize(sentence):
    words = sentence.split()
    words_vecs = [w2v_model.wv[word] for word in words if word in w2v_model.wv]
    if len(words_vecs) == 0:
        return np.zeros(50)
    words_vecs = np.array(words_vecs)
    return words_vecs.mean(axis=0)

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/frankyaraujo/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/frankyaraujo/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [41]:
# variables not encoded in 3.7.1
longer_text_columns = ["Crime Code Description","MO Codes","Premise Description", "Address"]  

feature_name_mapping = {}
for i in longer_text_columns: # loop through relevant features 
     for index in range(len(list_of_x_train_test_vars)): # loop through list of train/test data
            
            original_feature_name = f"{i}_{index}"

            # preprocess text data
            list_of_x_train_test_vars[index][i] = list_of_x_train_test_vars[index][i]\
            .apply(preprocess)  
    
            # train the word2vec model
            sentences = [sentence.split() for sentence in list_of_x_train_test_vars[index][i]]
            w2v_model = Word2Vec(sentences, vector_size=50, window=5, workers=4)
    
            # vectorize text data and replace column
            list_of_x_train_test_vars[index][i] = np.array([
                vectorize(sentence) for sentence in list_of_x_train_test_vars[index][i]])
        
            # Store new column names after vectorization
            new_feature_names = [f"{i}_vec_{j}" for j in range(w2v_model.vector_size)]
            feature_name_mapping[original_feature_name] = new_feature_names


In [42]:
# make a copy since this version has all column names intact
from copy import deepcopy

list_of_x_train_test_vars_cpy = deepcopy(list_of_x_train_test_vars)

In [66]:
# store version of preprocessed data that is not scaled
victim_age_feature_csv_unscaled = '/Users/frankyaraujo/Development/springboard_main/Capstone Two/\
Springboard-Capstone-Two/src/data/victim_age_feature_data_unscaled.csv'
victim_age_target_csv_unscaled = '/Users/frankyaraujo/Development/springboard_main/Capstone Two/\
Springboard-Capstone-Two/src/data/victim_age_target_data_unscaled.csv'

victim_sex_feature_csv_unscaled = '/Users/frankyaraujo/Development/springboard_main/Capstone Two/\
Springboard-Capstone-Two/src/data/victim_sex_feature_data_unscaled.csv'
victim_sex_target_csv_unscaled = '/Users/frankyaraujo/Development/springboard_main/Capstone Two/\
Springboard-Capstone-Two/src/data/victim_sex_target_data_unscaled.csv'

# Write DataFrames to CSV files
list_of_x_train_test_vars_cpy[0].to_csv(victim_age_feature_csv_unscaled, index=False)
list_of_x_train_test_vars_cpy[1].to_csv(victim_age_target_csv_unscaled, index=False)

list_of_x_train_test_vars_cpy[2].to_csv(victim_sex_feature_csv_unscaled, index=False)
list_of_x_train_test_vars_cpy[3].to_csv(victim_sex_target_csv_unscaled, index=False)

## 3.8 Scale the Data <a id='38-scale-the-data'></a>

Since the data is comprised of a mixture of data types (ie datetime, class labels, and continuous variables) then each type will need its own approach. ColumnTransformer allows different types to be handled based on ther specific characteristics during scaling.

In [43]:
'''
LAT                                  
LON                              
Year_Reported                         
Month_Reported                        
Day_Reported                           
Year_Occurred                          
Month_Occurred                         
Day_Occurred                          
Hour_Occurred                          
Minute_Occurred 
'''
numerical_cols = ["LAT", "LON", "Year_Reported",
                  "Month_Reported", "Day_Reported",
                  "Year_Occurred", "Month_Occurred",
                  "Day_Occurred", "Hour_Occurred", "Minute_Occurred"]
num_scaler = StandardScaler()

for i in range(len(list_of_x_train_test_vars)):
    for j in numerical_cols:
        list_of_x_train_test_vars[i][j] = num_scaler.fit_transform(
            np.array(list_of_x_train_test_vars[i][j]).reshape(-1,1,))


In [44]:
# take the encoded train_test data from list_of_x_train_test_vars and update initial train_test var's

X_train_va,X_test_va,X_train_vs, X_test_vs = list_of_x_train_test_vars

## 3.9 Train/Predict with a "Baseline Model" <a id='39-trainpredict-with-a-baseline-model'></a>

#### Fit the dummy regressor

In [45]:
#Fit the dummy regressor on the training data - Victim Age
dumb_reg = DummyRegressor(strategy='mean')
dumb_reg.fit(X_train_va, y_train_va)

#### Assess dummy regressor performance

In [46]:
# Obtain predictions from the Dummy Regressor
y_pred_dummy_va = dumb_reg.predict(X_test_va)

# Calculate Mean Absolute Error (MAE)
mae_dummy = mean_absolute_error(y_test_va, y_pred_dummy_va)
print("Dummy Regressor - Mean Absolute Error:", mae_dummy)

# Calculate Mean Squared Error (MSE)
mse_dummy = mean_squared_error(y_test_va, y_pred_dummy_va)
print("Dummy Regressor - Mean Squared Error:", mse_dummy)

# Calculate Root Mean Squared Error (RMSE)
rmse_dummy = np.sqrt(mse_dummy)
print("Dummy Regressor - Root Mean Squared Error:", rmse_dummy)

# Calculate R-squared
r2_dummy = r2_score(y_test_va, y_pred_dummy_va)
print("Dummy Regressor - R-squared:", r2_dummy)

Dummy Regressor - Mean Absolute Error: 10.417887179842625
Dummy Regressor - Mean Squared Error: 206.23319913904604
Dummy Regressor - Root Mean Squared Error: 14.360821673534074
Dummy Regressor - R-squared: -2.3358559531061474e-10



Reminder: MAE measures the average absolute difference between the true values and the predictions.
- The Dummy Regressor has an MAE of approximately 10.42 which means, on average, the predictions are off by around 10.42 units from the true values.

Reminder: MSE measures the average squared difference between the true values and the predictions.
- The Dummy Regressor has an MSE of approximately 206.23. 

Reminder: RMSE is the square root of the MSE and provides a more interpretable scale.
- The Dummy Regressor has an RMSE of approximately 14.36. This is similar to MAE as it is in the same units as the target variabl but the approach gives more weight to larger errors. 

Reminder: R-squared measures the proportion of the variance in the dependent variable that is predictable from the independent variables.
- The Dummy Regressor has an R-squared close to zero (approximately -2.34e-10). R-squared is typically between 0 and 1; a negative value indicates that the model is performing worse than a model that predicts the mean of the target variable.

Reminder: If the performance of your actual model is not significantly better than the Dummy Regressor, it indicates that your model might not be learning meaningful patterns from the data.
- This Dummy Regressor serves as a baseline model. The following regression models will aim to outperform these metrics. 

#### Fit the dummy classifier

In [47]:
# Create and fit a DummyClassifier - Victim Sex
dummy_classifier = DummyClassifier(strategy='stratified')  # Other strategies: 'stratified', 'uniform', 'constant'
dummy_classifier.fit(X_train_vs, y_train_vs)

#### Assess dummy regressor performance

In [48]:
# Obtain predictions from the Dummy Classifier
y_pred_dummy = dummy_classifier.predict(X_test_vs)

# Calculate Accuracy
accuracy = accuracy_score(y_test_vs, y_pred_dummy)
print("Dummy Classifier - Accuracy:", accuracy)

# Calculate Precision, Recall, and F1 Score for each label
precision = precision_score(y_test_vs, y_pred_dummy, average='micro')
recall = recall_score(y_test_vs, y_pred_dummy, average='micro')
f1 = f1_score(y_test_vs, y_pred_dummy, average='micro')

print("Dummy Classifier - Precision:", precision)
print("Dummy Classifier - Recall:", recall)
print("Dummy Classifier - F1 Score:", f1)

# Multilabel Confusion Matrix
conf_matrix = multilabel_confusion_matrix(y_test_vs, y_pred_dummy)
print("Dummy Classifier - Multilabel Confusion Matrix:")
print(conf_matrix)

Dummy Classifier - Accuracy: 0.39445447167274356
Dummy Classifier - Precision: 0.39445447167274356
Dummy Classifier - Recall: 0.39445447167274356
Dummy Classifier - F1 Score: 0.39445447167274356
Dummy Classifier - Multilabel Confusion Matrix:
[[[164439  96094]
  [ 96395  55837]]

 [[107744 103227]
  [103269  98525]]

 [[303399  50627]
  [ 50284   8455]]]


The accuracy is approximately 19.55%, indicating that the Dummy Classifier is predicting the correct class for about 19.55% of the instances. This gives you a baseline performance level.

Reminder: Precision is the proportion of true positives among all predicted positives, recall is the proportion of true positives among all actual positives, and F1 score is the harmonic mean of precision and recall. 

The values are around 39-40%, indicating that the classifier is making correct positive predictions, but there is room for improvement.

The Multilabel Confusion Matrix breaks down the number of true positives, true negatives, false positives, and false negatives for each label.

With this baseline performance from the Dummy Classifier, we can proceed to evaluate other models and/or fine-tune hyperparameters to improve performance.

In [49]:
X_train_va, X_test_va, y_train_va, y_test_va
X_train_vs, X_test_vs, y_train_vs, y_test_vs

(         DR Number  Area ID  Reporting District  Crime Code  \
 455401   221114436       11                1181         420   
 84164    200300865        3                 357         210   
 1257118  171208610       12                1273         997   
 449189   221911252       19                1967         440   
 1193558  160908840        9                 909         997   
 ...            ...      ...                 ...         ...   
 946235   110515754        5                 562         997   
 773242   230306639        3                 311         420   
 425353   221004265       10                1023         236   
 1253796  171014776       10                1008         997   
 800819   191425336       14                1435         997   
 
          Crime Code Description  MO Codes  Premise Description   Address  \
 455401                 0.505616  1.193725            -0.001072 -0.097334   
 84164                 -0.772781  0.705511             0.799432  0.899680   

In [61]:
# combining and converting to dataframes to store the preprocessed data
victim_age_feature_data_df = pd.DataFrame(np.concatenate((X_train_va, X_test_va),axis=0),
                                         columns=X_train_va.columns)
victim_age_target_data_df = pd.DataFrame(np.concatenate((y_train_va, y_test_va),axis=0),
                                                        columns=["Victim Age"])

victim_sex_feature_data_df = pd.DataFrame(np.concatenate((X_train_vs, X_test_vs),axis=0), 
                                                         columns=X_train_vs.columns)
victim_sex_target_data_df = pd.DataFrame(np.concatenate((y_train_vs, y_test_vs),axis=0),
                                                        columns=["Victim Sex"])

In [62]:
# store the preprocessed data
victim_age_feature_csv = '/Users/frankyaraujo/Development/springboard_main/Capstone Two/\
Springboard-Capstone-Two/src/data/victim_age_feature_data.csv'
victim_age_target_csv = '/Users/frankyaraujo/Development/springboard_main/Capstone Two/\
Springboard-Capstone-Two/src/data/victim_age_target_data.csv'

victim_sex_feature_csv = '/Users/frankyaraujo/Development/springboard_main/Capstone Two/\
Springboard-Capstone-Two/src/data/victim_sex_feature_data.csv'
victim_sex_target_csv = '/Users/frankyaraujo/Development/springboard_main/Capstone Two/\
Springboard-Capstone-Two/src/data/victim_sex_target_data.csv'

# Write DataFrames to CSV files
victim_age_feature_data_df.to_csv(victim_age_feature_csv, index=False)
victim_age_target_data_df.to_csv(victim_age_target_csv, index=False)

victim_sex_feature_data_df.to_csv(victim_sex_feature_csv, index=False)
victim_sex_target_data_df.to_csv(victim_sex_target_csv, index=False)

## 3.10 Setting up Pipelines  <a id="310-setting-up-pipelines">

Managing two target variables—one for regression and another for classification—requires a thoughtful approach. Fortunately, the data has been scaled and transformed so the pipelines will be simply to compare models. 

#### 3.10.1 Define Pipelines <a id="3101-define"></a>

In [63]:
# Define the pipelines

# Pipelines for regression
regression_pipelines = [
    ('RandomForest', RandomForestRegressor()),
    ('GradientBoosting', GradientBoostingRegressor()),
    ('LinearRegression', LinearRegression())
]

# Pipelines for classification
classification_pipelines = [
    ('RandomForest', RandomForestClassifier()),
    ('GradientBoosting', GradientBoostingClassifier()),
    ('LogisticRegression', LogisticRegression())
]


## 3.11 Fit/Train/Predict and Assess Models  <a id="3102-fit-train-predict-and-assess"></a>

In [64]:
import time
# Record start time
start_time = time.time()

X_train_va_sample = pd.DataFrame(X_train_va)
y_train_va_sample = pd.DataFrame(y_train_va).values.ravel()  # Convert to 1D

X_test_va_sample = pd.DataFrame(X_test_va)
y_test_va_sample = pd.DataFrame(y_test_va).values.ravel()  # Convert to 1D 

X_train_vs_sample = pd.DataFrame(X_train_vs)
y_train_vs_sample = y_train_vs 

X_test_vs_sample = pd.DataFrame(X_test_vs)
y_test_vs_sample = y_test_vs

# Results for regression
results_va = []
for name, model in regression_pipelines:
    model.fit(X_train_va_sample, y_train_va_sample)
    y_pred_va = model.predict(X_test_va_sample)
    mse_va = mean_squared_error(y_test_va_sample, y_pred_va)
    results_va.append({'Model': name, 'MSE': mse_va})

# Results for classification
results_vs = []
for name, model in classification_pipelines:
    model.fit(X_train_vs_sample, y_train_vs_sample)
    y_pred_vs = model.predict(X_test_vs_sample)
    accuracy_vs = accuracy_score(y_test_vs_sample, y_pred_vs)
    results_vs.append({'Model': name, 'Accuracy': accuracy_vs})

# Display results for regression
results_df_va = pd.DataFrame(results_va)
print("Regression Results:")
print(results_df_va)

# Display results for classification
results_df_vs = pd.DataFrame(results_vs)
print("\nClassification Results:")
print(results_df_vs)


end_time = time.time()
# Calculate elapsed time
elapsed_time = end_time - start_time
print("\nElapsed Time:", elapsed_time, "seconds")

Regression Results:
              Model         MSE
0      RandomForest  211.419078
1  GradientBoosting  214.119721
2  LinearRegression  212.264525

Classification Results:
                Model  Accuracy
0        RandomForest  0.619956
1    GradientBoosting  0.549339
2  LogisticRegression  0.488884

Elapsed Time: 9611.328473091125 seconds


## 3.12 Final Model Selection <a id='314-final-model-selection'></a>

#### 3.12.1 Regression Model Performance <a id='3141-logistic-regression-model-performance'></a>
   

In [65]:
results_df_va

Unnamed: 0,Model,MSE
0,RandomForest,211.419078
1,GradientBoosting,214.119721
2,LinearRegression,212.264525



Based on the provided results, the Random Forest model is performing the best among the regression models, as it has the lowest Mean Squared Error (MSE). 

#### 3.12.2 Classifier Model Performance <a id='3142-random-forest-regression-model-performance'></a>


In [66]:
results_df_vs

Unnamed: 0,Model,Accuracy
0,RandomForest,0.619956
1,GradientBoosting,0.549339
2,LogisticRegression,0.488884


In this case, both RandomForest and GradientBoosting produce better results compared to LogisticRegression. Therefore, we can consider RandomForest or GradientBoosting as potential choices for our classification model.

## 3.13 Conclusion <a id='315-conclusion'></a>
   

It's important to note that these results may not be conclusive, and still need to consider additional factors such as model interpretability, computational resources, and the specific requirements of my application. In the next notebook we will fine-tune hyperparameters and potentially explore other models to ensure the robustness of our model selection.