# 3 Pre-processing & Training Data Development <a id='3_Pre-processing_&_training_data_development'></a>


## 3.1 Contents <a id='3.1_Contents'></a>

- [3.1 Contents](#3.1_Contents)
- [3.2 Introduction](#3.2_Introduction)
- [3.3 Imports](#3.3_Imports)
- [3.4 Load The Data](#3.4_Load_The_Data)
- [3.5 Data Cleaning](#3.5_Data_cleaning)
    - [3.5.1 Encoding Categorical Features](#3.5.1_encoding)
        - [3.5.1.1 Encoding with get_Dummies](#3.5.1.1_Encoding_with_get_Dummies)
        - [3.5.1.2 Encoding with word2vec](#3.5.1.2_Encoding_with_word2vec)
    - [3.5.2 Imputing Missing Values](#3.5.2_Imputing_Missing_Values)
    

## 3.2 Introduction <a id="3.2_Introduction"></a>

This is a continuation of "2.0-faa-exploratory-data-analysis.ipynb" focusing on feature engineering, training and model selection. 

Goals: Impute missing values, scale data, encode categorical types, train/test split, create a pipeline and model selection 

### **Problem Statement:**
The purpose of this data science project involves predicting the age and sex of individuals who become victims of crime using crime data and potentially other relevant variables. By analyzing patterns within crime data, we aim to develop predictive models that estimate the age and sex of victims, which can have applications in law enforcement, victim support and aid victim service providers target relevant areas. 


## 3.3 Imports <a id='3.3_Imports'></a>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string 

from gensim.models import Word2Vec
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


#import seaborn as sns
#from datetime import datetime
#from scipy import stats
#from math import trunc
#import os

## 3.4 Load The Data <a id='3.4_Load_The_Data'></a> 

In [2]:
# Storing file path in variable and then using pd.read_csv() to load the data as a dataframe into crimeData

dataFilePath = "/Users/frankyaraujo/Development/Springboard_Main/Capstone Two/\
Springboard-Capstone-Two/src/data/2010-2023 Crime_Traffic_Collisions_Data_R2 .csv"
crimedf = pd.read_csv(dataFilePath, low_memory = False)

In [3]:
# Review of the data using .head() and .info()

crimedf.head()

Unnamed: 0.1,Unnamed: 0,DR Number,Date Reported,Date Occurred,Time Occurred,Area ID,Area Name,Reporting District,Crime Code,Crime Code Description,...,Status,Status Desc,Crm Cd 1,Crm Cd 2,Crm Cd 3,Crm Cd 4,Address,Cross Street,LAT,LON
0,0,10304468,2020-01-08,2020-01-08,2230,3,Southwest,377,624,BATTERY - SIMPLE ASSAULT,...,AO,Adult Other,624.0,,,,1100 W 39TH PL,,34.0141,-118.2978
1,1,190101086,2020-01-02,2020-01-01,330,1,Central,163,624,BATTERY - SIMPLE ASSAULT,...,IC,Invest Cont,624.0,,,,700 S HILL ST,,34.0459,-118.2545
2,2,200110444,2020-04-14,2020-02-13,1200,1,Central,155,845,SEX OFFENDER REGISTRANT OUT OF COMPLIANCE,...,AA,Adult Arrest,845.0,,,,200 E 6TH ST,,34.0448,-118.2474
3,3,191501505,2020-01-01,2020-01-01,1730,15,N Hollywood,1543,745,VANDALISM - MISDEAMEANOR ($399 OR UNDER),...,IC,Invest Cont,745.0,998.0,,,5400 CORTEEN PL,,34.1685,-118.4019
4,4,191921269,2020-01-01,2020-01-01,415,19,Mission,1998,740,"VANDALISM - FELONY ($400 & OVER, ALL CHURCH VA...",...,IC,Invest Cont,740.0,,,,14400 TITUS ST,,34.2198,-118.4468


## 3.5 Data Cleaning <a id='3.5_Data_cleaning'></a>

The dataset is still missing values, has categorical data, and potentially useless or redundant features so this section will focus on cleaning the data.  

In [4]:
# quick look at the object data 

crimedf_obj = crimedf.select_dtypes(include=[object])
crimedf_obj.head()

Unnamed: 0,Date Reported,Date Occurred,Area Name,Crime Code Description,MO Codes,Victim Sex,Victim Descent,Premise Description,Weapon Desc,Status,Status Desc,Address,Cross Street
0,2020-01-08,2020-01-08,Southwest,BATTERY - SIMPLE ASSAULT,0444 0913,F,Black,SINGLE FAMILY DWELLING,"STRONG-ARM (HANDS, FIST, FEET OR BODILY FORCE)",AO,Adult Other,1100 W 39TH PL,
1,2020-01-02,2020-01-01,Central,BATTERY - SIMPLE ASSAULT,0416 1822 1414,M,Hispanic/Latin/Mexican,SIDEWALK,UNKNOWN WEAPON/OTHER WEAPON,IC,Invest Cont,700 S HILL ST,
2,2020-04-14,2020-02-13,Central,SEX OFFENDER REGISTRANT OUT OF COMPLIANCE,1501,X,Unknown,POLICE FACILITY,,AA,Adult Arrest,200 E 6TH ST,
3,2020-01-01,2020-01-01,N Hollywood,VANDALISM - MISDEAMEANOR ($399 OR UNDER),0329 1402,F,White,"MULTI-UNIT DWELLING (APARTMENT, DUPLEX, ETC)",,IC,Invest Cont,5400 CORTEEN PL,
4,2020-01-01,2020-01-01,Mission,"VANDALISM - FELONY ($400 & OVER, ALL CHURCH VA...",0329,X,Unknown,BEAUTY SUPPLY STORE,,IC,Invest Cont,14400 TITUS ST,


In [5]:
crimedf_obj.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1375881 entries, 0 to 1375880
Data columns (total 13 columns):
 #   Column                  Non-Null Count    Dtype 
---  ------                  --------------    ----- 
 0   Date Reported           1375881 non-null  object
 1   Date Occurred           1375881 non-null  object
 2   Area Name               1375881 non-null  object
 3   Crime Code Description  1375881 non-null  object
 4   MO Codes                1181546 non-null  object
 5   Victim Sex              1375881 non-null  object
 6   Victim Descent          1375881 non-null  object
 7   Premise Description     1374460 non-null  object
 8   Weapon Desc             271192 non-null   object
 9   Status                  779803 non-null   object
 10  Status Desc             779803 non-null   object
 11  Address                 1375881 non-null  object
 12  Cross Street            692955 non-null   object
dtypes: object(13)
memory usage: 136.5+ MB


There are few things that can be flagged already:  
- The 'Status' and 'Status Desc' columns are redundant one of those columns can be dropped
- There are multiple MO codes so this will have to be reviewed as encoding may not work as-is
- Missing values will need to be resolved unless the missing values provide relevant information

In [6]:
# drop Status column 

crimedf.drop(columns="Status", inplace=True)

In [7]:
# unnamed 0: column has no information as it is the duplicate of the index

crimedf.drop(columns="Unnamed: 0",inplace=True)

In [8]:
# convert date columns to datetime

crimedf["Date Reported"]=pd.to_datetime(crimedf["Date Reported"])
crimedf["Date Occurred"]=pd.to_datetime(crimedf["Date Occurred"])

### 3.5.1 Encoding <a id='3.5.1_encoding'></a>

In [9]:
# Before encoding all categorical features, the number of distinct values per features will be reviewed to ensure
# that we are considering dimensionality 

categorical_feature_names = ["Area Name", "Crime Code Description", "MO Codes", 
                             "Victim Sex", "Victim Descent", "Premise Description", 
                             "Weapon Desc", "Status Desc", "Address", "Cross Street"]

# Date column names were not included as they are temporal variables 

In [10]:
j=1 #counter to keep track of the # of variables we're looking at 
for i in categorical_feature_names:
    print(j,"Feature",i,"has",crimedf[i].nunique(),"unique values")
    j+=1

1 Feature Area Name has 21 unique values
2 Feature Crime Code Description has 139 unique values
3 Feature MO Codes has 370931 unique values
4 Feature Victim Sex has 3 unique values
5 Feature Victim Descent has 19 unique values
6 Feature Premise Description has 307 unique values
7 Feature Weapon Desc has 79 unique values
8 Feature Status Desc has 6 unique values
9 Feature Address has 71414 unique values
10 Feature Cross Street has 22588 unique values


One hot encoding will be applied to the following variables due to the low number of unique values present:
1. Area Name
2. Victim Sex
3. Victim Descent
4. Status Desc

The following variables will be encoded followed by a vectorizer due to high number of unique levels:
1. Crime Code Description 
2. MO Codes 
3. Premise Description 
4. Address 
5. Cross Street 
6. Weapon Desc

**3.5.1.1 Straight forward One Hot Encoding: Area Name, Victim Sex, Victim Descent, Status Desc** <a id='3.5.1.1_Encoding_with_get_Dummies'></a>

In [95]:
# Encoding Area Name, Victim Sex, Victim Descent, Status Desc

encoded_crimedf = pd.get_dummies(crimedf,\
                        columns=["Area Name","Victim Sex","Victim Descent","Status Desc"],\
                        drop_first=True)

In [12]:
encoded_crimedf.head()

Unnamed: 0,DR Number,Date Reported,Date Occurred,Time Occurred,Area ID,Reporting District,Crime Code,Crime Code Description,MO Codes,Victim Age,...,Victim Descent_Pacific Islander,Victim Descent_Samoan,Victim Descent_Unknown,Victim Descent_Vietnamese,Victim Descent_White,Status Desc_Adult Other,Status Desc_Invest Cont,Status Desc_Juv Arrest,Status Desc_Juv Other,Status Desc_UNK
0,10304468,2020-01-08,2020-01-08,2230,3,377,624,BATTERY - SIMPLE ASSAULT,0444 0913,36.0,...,0,0,0,0,0,1,0,0,0,0
1,190101086,2020-01-02,2020-01-01,330,1,163,624,BATTERY - SIMPLE ASSAULT,0416 1822 1414,25.0,...,0,0,0,0,0,0,1,0,0,0
2,200110444,2020-04-14,2020-02-13,1200,1,155,845,SEX OFFENDER REGISTRANT OUT OF COMPLIANCE,1501,,...,0,0,1,0,0,0,0,0,0,0
3,191501505,2020-01-01,2020-01-01,1730,15,1543,745,VANDALISM - MISDEAMEANOR ($399 OR UNDER),0329 1402,76.0,...,0,0,0,0,1,0,1,0,0,0
4,191921269,2020-01-01,2020-01-01,415,19,1998,740,"VANDALISM - FELONY ($400 & OVER, ALL CHURCH VA...",0329,31.0,...,0,0,1,0,0,0,1,0,0,0


**3.5.1.2 Encoding with word2vec** <a id='3.5.1.2_Encoding_with_word2vec'></a>

In [124]:
# Need to download data from NLTK to use stopwords.words('english') and tokenize.word_tokenize
nltk.download('punkt')
nltk.download('stopwords')
stop_words = set.union(set(stopwords.words('english')))

# function to clean the text data -> remove stop words and convert to all lower case
def preprocess(text):
    text = text.lower() # make strings lower case
    text = ''.join([word for word in text if word not in string.punctuation]) #remove punctuation
    tokens = nltk.tokenize.word_tokenize(text) #tranform text
    tokens = [word for word in tokens if word not in stop_words] #ensure no stopwords are present
    return ' '.join(tokens) # return complete text since tokenizing split the text

# function to vectorize the text
def vectorize(sentence):
    words = sentence.split()
    words_vecs = [w2v_model.wv[word] for word in words if word in w2v_model.wv]
    if len(words_vecs) == 0:
        return np.zeros(50)
    words_vecs = np.array(words_vecs)
    return words_vecs.mean(axis=0)

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/frankyaraujo/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/frankyaraujo/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [135]:
text_columns = ["Address","Weapon Desc","MO Codes","Crime Code Description",\
                "Premise Description", "Cross Street"]

encoded_crimedf.loc[:,text_columns].fillna("Unknown", inplace=True)

for i in text_columns:   
    processed_feature = encoded_crimedf[i].astype(str).apply(preprocess)

    #encoded_crimedf[i]=encoded_crimedf[i].astype(str)
    #encoded_crimedf[i].fillna('Unknown/Not Entered', inplace=True)
    
    # preprocess text data
    #encoded_crimedf[i].apply(preprocess)
    
    # train the word2vec model
    sentences = [sentence.split() for sentence in processed_feature]
    w2v_model = Word2Vec(sentences, vector_size=50, window=5)
    
    # vectorize text data and replace column
    #encoded_crimedf[i] = np.array([vectorize(sentence) for sentence in encoded_feature])
    
    encoded_feature = np.array([vectorize(sentence) for sentence in processed_feature])
    encoded_crimedf[i] = encoded_feature
    

In [137]:
encoded_crimedf.head()

Unnamed: 0,DR Number,Date Reported,Date Occurred,Time Occurred,Area ID,Reporting District,Crime Code,Crime Code Description,MO Codes,Victim Age,...,Victim Descent_Pacific Islander,Victim Descent_Samoan,Victim Descent_Unknown,Victim Descent_Vietnamese,Victim Descent_White,Status Desc_Adult Other,Status Desc_Invest Cont,Status Desc_Juv Arrest,Status Desc_Juv Other,Status Desc_UNK
0,10304468,2020-01-08,2020-01-08,2230,3,377,624,-0.071429,0.703382,36.0,...,0,0,0,0,0,1,0,0,0,0
1,190101086,2020-01-02,2020-01-01,330,1,163,624,-0.071429,0.298795,25.0,...,0,0,0,0,0,0,1,0,0,0
2,200110444,2020-04-14,2020-02-13,1200,1,155,845,0.888687,-0.606975,,...,0,0,1,0,0,0,0,0,0,0
3,191501505,2020-01-01,2020-01-01,1730,15,1543,745,-0.489959,0.786488,76.0,...,0,0,0,0,1,0,1,0,0,0
4,191921269,2020-01-01,2020-01-01,415,19,1998,740,0.284049,0.784958,31.0,...,0,0,1,0,0,0,1,0,0,0


In [138]:
# checking that all string values have been encoded

(encoded_crimedf.dtypes == str).sum()

0

### 3.5.2 Imputing Missing Values <a id='3.5.2_Imputing_Missing_Values'></a>


In [148]:
encoded_crimedf.head()

Unnamed: 0,DR Number,Date Reported,Date Occurred,Time Occurred,Area ID,Reporting District,Crime Code,Crime Code Description,MO Codes,Victim Age,...,Victim Descent_Pacific Islander,Victim Descent_Samoan,Victim Descent_Unknown,Victim Descent_Vietnamese,Victim Descent_White,Status Desc_Adult Other,Status Desc_Invest Cont,Status Desc_Juv Arrest,Status Desc_Juv Other,Status Desc_UNK
0,10304468,2020-01-08,2020-01-08,2230,3,377,624,-0.071429,0.703382,36.0,...,0,0,0,0,0,1,0,0,0,0
1,190101086,2020-01-02,2020-01-01,330,1,163,624,-0.071429,0.298795,25.0,...,0,0,0,0,0,0,1,0,0,0
2,200110444,2020-04-14,2020-02-13,1200,1,155,845,0.888687,-0.606975,,...,0,0,1,0,0,0,0,0,0,0
3,191501505,2020-01-01,2020-01-01,1730,15,1543,745,-0.489959,0.786488,76.0,...,0,0,0,0,1,0,1,0,0,0
4,191921269,2020-01-01,2020-01-01,415,19,1998,740,0.284049,0.784958,31.0,...,0,0,1,0,0,0,1,0,0,0


The following variables have now been processed: Area Name, Victim Sex, Victim Descent, status Desc, Crime Code Description, MO Codes, Premise Description, Address, Cross Street, and Weapon Desc. These were the object variables from the original dataset so now the non-object variables will be processed.

In [23]:
unprocessed_var_names = ['DR Number', 'Date Reported', 'Date Occurred', 'Time Occurred','Area ID', \
'Reporting District', 'Crime Code','Victim Age', 'Premise Code','Weapon Used Cd',\
'Crm Cd 1', 'Crm Cd 2', 'Crm Cd 3','Crm Cd 4', 'LAT', 'LON']

In [140]:
encoded_crimedf_1 = encoded_crimedf.copy()

In [150]:
encoded_crimedf_1.head()

Unnamed: 0,DR Number,Date Reported,Date Occurred,Time Occurred,Area ID,Reporting District,Crime Code,Crime Code Description,MO Codes,Victim Age,...,Victim Descent_Pacific Islander,Victim Descent_Samoan,Victim Descent_Unknown,Victim Descent_Vietnamese,Victim Descent_White,Status Desc_Adult Other,Status Desc_Invest Cont,Status Desc_Juv Arrest,Status Desc_Juv Other,Status Desc_UNK
0,10304468,2020-01-08,2020-01-08,2230,3,377,624,-0.071429,-0.579764,0.084483,...,0,0,0,0,0,1,0,0,0,0
1,190101086,2020-01-02,2020-01-01,330,1,163,624,-0.071429,-0.579764,0.084483,...,0,0,0,0,0,0,1,0,0,0
2,200110444,2020-04-14,2020-02-13,1200,1,155,845,0.888687,0.768959,-0.724932,...,0,0,1,0,0,0,0,0,0,0
3,191501505,2020-01-01,2020-01-01,1730,15,1543,745,-0.489959,-0.26384,0.756786,...,0,0,0,0,1,0,1,0,0,0
4,191921269,2020-01-01,2020-01-01,415,19,1998,740,0.284049,-1.905684,-1.771404,...,0,0,1,0,0,0,1,0,0,0
