# Logistic Model

A logistic model is trained as a baseline for comparison. This model will use the crime type as the target and the rest of the columns in the data as features.

Here, we are loading the data to use for the LogReg Model, where we will be using some characteristics of each crime to predict the type of crime.

In [None]:
#Load Data
import pickle

!gdown https://drive.google.com/uc?id=1xuQkW1U3jAQ_OcoJZ53to3rdyhDhxS-f

# Load Data
file_path = 'CrimeData'
with open(file_path, 'rb') as file:
    data = pickle.load(file)
data

Downloading...
From: https://drive.google.com/uc?id=1xuQkW1U3jAQ_OcoJZ53to3rdyhDhxS-f
To: /content/CrimeData
  0% 0.00/15.5M [00:00<?, ?B/s] 47% 7.34M/15.5M [00:00<00:00, 72.8MB/s]100% 15.5M/15.5M [00:00<00:00, 109MB/s] 


Unnamed: 0,id,case_number,date,block,iucr,primary_type,description,location_description,arrest,domestic,...,ward,community_area,fbi_code,x_coordinate,y_coordinate,year,updated_on,latitude,longitude,location
0,13650227,JH491940,2024-11-01 00:00:00,089XX S LAFLIN ST,0560,ASSAULT,SIMPLE,RESIDENCE,False,True,...,21,73,08A,1167901,1845454,2024,2024-11-08 15:41:24,41.731462584,-87.660503907,"{'latitude': '41.731462584', 'longitude': '-87..."
1,13649850,JH491289,2024-11-01 00:00:00,003XX W OHIO ST,0810,THEFT,OVER $500,STREET,False,False,...,42,8,06,1174007,1904136,2024,2024-11-08 15:41:24,41.892358446,-87.636391922,"{'latitude': '41.892358446', 'longitude': '-87..."
2,13650185,JH491764,2024-11-01 00:00:00,015XX W CARROLL AVE,0820,THEFT,$500 AND UNDER,STREET,False,False,...,27,28,06,1166235,1902273,2024,2024-11-08 15:41:24,41.88741592,-87.664988379,"{'latitude': '41.88741592', 'longitude': '-87...."
3,13654954,JH497594,2024-11-01 00:00:00,043XX W MONTROSE AVE,0710,THEFT,THEFT FROM MOTOR VEHICLE,PARKING LOT / GARAGE (NON RESIDENTIAL),False,False,...,39,16,06,1146714,1928893,2024,2024-11-08 15:41:24,41.960858577,-87.735994826,"{'latitude': '41.960858577', 'longitude': '-87..."
4,13650108,JH491685,2024-11-01 00:00:00,048XX S MICHIGAN AVE,0460,BATTERY,SIMPLE,LIBRARY,False,False,...,3,38,08B,1177972,1873066,2024,2024-11-08 15:41:24,41.807010932,-87.622774469,"{'latitude': '41.807010932', 'longitude': '-87..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,12149588,JD348492,2020-08-28 16:30:00,0000X W ERIE ST,0820,THEFT,$500 AND UNDER,STREET,False,False,...,42,8,06,1176097,1904775,2020,2020-09-04 15:40:59,41.894065054,-87.628697034,"{'latitude': '41.894065054', 'longitude': '-87..."
49996,12152992,JD351499,2020-08-28 16:30:00,079XX S INGLESIDE AVE,0850,THEFT,ATTEMPT THEFT,RESIDENCE,False,False,...,8,44,06,1183939,1852562,2020,2020-09-04 15:40:59,41.750608654,-87.601529598,"{'latitude': '41.750608654', 'longitude': '-87..."
49997,12149625,JD348456,2020-08-28 16:30:00,055XX N CLARK ST,0820,THEFT,$500 AND UNDER,GROCERY FOOD STORE,False,False,...,40,77,06,1165024,1936693,2020,2020-09-04 15:40:59,41.981892081,-87.668455248,"{'latitude': '41.981892081', 'longitude': '-87..."
49998,12153330,JD352845,2020-08-28 16:30:00,027XX W 111TH ST,0620,BURGLARY,UNLAWFUL ENTRY,RESIDENCE,False,False,...,19,75,05,1160179,1830908,2020,2020-09-04 15:40:59,41.691708011,-87.689191338,"{'latitude': '41.691708011', 'longitude': '-87..."


Below, we preprocess some of the data into a more palatable form for the model, as we want it to consider hour of the day, month of the year, and day of the week separately (these can have different impacts on crime type). Similarly, we must encode our crime type column into a numerical form for the model.  

In [None]:
#Preprocessing + Dataset ID
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

data['hour'] = data['date'].dt.hour
data['year_month'] = data['date'].dt.to_period('M')
data['month'] = data['year_month'].dt.month
data['day_of_week'] = data['date'].dt.dayofweek

#crime type
le = LabelEncoder()
data['primary_type_encoded'] = le.fit_transform(data['primary_type'])
print(le.classes_)
X = data[['hour', 'month', 'day_of_week', 'year', 'ward', 'community_area', 'latitude', 'longitude']]
y = data[['primary_type_encoded']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=8)

['ARSON' 'ASSAULT' 'BATTERY' 'BURGLARY'
 'CONCEALED CARRY LICENSE VIOLATION' 'CRIMINAL DAMAGE'
 'CRIMINAL SEXUAL ASSAULT' 'CRIMINAL TRESPASS' 'DECEPTIVE PRACTICE'
 'GAMBLING' 'HOMICIDE' 'HUMAN TRAFFICKING'
 'INTERFERENCE WITH PUBLIC OFFICER' 'INTIMIDATION' 'KIDNAPPING'
 'LIQUOR LAW VIOLATION' 'MOTOR VEHICLE THEFT' 'NARCOTICS' 'OBSCENITY'
 'OFFENSE INVOLVING CHILDREN' 'OTHER NARCOTIC VIOLATION' 'OTHER OFFENSE'
 'PROSTITUTION' 'PUBLIC INDECENCY' 'PUBLIC PEACE VIOLATION' 'ROBBERY'
 'SEX OFFENSE' 'STALKING' 'THEFT' 'WEAPONS VIOLATION']


Here, we train the model. If class weights are not balanced, the model predicts theft or battery in all cases (the two most common crime types according to our data exploration) and due to our dataset's imbalance, is 'correct' roughly 23-25% of the time.

In [None]:
#Train Model
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

lr = LogisticRegression(max_iter=1000, multi_class='ovr')
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred) * 100:.2f}%")
# print(confusion_matrix(y_pred, y_test))

  y = column_or_1d(y, warn=True)


Accuracy: 23.88%


When we weight our classes equally, the model accuracy drops to roughly 2-3%, representing a more realistic, if less impressive, baseline.

In [None]:
#Train Model
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

lr = LogisticRegression(max_iter=1000, multi_class='ovr', class_weight='balanced')
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred) * 100:.2f}%")
# print(confusion_matrix(y_pred, y_test))

  y = column_or_1d(y, warn=True)


Accuracy: 2.85%
