#### It's said to use the cleaned data as training data and last week's data as testing data. However, last week we use the cleaned data for most of the time. I think it would be strange to use the cleaned data as both the training and testing data, therefore, I use the original data (also used in last week; in question 1) as the testing data

In [80]:
import pandas as pd

# load data 
ori = pd.read_csv("data/nyc_crashes_202301.csv")
clean = pd.read_csv("data/nyc_crashes_202301_cleaned.csv")

# Create new variable injury.
clean['injury'] = clean['NUMBER OF PERSONS INJURED'].\
    apply(lambda x: 1 if x >= 1 else 0)
ori['injury'] = ori['NUMBER OF PERSONS INJURED'].\
    apply(lambda x: 1 if x >= 1 else 0)

# Construct a hour variable with integer values from 0 to 23
clean['hour'] = pd.to_datetime(clean['CRASH TIME']).dt.hour
ori['hour'] = pd.to_datetime(ori['CRASH TIME']).dt.hour

In [81]:
# "clean" as training data and "ori" as testing data
X_train = clean[['hour', 'CRASH DATE', 'BOROUGH']].values
y_train = clean['injury'].values
X_test = ori[['hour', 'CRASH DATE', 'BOROUGH']].values
y_test = ori['injury'].values
print(X_train.shape)
print(X_test.shape)

(7244, 3)
(7189, 3)


In [82]:
from sklearn.preprocessing import OneHotEncoder

# encode catigorical variables 
encoder = OneHotEncoder(sparse_output=False)
X_train_encoded = encoder.fit_transform(X_train)
X_test_encoded = encoder.fit_transform(X_test)

In [114]:
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

# Fit SVM
svm = SVC()
svm.fit(X_train_encoded, y_train)

# Fit logistic regression
logreg = LogisticRegression()
logreg.fit(X_train_encoded, y_train)

In [115]:
# define the hyperparameter space to search over for svm
param_grid = {'C': [0.1, 1, 10],
              'gamma': [0.01, 0.1, 1, 'scale', 'auto'],
              'kernel': ['linear', 'rbf', 'sigmoid']}

# perform cross-validation with GridSearchCV
grid_search = GridSearchCV(svm, param_grid, cv=5, scoring='f1')

# fit the GridSearchCV object to the training data
grid_search.fit(X_train_encoded, y_train)

# print the best hyperparameters found
print("Best hyperparameters: ", grid_search.best_params_)

Best hyperparameters:  {'C': 1, 'gamma': 1, 'kernel': 'sigmoid'}


In [118]:
# Fit SVM using the Best hyperparameters 
svm2 = SVC(C = 1, gamma = 1, kernel = 'sigmoid')
svm2.fit(X_train_encoded, y_train)

In [166]:
# define the hyperparameter grid for logistic regression
param_grid_r = {
    'C': [0.01, 0.1, 1],  # regularization parameter
    'penalty': ['l1', 'l2'],  # regularization type
    'solver': ['liblinear', 'saga' ]}  # optimization algorithm

# perform cross-validation with GridSearchCV
grid = GridSearchCV(logreg, param_grid_r, cv=5, scoring='recall')

# fit the GridSearchCV object to the training data
grid.fit(X_train_encoded, y_train)

# print the best hyperparameters found
print("Best hyperparameters: ", grid.best_params_)

Best hyperparameters:  {'C': 1, 'penalty': 'l2', 'solver': 'liblinear'}


In [167]:
# Fit logistic regression using the Best hyperparameters
logreg2 = LogisticRegression(C = 1, penalty = 'l2', solver = 'liblinear')
logreg2.fit(X_train_encoded, y_train)

In [168]:
# caculate the predicted values
svm_pred = svm2.predict(X_test_encoded)
logreg_pred = logreg2.predict(X_test_encoded)

In [169]:
from sklearn.metrics import confusion_matrix, \
accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Confusion matrix
svm_cm = confusion_matrix(y_test, svm_pred)
logreg_cm = confusion_matrix(y_test, logreg_pred)

# Accuracy
svm_acc = accuracy_score(y_test, svm_pred)
logreg_acc = accuracy_score(y_test, logreg_pred)

# Precision
svm_precision = precision_score(y_test, svm_pred)
logreg_precision = precision_score(y_test, logreg_pred)

# Recall
svm_recall = recall_score(y_test, svm_pred)
logreg_recall = recall_score(y_test, logreg_pred)

# F1-score
svm_f1 = f1_score(y_test, svm_pred)
logreg_f1 = f1_score(y_test, logreg_pred)

# AUC
svm_auc = roc_auc_score(y_test, svm_pred)
logreg_auc = roc_auc_score(y_test, logreg_pred)

In [170]:
print("SVM results:")
print("Confusion matrix:")
print(svm_cm)
print("Accuracy:", svm_acc)
print("Precision:", svm_precision)
print("Recall:", svm_recall)
print("F1-score:", svm_f1)
print("AUC:", svm_auc)

print("Logistic regression results:")
print("Confusion matrix:")
print(logreg_cm)
print("Accuracy:", logreg_acc)
print("Precision:", logreg_precision)
print("Recall:", logreg_recall)
print("F1-score:", logreg_f1)
print("AUC:", logreg_auc)

SVM results:
Confusion matrix:
[[2797 1611]
 [1748 1033]]
Accuracy: 0.5327583808596467
Precision: 0.39069591527987896
Recall: 0.37144911902193456
F1-score: 0.38082949308755765
AUC: 0.5029886248467206
Logistic regression results:
Confusion matrix:
[[4198  210]
 [2589  192]]
Accuracy: 0.610655167617193
Precision: 0.47761194029850745
Recall: 0.06903991370010787
F1-score: 0.12064090480678603
AUC: 0.510699630171288


## question 1

For the SVM model, I used C = 1, gamma = 1, kernel = 'sigmoid' as the hyperparameters. They represent that the SVM model is using a sigmoid kernel with a gamma coefficient of 1. C is a tuning parameter, which controles "the hardness of the margin". Specifacally, for very large C, the margin is hard, and points cannot lie in it. For smaller C, the margin is softer, and can grow to encompass some points. Since most common values for C are in the range of 0.1 to 100, C=1 is not considered a large value. gamma is a hyperparameter that determines the kernel coefficient for non-linear SVM models. It defines how far the influence of a single training example reaches, with low values meaning 'far' and high values meaning 'close'. When gamma is low, the decision boundary is smoother, while for high gamma values, the decision boundary is more complex and more prone to overfitting. 

For the logistic regression, I used C = 1, penalty = 'l2', solver = 'liblinear' as the hyperparameters. They represent that the logistic regression model is using 'l2' as the regularization type with a regularization parameter of C=1. C controls the inverse of regularization strength (smaller values specify stronger regularization). 'l2' regularization results in solutions with all coefficients being non-zero. 'l2' regularization is less sensitive to outliers and is less prone to overfitting than 'l1' regularization. solver is the algorithm used for optimization during the training of the model. solver='liblinear' means that we are using the Liblinear optimizer, which uses coordinate descent algorithm.

All these hyperparameters are got using cross-validation. I tried different settings and compared them to get these final hyperparameters.

## question 2

The confusion matrix retult from svm model is [[2797 1611][1748 1033]].2797 is the counts for True negatives (TN), which represents the number of predictions that the model correctly predicted the "not injured" class. 1611 is the counts for False positives (FP), which represents the number of predictions that the model predicted "injured" but the actual class was "not injured". 1748 is False negatives (FN),  which represents the number of predictions that the model predicted the "not injured" but the actual class was "injured". 1033 is the counts for True positives (TP), which represents the number of predictions that the model correctly predicted the "injured".

The confusion matrix retult from logistic regression model is [[4198  210][2589  192]]. 4198 is the counts for True negatives (TN), which represents the number of predictions that the model correctly predicted the "not injured" class. 210 is the counts for False positives (FP), which represents the number of predictions that the model predicted "injured" but the actual class was "not injured". 2589 is False negatives (FN),  which represents the number of predictions that the model predicted the "not injured" but the actual class was "injured". 192 is the counts for True positives (TP), which represents the number of predictions that the model correctly predicted the "injured".


## question 3

Comparing the SVM and logistic regression results, we can see that the logistic regression model has higher accuracy (0.611 vs. 0.533) and precision (0.478 vs. 0.391), but lower recall (0.069 vs. 0.371) and F1-score (0.121 vs. 0.381). It means that firstly, logistic regression model has higher percentage of correct predictions over all predictions. Second, logistic regression model also has higher percentage of true positives (correctly predicted positive instances) among all predicted positives. Third, SVM has higher percentage of true positives among all actual positive instances. Fourth, SVM has a better balance for precision and recall. In addition, both models have similar AUC scores (both about 0.5, and logistic regression is silightly higher), which suggest that they perform similarly in terms of ranking the predicted probabilities. Unfortuately, 0.5 is equivalent to random guessing and suggests that the model is not useful for predicting the target variable.

Overall, the choice of which model to use would depend on the specific goals and constraints of the problem. If precision is more important (e.g., false positives are more costly than false negatives), then the logistic regression model may be preferred. If recall is more important (e.g., false negatives are more costly than false positives), then the SVM model may be preferred.