## **Getting the Data** 

In [0]:
import pandas as pd

#Function to load full crash data
def load_crash_data(path="Full_Crash.csv"):
    return pd.read_csv(path)

#Function to save updated crash data after extracting useful features
def save_crash_data(data_frame):
    data_frame.to_csv('Updated_Crash_Data.csv', index=False)

In [0]:
#Open Full crash data file
crash_data = load_crash_data()
crash_data.info()

FileNotFoundError: ignored

In [0]:
crash_data.head()

In [0]:
#Create new table containing only useful features
new_crash_data = crash_data[['Carspeedlimit','Alcohol_Notalcohol', 'DAY_OF_WEEK', 'NIGHT', 'Weather_Condition', 'Young_Notyoung', 'Light_Condition', 'INTERSECTION_TYPE', 'Collision_Type', 'Time_Slicing_Used', 'RURALURBANDESC']].copy()

In [0]:
new_crash_data.head(5)

In [0]:
#check for incomplete rows
incomplete_rows = new_crash_data[new_crash_data.isnull().any(axis=1)].head()
incomplete_rows


In [0]:
#Only 5 rows have Null values so just drop them from dataset
crash_cleaned = new_crash_data.dropna(subset=["Carspeedlimit"]) 

In [0]:
incomplete_rows = crash_cleaned[new_crash_data.isnull().any(axis=1)].head()
incomplete_rows

In [0]:
#Save new dataframe into csv_file so Full_Crash doesn't need to be opened
save_crash_data(crash_cleaned)

## **Data Visualization**

In [0]:
lng_lat = crash_data[['X', 'Y', 'Carspeedlimit']].copy()
lng_lat.head

In [0]:
#Drop Null values from dataset
lng_lat_prepared = lng_lat.dropna()

In [0]:
lng_lat_prepared.info

The first map of the datapoints show the Longitude and Latitude of the points. Noticably, there is a high concentration of points around major cities and roadways.

In [0]:
lng_lat_prepared.plot(kind='scatter', x='X', y='Y', alpha = 0.1,
            figsize=(10,4))

plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('Crash Crashes in Virginia')

plt.show()

The next map uses a heatmap to show the density of crashes by the speedlimit the driver was going. It can now be seen that the higher density crashes occur at lower speed limits. 

In [0]:
import matplotlib.pyplot as plt

#Heat map of 
lng_lat_prepared.plot(kind='scatter', x='X', y='Y', alpha = 0.3,
            figsize=(10,4), c='Carspeedlimit', cmap=plt.get_cmap("hot"), colorbar=True)

plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('Heatmap of Crash Crashes in Virginia by SpeedLimit')

plt.show()

In [0]:
import numpy as np
from PIL import Image

#Overlay Scatterplot over highway map of virginia
im = plt.imread('va_highways.jpg')
implot = plt.imshow(im)

pic = Image.open('va_highways.jpg')

#Convert from original range to picture size range
OldRange_X = np.max(lng_lat_prepared['X']) - np.min(lng_lat_prepared['X']) 
OldRange_Y = np.max(lng_lat_prepared['Y']) - np.min(lng_lat_prepared['Y']) 
NewRange_X = pic.size[0]
NewRange_Y = -pic.size[1]
NewValue_X = (((lng_lat_prepared['X'] - np.min(lng_lat_prepared['X'])) * NewRange_X) / OldRange_X)
NewValue_Y = (((lng_lat_prepared['Y'] - np.min(lng_lat_prepared['Y'])) * NewRange_Y) / OldRange_Y) - NewRange_Y

plt.scatter(NewValue_X, NewValue_Y, alpha=0.01)
plt.xlabel('Longitude Scaled')
plt.ylabel('Latitude Scaled')
plt.title('Crash Crashes in Virginia Overlay Virginia Highway Map')

The above graph gives insight into where a majority of the crashes are occuring. Most crashes occur on the major highways, which is where a large percentage of drivers drive everyday. Also, a lot of crashes are concentrated around cities because the population density is higher so there are more cars. This means that features pretaining to road type and population density are important factors. This is way the RURALURBANDESC and the Carspeedlimit features were selected.

## **Data Cleaning**

In [0]:
import pandas as pd
#Function to load full crash data
def load_updated_crash_data(path="Updated_Crash_Data.csv"):
    return pd.read_csv(path)

#Open updated crash data
crash_data = load_updated_crash_data()
crash_data.info()

In [0]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

#Split data into categorical and numerical features for processing
crash_num = crash_data['Carspeedlimit']
crash_cat = crash_data.drop('Carspeedlimit', axis=1)

num_attribs = ["Carspeedlimit"]
cat_attribs = list(crash_cat)

num_pipeline = Pipeline([
        ('std_scaler', StandardScaler()),
    ])

full_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", OneHotEncoder(), cat_attribs),
    ])

crash_prepared = full_pipeline.fit_transform(crash_data)

In [0]:
#Split data into train and test set
from sklearn.model_selection import train_test_split

small_crash_set = crash_prepared[:8600] # for hyperparamter tuning
train_set, test_set = train_test_split(crash_prepared, test_size=0.1, random_state=42)
train_small, test_small = train_test_split(small_crash_set, test_size=0.1, random_state=42)

## **One Class SVM**

In [0]:
from sklearn.svm import OneClassSVM
from sklearn.model_selection import ParameterGrid
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
import numpy as np

#Standard GridSearch can't be performed with OneClassSVM so use custom function
#Function returns best set of parameters from input grid
def gridTune(model, train, test, grid):
  best_params = None
  best_accuracy = 0
  i = 0
  for z in ParameterGrid(grid):
    model.set_params(**z)
    model.fit(train)
    y_pred = model.predict(test)
    if 1. in y_pred:
      accScore = accuracy_score([1] * len(y_pred), y_pred)
      if (accScore >= best_accuracy):
        best_params = z
        best_accuracy = accScore
  return best_params

**RBF Kernel**

In [0]:
clf_rbf_grid = OneClassSVM(kernel='rbf')

#Parameter Grid: gamma, nu
grid = {'gamma' : np.logspace(-4,3,5),
        'nu' : np.linspace(0.3,0.99,5)}

bestParameters = gridTune(clf_rbf_grid, train_small, test_small, grid)
print(bestParameters)

In [0]:
#Train svm with best parameters
clf_rbf = OneClassSVM(kernel="rbf", gamma=bestParameters['gamma'], nu=bestParameters['nu']) 
clf_rbf.fit(train_set)

In [0]:
#RBF Accuracy
y_pred_train_rbf = clf_rbf.predict(train_set)
y_pred_test_rbf = clf_rbf.predict(test_set)
rbf_train_acc = y_pred_train_rbf[y_pred_train_rbf == 1].size
rbf_test_acc = y_pred_test_rbf[y_pred_test_rbf == 1].size

print("Training Accuracy: ", rbf_train_acc / len(y_pred_train_rbf)) #Training accuracy
print("Testing Accuracy: ", rbf_test_acc / len(y_pred_test_rbf)) #Testing accuracy

In [0]:
#RBF confusion matrix
confusion_matrix([1] * len(y_pred_test_rbf), y_pred_test_rbf)

The confusion matrix for this classification looks slightly different than the confusion matrices we are used to seeing. This is because all the samples in the dataset represent car crashes. Thus the only possibilites are to calulate true positives (bottom right) or false negatives (bottom left).

**Polynomial Kernel**

In [0]:
clf_poly_grid = OneClassSVM(kernel='poly')

#Parameter Grid: gamme, nu, coef0
grid = {'gamma' : np.logspace(-4,3,5),
        'nu' : np.linspace(0.3,0.99,5),
        'coef0' : (0,1)}

bestParametersPoly = gridTune(clf_rbf_grid, train_small, test_small, grid)
print(bestParametersPoly)

In [0]:
#Train svm with best parameters
clf_poly = OneClassSVM(kernel="poly", gamma=bestParametersPoly['gamma'], 
                       nu=bestParametersPoly['nu'], coef0=bestParametersPoly['coef0']) 
clf_poly.fit(train_set)

In [0]:
#Polynomial Accuracy
y_pred_train_poly = clf_poly.predict(train_set)
y_pred_test_poly = clf_poly.predict(test_set)
poly_train_acc = y_pred_train_poly[y_pred_train_poly == 1].size
poly_test_acc = y_pred_test_poly[y_pred_test_poly == 1].size

print("Training Accuracy: ", poly_train_acc / len(y_pred_train_poly)) #Training accuracy
print("Testing Accuracy: ", poly_test_acc / len(y_pred_test_poly)) #Testing accuracy

In [0]:
#Polynomial confusion matrix
confusion_matrix([1] * len(y_pred_test_poly), y_pred_test_poly)

**Linear Kernel**

In [0]:
clf_lin_grid = OneClassSVM(kernel='linear')

#Parameter Grid: gamme, nu
grid = {'gamma' : np.logspace(-4,3,5),
        'nu' : np.linspace(0.3,0.99,5)}

bestParametersLin = gridTune(clf_lin_grid, train_small, test_small, grid)
print(bestParametersLin)

In [0]:
#Train svm with best parameters
clf_lin = OneClassSVM(kernel="linear", gamma=bestParametersPoly['gamma'], 
                       nu=bestParametersPoly['nu']) 
clf_lin.fit(train_set)

In [0]:
#Linear Accuracy
y_pred_train_lin = clf_lin.predict(train_set)
y_pred_test_lin = clf_lin.predict(test_set)
lin_train_acc = y_pred_train_lin[y_pred_train_lin == 1].size
lin_test_acc = y_pred_test_lin[y_pred_test_lin == 1].size

print("Training Accuracy: ", lin_train_acc / len(y_pred_train_lin)) #Training accuracy
print("Testing Accuracy: ", lin_test_acc / len(y_pred_test_lin)) #Testing accuracy

In [0]:
#Linear confusion matrix
confusion_matrix([1] * len(y_pred_test_lin), y_pred_test_lin)

After training with the optimal parameters on the entire datset, the rbf and linear kernels achieved the same accuracy of 69.74% on the testing set. It should be noted that the rbf kernel performed slightly better than the linear kernel on the training set.

## **K-Nearest Neighbors**

With K_Nearest Neighbors you can use novelty detection to determine whether a datapoint is an outlier or not. This is useful with our dataset since we are trying to create a boundary around the data in order to create a threshold for positive samples to be within.

In [0]:
from sklearn.neighbors import LocalOutlierFactor
clf_LOF = LocalOutlierFactor(novelty=True, n_neighbors=20, contamination='auto')  #Need to set novelty to True
clf_LOF.fit(train_set)

In [0]:
#kNeighbors Accuracy
y_pred_LOF = clf_LOF.predict(test_set)
LOF_test_acc = y_pred_LOF[y_pred_LOF == 1].size
print("Testing Accuracy: ", LOF_test_acc / len(y_pred_LOF)) #Testing accuracy

In [0]:
#kNeighbors confusion matrix
confusion_matrix([1] * len(y_pred_LOF), y_pred_LOF)

## **Isolation Forest**

In [0]:
from sklearn.ensemble import IsolationForest
clf_iso = IsolationForest(n_estimators=10, warm_start=True, behaviour='new', verbose=1, n_jobs=100)
clf_iso.fit(train_set)

In [0]:
#isoForest Accuracy
y_pred_iso = clf_iso.predict(test_set)
iso_test_acc = y_pred_iso[y_pred_iso == 1].size
print("Testing Accuracy: ", iso_test_acc / len(y_pred_iso)) #Testing accuracy

In [0]:
#isoForest confusion matrix
confusion_matrix([1] * len(y_pred_iso), y_pred_iso)

In [0]:
y_pred_iso.feature_importances_