##  Deep Neural Networks Project

In this project, you will be working with a real-world data set from the Las Vegas Metropolitan Police Department. The dataset  contains information about the reported incidents, including the time and location of the crime, type of incident, and number of persons involved.

The dataset is downloaded from the public docket at:
https://opendata-lvmpd.hub.arcgis.com

let's read the csv file and transform the data:

In [120]:
import torch
import torch.nn as nn
import pandas as pd
from torch.utils.data import DataLoader, Dataset, TensorDataset
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [37]:
orig_df = pd.read_csv('../../datasets/LVMPD-Stats.csv', parse_dates=['ReportedOn'])

In [111]:
df = pd.read_csv('datasets/LVMPD-Stats.csv', parse_dates=['ReportedOn'],
                 usecols = ['X', 'Y', 'ReportedOn',
                            'Area_Command','NIBRSOffenseCode',
                            'VictimCount' ] )

df['DayOfWeek'] = df['ReportedOn'].dt.day_name()
df['Time' ]     = df['ReportedOn'].dt.hour
df.drop(columns = 'ReportedOn', inplace=True)

In [112]:

df['X'] = df['X']
df['Y'] = df['Y']
df['Time'] = pd.factorize(df['Time'])[0]
df['DayOfWeek'] = pd.factorize(df['DayOfWeek'])[0]
df.Area_Command = pd.factorize(df['Area_Command'])[0]
df.VictimCount = pd.factorize(df['VictimCount'])[0]
df.NIBRSOffenseCode = pd.factorize(df['NIBRSOffenseCode'])[0]
df.dropna(inplace=True)

In [113]:
df= df[['X', 'Y', 'Area_Command', 'NIBRSOffenseCode',
       'DayOfWeek', 'Time','VictimCount']]

In [42]:
df.values.shape

(275, 7)

# Goal
The goal is to build a predictive model that is trained on the following data:
* latitude and longitude (location)
* Hour of the day
* Day of the week
* Area-of-command code: The police designation of the bureau of the operation.
* Classification code for the crime committed
  
The predicted variable is the number of persons involved in the accident.


## Task 1
* print a few rows of the values in the dataframe ``df`` and explain what each column of data means.
* identify the input and target variables
* what is the range of values in each column? Do you need to scale, shift or normalize your data?


In [114]:
'''
X is longitude, Y is latitude
  Will normalize

Area_command is the police designation of the bureau of the operation, encoded as an integer
  Categorical data, can be one-hot encoded but is not really needed

NIBRSOffenseCode is the code for the crime commited, encoded as an integer with range of 2
  Categorical data, same as above, already encoded using numbers

DayofWeek is self-explanatory encoded as integer value with range of 7
  Categorical data, same as above, range of values is low

Time is the hour of day encoded as an integer


VictimCounts is the target
'''
df['X'] = (df['X']-df['X'].mean())/df['X'].std()
df['Y'] = (df['Y']-df['Y'].mean())/df['Y'].std()

print(df.head())

          X         Y  Area_Command  NIBRSOffenseCode  DayOfWeek  Time  \
0  0.708907  0.619351             0                 0          0     0   
1 -0.798132  0.391269             1                 1          1     1   
2  0.160300  0.320637             2                 1          2     0   
3 -0.648490 -0.217249             3                 1          1     2   
4 -0.171602 -0.400214             4                 1          1     3   

   VictimCount  
0            0  
1            0  
2            1  
3            2  
4            0  


## Task 2

* Create two `DataLoader` objects for training and testing based on the input and output variables. Pick a reasonable batch size and verify the shape of data by iterating over the one dataset and printing the shape of the batched data.

In [126]:
batch_size = 40
n_iters = 3000
train_dataframe, test_dataframe = train_test_split(df, test_size=0.3)


train_y = train_dataframe[['VictimCount']]
train_x = train_dataframe
train_x.drop(['VictimCount'], axis=1, inplace=True)

tensor_train_y = torch.Tensor(train_y.to_numpy())
tensor_train_x = torch.Tensor(train_x.to_numpy())
train_dataset = TensorDataset(tensor_train_x, tensor_train_y)

train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                           batch_size=batch_size,
                                           shuffle=True)


test_y = test_dataframe[['VictimCount']]
test_x = test_dataframe
test_x.drop(['VictimCount'], axis=1, inplace=True)

tensor_test_y = torch.Tensor(test_y.to_numpy())
tensor_test_x = torch.Tensor(test_x.to_numpy())
test_dataset = TensorDataset(tensor_test_x, tensor_test_y)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
                                          batch_size=batch_size,
                                          shuffle=False)

for batch in test_loader:
  x, y = batch
  print(x.shape, y.shape)

print('\n')

for batch in train_loader:
  x, y = batch
  print(x.shape, y.shape)


torch.Size([40, 6]) torch.Size([40, 1])
torch.Size([40, 6]) torch.Size([40, 1])
torch.Size([3, 6]) torch.Size([3, 1])


torch.Size([40, 6]) torch.Size([40, 1])
torch.Size([40, 6]) torch.Size([40, 1])
torch.Size([40, 6]) torch.Size([40, 1])
torch.Size([40, 6]) torch.Size([40, 1])
torch.Size([32, 6]) torch.Size([32, 1])


## Task 3
In this task you will try to predict number of crime victims as a **real number**. Therefore the machine learning problem is a **regression** problem.

* Define the proper loss function for this task
* what should the size of the predicted output be?
* explain your choice of architecture, including how many layers you will be using
* define an optimizer for training this model, choose a proper learning rate
* write a training loop that obtains a batch out of the  training data and calculates the forward and backward passes over the neural network. Call the optimizer to update the weights of the neural network.
* write a for loop that continues the training over a number of epochs. At the end of each epoch, calculate the ``MSE`` error on the test data and print it.
* is your model training well? Adjust the learning rate, hidden size of the network, and try different activation functions and number of layers to achieve the best accuracy and report it.

In [None]:
loss_function = nn.MSELoss()
# MSE loss as it is regression, we compare output to expected result
# The output should be a single scalar prediction of the count of victims


optimizer = torch.optim.Adam(model.parameters(), lr=0.01)




## Task 4

In this task, you will try to predict the number of crime victims as a **class number**. Therefore the machine learning problem is a **classification** problem.

* Repeat all the steps in task 3. Specifically, pay attention to the differences with regression.
* How would you find the number of classes on the output data?
* How is the architecture different?
* How is the loss function different?
* Calculate the Accuracy for test data as the number of correct classified outputs divided by the total number of test data in each epoch. Report it at the end of each epoch
* Try a few variations of learning rate, hidden dimensions, layers, etc. What is the best accuracy that you can get?

## Task 5

### Reflect on your results

* Write a paragraph about your experience with tasks 3 and 4. How do you compare the results? Which one worked better? Why?
* Write a piece of code that finds an example of a  miss-classification. Calculate the probabilities for the output classes and plot them in a bar chart. Also, indicate what is the correct class label.

## Task 6: Exploring the patterns in raw data

* Plot the crime incidents as a `scatter` plot using the corrdinates. Use the color property of each datapoint to indicate the day of the week. Is there a pattern in the plot?
* Now make a new scatter plot and use the color property of each datapoint to indicate the number of persons involved in the incident. Is there a pattern here?
* use numpy (or pandas if you like) to sort the number of crimes reported by the day of the week. What days are most frequent?
