##  Deep Neural Networks Project
_<span style="float: right">Norine NDOUDI</span>_

In this project, you will be working with a real-world data set from the Las Vegas Metropolitan Police Department. The dataset  contains information about the reported incidents, including the time and location of the crime, type of incident, and number of persons involved. 

The dataset is downloaded from the public docket at: 
https://opendata-lvmpd.hub.arcgis.com

let's read the csv file and transform the data:

In [13]:
import torch
import pandas as pd
from torch.utils.data import DataLoader, Dataset
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [14]:
orig_df = pd.read_csv('../../datasets/LVMPD-Stats.csv', parse_dates=['ReportedOn'])

In [15]:
df = pd.read_csv('../../datasets/LVMPD-Stats.csv', parse_dates=['ReportedOn'],
                 usecols = ['X', 'Y', 'ReportedOn',
                            'Area_Command','NIBRSOffenseCode',
                            'VictimCount' ] )

df['DayOfWeek'] = df['ReportedOn'].dt.day_name()
df['Time' ]     = df['ReportedOn'].dt.hour
df.drop(columns = 'ReportedOn', inplace=True)

In [16]:

df['X'] = df['X'] 
df['Y'] = df['Y'] 
df['Time'] = pd.factorize(df['Time'])[0]
df['DayOfWeek'] = pd.factorize(df['DayOfWeek'])[0]
df.Area_Command = pd.factorize(df['Area_Command'])[0]
df.VictimCount = pd.factorize(df['VictimCount'])[0]
df.NIBRSOffenseCode = pd.factorize(df['NIBRSOffenseCode'])[0]
df.dropna(inplace=True)

In [17]:
df= df[['X', 'Y', 'Area_Command', 'NIBRSOffenseCode',
       'DayOfWeek', 'Time','VictimCount']]

In [18]:
df.values.shape

(275, 7)

# Goal
The goal is to build a predictive model that is trained on the following data:
* latitude and longitude (location)
* Hour of the day
* Day of the week
* Area-of-command code: The police designation of the bureau of the operation.
* Classification code for the crime committed
  
The predicted variable is the number of persons involved in the accident.


## Task 1
* print a few rows of the values in the dataframe ``df`` and explain what each column of data means. 
* identify the input and target variables
* what is the range of values in each column? Do you need to scale, shift or normalize your data? 


In [8]:
#Print a few rows of the values in the dataframe df
df

Unnamed: 0,X,Y,Area_Command,NIBRSOffenseCode,DayOfWeek,Time,VictimCount
0,-115.087518,36.216702,0,0,0,0,0
1,-115.240172,36.189693,1,1,1,1,0
2,-115.143088,36.181329,2,1,2,0,1
3,-115.225014,36.117633,3,1,1,2,2
4,-115.176708,36.095967,4,1,1,3,0
...,...,...,...,...,...,...,...
270,-115.114739,36.119592,5,1,5,18,0
271,-115.080764,36.162648,0,1,5,17,0
272,-115.172073,36.123012,4,1,1,16,2
273,-115.152593,36.066073,5,1,6,23,0


#### Explain what each column of data means.
* `X` and `Y` column mean the location. `X` for the longitude and `Y` for the latitude.
* `Area_Command` column means the police designation of bureau of the operation.
* `NIBRSOffenseCode` is the classification code for the crime committed. That is to say, the numerical codes for different types of offenses.
* `DayOfWeek` column is the day of the week, monday, tuesday, wednesday... 
* `Time` represents the time of day in hours from midnight (value 0) to 23.
* `VictimCount` reprensents the number of victim associated with each reported crime.

#### Identify the input and target variables
As explained in the predictive model goal of this assignment the input variables are :
* latitude and longitude (location)
* Classification code for the crime committed
* Area-of-command code
* 
Hour of the da
* Day of the week

And the target variable is the **number of victim**.tted

#### what is the range of values in each column? Do you need to scale, shift or normalize your data? 
* `X` and `Y` column have values regardind latitude and longitude ranges. Between [-90, 90] for the latitude and between [-180 ,180] for the longitude
* `AreaCommand` has values between [0, number of distinct area code]. 0 represents one area, 1 another and so on...
* `NIBRSOffenseCode` column same as AreaCommand, its values are between [0, number of distinct OffenseCode].
* `DayOfWeek` has values between [0,6]. Here 0 represents one day, 2 another, same for 3..., the seven days of the week.
* `Time` has values between [0,23].
* `VictimCount` has values $n \in \mathbb{N}$

We should **normalize our data** to have all the values between 0 and 1 in order to have all the features in the same scale.

## Task 2 

* Create two `DataLoader` objects for training and testing based on the input and output variables. Pick a reasonable batch size and verify the shape of data by iterating over the one dataset and printing the shape of the batched data. 

With the first part, we have identified our inputs and the target variables. 

In [19]:
#Separate the input and the output 
X = df[['X', 'Y', 'Area_Command', 'NIBRSOffenseCode', 'DayOfWeek', 'Time']].values
Y = df['VictimCount'].values
print(X.shape)
print(Y.shape)

(275, 6)
(275,)


Then, we will use the DataLoader class from PyTorch to create our dataset. For that we need to use a dataset object. That is why we'll convert the NumPy arrays (X and Y) into a torch.Tensor object in order to use TensorDataset().  
*<span style="float:right;">Help by [the documentation](https://pytorch.org/docs/stable/data.html) and AI</span>*

In [30]:
from torch.utils.data import DataLoader, TensorDataset

#Convertion to Tensor
X_tensor = torch.tensor(X, dtype=torch.float32)
Y_tensor = torch.tensor(Y, dtype=torch.float32)

#Creation of the dataset
dataset = TensorDataset(X_tensor, Y_tensor)

#Pick a reasonable batch size power of 2
batch_size = 64

#Creation of the DataLoader objects
train_loader = DataLoader(dataset=dataset, 
                                           batch_size=batch_size, 
                                           shuffle=True)

test_loader = DataLoader(dataset=dataset, 
                                          batch_size=batch_size, 
                                          shuffle=False)

#Iterating over the train dataset and printing the shape of the batched data.
for batch, (X_batch, Y_batch) in enumerate(train_loader):
    print(f"Batch {batch + 1} - X_batch shape: {X_batch.shape}, Y_batch shape: {Y_batch.shape}")

Batch 1 - X_batch shape: torch.Size([64, 6]), Y_batch shape: torch.Size([64])
Batch 2 - X_batch shape: torch.Size([64, 6]), Y_batch shape: torch.Size([64])
Batch 3 - X_batch shape: torch.Size([64, 6]), Y_batch shape: torch.Size([64])
Batch 4 - X_batch shape: torch.Size([64, 6]), Y_batch shape: torch.Size([64])
Batch 5 - X_batch shape: torch.Size([19, 6]), Y_batch shape: torch.Size([19])


## Task 3
In this task you will try to predict number of crime victims as a **real number**. Therefore the machine learning problem is a **regression** problem. 

* Define the proper loss function for this task
* what should the size of the predicted output be?
* explain your choice of architecture, including how many layers you will be using
* define an optimizer for training this model, choose a proper learning rate 
* write a training loop that obtains a batch out of the  training data and calculates the forward and backward passes over the neural network. Call the optimizer to update the weights of the neural network.
* write a for loop that continues the training over a number of epochs. At the end of each epoch, calculate the ``MSE`` error on the test data and print it.
* is your model training well? Adjust the learning rate, hidden size of the network, and try different activation functions and number of layers to achieve the best accuracy and report it. 

## Task 4 

In this task, you will try to predict the number of crime victims as a **class number**. Therefore the machine learning problem is a **classification** problem. 

* Repeat all the steps in task 3. Specifically, pay attention to the differences with regression.
* How would you find the number of classes on the output data?
* How is the architecture different?
* How is the loss function different?
* Calculate the Accuracy for test data as the number of correct classified outputs divided by the total number of test data in each epoch. Report it at the end of each epoch
* Try a few variations of learning rate, hidden dimensions, layers, etc. What is the best accuracy that you can get? 

## Task 5

### Reflect on your results

* Write a paragraph about your experience with tasks 3 and 4. How do you compare the results? Which one worked better? Why?
* Write a piece of code that finds an example of a  miss-classification. Calculate the probabilities for the output classes and plot them in a bar chart. Also, indicate what is the correct class label.

## Task 6: Exploring the patterns in raw data

* Plot the crime incidents as a `scatter` plot using the corrdinates. Use the color property of each datapoint to indicate the day of the week. Is there a pattern in the plot?
* Now make a new scatter plot and use the color property of each datapoint to indicate the number of persons involved in the incident. Is there a pattern here?
* use numpy (or pandas if you like) to sort the number of crimes reported by the day of the week. What days are most frequent?
