# IADS midterm

Please ensure all code is executed and the corresponding outputs are included. Write the code directly in this notebook rather than creating a new one.

## Part 1: Multiple choice and theoretic questions
Please write your answer after each question

### Question 1. What would the p-value of 0.04 mean for t-test comparing two samples of observations (select all that applies):
A) sample averages are at least 4% different

B) the samples follow the underlying distributions with the same mean

C) the samples follow the underlying distributions with the different mean 

D) one can reject the null hypothesis that the samples follow the underlying distributions with the same mean at 5% significance level (or with 95% confidence) since p-values is below 0.05

E) one can't reject the null hypothesis that the samples follow the underlying distributions with the same mean at 5% significance level (or 95% confidence) singe p-value does not reach 0.05

F) one can reject the null hypothesis that the samples follows the underlying distributions with the different means at 5% significance level (or 95% confidence)

G) probability that two samples have the same means is 4%

Answer: 

### Question 2. What is true regarding normal and log-normal distributions:
A) Quantities following log-normal distributions have higher probabilities for outliers compared to normal distributions

B) Outliers significantly different from the mean are more common for the normally distributed variables compared to log normally distributed variables

C) Logarithm of the normally distributed quantity follow a log-normal distribution

D) Logarithm of the log-normally distributed quantity follows a normal distribution

E) Probability density function of the log-normally distributed variable equals to the logarithm of the probability density function of the normally distributed variable

Answer: 

### Question 3. 
Imagine training a model which considers multiple sattelite images of urban traffic and tries to find groups of typical
(repeated with minor deviations) scenarios. How would you classify this problem from Machine Learning perspective?

A) Supervised leanring;

B) Unsupervised learning;

C) Semi-supervised learning;

D) Reinforcement learning.

Explain you choice:

Answer: 

### Question 4. 
Please explain why would you need separate training, validation and test samples to learn the model. In which cases you may need all three, including a validation sample?



Answer: 

In [None]:
# !pip install rtree
# !pip install pygeos
# !pip install geopandas

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime
from dateutil import parser
import seaborn as sns
from scipy.stats import norm
from scipy.stats import genextreme as gev
from scipy.stats import pareto 
from scipy import stats
import geopandas as gpd
from shapely.geometry import Point
from matplotlib import pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

## Part 2: NYPD data analysis

In this part, you need to download New York Police Department (NYPD) complaints data for 2019 and write code for three following sections (each having own sub-sections): Data cleaning, Exploratory analysis and Hypothesis testing

###  download NYPD complaints data:
two options:
1. download with curl or urllib methods
2. download with API 

In [None]:
!curl https://data.cityofnewyork.us/api/views/qgea-i56i/rows.csv?accessType=DOWNLOAD > NYPD_data.csv

In [None]:
# !wget https://data.cityofnewyork.us/api/views/qgea-i56i/rows.csv?accessType=DOWNLOAD

In [None]:
# !wget https://www.dropbox.com/s/u78fk8g0wkf3xwu/NYPD_data.csv?dl=0

Data dictionary: https://data.cityofnewyork.us/api/views/qgea-i56i/files/b21ec89f-4d7b-494e-b2e9-f69ae7f4c228?download=true&filename=NYPD_Complaint_Incident_Level_Data_Footnotes.pdf

### read data

In [None]:
data = pd.read_csv('NYPD_data.csv')
data.head()

In [None]:
data.shape

#### the shape of the data frame should be (8914838, 35)

In [None]:
data.OFNS_DESC.unique()

In [None]:
data.columns

The complete data dictionary link is provided above. The focus of this notebook would be on columns - 'CMPLNT_NUM', 'CMPLNT_FR_DT', 'CMPLNT_FR_TM', 'OFNS_DESC', 'BORO_NM', 'PARKS_NM', 'Latitude', 'Longitude'.

The 'CMPLNT_NUM' is a unique id for each complaint, 'CMPLNT_FR_DT' and 'CMPLNT_FR_TM' are date and time of complaint respectively, 'OFNS_DESC' is the type of offence reported, 'BORO_NM' is name of borough where complaint was reported, 'PARKS_NM' is name of park where complaint recorded (if any) and 'Latitude', 'Longitude' are location of complaint.


## Section 1 - Data cleaning tasks 
#### We have completed the majority of the data cleaning tasks, but there are still a few remaining items for you to address.(Marked as 'todo')
1. Drop rows with a) missing/wrong complaint date and time b) missing borough name and c) duplicate complaint number ('CMPLNT_NUM' column)
2. Filter out data where incident occured in a park or greenspace. Next, keep data for 2019 and after.
3. Keep specific crime categories - type 1 crimes defined by FBI: The list is given here https://ucr.fbi.gov/crime-in-the-u.s/2011/crime-in-the-u.s.-2011/offense-definitions
4. Filter by area (drop rows with location outside NYC)

### 1. filter out missing/wrong date and times, missing borough name and duplicate complaints from the data

In [None]:
data.isna().sum()

In [None]:
data['CMPLNT_FR_DT'] = pd.to_datetime(data['CMPLNT_FR_DT'], errors='coerce')

In [None]:
data['CMPLNT_FR_TM'] = pd.to_datetime(data['CMPLNT_FR_TM'], format='%H:%M:%S', errors='coerce').dt.time

In [None]:
print(data.CMPLNT_FR_DT.isna().sum())
print(data.CMPLNT_FR_TM.isna().sum())

In [None]:
data.dropna(subset=['CMPLNT_FR_DT', 'CMPLNT_FR_TM'], inplace=True)
data.shape

In [None]:
data.drop_duplicates(subset=['CMPLNT_NUM'], inplace=True)
data.shape

In [None]:
data.BORO_NM.unique()

In [None]:
data = data[~data.BORO_NM.isna()]
data = data[data.BORO_NM != '(null)']
data.shape

In [None]:
data.BORO_NM.unique()

### 2. Remove rows where location is parks or greenspace and filter for 2019 and after

In [None]:
## check the timeline of data
print(data.sort_values(by='CMPLNT_FR_DT', ascending=True).head(3)['CMPLNT_FR_DT'])
print(data.sort_values(by='CMPLNT_FR_DT', ascending=False).head(3)['CMPLNT_FR_DT'])

In [None]:
# Todo: filter out the data before 2019-1-1

In [None]:
print(data.sort_values(by='CMPLNT_FR_DT', ascending=True).head(1)['CMPLNT_FR_DT'].values[0])
print(data.sort_values(by='CMPLNT_FR_DT', ascending=False).head(1)['CMPLNT_FR_DT'].values[0])

In [None]:
data.shape

In [None]:
data.PARKS_NM.unique()

In [None]:
data = data[data.PARKS_NM == '(null)']
data.shape

#### Checkpoint: We should have around 2.38M records after this step

### 3. keep type 1 crimes as defined by FBI from the data : 
https://ucr.fbi.gov/crime-in-the-u.s/2011/crime-in-the-u.s.-2011/offense-definitions

The crime type is present in the 'OFNS_DESC' column. You just need to keep the following categories: "'ARSON', 'BURGLARY', 'FELONY ASSAULT', 'GRAND LARCENY' ,'GRAND LARCENY OF MOTOR VEHICLE',
                'MURDER & NON-NEGL. MANSLAUGHTER', 'RAPE', 'ROBBERY'"

In [None]:
data.OFNS_DESC.unique()

In [None]:
data_type1 = data[data.OFNS_DESC.isin(['ARSON', 'BURGLARY', 'FELONY ASSAULT', 'GRAND LARCENY' ,'GRAND LARCENY OF MOTOR VEHICLE',
                'MURDER & NON-NEGL. MANSLAUGHTER', 'RAPE', 'ROBBERY'])]
data_type1.reset_index(drop=True, inplace=True)
data_type1.shape

In [None]:
data_type1.head()

### 4. keep rows with location within NYC

zip codes file is present in the github 'Data' repository as "ZIPCODE.zip". We also have already used it in homework 2.

Do a spatial joint to keep only rows within NYC

In [None]:
## zip codes map
# zips = gpd.read_file('Data/ZIPCODE/ZIP_CODE_040114.shp')
zips = gpd.read_file('ZIPCODE/ZIP_CODE_040114.shp')
zips.head()

Note: 'ZIPCODE' column has unique codes. The borough name is given in 'COUNTY' column. The counties and boroughs are synonymous in NYC. 'New York' county corresponds to Manhattan, 'Kings' to Brooklyn, 'Richmond' to Staten Island

In [None]:
zips.COUNTY.unique()

In [None]:
zips.plot(figsize=(8,8))

In [None]:
# Todo: filter out crime point beyond NYC

In [None]:
# YourDataframe.to_csv('NYC_crimes/crimes_NYC.csv')

## Section 2 -Exploratory analysis tasks

1. Visualize the time series of the total number of type 1 crimes for the whole city per day.
2. Visualize part 1 crimes grouped on a) borough level as a bar plot and b) zip code level as a heatmap normalized by population (per 100,000). Use quantiles scheme colormap.
3. Plot following bar plots: the total number of part 1 crimes by a) month, b) day of week( use weekday names for labels) and c) hour of day.
4. Plot two bar plots: Day of the week and hour of the day timelines for felony vs grand larceny (normalized per 100,000 population, comparing these two types of crime on the same bar plots)
5. Compare the %% decomposition of type 1 crimes by category of crime within different boroughs by plotting pie charts for each borough

### 1. time series plotting

In [None]:
# Todo: group total crimes by daily numbers


In [None]:
# Todo: plot as a time series


### 2. plotting on borough and zip code level normalized by population

In [None]:
# Todo: group crime numbers by borough and normalize by their population (per 100,000). Population is given in the zips shapefile


In [None]:
# Todo: plot as a bar plot


In [None]:
# Todo: now group by zip codes, normalize by their population


In [None]:
# Todo:plot as a heatmap with quantiles color scheme


### 3. bar plot of total crimes vs a) months b) day of week and c) hour of day

In [None]:
# Todo: code here

### 4. Bar plots: Felony assault vs grand larceny grouped by a) day of week and b) hour of day

In [None]:
# Todo: filter data for above crime types


In [None]:
# Todo: group the numbers and normalize by total city population (per 100,000)


In [None]:
# Todo: plot two bar plots: one for day of week and other for hour of day
# each plot should have comparison of the two type of crime numbers (normalized) by weekday and hour respectively


## Section 3 - Hypothesis testing tasks

1. Plot the distribution (density plot) of daily number of total type 1 crimes for 2019.
Test the hypothesis if the distribution follows normal distribution.

2. Plot the distributions (density plots) of daily number of total type 1 crimes for weekdays and weekends (normalized by population) and perform a) the t-test for the hypothesis that the average daily crime over weekdays and weekends is the same, b) the KS-test for the hypothesis that the weekday and weekend daily crime numbers follow the same distribution. Can you reject either hypothesis at the 10% significance level? 

In [None]:
#introduce a custom function performing distribution analysis
def distribution_analysis(x, log_scale = False, fit_distribution = 'None', bins = 50, vis_means = True, vis_curve = True, print_outputs = True):
    #x - array of observations
    #log_scale - analyze distribution of log(x) if True
    #fit_distribution - fit the distribution ('normal', 'gev' or 'pareto') or do nothing if 'None'
    #bins - how many bins to use for binning the data
    #vis_means - show mean and std lines if True
    #vis_curve - show interpolated distribution curve over the histogram bars if True
    #print_outputs - print mean, std and percentiles
    
    if log_scale: 
        x1 = np.log10(x) #convert data to decimal logarithms
        xlabel = 'log(values)' #reflect in x labels
    else:
        x1 = x #leave original scale 
        xlabel = 'values'
    mu = x1.mean() #compute the mean
    if log_scale: #if logscale, output all three - log mean, its original scale and original scale mean
        print('Log mean = {:.2f}({:.2f}), mean = {:.2f}'.format(mu,10**mu,x.mean()))
    else:
        print('Mean = {:.2f}'.format(mu)) #otherwise print mean
    sigma = x1.std() #compute and output standard deviation 
    print('Standard deviation = {:.2f}'.format(sigma))
    for p in [1,5,25,50,75,95,99]: #output percentile values
        print('{:d} percentile = {:.2f}'.format(p,np.percentile(x,p)))
        
    #visualize histogram and the interpolated line (if vis_curve=True) using seaborn
    sns.distplot(x1, hist=True, kde=vis_curve, 
        bins=bins,color = 'darkblue', 
        hist_kws={'edgecolor':'black'},
        kde_kws={'linewidth': 4})
    
    #show vertical lines for mean and std if vis_means = True
    if vis_means:
        plt.axvline(mu, color='r', ls='--', lw=2.0)
        plt.axvline(mu-sigma, color='g', ls='--', lw=2.0)
        plt.axvline(mu+sigma, color='g', ls='--', lw=2.0)
        
    ylim = plt.gca().get_ylim() #keep the y-range of original distribution density values 
    #(to make sure the fitted distribution would not affect it)
    
    h = np.arange(mu - 3 * sigma, mu + 3 * sigma, sigma / 100) #3-sigma visualization range for the fitted distribution
    pars = None #fitted distribution parameters
    
    #fit and visualize the theoretic distribution
    if fit_distribution == 'normal':
        pars = norm.fit(x1)
        plt.plot(h,norm.pdf(h,*pars),'r')
    elif fit_distribution == 'gev':
        pars = gev.fit(x1)
        plt.plot(h,gev.pdf(h,*pars),'r')
    elif fit_distribution == 'pareto':
        pars = pareto.fit(x1)
        plt.plot(h,pareto.pdf(h,*pars),'r')
    
    plt.xlabel(xlabel) #add x label 
    plt.ylim(ylim) #restore the y-range of original distribution density values 
    plt.show()
    return pars

### 1. plotting distributions and normality test

In [None]:
# Todo: group type 1 crime numbers per day for 2019


In [None]:
# Todo: plot the distribution (density plot)


In [None]:
# Todo: normality test


### weekdays vs weekend distribution

In [None]:
# Todo: create dataframes for weekdays and weekends


In [None]:
# Todo: group daily numbers for weekdays and weekends


In [None]:
# Todo: plot distribution (density plot)


In [None]:
# Todo: t-test


In [None]:
# Todo: k-s test
