## Lab 13c - Crime Lab using library

For this lab, we ask you to find the Victoria crime **most heavily linked to temperature**.

Create a count of the average amount of crimes in the crime database (for every crime type), indexed by the daily temperature reading.

i.e. given a **crime type** $\alpha$ (a **incident_type_primary** key to **clean_dat**) the quantity

$$D_\alpha(\tau)$$

will be the **average** number of crimes on the days with temperature reading $\tau$.


 * * *

First we *suppose* there is some linear relationship 

$$D_\alpha(\tau) = M_{\alpha} \tau + C_{\alpha}$$

where $\tau$ is the temperature reading for the day. As you will see, this relationship is not true, but we ask you to find a *best-fit* for $M$ and $C$ using least squares. 

Once you have solved for $M_{\alpha}$ for all crime types $\alpha$, sort the list of pairs: $$[ (M_{\alpha}, \alpha) \ \ for \ \ \alpha \text{ a crime type } ]$$

on the absolute value of $M_\alpha$ from smallest to largest, and print out the last four elements of this sorted list. i.e. we want to know the four crime types which have the largest $|M_{\alpha}|$, and we want to know the value of $M_{\alpha}$ for those crime types.  

Do this lab for all three temperature readings: daily max temp, daily min temp, and daily mean temp.  There will be grading script that will search for your answers in the file **mp248/Labs/Lab.13c.ipynb**. The feedback will appear in the **Commentary 2** field on CourseSpaces, i.e. this lab will not count towards your course grade. 

Due to the limited amount of data we have, please round the temperature readings from **datlib.vicdict** (see Notebook 13b) to the nearest degree.

* * *

*Hint:* (1) Perhaps make some plots as a *sanity check* to see how useful your least-squares approximations are.

   (2) There will be some crime types for which there is not enough data to perform least-squares. Skip those crime types, but also print out their names.

In [45]:
import os
import collections as co
from datlib import clean_dat, ctree, ncl, vicdict, coml
import pprint as pp
import datetime as dt

In [133]:
## this function returns a list of pairs (temp, crime count)
## itp is the incident_type_primary for the crime type
## k is the index to ncl, i.e. the temperature reading
def count_crime_temp(itp, k):
    ## indexed by temperature (rounded to int)
    totcrimes = co.defaultdict(int)
    
    ## we will run through all crimes, search for itp==k
    ## form count for all days -- using defaultdict
    ## then collect the temperatures for those days.
    dayct = co.defaultdict(int)
    for x in clean_dat:
        if x['incident_type_primary']==itp and\
           x['incident_datetime'].replace(hour=0, minute=0, second=0, microsecond=0) in coml:
            dayct[x['incident_datetime']]+=1

    ## now run through vicdict
    for i,v in dayct.items():
        totcrimes[ int(round(vicdict[i.replace(hour=0, minute=0, second=0, microsecond=0)][ncl[k]])) ] += v

    ## let's find the total number of days with these temperature readings.
    ## number of days with a given temperature reading.
    totdays = co.defaultdict(int)
    for i in totcrimes.keys():
        ## i is a temperature reading
        for a in coml:
            if int(round(vicdict[a][ncl[k]]))==i:
                totdays[i]+=1
    for i in totcrimes.keys():
        totcrimes[i]/=totdays[i]
        
    return(totcrimes)



In [134]:
import matplotlib.pyplot as plt
%matplotlib inline

In [135]:
import numpy as np

def lsq_cof(i):
    retval = dict()
    #print("Computing for: ", end='')
    for k,v in ctree.items():
        for w in v.keys():
            #print(w, ' ', end='')
            if w in ['FRAUD-CREDIT/DEBIT CARD', 'FRAUD-CHEQUE', 'POSSESS STLN PROPERTY O/$5000',\
                 'POSSESS STLN PROPERTY U/$5000', 'TRAFFICKING-OTH SCHED IV CDSA', 'TRAFFICKING-MORPHINE',\
                 'ASSAULT-COMMON OR TRESPASS', 'CITIZEN ASSIST']:
                continue
            X = count_crime_temp(w, i)
            ## least squares on X
            A = np.matrix([[1.0, x[0]] for x in X.items()]).T
            y = np.matrix([x[1] for x in X.items()]).T
            c = np.linalg.inv(A*A.T)*A*y 
            retval[(k,w)] = [c[0,0], c[1,0]]
            ## put coeff in lsq_cof.
    return(retval)


In [138]:
import operator as op

for i in [0,1,2]:
    print(ncl[i], ' ', end='')
    X = lsq_cof(i)
    Xs = [(k, v[1], abs(v[1])) for k,v in X.items()]
    XS = sorted(Xs, key=op.itemgetter(2) )
    pp.pprint(XS[-4:])

Max Temp (°C)  [(('Other', 'BYLAW-NOISE'), 0.05588234847256847, 0.05588234847256847),
 (('Theft', 'THEFT BICYCLE UNDER $5000'),
  0.0696708912998867,
  0.0696708912998867),
 (('Liquor', 'LIQUOR-INTOX IN PUBLIC PLACE'),
  0.0854710859564018,
  0.0854710859564018),
 (('Other', 'SUSPICIOUS PERS/VEH/OCCURRENCE'),
  0.17723588649532754,
  0.17723588649532754)]
Min Temp (°C)  [(('Other', 'BYLAW-NOISE'), 0.06733589491097552, 0.06733589491097552),
 (('Theft', 'THEFT BICYCLE UNDER $5000'),
  0.08966904496571158,
  0.08966904496571158),
 (('Theft', 'THEFT-OTHER UNDER $5000'),
  0.09580518535760048,
  0.09580518535760048),
 (('Other', 'SUSPICIOUS PERS/VEH/OCCURRENCE'),
  0.2000846405822861,
  0.2000846405822861)]
Mean Temp (°C)  [(('Liquor', 'LIQUOR-INTOX IN PUBLIC PLACE'),
  0.07380196369052848,
  0.07380196369052848),
 (('Theft', 'THEFT BICYCLE UNDER $5000'),
  0.07773771340488556,
  0.07773771340488556),
 (('Theft', 'THEFT-OTHER UNDER $5000'),
  0.08354645397468002,
  0.08354645397468002),
 ((

In [None]:
## Suspicious person/veh/occ has the most temperature dependence. 

In [139]:
d1 = {1:2, 2:3, 3:4}

In [140]:
print(len(d1))

3
