# Week 6 - Machine learning

## Part 1: Lightning intro to machine learning

***Exercise 1.1***

*What do we mean by a 'feature' in a machine learning model?*

***Answer:*** *A feature represents a characteristic in machine learning, such as the Sex of a person. Another way to look at 'features' is to compare it to a face of a person - a facial feature could be they have blue eyes. Therefore, you could say that it is an individual measureable characteristic*
___________

***Exercise 1.2***

*What is the problem with overfitting?*

***Answer:*** *Overfitting really allows for the training set to be extremely well predicted, whilst it fails to generalize for that type of data. In other words, it means that the model will pick up noise and fluctutations, that do not represent the trend of the data.*
___________

***Exercise 1.3***

*Explain the connection between the bias-variance trade-off and overfitting/underfitting*

***Answer:*** *I found this very interesting [infographic](https://elitedatascience.com/bias-variance-tradeoff).*

- *A model that has high bias and low variance are generally consistent, but inaccurate*
- *A model with high variance and low bias are more accurate on average, but inconsistent*

***What is the tradeoff then?*** *If you have a high bias, then you will have a low variance. The same goes for having a high variance, then you will have a low variance. Therefore the training/test sets should be varied, to minimize bias, but also try to tune the model with impactful parameters. Like if you were doing some textual ML algorithm, you could sometimes exclude outliers, as they tend not to add prediction accuracy.*
___________

***Exercise 1.4*** 

*The Luke is for leukemia on page 145 in the reading is a great example of why accuracy is not a good measure in very unbalanced problems. You know about the incidents dataset we've been working with. Try to come up with a similar example based on the data we've been working with today.*

***Answer:*** **
___________

## Part 2: Scikit-learn

__________
***Run through the first three sections of this [tutorial](https://scikit-learn.org/stable/tutorial/basic/tutorial.html)***

**Loading an example dataset**

*We start of by importing the datasets we need for the tutorial*

In [None]:
from sklearn import datasets
iris = datasets.load_iris()
digits = datasets.load_digits()

The data that gives access to the features is stored in the digits.data:

In [None]:
print(digits.data)

In [None]:
print(digits.DESCR)

And the target can be accessed (supervised machine learning) in:

In [None]:
print(digits.target)

**Loading an example dataset**

*In scikit learn we can make an estimator with the use of:*

In [None]:
from sklearn import svm
clf = svm.SVC(gamma=0.001, C=100.)

In order to let the classifier learn from the model, we must now pass the fit of the dataset into it:

In [None]:
clf.fit(digits.data[:-1], digits.target[:-1])

Now, the classifier has learned from the fit that we did, we can therefore predict what kind of letter or number is written:

In [None]:
clf.predict(digits.data[-1:])

*From the image, it is hard to make out if it is actually the number 8 that has been written, due to poor resolution.*
__________

***Exercise 2.1***

*Did you read the text?*

- ***Describe in your own words how data is organized in sklearn (how does a dataset work according to the tutorial)***
    - *The datasets are organized such that there is a training set, and the target that you would like for the model to be able to predict*
- ***What is the dimensionality of the .data part of a dataset and what is the size of each dimension?***
    - *There are 64 attributes to the .data part. The size is an integer from 0-16*
_____

***We won't do the whole tutorial. Try it out: I'd like you to work thorough up to and including the section Building a pipeline [tutorial](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)***

**Tutorial setup**

In [None]:
categories = ['alt.atheism', 'soc.religion.christian',
              'comp.graphics', 'sci.med']

In [None]:
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train',
    categories=categories, shuffle=True, random_state=42)

*We can find the target names*

In [None]:
twenty_train.target_names

*We check the name of the datasets*

In [None]:
len(twenty_train.data)

In [None]:
len(twenty_train.filenames)

*We print the first lines of the first loaded file:*

In [None]:
print("\n".join(twenty_train.data[0].split("\n")[:3]))
print(twenty_train.target_names[twenty_train.target[0]])

*For efficiency, the target set categories has been transformed into integers instead of strings*

In [None]:
twenty_train.target[:10]

*We can get the category names by the following method:*

In [None]:
for t in twenty_train.target[:10]:
    print(twenty_train.target_names[t])

**Extracting features from text files**

*Tokenizing text with `scikit-learn`*

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
#initiliase count vector:
count_vect = CountVectorizer()
#make a fit with the count vector, with the training data
X_train_counts = count_vect.fit_transform(twenty_train.data)
#Display the shape of the feature set (2257 rows and 35788 columns)
X_train_counts.shape

*With the use of vocubulary_.get() we can extract the relevant features, ergo: stuff that doesn't just exist few times. Hence our feature set has been reduced to 4690 instead of 35788*

In [None]:
count_vect.vocabulary_.get(u'algorithm')

*In order to account for some documents being longer than others, and therefore creating a bias, the word count is made a term frequency, based on the length of the document*

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
X_train_tf.shape

In [None]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

**Training a classifier**

In [None]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

In [None]:
docs_new = ['God is love', 'OpenGL on the GPU is fast']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))

**Building a pipeline**

In [None]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])

In [None]:
text_clf.fit(twenty_train.data, twenty_train.target)

__________

***Exercise 2.2***

*Did you do the work*

- ***Describe in your own words the dataset used in the tutorial.***
    - *The dataset is a set of posts from newsgroups. It consists of a training and test set. The target says what category a text belongs to, for instance `soc.religion.christian`.*
- ***Investigate further: what kind of folder/file structure does the sklearn.datasets.load_files function expect?***
    - *The function takes a folder where the different categories are its own subfolder.*
- ***What is the "bag-of-words" representation of text? How does this strategy turn text into data of the kind described above?***
    - *It takes every word that are in all the texts represented, and turns them into a feature set with all those words. One could make sure to exclude the words that occur few times, including 'and'...*
- ***Once you've built the classifier, play around with it a bit. Describe the content of the predicted variable.***
    - *The predicted variable takes a sentence or word and predicts which subject it belongs to, like computer science, religion etc.*
__________

## Part 3: KNN

***Exercise 3.1***

*How does K-nearest-neighbors work? Explain in your own words. Explain in your own words: What is the curse of dimensionality? Use figure 12-6 in DSFS as part of your explanation.*

***Answer:*** *In figure 12-6, it can be seen that the more dimensions you add to a model for KNN, the greater the average distance is between each point - this is inevitable, since you add another dimension, then the summed distance has to be greater.*
_____

***We know from last week's exercises that the focus crimes PROSTITUTION, DRUG/NARCOTIC and DRIVING UNDER THE INFLUENCE tend to be concentrated in certain neighborhoods, so we focus on those crime types since they will make the most sense a KNN - map.***

***Exercise 3.2***

*Begin by using folium (see Week4) to plot all incidents of the three crime types on their own map. This will give you an idea of how the varioius crimes are distributed across the city.*

***Answer:***

*We start by importing folium and pandas*

In [None]:
import folium
import pandas as pd

*We import the policing dataframe and parse the date and time to a datetime column*

In [None]:
policing_dataframe = pd.read_csv('../rawdata/Police_Department_Incident_Reports__Historical_2003_to_May_2018.csv', parse_dates=[['Date', 'Time']])

In [None]:
mask_p_df_prostitution = (policing_dataframe['Category'] >= "PROSTITUTION") & (policing_dataframe['Date_Time'] > "2017-06-01") & (policing_dataframe['Date_Time'] < "2017-11-01")
mask_p_df_drug = (policing_dataframe['Category'] >= "DRUG/NARCOTIC") & (policing_dataframe['Date_Time'] > "2017-06-01") & (policing_dataframe['Date_Time'] < "2017-11-01")
mask_p_df_dui = (policing_dataframe['Category'] >= "DRIVING UNDER THE INFLUENCE") & (policing_dataframe['Date_Time'] > "2017-06-01") & (policing_dataframe['Date_Time'] < "2017-11-01")

In [None]:
p_df_prostitution = policing_dataframe.loc[mask_p_df_prostitution].reset_index(drop=True)
p_df_drug = policing_dataframe.loc[mask_p_df_drug].reset_index(drop=True)
p_df_dui = policing_dataframe.loc[mask_p_df_dui].reset_index(drop=True)

*We start by getting the lon and lat of the dataframes*

In [None]:
x_y_prostitution = list(zip(list(p_df_prostitution["Y"]), list(p_df_prostitution["X"])))
x_y_drug = list(zip(list(p_df_drug["Y"]), list(p_df_drug["X"])))
x_y_dui = list(zip(list(p_df_dui["Y"]), list(p_df_dui["X"])))

*We first plot `PROSTITUTION`:*

In [None]:
SF_map1 = folium.Map([37.77919, -122.41914], zoom_start=13, tiles = "Stamen Toner")
folium.Marker([37.77919, -122.41914], popup='SF City Hall').add_to(SF_map1)

for x,y in x_y_prostitution:
    folium.CircleMarker([x, y],
                    radius=1,
                    color='red',
                    ).add_to(SF_map1)

SF_map1

*We now plot `DRUG/NARCOTICS`:*

In [None]:
SF_map2 = folium.Map([37.77919, -122.41914], zoom_start=13, tiles = "Stamen Toner")
folium.Marker([37.77919, -122.41914], popup='SF City Hall').add_to(SF_map2)

for x,y in x_y_drug:
    folium.CircleMarker([x, y],
                    radius=1,
                    color='blue',
                    ).add_to(SF_map2)

SF_map2

*We now plot `DRIVING UNDER THE INFLUENCE`*

In [None]:
SF_map3 = folium.Map([37.77919, -122.41914], zoom_start=13, tiles = "Stamen Toner")
folium.Marker([37.77919, -122.41914], popup='SF City Hall').add_to(SF_map3)

for x,y in x_y_dui:
    folium.CircleMarker([x, y],
                    radius=1,
                    color='green',
                    ).add_to(SF_map3)

SF_map3

_____

***Exercise 3.3***

*Next, it's time to set up your model based on the actual data. I recommend that you try out sklearn's KNeighborsClassifier. For an intro, start with this tutorial and follow the link to get a sense of the usage.*

- *You don't have to think a lot about testing/trainig and accuracy for this exercise. We're mostly interested in creating a map that's not too problematic. But do calculate the number of observations of each crime-type respectively. You'll find that the levels of each crime varies (lots of drug arrests, an intermediate amount of prostitiution registered, and very little drunk driving in the dataset). Since the algorithm classifies each point according to it's neighbors, what could a consequence of this imbalance in the number of examples from each class mean for your map?*
- *You can make the dataset 'balanced' by grabbing an equal number of examples from each crime category.*
    - *How do you expect that will change the KNN result?*
    - *In which situations is the balanced map useful -*
    - *When is the map where data is in proportion to occurrences useful?*
    - *Choose which map you will work on in the following.*

***Answer:***

*We calculate the number of observations for each crime type:*

In [None]:
print("The number of observations for prostitution is:", len(p_df_prostitution))
print("The number of observations for drug/narcotic is:", len(p_df_drug))
print("The number of observations for driving under the influence is:", len(p_df_dui))

*What consequence could it have that there are more observations in one group than another?*
- *This could lead to the model being more likely to predict the category that is the higher amount of cases, no matter where you are in the map (creating a bias)*
- *We expect the result to be changed by balancing the datasets, creating less bias*
- *The balanced map is useful in a situation, where you need to predict where a crime is most likely to happen*
- *If you need to predict where crimes are more likely to happen, then the dataset that is in proportion is appropriate*

_____

***Exercise 3.4***

*Now create an approximately square grid of point that runs over SF. You get to decide the grid-size, but I recommend somewhere between $50\times50$ and $100\times100$ points. I recommend using folium for this task.*

***Answer:***

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

In [None]:
def get_geojson_grid(upper_right, lower_left, n=6):
    """Returns a grid of geojson rectangles, and computes the exposure in each section of the grid based on the vessel data.

    Parameters
    ----------
    upper_right: array_like
        The upper right hand corner of "grid of grids" (the default is the upper right hand [lat, lon] of the USA).

    lower_left: array_like
        The lower left hand corner of "grid of grids"  (the default is the lower left hand [lat, lon] of the USA).

    n: integer
        The number of rows/columns in the (n,n) grid.

    Returns
    -------

    list
        List of "geojson style" dictionary objects   
    """

    all_boxes = []

    lat_steps = np.linspace(lower_left[0], upper_right[0], n+1)
    lon_steps = np.linspace(lower_left[1], upper_right[1], n+1)

    lat_stride = lat_steps[1] - lat_steps[0]
    lon_stride = lon_steps[1] - lon_steps[0]

    for lat in lat_steps[:-1]:
        for lon in lon_steps[:-1]:
            # Define dimensions of box in grid
            upper_left = [lon, lat + lat_stride]
            upper_right = [lon + lon_stride, lat + lat_stride]
            lower_right = [lon + lon_stride, lat]
            lower_left = [lon, lat]

            # Define json coordinates for polygon
            coordinates = [
                upper_left,
                upper_right,
                lower_right,
                lower_left,
                upper_left
            ]

            geo_json = {"type": "FeatureCollection",
                        "properties":{
                            "lower_left": lower_left,
                            "upper_right": upper_right
                        },
                        "features":[]}

            grid_feature = {
                "type":"Feature",
                "geometry":{
                    "type":"Polygon",
                    "coordinates": [coordinates],
                }
            }

            geo_json["features"].append(grid_feature)

            all_boxes.append(geo_json)

    return all_boxes

In [None]:
lower_left = [37.713995, -122.517391]
upper_right = [37.814491, -122.352865]
m = folium.Map(location=[37.77919, -122.41914], zoom_start=12, tiles = "Stamen Toner")
grid = get_geojson_grid(upper_right, lower_left , n=50)

for i, geo_json in enumerate(grid):

    color = plt.cm.Reds(i / len(grid))
    color = mpl.colors.to_hex(color)

    gj = folium.GeoJson(geo_json,
                        style_function=lambda feature, color=color: {
                                                                        'fillColor': color,
                                                                        'color':"black",
                                                                        'weight': 1,
                                                                        'fillOpacity': 0.55,
                                                                    })
    popup = folium.Popup("example popup {}".format(i))
    gj.add_child(popup)

    m.add_child(gj)
m

In [None]:
def meters_to_degrees(meters):
    degrees_0_0001_in_meters = 11.1
    return round((meters / degrees_0_0001_in_meters) * 0.0001, 9)

size_of_grid = 200
degrees = meters_to_degrees(size_of_grid)
to_bin = lambda x: np.floor(x / degrees) * degrees
policing_dataframe["latbin"] = policing_dataframe.Y.map(to_bin)
policing_dataframe["lonbin"] = policing_dataframe.X.map(to_bin)

In [None]:
policing_dataframe["latbin"].unique()

In [None]:
SF_map4 = folium.Map([37.77919, -122.41914], zoom_start=13, tiles = "Stamen Toner")
x_y_policing_dataframe = list(zip(list(policing_dataframe["latbin"].unique()), list(policing_dataframe["lonbin"].unique())))

for x,y in x_y_policing_dataframe:
    folium.CircleMarker([x, y],
                    radius=1,
                    color='green',
                    ).add_to(SF_map4)

SF_map4

_____

***Exercise 3.5***

*Visualize your model by coloring the grid, coloring each grid point according to it's category. Create a plot of this kind for models where each point is colored according to the majority of its $5$, $10$, and $30$ nearest neighbors. Describe what happens to the map as you increase the number of neighbors, K*

***Answer:***

_____

***Exercise 3.6***

*To see an example, click here. This one is a 100x100 grid based on crimes from 1st January 2017 until the end of 2018. And the categories are narcotics, prostitution and vehicle theft.*

***Answer:***