# Anomaly Detection - Kmeans

In [None]:
import turicreate as tc
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline
plt.style.use('ggplot')

Sometimes is normal to find databases or datasets in web repositories, in general, the most common way to access is via API, in many other cases, the data can be accessed through the link URL.

We utilize the Dutch portal for official distribution of data sources, part of the
[Ministry of the Interior and Kingdom Relations](https://data.overheid.nl/data/dataset) web page, where we can find tons of datasets in different formats, for many distinct applications.

It is important to automate the collection of the data. A good method is creating a function to make the [web service](https://en.wikipedia.org/wiki/Web_service) request, as shown below.

For convenience, we are going to focus on crime data in this example.

### Deaths; murder, crime scene in The Netherlands

This table contains the number of persons died as a result of murder or manslaughter, where the crime scene is located in the Netherlands. The victims can be residents or non-residents. The data can be split by location of the crime, method, age and sex. The date of death is the criterion, the date of the act can be in the previous year. The ICD10 codes that belong to murder and manslaughter are X85-Y09.  

[Open Data Source](https://data.overheid.nl/data/dataset/deaths-murder-and-manslaughter-crime-scene-in-the-netherlands)  
[License](https://data.overheid.nl/licenties-voor-hergebruik) CC-BY 4.0

In [None]:
df_crimes = pd.read_csv('https://s3.eu-west-3.amazonaws.com/pedrohserrano-datasets/crimes-netherlands.csv', sep=',')

The municipalities names, region and population are missing, we might go for an additional source

In [None]:
# Sometime you just need a little bit of imagination, List of municipalities
df_population = pd.read_csv('https://s3.eu-west-3.amazonaws.com/pedrohserrano-datasets/population-netherlands.tsv', sep='\t')

### Explore Distribution

How is related the crime and the types of crime with the population?

***Try .head() .describe()***

In [None]:
plt.figure(figsize=[16, 8])
sns.kdeplot(df_population['Population'], shade=True, color="r", label='Population')
plt.title('Distribution of Population The Netherlands'); plt.legend()

### Datasets Merge

In [None]:
# Merge the crimes table and population table
df_crime_pop = pd.merge(df_crimes, df_population, on='CBScode', how='left')
df_crime_pop = df_crime_pop.replace([np.inf, -np.inf], np.nan)
df_crime_pop.fillna(0, inplace=True)

In [None]:
df_crime_pop.sort_values('CBScode').head(10)

In [None]:
df_crime_pop.describe()

Next, let's create a **scatterplot matrix**. Scatterplot matrices plot the distribution of each column along the diagonal, and then plot a scatterplot matrix for the combination of each variable. They make for an efficient tool to look for errors in our data.

We can even have the plotting package color each entry by its class to look for trends within the classes.

In [None]:
df_plot = df_crime_pop[['HIC: Violent Crime','HIC: Street Roof', 'HIC: Robberies', 'Population_density(p/km)', 'Province']]
sns.pairplot(df_plot, hue="Province")
plt.title('Correlation of High Impact Crime')
plt.legend()

#### Correlation Violent crimes vs Population Density

In [None]:
df_plot = df_crime_pop[['HIC: Violent Crime','Arms Trade']]

In [None]:
df_plot = df_plot[(df_plot['Arms Trade'] < 10 ) & (df_plot['HIC: Violent Crime'] < 200) ]

In [None]:
plt.figure(figsize=[14, 7])
sns.kdeplot(df_plot['Arms Trade'], df_plot['HIC: Violent Crime'], cmap="Reds", shade=True, shade_lowest=False)
plt.title('Relation')

---
## K-means Method

The most basic usage of K-means clustering requires only a choice for the number of clusters, . We rarely know the correct number of clusters a priori, but the following simple heuristic sometimes works well:  

where is the number of rows in your dataset. By default, the maximum number of iterations is 10, and all features in the input dataset are used

In [None]:
sf = tc.SFrame(data=df_crime_pop.loc[:, df_crime_pop.columns != 'Population']._get_numeric_data())

***Write the formula for k***

In [None]:
print('Number of Clusters K: {}'.format(K))

In [None]:
kmeans_model = tc.kmeans.create(sf, num_clusters=K)

In [None]:
kmeans_model.summary

The model summary shows the usual fields about model schema, training time, and training iterations. It also shows that the K-means results are returned in two SFrames contained in the model: `cluster_id` and `cluster_info`. The cluster_info SFrame indicates the final cluster centers, one per row, in terms of the same features used to create the model.

The last three columns of the cluster_info SFrame indicate metadata about the corresponding cluster: ID number, number of points in the cluster, and the within-cluster sum of squared distances to the center.

In [None]:
kmeans_model.cluster_info[['cluster_id', 'size', 'sum_squared_distance']].print_rows(num_rows=14, num_columns=3)

The `cluster_id` field of the model shows the cluster assignment for each input data point, along with the Euclidean distance from the point to its assigned cluster's center.

In [None]:
clusters = kmeans_model.cluster_id

***Which are the anomalous points?***

***What do they have in common within anomalous clusters?***