<h2 style='height:40px;text-align:center;font-size:30px;background-color:red;border:20px;color:white'>Violent Crime Rates By US State<h2>


Taken from Kaggle

## Content
This data set contains statistics, in arrests per 100,000 residents
for assault, murder, and rape in each of the 50 US states in 1973.
Also given is the percent of the population living in urban areas.This is a systematic approach for identifying and analyzing patterns and trends in crime using USArrest dataset. 

## What is Hierarchical Clustering?

Hierarchical clustering is a type of unsupervised machine learning algorithm used to cluster unlabeled data points. Like K-means clustering, hierarchical clustering also groups together the data points with similar characteristics.One of the major considerations in using the K-means algorithm is deciding the value of K beforehand. The hierarchical clustering algorithm does not have this restriction.The output of the hierarchical clustering algorithm is quite different from the K-mean algorithm as well. It results in an inverted tree-shaped structure, called the **dendrogram**.

## Types Of Hierarchical Clustering:

There are two types of hierarchical clustering: 

* **Agglomerative**: The data points are clustered using a bottom-up approach starting with individual data points.
* **Divisive**: The top-down approach is followed where all the data points are treated as one big cluster and the clustering process involves dividing the one big cluster into several small clusters.

## Steps to Perform Hierarchical Clustering:

Following are the steps involved in **agglomerative clustering**:

* At the start, treat each data point as one cluster. Therefore, the number of clusters at the start will be K, while K is an integer representing the number of data points.
* Form a cluster by joining the two closest data points resulting in K-1 clusters.
* Form more clusters by joining the two closest clusters resulting in K-2 clusters.
* Repeat the above three steps until one big cluster is formed.
* Once single cluster is formed, dendrograms are used to divide into multiple clusters depending upon the problem. We will study the concept of dendrogram in detail in an upcoming section.

## Import the Desired Libraries:

In [None]:
import warnings
warnings.filterwarnings('always')
warnings.filterwarnings('ignore')

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

sns.set()
%matplotlib inline

In [None]:
from sklearn.preprocessing import StandardScaler
scaler= StandardScaler()

In [None]:
import scipy.cluster.hierarchy as sch
from sklearn.cluster import AgglomerativeClustering

## Reading and Understanding the Data

In [None]:
crime = pd.read_csv("USArrests.csv")

In [None]:
#peeking at the dataset

crime.head(5)

### Our dataset consists of crime rates for Murder, Assault, UrbanPop and Rape

In [None]:
# Let's see how many rows and columns we got!

crime.shape

In [None]:
#Let's see some facts here

crime.info()

### We have 50 rows and 5 columns.

In [None]:
# Let's get some statistics summary

crime.describe()

### Let's check for missing values.

In [None]:
crime.isnull().sum()

### We have no missing values!

In [None]:
crime.head()

## 1) Murder Rate

In [None]:
plt.figure(figsize=(20,5))
crime.groupby('State')['Murder'].max().plot(kind='bar')

## **Observations**:

* Highest Murder Rate : Georgia and Missisippi
* Lowest Murder Rate : Idaho , Iowa, Maine, New Hampshire, North Dakota, Vermont and Wisconsin.

## 2) Assault Rate

In [None]:
plt.figure(figsize=(20,5))
crime.groupby('State')['Assault'].max().plot(kind='bar')

## **Observations**:

* Highest Assualt Rate : Florida and North California.
* Lowest Assualt Rate : Hawaii, North Dakota, Vermont , New Hampshire and Wisconsin.

## 3) Rape Rate

In [None]:
plt.figure(figsize=(20,5))
crime.groupby('State')['Rape'].max().plot(kind='bar')

## Observations:

* Highest UrbanPop Rate : Nevada and Alaska.
* Lowest UrbanPop Rate  : Maine, North Dakota,Vermont,Connecticut,New Hampshire, Wisconsin,Rhode Island and West Virginia

## 4) UrbanPop : Percent Urban Population

In [None]:
plt.figure(figsize=(20,5))
crime.groupby('State')['UrbanPop'].max().plot(kind='bar')

In [None]:
plt.figure(figsize=(10,5))
plt.scatter('UrbanPop','Murder',data=crime)
plt.xlabel('Urban Population')
plt.ylabel('Murder Rate')

In [None]:
plt.figure(figsize=(10,5))
plt.scatter('UrbanPop','Rape',data=crime)
plt.xlabel('Urban Population')
plt.ylabel('Rape Rate')

In [None]:
plt.figure(figsize=(10,5))
plt.scatter('UrbanPop','Assault',data=crime)
plt.xlabel('Urban Population')
plt.ylabel('Assault Rate')

In [None]:
data = crime.iloc[:,1:].values

In [None]:
scaled_data = scaler.fit_transform(data)

## Types of Linkages:

## 1) Single Leakage:

The distance between 2 clusters is defined as the shortest distance between points in the two clusters

In [None]:
plt.figure(figsize=(20,5))
plt.title("Crime Rate Dendograms")
dend = sch.dendrogram(sch.linkage(scaled_data, method='single'))
plt.xlabel('Crime Rate')
plt.ylabel('Euclidean distances')

#### The hierarchy class has a dendrogram method which takes the value returned by the linkage method of the same class. The linkage method takes the dataset and the method to minimize distances as parameters.

## 2) Complete Linkage:

The distance between 2 clusters is defined as the maximum distance between any 2 points in the clusters

In [None]:
plt.figure(figsize=(20,5))
plt.title("Crime Rate Dendograms")
dend = sch.dendrogram(sch.linkage(scaled_data, method='complete'))
plt.xlabel('Crime Rate')
plt.ylabel('Euclidean distances')

## 3) Average Linkage: 
    
The distance between 2 clusters is defined as the average distance between every point of one cluster to every other point of the other cluster.

In [None]:
plt.figure(figsize=(20,5))
plt.title("Crime Rate Dendograms")
dend = sch.dendrogram(sch.linkage(scaled_data, method='average'))
plt.xlabel('Crime Rate')
plt.ylabel('Euclidean distances')

#### The single linkage type will produce dendrograms which are not structured properly, whereas complete or average linkage will produce clusters which have a proper tree-like structure.

In [None]:
# With Ward method
plt.figure(figsize=(20,8))
dendrogram = sch.dendrogram(sch.linkage(data, method  = "ward"))
plt.title('Dendrogram')
plt.xlabel('Crime Rate')
plt.ylabel('Euclidean distances')
plt.show()

## How do we determine the optimal number of clusters from this diagram? 

We look for the largest distance that we can vertically without crossing any horizontal line and this one is the red framed line on the above diagram. Let’s count the number of lines on the diagram and determine the optimal number of clusters. Cluster number will be 3 for this dataset.

In [None]:
# Fit the Agglomerative Clustering
 
AC = AgglomerativeClustering(n_clusters = 3, affinity = 'euclidean', linkage ='ward')

**Ward** method is actually a method that tries to minimize the variance within each cluster. In K-means when we were trying to minimize the wcss to plot our elbow method chart, here it’s almost the same the only difference is that instead of minimizing wcss we are minimizing the within-cluster variants. That is the variance within each cluster.

In [None]:
# Fit and predict to have the cluster labels.
y_pred =AC.fit_predict(data)
y_pred

In [None]:
# Fetch the cluster labels
crime['cluster labels']= y_pred

In [None]:
# Let's see which State falls in which cluster
crime[['State','cluster labels']]

In [None]:
plt.figure(figsize=(10,5))
sns.boxplot(x='cluster labels', y='Murder', data=crime)

In [None]:
plt.figure(figsize=(10,5))
sns.boxplot(x='cluster labels', y='Rape', data=crime)

In [None]:
plt.figure(figsize=(10,5))
sns.boxplot(x='cluster labels', y='Assault', data=crime)

## **Observations**:

* The Cities in the Cluster-0 seems to be Safe-Zone where there are relativley less Murders,Assaults and Rapes.
* The Cities in Cluster-1 seems to have higher crime rates and can be regarded as Danger-Zone.
* The Cities in Cluster-3 seems to have moderate crime rates when compared to other zones and can be called as Moderate-Zone


In [None]:
Safe_Zone= crime.groupby('cluster labels')['State'].unique()[0]
Safe_Zone

In [None]:
Danger_Zone= crime.groupby('cluster labels')['State'].unique()[1]
Danger_Zone

In [None]:
Moderate_Zone= crime.groupby('cluster labels')['State'].unique()[2]
Moderate_Zone

In [None]:
plt.figure(figsize=(10,5))
plt.scatter(data[y_pred==0, 0], data[y_pred==0, 1], s=100, c='red', label ='Safe_Zone')
plt.scatter(data[y_pred==1, 0], data[y_pred==1, 1], s=100, c='blue', label ='Danger_Zone')
plt.scatter(data[y_pred==2, 0], data[y_pred==2, 1], s=100, c='green', label ='Moderate_Zone')
plt.legend()
plt.show()