# Detect Anomalies Using Density Based Clustering


## Clustering-Based Anomaly Detection

- Assumption: Data points that are similar tend to belong to similar groups or clusters, as determined by their distance from local centroids. Normal data points occur around a dense neighborhood and abnormalities are far away.

- Using density based clustering, like DBSCAN, we can design the model such that the data points that do not fall into a cluster are the anomalies.


In [None]:
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

# DBSCAN import
from sklearn.cluster import DBSCAN
# Scaler import
from sklearn.preprocessing import MinMaxScaler


In [None]:
# Import .txt file and convert it to a DataFrame object
df = pd.read_table("anonymized-curriculum-access.txt", sep = '\s', header = None, 
                   names = ['date', 'time', 'page', 'id', 'cohort', 'ip'])


In [None]:
# let's examine the head of the dataframe to make sure its
# what we were expecting
df.head(5)

In [None]:
df.info()

#### What do we know about this data set?
 - We are examining curriculum access at codeup.com
 - What do we want to find out about this data set?
     - high density page views on certain dates?
     - users that have a high number of page views?
     - cohorts that have unusual activity?

In [None]:
# I know that I have the following columns here:
# id: the number associated with the person viewing the curriculum
# page: the url endpoint associated with the page being visited
# cohort: number associated with the cohort that the id is associated with

Explore

In [None]:
# lets do a little aggregation based on the student id's in the data set,
# focusing on the number of unique hits
id_counts = df.groupby(['id'])['date', 'page', 'cohort'].nunique()

In [None]:
# my index is the user id
# and my aggregated columns represent the unique
# number of days, pages, and cohorts associated with each
id_counts.head()

In [None]:
id_counts.date.hist()

### Takeaways:
- id #1 is likely a curriculum developer or someone involved on the Codeup side.
- We have some instances near the bottom of a single or < week number of page/day access
- What different values can we associate with multiple cohort assignments?

In [None]:
# let's observe unique hits based on cohort
cohort_counts = df.groupby('cohort')['id', 'date','page'].nunique()

In [None]:
cohort_counts

 - I want to observe the initial visit per user in this data set.
     - How am I going to do this?

In [None]:
# let's go back to our original dataframe and 
# convert to a datetime
df.info()

In [None]:
# convert our date to a pandas datetime so we can take the minimum
# value
df['date'] = pd.to_datetime(df['date'])

In [None]:
df.info()

In [None]:
first_access = df.groupby('id')['date'].min()

In [None]:
first_access

### Thoughts: 
- Can I use this series to examine when cohorts potentially start inside this data set? 
- Let's use this series to break out that index and then regroup based on the first access date for each user
    - Utilize the existing index by turning it back into a column

In [None]:
# let's utilize that index that exists as the id, pop it back out into 
# a more columnar status, and then proceed forward with observing 
# high volume dates

In [None]:
id_by_first_access_date = pd.DataFrame({'first_access_date': first_access}).reset_index()
id_by_first_access_date

In [None]:
id_by_first_access_date = id_by_first_access_date.groupby('first_access_date').count()\
.rename(columns={'id':'count_of_unique_ids'})
id_by_first_access_date

In [None]:
plt.plot(id_by_first_access_date)
plt.xticks(rotation=90)
plt.title('Number of First Access Users by Date')
plt.show()

 - Takeaways:
     - It appears that there is a pretty clear pattern of multiple users starting between strong periods of lag.  It seems that we could determine when cohorts start based on this information and corroborate with any outside information sources to examine if mass curriculum access happened outside of the expectected or anticipated window for said curriculum access

Could someone be stealing the content of our curriculum for their benefit beyond personal education? If so, we would probably see them accessing a large number of unique pages. I would imagine they wouldn't spend much time on each page, maybe taking screen shots, copy/paste or downloading the content. Let's take a look. 

Aggregate and compute 2 features...number of unique pages and total page views. 

In [None]:
# let's make an examination:
# we want to look at individual users,
# and I want to know how they interact with pages in the curriculum,
# the number of unique pages and the number of total pages
page_views = df.groupby(['id'])['page'].agg(['count', 'nunique'])

In [None]:
page_views.sort_values(by='count', ascending=False)

In [None]:
plt.subplot(211)
page_views['count'].hist(bins=50)
plt.title('Distribution of Total Page Views Per User')

plt.subplot(212)
page_views['nunique'].hist(bins=50)
plt.title('Distribution of Unique Page Views Per User')

plt.tight_layout()
plt.show()

In [None]:
# let's narrow down the scope
# if we want to examine the users that have a lower count
# but a high nunique,
# we observed with our value counts and histograms previously
# that we had a range of approxiately 200 for each of those features
# per user
# lets narrow down to users that have less than 200 page view total counts
# but have over 190
page_views[(page_views['count'] < 600) & (page_views['nunique'] > 190)]

In a sense, we have clustered our data using a rule derived from visual observation of the distributions of `count` and `nunique`. Our clusters might be broadly defined as:
1. Anomalous (View Count < 600 & Unique Pages Viewed > 190)
2. Non-anomalous (Everything else)

Based on our criteria, we identified two anomalous users.

But what if we are limited in domain expertise and/or would prefer to use a more sophisticated algorithm to segment/cluster our data? 

We can use **DBSCAN**

Scale each attribute linearly. 

In [None]:
# create the scaler
scaler = MinMaxScaler().fit(page_views)
# use the scaler
scaled_page_views = scaler.transform(page_views)

In [None]:
scaled_page_views[0:5]

In [None]:
# whip up some new column names
scaled_cols = [col + '_scaled' for col in page_views.columns]

In [None]:
scaled_cols

In [None]:
scaled_page_views_df = pd.DataFrame(scaled_page_views, columns=scaled_cols, index=page_views.index)

In [None]:
scaled_page_views_df

`count_scaled` and `nunique_scaled` can be plotted against each other:

In [None]:
plt.figure(figsize=(5, 2))
sns.scatterplot(data = scaled_page_views_df, x ='count_scaled', y = 'nunique_scaled')

Construct a DBSCAN object that requires a minimum of 20 data points in a neighborhood of radius 0.1 to be considered a core point.

In [None]:
# create the object first
dbsc = DBSCAN(min_samples=20, eps=0.1)

In [None]:
dbsc

In [None]:
# fit the object like we normally would with sklearn
dbsc.fit(scaled_page_views_df)

In [None]:
dbsc.labels_

In [None]:
# Merge the scaled and non-scaled values into one dataframe
page_views_total = page_views.merge(
    scaled_page_views_df, on=page_views.index).drop(
    columns=['key_0'])

In [None]:
# sanity check for the shape of the df:
page_views_total.shape

In [None]:
page_views_total

In [None]:
# let's apply the dbscan labels

In [None]:
page_views_total['labels'] = dbsc.labels_

In [None]:
page_views_total.head()

In [None]:
page_views_total.labels.value_counts()

In [None]:
page_views_total[page_views_total.labels == -1]

In [None]:
page_views = page_views_total

In [None]:
# Let's look at the descriptive stats for the entire population, the inliers, then the outliers/anomalies
print("Full Dataset")
print(page_views.describe())
print("-------------")
print("Inliers Only")
print(page_views[page_views.labels==0].describe())
print("-------------")
print("Outliers Only")
print(page_views[page_views.labels==-1].describe())

In [None]:
plt.scatter(page_views['count'],
           page_views['nunique'],
           c=page_views['labels'])
plt.xlabel('Count of pages viewed')
plt.ylabel('Number of unique instances')
plt.title('Cluster assignment by page count and unique count per user')
plt.show()

### Follow up questions:
    - These unusual users dont seem to all be weird in the same way.
    - What's different about these users specifically?
    - Examine each user based on these parameters,
    - See which url endpoints they are visiting
    - Examine the cohorts those users are associated with
    - Examine the dates associated with first access for those users
    - Determine if these users may be employees, instructors, students that belonged to one or more cohorts, students that went through both programs, etc
    

## Experiment with the DBSCAN properties
- Read up on the epsilon and min_samples arguments into DBSCAN at https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
- Experiment with altering the epsilon values (the `eps` argument holding the threshhold parameter). Run the models and visualize the results. What has changed? Why do you think that is?
- Double the `min_samples` parameter. Run your model and visualize the results. Consider what changed and why.

# Exercise

**file name:** clustering_anomaly_detection.py or clustering_anomaly_detection.ipynb


### Clustering - DBSCAN

Ideas: 

Use DBSCAN to detect anomalies in curriculum access. 

Use DBSCAN to detect anomalies in other products from the customers dataset. 

Use DBSCAN to detect anomalies in number of bedrooms and finished square feet of property for the filtered dataset you used in the clustering project (single unit properties with a logerror).
