# Time Series Anomaly Detection Exercises
### Kwame V. Taylor


* Discover users who are accessing our curriculum pages way beyond the end of their codeup time. What would the dataframe look like? What are two hypotheses you can test? Use time series method for detecting anomalies, like exponential moving average with %b.

**Bonus:**
 * Can you label students who are viewing both the web dev and data science curriculum?
 * Can you label students by the program they are in?
 * Can you label users by student vs. staff?
 * What are Zach, Maggie, Faith, and Ryan's ids?

### Imports

In [1]:
import numpy as np
import pandas as pd
import math
from sklearn import metrics

from scipy.stats import entropy

import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
import matplotlib.dates as mdates #to format dates on our plots
%matplotlib inline
import seaborn as sns

# This is to make sure matplotlib doesn't throw the following error:
# The next line fixes "TypeError: float() argument must be a string or a number, not 'Timestamp' matplotlib"
pd.plotting.register_matplotlib_converters()

### Acquire

In [2]:
colnames = ['date', 'timestamp', 'request_method', 'ip']

df = pd.read_csv('anonymized-curriculum-access.txt', header=None, index_col=False,
                 names=colnames, delim_whitespace=True, na_values='"-"',
                 usecols=[0, 1, 2, 5])
df.head()

Unnamed: 0,date,timestamp,request_method,ip
0,2018-01-26,09:55:03,/,97.105.19.61
1,2018-01-26,09:56:02,java-ii,97.105.19.61
2,2018-01-26,09:56:05,java-ii/object-oriented-programming,97.105.19.61
3,2018-01-26,09:56:06,slides/object_oriented_programming,97.105.19.61
4,2018-01-26,09:56:24,javascript-i/conditionals,97.105.19.61


### Prepare

In [3]:
# merge date and timestamp
df["ds"] = df["date"] +" "+ df["timestamp"]
df.head()

Unnamed: 0,date,timestamp,request_method,ip,ds
0,2018-01-26,09:55:03,/,97.105.19.61,2018-01-26 09:55:03
1,2018-01-26,09:56:02,java-ii,97.105.19.61,2018-01-26 09:56:02
2,2018-01-26,09:56:05,java-ii/object-oriented-programming,97.105.19.61,2018-01-26 09:56:05
3,2018-01-26,09:56:06,slides/object_oriented_programming,97.105.19.61,2018-01-26 09:56:06
4,2018-01-26,09:56:24,javascript-i/conditionals,97.105.19.61,2018-01-26 09:56:24


In [4]:
# drop date and timestamp
df = df.drop(columns=['date', 'timestamp'])
df.head()

Unnamed: 0,request_method,ip,ds
0,/,97.105.19.61,2018-01-26 09:55:03
1,java-ii,97.105.19.61,2018-01-26 09:56:02
2,java-ii/object-oriented-programming,97.105.19.61,2018-01-26 09:56:05
3,slides/object_oriented_programming,97.105.19.61,2018-01-26 09:56:06
4,javascript-i/conditionals,97.105.19.61,2018-01-26 09:56:24


In [5]:
# convert date column to datetime type
df.ds = pd.to_datetime(df.ds)
df.dtypes

request_method            object
ip                        object
ds                datetime64[ns]
dtype: object

In [6]:
# set ds as index and sort
# this is a very important step!
df = df.set_index('ds').sort_index()
df.head()

Unnamed: 0_level_0,request_method,ip
ds,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-01-26 09:55:03,/,97.105.19.61
2018-01-26 09:56:02,java-ii,97.105.19.61
2018-01-26 09:56:05,java-ii/object-oriented-programming,97.105.19.61
2018-01-26 09:56:06,slides/object_oriented_programming,97.105.19.61
2018-01-26 09:56:24,javascript-i/conditionals,97.105.19.61


### Explore

In [8]:
df.request_method.value_counts()

/                                                                 40122
search/search_index.json                                          15393
javascript-i                                                      14551
toc                                                               14018
java-iii                                                          10835
                                                                  ...  
capsones/151                                                          1
html-css/css-i/positioning/specimen/MaterialIcons-Regular.woff        1
appendix/cli/more-topics                                              1
8_ts_split                                                            1
8.8_Project                                                           1
Name: request_method, Length: 2154, dtype: int64

In [9]:
df.ip.value_counts()

97.105.19.58       268648
97.105.19.61        60530
192.171.117.210      8896
71.150.217.33        4919
12.106.208.194       4262
                    ...  
107.77.105.20           1
184.226.114.60          1
172.85.210.26           1
99.203.27.111           1
208.54.83.222           1
Name: ip, Length: 4064, dtype: int64