# Time Series Anomaly Detection
## Codeup's Curriculum Access Logs

### Programs:
- Data Science
- Web Development

### Goals for each program:
1. Find lessons where the *most amount of traffic occurs*
2. Is there a cohort that referred to a lesson more than any other?
3. Are there any students, while active, who didn't access the curriculum? 
    - If so, what can be said about these students?
4. Is there any suspicious activity, such as entities accessing the curriculum who aren't authorized? 
    - Does it appear that any web-scraping is happening? 
    - Are there any suspicious IP addresses?
    - Any odd user-agents?
5. At some point in the last year, ability for students and alumni to cross-access curriculum (web dev to ds, ds to web dev) should have been shut off. 
    - Do you see any evidence of that happening? 
    - Did it happen before? 
6. What topics are grads continuing to reference after graduation and into their jobs (for each program)?
    - Which lessons are least accessed? 
7. Anything else anomalous? 

In [1]:
import numpy as np
import pandas as pd
import math
from sklearn import metrics

from scipy.stats import entropy

import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
import matplotlib.dates as mdates #to format dates on our plots
%matplotlib inline
import seaborn as sns

from wrangle import wrangle_logs

# This is to make sure matplotlib doesn't throw the following error:
# The next line fixes "TypeError: float() argument must be a string or a number, not 'Timestamp' matplotlib"
pd.plotting.register_matplotlib_converters()

# Wrangling

In [8]:
df, no_id = wrangle_logs()

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 674618 entries, 2018-01-26 09:55:03 to 2020-11-02 16:48:47
Data columns (total 8 columns):
 #   Column      Non-Null Count   Dtype         
---  ------      --------------   -----         
 0   page        674618 non-null  object        
 1   user_id     674618 non-null  int64         
 2   cohort_id   674618 non-null  int64         
 3   ip          674618 non-null  object        
 4   name        674618 non-null  object        
 5   start_date  674618 non-null  datetime64[ns]
 6   end_date    674618 non-null  datetime64[ns]
 7   program_id  674618 non-null  int64         
dtypes: datetime64[ns](2), int64(3), object(3)
memory usage: 46.3+ MB


In [6]:
no_id.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 44840 entries, 2018-01-26 16:46:16 to 2020-11-02 16:30:49
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   page       44840 non-null  object 
 1   user_id    44840 non-null  int64  
 2   cohort_id  44840 non-null  float64
 3   ip         44840 non-null  object 
dtypes: float64(1), int64(1), object(2)
memory usage: 1.7+ MB


In [15]:
print('Curriculum Access History')
print(df.index.min()) 
print(df.index.max(), '\n')

print('Cohorts with no ID')
print(no_id.index.min()) 
print(no_id.index.max())

Curriculum Access History
2018-01-26 09:55:03
2020-11-02 16:48:47 

Cohorts with no ID
2018-01-26 16:46:16
2020-11-02 16:30:49


**Summary**:
- 44840 null values in `cohort_id`
    - separated into "no_id" dataframe
- 674618 total entries with `cohort_id`
    - prepared with extra columns on
        - cohort `name`
        - `start_date`
        - `end_date`
        - `program_id`