# Time Series Anomaly Detection
## Codeup's Curriculum Access Logs

### Programs:
- Data Science
- Web Development

### Goals for each program:
1. Find lessons where the *most amount of traffic occurs*
2. Is there a cohort that referred to a lesson more than any other?
3. Are there any students, while active, who didn't access the curriculum much? 
    - If so, what can be said about these students?
4. Is there any suspicious activity, such as entities accessing the curriculum who aren't authorized? 
    - Does it appear that any web-scraping is happening? 
    - Are there any suspicious IP addresses?
    - Any odd user-agents?
5. At some point in the last year, ability for students and alumni to cross-access curriculum (web dev to ds, ds to web dev) should have been shut off. 
    - Do you see any evidence of that happening? 
    - Did it happen before? 
6. What topics are grads continuing to reference after graduation and into their jobs (for each program)?
    - Which lessons are least accessed? 
7. Anything else anomalous? 

In [1]:
import numpy as np
import pandas as pd
import math
from sklearn import metrics

from scipy.stats import entropy

import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
import matplotlib.dates as mdates #to format dates on our plots
%matplotlib inline
import seaborn as sns

from wrangle import wrangle_logs
import requests

# This is to make sure matplotlib doesn't throw the following error:
# The next line fixes "TypeError: float() argument must be a string or a number, not 'Timestamp' matplotlib"
pd.plotting.register_matplotlib_converters()

# Wrangling

In [2]:
df, no_id = wrangle_logs()

In [None]:
df.info()

In [None]:
no_id.info()

In [None]:
print('Curriculum Access History')
print(df.index.min()) 
print(df.index.max(), '\n')

print('Cohorts with no ID')
print(no_id.index.min()) 
print(no_id.index.max())

#### Web Development and Data Science

In [None]:
# dropped all observations with '/' as page. This is the homepage. ADD TO WRANGLE.py
df = df.drop(df[df.page == '/'].index)

In [None]:
# created different data frames that split up each program
ds = df[df.program_id == 3]
wd = df[df.program_id != 3]

**Summary**:
- 44840 null values in `cohort_id`
    - separated into "no_id" dataframe
- 674618 total entries with `cohort_id`
    - prepared with extra columns on
        - cohort `name`
        - `start_date`
        - `end_date`
        - `program_id`

# Exploration

1. Which lesson appears to attract the most traffic consistently across cohorts (per program)?

### Web Development

In [None]:
pd.DataFrame(wd.page.value_counts().head())

### Data Science

In [None]:
pd.DataFrame(ds.page.value_counts().head())

2. Is there a cohort that referred to a lesson significantly more that other cohorts seemed to gloss over?

### Web Development

In [None]:
pd.DataFrame(wd[wd.page != '/'].groupby('name').page.value_counts().sort_values(ascending=False).head(10))

In [None]:
pd.DataFrame(wd[wd.page == 'toc'].groupby('name').page.value_counts().sort_values().head(10))

In [None]:
pd.DataFrame(wd[wd.page == 'search/search_index.json'].groupby('name').page.value_counts().sort_values().head(10))

### Data Science

In [None]:
pd.DataFrame(ds.groupby('name').page.value_counts().sort_values(ascending=False).head(10))

In [None]:
pd.DataFrame(ds[ds.page == 'classification/overview'].groupby('name').page.value_counts().sort_values().head(10))

In [None]:
pd.DataFrame(ds[ds.page == '1-fundamentals/modern-data-scientist.jpg'].groupby('name').page.value_counts().sort_values().head(10))

In [None]:
pd.DataFrame(ds[ds.page == '6-regression/1-overview'].groupby('name').page.value_counts().sort_values().head(10))

3. Are there students who, when active, hardly access the curriculum? If so, what information do you have about these students? 

In [None]:
# filters df for observations only during the time the student 
# was 'active' for each program
active_access_wd = wd.loc[(wd.index >= wd.start_date) & (wd.index <= wd.end_date)]
active_access_ds = ds.loc[(ds.index >= ds.start_date) & (ds.index <= ds.end_date)]

### Web Development

In [None]:
# groups active web dev users by id and aggregates by page count and sorts them
low_access_wd = active_access_wd.groupby('user_id').size().sort_values().head()
low_access_wd

In [None]:
# accessed pages of active web dev users from the list above
pd.DataFrame(active_access_wd[active_access_wd.user_id.isin(low_access_wd.index.tolist())])#[['user_id','end_date']])

In [None]:
# ip: 97.105.19.58
url = 'http://ip-api.com/csv/97.105.19.58'
response = requests.get(url)
location = response.text
location

In [None]:
# ip: 107.77.217.9
url = 'http://ip-api.com/csv/107.77.217.9'
response = requests.get(url)
location = response.text
location

### Data Science

In [None]:
# groups active data sci users by id and aggregates by page count and sorts them
low_access_ds = active_access_ds.groupby('user_id').size().sort_values().head()
low_access_ds

In [None]:
# accessed pages of active web dev users from the list above
active_access_ds[active_access_ds.user_id.isin(low_access_ds.index.tolist())]

In [None]:
# ip's of active data science users' accessed pages
active_access_ds[active_access_ds.user_id.isin(low_access_ds.index.tolist())].ip.value_counts()

In [None]:
# ip: 99.132.128.255
url = 'http://ip-api.com/csv/99.132.128.255'
response = requests.get(url)
location = response.text
location

**Summary**
- For web dev:
    - there is an identical ip address for multiple user_ids and multiple cohorts. This just tells me that this is Codeup's location.
    - There was also another ip address based on a user from Houston with one occurence with Codeup's ip address (see first two rows in web dev dataframe)
- For data sci:
    - the ip addresses are more spread out. This make sense considering the work from home environment since the pandemic started

4. Is there any suspicious activity, such as users/machines/etc accessing the curriculum who shouldn’t be? Does it appear that any web-scraping is happening? Are there any suspicious IP addresses? Any odd user-agents? 

5. At some point in the last year, ability for students and alumni to cross-access curriculum (web dev to ds, ds to web dev) should have been shut off. Do you see any evidence of that happening? Did it happen before? 

6. What topics are grads continuing to reference after graduation and into their jobs (for each program)?

7. Which lessons are least accessed? 

8. Others? 