# Time Series Anomaly Detection
### *Codeup's Curriculum Access Logs*

#### Programs:
- Data Science
- Web Development

#### Goals for each program:
>1. Find lessons where the *most amount of traffic occurs*
2. Is there a cohort that referred to a lesson more than any other?
3. Are there any students, while active, who didn't access the curriculum much? 
    - If so, what can be said about these students?
4. Is there any suspicious activity, such as entities accessing the curriculum who aren't authorized? 
    - Does it appear that any web-scraping is happening? 
    - Are there any suspicious IP addresses?
    - Any odd user-agents?
5. At some point in the last year, ability for students and alumni to cross-access curriculum (web dev to ds, ds to web dev) should have been shut off. 
    - Do you see any evidence of that happening? 
    - Did it happen before? 
6. What topics are grads continuing to reference after graduation and into their jobs (for each program)?
    - Which lessons are least accessed?
7. Anything else anomalous? 

---

In [1]:
import numpy as np
import pandas as pd
import math
from sklearn import metrics

from scipy.stats import entropy

import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
import matplotlib.dates as mdates #to format dates on our plots
%matplotlib inline
import seaborn as sns

from wrangle import wrangle_logs
import requests

# This is to make sure matplotlib doesn't throw the following error:
# The next line fixes "TypeError: float() argument must be a string or a number, not 'Timestamp' matplotlib"
pd.plotting.register_matplotlib_converters()

---
## Wrangling

`df`: all logs that have a cohort_id

`no_id`: logs that don't have a cohort_id

In [2]:
# wrangle data from local file 
df, no_id = wrangle_logs()

Summarize `df` & `no_id`

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 581754 entries, 2018-01-26 09:56:02 to 2020-11-02 16:48:47
Data columns (total 11 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   page            581754 non-null  object        
 1   user_id         581754 non-null  int64         
 2   cohort_id       581754 non-null  int64         
 3   ip              581754 non-null  object        
 4   name            581754 non-null  object        
 5   start_date      581754 non-null  datetime64[ns]
 6   end_date        581754 non-null  datetime64[ns]
 7   program_id      581754 non-null  int64         
 8   days_active     581754 non-null  int64         
 9   program_length  581754 non-null  int64         
 10  post_access     581754 non-null  int64         
dtypes: datetime64[ns](2), int64(6), object(3)
memory usage: 53.3+ MB


In [7]:
no_id.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 44840 entries, 2018-01-26 16:46:16 to 2020-11-02 16:30:49
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   page       44840 non-null  object 
 1   user_id    44840 non-null  int64  
 2   cohort_id  44840 non-null  float64
 3   ip         44840 non-null  object 
dtypes: float64(1), int64(1), object(2)
memory usage: 1.7+ MB


In [8]:
# print of in and max dates
print('Curriculum Access History')
print(df.index.min()) 
print(df.index.max(), '\n')

print('Cohorts with no ID')
print(no_id.index.min()) 
print(no_id.index.max())

Curriculum Access History
2018-01-26 09:56:02
2020-11-02 16:48:47 

Cohorts with no ID
2018-01-26 16:46:16
2020-11-02 16:30:49


Web Development and Data Science Dataframes

In [9]:
# created different data frames that split up each program from original df
ds = df[df.program_id == 3]
wd = df[df.program_id != 3]

>**Summary**:
- 44840 null values in `cohort_id`
    - separated into **no_id** dataframe
- 581754 total entries with `cohort_id`
    - prepared with extra columns on
        - cohort `name`
        - `start_date`
        - `end_date`
        - `program_id`
- wrangle.py
    - read data from 2 sources
    - change data types for exploration
    - index set to date
    - created 2 dataframes
    - merged data for cohorts

---
## Exploration

<div class="alert alert-block alert-info">1. Which lesson appears to attract the most traffic consistently across cohorts (per program)?</div>

#### Web Development

In [10]:
# top total page views
pd.DataFrame(wd.page.value_counts().head()).rename(columns={'page':'count'})

Unnamed: 0,count
toc,12740
javascript-i,12608
search/search_index.json,11152
java-iii,9681
html-css,9351


In [11]:
# top page views per cohort
pd.DataFrame(wd.groupby(['page', 'name']).size().sort_values(ascending=False)).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,0
page,name,Unnamed: 2_level_1
toc,Zion,1455
search/search_index.json,Ceres,1372
search/search_index.json,Apex,1358
toc,Fortuna,1261
search/search_index.json,Ganymede,1049


#### Data Science

In [12]:
# top total page views
pd.DataFrame(ds.page.value_counts().head()).rename(columns={'page':'count'})

Unnamed: 0,count
1-fundamentals/modern-data-scientist.jpg,1560
1-fundamentals/AI-ML-DL-timeline.jpg,1554
1-fundamentals/1.1-intro-to-data-science,1532
search/search_index.json,1330
6-regression/1-overview,1122


In [13]:
# top page views per cohort
pd.DataFrame(ds.groupby(['page', 'name']).size().sort_values(ascending=False)).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,0
page,name,Unnamed: 2_level_1
classification/overview,Darden,756
1-fundamentals/modern-data-scientist.jpg,Bayes,625
1-fundamentals/AI-ML-DL-timeline.jpg,Bayes,623
1-fundamentals/1.1-intro-to-data-science,Bayes,614
6-regression/1-overview,Curie,594


> **Summary**:
- WD: 
    - highest traffic lessons: 
        - 'toc' 
        - 'search/search_index.json'
    - cohorts accessing these lessons the most:
        - Zion
        - Ceres
        - Apex
        - Fortuna
        - Ganymede
- DS:
    - highest traffic lessons: 
        - 'classification/overview' 
        - '1-fundamentals/1.1-intro-to-data-science'
    - cohorts accessing these lessons the most:
        - Darden
        - Bayes

---
<div class="alert alert-block alert-info">2. Is there a cohort that referred to a lesson significantly more that other cohorts seemed to gloss over?</div>

### Web Development

Top Lessons:
    - toc
    - search/search_index.json

In [14]:
# WD: grouped by cohort, sorted by top page view
pd.DataFrame(wd.groupby('name').page.value_counts().sort_values(ascending=False).head(10))

Unnamed: 0_level_0,Unnamed: 1_level_0,page
name,page,Unnamed: 2_level_1
Zion,toc,1455
Ceres,search/search_index.json,1372
Apex,search/search_index.json,1358
Fortuna,toc,1261
Ganymede,search/search_index.json,1049
Fortuna,search/search_index.json,985
Wrangell,toc,982
Ceres,javascript-i,975
Hyperion,toc,972
Europa,toc,944


In [15]:
# filters df where a cohort has viewed the 'toc' lesson the least
pd.DataFrame(wd[wd.page == 'toc'].groupby('name').page.value_counts().sort_values().head(10))

Unnamed: 0_level_0,Unnamed: 1_level_0,page
name,page,Unnamed: 2_level_1
Mammoth,toc,1
Badlands,toc,2
Joshua,toc,3
Hampton,toc,5
Ike,toc,6
Niagara,toc,6
Lassen,toc,10
Pinnacles,toc,11
Quincy,toc,12
Glacier,toc,13


In [16]:
# filters df where a cohort has viewed the 'search/search_index.json' lesson the least
pd.DataFrame(wd[wd.page == 'search/search_index.json'].groupby('name').page.value_counts().sort_values().head(10))

Unnamed: 0_level_0,Unnamed: 1_level_0,page
name,page,Unnamed: 2_level_1
Mammoth,search/search_index.json,1
Ike,search/search_index.json,1
Glacier,search/search_index.json,4
Quincy,search/search_index.json,6
Pinnacles,search/search_index.json,6
Kings,search/search_index.json,6
Niagara,search/search_index.json,7
Hampton,search/search_index.json,9
Olympic,search/search_index.json,17
Lassen,search/search_index.json,31


### Data Science

In [17]:
# DS: grouped by cohort, sorted by top page view
pd.DataFrame(ds.groupby('name').page.value_counts().sort_values(ascending=False).head(10))

Unnamed: 0_level_0,Unnamed: 1_level_0,page
name,page,Unnamed: 2_level_1
Darden,classification/overview,756
Bayes,1-fundamentals/modern-data-scientist.jpg,625
Bayes,1-fundamentals/AI-ML-DL-timeline.jpg,623
Bayes,1-fundamentals/1.1-intro-to-data-science,614
Curie,6-regression/1-overview,594
Darden,classification/scale_features_or_not.svg,589
Bayes,search/search_index.json,550
Bayes,6-regression/1-overview,521
Darden,sql/mysql-overview,513
Curie,search/search_index.json,480


In [18]:
# filters df where a cohort has viewed the 'classification/overview' lesson the least
pd.DataFrame(ds[ds.page == 'classification/overview'].groupby('name').page.value_counts().sort_values().head(10))

Unnamed: 0_level_0,Unnamed: 1_level_0,page
name,page,Unnamed: 2_level_1
Bayes,classification/overview,10
Curie,classification/overview,90
Darden,classification/overview,756


In [19]:
# filters df where a cohort has viewed the '1-fundamentals/1.1-intro-to-data-science' lesson the least
pd.DataFrame(ds[ds.page == '1-fundamentals/1.1-intro-to-data-science'].groupby('name').page.value_counts().sort_values().head(10))

Unnamed: 0_level_0,Unnamed: 1_level_0,page
name,page,Unnamed: 2_level_1
Darden,1-fundamentals/1.1-intro-to-data-science,458
Curie,1-fundamentals/1.1-intro-to-data-science,460
Bayes,1-fundamentals/1.1-intro-to-data-science,614


In [20]:
# filters df where a cohort has viewed the '6-regression/1-overview' lesson the least
pd.DataFrame(ds[ds.page == '6-regression/1-overview'].groupby('name').page.value_counts().sort_values().head(10))

Unnamed: 0_level_0,Unnamed: 1_level_0,page
name,page,Unnamed: 2_level_1
Darden,6-regression/1-overview,7
Bayes,6-regression/1-overview,521
Curie,6-regression/1-overview,594


> **Summary**:
- WD: 
    - highest traffic lessons: 
        - 'toc' 
        - 'search/search_index.json'
    - cohorts hardly accessing these lessons:
        - Mammoth
        - Ike
        - Hampton
        - Niagara
        - Glacier
- DS:
    - highest traffic lessons: 
        - 'classification/overview' 
        - '1-fundamentals/1.1-intro-to-data-science'
    - cohorts hardly accessing these lessons:
        - see second and fourth dataframes for DS

---
<div class="alert alert-block alert-info">3. Are there students who, when active, hardly access the curriculum? If so, what information do you have about these students?</div>

In [21]:
# filters df for observations only during the time the student 
# was 'active' for each program
active_access_wd = wd.loc[(wd.index >= wd.start_date) & (wd.index <= wd.end_date)]
active_access_ds = ds.loc[(ds.index >= ds.start_date) & (ds.index <= ds.end_date)]

### Web Development

In [22]:
# groups active web dev users by id and aggregates by page count and sorts them from lowest page count
low_access_wd = active_access_wd.groupby('user_id').size().sort_values().head()
low_access_wd = pd.DataFrame(low_access_wd, columns=['count'])

In [23]:
# accessed pages of active web dev users from the list above
active_obs_wd = pd.DataFrame(active_access_wd[active_access_wd.user_id.isin(low_access_wd.index.tolist())])

In [24]:
# gets a df of the cohort name of each user
cohort_name_wd = pd.DataFrame(active_obs_wd.groupby('user_id').name.value_counts())\
    .rename(columns={'name':'count'}).reset_index(level=['name']).drop(columns=['count'])

In [25]:
# shows how many days the user has been active based on latest access
# active_obs_wd.groupby('user_id').days_active.max()

In [26]:
# shows how many days the user has been active during their program, 
# program length in days, and total pages viewed(count).
# and name of the cohort they belong to
pd.DataFrame(active_obs_wd.groupby('user_id').days_active.max())\
        .merge(active_obs_wd.groupby('user_id').program_length.max(), on='user_id')\
        .merge(low_access_wd, on='user_id')\
        .merge(cohort_name_wd, on='user_id')

Unnamed: 0_level_0,days_active,program_length,count,name
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
64,78,165,14,Europa
278,121,135,3,Voyageurs
388,1,134,7,Andromeda
539,0,165,4,Europa
572,1,162,11,Fortuna


### Data Science

In [27]:
# groups active data sci users by id and aggregates by page count and sorts them sorts them from lowest page count
low_access_ds = active_access_ds.groupby('user_id').size().sort_values().head()
low_access_ds = pd.DataFrame(low_access_ds, columns=['count'])

In [28]:
# accessed pages of active data sci users from the list above
active_obs_ds = pd.DataFrame(active_access_ds[active_access_ds.user_id.isin(low_access_ds.index.tolist())])

In [29]:
# gets a df of the cohort name of each user
cohort_name_ds = pd.DataFrame(active_obs_ds.groupby('user_id').name.value_counts())\
    .rename(columns={'name':'count'}).reset_index(level=['name']).drop(columns=['count'])

In [30]:
# shows how many days the user has been active based on latest access
# active_obs_ds.groupby('user_id').days_active.max()

In [31]:
# shows how many days the user has been active during their program, 
# program length in days, and total pages viewed(count).
# and name of the cohort they belong to
pd.DataFrame(active_obs_ds.groupby('user_id').days_active.max())\
        .merge(active_obs_ds.groupby('user_id').program_length.max(), on='user_id')\
        .merge(low_access_ds, on='user_id')\
        .merge(cohort_name_ds, on='user_id')

Unnamed: 0_level_0,days_active,program_length,count,name
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
487,11,164,16,Bayes
679,1,183,10,Darden
697,0,183,12,Darden
780,112,183,44,Darden
785,112,183,29,Darden


>**Summary**
- For web dev:
    - Users 388, 539, and 572 may have dropped out by the first day
    - User 64 of them seemed to have stopped accessing midway
    - User 278 seemed to have stopped access two weeks prior to gradution
- For data sci:
    - User 679 and 697 seemed to have dropped early
    - User 487 was only active for 11 days
    - Users 780 and 785 may still be active since Darden is still running to this day

---
<div class="alert alert-block alert-info">4. Is there any suspicious activity, such as users/machines/etc accessing the curriculum who shouldn’t be? Does it appear that any web-scraping is happening? Are there any suspicious IP addresses? Any odd user-agents?</div>

In [32]:
no_id.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 44840 entries, 2018-01-26 16:46:16 to 2020-11-02 16:30:49
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   page       44840 non-null  object 
 1   user_id    44840 non-null  int64  
 2   cohort_id  44840 non-null  float64
 3   ip         44840 non-null  object 
dtypes: float64(1), int64(1), object(2)
memory usage: 1.7+ MB


In [33]:
# users with no cohort_id
no_id_users = no_id.groupby('user_id').size().index.to_list()
no_id_users

[48,
 54,
 58,
 59,
 61,
 62,
 63,
 64,
 73,
 74,
 78,
 79,
 86,
 88,
 89,
 100,
 103,
 111,
 137,
 166,
 176,
 213,
 247,
 317,
 346,
 349,
 350,
 351,
 352,
 353,
 354,
 355,
 356,
 357,
 358,
 359,
 360,
 361,
 362,
 363,
 364,
 365,
 366,
 367,
 368,
 369,
 372,
 375,
 403,
 406,
 429,
 544,
 644,
 663,
 713,
 714,
 715,
 716,
 717,
 718,
 719,
 720,
 721,
 722,
 723,
 724,
 725,
 726,
 727,
 728,
 729,
 731,
 736,
 744,
 782]

In [34]:
# users in original df that coincide with user_ids from the unknown cohort df
weird_users = df[df.user_id.isin(no_id_users)].user_id.value_counts().index.to_list()
weird_users

[64, 644, 375, 346, 358, 88, 663]

In [35]:
# users from above respective cohorts
pd.DataFrame(df[df.user_id.isin(weird_users)].groupby(['user_id', 'name']).size())

Unnamed: 0_level_0,Unnamed: 1_level_0,0
user_id,name,Unnamed: 2_level_1
64,Arches,3538
64,Europa,14
88,Glacier,326
88,Ike,5
88,Joshua,9
346,Sequoia,49
346,Zion,1470
358,Bayes,1051
375,Andromeda,1553
644,Ganymede,1622


>**Summary**
- I checked out the users that didn't have cohort_ids to see if they had a similar user_id to any observations with cohort information. 
- I found 7 users with absent cohort_ids that shared a user_id that had a cohort_id.
- Finally, I discovered that users 64, 88, and, 346 had information from 2 or three different cohorts (dataframe above).
- All findings are from web dev programs, except one (Bayes)

---
<div class="alert alert-block alert-info">5. At some point in the last year, ability for students and alumni to cross-access curriculum (web dev to ds, ds to web dev) should have been shut off. Do you see any evidence of that happening? Did it happen before?</div>

In [36]:
# how many pages were cross-accessed in the last year
wd['2020':].groupby('page').size().index.isin(ds['2020':].groupby('page').size().index).sum()

160

In [37]:
# put all unique pages into a list for wd and ds
a = wd['2020':].groupby('page').size().index.to_list()
b = ds['2020':].groupby('page').size().index.to_list()

In [38]:
# use set method to only include the pages that are crossed referenced
cross_access = list(set(a) & set(b))
# check the number of pages to clarify
len(cross_access), type(cross_access)

(160, list)

In [39]:
# all cross accessed observations (df) from 2020 till most recent
cross_obs = df['2020':][(df['2020':].page.isin(wd['2020':].page)) & (df['2020':].page.isin(ds['2020':].page))]
# remove staff
cross_obs = cross_obs[cross_obs.name != 'Staff']
cross_obs

Unnamed: 0_level_0,page,user_id,cohort_id,ip,name,start_date,end_date,program_id,days_active,program_length,post_access
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2020-01-01 15:35:12,search/search_index.json,476,34,136.50.49.145,Bayes,2019-08-19,2020-01-30,3,135,164,-29
2020-01-01 15:36:03,search/search_index.json,476,34,136.50.49.145,Bayes,2019-08-19,2020-01-30,3,135,164,-29
2020-01-01 16:39:53,search/search_index.json,476,34,136.50.49.145,Bayes,2019-08-19,2020-01-30,3,135,164,-29
2020-01-01 21:23:47,4-python/7.2-intro-to-matplotlib,18,22,45.20.117.182,Teddy,2018-01-08,2018-05-17,2,723,129,594
2020-01-01 21:24:04,4-python/7.3-intro-to-numpy,18,22,45.20.117.182,Teddy,2018-01-08,2018-05-17,2,723,129,594
...,...,...,...,...,...,...,...,...,...,...,...
2020-11-02 15:47:40,search/search_index.json,733,61,107.77.220.169,Bash,2020-07-20,2021-01-21,2,105,185,-80
2020-11-02 15:57:25,search/search_index.json,616,55,70.114.9.241,Curie,2020-02-03,2020-07-07,3,273,155,118
2020-11-02 15:57:29,fundamentals/git,616,55,70.114.9.241,Curie,2020-02-03,2020-07-07,3,273,155,118
2020-11-02 16:12:31,search/search_index.json,760,62,72.190.235.36,Jupiter,2020-09-21,2021-03-30,2,42,190,-148


In [40]:
# users cross referencing and count
cross_obs.user_id.value_counts()

685    1795
689    1271
698    1200
581    1197
688    1002
       ... 
451       1
333       1
607       1
439       1
470       1
Name: user_id, Length: 330, dtype: int64

In [41]:
# cohorts
cross_obs.name.value_counts()

Darden        16176
Curie          5357
Apex           1922
Ganymede       1595
Fortuna        1487
Bayes          1442
Hyperion       1025
Europa          892
Bash            662
Deimos          565
Ceres           235
Jupiter         231
Betelgeuse       83
Andromeda        68
Teddy            47
Zion             30
Voyageurs        18
Sequoia          15
Xanadu           15
Yosemite          7
Ulysses           5
Pinnacles         3
Wrangell          3
Olympic           1
Lassen            1
Name: name, dtype: int64

---
<div class="alert alert-block alert-info">6. What topics are grads continuing to reference after graduation and into their jobs (for each program)?</div>

In [42]:
# filters df for observations only after graduation date
post_access_wd = wd[(wd.index > wd.end_date)]
post_access_ds = ds[(ds.index > ds.end_date)]

### Web Development

In [43]:
post_access_wd.page.value_counts().head()

javascript-i    2541
spring          2455
html-css        1948
java-i          1901
java-ii         1837
Name: page, dtype: int64

### Data Science

In [44]:
post_access_ds.page.value_counts()[3:8]

sql/mysql-overview                          102
1-fundamentals/1.1-intro-to-data-science    102
classification/overview                     100
6-regression/1-overview                      86
10-anomaly-detection/1-overview              69
Name: page, dtype: int64

---
<div class="alert alert-block alert-info">7. Which lessons are least accessed?</div>

### Web Development

In [45]:
wd.page.value_counts().tail(10)

advanced-topics/styling-webpages    1
capstone/54                         1
sql/temporary-tables                1
classification/knn                  1
appendix/cli/2-listing-files        1
4-python/intro-to-sklearn           1
4_Matplotlib_Styles                 1
student/create                      1
nlp/overview                        1
7.04.01_Partitioning                1
Name: page, dtype: int64

### Data Science

In [46]:
ds.page.value_counts().tail(10)

12-distributed-ml/3-getting-started               1
12-distributed-ml/6.3-prepare-part-3              1
Hospital-Distance-Clusters.jpg                    1
appendix/open_data/www.databasefootball.com       1
appendix/open_data/www.flickr.com/services/api    1
stats-assessment                                  1
database-design                                   1
itc%20-%20ml                                      1
b-clustering/project                              1
grades                                            1
Name: page, dtype: int64

---
<div class="alert alert-block alert-info">8. Others?</div>

In [47]:
# users in original df that coincide with user_ids from the unknown cohort df
weird_users

[64, 644, 375, 346, 358, 88, 663]

In [48]:
# made a list called non-users that stores the user_ids of those who don't have a user id
# that matches one with a cohort from the original data
# show the number of each type of user
non_users = []
for n in no_id_users:
    if n not in weird_users:
        non_users.append(n)
len(no_id_users), len(non_users), len(weird_users)

(75, 68, 7)

In [49]:
# filter original df to only include users' activity from the unknown df 
weird_activity = df[df.user_id.isin(weird_users)]

In [50]:
# looking at the users with what cohort they'd coincide with
pd.DataFrame(weird_activity.groupby(['user_id', 'name']).size())

Unnamed: 0_level_0,Unnamed: 1_level_0,0
user_id,name,Unnamed: 2_level_1
64,Arches,3538
64,Europa,14
88,Glacier,326
88,Ike,5
88,Joshua,9
346,Sequoia,49
346,Zion,1470
358,Bayes,1051
375,Andromeda,1553
644,Ganymede,1622


In [51]:
# shows the number of unique ip addresses for each of the weird users (users that have matching cohort information))
weird_activity.groupby('user_id').ip.nunique().sort_values()

user_id
644     1
663     2
358     6
64     10
88     16
375    20
346    21
Name: ip, dtype: int64

In [52]:
# shows the number of unique ip addresses for each of the non-users (users that have no other information)
no_id[no_id.user_id.isin(non_users)].groupby('user_id').ip.nunique().sort_values().tail(15)

user_id
349     9
403     9
86     10
716    10
58     11
79     12
367    12
362    12
354    12
368    14
355    14
353    14
429    16
369    19
111    29
Name: ip, dtype: int64

>**Summary**
- looking further in to the data with no cohort_id, I've discovered that many of these users have multiple ip addresses (ranging from 1 to 29). It seems fishy to have any more than 5 ip addresses. These candidates who have several may be web scrapping the curriculum.
- I split these candidates into two groups
    - weird users (7): 
        - users that have matching cohort information within the 'df' dataframe
    - non-users (68): 
        - users that have no other information besides, user_id, ip_address, page viewed and the time stamp