In [1]:
import itertools
import warnings
warnings.filterwarnings("ignore")

import matplotlib.pyplot as plt
import matplotlib.dates as mdates

import numpy as np
import pandas as pd
import math
from sklearn import metrics
from random import randint
from matplotlib import style
import seaborn as sns

from sklearn.cluster import DBSCAN
from sklearn.preprocessing import MinMaxScaler

import acquire as a
import prepare as p
import explore as e

## Anamoly Detection: Curriculum Access Logs

### Executive Summary

The purpose of this notebook is to find anomalous activity in a dataset that monitors the Codeup curriculum access. We will be using pages accessed, user_id, timestamps, and other columns to find anomalous activity

#### Key Findings:
 
##### Lesson Traffic
 - Most Accessed Lessons:
     - MySQL: Tables Lesson: 7922 pings
     - Javascript I: Introduction Working With Data Types Operators and Variables Lesson: 8302 pings
     - Javasvript I: Javascrip-with-html : 8199 pings
 
 - Least Accessed Lessons:
      - anomaly-detection/discrete-probabilistic-methods
     - storytelling/create
     - clustering/kmeans-part1
     - python/series
     - slides/exceptions_and_error_handling

##### Suspicious Activity
 - Supsicious IP Adress: 204.44.112.76
     - Accessed Several pages at machine like speeds according to time stamps corresponding with pages accessed

##### Data Science Cohort Lesson Differences

Looking at the lessons accessed by the data science cohorts from Bayes to Florence it was determined that pre Easley the most accessed lessons were for the regression methodology. Where as Easley and Florence glossed over regression more and spent more time with classifcation.

##### Inactive Students

We were also able to pinpoint students that had the least curriculum activity and were able to determine those students most likely left the program at an early date because of their access pings consisted of one day

### Acquire Curriculum Logs

In [2]:
df = a.acquire_logs() ## <-- using function to acquire curriculum logs

df.head() ## previewing our dataframe

Unnamed: 0,date,time,page,user_id,cohort_id,source_ip
0,2018-01-26,09:55:03,/,1,8.0,97.105.19.61
1,2018-01-26,09:56:02,java-ii,1,8.0,97.105.19.61
2,2018-01-26,09:56:05,java-ii/object-oriented-programming,1,8.0,97.105.19.61
3,2018-01-26,09:56:06,slides/object_oriented_programming,1,8.0,97.105.19.61
4,2018-01-26,09:56:24,javascript-i/conditionals,2,22.0,97.105.19.61


In [3]:
a.df_summary(df) ## <-- using our function to give a quick data frame summary

The shape of our dataframe:

(1018810, 6)
----------------------------------

Looking at our dataframe column and datatypes:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1018810 entries, 0 to 1018809
Data columns (total 6 columns):
 #   Column     Non-Null Count    Dtype  
---  ------     --------------    -----  
 0   date       1018810 non-null  object 
 1   time       1018810 non-null  object 
 2   page       1018809 non-null  object 
 3   user_id    1018810 non-null  int64  
 4   cohort_id  965313 non-null   float64
 5   source_ip  1018810 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 46.6+ MB
None
----------------------------------

Value Counts For date Column:

2021-06-15    3357
2021-06-21    3272
2021-03-19    3104
2021-06-18    3026
2021-06-16    2562
              ... 
2018-12-29      32
2018-12-22      30
2018-12-30      21
2019-07-04      16
2018-12-23      10
Name: date, Length: 1267, dtype: int64
-------------------------------

Value Coun

### Prepare Curriculum Logs

In [4]:
df = p.prep_logs(df) ## <-- using function to prepare our curriculum logs

In [5]:
df.head() ## <-- looking at our dataframe

Unnamed: 0,date,time,page,user_id,cohort_id,source_ip
0,2018-01-26,09:55:03,/,1,8.0,97.105.19.61
1,2018-01-26,09:56:02,java-ii,1,8.0,97.105.19.61
2,2018-01-26,09:56:05,java-ii/object-oriented-programming,1,8.0,97.105.19.61
3,2018-01-26,09:56:06,slides/object_oriented_programming,1,8.0,97.105.19.61
4,2018-01-26,09:56:24,javascript-i/conditionals,2,22.0,97.105.19.61


In [6]:
df.isna().sum() ## <-- making sure our nulls were removed

date         0
time         0
page         0
user_id      0
cohort_id    0
source_ip    0
dtype: int64

### Exploring Curriculum Logs

#### Which Lesson appears to attract the most traffic consistently across cohorts?

In [7]:
## narrowing down dataframe to look at pages with /'s because those are most likely to be 
## lessons within the curriculumn 
df_lesson = df[df.page.str.contains('/')]
df_lesson.head()

Unnamed: 0,date,time,page,user_id,cohort_id,source_ip
0,2018-01-26,09:55:03,/,1,8.0,97.105.19.61
2,2018-01-26,09:56:05,java-ii/object-oriented-programming,1,8.0,97.105.19.61
3,2018-01-26,09:56:06,slides/object_oriented_programming,1,8.0,97.105.19.61
4,2018-01-26,09:56:24,javascript-i/conditionals,2,22.0,97.105.19.61
5,2018-01-26,09:56:41,javascript-i/loops,2,22.0,97.105.19.61


In [8]:
## grouping by page and doing an overall count of occurences
## per page to figure which lesson has the most overall traffic
page_views = df_lesson.groupby(['page'])['user_id'].agg(['count','nunique'])
observed = page_views.sort_values(by = 'count', ascending = False)
observed.head(15)

Unnamed: 0_level_0,count,nunique
page,Unnamed: 1_level_1,Unnamed: 2_level_1
/,51017,993
search/search_index.json,20323,744
javascript-i/introduction/working-with-data-types-operators-and-variables,8302,659
javascript-i/javascript-with-html,8199,680
mysql/tables,7922,544
javascript-i/functions,7901,680
html-css/elements,7444,676
java-iii/jsp-and-jstl,7320,517
javascript-i/loops,7313,664
java-iii/servlets,7283,526


#### High Traffic Lesson Per Program Takeaways

After narrowing down the dataframe to look at pages with only /'s because those are most likely to be lessons within the curriculumn. We can see that top most accessed lessons per program at Codeup:
 - Data Science
     - MySQL: Tables Lesson: 7922 pings
 - Software Development
     - Javascript I: Introduction Working With Data Types Operators and Variables Lesson: 8302 pings
 - Web Development
     - Javasvript I: Javascrip-with-html : 8199 pings

#### Is there any suspicious activity, such as users/machines/etc accessing the curriculum who shouldn’t be? Does it appear that any web-scraping is happening? Are there any suspicious IP addresses?

In [9]:
df.head(3) ## <-- looking at our dataframe

Unnamed: 0,date,time,page,user_id,cohort_id,source_ip
0,2018-01-26,09:55:03,/,1,8.0,97.105.19.61
1,2018-01-26,09:56:02,java-ii,1,8.0,97.105.19.61
2,2018-01-26,09:56:05,java-ii/object-oriented-programming,1,8.0,97.105.19.61


In [10]:
## Testing on a single user

user = 1
span = 30
weight = 6
user_df = e.find_anomalies(df, user, span, weight)

anomalies = pd.DataFrame()
user_df = e.find_anomalies(df, user, span, weight)
anomalies = pd.concat([anomalies, user_df], axis=0)

In [11]:
## looping through all the users 

span = 30
weight = 3.5

anomalies = pd.DataFrame()
for u in list(df.user_id.unique()):
    user_df = e.find_anomalies(df, u, span, weight)
    anomalies = pd.concat([anomalies, user_df], axis=0)

In [12]:
## this value counts shows us the number of users on the right that have accessed 
## the value of pages on the left in total

anomalies.pages.value_counts(sort=False) 

1      50
2      38
3      77
4      84
5      56
       ..
179     1
192     1
198     1
272     1
343     1
Name: pages, Length: 103, dtype: int64

Let's Look at the two users that accessed the highest number of pages (272 and 343)

In [13]:
anomalies[anomalies.pages==272] ## finding the specific user id for who accessed 272 pages

Unnamed: 0_level_0,pages,midband,ub,lb,pct_b,user_id
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2019-03-03,272,24.721632,266.780128,-217.336864,1.010782,341


In [14]:
df[df.user_id==341] ## looking at the data frame for user 341

Unnamed: 0,date,time,page,user_id,cohort_id,source_ip
181808,2019-01-22,15:23:24,/,341,29.0,97.105.19.58
181826,2019-01-22,15:25:51,toc,341,29.0,97.105.19.58
181840,2019-01-22,15:26:44,html-css,341,29.0,97.105.19.58
181862,2019-01-22,15:28:33,html-css/introduction,341,29.0,97.105.19.58
181870,2019-01-22,15:29:29,html-css/elements,341,29.0,97.105.19.58
...,...,...,...,...,...,...
817183,2021-02-09,21:02:55,search/search_index.json,341,29.0,172.124.70.146
817184,2021-02-09,21:03:20,appendix/further-reading/spring/seeder,341,29.0,172.124.70.146
817268,2021-02-10,08:31:18,appendix/code-standards/mysql,341,29.0,172.124.70.146
817269,2021-02-10,08:31:29,appendix/further-reading/spring/pagination,341,29.0,172.124.70.146


In [15]:
anomalies[anomalies.pages==343] #3 finding the specific user id for who accessed 343 pages

Unnamed: 0_level_0,pages,midband,ub,lb,pct_b,user_id
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2021-06-21,343,22.370564,322.155049,-277.413921,1.034767,804


In [16]:
df[df.user_id==804] ## looking at the data frame for user 804

Unnamed: 0,date,time,page,user_id,cohort_id,source_ip
719859,2020-11-03,10:15:51,javascript-i,804,132.0,69.91.64.132
720925,2020-11-04,09:00:55,javascript-i,804,132.0,69.91.64.132
721318,2020-11-04,11:40:16,javascript-i/javascript-with-html,804,132.0,69.91.64.132
721321,2020-11-04,11:40:28,javascript-i,804,132.0,69.91.64.132
721324,2020-11-04,11:40:33,javascript-i/introduction/primitive-types,804,132.0,69.91.64.132
...,...,...,...,...,...,...
987643,2021-06-21,14:17:37,appendix/further-reading/pagination,804,132.0,66.69.1.31
987644,2021-06-21,14:17:37,appendix/further-reading/authorization,804,132.0,66.69.1.31
987645,2021-06-21,14:17:58,appendix/further-reading/security-use-cases,804,132.0,66.69.1.31
987646,2021-06-21,14:17:58,appendix/further-reading/spring,804,132.0,66.69.1.31


Now that we have identified two users with irregularly high usage lets build some dataframes and explore this more.

In [17]:
user_341 = df[df.user_id==341] ## making our dataframe for user 341
user_804 = df[df.user_id==804] ## making our dataframe for user 804

##### User 341

In [18]:
user_341.head(3) ## previewing our user dataframe

Unnamed: 0,date,time,page,user_id,cohort_id,source_ip
181808,2019-01-22,15:23:24,/,341,29.0,97.105.19.58
181826,2019-01-22,15:25:51,toc,341,29.0,97.105.19.58
181840,2019-01-22,15:26:44,html-css,341,29.0,97.105.19.58


In [19]:
## makeing a page_views dataframe that counts the number of pages viewed by date 
## for user 341 than creating a sorted observed dataframe for easier reading

page_views = user_341.groupby(['date'])['page'].agg(['count','nunique'])
observed = page_views.sort_values(by = 'count', ascending = False)
observed.head(15)

Unnamed: 0_level_0,count,nunique
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2019-03-03,272,170
2020-04-21,109,13
2020-07-13,52,37
2019-04-12,46,35
2019-03-20,39,32
2019-02-14,35,30
2019-03-11,31,12
2019-03-18,30,18
2019-02-20,26,12
2019-02-06,23,10


High amount of page access on March 3rd, 2019 let's check it out

In [20]:
high_pages = user_341[user_341['date'] == '2019-03-03'] ## Exploring the high access day further
observed = high_pages.sort_values(by = 'time', ascending = False)
high_pages.head(30)

## we sorted by time to see if the user is accessing the curriculumn in a human readable
## amount of time

Unnamed: 0,date,time,page,user_id,cohort_id,source_ip
211312,2019-03-03,21:14:08,/,341,29.0,173.174.243.231
211313,2019-03-03,21:14:14,jquery,341,29.0,173.174.243.231
211314,2019-03-03,21:14:27,jquery/essential-methods/traversing,341,29.0,173.174.243.231
211315,2019-03-03,21:15:25,jquery/effects,341,29.0,173.174.243.231
211340,2019-03-03,22:52:05,html-css,341,29.0,204.44.112.76
211341,2019-03-03,22:52:06,javascript-i,341,29.0,204.44.112.76
211342,2019-03-03,22:52:06,java-i,341,29.0,204.44.112.76
211343,2019-03-03,22:52:06,java-ii,341,29.0,204.44.112.76
211344,2019-03-03,22:52:06,javascript-ii,341,29.0,204.44.112.76
211345,2019-03-03,22:52:06,jquery,341,29.0,204.44.112.76


In [21]:
high_pages.tail(30) ## looking at the end of the df to compare timestamp access

Unnamed: 0,date,time,page,user_id,cohort_id,source_ip
211588,2019-03-03,23:20:32,java-i/methods,341,29.0,173.174.243.231
211589,2019-03-03,23:20:38,java-ii,341,29.0,173.174.243.231
211590,2019-03-03,23:20:49,java-ii/object-oriented-programming,341,29.0,173.174.243.231
211591,2019-03-03,23:20:54,java-ii/arrays,341,29.0,173.174.243.231
211592,2019-03-03,23:20:57,java-ii/inheritance-and-polymorphism,341,29.0,173.174.243.231
211593,2019-03-03,23:21:01,java-ii/interfaces-and-abstract-classes,341,29.0,173.174.243.231
211594,2019-03-03,23:21:05,java-ii/collections,341,29.0,173.174.243.231
211595,2019-03-03,23:21:08,java-ii/annotations,341,29.0,173.174.243.231
211596,2019-03-03,23:21:12,java-ii/exceptions-and-error-handling,341,29.0,173.174.243.231
211597,2019-03-03,23:21:15,java-ii/file-io,341,29.0,173.174.243.231


In [22]:
observed.source_ip.value_counts() ## looking at out suspicious IP adresses

204.44.112.76      180
173.174.243.231     92
Name: source_ip, dtype: int64

##### User 804

In [23]:
user_804.head(3) ## previewing our user dataframe

Unnamed: 0,date,time,page,user_id,cohort_id,source_ip
719859,2020-11-03,10:15:51,javascript-i,804,132.0,69.91.64.132
720925,2020-11-04,09:00:55,javascript-i,804,132.0,69.91.64.132
721318,2020-11-04,11:40:16,javascript-i/javascript-with-html,804,132.0,69.91.64.132


In [24]:
## makeing a page_views dataframe that counts the number of pages viewed by date 
## for user 804 than creating a sorted observed dataframe for easier reading

page_views = user_804.groupby(['date'])['page'].agg(['count','nunique'])
observed = page_views.sort_values(by = 'count', ascending = False)
observed.head(15)

Unnamed: 0_level_0,count,nunique
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-06-21,343,334
2021-01-20,74,39
2020-11-23,63,32
2020-11-17,57,37
2020-11-20,50,21
2021-03-26,48,20
2020-11-24,47,27
2020-11-05,45,18
2021-03-01,39,24
2020-11-19,33,19


High amount of access on June 21st, 2021 let's check it out

In [25]:
high_pages = user_804[user_804['date'] == '2021-06-21'] ## Exploring the high access day further
observed = high_pages.sort_values(by = 'time', ascending = False)
high_pages.head(30)

Unnamed: 0,date,time,page,user_id,cohort_id,source_ip
986984,2021-06-21,13:16:13,/,804,132.0,66.69.1.31
986985,2021-06-21,13:16:15,html-css,804,132.0,66.69.1.31
987108,2021-06-21,13:48:49,/,804,132.0,66.69.1.31
987110,2021-06-21,13:49:06,toc,804,132.0,66.69.1.31
987129,2021-06-21,13:51:13,main-pages_xXxXx.html,804,132.0,66.69.1.31
987155,2021-06-21,13:57:16,/,804,132.0,66.69.1.31
987178,2021-06-21,13:58:39,/,804,132.0,66.69.1.31
987179,2021-06-21,13:58:41,.,804,132.0,66.69.1.31
987180,2021-06-21,13:58:42,html-css,804,132.0,66.69.1.31
987181,2021-06-21,13:58:42,javascript-i,804,132.0,66.69.1.31


In [26]:
high_pages.tail(30) ## looking at the end to compare time stamps of access

Unnamed: 0,date,time,page,user_id,cohort_id,source_ip
987602,2021-06-21,14:15:26,further-reading/mysql/host-wildcards,804,132.0,66.69.1.31
987603,2021-06-21,14:15:26,extra-challenges/mysql/mysql-extra-exercises,804,132.0,66.69.1.31
987604,2021-06-21,14:15:28,further-reading/java/intellij-tomcat-configura...,804,132.0,66.69.1.31
987606,2021-06-21,14:15:47,further-reading/spring/pagination,804,132.0,66.69.1.31
987607,2021-06-21,14:15:48,further-reading/spring/authorization,804,132.0,66.69.1.31
987608,2021-06-21,14:15:48,further-reading/spring/security-use-cases,804,132.0,66.69.1.31
987609,2021-06-21,14:15:50,further-reading/spring/seeder,804,132.0,66.69.1.31
987610,2021-06-21,14:16:09,further-reading/spring/devtools-configuration,804,132.0,66.69.1.31
987611,2021-06-21,14:16:10,slides,804,132.0,66.69.1.31
987612,2021-06-21,14:16:10,pair-programming,804,132.0,66.69.1.31


User 804 seems to be accessing the curriculumn at a normal human readable rate (within multiple seconds and minutes between page) this is not deemed suspicious, possibly just a dedicated student that wanted to view the whole curriculumn to see what they are diving into.

#### Supscious Activity Takeaways

Looking at user 341 there is a suspicious IP address (204.44.112.76) that accesses 180 pages. It is suspicious because it is accessing pages at machine level speeds an example would be 15 pages in one second at the timestamp of 22:52:06 on March 3rd, 2019.

This could be evidence of web scraping and the IP address is suspicous because the 204 IP address was not the user's only IP address on March 3rd, the other IP address seemed to accessing pages at a human readable pace, but when the IP address switched to the one beggining with 204 the access speed per page was ramped up to machine like speeds.


#### Is there a cohort that referred to a lesson significantly more than other cohorts seemed to gloss over?

Were going to look at specifically data science cohorts for this question because Carl and I are in the data science program.

Using our domain expertise we know that these cohorts are Ada, Bayes, Curie, Darden, Easley, and Florence

And there corresponding cohort ID's are: 30, 34, 55, 59, 133, 137 we can obtain from the SQL database

In [27]:
df_lesson.head(3) ## <-- looking at lesson dataframe that is filtered for /'s

Unnamed: 0,date,time,page,user_id,cohort_id,source_ip
0,2018-01-26,09:55:03,/,1,8.0,97.105.19.61
2,2018-01-26,09:56:05,java-ii/object-oriented-programming,1,8.0,97.105.19.61
3,2018-01-26,09:56:06,slides/object_oriented_programming,1,8.0,97.105.19.61


In [28]:
ds_df = df_lesson[(df_lesson.cohort_id == 30.0) | (df_lesson.cohort_id == 34.0)
                           | (df_lesson.cohort_id == 55.0) | (df_lesson.cohort_id == 59.0)
                           | (df_lesson.cohort_id == 133.0) | (df_lesson.cohort_id == 137.0)]

ds_df.cohort_id.value_counts()

## unfortunately the data doesn't have the Ada cohort but that is okay. We have plenty 
## of data from the other cohorts we can look at.

59.0     32657
34.0     27096
55.0     22022
137.0    21215
133.0    18308
Name: cohort_id, dtype: int64

In [29]:
## filtering down for more things that don't seem like lessons within the curriculumn 

ds_df = ds_df[~(ds_df['page'].str.contains('appendix|cohorts|examples|caps|github|coding-challenges \
                                                            |advanced-topics|extra|jpeg|ico|csv|project'))]

ds_df.cohort_id.value_counts() ## <-- making sure things were filtered down

59.0     29288
34.0     23443
137.0    19051
55.0     18927
133.0    15819
Name: cohort_id, dtype: int64

##### Bayes Cohort ID 34

In [30]:
bayes_df = ds_df[ds_df.cohort_id == 34.0] ## making bayes dataframe
bayes_df.cohort_id.value_counts() ## <-- quality assurance check

34.0    23443
Name: cohort_id, dtype: int64

In [31]:
## grouping by page and doing an overall count of occurences
## per page to figure which lesson has the most overall traffic in bayes
page_views = bayes_df.groupby(['page'])['user_id'].agg(['count','nunique'])
observed = page_views.sort_values(by = 'count', ascending = False)
observed.head(15)

Unnamed: 0_level_0,count,nunique
page,Unnamed: 1_level_1,Unnamed: 2_level_1
/,2051,23
1-fundamentals/modern-data-scientist.jpg,653,21
1-fundamentals/AI-ML-DL-timeline.jpg,651,21
1-fundamentals/1.1-intro-to-data-science,643,21
search/search_index.json,608,19
6-regression/1-overview,521,21
10-anomaly-detection/1-overview,384,21
6-regression/5.0-evaluate,333,21
5-stats/3-probability-distributions,320,21
5-stats/4.2-compare-means,316,21


In [32]:
observed.describe()

Unnamed: 0,count,nunique
count,398.0,398.0
mean,58.90201,8.007538
std,138.53835,8.003617
min,1.0,1.0
25%,4.0,2.0
50%,12.0,4.0
75%,61.25,19.0
max,2051.0,23.0


We are going to look at lessons clicked more than 12 times to view the low traffic (glossed over lessons) after looking at the numerical statistics for Bayes page view count

In [33]:
observed[observed['count'] > 12].tail(15)

Unnamed: 0_level_0,count,nunique
page,Unnamed: 1_level_1,Unnamed: 2_level_1
clustering/now_what.jpg,15,3
4-python/matplotlib-styles,15,7
4-python/8.4.3-dataframes,15,2
classification/tidy-data,15,2
4-python/handling-duplicate-values,14,9
fundamentals/spreadsheets-overview,14,2
10-anomaly-detection/2-detecting-through-probability,14,10
timeseries/prep,14,4
sql/functions,14,1
clustering/Hospital-Distance-Clusters.jpg,14,5


##### Curie Cohort ID 55

In [34]:
curie_df = ds_df[ds_df.cohort_id == 55.0] ## making bayes dataframe
curie_df.cohort_id.value_counts() ## <-- quality assurance check

55.0    18927
Name: cohort_id, dtype: int64

In [35]:
## grouping by page and doing an overall count of occurences
## per page to figure which lesson has the most overall traffic in curie
page_views = curie_df.groupby(['page'])['user_id'].agg(['count','nunique'])
observed = page_views.sort_values(by = 'count', ascending = False)
observed.head(15)

Unnamed: 0_level_0,count,nunique
page,Unnamed: 1_level_1,Unnamed: 2_level_1
/,1779,21
6-regression/1-overview,595,19
search/search_index.json,584,19
1-fundamentals/modern-data-scientist.jpg,467,19
1-fundamentals/AI-ML-DL-timeline.jpg,465,19
1-fundamentals/1.1-intro-to-data-science,461,19
3-sql/1-mysql-overview,441,19
10-anomaly-detection/1-overview,345,19
4-python/8.4.3-dataframes,260,19
4-python/8.4.4-advanced-dataframes,246,19


In [36]:
observed.describe()

Unnamed: 0,count,nunique
count,319.0,319.0
mean,59.332288,9.275862
std,127.600605,6.631288
min,1.0,1.0
25%,4.0,3.0
50%,20.0,8.0
75%,83.0,17.0
max,1779.0,21.0


We are going to look at lessons clicked more than 20 times to view the low traffic (glossed over lessons) after looking at the numerical statistics for Curie page view count

In [37]:
observed[observed['count'] > 20].tail(15)

Unnamed: 0_level_0,count,nunique
page,Unnamed: 1_level_1,Unnamed: 2_level_1
classification/acquire,25,8
python/advanced-dataframes,24,8
classification/explore,24,9
2-storytelling/2.4-present,24,10
timeseries/explore,23,7
python/introduction-to-python,23,9
stats/compare-means,22,9
timeseries/modeling-lesson1,22,4
stats/probability-distributions,22,7
sql/databases,21,5


##### Darden Cohort ID 59

In [38]:
darden_df = ds_df[ds_df.cohort_id == 59.0] ## making bayes dataframe
darden_df.cohort_id.value_counts() ## <-- quality assurance check

59.0    29288
Name: cohort_id, dtype: int64

In [39]:
## grouping by page and doing an overall count of occurences
## per page to figure which lesson has the most overall traffic in curie
page_views = curie_df.groupby(['page'])['user_id'].agg(['count','nunique'])
observed = page_views.sort_values(by = 'count', ascending = False)
observed.head(15)

Unnamed: 0_level_0,count,nunique
page,Unnamed: 1_level_1,Unnamed: 2_level_1
/,1779,21
6-regression/1-overview,595,19
search/search_index.json,584,19
1-fundamentals/modern-data-scientist.jpg,467,19
1-fundamentals/AI-ML-DL-timeline.jpg,465,19
1-fundamentals/1.1-intro-to-data-science,461,19
3-sql/1-mysql-overview,441,19
10-anomaly-detection/1-overview,345,19
4-python/8.4.3-dataframes,260,19
4-python/8.4.4-advanced-dataframes,246,19


In [40]:
observed.describe()

Unnamed: 0,count,nunique
count,319.0,319.0
mean,59.332288,9.275862
std,127.600605,6.631288
min,1.0,1.0
25%,4.0,3.0
50%,20.0,8.0
75%,83.0,17.0
max,1779.0,21.0


We are going to look at lessons clicked more than 66 times to view the low traffic (glossed over lessons) after looking at the numerical statistics for Bayes page view count

In [41]:
observed[observed['count'] > 66].tail(15)

Unnamed: 0_level_0,count,nunique
page,Unnamed: 1_level_1,Unnamed: 2_level_1
2-storytelling/1-overview,76,19
3-sql/9.2-indexes,74,17
8-clustering/1-overview,73,16
1-fundamentals/2.1-spreadsheets-overview,73,17
4-python/8.4.1-pandas-overview,72,18
8-clustering/Hospital-Distance-Clusters.jpg,72,17
9-timeseries/1-overview,71,17
2-storytelling/2.1-understand,71,19
5-stats/Selecting_a_hypothesis_test.svg,69,16
10-anomaly-detection/3-discrete-probabilistic-methods,68,16


##### Easley Cohort ID 133

In [42]:
easley_df = ds_df[ds_df.cohort_id == 133.0] ## making bayes dataframe
easley_df.cohort_id.value_counts() ## <-- quality assurance check

133.0    15819
Name: cohort_id, dtype: int64

In [43]:
## grouping by page and doing an overall count of occurences
## per page to figure which lesson has the most overall traffic in easley
page_views = easley_df.groupby(['page'])['user_id'].agg(['count','nunique'])
observed = page_views.sort_values(by = 'count', ascending = False)
observed.head(15)

Unnamed: 0_level_0,count,nunique
page,Unnamed: 1_level_1,Unnamed: 2_level_1
/,1543,17
classification/scale_features_or_not.svg,561,17
classification/overview,540,17
fundamentals/AI-ML-DL-timeline.jpg,409,17
fundamentals/modern-data-scientist.jpg,408,17
fundamentals/intro-to-data-science,401,17
search/search_index.json,359,15
sql/mysql-overview,338,15
anomaly-detection/overview,258,16
stats/compare-means,226,16


In [44]:
observed.describe()

Unnamed: 0,count,nunique
count,179.0,179.0
mean,88.374302,12.089385
std,142.732009,6.031085
min,1.0,1.0
25%,9.5,7.0
50%,63.0,16.0
75%,114.5,16.0
max,1543.0,17.0


We are going to look at lessons clicked more than 63 times to view the low traffic (glossed over lessons) after looking at the numerical statistics for Easly page view count

In [45]:
observed[observed['count'] > 63].tail(15)

Unnamed: 0_level_0,count,nunique
page,Unnamed: 1_level_1,Unnamed: 2_level_1
nlp/model,80,16
timeseries/acquire,78,16
python/introduction-to-python,78,16
sql/order-by,78,16
nlp/explore,78,16
storytelling/overview,77,15
storytelling/tableau,76,15
anomaly-detection/detecting-timeseries-anomalies,74,14
timeseries/overview,72,17
sql/indexes,72,16


##### Florence Cohort ID 137

In [46]:
florence_df = ds_df[ds_df.cohort_id == 137.0] ## making bayes dataframe
florence_df.cohort_id.value_counts() ## <-- quality assurance check

137.0    19051
Name: cohort_id, dtype: int64

In [47]:
## grouping by page and doing an overall count of occurences
## per page to figure which lesson has the most overall traffic in florence
page_views = florence_df.groupby(['page'])['user_id'].agg(['count','nunique'])
observed = page_views.sort_values(by = 'count', ascending = False)
observed.head(15)

Unnamed: 0_level_0,count,nunique
page,Unnamed: 1_level_1,Unnamed: 2_level_1
/,1406,22
fundamentals/modern-data-scientist.jpg,758,21
fundamentals/intro-to-data-science,755,21
fundamentals/AI-ML-DL-timeline.jpg,752,21
search/search_index.json,684,20
classification/scale_features_or_not.svg,584,22
classification/overview,549,22
sql/mysql-overview,400,22
python/data-types-and-variables,271,21
classification/evaluation,263,22


In [48]:
observed.describe()

Unnamed: 0,count,nunique
count,187.0,187.0
mean,101.877005,13.176471
std,163.647802,8.700876
min,1.0,1.0
25%,3.5,3.0
50%,66.0,19.0
75%,146.0,21.0
max,1406.0,22.0


We are going to look at lessons clicked more than 66 times to view the low traffic (glossed over lessons) after looking at the numerical statistics for Florence page view count

In [49]:
observed[observed['count'] > 66].tail(15)

Unnamed: 0_level_0,count,nunique
page,Unnamed: 1_level_1,Unnamed: 2_level_1
timeseries/working-with-time-series-data,87,19
timeseries/acquire,86,20
sql/more-exercises,85,17
clustering/Hospital-Distance-Clusters.jpg,83,20
timeseries/modeling-lesson1,82,19
clustering/overview,82,20
sql/relationships-overview,80,19
fundamentals/cli/overview,77,21
storytelling/tableau,71,20
fundamentals/cli/intro,71,20


#### Cohort Takeaways
 - Bayes Most Common Viewed Lessons:
     - 1-fundamentals/modern-data-scientist.jpg
     - 1-fundamentals/1.1-intro-to-data-science
 - Curie Most Common Viewed Lessons: 
     - 6-regression/1-overview
     - 1-fundamentals/modern-data-scientist.jpg
 - Darden Most Common Viewed Lessons: 
     - 6-regression/1-overview
     - 1-fundamentals/modern-data-scientist.jpg	
 - Easley Most Common Viewed Lessons:
     - classification/scale_features_or_not.svg
     - classification/overview
 - Florence Most Common Viewed Lessons:
     - fundamentals/modern-data-scientist.jpg
     - classification/scale_features_or_not.svg
     

#### Significance in Data Science Cohort Lesson Activity Difference

Based on the filtering for lessons that was done and dividing each cohort into their own respective dataframe.

We can see that the earlier data science cohorts (Bayes, Curie, & Darden) spent a lot of time looking at the regresison lessons where as the most recent cohorts (Easley & Florence) spent a lot more time looking at the classification lessons specifically scaling.

This could be because the curriculumn is constantly adjusting and maybe the regression section for modeling used to fall before the classification program. Seeing we are Florence students we know that the classification project was our first project in the program so we all wanted to get it right by accessing all parts of that Methodology. 

The Earlier cohorts may have had regression as their first project given these understandings.

#### Which Lessons are Least Accessed

In [50]:
## making an observed dataframe by grouping by page and user_id with a count 
## aggregate to look at the least active lessons

page_views = df.groupby(['page'])['user_id'].agg(['count','nunique'])
observed = page_views.sort_values(by = 'count', ascending = True)
observed.head(5)

Unnamed: 0_level_0,count,nunique
page,Unnamed: 1_level_1,Unnamed: 2_level_1
%20https://github.com/RaulCPena,1,1
content/appendix/javascript/functions/scope.html,1,1
content/appendix/javascript/functions/templating.html,1,1
content/appendix/javascript/javascript/functions/scope.html,1,1
content/conditionals.html,1,1


In [51]:
observed = observed.reset_index() ## <-- resetting the index

In [52]:
## filtering down using keywords to eliminate things that don't appear to be lessons

observed = observed[~(observed['page'].str.contains('appendix|cohorts|examples|caps|github|coding-challenges \
                                                            |advanced-topics|extra|jpeg|ico|csv|project'))]

In [53]:
## filtering to look for /'s because we are looking for lessons within methodologies
## which means they will most likely have a dash

observed = observed[observed['page'].str.contains('/')]

observed.shape ## <-- looking at our shape

(1186, 3)

Let's get rid of the noise by calculating the upper and lower bounds to locate outliers

In [54]:
e.outlier_calculation(observed, 'count')

For count the lower bound is -419.625 and  upper bound is 707.375


Unnamed: 0,page,count,nunique
4,content/conditionals.html,1,1
5,content/control-structures-ii,1,1
38,classification/knn.md,1,1
42,cli/4-navigating-the-filesystem,1,1
46,coding-challenges/intermediate,1,1
...,...,...,...
2109,regression/feature-engineering,667,76
2110,sql/basic-statements,669,96
2111,python/intro-to-matplotlib,682,86
2112,clustering/wrangle,688,80


In [55]:
observed['count'].eq(1).sum()

209

#### Least accessed lessons

In [56]:
# lower bound absolute 418 - 200 (number of count = 1) = 218
observed.loc[observed['count'] > 218].head(5)

Unnamed: 0,page,count,nunique
1875,10-anomaly-detection/3-discrete-probabilistic-...,219,41
1877,2-storytelling/2.2-create,219,50
1879,8-clustering/5.1-kmeans-part-1,221,31
1881,4-python/7.4.2-series,222,24
1882,slides/exceptions_and_error_handling,222,95


After calculating a lower and upper bound for the dataset eliminating some of the pages that were accessed lower than the lower bound those are most likely not lessons.

Therefore we can conclude the top 5 least accessed lesson are:
 - anomaly-detection/discrete-probabilistic-methods
 - storytelling/create
 - clustering/kmeans-part1
 - python/series
 - slides/exceptions_and_error_handling

#### Are there students who, when active, hardly accessed the curriculum? If so, what information do you have about these students?¶


To answer this question we are going to use the SQL database because it has more information

In [57]:
df = a.acquire() ## <-- getting our SQL dataframe with our function

In [58]:
df.head() ## <-- previewing our dataframe sample

Unnamed: 0,date,endpoint,user_id,cohort_id,source_ip,name,start_date,end_date,created_at,updated_at,program_id
0,2018-01-26,/,1,8.0,97.105.19.61,Hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,1.0
1,2018-01-26,java-ii,1,8.0,97.105.19.61,Hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,1.0
2,2018-01-26,java-ii/object-oriented-programming,1,8.0,97.105.19.61,Hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,1.0
3,2018-01-26,slides/object_oriented_programming,1,8.0,97.105.19.61,Hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,1.0
4,2018-01-26,javascript-i/conditionals,2,22.0,97.105.19.61,Teddy,2018-01-08,2018-05-17,2018-01-08 13:59:10,2018-01-08 13:59:10,2.0


In [59]:
## converting time columns to datetime

df.date = pd.to_datetime(df.date)
df.start_date = pd.to_datetime(df.start_date)
df.end_date = pd.to_datetime(df.end_date)

df.info() ## <-- quality assurance check

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1018810 entries, 0 to 1018809
Data columns (total 11 columns):
 #   Column      Non-Null Count    Dtype         
---  ------      --------------    -----         
 0   date        1018810 non-null  datetime64[ns]
 1   endpoint    1018809 non-null  object        
 2   user_id     1018810 non-null  int64         
 3   cohort_id   965313 non-null   float64       
 4   source_ip   1018810 non-null  object        
 5   name        954313 non-null   object        
 6   start_date  954313 non-null   datetime64[ns]
 7   end_date    954313 non-null   datetime64[ns]
 8   created_at  954313 non-null   object        
 9   updated_at  954313 non-null   object        
 10  program_id  954313 non-null   float64       
dtypes: datetime64[ns](3), float64(2), int64(1), object(5)
memory usage: 93.3+ MB


In [60]:
# Create a column to determine if the student was active or not when they accessed the curriculum
df['is_active'] = (df.date > df.start_date) & (df.date < df.end_date)

In [62]:
# Separate out the students who were active when they accessed the curriculum
df2 = df[(df['is_active'])]

df2.is_active.value_counts() ## <-- quality assurance check

True    714407
Name: is_active, dtype: int64

In [63]:
df2 = df2.sort_values('user_id') ## <-- sorting our active user dataframe by users

In [71]:
# What are the student_ids who hardly accessed the curriculum while being an active student?
df2['user_id'].value_counts().tail(5)

956    5
278    4
832    3
679    3
879    1
Name: user_id, dtype: int64

#### These students hardly accessed the curriculum.¶


In [65]:
df2.loc[df2['user_id'] == 956] ## looking at user 956

Unnamed: 0,date,endpoint,user_id,cohort_id,source_ip,name,start_date,end_date,created_at,updated_at,program_id,is_active
891694,2021-04-15,toc,956,139.0,162.200.114.251,Oberon,2021-04-12,2021-10-01,2021-04-12 18:07:21,2021-04-12 18:07:21,2.0,True
891690,2021-04-15,/,956,139.0,162.200.114.251,Oberon,2021-04-12,2021-10-01,2021-04-12 18:07:21,2021-04-12 18:07:21,2.0,True
891724,2021-04-15,javascript-i/introduction/primitive-types,956,139.0,162.200.114.251,Oberon,2021-04-12,2021-10-01,2021-04-12 18:07:21,2021-04-12 18:07:21,2.0,True
891897,2021-04-15,javascript-i/introduction/operators,956,139.0,162.200.114.251,Oberon,2021-04-12,2021-10-01,2021-04-12 18:07:21,2021-04-12 18:07:21,2.0,True
891710,2021-04-15,javascript-i/introduction/operators,956,139.0,162.200.114.251,Oberon,2021-04-12,2021-10-01,2021-04-12 18:07:21,2021-04-12 18:07:21,2.0,True


In [66]:
df2.loc[df2['user_id'] == 278] ## looking at user 278

Unnamed: 0,date,endpoint,user_id,cohort_id,source_ip,name,start_date,end_date,created_at,updated_at,program_id,is_active
131804,2018-09-27,java-ii/collections,278,24.0,107.77.217.9,Voyageurs,2018-05-29,2018-10-11,2018-05-25 22:25:57,2018-05-25 22:25:57,2.0,True
131788,2018-09-27,java-ii/arrays,278,24.0,107.77.217.9,Voyageurs,2018-05-29,2018-10-11,2018-05-25 22:25:57,2018-05-25 22:25:57,2.0,True
131802,2018-09-27,java-ii/arrays,278,24.0,107.77.217.9,Voyageurs,2018-05-29,2018-10-11,2018-05-25 22:25:57,2018-05-25 22:25:57,2.0,True
131699,2018-09-27,/,278,24.0,97.105.19.58,Voyageurs,2018-05-29,2018-10-11,2018-05-25 22:25:57,2018-05-25 22:25:57,2.0,True


In [67]:
df2.loc[df2['user_id'] == 832] ## looking at user 832

Unnamed: 0,date,endpoint,user_id,cohort_id,source_ip,name,start_date,end_date,created_at,updated_at,program_id,is_active
754195,2020-12-07,/,832,62.0,69.154.52.98,Jupiter,2020-09-21,2021-03-30,2020-09-21 18:06:27,2020-09-21 18:06:27,2.0,True
754204,2020-12-07,javascript-i,832,62.0,69.154.52.98,Jupiter,2020-09-21,2021-03-30,2020-09-21 18:06:27,2020-09-21 18:06:27,2.0,True
754206,2020-12-07,html-css,832,62.0,69.154.52.98,Jupiter,2020-09-21,2021-03-30,2020-09-21 18:06:27,2020-09-21 18:06:27,2.0,True


In [68]:
df2.loc[df2['user_id'] == 679] ## looking at user 679

Unnamed: 0,date,endpoint,user_id,cohort_id,source_ip,name,start_date,end_date,created_at,updated_at,program_id,is_active
597687,2020-07-14,1-fundamentals/modern-data-scientist.jpg,679,59.0,24.28.146.155,Darden,2020-07-13,2021-01-12,2020-07-13 18:32:19,2020-07-13 18:32:19,3.0,True
597685,2020-07-14,1-fundamentals/1.1-intro-to-data-science,679,59.0,24.28.146.155,Darden,2020-07-13,2021-01-12,2020-07-13 18:32:19,2020-07-13 18:32:19,3.0,True
597686,2020-07-14,1-fundamentals/AI-ML-DL-timeline.jpg,679,59.0,24.28.146.155,Darden,2020-07-13,2021-01-12,2020-07-13 18:32:19,2020-07-13 18:32:19,3.0,True


In [69]:
df2.loc[df2['user_id'] == 879] ## looking at user 879

Unnamed: 0,date,endpoint,user_id,cohort_id,source_ip,name,start_date,end_date,created_at,updated_at,program_id,is_active
799236,2021-01-26,/,879,135.0,136.50.50.187,Marco,2021-01-25,2021-07-19,2021-01-20 21:31:11,2021-01-20 21:31:11,2.0,True


#### Least Active Students Takeaways:

Given these students only had curriculum access logs for one day it is concluded that these students had to exit the program or did not complete it.

### Final Conclusions

Overall after looking at the curriculum logs we have answered several questions pertaining to anomaly detection.

We found a suspicious user that had a suspicious IP address that had clear evidence of web-scraping with machine like access rates of ds.codeup content. 

We determined that the previous data science cohorts may have had a different curriculum order because the earlier cohorts viewed several regression project lessons the most while the two most previous cohorts looked at classification the most because that was there first project. Leading us to believe regression used to be the first project for older cohorts.

We were able to pinpoint students that accessed the curriculum the least and that look like may have quit the program early.

We were also able to determine which lessons were accessed the most throughout all users and least througout all users.