# Anomaly detection project

Hello,


I have some questions for you that I need to be answered before the board meeting Thursday afternoon. I need to be able to speak to the following questions. I also need a single slide that I can incorporate into my existing presentation (Google Slides) that summarizes the most important points. My questions are listed below; however, if you discover anything else important that I didn’t think to ask, please include that as well.


1. Which lesson appears to attract the most traffic consistently across cohorts (per program)?
2. Is there a cohort that referred to a lesson significantly more than other cohorts seemed to gloss over?
3. Are there students who, when active, hardly access the curriculum? If so, what information do you have about these students?
4. Is there any suspicious activity, such as users/machines/etc accessing the curriculum who shouldn’t be? Does it appear that any web-scraping is happening? Are there any suspicious IP addresses?
5. At some point in 2019, the ability for students and alumni to access both curriculums (web dev to ds, ds to web dev) should have been shut off. Do you see any evidence of that happening? Did it happen before?
6. What topics are grads continuing to reference after graduation and into their jobs (for each program)?
7. Which lessons are least accessed?
8. Anything else I should be aware of?


Thank you

In [1]:
#Import dependencies
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from env import host, username, password

In [2]:
#Define a function that creates the database url
def url_creator(host, username, password, db_name):
    return f'mysql+pymysql://{username}:{password}@{host}/{db_name}'

In [3]:
#Create the url
url = url_creator(host, username, password, 'curriculum_logs')

In [4]:
#Define the SQL query
query = '''
        SELECT date,
        time, path as endpoint,
        user_id,
        cohort_id,
        ip, name,
        slack, start_date,
        end_date, created_at,
        updated_at, deleted_at,
        program_id
        FROM logs
        LEFT JOIN cohorts ON logs.cohort_id = cohorts.id
        '''

In [5]:
#Read in the dataframe
df = pd.read_sql(query, url)
df.head()

Unnamed: 0,date,time,endpoint,user_id,cohort_id,ip,name,slack,start_date,end_date,created_at,updated_at,deleted_at,program_id
0,2018-01-26,09:55:03,/,1,8.0,97.105.19.61,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,,1.0
1,2018-01-26,09:56:02,java-ii,1,8.0,97.105.19.61,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,,1.0
2,2018-01-26,09:56:05,java-ii/object-oriented-programming,1,8.0,97.105.19.61,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,,1.0
3,2018-01-26,09:56:06,slides/object_oriented_programming,1,8.0,97.105.19.61,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,,1.0
4,2018-01-26,09:56:24,javascript-i/conditionals,2,22.0,97.105.19.61,Teddy,#teddy,2018-01-08,2018-05-17,2018-01-08 13:59:10,2018-01-08 13:59:10,,2.0


In [6]:
#Check the shape
df.shape

(900223, 14)

In [7]:
#Look at the info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 900223 entries, 0 to 900222
Data columns (total 14 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   date        900223 non-null  object 
 1   time        900223 non-null  object 
 2   endpoint    900222 non-null  object 
 3   user_id     900223 non-null  int64  
 4   cohort_id   847330 non-null  float64
 5   ip          900223 non-null  object 
 6   name        847330 non-null  object 
 7   slack       847330 non-null  object 
 8   start_date  847330 non-null  object 
 9   end_date    847330 non-null  object 
 10  created_at  847330 non-null  object 
 11  updated_at  847330 non-null  object 
 12  deleted_at  0 non-null       object 
 13  program_id  847330 non-null  float64
dtypes: float64(2), int64(1), object(11)
memory usage: 96.2+ MB


I have some nulls in my dataframe that I would like to investigate before I begin answering the questions. Many columns have the same non-null value counts. Are these columns all missing information for the same group of observations?

In [8]:
#Drop the column with all null values
df.drop(columns=['deleted_at'], inplace=True)

In [9]:
#Rows with null cohort id
df[df['cohort_id'].isna()]

Unnamed: 0,date,time,endpoint,user_id,cohort_id,ip,name,slack,start_date,end_date,created_at,updated_at,program_id
411,2018-01-26,16:46:16,/,48,,97.105.19.61,,,,,,,
412,2018-01-26,16:46:24,spring/extra-features/form-validation,48,,97.105.19.61,,,,,,,
425,2018-01-26,17:54:24,/,48,,97.105.19.61,,,,,,,
435,2018-01-26,18:32:03,/,48,,97.105.19.61,,,,,,,
436,2018-01-26,18:32:17,mysql/relationships/joins,48,,97.105.19.61,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
899897,2021-04-21,12:49:00,javascript-ii,717,,136.50.102.126,,,,,,,
899898,2021-04-21,12:49:02,javascript-ii/es6,717,,136.50.102.126,,,,,,,
899899,2021-04-21,12:51:27,javascript-ii/map-filter-reduce,717,,136.50.102.126,,,,,,,
899900,2021-04-21,12:52:37,javascript-ii/promises,717,,136.50.102.126,,,,,,,


In [10]:
#Are the nulls from all columns present in these observations?
df[df['cohort_id'].isna()].isna().sum()

date              0
time              0
endpoint          0
user_id           0
cohort_id     52893
ip                0
name          52893
slack         52893
start_date    52893
end_date      52893
created_at    52893
updated_at    52893
program_id    52893
dtype: int64

In [11]:
#Check out the one endpoint null
df[df.endpoint.isna()]

Unnamed: 0,date,time,endpoint,user_id,cohort_id,ip,name,slack,start_date,end_date,created_at,updated_at,program_id
506305,2020-04-08,09:25:18,,586,55.0,72.177.240.51,Curie,#curie,2020-02-03,2020-07-07,2020-02-03 19:31:51,2020-02-03 19:31:51,3.0


In [12]:
#Create a module lesson with the first portion of the endpoint
df['module'] = df['endpoint'].str.split('/').str[0]

In [13]:
#Create a lesson column with the last portion of the endpoint
df['lesson'] = df['endpoint'].str.split('/').str[-1]

In [14]:
df['is_lesson'] = (df['module'] != df['lesson'])

In [15]:
df

Unnamed: 0,date,time,endpoint,user_id,cohort_id,ip,name,slack,start_date,end_date,created_at,updated_at,program_id,module,lesson,is_lesson
0,2018-01-26,09:55:03,/,1,8.0,97.105.19.61,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,1.0,,,False
1,2018-01-26,09:56:02,java-ii,1,8.0,97.105.19.61,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,1.0,java-ii,java-ii,False
2,2018-01-26,09:56:05,java-ii/object-oriented-programming,1,8.0,97.105.19.61,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,1.0,java-ii,object-oriented-programming,True
3,2018-01-26,09:56:06,slides/object_oriented_programming,1,8.0,97.105.19.61,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,1.0,slides,object_oriented_programming,True
4,2018-01-26,09:56:24,javascript-i/conditionals,2,22.0,97.105.19.61,Teddy,#teddy,2018-01-08,2018-05-17,2018-01-08 13:59:10,2018-01-08 13:59:10,2.0,javascript-i,conditionals,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
900218,2021-04-21,16:41:51,jquery/personal-site,64,28.0,71.150.217.33,Staff,#,2014-02-04,2014-02-04,2018-12-06 17:04:19,2018-12-06 17:04:19,2.0,jquery,personal-site,True
900219,2021-04-21,16:42:02,jquery/mapbox-api,64,28.0,71.150.217.33,Staff,#,2014-02-04,2014-02-04,2018-12-06 17:04:19,2018-12-06 17:04:19,2.0,jquery,mapbox-api,True
900220,2021-04-21,16:42:09,jquery/ajax/weather-map,64,28.0,71.150.217.33,Staff,#,2014-02-04,2014-02-04,2018-12-06 17:04:19,2018-12-06 17:04:19,2.0,jquery,weather-map,True
900221,2021-04-21,16:44:37,anomaly-detection/discrete-probabilistic-methods,744,28.0,24.160.137.86,Staff,#,2014-02-04,2014-02-04,2018-12-06 17:04:19,2018-12-06 17:04:19,2.0,anomaly-detection,discrete-probabilistic-methods,True


## Question 1

Which lesson appears to attract the most traffic consistently across cohorts (per program)?

In [16]:
df['program_id'].value_counts()

2.0    713365
3.0    103412
1.0     30548
4.0         5
Name: program_id, dtype: int64

Four unique program IDs. I need to map real programs to the appropriate ID using alumni.codeup.com. I will accomplish this by looking up the cohorts associated with each ID and mapping the program name to all cohorts with the same program ID.

In [17]:
for i in range(1, 5):
    print()
    print('--------------------------------')
    print()
    print('Program ID {} corresponds to these cohorts.'.format(i))
    print()
    print(df[df['program_id'] == i]['name'].unique())


--------------------------------

Program ID 1 corresponds to these cohorts.

['Hampton' 'Arches' 'Quincy' 'Kings' 'Lassen' 'Glacier' 'Denali' 'Joshua'
 'Olympic' 'Badlands' 'Ike' 'Franklin' 'Everglades']

--------------------------------

Program ID 2 corresponds to these cohorts.

['Teddy' 'Sequoia' 'Niagara' 'Pinnacles' 'Mammoth' 'Ulysses' 'Voyageurs'
 'Wrangell' 'Xanadu' 'Yosemite' 'Staff' 'Zion' 'Andromeda' 'Betelgeuse'
 'Ceres' 'Deimos' 'Europa' 'Fortuna' 'Apex' 'Ganymede' 'Hyperion' 'Bash'
 'Jupiter' 'Kalypso' 'Luna' 'Marco' 'Neptune' 'Oberon']

--------------------------------

Program ID 3 corresponds to these cohorts.

['Bayes' 'Curie' 'Darden' 'Easley' 'Florence']

--------------------------------

Program ID 4 corresponds to these cohorts.

['Apollo']


Programs with ID 1 map to "Full Stack PHP Program." These programs represent early cohorts at Codeup that learned web development. Cohorts belonging to this program ID will be labeled "PHP".

Programs with ID 2 map to "Full Stack Java Program." These programs also train web developers, but are more recent cohorts in Codeup's history. These cohorts will be labeled "Java" moving forward.

Programs with ID 3 map to "Data Science Program." These cohorts will be labeled "DS".

Programs with ID 4 map to "Front-End Program." I will label this lone cohort "Front".

In [18]:
program_dict = {1.0: 'PHP', 2.0: 'Java', 3.0: 'DS', 4.0: 'Front'}

df['program_type'] = df['program_id'].map(program_dict)

In [19]:
#The program id column is now worthless
df.drop(columns=['program_id'], inplace=True)

In [20]:
program_list = ['PHP', 'Java', 'DS', 'Front']

for program in program_list:
    print()
    print('--------------------------------')
    print()
    print('Most visited pages by members of the {} program.'.format(program))
    print()
    print(df[df['program_type'] == program]['lesson'].value_counts().head(10))


--------------------------------

Most visited pages by members of the PHP program.

                1681
favicon.ico     1239
index.html      1020
html-css         814
javascript-i     736
introduction     562
spring           501
java-iii         479
appendix         469
java-ii          454
Name: lesson, dtype: int64

--------------------------------

Most visited pages by members of the Java program.

                     35814
introduction         20114
javascript-i         17457
toc                  17428
search_index.json    16976
html-css             12913
java-iii             12684
java-ii              11719
spring               11385
jquery               10710
Name: lesson, dtype: int64

--------------------------------

Most visited pages by members of the DS program.

                             8358
overview                     4267
1-overview                   3519
modern-data-scientist.jpg    3152
AI-ML-DL-timeline.jpg        3152
project                      3140
sear

The PHP and Java programs share a great amount of topics in common: java, javascript, spring, and html. The data science students often refer to classification and fundamentals information. Members of the Front program account for 5 total hits on webpages.

I don't think this answers the question in its entirety. The question asks about consistent traffic across cohorts. I will need to group by cohort within each program to answer the question completely.

In [21]:
#Create separate dataframes for each program
df_php = df[df['program_type'] == 'PHP']
df_java = df[df['program_type'] == 'Java']
df_ds = df[df['program_type'] == 'DS']

In [22]:
df_php[df_php['name'] == 'Quincy'].module.value_counts().iloc[:5]

content    650
           151
mysql       97
java-i      50
slides      49
Name: module, dtype: int64

In [23]:
def most_visited_modules(df):
    
    mv = pd.DataFrame()
    
    cohort_list = list(df['name'].unique())
    
    for cohort in cohort_list:
        
        if len(df[df['name'] == cohort].module.value_counts().iloc[:5].index) < 5:
            
            continue
            
        else:    
        
            mv[cohort] = df[df['name'] == cohort].module.value_counts().iloc[:5].index
        
    return mv    

In [24]:
def most_popular_modules(df):
    
    mv = most_visited_modules(df)
    
    top_five = mv.melt(var_name='columns', value_name='index')['index'].value_counts().head(5)
    
    return top_five

In [25]:
most_popular_modules(df_php)

content         8
                7
javascript-i    7
html-css        6
java-iii        4
Name: index, dtype: int64

In [26]:
most_popular_modules(df_java)

javascript-i    27
html-css        25
mysql           24
jquery          14
spring          13
Name: index, dtype: int64

In [27]:
most_popular_modules(df_ds)

                  4
1-fundamentals    3
classification    3
sql               3
python            3
Name: index, dtype: int64

For PHP cohorts, the content, javascript-i, and html-css modules are the most popular.

For Java cohorts, the javascript-i, html-css, and mysql modules are the most popular.

For DS cohorts, the fundamentals, classification, sql, and python modules are the most popular.

In [28]:
df_php[df_php['name'] == 'Quincy'].loc[df_php.is_lesson].lesson.value_counts().head(10)

favicon.ico                   92
intro                         38
html-css                      33
environment.html              31
application-structure.html    20
intro-to-mysql                19
php_iii                       19
introduction                  17
conditionals                  16
collections                   15
Name: lesson, dtype: int64

In [29]:
def most_visited_lessons(df):
    
    mv = pd.DataFrame()
    
    cohort_list = list(df['name'].unique())
    
    for cohort in cohort_list:
        
        if len(df[df['name'] == cohort].loc[df.is_lesson].lesson.value_counts().iloc[:5].index) < 5:
            
            continue
            
        else:    
        
            mv[cohort] = df[df['name'] == cohort].loc[df.is_lesson].lesson.value_counts().iloc[:5].index
        
    return mv    

In [30]:
def most_popular_lessons(df):
    
    mv = most_visited_lessons(df)
    
    top_five = mv.melt(var_name='columns', value_name='index')['index'].value_counts().head(5)
    
    return top_five

In [31]:
most_popular_lessons(df_php)

favicon.ico          7
html-css             5
introduction         5
intro                2
introduction.html    2
Name: index, dtype: int64

In [32]:
most_popular_lessons(df_java)

introduction         26
functions            23
search_index.json    22
arrays               19
servlets              4
Name: index, dtype: int64

In [33]:
most_popular_lessons(df_ds)

project                      4
AI-ML-DL-timeline.jpg        4
modern-data-scientist.jpg    3
1-overview                   2
overview                     2
Name: index, dtype: int64

For PHP, the most popular lessons across all cohorts are html-css and introductions.

For Java, the most popular lessons across all cohorts are introductions, functions, and arrays.

For DS, the most popular lessons across all cohorts are projects, the timeline, and the modern data scientist.

## Question 2

Is there a cohort that referred to a lesson significantly more than other cohorts seemed to gloss over?

In [49]:
df_php[df_php.is_lesson].lesson.value_counts().head(10)

favicon.ico                                        1239
introduction                                        560
arrays                                              346
functions                                           322
html-css                                            272
servlets                                            272
mvc                                                 259
jdbc                                                243
intro                                               241
working-with-data-types-operators-and-variables     232
Name: lesson, dtype: int64

In [52]:
php_mp_list = ['arrays', 'functions', 'html-css', 'servlets', 'mvc', 'jdbc',
               'working-with-data-types-operators-and-variables']

for pop in php_mp_list:
    
    print()
    print('--------------------------------')
    print()
    print('How many times each cohort accessed the {} lesson.'.format(pop))
    print()
    print(df_php[df_php['lesson'] == pop].name.value_counts())


--------------------------------

How many times each cohort accessed the arrays lesson.

Arches     143
Lassen     118
Olympic     59
Hampton      9
Quincy       8
Kings        7
Joshua       1
Glacier      1
Name: name, dtype: int64

--------------------------------

How many times each cohort accessed the functions lesson.

Lassen      109
Arches      107
Olympic      54
Hampton      20
Glacier      20
Quincy        5
Kings         3
Ike           3
Franklin      1
Name: name, dtype: int64

--------------------------------

How many times each cohort accessed the html-css lesson.

Lassen      237
Arches      227
Olympic     129
Kings        55
Glacier      50
Quincy       45
Joshua       24
Hampton      22
Ike          21
Badlands      3
Franklin      1
Name: name, dtype: int64

--------------------------------

How many times each cohort accessed the servlets lesson.

Lassen      120
Arches       79
Hampton      38
Olympic      29
Kings         4
Glacier       1
Franklin      1
Na

At first glance, Arches and Lassen are dominating the number of hits on each lesson page. What could be driving this behavior?

In [39]:
df_php.name.value_counts()

Lassen        9587
Arches        8890
Olympic       4954
Kings         2845
Hampton       1712
Quincy        1237
Glacier        598
Joshua         302
Ike            253
Badlands        93
Franklin        72
Denali           4
Everglades       1
Name: name, dtype: int64

Lassen and Arches also dominate the overall number of impressions on the curriculum website. Given their activity relative to the other cohorts, I don't identify much any anomalous behavior across the most popular lessons for PHP developers. The only thing that catches my eye is how many times the Olympic cohort accessed the working-with-data-types-operators-and-variables lesson. They beat out Lassen and almost took the top spot.

In [51]:
df_java[df_java.is_lesson].lesson.value_counts().head(10)

introduction                                       20069
search_index.json                                  16976
arrays                                             10504
functions                                          10174
working-with-data-types-operators-and-variables     7109
tables                                              7027
javascript-with-html                                6824
servlets                                            6622
databases                                           6574
elements                                            6522
Name: lesson, dtype: int64

In [53]:
java_mp_list = ['arrays', 'functions', 'working-with-data-types-operators-and-variables', 'tables',
              'javascript-with-html', 'servlets', 'databases', 'elements']

for pop in java_mp_list:
    
    print()
    print('--------------------------------')
    print()
    print('How many times each cohort accessed the {} lesson.'.format(pop))
    print()
    print(df_java[df_java['lesson'] == pop].name.value_counts())


--------------------------------

How many times each cohort accessed the arrays lesson.

Ceres         733
Fortuna       701
Voyageurs     628
Staff         580
Teddy         573
Zion          555
Betelgeuse    554
Ulysses       553
Jupiter       527
Apex          491
Ganymede      490
Deimos        428
Hyperion      417
Xanadu        414
Kalypso       390
Andromeda     390
Wrangell      370
Europa        344
Luna          299
Yosemite      288
Bash          278
Marco         276
Neptune       137
Sequoia        64
Pinnacles      14
Mammoth         7
Niagara         6
Oberon          3
Name: name, dtype: int64

--------------------------------

How many times each cohort accessed the functions lesson.

Staff         681
Ceres         614
Zion          550
Teddy         543
Jupiter       526
Ganymede      467
Voyageurs     465
Ulysses       460
Xanadu        459
Betelgeuse    451
Fortuna       447
Deimos        445
Wrangell      419
Apex          406
Hyperion      404
Marco         38

There are no Java cohorts accessing a particular lesson significantly more than the others. I will look at the overall number of hits by each cohort to be sure.

In [54]:
df_java.name.value_counts()

Staff         84031
Ceres         40730
Zion          38096
Jupiter       37109
Fortuna       36902
Voyageurs     35636
Ganymede      33844
Apex          33568
Deimos        32888
Teddy         30926
Hyperion      29855
Betelgeuse    29356
Ulysses       28534
Europa        28033
Xanadu        27749
Wrangell      25586
Andromeda     25359
Kalypso       23691
Yosemite      20743
Bash          17713
Luna          16623
Marco         16397
Sequoia        7444
Neptune        7276
Pinnacles      2158
Oberon         1672
Niagara         755
Mammoth         691
Name: name, dtype: int64

Interestingly enough, the Codeup staff dominate the overall impressions on the Java curriculum website. This makes sense, because staff members are constantly accessing the curriculum to answer questions, prepare for lectures, and improve the material. An interesting note is the PHP subset didn't have any labeled staff members.

In [55]:
df_ds[df_ds.is_lesson].lesson.value_counts().head(10)

overview                        4259
1-overview                      3517
project                         3137
modern-data-scientist.jpg       3094
AI-ML-DL-timeline.jpg           3094
search_index.json               2204
1.1-intro-to-data-science       1633
scale_features_or_not.svg       1590
explore                         1585
AnomalyDetectionCartoon.jpeg    1583
Name: lesson, dtype: int64

The data science curriculum has less descriptive lesson names than the other programs. "Overview", "Project", and "Explore" likely exist under multiple modules. I will disregard them for this analysis because I can't pinpoint their content.

In [56]:
ds_mp_list = ['modern-data-scientist.jpg', 'AI-ML-DL-timeline.jpg', '1.1-intro-to-data-science',
              'scale_features_or_not.svg', 'AnomalyDetectionCartoon.jpeg']

for pop in ds_mp_list:
    
    print()
    print('--------------------------------')
    print()
    print('How many times each cohort accessed the {} lesson.'.format(pop))
    print()
    print(df_ds[df_ds['lesson'] == pop].name.value_counts())


--------------------------------

How many times each cohort accessed the modern-data-scientist.jpg lesson.

Darden      808
Florence    704
Bayes       675
Curie       568
Easley      397
Name: name, dtype: int64

--------------------------------

How many times each cohort accessed the AI-ML-DL-timeline.jpg lesson.

Darden      812
Florence    701
Bayes       673
Curie       567
Easley      399
Name: name, dtype: int64

--------------------------------

How many times each cohort accessed the 1.1-intro-to-data-science lesson.

Bayes       640
Curie       461
Darden      460
Florence     64
Easley        8
Name: name, dtype: int64

--------------------------------

How many times each cohort accessed the scale_features_or_not.svg lesson.

Darden      944
Easley      463
Curie        96
Florence     60
Bayes        31
Name: name, dtype: int64

--------------------------------

How many times each cohort accessed the AnomalyDetectionCartoon.jpeg lesson.

Darden      635
Curie       421

The Darden cohort accessed almost all popular lessons much more frequently than the other cohorts. Bayes stands out on the intro to data science lesson. Could Darden's dominance be explained by overall greater activity on the website?

In [57]:
df_ds.name.value_counts()

Darden      32015
Bayes       26538
Curie       21582
Easley      14715
Florence     8562
Name: name, dtype: int64

Darden has the most overall impressions on the website.

## Question 3

Are there students who, when active, hardly access the curriculum? If so, what information do you have about these students?

In [60]:
df.groupby('user_id').endpoint.count().mean()

917.65749235474

In [61]:
df.groupby('user_id').endpoint.count().median()

692.0

The average student has around 900 impressions. It appears the average is skewed toward the high end by a few users that access the curriculum thouands of times. The median for student engagement is around 700. I will pick an arbitrary cutoff of 10 unique webpage visits for active students that hardly access the curriculum.

In [65]:
df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True)

In [66]:
df['start_date'] = pd.to_datetime(df['start_date'], infer_datetime_format=True)
df['end_date'] = pd.to_datetime(df['end_date'], infer_datetime_format=True)

In [67]:
df[(df['date'] > df['start_date']) & (df['date'] < df['end_date'])]

Unnamed: 0,date,time,endpoint,user_id,cohort_id,ip,name,slack,start_date,end_date,created_at,updated_at,module,lesson,is_lesson,program_type
4,2018-01-26,09:56:24,javascript-i/conditionals,2,22.0,97.105.19.61,Teddy,#teddy,2018-01-08,2018-05-17,2018-01-08 13:59:10,2018-01-08 13:59:10,javascript-i,conditionals,True,Java
5,2018-01-26,09:56:41,javascript-i/loops,2,22.0,97.105.19.61,Teddy,#teddy,2018-01-08,2018-05-17,2018-01-08 13:59:10,2018-01-08 13:59:10,javascript-i,loops,True,Java
6,2018-01-26,09:56:46,javascript-i/conditionals,3,22.0,97.105.19.61,Teddy,#teddy,2018-01-08,2018-05-17,2018-01-08 13:59:10,2018-01-08 13:59:10,javascript-i,conditionals,True,Java
7,2018-01-26,09:56:48,javascript-i/functions,3,22.0,97.105.19.61,Teddy,#teddy,2018-01-08,2018-05-17,2018-01-08 13:59:10,2018-01-08 13:59:10,javascript-i,functions,True,Java
8,2018-01-26,09:56:59,javascript-i/loops,2,22.0,97.105.19.61,Teddy,#teddy,2018-01-08,2018-05-17,2018-01-08 13:59:10,2018-01-08 13:59:10,javascript-i,loops,True,Java
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
900210,2021-04-21,16:36:09,jquery/personal-site,869,135.0,136.50.98.51,Marco,#marco,2021-01-25,2021-07-19,2021-01-20 21:31:11,2021-01-20 21:31:11,jquery,personal-site,True,Java
900211,2021-04-21,16:36:34,html-css/css-ii/bootstrap-grid-system,948,138.0,104.48.214.211,Neptune,#neptune,2021-03-15,2021-09-03,2021-03-15 19:57:09,2021-03-15 19:57:09,html-css,bootstrap-grid-system,True,Java
900212,2021-04-21,16:37:48,java-iii,834,134.0,67.11.50.23,Luna,#luna,2020-12-07,2021-06-08,2020-12-07 16:58:43,2020-12-07 16:58:43,java-iii,java-iii,False,Java
900213,2021-04-21,16:38:14,java-iii/servlets,834,134.0,67.11.50.23,Luna,#luna,2020-12-07,2021-06-08,2020-12-07 16:58:43,2020-12-07 16:58:43,java-iii,servlets,True,Java


In [68]:
df_active = df[(df['date'] > df['start_date']) & (df['date'] < df['end_date'])]

In [95]:
df_active.shape

(644386, 16)

In [136]:
df_active.groupby('user_id').user_id.count()

user_id
2      1541
3      1514
4       692
5      1621
6      1311
       ... 
976      25
977      76
978      49
979     104
981      42
Name: user_id, Length: 724, dtype: int64

In [117]:
df_active.groupby('user_id').user_id.count() <= 10

user_id
2      False
3      False
4      False
5      False
6      False
       ...  
976    False
977    False
978    False
979    False
981    False
Name: user_id, Length: 724, dtype: bool

In [119]:
df_active.groupby('user_id').user_id.count()[df_active.groupby('user_id').user_id.count() <= 10]

user_id
278    4
388    8
679    3
812    7
832    3
879    1
956    5
Name: user_id, dtype: int64

In [126]:
df_active[df_active['user_id'] == 679]

Unnamed: 0,date,time,endpoint,user_id,cohort_id,ip,name,slack,start_date,end_date,created_at,updated_at,module,lesson,is_lesson,program_type
597685,2020-07-14,08:05:15,1-fundamentals/1.1-intro-to-data-science,679,59.0,24.28.146.155,Darden,#darden,2020-07-13,2021-01-12,2020-07-13 18:32:19,2020-07-13 18:32:19,1-fundamentals,1.1-intro-to-data-science,True,DS
597686,2020-07-14,08:05:15,1-fundamentals/AI-ML-DL-timeline.jpg,679,59.0,24.28.146.155,Darden,#darden,2020-07-13,2021-01-12,2020-07-13 18:32:19,2020-07-13 18:32:19,1-fundamentals,AI-ML-DL-timeline.jpg,True,DS
597687,2020-07-14,08:05:15,1-fundamentals/modern-data-scientist.jpg,679,59.0,24.28.146.155,Darden,#darden,2020-07-13,2021-01-12,2020-07-13 18:32:19,2020-07-13 18:32:19,1-fundamentals,modern-data-scientist.jpg,True,DS


In [139]:
inactive_list = list(df_active.groupby('user_id').user_id.count()[df_active.groupby('user_id').user_id.count() <= 10].index)

inactive_users = pd.DataFrame()

for user in inactive_list:
    
    temp_df = df_active[df_active['user_id'] == user]
    
    inactive_users = pd.concat([inactive_users, temp_df], axis=0)
    
inactive_users.reset_index(drop=True, inplace=True)

inactive_users

Unnamed: 0,date,time,endpoint,user_id,cohort_id,ip,name,slack,start_date,end_date,created_at,updated_at,module,lesson,is_lesson,program_type
0,2018-09-27,13:57:44,/,278,24.0,97.105.19.58,Voyageurs,#voyageurs,2018-05-29,2018-10-11,2018-05-25 22:25:57,2018-05-25 22:25:57,,,False,Java
1,2018-09-27,14:47:37,java-ii/arrays,278,24.0,107.77.217.9,Voyageurs,#voyageurs,2018-05-29,2018-10-11,2018-05-25 22:25:57,2018-05-25 22:25:57,java-ii,arrays,True,Java
2,2018-09-27,14:58:48,java-ii/arrays,278,24.0,107.77.217.9,Voyageurs,#voyageurs,2018-05-29,2018-10-11,2018-05-25 22:25:57,2018-05-25 22:25:57,java-ii,arrays,True,Java
3,2018-09-27,14:59:07,java-ii/collections,278,24.0,107.77.217.9,Voyageurs,#voyageurs,2018-05-29,2018-10-11,2018-05-25 22:25:57,2018-05-25 22:25:57,java-ii,collections,True,Java
4,2019-03-19,09:50:19,/,388,31.0,97.105.19.58,Andromeda,#andromeda,2019-03-18,2019-07-30,2019-03-18 20:35:06,2019-03-18 20:35:06,,,False,Java
5,2019-03-19,09:50:23,html-css,388,31.0,97.105.19.58,Andromeda,#andromeda,2019-03-18,2019-07-30,2019-03-18 20:35:06,2019-03-18 20:35:06,html-css,html-css,False,Java
6,2019-03-19,09:50:28,html-css/elements,388,31.0,97.105.19.58,Andromeda,#andromeda,2019-03-18,2019-07-30,2019-03-18 20:35:06,2019-03-18 20:35:06,html-css,elements,True,Java
7,2019-03-19,10:04:11,html-css/elements,388,31.0,97.105.19.58,Andromeda,#andromeda,2019-03-18,2019-07-30,2019-03-18 20:35:06,2019-03-18 20:35:06,html-css,elements,True,Java
8,2019-03-19,10:19:32,html-css/elements,388,31.0,97.105.19.58,Andromeda,#andromeda,2019-03-18,2019-07-30,2019-03-18 20:35:06,2019-03-18 20:35:06,html-css,elements,True,Java
9,2019-03-19,11:11:51,html-css/forms,388,31.0,97.105.19.58,Andromeda,#andromeda,2019-03-18,2019-07-30,2019-03-18 20:35:06,2019-03-18 20:35:06,html-css,forms,True,Java


In [143]:
inactive_users.name.unique()

array(['Voyageurs', 'Andromeda', 'Darden', 'Hyperion', 'Jupiter', 'Marco',
       'Oberon'], dtype=object)

In [145]:
inactive_users.name.nunique(), inactive_users.user_id.nunique()

(7, 7)

Each user in this dataframe is associated with a unique cohort, and each user is only associated with one cohort.

In [148]:
inactive_users.groupby('user_id').program_type.unique().value_counts()

[Java]    6
[DS]      1
Name: program_type, dtype: int64

Six of the accounts are associated with Java web development programs and one account is associated with the data science program.

In [151]:
inactive_users.groupby('user_id').date.first(), inactive_users.groupby('user_id').date.last()

(user_id
 278   2018-09-27
 388   2019-03-19
 679   2020-07-14
 812   2020-11-08
 832   2020-12-07
 879   2021-01-26
 956   2021-04-15
 Name: date, dtype: datetime64[ns],
 user_id
 278   2018-09-27
 388   2019-03-19
 679   2020-07-14
 812   2020-11-08
 832   2020-12-07
 879   2021-01-26
 956   2021-04-15
 Name: date, dtype: datetime64[ns])

The first and last dates match for every user in this group. Curious...

In [152]:
inactive_users.groupby('user_id').date.last() - inactive_users.groupby('user_id').date.first()

user_id
278   0 days
388   0 days
679   0 days
812   0 days
832   0 days
879   0 days
956   0 days
Name: date, dtype: timedelta64[ns]

Sure enough! All users in this group are accessing the curriculum on one day and one day only!

It doesn't make sense for a student who is part of a 5-6 month long course to access the curriculum on one day.

In [153]:
inactive_users.groupby('ip').name.unique()

ip
107.77.217.9                  [Voyageurs]
136.50.50.187                     [Marco]
162.200.114.251                  [Oberon]
24.243.49.105                  [Hyperion]
24.28.146.155                    [Darden]
69.154.52.98                    [Jupiter]
97.105.19.58       [Voyageurs, Andromeda]
Name: name, dtype: object

There's an ip address associated with two different cohorts! My theory is that these are "dummy" student accounts created by instructors to check out how the curriculum works/operates from the students' point of view.

In [155]:
inactive_users['ip'].isin(df[df['name'] == 'Staff']['ip'])

0      True
1     False
2     False
3     False
4      True
5      True
6      True
7      True
8      True
9      True
10     True
11     True
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
30    False
Name: ip, dtype: bool

In [156]:
inactive_users[inactive_users['ip'].isin(df[df['name'] == 'Staff']['ip'])]

Unnamed: 0,date,time,endpoint,user_id,cohort_id,ip,name,slack,start_date,end_date,created_at,updated_at,module,lesson,is_lesson,program_type
0,2018-09-27,13:57:44,/,278,24.0,97.105.19.58,Voyageurs,#voyageurs,2018-05-29,2018-10-11,2018-05-25 22:25:57,2018-05-25 22:25:57,,,False,Java
4,2019-03-19,09:50:19,/,388,31.0,97.105.19.58,Andromeda,#andromeda,2019-03-18,2019-07-30,2019-03-18 20:35:06,2019-03-18 20:35:06,,,False,Java
5,2019-03-19,09:50:23,html-css,388,31.0,97.105.19.58,Andromeda,#andromeda,2019-03-18,2019-07-30,2019-03-18 20:35:06,2019-03-18 20:35:06,html-css,html-css,False,Java
6,2019-03-19,09:50:28,html-css/elements,388,31.0,97.105.19.58,Andromeda,#andromeda,2019-03-18,2019-07-30,2019-03-18 20:35:06,2019-03-18 20:35:06,html-css,elements,True,Java
7,2019-03-19,10:04:11,html-css/elements,388,31.0,97.105.19.58,Andromeda,#andromeda,2019-03-18,2019-07-30,2019-03-18 20:35:06,2019-03-18 20:35:06,html-css,elements,True,Java
8,2019-03-19,10:19:32,html-css/elements,388,31.0,97.105.19.58,Andromeda,#andromeda,2019-03-18,2019-07-30,2019-03-18 20:35:06,2019-03-18 20:35:06,html-css,elements,True,Java
9,2019-03-19,11:11:51,html-css/forms,388,31.0,97.105.19.58,Andromeda,#andromeda,2019-03-18,2019-07-30,2019-03-18 20:35:06,2019-03-18 20:35:06,html-css,forms,True,Java
10,2019-03-19,11:12:02,html-css/elements,388,31.0,97.105.19.58,Andromeda,#andromeda,2019-03-18,2019-07-30,2019-03-18 20:35:06,2019-03-18 20:35:06,html-css,elements,True,Java
11,2019-03-19,12:19:23,html-css/elements,388,31.0,97.105.19.58,Andromeda,#andromeda,2019-03-18,2019-07-30,2019-03-18 20:35:06,2019-03-18 20:35:06,html-css,elements,True,Java


We have a match on the multiple accounts linked to the same ip! The users in the table above have the same ip address as a staff member!

In [160]:
inactive_users[~inactive_users['ip'].isin(df[df['name'] == 'Staff']['ip'])]

Unnamed: 0,date,time,endpoint,user_id,cohort_id,ip,name,slack,start_date,end_date,created_at,updated_at,module,lesson,is_lesson,program_type
1,2018-09-27,14:47:37,java-ii/arrays,278,24.0,107.77.217.9,Voyageurs,#voyageurs,2018-05-29,2018-10-11,2018-05-25 22:25:57,2018-05-25 22:25:57,java-ii,arrays,True,Java
2,2018-09-27,14:58:48,java-ii/arrays,278,24.0,107.77.217.9,Voyageurs,#voyageurs,2018-05-29,2018-10-11,2018-05-25 22:25:57,2018-05-25 22:25:57,java-ii,arrays,True,Java
3,2018-09-27,14:59:07,java-ii/collections,278,24.0,107.77.217.9,Voyageurs,#voyageurs,2018-05-29,2018-10-11,2018-05-25 22:25:57,2018-05-25 22:25:57,java-ii,collections,True,Java
12,2020-07-14,08:05:15,1-fundamentals/1.1-intro-to-data-science,679,59.0,24.28.146.155,Darden,#darden,2020-07-13,2021-01-12,2020-07-13 18:32:19,2020-07-13 18:32:19,1-fundamentals,1.1-intro-to-data-science,True,DS
13,2020-07-14,08:05:15,1-fundamentals/AI-ML-DL-timeline.jpg,679,59.0,24.28.146.155,Darden,#darden,2020-07-13,2021-01-12,2020-07-13 18:32:19,2020-07-13 18:32:19,1-fundamentals,AI-ML-DL-timeline.jpg,True,DS
14,2020-07-14,08:05:15,1-fundamentals/modern-data-scientist.jpg,679,59.0,24.28.146.155,Darden,#darden,2020-07-13,2021-01-12,2020-07-13 18:32:19,2020-07-13 18:32:19,1-fundamentals,modern-data-scientist.jpg,True,DS
15,2020-11-08,01:45:34,html-css/css-i/selectors-and-properties,812,58.0,24.243.49.105,Hyperion,#hyperion,2020-05-26,2020-11-10,2020-05-26 19:22:44,2020-05-26 19:22:44,html-css,selectors-and-properties,True,Java
16,2020-11-08,01:45:41,html-css/elements,812,58.0,24.243.49.105,Hyperion,#hyperion,2020-05-26,2020-11-10,2020-05-26 19:22:44,2020-05-26 19:22:44,html-css,elements,True,Java
17,2020-11-08,01:45:56,html-css/css-i/introduction,812,58.0,24.243.49.105,Hyperion,#hyperion,2020-05-26,2020-11-10,2020-05-26 19:22:44,2020-05-26 19:22:44,html-css,introduction,True,Java
18,2020-11-08,01:46:01,html-css/css-i,812,58.0,24.243.49.105,Hyperion,#hyperion,2020-05-26,2020-11-10,2020-05-26 19:22:44,2020-05-26 19:22:44,html-css,css-i,True,Java


Although these examples don't have matching ip addresses with known staff members, they follow a similar pattern of behavior. The account is created at the beginning of a cohort, it's used for one day during the class, and then it's never active again. Looking at the account activity, I believe the instructors are using these accounts to check the availability/version of certain pages of the website to make sure they are displayed correctly to the students.

## Question 4

Is there any suspicious activity, such as users/machines/etc accessing the curriculum who shouldn’t be? Does it appear that any web-scraping is happening? Are there any suspicious IP addresses?