# Anomaly detection project

Hello,


I have some questions for you that I need to be answered before the board meeting Thursday afternoon. I need to be able to speak to the following questions. I also need a single slide that I can incorporate into my existing presentation (Google Slides) that summarizes the most important points. My questions are listed below; however, if you discover anything else important that I didn’t think to ask, please include that as well.


1. Which lesson appears to attract the most traffic consistently across cohorts (per program)?
2. Is there a cohort that referred to a lesson significantly more than other cohorts seemed to gloss over?
3. Are there students who, when active, hardly access the curriculum? If so, what information do you have about these students?
4. Is there any suspicious activity, such as users/machines/etc accessing the curriculum who shouldn’t be? Does it appear that any web-scraping is happening? Are there any suspicious IP addresses?
5. At some point in 2019, the ability for students and alumni to access both curriculums (web dev to ds, ds to web dev) should have been shut off. Do you see any evidence of that happening? Did it happen before?
6. What topics are grads continuing to reference after graduation and into their jobs (for each program)?
7. Which lessons are least accessed?
8. Anything else I should be aware of?


Thank you

In [1]:
#Import dependencies
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from env import host, username, password

In [2]:
#Define a function that creates the database url
def url_creator(host, username, password, db_name):
    return f'mysql+pymysql://{username}:{password}@{host}/{db_name}'

In [3]:
#Create the url
url = url_creator(host, username, password, 'curriculum_logs')

In [4]:
#Define the SQL query
query = '''
        SELECT date,
        time, path as endpoint,
        user_id,
        cohort_id,
        ip, name,
        slack, start_date,
        end_date, created_at,
        updated_at, deleted_at,
        program_id
        FROM logs
        LEFT JOIN cohorts ON logs.cohort_id = cohorts.id
        '''

In [5]:
#Read in the dataframe
df = pd.read_sql(query, url)
df.head()

Unnamed: 0,date,time,endpoint,user_id,cohort_id,ip,name,slack,start_date,end_date,created_at,updated_at,deleted_at,program_id
0,2018-01-26,09:55:03,/,1,8.0,97.105.19.61,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,,1.0
1,2018-01-26,09:56:02,java-ii,1,8.0,97.105.19.61,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,,1.0
2,2018-01-26,09:56:05,java-ii/object-oriented-programming,1,8.0,97.105.19.61,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,,1.0
3,2018-01-26,09:56:06,slides/object_oriented_programming,1,8.0,97.105.19.61,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,,1.0
4,2018-01-26,09:56:24,javascript-i/conditionals,2,22.0,97.105.19.61,Teddy,#teddy,2018-01-08,2018-05-17,2018-01-08 13:59:10,2018-01-08 13:59:10,,2.0


In [6]:
#Check the shape
df.shape

(900223, 14)

In [7]:
#Look at the info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 900223 entries, 0 to 900222
Data columns (total 14 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   date        900223 non-null  object 
 1   time        900223 non-null  object 
 2   endpoint    900222 non-null  object 
 3   user_id     900223 non-null  int64  
 4   cohort_id   847330 non-null  float64
 5   ip          900223 non-null  object 
 6   name        847330 non-null  object 
 7   slack       847330 non-null  object 
 8   start_date  847330 non-null  object 
 9   end_date    847330 non-null  object 
 10  created_at  847330 non-null  object 
 11  updated_at  847330 non-null  object 
 12  deleted_at  0 non-null       object 
 13  program_id  847330 non-null  float64
dtypes: float64(2), int64(1), object(11)
memory usage: 96.2+ MB


I have some nulls in my dataframe that I would like to investigate before I begin answering the questions. Many columns have the same non-null value counts. Are these columns all missing information for the same group of observations?

In [8]:
#Drop the column with all null values
df.drop(columns=['deleted_at'], inplace=True)

In [9]:
#Rows with null cohort id
df[df['cohort_id'].isna()]

Unnamed: 0,date,time,endpoint,user_id,cohort_id,ip,name,slack,start_date,end_date,created_at,updated_at,program_id
411,2018-01-26,16:46:16,/,48,,97.105.19.61,,,,,,,
412,2018-01-26,16:46:24,spring/extra-features/form-validation,48,,97.105.19.61,,,,,,,
425,2018-01-26,17:54:24,/,48,,97.105.19.61,,,,,,,
435,2018-01-26,18:32:03,/,48,,97.105.19.61,,,,,,,
436,2018-01-26,18:32:17,mysql/relationships/joins,48,,97.105.19.61,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
899897,2021-04-21,12:49:00,javascript-ii,717,,136.50.102.126,,,,,,,
899898,2021-04-21,12:49:02,javascript-ii/es6,717,,136.50.102.126,,,,,,,
899899,2021-04-21,12:51:27,javascript-ii/map-filter-reduce,717,,136.50.102.126,,,,,,,
899900,2021-04-21,12:52:37,javascript-ii/promises,717,,136.50.102.126,,,,,,,


In [10]:
#Are the nulls from all columns present in these observations?
df[df['cohort_id'].isna()].isna().sum()

date              0
time              0
endpoint          0
user_id           0
cohort_id     52893
ip                0
name          52893
slack         52893
start_date    52893
end_date      52893
created_at    52893
updated_at    52893
program_id    52893
dtype: int64

In [11]:
#Check out the one endpoint null
df[df.endpoint.isna()]

Unnamed: 0,date,time,endpoint,user_id,cohort_id,ip,name,slack,start_date,end_date,created_at,updated_at,program_id
506305,2020-04-08,09:25:18,,586,55.0,72.177.240.51,Curie,#curie,2020-02-03,2020-07-07,2020-02-03 19:31:51,2020-02-03 19:31:51,3.0


In [12]:
#Create a module lesson with the first portion of the endpoint
df['module'] = df['endpoint'].str.split('/').str[0]

In [13]:
#Create a lesson column with the last portion of the endpoint
df['lesson'] = df['endpoint'].str.split('/').str[-1]

In [30]:
df['is_lesson'] = (df['module'] != df['lesson'])

In [31]:
df

Unnamed: 0,date,time,endpoint,user_id,cohort_id,ip,name,slack,start_date,end_date,created_at,updated_at,module,lesson,is_lesson,program_type
0,2018-01-26,09:55:03,/,1,8.0,97.105.19.61,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,,,False,PHP
1,2018-01-26,09:56:02,java-ii,1,8.0,97.105.19.61,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,java-ii,java-ii,False,PHP
2,2018-01-26,09:56:05,java-ii/object-oriented-programming,1,8.0,97.105.19.61,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,java-ii,object-oriented-programming,True,PHP
3,2018-01-26,09:56:06,slides/object_oriented_programming,1,8.0,97.105.19.61,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,slides,object_oriented_programming,True,PHP
4,2018-01-26,09:56:24,javascript-i/conditionals,2,22.0,97.105.19.61,Teddy,#teddy,2018-01-08,2018-05-17,2018-01-08 13:59:10,2018-01-08 13:59:10,javascript-i,conditionals,True,Java
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
900218,2021-04-21,16:41:51,jquery/personal-site,64,28.0,71.150.217.33,Staff,#,2014-02-04,2014-02-04,2018-12-06 17:04:19,2018-12-06 17:04:19,jquery,personal-site,True,Java
900219,2021-04-21,16:42:02,jquery/mapbox-api,64,28.0,71.150.217.33,Staff,#,2014-02-04,2014-02-04,2018-12-06 17:04:19,2018-12-06 17:04:19,jquery,mapbox-api,True,Java
900220,2021-04-21,16:42:09,jquery/ajax/weather-map,64,28.0,71.150.217.33,Staff,#,2014-02-04,2014-02-04,2018-12-06 17:04:19,2018-12-06 17:04:19,jquery,weather-map,True,Java
900221,2021-04-21,16:44:37,anomaly-detection/discrete-probabilistic-methods,744,28.0,24.160.137.86,Staff,#,2014-02-04,2014-02-04,2018-12-06 17:04:19,2018-12-06 17:04:19,anomaly-detection,discrete-probabilistic-methods,True,Java


## Question 1

Which lesson appears to attract the most traffic consistently across cohorts (per program)?

In [16]:
df['program_id'].value_counts()

2.0    713365
3.0    103412
1.0     30548
4.0         5
Name: program_id, dtype: int64

Four unique program IDs. I need to map real programs to the appropriate ID using alumni.codeup.com. I will accomplish this by looking up the cohorts associated with each ID and mapping the program name to all cohorts with the same program ID.

In [17]:
for i in range(1, 5):
    print()
    print('--------------------------------')
    print()
    print('Program ID {} corresponds to these cohorts.'.format(i))
    print()
    print(df[df['program_id'] == i]['name'].unique())


--------------------------------

Program ID 1 corresponds to these cohorts.

['Hampton' 'Arches' 'Quincy' 'Kings' 'Lassen' 'Glacier' 'Denali' 'Joshua'
 'Olympic' 'Badlands' 'Ike' 'Franklin' 'Everglades']

--------------------------------

Program ID 2 corresponds to these cohorts.

['Teddy' 'Sequoia' 'Niagara' 'Pinnacles' 'Mammoth' 'Ulysses' 'Voyageurs'
 'Wrangell' 'Xanadu' 'Yosemite' 'Staff' 'Zion' 'Andromeda' 'Betelgeuse'
 'Ceres' 'Deimos' 'Europa' 'Fortuna' 'Apex' 'Ganymede' 'Hyperion' 'Bash'
 'Jupiter' 'Kalypso' 'Luna' 'Marco' 'Neptune' 'Oberon']

--------------------------------

Program ID 3 corresponds to these cohorts.

['Bayes' 'Curie' 'Darden' 'Easley' 'Florence']

--------------------------------

Program ID 4 corresponds to these cohorts.

['Apollo']


Programs with ID 1 map to "Full Stack PHP Program." These programs represent early cohorts at Codeup that learned web development. Cohorts belonging to this program ID will be labeled "PHP".

Programs with ID 2 map to "Full Stack Java Program." These programs also train web developers, but are more recent cohorts in Codeup's history. These cohorts will be labeled "Java" moving forward.

Programs with ID 3 map to "Data Science Program." These cohorts will be labeled "DS".

Programs with ID 4 map to "Front-End Program." I will label this lone cohort "Front".

In [18]:
program_dict = {1.0: 'PHP', 2.0: 'Java', 3.0: 'DS', 4.0: 'Front'}

df['program_type'] = df['program_id'].map(program_dict)

In [19]:
#The program id column is now worthless
df.drop(columns=['program_id'], inplace=True)

In [20]:
program_list = ['PHP', 'Java', 'DS', 'Front']

for program in program_list:
    print()
    print('--------------------------------')
    print()
    print('Most visited pages by members of the {} program.'.format(program))
    print()
    print(df[df['program_type'] == program]['lesson'].value_counts().head(10))


--------------------------------

Most visited pages by members of the PHP program.

                1681
favicon.ico     1239
index.html      1020
html-css         814
javascript-i     736
introduction     562
spring           501
java-iii         479
appendix         469
java-ii          454
Name: lesson, dtype: int64

--------------------------------

Most visited pages by members of the Java program.

                     35814
introduction         20114
javascript-i         17457
toc                  17428
search_index.json    16976
html-css             12913
java-iii             12684
java-ii              11719
spring               11385
jquery               10710
Name: lesson, dtype: int64

--------------------------------

Most visited pages by members of the DS program.

                             8358
overview                     4267
1-overview                   3519
modern-data-scientist.jpg    3152
AI-ML-DL-timeline.jpg        3152
project                      3140
sear

The PHP and Java programs share a great amount of topics in common: java, javascript, spring, and html. The data science students often refer to classification and fundamentals information. Members of the Front program account for 5 total hits on webpages.

I don't think this answers the question in its entirety. The question asks about consistent traffic across cohorts. I will need to group by cohort within each program to answer the question completely.

In [38]:
#Create separate dataframes for each program
df_php = df[df['program_type'] == 'PHP']
df_java = df[df['program_type'] == 'Java']
df_ds = df[df['program_type'] == 'DS']

In [65]:
df_php[df_php['name'] == 'Quincy'].module.value_counts().iloc[:5]

content    650
           151
mysql       97
java-i      50
slides      49
Name: module, dtype: int64

In [62]:
def most_visited_modules(df):
    
    mv = pd.DataFrame()
    
    cohort_list = list(df['name'].unique())
    
    for cohort in cohort_list:
        
        if len(df[df['name'] == cohort].module.value_counts().iloc[:5].index) < 5:
            
            continue
            
        else:    
        
            mv[cohort] = df[df['name'] == cohort].module.value_counts().iloc[:5].index
        
    return mv    

In [63]:
def most_popular_modules(df):
    
    mv = most_visited_modules(df)
    
    top_five = mv.melt(var_name='columns', value_name='index')['index'].value_counts().head(5)
    
    return top_five

In [66]:
most_popular_modules(df_php)

content         8
                7
javascript-i    7
html-css        6
java-iii        4
Name: index, dtype: int64

In [67]:
most_popular_modules(df_java)

javascript-i    27
html-css        25
mysql           24
jquery          14
spring          13
Name: index, dtype: int64

In [68]:
most_popular_modules(df_ds)

                  4
1-fundamentals    3
classification    3
sql               3
python            3
Name: index, dtype: int64

For PHP cohorts, the content, javascript-i, and html-css modules are the most popular.

For Java cohorts, the javascript-i, html-css, and mysql modules are the most popular.

For DS cohorts, the fundamentals, classification, sql, and python modules are the most popular.

In [45]:
df_php[df_php['name'] == 'Quincy'].loc[df_php.is_lesson].lesson.value_counts().head(10)

favicon.ico                   92
intro                         38
html-css                      33
environment.html              31
application-structure.html    20
intro-to-mysql                19
php_iii                       19
introduction                  17
conditionals                  16
collections                   15
Name: lesson, dtype: int64

In [60]:
def most_visited_lessons(df):
    
    mv = pd.DataFrame()
    
    cohort_list = list(df['name'].unique())
    
    for cohort in cohort_list:
        
        if len(df[df['name'] == cohort].loc[df.is_lesson].lesson.value_counts().iloc[:5].index) < 5:
            
            continue
            
        else:    
        
            mv[cohort] = df[df['name'] == cohort].loc[df.is_lesson].lesson.value_counts().iloc[:5].index
        
    return mv    

In [61]:
def most_popular_lessons(df):
    
    mv = most_visited_lessons(df)
    
    top_five = mv.melt(var_name='columns', value_name='index')['index'].value_counts().head(5)
    
    return top_five

In [56]:
most_popular_lessons(df_php)

favicon.ico          7
html-css             5
introduction         5
intro                2
introduction.html    2
Name: index, dtype: int64

In [57]:
most_popular_lessons(df_java)

introduction         26
functions            23
search_index.json    22
arrays               19
servlets              4
Name: index, dtype: int64

In [58]:
most_popular_lessons(df_ds)

project                      4
AI-ML-DL-timeline.jpg        4
modern-data-scientist.jpg    3
1-overview                   2
overview                     2
Name: index, dtype: int64

For PHP, the most popular lessons across all cohorts are html-css and introductions.

For Java, the most popular lessons across all cohorts are introductions, functions, and arrays.

For DS, the most popular lessons across all cohorts are projects, the timeline, and the modern data scientist.

## Question 2

Is there a cohort that referred to a lesson significantly more than other cohorts seemed to gloss over?

In [72]:
pd.DataFrame(df_php[df_php['name'] == 'Quincy'].loc[df_php.is_lesson].lesson.value_counts()).reset_index()

Unnamed: 0,index,lesson
0,favicon.ico,92
1,intro,38
2,html-css,33
3,environment.html,31
4,application-structure.html,20
...,...,...
212,arithmetic-operators.html,1
213,logical-operators.html,1
214,aliases,1
215,constructors-destructors.html,1


In [114]:
def popular_lesson_counter(df):
    
    new_df = pd.DataFrame()
    
    cohort_list = list(df['name'].unique())
    
    for cohort in cohort_list:
        
        cohort_df = pd.DataFrame(df[df['name'] == cohort].loc[df.is_lesson].lesson.value_counts()).reset_index().head()
        
        new_df = pd.concat([new_df, cohort_df], axis=0)
        
    new_df.rename(columns={'index': 'lesson', 'lesson': 'count'}, inplace=True)
    
    new_df.reset_index(drop=True, inplace=True)    
        
    return new_df    

In [115]:
php_mv = popular_lesson_counter(df_php)
php_mv.head()

Unnamed: 0,lesson,count
0,servlets,38
1,jdbc,35
2,mvc,29
3,favicon.ico,28
4,jsp-and-jstl,27


In [122]:
php_mv.groupby('lesson')['count'].sum().sort_values(ascending=False).head()

lesson
favicon.ico     1176
introduction     516
arrays           261
servlets         158
html-css         135
Name: count, dtype: int64

In [128]:
df_php[df_php['lesson'] == 'arrays'].name.value_counts()

Arches     143
Lassen     118
Olympic     59
Hampton      9
Quincy       8
Kings        7
Joshua       1
Glacier      1
Name: name, dtype: int64

In [130]:
df_php.name.value_counts()

Lassen        9587
Arches        8890
Olympic       4954
Kings         2845
Hampton       1712
Quincy        1237
Glacier        598
Joshua         302
Ike            253
Badlands        93
Franklin        72
Denali           4
Everglades       1
Name: name, dtype: int64

In [86]:
cohort_comparer(df_java)

Unnamed: 0,Teddy_mv,Teddy_visits,Sequoia_mv,Sequoia_visits,Niagara_mv,Niagara_visits,Pinnacles_mv,Pinnacles_visits,Mammoth_mv,Mammoth_visits,...,Kalypso_mv,Kalypso_visits,Luna_mv,Luna_visits,Marco_mv,Marco_visits,Neptune_mv,Neptune_visits,Oberon_mv,Oberon_visits
0,search_index.json,698,search_index.json,213,jdbc,14,introduction,57,favicon.ico,49,...,introduction,740,introduction,700,introduction,711,working-with-data-types-operators-and-variables,331,operators,179
1,introduction,578,views,207,servlets,14,npm,51,controllers,12,...,search_index.json,467,search_index.json,405,working-with-data-types-operators-and-variables,511,introduction,312,primitive-types,177
2,arrays,573,introduction,198,controllers,13,functions,37,views,11,...,arrays,390,working-with-data-types-operators-and-variables,319,javascript-with-html,411,javascript-with-html,311,functions,171
3,functions,543,controllers,193,setup,13,introduction-to-java,30,introduction,10,...,dom,358,functions,311,functions,384,bootstrap-grid-system,256,working-with-data-types-operators-and-variables,146
4,jdbc,395,repositories,172,repositories,11,setup,29,deployment-and-dependencies,9,...,functions,331,arrays,296,loops,312,bootstrap-introduction,227,conditionals,144


In [87]:
cohort_comparer(df_ds)

Unnamed: 0,Bayes_mv,Bayes_visits,Curie_mv,Curie_visits,Darden_mv,Darden_visits,Easley_mv,Easley_visits,Florence_mv,Florence_visits
0,1-overview,1824,1-overview,1573,overview,2549,overview,1074,modern-data-scientist.jpg,688
1,project,1024,project,660,explore,967,project,534,AI-ML-DL-timeline.jpg,685
2,modern-data-scientist.jpg,674,modern-data-scientist.jpg,567,scale_features_or_not.svg,943,scale_features_or_not.svg,463,intro-to-data-science,615
3,AI-ML-DL-timeline.jpg,672,AI-ML-DL-timeline.jpg,566,project,917,explore,460,functions,305
4,4-explore,652,search_index.json,538,AI-ML-DL-timeline.jpg,783,classical_programming_vs_machine_learning.jpeg,432,data-types-and-variables,258
