# Anomaly Detection Project

# Imports 

In [1]:
#standard DS imports
import numpy as np
import pandas as pd
import math
import os

#custom imports
import env
from env import get_connection
import acquire
import prepare
import functions

#filter out any noisy warning flags
import warnings
warnings.filterwarnings("ignore")

# Acquire

- Data acquired from the codeup curriculum_logs database
- It contained 847,330 rows and 15 columns before cleaning
- Each row represents a log in into codeup's website and curriculum
- Each column represents information about the log in and the user

In [2]:
# acquire the data
df = acquire.offline_lesson_kernel_restart()

# Prepare

Prepare Actions:

- Added a column (fixed_date) that combined the date and time columns
- Set the index to the fixed_date column
- Dropped unnecessary and redundant columns
- Added four columns to identify the codeup programs
- Added a column to identify if a log in attempt occured when a codeup student was active or not

In [3]:
# prepare the data
df = prepare.prep_curr_logs(df)

# A brief look at the data

In [4]:
df.head()

Unnamed: 0_level_0,path,user_id,cohort_id,ip,name,start_date,end_date,created_at,updated_at,program_id,fixed_date,data,web,php,front_end,is_active
fixed_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2018-01-26 09:55:03,/,1,8.0,97.105.19.61,Hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,1,2018-01-26 09:55:03,False,False,True,False,0
2018-01-26 09:56:02,java-ii,1,8.0,97.105.19.61,Hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,1,2018-01-26 09:56:02,False,False,True,False,0
2018-01-26 09:56:05,java-ii/object-oriented-programming,1,8.0,97.105.19.61,Hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,1,2018-01-26 09:56:05,False,False,True,False,0
2018-01-26 09:56:06,slides/object_oriented_programming,1,8.0,97.105.19.61,Hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,1,2018-01-26 09:56:06,False,False,True,False,0
2018-01-26 09:56:24,javascript-i/conditionals,2,22.0,97.105.19.61,Teddy,2018-01-08,2018-05-17,2018-01-08 13:59:10,2018-01-08 13:59:10,2,2018-01-26 09:56:24,False,True,False,False,1


# Explore

# 1. Which lesson appears to attract the most traffic consistently across cohorts (per program)?

In [5]:
#function displays the programs and most accessed lessons
functions.q_one(df)

                               path  count
program_id                                
1                      javascript-i    736
2                      javascript-i  17457
3           classification/overview   1785
4                  content/html-css      2


***Takeaways***

- Each Codeup Program accesses their respective curriculum topic that is used the most by them i.e. Front End uses CSS often, and Data Science uses the Classification Algorithms.

# 2. Is there a cohort that referred to a lesson significantly more than other cohorts seemed to gloss over?

In [6]:
#function displaying the cohorts and lessons accessed
new_df2 = functions.q_two(df)
new_df2.head()

Unnamed: 0_level_0,path,count
cohort_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,javascript-i,294
2.0,content/php_ii/command-line,6
4.0,mkdocs/search_index.json,1
6.0,javascript-ii/es6,10
7.0,content/html-css,29


***Takeaways***

 - Using the function, we are able to return each lesson that was accessed the most for each cohort.
     - We can see that some of the Web Dev cohorts least accessed lesson is related to data science.

# 3. Are there students who, when active, hardly access the curriculum? If so, what information do you have about these students?

In [7]:
#function displaying 50 or less logins when students were active
new_df3 = functions.q_three(df)
new_df3.head()

Unnamed: 0_level_0,ip
user_id,Unnamed: 1_level_1
918,1
879,1
940,1
619,1
832,3


***Takeaways***

- There are a total of 37 user ids with 50 or less log ins, and four with only one log in.
- The single log ins occured on the same day as the user's start date and the ips were all within the Texas area.
- After exploring the user's with under 50 log ins, it appears that those students either received a new user id, or because of covid, may have started working from home and recieved a new user id.
- Also, these users may not have finished the course as well, but further investigation is necessary to gain a better understanding of these limited login attempts.

# 6. What topics are grads continuing to reference after graduation and into their jobs (for each program)?

In [10]:
#function displays the most accessed topics per graduated users
most_accessed_path_by_program = functions.q_six(df)
most_accessed_path_by_program

Unnamed: 0_level_0,path,count
program_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,index.html,1011
2,javascript-i,4233
3,sql/mysql-overview,275
4,content/html-css,2


***Takeaways***

- SQL is the most accssed topic for the data science program, which makes sense since SQL is a highly sought after skill.
- Webdev accessed JavaScript, which makes sense since it’s their densest toolset.

# 7. Which lessons are least accessed?

In [12]:
#function displaying cohorts and the least accessed lessons
newer_df = functions.q_seven(df)
newer_df.head()

Unnamed: 0_level_0,path,count
cohort_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,2.0_Intro_Stats,1
2.0,prework/fundamentals/loop,1
4.0,prework/versioning/github,1
6.0,11._DistributedML,1
7.0,content/examples/gitbook/images/favicon.ico,1


In [13]:
# function displaying the least accessed lessons by program
newer_df2 = functions.q_seven_two(df)
newer_df2

Unnamed: 0_level_0,path,count
program_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,cohorts/24/grades,1
2,capstones,1
3,3-sql,1
4,content/html-css/introduction.html,1


***Takeaways***

- Looking at the dataframe that was returned, we can see the least accessed lessons per program.
     - I returned the 3 least accessed for each program just to get a bigger insight to the lessons.
     - I also created a funtions to return the least accessed lessons per cohort.

# 8. Anything else I should be aware of?

In [11]:
# function displays Denali cohort and their information
four = functions.fourth_cohort(df)
four

Unnamed: 0_level_0,path,user_id,cohort_id,ip,name,start_date,end_date,created_at,updated_at,program_id,fixed_date,data,web,php,front_end,is_active
fixed_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2018-02-04 11:21:38,/,85,4.0,66.42.139.162,Denali,2014-10-20,2015-01-18,2016-06-14 19:52:26,2016-06-14 19:52:26,1,2018-02-04 11:21:38,False,False,True,False,0
2018-02-04 11:23:27,mkdocs/search_index.json,85,4.0,66.42.139.162,Denali,2014-10-20,2015-01-18,2016-06-14 19:52:26,2016-06-14 19:52:26,1,2018-02-04 11:23:27,False,False,True,False,0
2018-02-04 11:24:05,prework/databases,85,4.0,66.42.139.162,Denali,2014-10-20,2015-01-18,2016-06-14 19:52:26,2016-06-14 19:52:26,1,2018-02-04 11:24:05,False,False,True,False,0
2018-02-04 11:24:56,prework/versioning/github,85,4.0,66.42.139.162,Denali,2014-10-20,2015-01-18,2016-06-14 19:52:26,2016-06-14 19:52:26,1,2018-02-04 11:24:56,False,False,True,False,0


***Takeaways***

- Although the fourth Codeup cohort (Denali) has many students, the logs show that only one user accessed the lessons.