## Anomaly Detection Project
### Murphy and Applegate, Florence Cohort, 2021_07_22
#### First Draft Final Notebook

In [1]:
from __future__ import division
import itertools
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import numpy as np
import pandas as pd
import math
from sklearn import metrics
from random import randint
from matplotlib import style
import seaborn as sns

from sklearn.cluster import DBSCAN
from sklearn.preprocessing import MinMaxScaler

from env import host, user, password
import acquire
import prepare
import explore

## Project Planning

## Executive Summary

## Data Acquisition

In [2]:
# Bring the data in
df = acquire.get_cohort_curr_data()

In [3]:
# What does it look like?
df.head()

Unnamed: 0,date,time,path,user_id,cohort_id,ip,id,name,slack,start_date,end_date,created_at,updated_at,deleted_at,program_id
0,2018-01-26,09:55:03,/,1,8.0,97.105.19.61,8,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,,1
1,2018-01-26,09:56:02,java-ii,1,8.0,97.105.19.61,8,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,,1
2,2018-01-26,09:56:05,java-ii/object-oriented-programming,1,8.0,97.105.19.61,8,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,,1
3,2018-01-26,09:56:06,slides/object_oriented_programming,1,8.0,97.105.19.61,8,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,,1
4,2018-01-26,09:56:24,javascript-i/conditionals,2,22.0,97.105.19.61,22,Teddy,#teddy,2018-01-08,2018-05-17,2018-01-08 13:59:10,2018-01-08 13:59:10,,2


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 847330 entries, 0 to 847329
Data columns (total 15 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   date        847330 non-null  object 
 1   time        847330 non-null  object 
 2   path        847329 non-null  object 
 3   user_id     847330 non-null  int64  
 4   cohort_id   847330 non-null  float64
 5   ip          847330 non-null  object 
 6   id          847330 non-null  int64  
 7   name        847330 non-null  object 
 8   slack       847330 non-null  object 
 9   start_date  847330 non-null  object 
 10  end_date    847330 non-null  object 
 11  created_at  847330 non-null  object 
 12  updated_at  847330 non-null  object 
 13  deleted_at  0 non-null       float64
 14  program_id  847330 non-null  int64  
dtypes: float64(2), int64(3), object(10)
memory usage: 103.4+ MB


### Data Acquisition Key Findings, Takeaways, & Next Steps:
- Initial data set is 847_330 rows, by 15 columns
- One null in path, and entire null column 'deleted_at'
- Data Preparation To-Do:
    - Concatenate 'date' and 'time', convert to datetime, and reset as index.
    - Convert all time-bound variables to datetime format
    - Drop unnecessary columns
    - Drop null values

## Data Preparation

In [5]:
# initial_prep function takes care of the Data Preparation To-Do List
df = prepare.initial_prep(df)

In [6]:
# What does it look like now?
df.head()

Unnamed: 0_level_0,endpoint,user_id,cohort_id,ip,id,cohort,slack,start_date,end_date,created_at,updated_at,program_id,program
dt,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2018-01-26 09:55:03,/,1,8.0,97.105.19.61,8,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,1,full_stack_php
2018-01-26 09:56:02,java-ii,1,8.0,97.105.19.61,8,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,1,full_stack_php
2018-01-26 09:56:05,java-ii/object-oriented-programming,1,8.0,97.105.19.61,8,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,1,full_stack_php
2018-01-26 09:56:06,slides/object_oriented_programming,1,8.0,97.105.19.61,8,Hampton,#hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,1,full_stack_php
2018-01-26 09:56:24,javascript-i/conditionals,2,22.0,97.105.19.61,22,Teddy,#teddy,2018-01-08,2018-05-17,2018-01-08 13:59:10,2018-01-08 13:59:10,2,java


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 847329 entries, 2018-01-26 09:55:03 to 2021-04-21 16:44:39
Data columns (total 13 columns):
 #   Column      Non-Null Count   Dtype         
---  ------      --------------   -----         
 0   endpoint    847329 non-null  object        
 1   user_id     847329 non-null  int64         
 2   cohort_id   847329 non-null  float64       
 3   ip          847329 non-null  object        
 4   id          847329 non-null  int64         
 5   cohort      847329 non-null  object        
 6   slack       847329 non-null  object        
 7   start_date  847329 non-null  datetime64[ns]
 8   end_date    847329 non-null  datetime64[ns]
 9   created_at  847329 non-null  datetime64[ns]
 10  updated_at  847329 non-null  datetime64[ns]
 11  program_id  847329 non-null  int64         
 12  program     847329 non-null  object        
dtypes: datetime64[ns](4), float64(1), int64(3), object(5)
memory usage: 90.5+ MB


### Data Acquisition Key Findings, Takeaways, & Next Steps:
- After initial_prep, data set is 847_329 rows, by 13 columns
- 'date' and 'time' have been concatenated, converted to datetime, and set as the index
- All timebound variables have been converted to datetime
- 'deleted_at' was dropped, since it only contained null values
- 'program' was added to give a name to each program_id

## Data Exploration

### 1. Which lesson appears to attract the most traffic consistently across cohorts (per program)?

In [8]:
# split into four dataframes, one for each program
fsp, jv, ds, fep = explore.split_by_program(df)

#### Full Stack Program

In [10]:
# uses our prep_one function to get data ready for exploration
cohort_df, cohort_list = explore.prep_one(fsp)

In [11]:
# uses our print_one function to visualize results
explore.print_one(cohort_df, cohort_list)

-----------------
cohort  endpoint    
Lassen  index.html      877
        javascript-i    233
        java-iii        224
Name: endpoint, dtype: int64
-----------------
-----------------
cohort  endpoint     
Arches  javascript-i     294
        html-css         215
        javascript-ii    204
Name: endpoint, dtype: int64
-----------------
-----------------
cohort   endpoint    
Olympic  javascript-i    128
         java-i           76
         jquery           71
Name: endpoint, dtype: int64
-----------------
-----------------
cohort  endpoint                                        
Kings   index.html                                          84
        content/laravel/intro                               83
        content/laravel/intro/application-structure.html    63
Name: endpoint, dtype: int64
-----------------
-----------------
cohort   endpoint
Hampton  java-iii    57
         appendix    55
         java-i      46
Name: endpoint, dtype: int64
-----------------
----------------

#### Full Stack Program Key Findings
- java-i appears to be the most frequent lesson

#### Java Program

In [12]:
# uses our prep_one function to get data ready for exploration
cohort_df, cohort_list = explore.prep_one(jv)

In [13]:
# uses our print_one function to visualize results
explore.print_one(cohort_df, cohort_list)

-----------------
cohort  endpoint    
Staff   javascript-i    1817
        spring          1403
        java-iii        1393
Name: endpoint, dtype: int64
-----------------
-----------------
cohort  endpoint                
Ceres   search/search_index.json    1380
        javascript-i                1003
        toc                          911
Name: endpoint, dtype: int64
-----------------
-----------------
cohort  endpoint    
Zion    toc             1465
        javascript-i     897
        java-iii         753
Name: endpoint, dtype: int64
-----------------
-----------------
cohort   endpoint                
Jupiter  toc                         1866
         search/search_index.json     998
         javascript-i                 926
Name: endpoint, dtype: int64
-----------------
-----------------
cohort   endpoint                
Fortuna  toc                         1293
         search/search_index.json    1020
         java-iii                     786
Name: endpoint, dtype: int64
-

#### Java Program Key Findings
- javascript-i appears to be the most frequent lesson

#### Data Science Program

In [14]:
# uses our prep_one function to get data ready for exploration
cohort_df, cohort_list = explore.prep_one(ds)

In [15]:
# uses our print_one function to visualize results
explore.print_one(cohort_df, cohort_list)

-----------------
cohort  endpoint                                
Darden  classification/overview                     1109
        classification/scale_features_or_not.svg     943
        sql/mysql-overview                           774
Name: endpoint, dtype: int64
-----------------
-----------------
cohort  endpoint                                
Bayes   1-fundamentals/modern-data-scientist.jpg    650
        1-fundamentals/AI-ML-DL-timeline.jpg        648
        1-fundamentals/1.1-intro-to-data-science    640
Name: endpoint, dtype: int64
-----------------
-----------------
cohort  endpoint                                
Curie   6-regression/1-overview                     595
        search/search_index.json                    538
        1-fundamentals/modern-data-scientist.jpg    467
Name: endpoint, dtype: int64
-----------------
-----------------
cohort  endpoint                                                     
Easley  classification/scale_features_or_not.svg               

### Data Science Program Key Findings
- Fundamentals was the most occuring lesson among the cohorts

### Front End Program

In [16]:
# uses our prep_one function to get data ready for exploration
cohort_df, cohort_list = explore.prep_one(fep)

In [17]:
# uses our print_one function to visualize results
explore.print_one(cohort_df, cohort_list)

-----------------
cohort  endpoint                                   
Apollo  content/html-css                               2
        content/html-css/gitbook/images/favicon.ico    1
        content/html-css/introduction.html             1
Name: endpoint, dtype: int64
-----------------


### Front End Program Key Findings:
- Content/html-css is the most occuring lesson