# Access Not Found: Anomaly Detection Project

## Goal: 
* The project goal is to detect outliers on the curriculum logs, providing insights via email to the boss for the board meeting on Friday morning. 
* This analysis aims to address the listed questions and uncover any additional important findings related to the observed counts data.

## Imports

In [1]:
#.py modules
import wrangle as wr
import explore as ex

#numbers
import pandas as pd
import numpy as np

#vizzes
import matplotlib.pyplot as plt
import seaborn as sns

Imports Successful


> # `Wrangle`

### Acquire
* Data acquired from mySQL Codeup Server using env.py credentials
* Each row represents a visitor
* Each column represents a feature of the visitor request 

### Prepare
* Checked column data types
    * changed total charges from an object to a float
* Checked for nulls
    * total charges contained 11 nulls for new customers
    * imputed the corresponding monthly charges value
* Encoded categoricals
* Split data into train, validate and test (60/20/20)
    * target = 'churn'
* Outliers have not been removed for this iteration of the project

### Data Dictionary

| Feature | Definition |
|:--------|:-----------|
|| |   
|| |
|| |   
|| |            
|| |   
|| |
|| |   
|| |
|| |
|| |
|| |

In [2]:
# # acquiring data
df= wr.get_logs('logs.csv')

# # cleaning data and saving to a clean df
df =wr.prep_logs(df)

In [3]:
# quick data viz
df.head()

Unnamed: 0,path,user_id,cohort_id,ip,cohort,start_date,end_date,created,updated,program_id,access_date,program,lesson,endpoint
0,/,1,8,97.105.19.61,Hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,1,2018-01-26 09:55:03,web dev,,
1,java-ii,1,8,97.105.19.61,Hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,1,2018-01-26 09:56:02,web dev,,java-ii
2,java-ii/object-oriented-programming,1,8,97.105.19.61,Hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,1,2018-01-26 09:56:05,web dev,java-ii,object-oriented-programming
3,slides/object_oriented_programming,1,8,97.105.19.61,Hampton,2015-09-22,2016-02-06,2016-06-14 19:52:26,2016-06-14 19:52:26,1,2018-01-26 09:56:06,web dev,slides,object_oriented_programming
4,javascript-i/conditionals,2,22,97.105.19.61,Teddy,2018-01-08,2018-05-17,2018-01-08 13:59:10,2018-01-08 13:59:10,2,2018-01-26 09:56:24,web dev,javascript-i,conditionals


In [4]:
# data stats viz
df.describe()

Unnamed: 0,user_id,cohort_id,program_id
count,847330.0,847330.0,847330.0
mean,456.707344,48.501049,2.086004
std,250.734201,32.795482,0.388231
min,1.0,1.0,1.0
25%,263.0,28.0,2.0
50%,476.0,33.0,2.0
75%,648.0,57.0,2.0
max,981.0,139.0,4.0


> # `Explore`

### Questions To Answer:
1. Which lesson appears to attract the most traffic consistently across cohorts (per program)?
2. Is there a cohort that referred to a lesson significantly more than other cohorts seemed to gloss over?
3. Are there students who, when active, hardly access the curriculum? If so, what information do you have about these students?
4. What topics are grads continuing to reference after graduation and into their jobs (for each program)?
5. Which lessons are least accessed?
6. Anything else of note...

## 1. Which lesson appears to attract the most traffic consistently across cohorts (per program)?

In [5]:
#hypothesize

In [6]:
#analyze

In [7]:
#visualize

## 2. Is there a cohort that referred to a lesson significantly more than other cohorts seemed to gloss over?

In [8]:
#hypothesize


In [9]:
#analyze 
observed_counts = df.groupby(['lesson', 'cohort']).size().unstack(fill_value=0)
outliers,non_outliers=ex.detect_outliers(observed_counts)

In [30]:
outliers.sample(10)

cohort,Andromeda,Apex,Apollo,Arches,Badlands,Bash,Bayes,Betelgeuse,Ceres,Curie,...,Quincy,Sequoia,Staff,Teddy,Ulysses,Voyageurs,Wrangell,Xanadu,Yosemite,Zion
lesson,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
java-i,992,1641,0,290,0,671,1,1500,1615,0,...,36,300,1714,1140,1149,1658,1195,1312,788,1732
java-iii,1684,1952,0,543,0,994,0,1728,2176,0,...,0,157,2960,2049,1535,2387,1446,1748,1336,2091
prework,4,5,0,15,2,22,0,4,6,0,...,0,2,13,7,8,31,2,0,3,6
search,318,1497,0,45,0,660,588,761,1380,538,...,6,45,1349,103,142,328,504,577,361,700
css-i,866,642,0,241,0,410,0,782,1221,0,...,18,84,1005,111,737,796,819,735,582,1077
classification,0,0,0,0,0,0,260,0,0,400,...,0,0,2829,2,0,0,0,0,0,0
introduction,815,794,0,212,0,450,0,1075,1236,0,...,17,111,825,466,809,905,947,785,534,1015
images,46,0,1,54,1,0,0,55,10,0,...,92,16,135,72,80,63,17,14,33,7
versioning,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
javascript-i,2052,2074,0,578,0,1526,1,2640,3342,0,...,22,332,2596,2235,2164,2257,2173,1996,1640,2570


In [29]:
non_outliers.sample(10)

cohort,Andromeda,Apex,Apollo,Arches,Badlands,Bash,Bayes,Betelgeuse,Ceres,Curie,...,Quincy,Sequoia,Staff,Teddy,Ulysses,Voyageurs,Wrangell,Xanadu,Yosemite,Zion
lesson,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
elements,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
extra-features,314,222,0,64,0,144,1,209,184,0,...,0,133,328,347,191,113,140,124,169,122
servlets,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
flexbox,8,427,0,0,0,265,0,31,749,0,...,0,9,667,17,3,0,1,45,4,67
Classification_DecisionTree_files,0,0,0,0,0,0,0,0,0,0,...,0,0,2,0,0,0,0,0,0,0
control-statements-and-loops,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
26,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
tools,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
javascript,193,194,0,64,0,134,0,217,267,0,...,7,47,224,375,209,349,228,214,222,330


In [10]:
#visualize

## 3. Are there students who, when active, hardly access the curriculum? If so, what information do you have about these students?

In [11]:
#hypothesize

In [12]:
#analyze

In [13]:
#visualize

## 4. What topics are grads continuing to reference after graduation and into their jobs (for each program)?

In [14]:
#hypothesize

In [15]:
#analyze web dev program
grads_webdev = df[(df['end_date'] < df['access_date']) & (df['program'] == 'web dev')].groupby(['program', 'lesson']).size().sort_values(ascending=False)
grads_webdev.head(5)

program  lesson      
web dev                  14131
         fundamentals    11468
         mysql            8662
         javascript-i     8098
         java-ii          7510
dtype: int64

In [16]:
#analyze data science program
grads_ds = df[(df['end_date'] < df['access_date']) & (df['program'] == 'data science')].groupby(['program', 'lesson']).size().sort_values(ascending=False)
grads_ds.head(5)

program       lesson        
data science                    1446
              sql               1046
              classification    1036
              fundamentals       972
              python             615
dtype: int64

In [17]:
# analyze frontend program
grads_fend = df[(df['end_date'] < df['access_date']) & (df['program'] == 'frontend')].groupby(['program', 'lesson']).size().sort_values(ascending=False)
grads_fend

program   lesson  
frontend  content     2
                      1
          html-css    1
          images      1
dtype: int64

In [18]:
#visualize

## 5. Which lessons are least accessed?

In [19]:
#hypothesize

In [20]:
#analyze
least_lessons = df.groupby('lesson').size().sort_values(ascending=True)
least_lessons.head(20)

lesson
servlets                  1
style                     1
743                       1
882                       1
912                       1
918                       1
A-clustering              1
PreWork                   1
sgithubtudents            1
services                  1
b-clustering              1
bayes-capstones           1
requests-and-responses    1
quize                     1
capsones                  1
quic                      1
query                     1
project                   1
cls                       1
loops                     1
dtype: int64

In [21]:
#visualize

## 6. Anything else of note...

In [22]:
#hypothesize

In [23]:
#analyze

In [24]:
#visualize

> # `Conclusion`

## Explore

* 

* 


## Recommendations

* 