# Anomaly Mini Project

## Project Goals
Answer these questions:

1. Which lesson appears to attract the most traffic consistently across cohorts (per program)?
2. Is there a cohort that referred to a lesson significantly more than other cohorts seemed to gloss over?
3. Are there students who, when active, hardly access the curriculum? If so, what information do you have about these students?
4. Is there any suspicious activity, such as users/machines/etc accessing the curriculum who shouldn’t be? Does it appear that any web-scraping is happening? Are there any suspicious IP addresses?
5. At some point in 2019, the ability for students and alumni to access both curriculums (web dev to ds, ds to web dev) should have been shut off. Do you see any evidence of that happening? Did it happen before?
6. What topics are grads continuing to reference after graduation and into their jobs (for each program)?
7. Which lessons are least accessed?
8. Anything else I should be aware of?

In [1]:
# imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

import env
import os
import wrangle

# DBSCAN import
from sklearn.cluster import DBSCAN

# Scaler import
from sklearn.preprocessing import MinMaxScaler

In [2]:
# load data
df = wrangle.prep_logs()

# User Defined Functions

In [4]:
def one_program_df_prep(df, program_name):
    '''
    This function returns a dataframe consisting of data for only a single defined user
    '''
    df = df[df.program == program_name]
    df = df[df.name != 'Staff']
    #df.index = pd.to_datetime(df.date)
    #df = df.set_index(df.date)
    pages_one_program = df['path'].resample('d').count()
    return pages_one_program

def compute_pct_b(pages_one_program, span, weight, program_name):
    '''
    This function adds the %b of a bollinger band range for the page views of a single user's log activity
    '''
    # Calculate upper and lower bollinger band
    midband = pages_one_program.ewm(span=span).mean()
    stdev = pages_one_program.ewm(span=span).std()
    ub = midband + stdev*weight
    lb = midband - stdev*weight
    
    # Add upper and lower band values to dataframe
    bb = pd.concat([ub, lb], axis=1)
    
    # Combine all data into a single dataframe
    my_df = pd.concat([pages_one_program, midband, bb], axis=1)
    my_df.columns = ['pages_one_program', 'midband', 'ub', 'lb']
    
    # Calculate percent b and relevant user id to dataframe
    my_df['pct_b'] = (my_df['pages_one_program'] - my_df['lb'])/(my_df['ub'] - my_df['lb'])
    my_df['program'] = program_name
    return my_df

def plot_bands(my_df, program):
    '''
    This function plots the bolliger bands of the page views for a single user
    '''
    fig, ax = plt.subplots(figsize=(12,8))
    ax.plot(my_df.index, my_df.pages_one_program, label='Number of Pages, Progarm: '+str(program))
    ax.plot(my_df.index, my_df.midband, label = 'EMA/midband')
    ax.plot(my_df.index, my_df.ub, label = 'Upper Band')
    ax.plot(my_df.index, my_df.lb, label = 'Lower Band')
    ax.legend(loc='best')
    ax.set_ylabel('Number of Pages')
    plt.show()
    
def find_anomalies(df, program, span, weight, plot=False):
    '''
    This function returns the records where a user's daily activity exceeded the upper limit of a bollinger band range
    '''
    
    # Reduce dataframe to represent a single user
    pages_one_program = one_program_df_prep(df, program)
    
    # Add bollinger band data to dataframe
    my_df = compute_pct_b(pages_one_program, span, weight, program)
    
    # Plot data if requested (plot=True)
    if plot:
        plot_bands(my_df, program)
    
    # Return only records that sit outside of bollinger band upper limit
    return my_df[my_df.pct_b>1]

## 1. Which lesson appears to attract the most traffic consistently across cohorts (per program)?