# [Project - Anomaly Detection](https://classroom.google.com/u/1/c/NTI2MzQ4Nzc3NTk3/a/NTI2MzQ4Nzc3NjUx/details)

In [None]:
#################################################
#################### Imports ####################
#################################################

# ---------------- #
# Common Libraries #
# ---------------- #
      
# Standard Imports
import os
import requests
import numpy as np
import pandas as pd

# Working with Dates & Times
from sklearn.model_selection import TimeSeriesSplit
from datetime import timedelta, datetime

# Working with Math & Stats
import statsmodels.api as sm
import scipy.stats as stats

# to evaluate performance using rmse
from sklearn.metrics import mean_squared_error
from math import sqrt 

# holt's linear trend model. 
from statsmodels.tsa.api import Holt

# Plots, Graphs, & Visualization
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.ticker import StrMethodFormatter
from matplotlib.dates import DateFormatter

# plotting defaults
plt.rc('figure', figsize=(13, 7))
plt.style.use('seaborn-whitegrid')
plt.rc('font', size=16)

# -------------- #
# Action Imports #
# -------------- #

# Warnings 
import warnings 
warnings.filterwarnings("ignore")

# ------------ #
# JUPYTER ONLY #
# ------------ #
    
# Disable autosave
%autosave 0

# Increases Display Resolution for Graphs 
%matplotlib inline 
%config InlineBackend.figure_format = 'retina'

# Left Align Tables in Jupyter Notebook
from IPython.core.display import HTML
table_css = 'table {align:left;display:block}'
HTML('{}'.format(table_css))

# ------------- #
# Local Imports #
# ------------- #

# importing sys
import sys

# adding 00_helper_files to the system path
sys.path.insert(0, '/Users/qmcbt/codeup-data-science/00_helper_files')

# env containing sensitive access credentials
import env
from env import user, password, host

# Import Helper Modules
import QMCBT_00_quicktips as qt
import QMCBT_01_acquire as acq
import QMCBT_02_prepare as prep
import QMCBT_03_explore as exp
import QMCBT_04_visualize as viz
import QMCBT_05_model as mod
import QMCBT_wrangle as w

# SYLABUS
* To get 100 on this project you only need to answer 5 out of the 7 questions (along with the other deliverables listed below i.e. slide, your notebook, etc).
* send your email before the due date and time to datascience@codeup.com (Only one team member can do this on behalf of whole team).
* Submit a link to a final notebook on GitHub that asks and answers questions - document the work you do to justify findings
* Compose an email with the answers to the questions/your findings, and in the email, include the link to your notebook in GitHub and attach your slide.
* You will not present this, so be sure that the details you need your leader to convey/understand are clearly communicated in the email.
* Your slide should be like an executive summary and be in form to present.
* Continue to use best practices of acquire.py, prepare.py, etc.
* Since there is no modeling to be done for this project, there is no need to split the data into train/validate/test
* The cohort schedule is in the SQL database, and alumni.codeup.com has info as well.
* Teamwork with Git handout is posted in the google classroom

I have some questions for you that I need to be answered before the board meeting Wednesday afternoon. I need to be able to speak to the following questions. I also need a single slide that I can incorporate into my existing presentation (Google Slides) that summarizes the most important points. My questions are listed below; however, if you discover anything else important that I didn’t think to ask, please include that as well.

In [2]:
# Download the data
url = 'https://drive.google.com/u/1/uc?id=1phD962Wrt8fetpvX-ersybPcZW3_54ma&export=download'

In [3]:
import urllib2
import gzip
import StringIO

def download_gz(url):
    # Download SEED database
    out_file_path = url.split("/")[-1][:-3]
    print('Downloading SEED Database from: {}'.format(url))
    response = urllib2.urlopen(url)
    compressed_file = StringIO.StringIO(response.read())
    decompressed_file = gzip.GzipFile(fileobj=compressed_file)

    # Extract SEED database
    with open(out_file_path, 'w') as outfile:
        outfile.write(decompressed_file.read())

    # Filter SEED database
    # ...
    return

if __name__ == "__main__":    
    download(url)

ModuleNotFoundError: No module named 'urllib2'

1. Which lesson appears to attract the most traffic consistently across cohorts (per program)?
2. Is there a cohort that referred to a lesson significantly more than other cohorts seemed to gloss over?
3. Are there students who, when active, hardly access the curriculum? If so, what information do you have about these students?
4. Is there any suspicious activity, such as users/machines/etc accessing the curriculum who shouldn’t be? Does it appear that any web-scraping is happening? Are there any suspicious IP addresses?
5. At some point in 2019, the ability for students and alumni to access both curriculums (web dev to ds, ds to web dev) should have been shut off. Do you see any evidence of that happening? Did it happen before?
6. What topics are grads continuing to reference after graduation and into their jobs (for each program)?
7. Which lessons are least accessed?
8. Anything else I should be aware of?