# Quickly Detecting Anomalies in Web site traffic

---

***Final Project, FAES BIOF 309 Introduction to Python, Fall 2018***

**Marie Gallagher, mgallagher@mail.nih.gov**

## Overview

The purpose of this project is to provide a faster way to visualize
data than (repeatedly) importing data into Excel and creating a 
chart.  This will save me time!

## Background

One of my work tasks is to provide statistics about Web site use.
Over the years, my institute has mandated various Web analytics software
including: awstats, WebTrends and Google Analytics.

Each piece of software analyzes and displays results differently.
So switching to different Web analytics software usually creates the 
appearance of a massive increase or decrease in traffic.

I need a way to tell whether there has been a legitimate unexpected
change in Web site traffic.  Unique IP addresses per day turns out 
to be a reasonably reliable indicator.

Anomalies can also happen if the Web site is the subject of a denail
of service attack or if an event of national or international interest 
is related to content on the Web site.  Having a way to quickly visualize 
the number of unique IP addresses daily over time will help me quickly 
spot anomalies.

## Demo

My data for this project contains the number of unique IP addresses 
accessing a Web site each day.  A few lines of data follows:
```
06/01/2018|5565|515120|515120|515120
06/02/2018|4801|518657|518657|518657
06/03/2018|4069|451881|451881|451881
06/04/2018|4859|493762|493762|493762
06/05/2018|4816|514587|514587|514587
```

- We are interested in the first two columns.  We want to ignore the rest of the columns.

- The first column contains dates, but they are in the form MM/DD/YYYY rather than YYYY-MM-DD.

- There are no column headers in this data.

- The columns are separated by the "pipe" character rather than commas, spaces or tabs.

In [None]:
# Import the necessary packages
import os
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime

In [None]:
# Optional: display the names of the files in the raw data directory.
print(os.listdir("../data/raw/"))

In [None]:
# Choose a data file to plot.
daily_ip_file = "../data/raw/log_daily_ip_201806a.txt"

In [None]:
# Optional:  display the first few lines of the daily_ip_file.
with open(daily_ip_file) as file:
    line_num = 1
    while line_num <= 5:
        print(file.readline())
        line_num = line_num + 1
file.close()

In [None]:
# Read a data file into a pandas dataframe, df.
# The columns separator is '|'.
# There is no header row.
# We only need the first two columns.
df = pd.read_csv(daily_ip_file, sep="|", header=None, usecols=[0,1])

In [None]:
# Optional: display the first few rows.  There should now be only two columns, not five.
print(df.head(5))

In [None]:
# Give the dataframe's columns descriptive names.
df.columns = ["date1", "unique_ips"]

In [None]:
# Optional: make sure the column names changed.
print(df.head(5))

In [None]:
# Optional: what is are the types of the columns in df?
print(type(df))
print(type(df))

In [None]:
# Change from a text MM/DD/YYYY date to a Python friendly datetime YYYY-MM-DD.
dates_datetime = [datetime.strptime(date_text, "%m/%d/%Y") for date_text in df.date1]

## Future

1. I will incorporate this project into my work immediately.  (Until now, I imported the data into Excel and made a graph.)

2. Break my program into functions and restructure my project files

3. Scrub IP addresses from raw log files and extract data from them

4. List the most accessed URLs on days with high IP address counts

5. List the top referrers on days with high IP address counts

## Acknowledgments and Thanks!

BIOF 309 Instructors
* Martin Skarzynski
* Jinping Liu
* Michael Chambers

BIOF 309 Class
* Helpful questions

NIAID Scientific Programming Seminars through CIT
* Burke Squires