# Project 1 - Anomaly Detection
## Thomas Clark, Jacob Coyle
***
#### The file participants.csv contains meeting attendance data reported by Zoom for the first five weeks of a course. Each row contains the name of a student along with the number of minutes that the student was logged in to the course Zoom meeting. (The names of students have been changed to protect the innocent.)

In [64]:
#Load up the data with a csv.reader object
import csv

fn = "./participants.csv"

fields = []
rows = []

with open(fn, 'r') as csvfile:
    csvreader = csv.reader(csvfile)
    
    fields = next(csvreader)
    for row in csvreader:
        rows.append(row)
        
def tableform(a):
    for units in a:
        #Truncating names cause damn, some of y'all have long names
        print('{:15.14}'.format(units), end = '|')
    print()
        
tableform(fields)
for group in rows:
    tableform(group)

Student Name   |Week 1         |Week 2         |Week 3         |Week 4         |Week 5         |
Adrian Ellison |77             |154            |4              |170            |175            |
Ophelia Mcphee |179            |151            |164            |173            |171            |
Yasir Fenton   |180            |47             |164            |168            |169            |
Benny Arias    |180            |152            |161            |170            |170            |
Tamara Cottrel |183            |79             |161            |173            |168            |
Jada Calhoun   |178            |147            |162            |171            |175            |
Marlene Parry  |174            |150            |159            |171            |170            |
Jazmin Foreman |181            |103            |157            |172            |35             |
Bear Zuniga    |180            |131            |160            |170            |182            |
Sanjay Edwards |173           

***
#### Load the statistics module and use it to find the mean and median of Week 1’s data.

In [65]:
import statistics as stats

#names to track who we're looking at
week1 = []
names = []
for group in rows:
    week1.append(int(group[1]))
    names.append(group[0])
    
# week1
mean1 = stats.mean(week1)
print("Mean =", mean1)
print("Median =", stats.median(week1))

Mean = 161
Median = 175


***
#### Find the quartiles for Week 1.

In [66]:
quartiles = stats.quantiles(week1, n=4)
print("Quartiles =", quartiles)

Quartiles = [174.0, 175.0, 179.0]


***
#### In order to record attendance, we want to find the students who logged into the Zoom meeting but did not attend the entire lecture. In order to do this, we can look for outliers in the data
#### Tukey’s fences are a simple method to define outliers in terms of the interquartile range. (In fact, they are usually included as whiskers in box plots in order to visualize outliers). 
Tukeys Fence: $ {\big [}Q_{1}-k(Q_{3}-Q_{1}),Q_{3}+k(Q_{3}-Q_{1}){\big ]} $
#### Use this method with k = 1.5 to find the outliers in the Week 1 attendance data.
For our example: $ {\big [}174.0 - 1.5(179.0 - 174.0), 179.0 + 1.5(179.0 - 174.0){\big ]} $
$$ = [174 - 7.5, 179 + 7.5] $$
$$ = [166.5, 186.5] $$

In [67]:
outliers = {}

for i, attendance in enumerate(week1):
    if attendance < quartiles[0] - 7.5 or attendance > quartiles[2] + 7.5:
        outliers[names[i]] = attendance

for keys in outliers:
    print('{:20.19}'.format(keys), '{:5}'.format(outliers[keys]))

Adrian Ellison          77
Tayla Sparrow           51
Owain Emerson            9
Alaya Dickinson         24


***
#### Recall that in a normal distribution, 99.7% of the values lie within three standard deviations from the mean. If we assume that our data are normally distributed, this gives us another way to find outliers.
#### Compute the standard deviation for the Week 1 attendance data, then use this method to find the outliers. Do your results agree with experiment (4)?

In [68]:
outliers.clear()
stdv = stats.stdev(week1)

for i, attendance in enumerate(week1):
    if attendance < mean1 - (3 * stdv) or attendance > mean1 + (3 * stdv): #checks outerliers in 3 stdv in the negative and postive sides
        outliers[names[i]] = attendance

for keys in outliers:
    print('{:20.19}'.format(keys), '{:5}'.format(outliers[keys]))

Owain Emerson            9
Alaya Dickinson         24


This does not agree with the results from experiment 4, these outliers are more extreme than what was produced with the tukey fence
***
#### Define a function tardy_iqr() to make experiment (4) repeatable. This function should take the name of a column (e.g. 'Week 1') and return a list of names for whom the number of minutes is below the lower Tukey fence (e.g. ['Alaya Dickinson', 'Owain Emerson']). Verify that this function returns the same results as experiment (4).

In [69]:
#using regular expressions to get decimal digits
import re

def tardy_iqr(a):
    
    col = int(re.findall("\d+", a)[0])
    values = []
    outliers.clear()
    
    for items in rows:
        values.append(int(items[col]))
    
    quartiles = stats.quantiles(values, n=4)
    fence = 1.5 * (quartiles[2] - quartiles[0])
    
    for i, attendance in enumerate(values):
        if attendance > quartiles[2] + fence or attendance < quartiles[0] - fence:
            outliers[names[i]] = attendance
            
    for keys in outliers:
        print('{:20.19}'.format(keys), '{:5}'.format(outliers[keys]))
    
tardy_iqr("week 1")

Adrian Ellison          77
Tayla Sparrow           51
Owain Emerson            9
Alaya Dickinson         24


**Define a second function, tardy_stdev(), with the same interface as experiment (6) but using the method of experiment (5) and verify that its results match that experiment.**

In [70]:
def tardy_stdev(a):

    col = int(re.findall("\d+", a)[0]) #col = 1
    values = []
    outliers.clear()
    
    
    for items in rows:
        values.append(int(items[col]))
    
    stdv = stats.stdev(values)
    meanweek = stats.mean(values)
    
    for i, attendance in enumerate(values):
        if attendance < meanweek - (3 * stdv) or attendance > meanweek + (3 * stdv):
            outliers[names[i]] = attendance

    for keys in outliers:
        print('{:20.19}'.format(keys), '{:5}'.format(outliers[keys]))


tardy_stdev("week 1")

Owain Emerson            9
Alaya Dickinson         24


Using this function and entering "week 1," we can see that the function do indeed mirror experiment (5). This can be replaced with another week for finding any outliers outside the normal distrubtion of 3 standard deviations (or 99.7%).

**Compare the results of tardy_iqr() and tardy_stdev() on Weeks 2-5.**

In [71]:
week = "week 1"
compare = []
for name in fields[2:]:
    print(name, ": iqr")
    tardy_iqr(name)
    print("__________________________")
    print(name, ": stdev")
    tardy_stdev(name)
    print("--------------------------")


Week 2 : iqr
Yasir Fenton            47
Tamara Cottrell         79
Jazmin Foreman         103
Bear Zuniga            131
Miles Lyons              2
Owain Emerson          290
__________________________
Week 2 : stdev
Miles Lyons              2
Owain Emerson          290
--------------------------
Week 3 : iqr
Adrian Ellison           4
Adeline Jordan         105
Jaye Sweeney           121
__________________________
Week 3 : stdev
Adrian Ellison           4
--------------------------
Week 4 : iqr
Dora Delacruz          316
Shaquille Wood         184
__________________________
Week 4 : stdev
Dora Delacruz          316
--------------------------
Week 5 : iqr
Jazmin Foreman          35
Sanjay Edwards          66
Alfie-James Pierce      74
Adeline Jordan         195
Saffa Brook            143
__________________________
Week 5 : stdev
Jazmin Foreman          35
--------------------------


From these results, we notice a pattern where the irq (Tukey Fence) has more outliers than the stdev (standard deviation). This shows that the standard deviation method finds much more extreme outliers versus the tukey fence method. Furthermore, the stdev outlier is always present in the iqr outlier too.