# **Instructions**

This document is a template, and you are not required to follow it exactly. However, the kinds of questions we ask here are the kinds of questions we want you to focus on. While you might have answered similar questions to these in your project presentations, we want you to go into a lot more detail in this write-up; you can refer to the Lab homeworks for ideas on how to present your data or results.

You don't have to answer every question in this template, but you should answer roughly this many questions. Your answers to such questions should be paragraph-length, not just a bullet point. You likely still have questions of your own -- that's okay! We want you to convey what you've learned, how you've learned it, and demonstrate that the content from the course has influenced how you've thought about this project.

# Project Name
Project mentor: the GOAT Edward Wang

Nick Geissler <ngeissl2@jh.edu>, Annie Wang <awang105@jh.edu>, Jonathan Ye <jye41@jh.edu>, Evan Zhu <ezhu13@jh.edu>

Link_to_git_repo - edward says making it public is sus

# Outline and Deliverables

List the deliverables from your project proposal. For each uncompleted deliverable, please include a sentence or two on why you weren't able to complete it (e.g. "decided to use an existing implementation instead" or "ran out of time"). For each completed deliverable, indicate which section of this notebook covers what you did.

If you spent substantial time on any aspects that weren't deliverables in your proposal, please list those under "Additional Work" and indicate where in the notebook you discuss them.

### Uncompleted Deliverables
1. "Expect to complete #2": we decided to use an existing implementation for our SVM
2. ...


### Completed Deliverables
1. "Must complete #1": We discuss our dataset pre-processing [in "Dataset" below](#scrollTo=zFq-_D0khnhh&line=10&uniqifier=1).
2. "Must complete #2": We discuss training our logistic regression baseline [in "Baselines" below](#scrollTo=oMyqHUa0jUw7&line=5&uniqifier=1).
3. ...


### Additional Deliverables
1. We decided to add a second baseline using the published model from this paper. We discuss this [in "Baselines" below](#scrollTo=oMyqHUa0jUw7&line=5&uniqifier=1).
2. ...

# Preliminaries

## What problem were you trying to solve or understand?

Q. What are the real-world implications of this data and task?

A. 

Q. How is this problem similar to others we’ve seen in lectures, breakouts, and homeworks?

A. 

Q. What makes this problem unique?

A. 

Q. What ethical implications does this problem have?

A. What if the scammers get laid off from their cushy scamming cetner jobs

## Dataset(s)

Describe the dataset(s) you used.

How were they collected?

Why did you choose them?

How many examples in each?
* In total, we had ___ labeled emails and ___ features



In [1]:
# Import guys
import pandas as pd
import numpy as np
from scipy import sparse
# from AdaBoostClassifier import AdaBoostClassifier
#%run AdaBoostWeak.py
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

In [9]:
# Load your data and print 2-3 examples
df1 = pd.read_csv("data/spam.csv", usecols=[0,1])
print(df1.iloc[0:3,:])


  Category                                            Message
0      ham  Go until jurong point, crazy.. Available only ...
1      ham                      Ok lar... Joking wif u oni...
2     spam  Free entry in 2 a wkly comp to win FA Cup fina...


## Pre-processing

Please see our preprocessing2 jupyter notebook: {insert link}. 

### Questions

What features did you use or choose not to use? Why?

If you have categorical labels, were your datasets class-balanced?

How did you deal with missing data? What about outliers?

What approach(es) did you use to pre-process your data? Why?

Are your features continuous or categorical? How do you treat these features differently?

In [None]:
# For those same examples above, what do they look like after being pre-processed?
with open('data/column_names.txt', 'r') as f:
    column_names = [line.strip() for line in f] # feature labels

sparse_dat = sparse.load_npz("data/sparse_df.npz")
labels = sparse_dat[:, 0] # ground truth labels


In [None]:
# Visualize the distribution of your data before and after pre-processing.
#   You may borrow from how we visualized data in the Lab homeworks.

# Models and Evaluation

## Experimental Setup

### Evaluation metrics

Q. How did you evaluate your methods? Why is that a reasonable evaluation metric for the task?

A. F1, type 1 error, type 2 error, accuracy


### Custom loss function

Q. What did you use for your loss function to train your models? Did you try multiple loss functions? Why or why not?

A. Our loss function is pasted below. -INSERT DETAILS- We did try a couple other loss functions, but they sucked. There was that one loss function where it was np.exp(-proportion or whatever)

Furthermore, we implemented our own Adaboost algorithm to leverage this custom loss function. Link:  

In [None]:
# Code for loss functions, evaluation metrics or link to Git repo

def compute_error(y, y_pred, w_i, type2penalty, pen_factor):
    '''
    Calculate the error rate of a weak classifier m. Arguments:
    y: actual target value
    y_pred: predicted value by weak classifier
    w_i: individual weights for each observation

    
    Note that all arrays should be the same length. Convert sparse array to regular array before calling
    '''
    if type2penalty:
        error = (sum(w_i * (t2_pred_err_vec(y, y_pred, pen_factor)).astype(float)))/sum(w_i)
    else:
        error = (sum(w_i * (np.not_equal(y, y_pred)).astype(int)))/sum(w_i)

    return error
def t2_pred_err_vec(y,y_pred, pen_factor):
    pred_err_vec = ((y_pred==-1) & (y==1))* pen_factor
    better_err_vec = ((y_pred==1)) & (y==-1)
    pred_err_vec = pred_err_vec+better_err_vec
    if np.isnan(pred_err_vec).any(): print("WAHHHHH NAN")
    return pred_err_vec


### Train/Test split

Q. How did you split your data into train and test sets? Why?

A. By the y labels - so that we had similar proportion of spam in both train + test

In [None]:
# Train test split

# Use train_test_split.
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2, shuffle=True, stratify=labels.toarray().ravel())

y_train_flat = y_train.copy().toarray().ravel()
#y_train_flat[y_train_flat == -1] = 0 See if you have to run this line if anything sus

y_test_flat = y_test.copy().toarray().ravel()
# y_test_flat[y_test_flat == -1] = 0 Same here


## Baselines

Q. What baselines did you compare against? Why are these reasonable?

A. Sklearn SVM and Random Forest with default parameters. WHY ARE REASONABLE

Q. Did you look at related work to contextualize how others methods or baselines have performed on this dataset/task? If so, how did those methods do?

A. Yes of course - they are very good

In [None]:
# SVM

from sklearn import svm
svm_example = svm.SVC(probability = True)
svm_example.fit(X_train, y_train_flat)

# For ROC
scores = svm_example.predict_proba(X_test)
svm_fpr, svm_tpr, thresholds  = metrics.roc_curve(y_test_flat, scores[:,1])

# Also do the type 2 error shit

In [None]:
# Random Forest

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train, y_train_flat)

# For ROC
scores = rf.predict_proba(X_test)
rf_fpr, rf_tpr, thresholds = metrics.roc_curve(y_test_flat, scores[:,1])

# type 2 error stuff

## Methods

Q. What methods did you choose? Why did you choose them?

A. 

Q. How did you train these methods, and how did you evaluate them? Why?

A. 

Q. Which methods were easy/difficult to implement and train? Why?

A. Implementatino was decently hard. Big obstacle was getting the weak learners to be unique. Maybe also talk about boosting RF? Link github


Q. For each method, what hyperparameters did you evaluate? How sensitive was your model's performance to different hyperparameter settings?

A. First, we ran a grid search to find the best set of parameters for our goal - to reduce type 2 error (actual spam email classified as ham) while not sacrificing too much accuracy. Details are in the grid search section

## Grid Search

EVAN WAHHHH - add details (explain validation set, cross fold, yada yada) and most importantly paste that figure in here

In [None]:
# do something here

## Training the model

Using our choise of parameters from grid search:

penalty factor = 2

decision tree depth = 5 - for the sake of speed

number of boosting rounds = 200 - for the sake of speed


In [None]:
# Code for training models, or link to your Git repository

In [None]:
# Show plots of how these models performed during training.
#  For example, plot train loss and train accuracy (or other evaluation metric) on the y-axis,
#  with number of iterations or number of examples on the x-axis.

## Results

Show tables comparing your methods to the baselines.

What about these results surprised you? Why?

Did your models over- or under-fit? How can you tell? What did you do to address these issues?

What does the evaluation of your trained models tell you about your data? How do you expect these models might behave differently on different data?  

In [None]:
# Show plots or visualizations of your evaluation metric(s) on the train and test sets.
#   What do these plots show about over- or under-fitting?
#   You may borrow from how we visualized results in the Lab homeworks.
#   Are there aspects of your results that are difficult to visualize? Why?

# Discussion

## What you've learned

*Note: you don't have to answer all of these, and you can answer other questions if you'd like. We just want you to demonstrate what you've learned from the project.*

What concepts from lecture/breakout were most relevant to your project? How so?

What aspects of your project did you find most surprising?

What lessons did you take from this project that you want to remember for the next ML project you work on? Do you think those lessons would transfer to other datasets and/or models? Why or why not?

What was the most helpful feedback you received during your presentation? Why?

If you had two more weeks to work on this project, what would you do next? Why?