# INFO-4604/5604 HW2: Linear Classification 

### Solution by: *YOUR NAME* (and list any partners)


## Assignment overview

News agencies, governments and corporations sometimes track social media during natural disasters to try to monitor unfolding events. But because no single person or group of people can read all available Twitter data, organizations may turn to natural language processing methods to try and understand what is happening as disasters unfold. 

While this approach is powerful, inferring events from NLP can be tricky. For instance, say a person [tweets](https://twitter.com/AnyOtherAnnaK/status/629195955506708480) that "LOOK AT THE SKY LAST NIGHT IT WAS ABLAZE." This tweet includes the word "ablaze", which may signal to a computer that there is an unfolding disaster. However, in this particular case, the person is speaking metaphorically. A simple computer system using keywords (e.g. ablaze) might be fooled into thinking the tweet is reporting an actual fire.

In this assignment, you will predict if a given tweet actually refers to a natural disaster. This exercise is motivated by real-world disaster monitoring systems, and can help you to gain practice with supervised binary classification and natural language processing.

__Note__: This dataset originally comes from [Kaggle](https://www.kaggle.com/c/nlp-getting-started/overview). But it has been modified for this problem set. Information about the data from this problem set that you find on Kaggle will almost certainly be wrong.

### What to hand in

You will submit the assignment on Canvas. Submit a single Jupyter notebook named `hw2lastname.ipynb`, where lastname is replaced with your last name. **Please also submit a PDF or HTML version of your notebook to Canvas**.

Please clearly mark all deliverables. You are encouraged to create additional cells in whatever way makes the presentation more organized and easy to follow. You are allowed to import additional Python libraries.

### Submission policies

- **Collaboration:** You are allowed to work with one partner. You are still expected to write up your own solution. Each individual must turn in their own submission, and list your collaborators after your name.
- **Late submissions:** Each student may use up to 5 late days over the semester. You have late days, not late hours. This means that if your submission is late by any amount of time past the deadline, then this will use up a late day. If it is late by any amount beyond 24 hours past the deadline, then this will use a second late, and so on. Once you have used up all late days, late assignments will be given at most 80% credit after one day and 60% credit after two days.


## Getting started

In this assignment, you will experiment with perceptron and logistic regression in `sklearn`. Much of the code has already been written for you. We will use a class called `SGDClassifier` (which you should read about in the [sklearn documentation](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html)), which  implements stochastic gradient descent (SGD) for a variety of loss functions, including both perceptron and logistic regression, so this will be a way to easily move between the two classifiers.

The code below will load the datasets. There are two data collections: the "training" data, which contains the tweets that you will use for training the classifiers, and the "testing" data, which are tweets that you will use to measure the classifier accuracy. The test tweets are instances the classifier has never seen before, so they are a good way to see how the classifier will behave on data it hasn't seen before. However, we still know the labels of the test tweets, so we can measure the accuracy.

For this problem, we will use what are called "bag of words" features, which are commonly used when doing classification with text. Each feature is a word, and the value of a feature for a particular tweet is number of times the word appears in the tweet (with value $0$ if the word does not appear in the tweet).

A note on labels: **If `Y_train` or `Y_test` are 1 this means the tweet refers to a real disaster; if the values are 0, it means the tweet does not refer to a real disaster** 

Run the block of code below to load the data. You don't need to do anything yet. Move on to "Problem 1" next.

In [9]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

df_train = pd.read_csv('train.csv')

Y_train = df_train["target"]
text_train = df_train["text"]

vec = CountVectorizer()
X_train = vec.fit_transform(text_train)
feature_names = np.asarray(vec.get_feature_names())

df_test = pd.read_csv('test.csv')
Y_test = df_test["target"]
text_test = df_test["text"]

X_test = vec.transform(text_test)


## Problem 1: Understand the data [3 points]

Before doing anything else, take time to understand the code above.

The variables `df_train` and `df_test` are dataframes that store the training (and testing) datasets, which are contained in comma-separated files where the first column is the label and the second column is the text of the tweet.

The [`CountVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) class converts the raw text into a bag-of-words into a feature vector representation that `sklearn` can use.

You should print out the values of the variables and write any other code needed to answer the following questions.

#### Deliverable 1.1

How many training instances are in the dataset? How many test instances?

[your answer here]

#### Deliverable 1.2

How many features are in the training data?

[your answer here]

#### Deliverable 1.3

What is the distribution of labels in the training data? That is, what percentage of instances are about actual disasters?

[your answer here]

## Problem 2: Perceptron [3 points]

The code below trains an [`SGDClassifier`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html) using the perceptron loss, then it measures the accuracy of the classifier on the test data, using `sklearn`'s [`accuracy_score`](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) function. 

The `fit` function trains the classifier. The feature weights are stored in the `coef_` variable after training. The `predict` function of the trained `SGDClassifier` outputs the predicted label for a given instance or list of instances.

Additionally, this code displays the features and their weights in sorted order, which you may want to examine to understand what the classifier is learning. In general, in binary classification, the 0 class is considered the "negative" class.

There are 3 keyword arguments that have been added to the code below. It is important you keep the same values of these arguments whenever you create an `SGDClassifier` instance in this assignment so that you get consistent results. They are:

- `max_iter` is one of the stopping criteria, which is the maximum number of iterations/epochs the algorithm will run for.

- `tol` is the other stopping criterion, which is how small the difference between the current loss and previous loss should be before stopping.

- `random_state` is a seed for pseudorandom number generation. The algorithm uses randomness in the way the training data are sorted, which will affect the solution that is learned, and even the accuracy of that solution.

Note: *Wait a minute $-$ in class we learned that the loss function is convex, so the algorithm will find the same minimum regardless of how it is trained. Why is there random variation in the output? The reason is that even though there is only one minimum value of the loss, there may be different weights that result in the same loss, so randomness is a matter of tie-breaking. What's more, while different weights may have the same loss, they could lead to different classification accuracies, because the loss function is not the same as accuracy. (Unless accuracy was your loss function... which is possible, but uncommon because it turns out to be a difficult function to optimize.)
Note that different computers may still give different answers, despite keeping these settings the same, because of how pseudorandom numbers are generated with different operating systems and Python environments.*

To begin, run the code in the cell below without modification.

In [10]:
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score, precision_score

classifier = SGDClassifier(loss='perceptron', max_iter=1000, tol=1.0e-12, random_state=123, eta0=100)
classifier.fit(X_train, Y_train)

print("Number of SGD iterations: %d" % classifier.n_iter_)
print("Training accuracy: %0.6f" % accuracy_score(Y_train, classifier.predict(X_train)))
print("Testing accuracy: %0.6f" % accuracy_score(Y_test, classifier.predict(X_test)))

print("\nFeature weights:")
args = np.argsort(classifier.coef_[0])
for a in args:
    print(" %s: %0.4f" % (feature_names[a], classifier.coef_[0][a]))

Number of SGD iterations: 35
Training accuracy: 0.987908
Testing accuracy: 0.781648

Feature weights:
 zy3hpdjnwg: -0.7900
 qzlpfhpwdo: -0.6970
 better: -0.5112
 f7wqpcekg2: -0.5112
 sun: -0.5112
 t5trhjuau0: -0.5112
 august: -0.4647
 seat: -0.4647
 permanently: -0.4647
 desire: -0.4647
 hrqcjdovjz: -0.4647
 qzqc8wwwcn: -0.4647
 poll: -0.4647
 appears: -0.4182
 move: -0.4182
 full: -0.4182
 3yu26v19zh: -0.4182
 dlaub2nvtn: -0.4182
 jobs: -0.4182
 6gldwx71da: -0.4182
 scared: -0.4182
 song: -0.4182
 themselves: -0.3717
 used: -0.3717
 greenlacey: -0.3717
 booksbyroger: -0.3717
 l9mb2j5pxg: -0.3717
 myself: -0.3717
 zdj2hyf6ro: -0.3717
 wrong: -0.3717
 bags: -0.3717
 lot: -0.3717
 hands: -0.3717
 show: -0.3717
 x8moyevjsj: -0.3717
 m5djllxozp: -0.3717
 nzrwddlntp: -0.3717
 market: -0.3717
 mickinyman: -0.3717
 theatlantic: -0.3717
 esemjrn5cc: -0.3717
 lets: -0.3717
 ruin: -0.3717
 jlczidz7vu: -0.3717
 bejftygjil: -0.3717
 best: -0.3717
 long: -0.3717
 book: -0.3717
 wackoes: -0.3717
 ex

 aoesbvns45: -0.0929
 emails: -0.0929
 engle: -0.0929
 holt: -0.0929
 earners: -0.0929
 ashestoashes: -0.0929
 kristyleemusic: -0.0929
 january: -0.0929
 attic: -0.0929
 mattcohen4fake: -0.0929
 smokey: -0.0929
 grazed: -0.0929
 de: -0.0929
 complaints: -0.0929
 iglnqpgbnw: -0.0929
 cocker: -0.0929
 hikdc1fm2f: -0.0929
 i7eskymoec: -0.0929
 hlongwane: -0.0929
 welshninja87: -0.0929
 dancers: -0.0929
 margarita: -0.0929
 softenza: -0.0929
 flawless: -0.0929
 y33qckq7qd: -0.0929
 0rsverlztm: -0.0929
 lucymayofficial: -0.0929
 boss: -0.0929
 ebrointheam: -0.0929
 roh3: -0.0929
 throwback: -0.0929
 faceless: -0.0929
 rgtyzbnkeo: -0.0929
 bennycapricon: -0.0929
 wmoyibwec1: -0.0929
 gidiexclusixe: -0.0929
 spots: -0.0929
 biggie: -0.0929
 wineisdumb: -0.0929
 2a: -0.0929
 sonyprousa: -0.0929
 equipments: -0.0929
 drumming: -0.0929
 oppose: -0.0929
 thesewphist: -0.0929
 cufi: -0.0929
 9q9rk3fof7: -0.0929
 objective: -0.0929
 theboyofmasks: -0.0929
 okay: -0.0929
 burford: -0.0929
 fuckface:

 recognised: -0.0465
 1716: -0.0465
 sharing: -0.0465
 everybody: -0.0465
 hopped: -0.0465
 mylifestory: -0.0465
 smile: -0.0465
 corner: -0.0465
 4playthursdays: -0.0465
 lsjowgyvqh: -0.0465
 backty: -0.0465
 psychic: -0.0465
 cq7jj6yjfz: -0.0465
 brookesddl: -0.0465
 simon: -0.0465
 crucial: -0.0465
 swayoung01: -0.0465
 chriscesq: -0.0465
 freestyles: -0.0465
 chronicillness: -0.0465
 8rabhqrth5: -0.0465
 unconsciously: -0.0465
 blaawhysct: -0.0465
 1cvegtizog: -0.0465
 tomdean86: -0.0465
 snapping: -0.0465
 alright: -0.0465
 boundaries: -0.0465
 easily: -0.0465
 gesture: -0.0465
 auction: -0.0465
 willian: -0.0465
 ddr0zjxvqn: -0.0465
 priorities: -0.0465
 thoughts: -0.0465
 shifted: -0.0465
 arms: -0.0465
 9stlkh59fb: -0.0465
 o1enhjrkjd: -0.0465
 inkuv5dntx: -0.0465
 raniakhalek: -0.0465
 pumpkins: -0.0465
 alexjacobsonpfs: -0.0465
 ensuring: -0.0465
 berlin: -0.0465
 sonofbaldwin: -0.0465
 celestial: -0.0465
 shite: -0.0465
 joelsherman1: -0.0465
 gnorijnsva: -0.0465
 yo3t8qho9h

 jaycootchi: -0.0465
 creativity: -0.0465
 darkndtatted: -0.0465
 acb0ryenuo: -0.0465
 breaches: -0.0465
 nyty7fcqo6: -0.0465
 1pulaekxcq: -0.0465
 seashore: -0.0465
 cantmakeitup: -0.0465
 josephus: -0.0465
 leaks: -0.0465
 watched: -0.0465
 cracking: -0.0465
 bridgework: -0.0465
 themaine: -0.0465
 yn6nxoucr1: -0.0465
 joints: -0.0465
 growingupblack: -0.0465
 joegoodmanjr: -0.0465
 indiakomuntorjawabdo: -0.0465
 underwood: -0.0465
 ek6kyhxpe9: -0.0465
 tpw5gpmhq4: -0.0465
 marquez: -0.0465
 xxhjesc: -0.0465
 colts: -0.0465
 beth: -0.0465
 antiochus: -0.0465
 margaret: -0.0465
 tvshowtime: -0.0465
 wheatley: -0.0465
 s01e09: -0.0465
 disturbances: -0.0465
 hugged: -0.0465
 chicken: -0.0465
 6vja8r4yxa: -0.0465
 wisdomwed: -0.0465
 satellite: -0.0465
 fwj9ccyw6k: -0.0465
 idgaf: -0.0465
 drayesha4: -0.0465
 ciyty0fgpr: -0.0465
 hieroglyphics: -0.0465
 starflame_girl: -0.0465
 eliminate: -0.0465
 serene: -0.0465
 dolls: -0.0465
 sgxb6e5yda: -0.0465
 hired: -0.0465
 holmes: -0.0465
 ski

 wowsavannah: 0.0000
 bosphore: 0.0000
 bestie: 0.0000
 1wopsgbvvv: 0.0000
 poignant: 0.0000
 pogo: 0.0000
 bestcomedyvine: 0.0000
 pochette: 0.0000
 xdojjjj: 0.0000
 1965: 0.0000
 boobs: 0.0000
 pone: 0.0000
 poss: 0.0000
 x5yeuylt1x: 0.0000
 poses: 0.0000
 ramp: 0.0000
 boltåêcyclone: 0.0000
 1vz3rmjhy4: 0.0000
 bomairinge: 0.0000
 portable: 0.0000
 x6asgrjswc: 0.0000
 porcini: 0.0000
 bomd: 0.0000
 1967: 0.0000
 popcorn: 0.0000
 bong: 0.0000
 rams: 0.0000
 bestseller: 0.0000
 bonhomme37: 0.0000
 bonnieg434: 0.0000
 bonsai: 0.0000
 prem: 0.0000
 poc: 0.0000
 premonitions: 0.0000
 wq3wjsgphl: 0.0000
 wwfacu6nft: 0.0000
 blitzes: 0.0000
 prolly: 0.0000
 projeavg8t: 0.0000
 1funemes7m: 0.0000
 blizzard_draco: 0.0000
 blizzard_fans: 0.0000
 blizzard_gamin: 0.0000
 promo: 0.0000
 blizzardcs: 0.0000
 1fhrrhcimh: 0.0000
 profittothepeople: 0.0000
 blizzheroes: 0.0000
 producthunt: 0.0000
 producer: 0.0000
 prod: 0.0000
 painthey: 0.0000
 bfes5twbzt: 0.0000
 wwgadpffkw: 0.0000
 prmtxjjdue: 0

 widda16: 0.0000
 widout: 0.0000
 satisfying: 0.0000
 2jr3yo55dr: 0.0000
 attended: 0.0000
 attraction: 0.0000
 sarcastic: 0.0000
 au5jwgt0ar: 0.0000
 auckland: 0.0000
 audaciousspunk: 0.0000
 sapphirescallop: 0.0000
 b6nphxorzg: 0.0000
 rossum: 0.0000
 rossmartin7: 0.0000
 b6wwq2nyqi: 0.0000
 restoring: 0.0000
 resting: 0.0000
 wnamtxlfmt: 0.0000
 21hsrrqzou: 0.0000
 resque: 0.0000
 21b6skpdur: 0.0000
 batter: 0.0000
 respecting: 0.0000
 respected: 0.0000
 vacancies: 0.0000
 restoringpaths: 0.0000
 battered: 0.0000
 resoluteshield: 0.0000
 batters: 0.0000
 resigninshame: 0.0000
 battery: 0.0000
 battle_dom: 0.0000
 reserves: 0.0000
 reserved: 0.0000
 battlerapchris: 0.0000
 requiem: 0.0000
 request: 0.0000
 resolutevanity: 0.0000
 requa: 0.0000
 restrospect: 0.0000
 resxavgpyj: 0.0000
 barry: 0.0000
 rg9yaybosa: 0.0000
 barthubbuch: 0.0000
 rg3bndkxjx: 0.0000
 rfvjh58ef2: 0.0000
 rfb3jxbiej: 0.0000
 rey: 0.0000
 reworked: 0.0000
 baseballquotes1: 0.0000
 wmur9: 0.0000
 results: 0.0000

 ltvvpflsg8: 0.0000
 epa: 0.0000
 luzukokoti: 0.0000
 err: 0.0000
 flurry: 0.0000
 flunkie: 0.0000
 concluded: 0.0000
 lubbock: 0.0000
 fleetwood: 0.0000
 lubrication: 0.0000
 lunch: 0.0000
 flesh: 0.0000
 ltz: 0.0000
 eo2f96wxpz: 0.0000
 lynchburg: 0.0000
 enzasbargains: 0.0000
 losers: 0.0000
 finite: 0.0000
 m4jdzmgjow: 0.0000
 losses: 0.0000
 m416: 0.0000
 fowlers: 0.0000
 foutpwgfwy: 0.0000
 connecticut: 0.0000
 fouseytube: 0.0000
 m4: 0.0000
 m3: 0.0000
 lotz: 0.0000
 m2yuxnqlqy: 0.0000
 estates: 0.0000
 fousey: 0.0000
 firefighte: 0.0000
 enpjcfma8l: 0.0000
 enqgtbaxuj: 0.0000
 conklin: 0.0000
 loseit: 0.0000
 fp64yosjwx: 0.0000
 fpaoulwu3n: 0.0000
 fph01u3eii: 0.0000
 m8ciks60bx: 0.0000
 etg0prbp4g: 0.0000
 fqv47ob8ge: 0.0000
 competitor: 0.0000
 fqsk7qcawo: 0.0000
 eternity: 0.0000
 lore: 0.0000
 completed: 0.0000
 fqj0squ3lg: 0.0000
 founding: 0.0000
 fqcdphccg7: 0.0000
 fprt7nwrot: 0.0000
 engineers: 0.0000
 m6lvkxl9ii: 0.0000
 losangelestimes: 0.0000
 engines: 0.0000
 m5sbf

 loupascale: 0.0465
 gqpi7jmkan: 0.0465
 worldnews: 0.0465
 bigsim50: 0.0465
 refuses: 0.0465
 1000s: 0.0465
 wrightsboro: 0.0465
 wqy3jokumh: 0.0465
 dduxthvvnr: 0.0465
 pirates: 0.0465
 pst5bbq0av: 0.0465
 croat: 0.0465
 igxrqpotm7: 0.0465
 landi: 0.0465
 cost: 0.0465
 acquired: 0.0465
 icttz0divr: 0.0465
 nzifztcugl: 0.0465
 khulna: 0.0465
 plunging: 0.0465
 limestone: 0.0465
 mligpuhvoh: 0.0465
 tomfromireland: 0.0465
 teams: 0.0465
 gzxipmoknb: 0.0465
 haji_hunter762: 0.0465
 ûïwhen: 0.0465
 ln: 0.0465
 fx0w2sq05f: 0.0465
 missiles: 0.0465
 1w58ehv9s1: 0.0465
 payback: 0.0465
 creationsbykole: 0.0465
 p8ih0hni3l: 0.0465
 p769eo49fj: 0.0465
 sanjaynirupam: 0.0465
 sourmashnumber7: 0.0465
 conquest: 0.0465
 rfcgeom66: 0.0465
 fieldstone: 0.0465
 succeed: 0.0465
 middleeasteye: 0.0465
 snooker: 0.0465
 sneaks: 0.0465
 sunflower: 0.0465
 pqhq4jnztt: 0.0465
 addict: 0.0465
 muzzies: 0.0465
 û_turns: 0.0465
 uorxff0nfx: 0.0465
 careemergencies: 0.0465
 childfund: 0.0465
 hungry: 0.0465


 arabian: 0.0929
 flying: 0.0929
 arrived: 0.0929
 marin: 0.0929
 smeared: 0.0929
 guides: 0.0929
 nxs3z1kxid: 0.0929
 sheer: 0.0929
 wa5c77f8vq: 0.0929
 rar: 0.0929
 summertime: 0.0929
 mp3: 0.0929
 strutting: 0.0929
 alley: 0.0929
 hope: 0.0929
 sahbhlxssh: 0.0929
 hcyajsacfj: 0.0929
 misses: 0.0929
 aelinrhee: 0.0929
 league: 0.0929
 flow: 0.0929
 kezi9: 0.0929
 bowling: 0.0929
 lh9mrypdrj: 0.0929
 pqhuthss3i: 0.0929
 udkmadkuzy: 0.0929
 mascara: 0.0929
 closed: 0.0929
 0iw6drf5x9: 0.0929
 4ceeuzwhvf: 0.0929
 preschool: 0.0929
 afloat: 0.0929
 pension: 0.0929
 somehow: 0.0929
 6m0ynjwbc9: 0.0929
 picked: 0.0929
 78: 0.0929
 toes: 0.0929
 tattoos: 0.0929
 disappears: 0.0929
 itfbbz9xyc: 0.0929
 anchorage: 0.0929
 marcoarment: 0.0929
 apd: 0.0929
 sacramento: 0.0929
 messi: 0.0929
 domination: 0.0929
 jihadis: 0.0929
 unity: 0.0929
 freedomoutpost: 0.0929
 ronaldo: 0.0929
 start: 0.0929
 qyccvuubkr: 0.0929
 blevins: 0.0929
 command: 0.0929
 bro: 0.0929
 mathew_is_angry: 0.0929
 saladi

 worried: 0.2323
 hundreds: 0.2323
 zourryart: 0.2323
 passengers: 0.2323
 thankful: 0.2323
 alexalltimelow: 0.2323
 cuties: 0.2323
 gevrmbvznb: 0.2323
 activated: 0.2323
 purple: 0.2323
 anthrax: 0.2323
 cee: 0.2323
 mkayla: 0.2323
 wmata: 0.2323
 childish: 0.2323
 utter: 0.2323
 petty: 0.2323
 jeez: 0.2323
 fart: 0.2323
 cd: 0.2323
 blakeshelton: 0.2323
 jan: 0.2323
 atomic: 0.2323
 bay: 0.2323
 xpddwh5tem: 0.2323
 mumbai: 0.2323
 bettyfreedoms: 0.2323
 fyi: 0.2323
 abninfvet: 0.2323
 n2qzbmzuly: 0.2323
 greer: 0.2323
 unawares: 0.2323
 _gaabyx: 0.2323
 activist: 0.2323
 trolley: 0.2323
 disrupts: 0.2323
 byvubg0wye: 0.2323
 dumb: 0.2323
 sandiego: 0.2323
 building: 0.2323
 granted: 0.2323
 several: 0.2323
 city: 0.2323
 animal: 0.2323
 threatens: 0.2323
 volcanoes: 0.2323
 remodeled: 0.2323
 event: 0.2323
 jab541hhk0: 0.2323
 unsensibly: 0.2323
 scott: 0.2323
 je6zjwh5ub: 0.2323
 fdxgmiwaeh: 0.2323
 professionally: 0.2323
 curb: 0.2323
 outrun: 0.2323
 jq2ib1ob1x: 0.2323
 streak: 0.

#### Deliverable 2.1

Based on the training accuracy, do you conclude that the data are (mostly) linearly separable? Why or why not?

[your answer here]

#### Deliverable 2.2

Which feature most increases the likelihood that the tweet does not refer to a real disaster, and which feature most increases the likelihood that the tweet refers to a real disaster? 

[your answer here]

#### Deliverable 2.3 
One technique for improving the resulting model with perceptron (or stochastic gradient descent learning in general) is to take an average of the weight vectors learned at different iterations of the algorithm, rather than only using the final weights that minimize the loss. That is, calculate $\bar{\mathbf{w}} = \sum_{t=1}^T \mathbf{w}^{(t)}$ where $\mathbf{w}^{(t)}$ is the weight vector at iteration $t$ of the algorithm and $T$ is the number of iterations, and then use $\bar{\mathbf{w}}$ when making classifications on new data.

To use this technique in your classifier, add the keyword argument `average=True` to the `SGDClassifier` function. Try it now using the cells below.

Compare the initial training/test accuracies to the training/test accuracies after doing averaging. What happens? Why do you think averaging the weights from different iterations has this effect?

[your answer here]

## Problem 3: Logistic regression [4 points]

For this problem, create a new `SGDClassifier`, this time setting the `loss` argument to `'log'`, which will train a logistic regression classifier. Set `average=False` for the remaining problems.

Once you have trained the classifier, you can use the `predict` function to get the classifications, as with perceptron. Additionally, logistic regression provides probabilities for the predictions. You can get the probabilities by calling the `predict_proba` function. This will give a list of two numbers; the first is the probability that the class is $0$ and the second is the probability that the class is $1$.


For the first task, add the keyword argument `alpha` to the `SGDClassifier` function. This is the regularization strength, called $\lambda$ in lecture. If you don't specify `alpha`, it defaults to $0.0001$. Experiment with other values and see how this affects the outcome.

#### Deliverable 3.1: 

Calculate the training and testing accuracy when `alpha` is one of $[0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0]$. Create a plot where the x-axis is `alpha` and the y-axis is accuracy, with two lines (one for training and one for testing). You can borrow the code from HW1 for generating plots in Python. Use [a log scale for the x-axis](https://matplotlib.org/examples/pylab_examples/log_demo.html) so that the `alpha` values are spaced evenly.

[your solution should be plotted below]

In [3]:
for alpha in [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0]:
    classifier = SGDClassifier(loss='log', max_iter=1000, alpha=alpha, tol=1.0e-12, random_state=123, eta0=100, average=False)
    classifier.fit(X_train, Y_train)
    print("Training accuracy: %0.6f" % accuracy_score(Y_train, classifier.predict(X_train)))
    print("Testing accuracy: %0.6f" % accuracy_score(Y_test, classifier.predict(X_test)))

Training accuracy: 0.983333
Testing accuracy: 0.983333
Training accuracy: 0.901634
Testing accuracy: 0.901634
Training accuracy: 0.808660
Testing accuracy: 0.808660
Training accuracy: 0.710784
Testing accuracy: 0.710784
Training accuracy: 0.674183
Testing accuracy: 0.674183
Training accuracy: 0.657026
Testing accuracy: 0.657026
Training accuracy: 0.567484
Testing accuracy: 0.567484


#### Deliverable 3.2

Examine the classifier probabilities using the `predict_proba` function when training with different values of `alpha`. What do you observe? How does `alpha` affect the prediction probabilities, and why do you think this happens?

[your answer here]

#### Deliverable 3.2.1

Now remove the `alpha` argument so that it goes back to the default value. We'll now look at the effect of the learning rate. By default, `sklearn` uses an "optimal" learning rate based on some heuristics that work well for many problems. However, it can be good to see how the learning rate can affect the algorithm.

For this task, add the keyword argument `learning_rate` to the `SGDClassifier` function and set the value to `invscaling`. This defines the learning rate at iteration $t$ as: $\eta_t = \frac{\eta_0}{t^a}$, where $\eta_0$ and $a$ are both arguments you have to define in the `SGDClassifier` function, called `eta0` and `power_t`, respectively. Experiment with different values of `eta0` and `power_t` and see how they affect the number of iterations it takes the algorithm to converge. You will often find that it will not finish within the maximum of $1000$ iterations.

#### Deliverable 3.3: 

Fill in the table below with the number of iterations for values of `eta0` in $[10.0, 100.0, 1000.0, 10000.0]$ and values of `power_t` in $[0.5, 1.0, 2.0]$. You may find it easier to write python code that can output the markdown for the table, but if you do that place the output here. If it does not converge within the maximum number of iterations (set to $1000$ by `max_iter`), record $1000$ as the number of iterations. You will need to read the documentation for this class to learn how to recover the actual number of iterations before reaching the stopping criterion.

| `eta0`   | `power_t` | # Iterations |
|-----------:|-----------|--------------|
| $10.0$    | $0.5$     |              |
| $10.0$    | $1.0$     |              |
| $10.0$    | $2.0$     |              |
| $100.0$   | $0.5$     |              |
| $100.0$   | $1.0$     |              |
| $100.0$   | $2.0$     |              |
| $1000.0$  | $0.5$     |              |
| $1000.0$  | $1.0$     |              |
| $1000.0$  | $2.0$     |              |
| $10000.0$ | $0.5$     |              |
| $10000.0$ | $1.0$     |              |
| $10000.0$ | $2.0$     |              |

#### Deliverable 3.4

Describe how `eta0` and `power_t` affect the learning rate based on the formula (e.g., if you increase `power_t`, what will this do to the learning rate?), and connect this to what you observe in the table above.

[your answer here]
  
#### Deliverable 3.4.1

Now remove the `learning_rate`, `eta0`, and `power_t` arguments so that the learning rate returns to the default setting. For this final task, we will experiment with how high the probabiity must be before an instance is classified as positive.

The code below includes a function called `threshold` which takes as input the classification probabilities of the data (called `probs`, which is given by the function `predict_proba`) and a threshold (called `tau`, a scalar that should be a value between $0$ and $1$). It will classify each instance as $1$ if the probability of being $1$ is greater than `tau`, otherwise it will classify the instance as $0$. Note that if you set `tau` to $0.5$, the `threshold` function should give you exactly the same output as the classifier `predict` function.

You should find that increasing the threshold causes the accuracy to drop. This makes sense, because you are classifying some things as 0 even though it's more probable that they are 1. So why do this? Suppose you care more about accurately identifying tweets about natural disasters than missing tweets about disasters (e.g. maybe you forward these tweets to first responders.) You thus want to be confident that when you classify a tweet as 1 that it really is 1.

There is a metric called _precision_ which measures something like accuracy but for one specific class. Whereas accuracy is the percentage of tweets that were correctly classified, the precision of 1 would be the percentage of tweets classified as 1 that were correctly classified. (In other words, the number of tweets classified as 1 whose correct label was 1, divided by the number of tweets classified as 1.)

You can use the [`precision_score`](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score) function from `sklearn` to calculate the precision. It works much like the `accuracy_score` function.

In [4]:
# your answer here

#### Deliverable 3.5

Calculate the testing precision when the value of `tau` for thresholding is one of $[0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99]$. Create a plot where the x-axis is `tau` and the y-axis is precision.

[your solution should be plotted below]

In [5]:
# use this function for deliverable 3.5
def threshold(probs, tau):
    return np.where(probs[:,1] > tau, 1, 0)

# your logistic regression code here

classifier = SGDClassifier(loss='log', max_iter=1000, tol=1.0e-12, random_state=123)
classifier.fit(X_train, Y_train)

SGDClassifier(loss='log', random_state=123, tol=1e-12)

#### Deliverable 3.6

Describe what you observe with thresholding (e.g., what happens to precision as the threshold increases?), and explain why you think this happens.

[your answer here]

## Problem 4: Sparse learning [5604: 5 points; 4604: +3 EC points]

Add the `penalty` argument to `SGDClassifier` and set the value to `'l1'`, which tells the algorithm to use L1 regularization instead of the default L2. Recall from lecture that L1 regularization encourages weights to stay at exactly $0$, resulting in a more "sparse" model than L2. You should see this effect if you examine the values of `classifier.coef_`.

#### Deliverable 4.1: Write a function to calculate the number of features whose weights are nonzero when using L1 regularization. Calculate the number of nonzero feature weights when `alpha` is one of $[0.00001, 0.0001, 0.001, 0.01, 0.1]$. Create a plot where the x-axis is `alpha` and the y-axis is the number of nonzero weights, using a log scale for the x-axis.

[your solution should be plotted below]

In [6]:
# your code here
