# INFO-4604/5604 HW2: Linear Classification 

### Solution by: *YOUR NAME* (and list any partners)


## Assignment overview

News agencies, governments and corporations sometimes track social media during natural disasters to try to monitor unfolding events. Because no single person or group of people can read all available Twitter data, organizations may turn to natural language processing methods to try and understand what is happening as disasters unfold. 

While this approach is powerful, inferring events from NLP can be tricky. For instance, say a person [tweets](https://twitter.com/AnyOtherAnnaK/status/629195955506708480) that "LOOK AT THE SKY LAST NIGHT IT WAS ABLAZE." This tweet includes the word "ablaze", which may signal to a computer that there is an unfolding disaster. However, in this particular case, the person is speaking metaphorically. A simple computer system using keywords (e.g. ablaze) might be fooled into thinking the tweet is reporting an actual fire.

In principle, machine learning methods can balance multiple sources of evidence, which might do a better job and inferring which text refers to actual disasters. Therefore, in this assignment, you will thus predict if a given tweet actually refers to a natural disaster. This dataset originally comes from [Kaggle](https://www.kaggle.com/c/nlp-getting-started/overview).


### What to hand in

You will submit the assignment on Canvas. Submit a single Jupyter notebook named `hw2lastname.ipynb`, where lastname is replaced with your last name. **Please also submit a PDF or HTML version of your notebook to Canvas**.

Please clearly mark all deliverables. You are encouraged to create additional cells in whatever way makes the presentation more organized and easy to follow. You are allowed to import additional Python libraries.

### Submission policies

- **Collaboration:** You are allowed to work with one partner. You are still expected to write up your own solution. Each individual must turn in their own submission, and list your collaborators after your name.
- **Late submissions:** Each student may use up to 5 late days over the semester. You have late days, not late hours. This means that if your submission is late by any amount of time past the deadline, then this will use up a late day. If it is late by any amount beyond 24 hours past the deadline, then this will use a second late, and so on. Once you have used up all late days, late assignments will be given at most 80% credit after one day and 60% credit after two days.


## Getting started

In this assignment, you will experiment with perceptron and logistic regression in `sklearn`. Much of the code has already been written for you. We will use a class called `SGDClassifier` (which you should read about in the [sklearn documentation](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html)), which  implements stochastic gradient descent (SGD) for a variety of loss functions, including both perceptron and logistic regression, so this will be a way to easily move between the two classifiers.

The code below will load the datasets. There are two data collections: the "training" data, which contains the tweets that you will use for training the classifiers, and the "testing" data, which are tweets that you will use to measure the classifier accuracy. The test tweets are instances the classifier has never seen before, so they are a good way to see how the classifier will behave on data it hasn't seen before. However, we still know the labels of the test tweets, so we can measure the accuracy.

For this problem, we will use what are called "bag of words" features, which are commonly used when doing classification with text. Each feature is a word, and the value of a feature for a particular tweet is number of times the word appears in the tweet (with value $0$ if the word does not appear in the tweet).

A note on labels: **If `Y_train` or `Y_test` are 1 this means the tweet refers to a real disaster; if the values are 0, it means the tweet does not refer to a real disaster** 

Run the block of code below to load the data. You don't need to do anything yet. Move on to "Problem 1" next.

In [12]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

df_train = pd.read_csv('train.csv')

Y_train = df_train["target"]
text_train = df_train["text"]

vec = CountVectorizer()
X_train = vec.fit_transform(text_train)
feature_names = np.asarray(vec.get_feature_names())

df_test = pd.read_csv('test.csv')
Y_test = df_train["target"]
text_test = df_train["text"]

X_test = vec.transform(text_test)


## Problem 1: Understand the data [6 points]

Before doing anything else, take time to understand the code above.

The variables `df_train` and `df_test` are dataframes that store the training (and testing) datasets, which are contained in tab-separated files where the first column is the label and the second column is the text of the tweet.

The [`CountVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) class converts the raw text into a bag-of-words into a feature vector representation that `sklearn` can use.

You should print out the values of the variables and write any other code needed to answer the following questions.

#### Deliverable 1.1

How many training instances are in the dataset? How many test instances?

[your answer here]

#### Deliverable 1.2

How many features are in the training data?

[your answer here]

#### Deliverable 1.3

What is the distribution of labels in the training data? That is, what percentage of instances are about actual disasters?

[your answer here]

## Problem 2: Perceptron [6 points]

The code below trains an [`SGDClassifier`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html) using the perceptron loss, then it measures the accuracy of the classifier on the test data, using `sklearn`'s [`accuracy_score`](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) function. 

The `fit` function trains the classifier. The feature weights are stored in the `coef_` variable after training. The `predict` function of the trained `SGDClassifier` outputs the predicted label for a given instance or list of instances.

Additionally, this code displays the features and their weights in sorted order, which you may want to examine to understand what the classifier is learning. In general, in binary classification, the 0 class is considered the "negative" class.

There are 3 keyword arguments that have been added to the code below. It is important you keep the same values of these arguments whenever you create an `SGDClassifier` instance in this assignment so that you get consistent results. They are:

- `max_iter` is one of the stopping criteria, which is the maximum number of iterations/epochs the algorithm will run for.

- `tol` is the other stopping criterion, which is how small the difference between the current loss and previous loss should be before stopping.

- `random_state` is a seed for pseudorandom number generation. The algorithm uses randomness in the way the training data are sorted, which will affect the solution that is learned, and even the accuracy of that solution.

Note: *Wait a minute $-$ in class we learned that the loss function is convex, so the algorithm will find the same minimum regardless of how it is trained. Why is there random variation in the output? The reason is that even though there is only one minimum value of the loss, there may be different weights that result in the same loss, so randomness is a matter of tie-breaking. What's more, while different weights may have the same loss, they could lead to different classification accuracies, because the loss function is not the same as accuracy. (Unless accuracy was your loss function... which is possible, but uncommon because it turns out to be a difficult function to optimize.)
Note that different computers may still give different answers, despite keeping these settings the same, because of how pseudorandom numbers are generated with different operating systems and Python environments.*

To begin, run the code in the cell below without modification.

In [16]:
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score, precision_score

classifier = SGDClassifier(loss='perceptron', max_iter=1000, tol=1.0e-12, random_state=123, eta0=100)
classifier.fit(X_train, Y_train)

print("Number of SGD iterations: %d" % classifier.n_iter_)
print("Training accuracy: %0.6f" % accuracy_score(Y_train, classifier.predict(X_train)))
print("Testing accuracy: %0.6f" % accuracy_score(Y_test, classifier.predict(X_test)))

print("\nFeature weights:")
args = np.argsort(classifier.coef_[0])
for a in args:
    print(" %s: %0.4f" % (feature_names[a], classifier.coef_[0][a]))

Number of SGD iterations: 41
Training accuracy: 0.985026
Testing accuracy: 0.985026

Feature weights:
 zy3hpdjnwg: -0.5429
 qzlpfhpwdo: -0.4790
 song: -0.4152
 desire: -0.4152
 pile: -0.3832
 permanently: -0.3832
 qzqc8wwwcn: -0.3832
 mode: -0.3513
 significance: -0.3513
 better: -0.3513
 f7wqpcekg2: -0.3513
 phone: -0.3513
 poll: -0.3513
 t5trhjuau0: -0.3513
 myself: -0.3513
 6gldwx71da: -0.3513
 throw: -0.3513
 eyes: -0.3513
 nature: -0.3194
 ebay: -0.3194
 answer: -0.3194
 scared: -0.3194
 market: -0.3194
 dianneg: -0.3194
 selfies: -0.3194
 m5djllxozp: -0.3194
 hrqcjdovjz: -0.3194
 hands: -0.3194
 themselves: -0.3194
 tickets: -0.3194
 join: -0.3194
 long: -0.2874
 practice: -0.2874
 6lojoorouk: -0.2874
 vet: -0.2874
 mickinyman: -0.2874
 theatlantic: -0.2874
 pissed: -0.2874
 donnie: -0.2874
 art: -0.2874
 feels: -0.2874
 open: -0.2874
 quick: -0.2874
 fact: -0.2874
 statistically: -0.2874
 dlaub2nvtn: -0.2874
 wackoes: -0.2874
 lethal: -0.2874
 zdj2hyf6ro: -0.2874
 complete: -0.2

 hashtags: -0.0958
 links: -0.0958
 location: -0.0958
 objective: -0.0958
 strickskin: -0.0958
 zumiez: -0.0958
 crabbycale: -0.0958
 dylanmcclure55: -0.0958
 reunite: -0.0958
 zachlowe_nba: -0.0958
 v1mtr517ue: -0.0958
 annoying: -0.0958
 fxux987vzx: -0.0958
 constantly: -0.0958
 vannuyscouncil: -0.0958
 nuys: -0.0958
 1ki8lgvay4: -0.0958
 ghetto: -0.0958
 anp9g6njfd: -0.0958
 acfi2rhz4n: -0.0958
 scourge: -0.0958
 selling: -0.0958
 1017: -0.0958
 lw9o2kdk18: -0.0958
 ryobyqjfce: -0.0958
 farmr: -0.0958
 fanged: -0.0958
 foragesecret: -0.0958
 mart: -0.0958
 morels: -0.0958
 bitches: -0.0958
 mushroom: -0.0958
 photoshop: -0.0958
 whedonesque: -0.0958
 ambleside: -0.0958
 kisses: -0.0958
 round: -0.0958
 soak: -0.0958
 q0jhdcu6ly: -0.0958
 chill: -0.0958
 inning: -0.0958
 sundays: -0.0958
 skies: -0.0958
 alisonannyoung: -0.0958
 nurse: -0.0958
 your: -0.0958
 testy: -0.0958
 valdes1978: -0.0958
 seasonfrom: -0.0958
 onshit: -0.0958
 meter: -0.0958
 happy: -0.0958
 springs: -0.0958
 i

 dnbheaven: -0.0639
 utahgrizz: -0.0639
 proper: -0.0639
 vnzybfgzcm: -0.0639
 akgovbillwalker: -0.0639
 stony: -0.0639
 monthly: -0.0639
 drafted: -0.0639
 4dbhono3rk: -0.0639
 promised: -0.0639
 85v: -0.0639
 owners: -0.0639
 djx5elbrv1: -0.0639
 guards: -0.0639
 0jmkdtcymj: -0.0639
 felons: -0.0639
 cleaning: -0.0639
 kro: -0.0639
 taxreturn: -0.0639
 nukes: -0.0639
 collideworship_: -0.0639
 pir: -0.0639
 overall: -0.0639
 njvpxzmj5v: -0.0639
 bookslast: -0.0639
 bcfcticketlady: -0.0639
 0wbecdmhqo: -0.0639
 ironically: -0.0639
 cream: -0.0639
 conservation: -0.0639
 cri: -0.0639
 fatality_us: -0.0639
 nova: -0.0639
 sonofbaldwin: -0.0639
 265v: -0.0639
 solicitor: -0.0639
 wbogs8ejsj: -0.0639
 title: -0.0639
 06jst: -0.0639
 sportwatch: -0.0639
 bromleythe: -0.0639
 tren: -0.0639
 req: -0.0639
 lover: -0.0639
 rigour: -0.0639
 begun: -0.0639
 serve: -0.0639
 beetroot: -0.0639
 thpbdpdj35: -0.0639
 orianna: -0.0639
 kwislo: -0.0639
 _animaladvocate: -0.0639
 prove: -0.0639
 learned

 f2gwxeprak: -0.0319
 depreciations: -0.0319
 sholt87: -0.0319
 iphoto: -0.0319
 staged: -0.0319
 wgqkxmby3b: -0.0319
 hirsch: -0.0319
 schism: -0.0319
 inbetween: -0.0319
 lowered: -0.0319
 401ks: -0.0319
 usd: -0.0319
 entretenimento: -0.0319
 z87zmi3ozs: -0.0319
 glosblue66: -0.0319
 politicians: -0.0319
 canceling: -0.0319
 lightman: -0.0319
 since1970the: -0.0319
 carlsbadbugkil1: -0.0319
 oped: -0.0319
 1pack: -0.0319
 po_st: -0.0319
 viscous: -0.0319
 badu: -0.0319
 oceans: -0.0319
 dambisa: -0.0319
 oracle: -0.0319
 definition: -0.0319
 amazingness: -0.0319
 picthis: -0.0319
 cllwud4wsu: -0.0319
 tanslash: -0.0319
 australians: -0.0319
 br7gmmh5ek: -0.0319
 l5g2zj3kgg: -0.0319
 metroid: -0.0319
 2pack: -0.0319
 slithering: -0.0319
 yeahs: -0.0319
 marble: -0.0319
 tloz: -0.0319
 teenfiction: -0.0319
 boots: -0.0319
 moyo: -0.0319
 agdq: -0.0319
 7zb9gm5z0h: -0.0319
 g891m9gh4r: -0.0319
 devia: -0.0319
 ler: -0.0319
 fleek: -0.0319
 prosyn: -0.0319
 erykah: -0.0319
 wattys2015: 

 outnumbering: -0.0319
 renunciedilma: -0.0319
 77: -0.0319
 ûïi: -0.0319
 zcvfc500yy: -0.0319
 viralspell: -0.0319
 danielsahyounie: -0.0319
 ddhwori5w1: -0.0319
 campanha: -0.0319
 ifqqpur99x: -0.0319
 sje59u2nnm: -0.0319
 y217ceeemd: -0.0319
 37592: -0.0319
 cakes: -0.0319
 anxiety: -0.0319
 quests: -0.0319
 stare: -0.0319
 teeth: -0.0319
 piercings: -0.0319
 x2: -0.0319
 c1xhizprad: -0.0319
 0h7oua1pns: -0.0319
 pxnatosil: -0.0319
 grand: -0.0319
 weudlkc4o4: -0.0319
 freed: -0.0319
 trl1dskf81: -0.0319
 backed: -0.0319
 surgical: -0.0319
 associated: -0.0319
 replacement: -0.0319
 elgeotaofeeq: -0.0319
 yahoonews: -0.0319
 rz0adzursw: -0.0319
 captives: -0.0319
 rave: -0.0319
 adani: -0.0319
 ought: -0.0319
 pk8dgvripw: -0.0319
 ambition: -0.0319
 hfvnyft78c: -0.0319
 format: -0.0319
 robbie: -0.0319
 grzchkdf37: -0.0319
 derby: -0.0319
 breaks: -0.0319
 patrol: -0.0319
 shaping: -0.0319
 hgf52611: -0.0319
 dogg: -0.0319
 techniqu: -0.0319
 nate: -0.0319
 eminem: -0.0319
 redsox: 

 ldywd4ydt9: 0.0000
 tinderbox: 0.0000
 tithenai: 0.0000
 theft: 0.0000
 thegame: 0.0000
 theghostparty: 0.0000
 ûïmake: 0.0000
 leaning: 0.0000
 tire: 0.0000
 ûïnobody: 0.0000
 tipster: 0.0000
 ûïparties: 0.0000
 thejonesesvoice: 0.0000
 ûïplans: 0.0000
 leagues: 0.0000
 themale_madonna: 0.0000
 theme: 0.0000
 tiny: 0.0000
 tinted: 0.0000
 thepartyofmeanness: 0.0000
 tindering: 0.0000
 laylovetournay: 0.0000
 thedayct: 0.0000
 thighs: 0.0000
 lawx: 0.0000
 laughs: 0.0000
 ûò800000: 0.0000
 ûòthe: 0.0000
 tigersjostun: 0.0000
 ûòåêcnbc: 0.0000
 tiffanyfrizzell: 0.0000
 throwingknifes: 0.0000
 thrusts: 0.0000
 tight: 0.0000
 thugging: 0.0000
 thurlow: 0.0000
 ticklemeshawn: 0.0000
 thursd: 0.0000
 latinoand: 0.0000
 latino: 0.0000
 ûónegligence: 0.0000
 latina: 0.0000
 thyroid: 0.0000
 thundering: 0.0000
 threesome: 0.0000
 til_now: 0.0000
 thranduil: 0.0000
 thinkpink: 0.0000
 thirst: 0.0000
 lawsonofficial: 0.0000
 thirty: 0.0000
 thisisperidot: 0.0000
 lawn: 0.0000
 thisispublichealt

 haw: 0.0000
 cools: 0.0000
 havnt: 0.0000
 hav: 0.0000
 avi: 0.0000
 coolest: 0.0000
 hater: 0.0000
 cookie: 0.0000
 examples: 0.0000
 exc: 0.0000
 hastle: 0.0000
 hardline: 0.0000
 ava: 0.0000
 awn: 0.0000
 contruction: 0.0000
 azl4xydvzk: 0.0000
 hannomottola: 0.0000
 hanneman: 0.0000
 azovgv4sb6: 0.0000
 hannaph: 0.0000
 azusa: 0.0000
 b0es3ziork: 0.0000
 hannah: 0.0000
 hanna_brooksie: 0.0000
 hao: 0.0000
 b0zwi0qptu: 0.0000
 hank: 0.0000
 b19z8vi3td: 0.0000
 b1bx0eruep: 0.0000
 b2: 0.0000
 handi: 0.0000
 b24fowler: 0.0000
 continuing: 0.0000
 b2b: 0.0000
 b2bagency: 0.0000
 hanna: 0.0000
 exil1bkzmp: 0.0000
 az: 0.0000
 contr: 0.0000
 awol: 0.0000
 harder: 0.0000
 hardball: 0.0000
 awtscucbbv: 0.0000
 harda: 0.0000
 awxr24zsqh: 0.0000
 axhclfersu: 0.0000
 axk9xno6yz: 0.0000
 axvqldtehc: 0.0000
 axxdcakzty: 0.0000
 harbhajan_singh: 0.0000
 ay49mtyyl8: 0.0000
 ay6zzcupnz: 0.0000
 ayekoradio: 0.0000
 ayfdjeb7hy: 0.0000
 hara: 0.0000
 controller: 0.0000
 happing: 0.0000
 ayhhhhhdjjfj

 distribution: 0.0000
 13pm: 0.0000
 dramatically: 0.0000
 dloesch: 0.0000
 120000: 0.0000
 kqjevyqzlv: 0.0000
 dr_johanfranzen: 0.0000
 kttape: 0.0000
 ktfounder: 0.0000
 ktd5ig9m5o: 0.0000
 10pm: 0.0000
 11000: 0.0000
 djeddygnj: 0.0000
 11000000: 0.0000
 drains: 0.0000
 ksbynews: 0.0000
 ksawlyux02: 0.0000
 krsy54xmmc: 0.0000
 krnw0wxhe5: 0.0000
 kristenkoin6: 0.0000
 drako: 0.0000
 drama: 0.0000
 kqvn1utpmm: 0.0000
 dramaa_llama: 0.0000
 119000: 0.0000
 11am: 0.0000
 dixon: 0.0000
 kq9ae6ap2b: 0.0000
 downfall: 0.0000
 l5awtundhm: 0.0000
 dlp8kpkt2k: 0.0000
 landslides: 0.0000
 0abgfglh7x: 0.0000
 0btniwagt1: 0.0000
 dolla: 0.0000
 0c1y8g7e9p: 0.0000
 0cr74m1uxm: 0.0000
 doris: 0.0000
 0cxm5tkz8y: 0.0000
 dofrh5yb01: 0.0000
 dorling: 0.0000
 landolina: 0.0000
 0iyuntxduv: 0.0000
 landi: 0.0000
 dobzc3pitm: 0.0000
 lancasteronline: 0.0000
 lancaster: 0.0000
 lana: 0.0000
 lan76zqkxg: 0.0000
 lan: 0.0000
 0keh2treny: 0.0000
 0krw1zyahm: 0.0000
 dorett: 0.0000
 dnwwo1ybrk: 0.0000
 dom

 wendell: 0.0319
 intact: 0.0319
 edmond: 0.0319
 tn1ax1xmbb: 0.0319
 6jjvcdn4ti: 0.0319
 s4srgrmqcz: 0.0319
 fa07af174a71408: 0.0319
 gulf: 0.0319
 mdash: 0.0319
 berry: 0.0319
 wkpzp1jcau: 0.0319
 mjtn3qbgos: 0.0319
 rzagdnwtah: 0.0319
 effected: 0.0319
 showing: 0.0319
 goldstein: 0.0319
 temperature: 0.0319
 yield: 0.0319
 lkwxu8qv7n: 0.0319
 redistribute: 0.0319
 bronville: 0.0319
 fundraiser: 0.0319
 ys3nmwwyvc: 0.0319
 lifts: 0.0319
 bust: 0.0319
 ricin: 0.0319
 zionists: 0.0319
 peasants: 0.0319
 11juzhlgmt: 0.0319
 revere: 0.0319
 cali: 0.0319
 fymp4i2wp5: 0.0319
 les: 0.0319
 kxslftz2i5: 0.0319
 southeast: 0.0319
 carmqivkwu: 0.0319
 stalin: 0.0319
 cartel: 0.0319
 leak: 0.0319
 chandanee: 0.0319
 ld0uniyw4k: 0.0319
 floorburnt: 0.0319
 ûªa: 0.0319
 doppler: 0.0319
 cg579wldne: 0.0319
 repression: 0.0319
 cbsdenver: 0.0319
 fortworth: 0.0319
 recycling: 0.0319
 260th: 0.0319
 digit: 0.0319
 manzanita: 0.0319
 rs22lj4qfp: 0.0319
 rat: 0.0319
 mandem: 0.0319
 rp: 0.0319
 orchar

 brutal: 0.0639
 carbondale: 0.0639
 noranda: 0.0639
 exum: 0.0639
 shedid: 0.0639
 irons: 0.0639
 erodes: 0.0639
 participating: 0.0639
 dundee: 0.0639
 aul: 0.0639
 appointment: 0.0639
 helsinki: 0.0639
 hobo: 0.0639
 nixon: 0.0639
 jury: 0.0639
 cbxnhhz6kd: 0.0639
 jxwojxqndc: 0.0639
 physically: 0.0639
 lovefood: 0.0639
 pledged: 0.0639
 smithereens: 0.0639
 kaduna: 0.0639
 latest: 0.0639
 toes: 0.0639
 therefore: 0.0639
 gbbo2015: 0.0639
 exam: 0.0639
 arkansas: 0.0639
 gasped: 0.0639
 helo: 0.0639
 excite: 0.0639
 thx: 0.0639
 zomatoaus: 0.0639
 dayton: 0.0639
 7lhkjz0ivo: 0.0639
 nola: 0.0639
 process: 0.0639
 marcoarment: 0.0639
 g1bwl3dqqk: 0.0639
 invested: 0.0639
 north: 0.0639
 projectiles: 0.0639
 travelelixir: 0.0639
 novel: 0.0639
 theatershooting: 0.0639
 reacted: 0.0639
 ûóher: 0.0639
 volunteer: 0.0639
 icymagistrate: 0.0639
 icicle: 0.0639
 04: 0.0639
 ûïdetonate: 0.0639
 ij0wq490cs: 0.0639
 0dqjeretxu: 0.0639
 indie: 0.0639
 hunterston: 0.0639
 href: 0.0639
 friggin

 mqydxrlae7: 0.0958
 strawberries: 0.0958
 athens: 0.0958
 mexico: 0.0958
 starmade: 0.0958
 stardate: 0.0958
 hamilton: 0.0958
 i2hhviumtm: 0.0958
 planetary: 0.0958
 214904: 0.0958
 139055: 0.0958
 meters: 0.0958
 ap: 0.0958
 hilversum: 0.0958
 nuke: 0.0958
 temptation: 0.0958
 able: 0.0958
 mothernature: 0.0958
 postering: 0.0958
 killhard: 0.0958
 yycfringe: 0.0958
 calgaryfringe: 0.0958
 giant: 0.0958
 home: 0.0958
 deadly: 0.0958
 rosters: 0.0958
 jlester34: 0.0958
 xgnjgle9eq: 0.0958
 stemming: 0.0958
 bftou2nybw: 0.0958
 comingsoon: 0.0958
 arizzo44: 0.0958
 bowling: 0.0958
 ritzy_jewels: 0.0958
 ary: 0.0958
 cowardly: 0.0958
 lodisilverado: 0.0958
 rvfriedmann: 0.0958
 suffield: 0.0958
 buffalo: 0.0958
 bptmlf4p10: 0.0958
 swiftly: 0.0958
 send: 0.0958
 wwu070tjej: 0.0958
 nashvillefd: 0.0958
 nativehuman: 0.0958
 myreligion: 0.0958
 haiyan: 0.0958
 vigilent: 0.0958
 liberties: 0.0958
 gilmanrocks7: 0.0958
 1965: 0.0958
 civil: 0.0958
 ne: 0.0958
 sicily: 0.0958
 behavior: 0.0

 v8aftd9zez: 0.2555
 alil: 0.2555
 skin: 0.2555
 officials: 0.2555
 hijacker: 0.2555
 fam: 0.2555
 omfgv9ma1w: 0.2555
 abc: 0.2555
 plane: 0.2555
 imagine: 0.2555
 ready: 0.2555
 point: 0.2555
 tides: 0.2555
 8jxql8cv8z: 0.2555
 survivors: 0.2555
 depressing: 0.2555
 leader: 0.2555
 apollo: 0.2555
 watch: 0.2555
 stealing: 0.2555
 range: 0.2555
 simply: 0.2555
 adamrubinespn: 0.2555
 bound: 0.2555
 bilsko: 0.2555
 libya: 0.2555
 humofthecity: 0.2555
 godsfirstson1: 0.2555
 humanity: 0.2555
 practically: 0.2555
 cab: 0.2555
 suddenly: 0.2555
 pyrbliss: 0.2555
 heat: 0.2555
 dolphin: 0.2555
 meeting: 0.2555
 mo: 0.2555
 clearly: 0.2555
 hope: 0.2555
 sport_en: 0.2555
 smoke: 0.2555
 theashes: 0.2555
 control: 0.2555
 sake: 0.2555
 crisis: 0.2555
 case: 0.2555
 awareness: 0.2555
 talking: 0.2555
 chick: 0.2555
 forget: 0.2555
 peacefully: 0.2555
 rep: 0.2555
 call: 0.2555
 101: 0.2555
 myanmar: 0.2555
 screaming: 0.2555
 wild: 0.2555
 hostages: 0.2555
 nuclear: 0.2555
 region: 0.2555
 one

#### Deliverable 2.1

Based on the training accuracy, do you conclude that the data are linearly separable? Why or why not?

[your answer here]

#### Deliverable 2.2

Which feature most increases the likelihood that the tweet does not refer to a real disaster, and which feature most increases the likelihood that the tweet refers to a real disaster? 

[your answer here]

#### Deliverable 2.3 
One technique for improving the resulting model with perceptron (or stochastic gradient descent learning in general) is to take an average of the weight vectors learned at different iterations of the algorithm, rather than only using the final weights that minimize the loss. That is, calculate $\bar{\mathbf{w}} = \sum_{t=1}^T \mathbf{w}^{(t)}$ where $\mathbf{w}^{(t)}$ is the weight vector at iteration $t$ of the algorithm and $T$ is the number of iterations, and then use $\bar{\mathbf{w}}$ when making classifications on new data.

To use this technique in your classifier, add the keyword argument `average=True` to the `SGDClassifier` function. Try it now using the cell below.

Compare the initial training/test accuracies to the training/test accuracies after doing averaging. What happens? Why do you think averaging the weights from different iterations has this effect?

[your answer here]

## Problem 3: Logistic regression [15 points]

For this problem, create a new `SGDClassifier`, this time setting the `loss` argument to `'log'`, which will train a logistic regression classifier. Set `average=False` for the remaining problems.

Once you have trained the classifier, you can use the `predict` function to get the classifications, as with perceptron. Additionally, logistic regression provides probabilities for the predictions. You can get the probabilities by calling the `predict_proba` function. This will give a list of two numbers; the first is the probability that the class is $0$ and the second is the probability that the class is $1$.


For the first task, add the keyword argument `alpha` to the `SGDClassifier` function. This is the regularization strength, called $\lambda$ in lecture. If you don't specify `alpha`, it defaults to $0.0001$. Experiment with other values and see how this affects the outcome.

#### Deliverable 3.1: 

Calculate the training and testing accuracy when `alpha` is one of $[0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0]$. Create a plot where the x-axis is `alpha` and the y-axis is accuracy, with two lines (one for training and one for testing). You can borrow the code from HW1 for generating plots in Python. Use [a log scale for the x-axis](https://matplotlib.org/examples/pylab_examples/log_demo.html) so that the `alpha` values are spaced evenly.

[your solution should be plotted below]

In [None]:
# your code here

#### Deliverable 3.2

Examine the classifier probabilities using the `predict_proba` function when training with different values of `alpha`. What do you observe? How does `alpha` affect the prediction probabilities, and why do you think this happens?

[your answer here]

#### Deliverable 3.2.1

Now remove the `alpha` argument so that it goes back to the default value. We'll now look at the effect of the learning rate. By default, `sklearn` uses an "optimal" learning rate based on some heuristics that work well for many problems. However, it can be good to see how the learning rate can affect the algorithm.

For this task, add the keyword argument `learning_rate` to the `SGDClassifier` function and set the value to `invscaling`. This defines the learning rate at iteration $t$ as: $\eta_t = \frac{\eta_0}{t^a}$, where $\eta_0$ and $a$ are both arguments you have to define in the `SGDClassifier` function, called `eta0` and `power_t`, respectively. Experiment with different values of `eta0` and `power_t` and see how they affect the number of iterations it takes the algorithm to converge. You will often find that it will not finish within the maximum of $1000$ iterations.

#### Deliverable 3.3: 

Fill in the table below with the number of iterations for values of `eta0` in $[10.0, 100.0, 1000.0, 10000.0]$ and values of `power_t` in $[0.5, 1.0, 2.0]$. You may find it easier to write python code that can output the markdown for the table, but if you do that place the output here. If it does not converge within the maximum number of iterations (set to $1000$ by `max_iter`), record $1000$ as the number of iterations. You will need to read the documentation for this class to learn how to recover the actual number of iterations before reaching the stopping criterion.

| `eta0`   | `power_t` | # Iterations |
|-----------|-----------|--------------|
| $10.0$    | $0.5$     |              |
| $10.0$    | $1.0$     |              |
| $10.0$    | $2.0$     |              |
| $100.0$   | $0.5$     |              |
| $100.0$   | $1.0$     |              |
| $100.0$   | $2.0$     |              |
| $1000.0$  | $0.5$     |              |
| $1000.0$  | $1.0$     |              |
| $1000.0$  | $2.0$     |              |
| $10000.0$ | $0.5$     |              |
| $10000.0$ | $1.0$     |              |
| $10000.0$ | $2.0$     |              |

#### Deliverable 3.4

Describe how `eta0` and `power_t` affect the learning rate based on the formula (e.g., if you increase `power_t`, what will this do to the learning rate?), and connect this to what you observe in the table above.

[your answer here]
  
#### Deliverable 3.4.1

Now remove the `learning_rate`, `eta0`, and `power_t` arguments so that the learning rate returns to the default setting. For this final task, we will experiment with how high the probabiity must be before an instance is classified as positive.

The code below includes a function called `threshold` which takes as input the classification probabilities of the data (called `probs`, which is given by the function `predict_proba`) and a threshold (called `tau`, a scalar that should be a value between $0$ and $1$). It will classify each instance as $1$ if the probability of being $\textrm{Android}$ is greater than `tau`, otherwise it will classify the instance as $0$. Note that if you set `tau` to $0.5$, the `threshold` function should give you exactly the same output as the classifier `predict` function.

You should find that increasing the threshold causes the accuracy to drop. This makes sense, because you are classifying some things as 0 even though it's more probable that they are 1. So why do this? Suppose you care more about accurately identifying the 1 tweets and you don't care as much about 0 tweets. You want to be confident that when you classify a tweet as 1 that it really is 1.

There is a metric called _precision_ which measures something like accuracy but for one specific class. Whereas accuracy is the percentage of tweets that were correctly classified, the precision of 1 would be the percentage of tweets classified as 1 that were correctly classified. (In other words, the number of tweets classified as 1 whose correct label was 1, divided by the number of tweets classified as 1.)

You can use the [`precision_score`](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score) function from `sklearn` to calculate the precision. It works just like the `accuracy_score` function, except you have to add an additional keyword argument, `pos_label='Android'`, which tells it that $\textrm{Android}$ is the class you want to calculate the precision of.

In [None]:
# your answer here

#### Deliverable 3.5

Calculate the testing precision when the value of `tau` for thresholding is one of $[0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99]$. Create a plot where the x-axis is `tau` and the y-axis is precision.

[your solution should either be plotted below, or included in a separate PDF]

In [None]:
# use this function for deliverable 3.5
def threshold(probs, tau):
    return np.where(probs[:,0] > tau, 1, 0)

# your logistic regression code here

classifier = SGDClassifier(loss='log', max_iter=1000, tol=1.0e-12, random_state=123)
classifier.fit(X_train, Y_train)

#### Deliverable 3.6

Describe what you observe with thresholding (e.g., what happens to precision as the threshold increases?), and explain why you think this happens.

[your answer here]

## Problem 4: Sparse learning [5604: 5 points; 4604: +3 EC points]

Add the `penalty` argument to `SGDClassifier` and set the value to `'l1'`, which tells the algorithm to use L1 regularization instead of the default L2. Recall from lecture that L1 regularization encourages weights to stay at exactly $0$, resulting in a more "sparse" model than L2. You should see this effect if you examine the values of `classifier.coef_`.

#### Deliverable 4.1: Write a function to calculate the number of features whose weights are nonzero when using L1 regularization. Calculate the number of nonzero feature weights when `alpha` is one of $[0.00001, 0.0001, 0.001, 0.01, 0.1]$. Create a plot where the x-axis is `alpha` and the y-axis is the number of nonzero weights, using a log scale for the x-axis.

[your solution should be plotted below]

In [None]:
# your code here
