# Python as a Calculator

Blank notebook to be used for class exercises.

## Exercise 1

The tab (\t) separated file "sentiment-twitter-data.tsv" contains tweets annotated for sentiment. Load the data then do the following:

- split it into a train/test split.
- create a bag of words feature representation for the tweets using the CountVectorizer
- Use grid-search (CV) on the train split to find the best C parameters for an SVC classifier (you can also try different kernels)
- report (print) the accuracy of the final classifier on the test data and train data
- How many features were created with the bag of words representation?

file path: ../data/sentiment-twitter-data.tsv

In [3]:
!head ../data/sentiment-twitter-data.tsv

264183816548130816	15140428	positive	Gas by my house hit $3.39!!!! I'm going to Chapel Hill on Sat. :)
264249301910310912	18516728	negative	Iranian general says Israel's Iron Dome can't deal with their missiles (keep talking like that and we may end up finding out)
264105751826538497	147088367	positive	with J Davlar 11th. Main rivals are team Poland. Hopefully we an make it a successful end to a tough week of training tomorrow.
264094586689953794	332474633	negative	Talking about ACT's &amp;&amp; SAT's, deciding where I want to go to college, applying to colleges and everything about college stresses me out.
254941790757601280	557103111	negative	They may have a SuperBowl in Dallas, but Dallas ain't winning a SuperBowl. Not with that quarterback and owner. @S4NYC @RasmussenPoll
264169034155696130	382403760	neutral	Im bringing the monster load of candy tomorrow, I just hope it doesn't get all squiched
263192091700654080	344222239	objective-OR-neutral	Apple software, retail chiefs out in o

In [6]:
import csv
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import f1_score
import numpy as np

X_txt = []
y = []
with open('../data/sentiment-twitter-data.tsv') as in_file:
    iCSV = csv.reader(in_file, delimiter='\t')
    for row in iCSV:
        X_txt.append(row[-1])
        y.append(row[2])
        
X_txt = np.array(X_txt)
y = np.array(y)

X_train_txt, X_test_txt, y_train, y_test = train_test_split(X_txt, y)

vec = CountVectorizer(ngram_range=(1,1), min_df = 1)
vec.fit(X_train_txt)
X_train = vec.transform(X_train_txt)
X_test = vec.transform(X_test_txt)

params = {"C":[0.0001, 0.001, 0.01, 0.1, 1., 10.]}
svc = LinearSVC()
clf = GridSearchCV(svc, params, scoring='f1_macro', cv=5)
clf.fit(X_train, y_train)

train_f1 = f1_score(y_train, clf.predict(X_train), average='macro') # I just reported f1 instead of accuracy, since f1 is used in hw
dev_f1 = clf.best_score_
test_f1 = f1_score(y_test, clf.predict(X_test), average='macro')

print("Train F1: {:.4f} Dev F1: {:.4f} Test F1: {:.4f}".format(train_f1, dev_f1, test_f1))

print("Num Features: {}".format(len(vec.get_feature_names()))) # Can also just check number of columns of X_train

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Train F1: 0.9824 Dev F1: 0.3738 Test F1: 0.3599
Num Features: 17267


## Exercise 2

Expand on Exercise 1 to include CountVectorizer parameters in the GridSearchCV. Include the following CountVectorizer Parameters: stop_words = ['english, None], lowercase=[False, True], and min_df=[1, 5, 10]

- What are the parameters that produce the best macro F1?

Solution Note: Seems these parameters don't improve the model :( 

In [11]:
from sklearn.pipeline import Pipeline

pipe = Pipeline([("vect", CountVectorizer()), ("clf", LinearSVC())])

params = {"vect__stop_words":["english", None],
          "vect__min_df":[1,5,10],
          "vect__ngram_range":[(1,1), (1,2)],
          "vect__lowercase":[True, False],
          "clf__C":[0.001,0.01,0.1, 1., 10.]}

clf = GridSearchCV(pipe, params, scoring='f1_macro', cv=5)
clf.fit(X_train_txt, y_train)

train_f1 = f1_score(y_train, clf.predict(X_train_txt), average='macro')
dev_f1 = clf.best_score_
test_f1 = f1_score(y_test, clf.predict(X_test_txt), average='macro')

print("Best Params: {}".format(clf.best_params_))
print("Train F1: {:.4f} Dev F1: {:.4f} Test F1: {:.4f}".format(train_f1, dev_f1, test_f1))

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision'

Best Params: {'clf__C': 0.1, 'vect__lowercase': True, 'vect__min_df': 1, 'vect__ngram_range': (1, 1), 'vect__stop_words': None}
Train F1: 0.9824 Dev F1: 0.3738 Test F1: 0.3599


## Exercise 3

Expanding on Exercise 2, and using the same data, do the following:

- Apply SelectKBest() with chi2 on the training data. What are the most important features with a k=10?
- Add SelectKBest() to the gridsearch pipeline from Exercise 2. Grid-Search over various choices of K. Choose K on your own.
    - What is the best K?
    - Did the performance improve over the model trained on the entire set of features?

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

skb = SelectKBest(chi2, k=10)
skb.fit(X_train, y_train)

for i in skb.get_support(indices=True):
    print(vec.get_feature_names()[i])
print(np.array(vec.get_feature_names())[skb.get_support(indices=True)]))

In [20]:
pipe = Pipeline([("vect", CountVectorizer()), ('skb', SelectKBest()), ("clf", LinearSVC())])

params = {"vect__stop_words":["english", None],
          "vect__min_df":[1,5,10],
          "vect__ngram_range":[(1,1), (1,2)],
          "vect__lowercase":[True, False],
          "clf__C":[0.001,0.01,0.1, 1., 10.],
          "skb__k":[100, 500,  'all']}

clf = GridSearchCV(pipe, params, scoring='f1_macro', cv=5)
clf.fit(X_train_txt, y_train)

train_f1 = f1_score(y_train, clf.predict(X_train_txt), average='macro')
dev_f1 = clf.best_score_
test_f1 = f1_score(y_test, clf.predict(X_test_txt), average='macro')

print("Best Params: {}".format(clf.best_params_))
print("Train F1: {:.4f} Dev F1: {:.4f} Test F1: {:.4f}".format(train_f1, dev_f1, test_f1))

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision'

Best Params: {'clf__C': 0.1, 'skb__k': 'all', 'vect__lowercase': True, 'vect__min_df': 1, 'vect__ngram_range': (1, 1), 'vect__stop_words': None}
Train F1: 0.9824 Dev F1: 0.3738 Test F1: 0.3599
