# Experimental Methodology in Natural Language Processing

#### Exercise 1

Try `stratified` and `most_frequent` strategies and observe performances

In [2]:
from sklearn.dummy import DummyClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import StratifiedKFold
data = load_iris()
stratified_split = StratifiedKFold(n_splits=5, shuffle=True)

# X = stratified or most_frequent
for strat in ['stratified', 'most_frequent']:
    print('Strategy:\033[1m', strat.upper(),'\033[0m')
    dummy_clf = DummyClassifier(strategy=strat)
    accuracies = []
    for train_index, test_index in stratified_split.split(data.data, data.target):
        dummy_clf.fit(data.data[train_index], data.target[train_index])
        dummy_clf.predict(data.data[test_index])
        accuracy = dummy_clf.score(data.data[test_index], data.target[test_index])
        print(" Accuracy: {:.3}".format(accuracy))
    print(' ')

Strategy:[1m STRATIFIED [0m
 Accuracy: 0.4
 Accuracy: 0.367
 Accuracy: 0.367
 Accuracy: 0.433
 Accuracy: 0.3
 
Strategy:[1m MOST_FREQUENT [0m
 Accuracy: 0.333
 Accuracy: 0.333
 Accuracy: 0.333
 Accuracy: 0.333
 Accuracy: 0.333
 


#### Exercise 2
- Read [documentation](https://scikit-learn.org/stable/modules/model_evaluation.html)
- Try different evaluation scores
    - For instance, change f1_macro with f1_micro or f1_weighted

In [3]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_validate


clf = GaussianNB()
for tmp in ['f1_macro', 'f1_micro', 'f1_weighted']:
    scores = cross_validate(clf, data.data, data.target, cv=stratified_split, scoring=[tmp])
    print(tmp.upper(),':',round(sum(scores['test_' + tmp])/len(scores['test_' + tmp]),2))

F1_MACRO : 0.96
F1_MICRO : 0.95
F1_WEIGHTED : 0.96


## Last Exercise: Text Classification

- Using the Newsgroup corpus from `scikit-learn` train and evaluate a Linear SVM (LinearSVC) model on the Topic classification task
    - [Corpus access and description](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html) 
- Experiment with different vectorization methods and parameters:
    - `binary` of Count Vectorization (CountVect)
    - TF-IDF Transformation (TF-IDF)
    - Using TF-IDF
        - min and max cut-offs (CutOff)
        - without stop-words (WithoutStopWords)
        - without lowercasing (NoLowercase)


**Note**:
If the SVM doesn't converge play with the $C$ hyperparameter (starting from a small value such as 1e-4).

In [None]:
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')
# You have to generate a dev set from the training set
newsgroups_test = fetch_20newsgroups(subset='test')