# Template for the homework


## Task
- Classify each document in one of 20 categories.
- The objective is obtain the better accuracy in the test set. You can use any library and model explained in the course.
- The delivery are a unique jupiter notebook with all the code. Must run in the course Anaconda environment. Not use additional libraries.
- Send the notebook named homework\_[name]\_[surename].ipynb to sueiras@gmail.com before November 20th.

## Template structure

- A Jupiter notebook template is provided to do the task. Structure:
  - Read the train and validation data.
  - Transform to generate numerical features.: Build your transformations here
  - Model: Build your model or models here. Check the accuracy over the validation set.
  - Evaluate results: Build your scoring function here and apply it over the test set.
- You need to complete the transform and model steps to achieve the best result in the evaluation metric, the accuracy, in test set.
- Is completely forbidden load and use the test set except once in the final evaluate results step.

## Evaluation

- Exercise evaluated in 0-10 range points.
- To obtain 5 points you must deliver a notebook without errors that provide a solution whit a minimum accuracy of 67%.
- If you obtain an accuracy over 87% you have 10 points.
- Intermediated accuracies between 67% and 87% obtain intermediated points proportionally, but depending of the quality of the work is possible to reduce or increase a maximum of 2 the points assigned automatically by accuracy. 
 

In [1]:
# Header
from __future__ import print_function

import re
import nltk
import numpy as np
import pandas as pd
import string
import spacy
en_nlp = spacy.load('en_core_web_md')

## 01 Load data

In [2]:
from sklearn.datasets import fetch_20newsgroups

twenty_train = fetch_20newsgroups(subset='train',
                 shuffle=True, random_state=42)

print(twenty_train.target_names)

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


## 02 Text encoding

In [3]:
# ------------------------------------
# Define your own encoding proccess here
# ------------------------------------
def remove_empty(text_tr, labels):
    filtered_text = [] 
    filtered_labels = []    
    for doc, label in zip(text_tr, labels):       
        if doc.strip():            
            filtered_text.append(doc)            
            filtered_labels.append(label)
    return filtered_text, filtered_labels

def texto_parser(texto):
    texto_lemma =  [[x.lemma_ for x in en_nlp(y)] for y in texto]
    x_cleaned1 = []
    for x in texto_lemma:
        x_cleaned1.append([y for y in x if not y in stopwords])
    x_cleaned2 = []
    for x in x_cleaned1:
        x_cleaned2.append([y for y in x if not y in list(string.punctuation)])
    x_cleaned3 = []
    useless = ["-PRON-"]
    for x in x_cleaned2:
        x_cleaned3.append([y for y in x if not y in useless])
    x_cleaned4 = []
    for x in x_cleaned3:
        x_cleaned4.append([y for y in x if not ("--" in y or '\n' in y) ])
    x_cleaned =" ".join(str(x) for x  in x_cleaned4)
    text_l = []
    for x in x_cleaned4:
        text = " ".join([word.lower() for word in x ])
        text_clean = re.sub('[\[\]/{}⋅−_(...)><\|]+', ' ', text)
        text_l.append("".join([word.lower() for word in text_clean]))
    return text_l


# EXAMPLE OF CODE
from sklearn.feature_extraction.text import TfidfVectorizer

stopwords = nltk.corpus.stopwords.words('english')

texto_no_empty, y_trn = remove_empty(twenty_train.data, twenty_train.target)
texto_clean = texto_parser(texto_no_empty)

# Extract word ocurrences
vector_tf = TfidfVectorizer( token_pattern=r"\S+")
x_train_vec = vector_tf.fit(texto_clean)





def encoding_text(text):
    '''
    Encoding function
        Input: raw text
        Output: features to train the model
    '''
    text_tf = x_train_vec.transform(text)
    return text_tf

# Encode train
X_trn = encoding_text(texto_clean)
print(X_trn.shape)
# END OF EXAMPLE OF CODE


(11314, 160394)


## 03 Model and score function

In [4]:
# ------------------------------------
# Put your model or models here
# ------------------------------------

# EXAMPLE OF CODE
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.linear_model import PassiveAggressiveClassifier

# Define and fit in one line
clf_nb = MultinomialNB(alpha= 0.01)
clf_sv = LinearSVC(C=1, multi_class='ovr', dual=True, max_iter=100)
clf_pa = PassiveAggressiveClassifier(max_iter=100)
# END OF EXAMPLE OF CODE


In [5]:
# Score function
def score_function(data):
    '''
    score_function
        Input: Raw text data
        Ouptut: predicted category for each text
    '''

    # ------------------------------------
    # Define your own score function
    # ------------------------------------
    
    # EXAMPLE OF CODE
    # Transformation steps
    X_test_tf = encoding_text(data)
    # Prediction steps
    
    predicted = clf.predict(X_test_tf)
    # END OF EXAMPLE OF CODE

    return predicted



### We will evaluate 3 model in order to pick up the better:
 * Support Vector Classifier
 * MultinomialNB  Classifier
 * Passive Aggressive Classifier

In [6]:
from sklearn.metrics import accuracy_score, confusion_matrix
from scipy.stats import sem
from sklearn.model_selection import cross_val_score

clfs = [clf_nb, clf_sv, clf_pa]
cv = 3
for clf in clfs:
    scores = cross_val_score(clf, X_trn, y_trn, cv=cv)
    print (scores)
    print (("Mean score: {0:.3f} (+/-{1:.3f})").format(
        np.mean(scores), sem(scores)))

[0.90701987 0.90432017 0.90467339]
Mean score: 0.905 (+/-0.001)
[0.91364238 0.91518685 0.91768455]
Mean score: 0.916 (+/-0.001)
[0.90807947 0.90909091 0.91290494]
Mean score: 0.910 (+/-0.001)


##### The best Model Is LinearSVC so we will choose it

## 04 Evaluate valid data

#### In order to  make a better estimator we do not have a train/valid data, we will evaluate the estimator with RamdomizedSearch and Cross Validation

In [7]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix

params =  {'C':[ 0.05, 0.1, 0.25, 0.5, 0.75, 1, 1.5, 2, 10], 
           "dual":[True,False]}
random_grid = RandomizedSearchCV(clf_sv, param_distributions=params, cv= 3)

random_grid.fit(X_trn, y_trn)

RandomizedSearchCV(cv=3, error_score='raise',
          estimator=LinearSVC(C=1, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=100,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0),
          fit_params=None, iid=True, n_iter=10, n_jobs=1,
          param_distributions={'C': [0.05, 0.1, 0.25, 0.5, 0.75, 1, 1.5, 2, 10], 'dual': [True, False]},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score='warn', scoring=None, verbose=0)

In [8]:
random_grid.best_score_

0.9161216192328089

#### Finally we make the classifier with the best estimator

In [9]:
clf = random_grid.best_estimator_.fit(X_trn, y_trn)

## 05 Evaluate test data
- Don't edit after this!!!
- Execute only ONCE whit the optimal model selected based on the validation accuracy metric calculated over multiple experiments.

In [10]:
# Test Accuracy
twenty_test = fetch_20newsgroups(subset='test')

predicted = score_function(texto_parser(twenty_test.data))
    
print('Accuracy test: ', accuracy_score(twenty_test.target, predicted))


Accuracy test:  0.8506372809346787
