# Exercise "Lecture 13: Classification"


In this set of exercises, we will use classification to classify Wikipedia articles into 16 categories. 


The exercises cover the following points:

* Storing the data into an pandas dataframe and inspecting the data
* Converting the corpus into a tfd-idf document token matrix
* Learning a perceptron model from the data 
* Inspecting the results

Data: wkp_sorted.zip      

Python libraries
- sklearn
- pandas 
- nltk
- numpy

Cheat sheets
- classification_cheat_sheet.ipynb

## Loading the data

**Exercice 1**

Create a pandas dataframe containing the news data

* The data file is in "wkp_sorted.zip"
* Use the [load_files](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_files.html) method from sklearn.datasets to load all files 
* load_files returns a dictionnary named "data" with keys "data" (a list of strings, one string per file), "target" (a list of label indexes, one label index per file) and "target_names" (the list of categories). 
* Use [Pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) method to create a dataframe whose headers are "texts" and "label_idx". Each row aligns the content of a file ("text" column) and its corresponding label index (the "label_idx" column). 
* Use  "target_names" to create a dictionary idx_to_label mapping the label indices (0 to 15) to the corresponding Wikipedia categories.

In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_files
import os
from sklearn.model_selection import train_test_split
import nltk
from nltk.tokenize import word_tokenize
from sklearn.metrics import ConfusionMatrixDisplay

In [2]:
os.chdir('/experiments/cours nlp/data science/lecture13/')
d = load_files("wkp_sorted/", encoding = "utf8")
df = pd.DataFrame(zip(d['data'], d['target'], d['filenames']),  columns=['Text','Target', 'Filenames'])
df.Filenames = df.Filenames.apply(os.path.basename)
df['Target_name'] = df.Target.apply(lambda x : d['target_names'][x])

**Exercise 2**

- Shuffle the data 

In [3]:
df1 = df.drop(columns="Filenames")

In [4]:
df1 = df1.sample(frac=1)

Unnamed: 0,Text,Target,Target_name
134,Hubble search for transition comets (Transitio...,3,Astronomical_objects
44,Carniny Amateur & Youth FC is a junior-level f...,11,Sports_teams
37,"Joe Petagno (born January 1, 1948) is an Ameri...",1,Artists
65,"Al-Tilmiz (Arabic: التلميذ, 'The Pupil') was a...",15,Written_communication
111,"An airport bus, or airport shuttle bus or airp...",0,Airports
...,...,...,...
3,Ajman International Airport (Arabic: مطار عجما...,0,Airports
30,Al Anwa Aviation is a charter airline based ou...,13,Transport
60,"A carrozza, also referred to as mozzarella in ...",8,Foods
98,Gre-No-Li is a contraction of the surnames of ...,12,Sportspeople


## Vectorizing the input texts

**Exercise 3**

* Extract $X$ and $Y$ from the dataframe 
* $X$ = the features used for clustering. The features of a news items is the list of tokens contained in that item. We hope that words can help classify news items into the correct category
* $Y$ = the category (Astronaut, etc.) of each Wikipedia article


In [5]:
X = df1['Text']
y = df1.Target

**Exercise 4**

Create train and test data
* Use sklearn [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) method 

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

**Exercise 5**

Vectorize the input (X)

Use sklearn [TfidfVectorizer]( https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) method to turn the news items into a TF-IDF matrix where each row represents a news item, the columns are tokens and the cell contains the tf-idf score of each token.

* Import the [TfidfVectorize](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) method from sklearn
* Create a tf-idf vectorizer. The maximum nb of features should be set to 400. Set use_idf to True, stop_words to "english" and the tokenizer to nltk.word_tokenize.
* Apply the tfidf_vectorizer.fit_transform method to X to vectorize all input texts (i.e., both X_train and X_test)

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=400,
                                   use_idf=True,
                                   stop_words='english',
                                   tokenizer=nltk.word_tokenize,
                                   ngram_range=(1, 3))
X_train_vec = tfidf_vectorizer.fit_transform(X_train)
X_test_vec = tfidf_vectorizer.transform(X_test)

**Exercise 6**

- Use the [get_feature_names](https://scikit-learn.org/stable/modules/feature_extraction.html) method to print out the features
- Look at the features: are they all useful or would further preprocessing help eliminate uniformative tokens ?

In [10]:
features = tfidf_vectorizer.get_feature_names_out()
print(len(features))

400


## Training a perceptron classifier


**Exercise 7**

* Import the [Perceptron](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html) module 
* Create an object of the class Perceptron
* Train the model using the [fit](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html#sklearn.linear_model.Perceptron.fit) method
* Test the model using the [predict](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html#sklearn.linear_model.Perceptron.fit)
 method
* Print out expected values and predictions
* Print out accuracy using [sklearn accuracy_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) method
* Print out [confusion matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)and [classification report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)

In [11]:
from sklearn.linear_model import Perceptron
perceptron_clf = Perceptron()
perceptron_clf.fit(X_train_vec, y_train)

Perceptron()

In [12]:
y_pred = perceptron_clf.predict(X_test_vec)

In [13]:
print(y_pred[:5], y_train[:5].tolist())

[14 15  7 12 14] [13, 2, 7, 5, 7]


In [14]:
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

In [15]:
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")

Accuracy: 0.53125


**Exercise 8**

* sklearn tfidf_vectorizer creates a vocabulary dictionary {(k,v),} where k is a token and v is an index (integer)
   - Create a dictionary ix_to_tag mapping each index to the corresponding token  and a dictionary tag_to_idx mapping each token to the corresponding index
* The [coef_ ](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html#sklearn.linear_model.Perceptron.fit) attribute contains the learned weights for each feature. Size = nb of classes, nb of features. 
* Save the feature weights in a dictionary where key = token index, value = weight
* Define a function that derives a sorted list of (tokenIndex, weight) pairs
* For each class, 
   -  get the feature weights for each class
   - Sort the weights
   - Print out the first 6 token:weight pairs (replace token indices by the corresponding token)

To better see whether the top words of each class match the corresponding class, use the "idx_to_label" dictionary defined in Exercise 1 to rewrite each class idx to the corresponding label. 

In [17]:
vocab = tfidf_vectorizer.vocabulary_
ix_to_tokens = { v:k for k,v in vocab.items() }
tag_to_ix = { v:k for k,v in ix_to_tokens.items() }

In [25]:
coef_weight = {k:v for k,v in zip(ix_to_tokens.keys(), perceptron_clf.coef_)}

In [30]:
def srt_coef():
    pass

### Performing grid search to find the best possible score and best alpha value (PROVIDED)

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB

# Tuning using grid search cross-validation
# Create an object GridSearchCV

parameters = [a for a in np.linspace(0.01,1,11)]
clf = GridSearchCV( estimator=MultinomialNB(), 
                   param_grid={'alpha':parameters},
                   scoring='accuracy',
                   return_train_score=True,
                   cv=5
                  )

# Start the search over the hyper-parameters by calling the fit function over X_train
clf.fit( X_train_tf, Y_train )

# Print the results of the CV using the attribute *cv_results_*
cv_res = pd.DataFrame(clf.cv_results_)
cv_res

In [None]:
# Printing out the best score
print("Best score: %0.3f" % clf.best_score_)

# Printing out the 
best_parameters = clf.best_estimator_.get_params()
print("Best alpha", best_parameters['alpha'])
