# Text classification

## *"Words. I know words. I have the best words!"*
*- Noam Chomsky*

# Overview

In order to train a machine learning model to classify text, we need:
1. a way to preprocess text
2. a label for each text, represented as number
3. a way to represent each text as vector input
4. a model to learn  a function $f(input) = label$
5. a way to evaluate how well the model works
6. a way to predict new data

As an example, we will use reviews data and try to classify the rating into $positive$ or $negative$, only based on the text they use.

The same method can be used for any other data, including more labels and other dependent variables (e.g., age or gender of the text author, social constructs expressed in the text, etc...). 

# 1. Data

In [1]:
import pandas as pd

data = pd.read_csv('sa_train.csv', quoting=0)
print(len(data), data['output'].unique())
data.head(2)

1800 ['neg' 'pos']


Unnamed: 0,input,output
0,shakespeare in love is quite possibly the most...,neg
1,wizards is an animated feature that begins wit...,neg


## Preprocessing

Text is messy. The goal of preprocessing is to reduce the amount of noise (= unnecessary variation), while maintaining the signal. There is no one-size-fits-all solution, but a good approximation is the following:

In [2]:
import spacy
nlp = spacy.load('en', disable=['parser', 'ner'])

In [3]:
def clean_text(text):
    '''reduce text to lower-case lexicon entry'''
    lemmas = [token.lemma_ for token in nlp(text) 
              if token.pos_ in {'NOUN', 'VERB', 'ADJ', 'ADV', 'PROPN'}]
    return ' '.join(lemmas)

clean_text('This is a test sentence. And here comes another one... Go me!')

'test sentence here come one go'

Let's clean up the input data. This can take a while, so it's good to save it.

In [4]:
data['clean_text'] = data['input'].apply(clean_text)
data['clean_text'].head()

0    shakespeare love quite possibly most enjoyable...
1    wizard animate feature begin narration epic pr...
2    gun wielding arnold schwarzenegger change hear...
3    keep jane austen sense sensibility pride preju...
4    hollywood pimp fat cigar smoking chump wear fu...
Name: clean_text, dtype: object

In [5]:
data.head()

Unnamed: 0,input,output,clean_text
0,shakespeare in love is quite possibly the most...,neg,shakespeare love quite possibly most enjoyable...
1,wizards is an animated feature that begins wit...,neg,wizard animate feature begin narration epic pr...
2,gun wielding arnold schwarzenegger has a chang...,neg,gun wielding arnold schwarzenegger change hear...
3,"if this keeps up , jane austen ( sense and sen...",pos,keep jane austen sense sensibility pride preju...
4,"hollywood is a pimp . a fat , cigar - smoking ...",pos,hollywood pimp fat cigar smoking chump wear fu...


# 2. Labels

Here, we assume that we already have the labels. (In your task, you will have to label them yourself! Hint: use `input()` or a spreadsheet).

However, in order for the machine learning model to work with the labels, we need to translate them into a vector of numbers. We can use `sklearn.LabelEncoder`

In [6]:
from sklearn.preprocessing import LabelEncoder

# transform labels into numbers
labels2numbers = LabelEncoder()

y = labels2numbers.fit_transform(data['output'])
print(data['output'][:10], y[:10], len(y))

0    neg
1    neg
2    neg
3    pos
4    pos
5    neg
6    pos
7    pos
8    neg
9    neg
Name: output, dtype: object [0 0 0 1 1 0 1 1 0 0] 1800


To get the original names back, use `inverse_transform()`:

In [7]:
labels2numbers.inverse_transform([1,1,1,0,0,1])

array(['pos', 'pos', 'pos', 'neg', 'neg', 'pos'], dtype=object)

# 3. Representing text

First, we need to transform the texts into a matrix, where each row represents one text instance. The columns are the **features**


In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(ngram_range=(1,2), 
                             min_df=0.001, 
                             max_df=0.75, 
                             stop_words='english')

X = vectorizer.fit_transform(data['clean_text'])
print(X.shape)

(1800, 66808)


We can now translate back and forth between columns and words:

In [9]:
vectorizer.vocabulary_['bad']

3786

In [10]:
vectorizer.get_feature_names()[3786]

'bad'

Let's see how often that word is in the data:

In [11]:
len(data[data.input.str.contains('bad')])

749

In [12]:
data[data.input.str.contains('bad')]

Unnamed: 0,input,output,clean_text
0,shakespeare in love is quite possibly the most...,neg,shakespeare love quite possibly most enjoyable...
1,wizards is an animated feature that begins wit...,neg,wizard animate feature begin narration epic pr...
4,"hollywood is a pimp . a fat , cigar - smoking ...",pos,hollywood pimp fat cigar smoking chump wear fu...
6,films adapted from comic books have had plenty...,pos,film adapt comic book have plenty success supe...
9,tommy lee jones chases an innocent victim arou...,neg,tommy lee jones chase innocent victim america ...
...,...,...,...
1791,synopsis : captain picard and the crew of the ...,pos,synopsis captain picard crew starship enterpri...
1794,"ladies and gentlemen , 1997 ' s independence d...",pos,lady gentleman s independence day here title s...
1795,terrence malick made an excellent 90 minute fi...,neg,terrence malick make excellent minute film ada...
1798,a movie about divorce and custody in 1995 seem...,neg,movie divorce custody seem about as timely mov...


In [13]:
len(data[data.clean_text.str.contains('bad')])

895

# 4. Learning a classification model

A classification model is simply a function that takes a text representation as input, and returns an output label.

Inside that function is normally a set of weights. By multiplying the weight vector with the input vector, we get the label.

## 4.1: Fitting a model

Fitting a model is the process of finding the right weights to map the training inputs to the training outputs. Fitting to data in `sklearn` is easy: we use the `fit()` function, giving it the input matrix and output vector.

In [14]:
from sklearn.linear_model import LogisticRegression

classifier = LogisticRegression(n_jobs=-1, class_weight='balanced')
%time classifier.fit(X, y)
print(classifier)

CPU times: user 38.3 ms, sys: 44.2 ms, total: 82.5 ms
Wall time: 1.13 s
LogisticRegression(C=1.0, class_weight='balanced', dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='auto', n_jobs=-1, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)


The resulting fitted model has coefficients (betas) for each word/feature in our vocabulary

In [15]:
coefs = classifier.coef_
coefs

array([[ 0.00986428, -0.06329059, -0.03779013, ...,  0.06591883,
         0.02520406, -0.00039513]])

We can now examine the weights/coefficients/betas for the individual words (note that each word has an ID):

In [16]:
k = vectorizer.vocabulary_['bad'] # column position for the word
print(vectorizer.get_feature_names()[k], classifier.coef_[0, k])
#negative vlaue related to negative class --> bad belongs to negative class

bad -3.493747454624923


In [17]:
k = vectorizer.vocabulary_['good'] # column position for the word
print(vectorizer.get_feature_names()[k], classifier.coef_[0, k])
#positive value related to positive class --> good belongs to positive class

good 0.8523441444658751


NB: in a two-class problem, our coefficents are in a vector: positive values indicate the positive class, negative values the other class.
In a multi-class problem, we have one **row** of coefficients for each class: positive values indicate that this feature contributes to the class, negative values indicate that it contributes to other classes.

# 5. Evaluating models

Having a model is great, but how well does it do? Can it classify what it has seen? We need a way to estimate how well the model will work on new data.

We need a metric to measure performance and a way to simulate new data.

## 5.1: Metrics

We use three measure:
1. precision
2. recall
3. F1 (usually it is the best)

### Precision

Precision measures how many of our model's predictions were correct. We divide the number of true positives by the number of all positives

$$
p = \frac{tp}{tp+fp}
$$

### Recall

Recall measures how many of the correct answers in the data our model managed to find. We divide the number of true positives by the number of true positives (the instances our model got) and false negatives (the instances our model *should* have gotten)

$$
r = \frac{tp}{tp+fn}
$$

### F1

A model that classified everything as, say, "positive" would get a perfect recall (it does, after all, find all positive examples). However, such a model would obviously be useless, since its precision is bad.

We want to balance the two against each other. F1 does exactly that, by taking the harmonic mean.

$$
F_1 = \frac{p\cdot r}{p+r}
$$

Luckily, all of these metrics are implemented in `sklearn`. All we have to provide are the predictions of our model, and the actual correct answers (called the *gold standard*). 

In [18]:
from sklearn.metrics import classification_report

## 5.2: Cross-validation

How do we measure performance on new data, if we don't know what the correct outputs for those new data points are?

In **$k$-fold cross-validation**, we simulate new data, by fitting our model on parts of the data, and evaluating on other. We can thereby measure the performance on the held-out part. 

However, we have now reduced the amount of data we used to fit the data. In order to address this, we simply repeat the process $k$ times.
We separate the data into $k$ parts, fit the model on $k-1$ parts, and evaluate on the $k$th part. In the end, we have performance scores from $k$ models. The average of them tells us how well the model would work on new data.



In [19]:
from sklearn.model_selection import cross_val_score

for k in [2,3,5,10]:
    cv = cross_val_score(LogisticRegression(), X, y=y, cv=k, n_jobs=-1, scoring="f1_micro")
    fold_size = X.shape[0]/k
    
    print("F1 with {} folds for bag-of-words is {}".format(k, cv.mean()))
    print("Training on {} instances/fold, testing on {}".format(fold_size*(k-1), fold_size))
    print()

F1 with 2 folds for bag-of-words is 0.8083333333333333
Training on 900.0 instances/fold, testing on 900.0

F1 with 3 folds for bag-of-words is 0.8172222222222222
Training on 1200.0 instances/fold, testing on 600.0

F1 with 5 folds for bag-of-words is 0.828888888888889
Training on 1440.0 instances/fold, testing on 360.0

F1 with 10 folds for bag-of-words is 0.8305555555555555
Training on 1620.0 instances/fold, testing on 180.0



In [20]:
cv

array([0.84444444, 0.82777778, 0.78333333, 0.8       , 0.80555556,
       0.85      , 0.81666667, 0.85      , 0.86111111, 0.86666667])

## Baselines
So, is that performance good? Let's compare to a **baseline**, i.e., a null-hypothesis. The simplest one is that all instances belong to the most frequnt class in the data.

In [21]:
from sklearn.dummy import DummyClassifier

most_frequent = DummyClassifier(strategy='most_frequent')

print(cross_val_score(most_frequent, X, y=y, cv=5, n_jobs=-1, scoring="f1_micro").mean())


0.5061111111111111


# Exercise

See whether you can apply the previous steps to a new data sets, a description of wines. Choose any of the descriptor columns as target variable. The text is already preprocessed, to save time.

In [22]:
wine = pd.read_excel('wine_reviews_small.xlsx', nrows=10000)
wine.head()
#predict the country given the description

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,country,description,designation,points,price,province,region_1,region_2,variety,winery,description_cleaned
0,0,0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz,tremendous varietal wine hail be age year oak ...
1,1,1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez,ripe aroma fig blackberry cassis be soften swe...
2,2,2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley,honor memory wine once make his mother tremend...
3,3,3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi,spend month new french oak incorporate fruit v...
4,4,4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude,be top wine name high point vineyard foot have...


In [23]:
wine_vectorizer = TfidfVectorizer(ngram_range=(1,2), 
                             min_df=0.001, 
                             max_df=0.75, 
                             stop_words='english')

wine_X = wine_vectorizer.fit_transform(wine['description_cleaned'])
wine_labels2numbers = LabelEncoder()
wine_y = wine_labels2numbers.fit_transform(wine['country'])

wine_classifier = LogisticRegression(n_jobs=-1, class_weight='balanced')
%time wine_classifier.fit(wine_X, wine_y)
print(wine_classifier)

CPU times: user 37.2 ms, sys: 8.96 ms, total: 46.2 ms
Wall time: 5.13 s
LogisticRegression(C=1.0, class_weight='balanced', dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='auto', n_jobs=-1, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)


In [24]:
wine_most_frequent = DummyClassifier(strategy='most_frequent')

print(cross_val_score(wine_most_frequent, wine_X, y=wine_y, cv=5, n_jobs=-1, scoring="f1_micro").mean())

0.4091




In [25]:
for k in [2,3,5,10]:
    cv = cross_val_score(LogisticRegression(), wine_X, y=wine_y, cv=k, n_jobs=-1, scoring="f1_micro")
    fold_size = wine_X.shape[0]/k
    
    print("F1 with {} folds for bag-of-words is {}".format(k, cv.mean()))
    print("Training on {} instances/fold, testing on {}".format(fold_size*(k-1), fold_size))
    print()



F1 with 2 folds for bag-of-words is 0.7711000000000001
Training on 5000.0 instances/fold, testing on 5000.0





F1 with 3 folds for bag-of-words is 0.7825996156904232
Training on 6666.666666666667 instances/fold, testing on 3333.3333333333335





F1 with 5 folds for bag-of-words is 0.7929000000000002
Training on 8000.0 instances/fold, testing on 2000.0





F1 with 10 folds for bag-of-words is 0.7986000000000001
Training on 9000.0 instances/fold, testing on 1000.0



The previous warning is due to some classes that appear only on test data and never on training data while applying k-cross validation. 

# 6 Heldout data

Classifying new (**held-out**) data is called **prediction**. We reuse the weights we have learned before on a new data matrix to predict the new outcomes.
Important: the new data needs to have the same number of features!

In [26]:
# read in new data set
new_data = pd.read_csv('sa_test.csv')
print(len(new_data))
new_data.head()

200


Unnamed: 0,input,output
0,robert redford ' s a river runs through it is ...,pos
1,if the 70 ' s nostalgia didn ' t make you feel...,neg
2,you think that these people only exist in the ...,neg
3,""" knock off "" is exactly that : a cheap knock ...",neg
4,brian depalma needs a hit * really * badly . s...,pos


Don't forget to clean it!

In [27]:
%time new_data['clean_text'] = new_data.input.apply(clean_text)

CPU times: user 19.9 s, sys: 315 ms, total: 20.2 s
Wall time: 20.3 s


Let's see how well we do on this data:

In [28]:
# transform text into word counts
# IMPORTANT: use same vectorizer we fit on training data to create vectors!
new_X = vectorizer.transform(new_data['clean_text'])

# translate labels
new_y = labels2numbers.transform(new_data['output'])


# use the old classifier to predict and evaluate
new_predictions = classifier.predict(new_X)
print(new_predictions)

[1 0 0 0 1 0 0 0 0 0 1 1 1 0 1 1 1 1 0 1 0 1 1 1 0 0 0 0 1 0 1 0 0 0 0 0 1
 0 0 0 0 0 0 0 1 1 0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 1 0 0 1 1 0 0 0 1 0 1 0 1
 1 0 1 0 0 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 0 0 1 0 1 1 1 0 0 0 0 1 0 0 0 0
 0 0 1 1 0 1 1 0 1 0 0 1 0 1 0 0 1 1 0 1 0 0 1 1 1 1 1 0 0 0 0 0 0 1 1 0 0
 1 1 1 0 1 1 0 1 0 1 1 0 0 1 1 0 0 1 0 0 0 0 0 0 1 1 1 0 0 0 0 1 0 1 1 0 0
 0 0 0 1 1 1 1 0 1 0 1 0 0 0 1]


In [29]:
print(classification_report(new_y, new_predictions))

              precision    recall  f1-score   support

           0       0.83      0.84      0.83       111
           1       0.80      0.79      0.79        89

    accuracy                           0.81       200
   macro avg       0.81      0.81      0.81       200
weighted avg       0.81      0.81      0.81       200



Instead, we can also predict the probabilities of belonging to each class

In [30]:
new_probabilities = classifier.predict_proba(new_X)
print(new_probabilities)

[[0.30745711 0.69254289]
 [0.55629129 0.44370871]
 [0.5346982  0.4653018 ]
 [0.74681023 0.25318977]
 [0.36336666 0.63663334]
 [0.63447098 0.36552902]
 [0.52763496 0.47236504]
 [0.53041422 0.46958578]
 [0.62989898 0.37010102]
 [0.52398668 0.47601332]
 [0.39571583 0.60428417]
 [0.27701891 0.72298109]
 [0.36529535 0.63470465]
 [0.6180369  0.3819631 ]
 [0.37767803 0.62232197]
 [0.34803811 0.65196189]
 [0.30380553 0.69619447]
 [0.3705448  0.6294552 ]
 [0.50506119 0.49493881]
 [0.46744111 0.53255889]
 [0.71648016 0.28351984]
 [0.48074602 0.51925398]
 [0.34768242 0.65231758]
 [0.40778423 0.59221577]
 [0.51954813 0.48045187]
 [0.58126258 0.41873742]
 [0.69281727 0.30718273]
 [0.67484025 0.32515975]
 [0.46197409 0.53802591]
 [0.5516414  0.4483586 ]
 [0.29534742 0.70465258]
 [0.64829551 0.35170449]
 [0.71811139 0.28188861]
 [0.81374881 0.18625119]
 [0.66188463 0.33811537]
 [0.73474834 0.26525166]
 [0.30462163 0.69537837]
 [0.58988807 0.41011193]
 [0.59001609 0.40998391]
 [0.64077973 0.35922027]


For each instance (=row), we get a probability distribution over the classes (=columns)

## 6.1 Regularization

Typically, performance is lower on unseen data, because our model **overfit** the training data: it expects the new data to look *exactly* the same as the training data. That is almost never true.

In order to prevent the model from overfitting, we need to **regularize** it. Essentially, we make it harder to learn the training data.

A simple example of regularization is to "corrupt" the training data by adding a little bit of noise to each training instance. Since the noise is irregular, it becomes harder for the model to learn any patterns.

In [31]:
from scipy.sparse import random

num_instances, num_features = X.shape

for i in range(5):
    X_regularized = X + random(num_instances, num_features, density=0.01)

    print(cross_val_score(LogisticRegression(), X_regularized, y=y, cv=k, n_jobs=-1, scoring="f1_micro").mean())

0.5144444444444444
0.5005555555555554
0.5105555555555555
0.526111111111111
0.5344444444444444


If you run the previous cell several times, you see different results (it gets even more varied if you change `density`). This variation arises because we add **random** noise. Not good...

Instead, it makes sense to force the model to spread the weights more evenly over all features, rather than bet on a few feature, which might not be present in future data.

We can do this by training the model with the `C` parameter. The default is `1`. Lower values mean stricter regularization.

In [32]:
from sklearn.metrics import f1_score

best_c = None
best_f1_score = 0.0

for c in [50, 20, 10, 1.0, 0.5, 0.1, 0.05, 0.01]:
    clf = LogisticRegression(C=c, n_jobs=-1)
    cv_reg = cross_val_score(clf, X, y=y, cv=5, n_jobs=-1, scoring="f1_micro").mean()

    print("5-CV on train at C={}: {}".format(c, cv_reg.mean()))
    print()

    if cv_reg > best_f1_score:
        best_f1_score = cv_reg
        best_c = c
        
print("best C parameter: {}".format(best_c))

5-CV on train at C=50: 0.8477777777777777

5-CV on train at C=20: 0.8488888888888889

5-CV on train at C=10: 0.8488888888888889

5-CV on train at C=1.0: 0.828888888888889

5-CV on train at C=0.5: 0.8183333333333334

5-CV on train at C=0.1: 0.788888888888889

5-CV on train at C=0.05: 0.7311111111111112

5-CV on train at C=0.01: 0.5077777777777778

best C parameter: 20


In [33]:
reg_clf = LogisticRegression(C=best_c, n_jobs=-1)
reg_clf.fit(X, y)
reg_preds = reg_clf.predict(new_X)

print(classification_report(new_y, reg_preds))

              precision    recall  f1-score   support

           0       0.86      0.85      0.85       111
           1       0.81      0.83      0.82        89

    accuracy                           0.84       200
   macro avg       0.84      0.84      0.84       200
weighted avg       0.84      0.84      0.84       200



# Better features = better performance


We now have **a lot** of features! More than we have actual examples...

Not all of them will be helpful, though. Let's select the top 1500 based on how well they predict they outcome of the training data.

We use two libraries from `sklearn`, `SelectKBest` (the selection algorithm) and `chi2` (the selection criterion).

In [34]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

selector = SelectKBest(chi2, k=1500).fit(X, y)
X_sel = selector.transform(X)
print(X_sel.shape)

(1800, 1500)


In [35]:
X.shape

(1800, 66808)

Let's see how well this new representation performs, by looking at the 5-fold cross-validation. We keep the best regularization value from before.

In [36]:
clf = LogisticRegression(C=best_c, n_jobs=-1)

cv_reg = cross_val_score(clf, X_sel, y=y, cv=5, n_jobs=-1, scoring="f1_micro")
print("5-CV on train: {}".format(cv_reg.mean()))

5-CV on train: 0.8955555555555555


Not too bad! We have handily beaten our previous best! Let's fit a classifier on the whole data now.

In [37]:
clf.fit(X_sel, y)

LogisticRegression(C=20, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=-1, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

Now, let's apply it to the held-out data set. 
We need to 
* vectorize the data with our vectorizer from before (otherwise, we get different features)
* select the top features (using our previously fitted selector)

In [38]:
# select features for new data
new_X_sel = selector.transform(new_X)
print(new_X_sel.shape)

(200, 1500)


Finally, we can use our new classifier to predict the new data labels, and compare them to the truth.

In [39]:
new_predictions_regularized = clf.predict(new_X_sel)
prediction_df = pd.DataFrame(data={'input': new_data['input'], 'prediction': labels2numbers.inverse_transform(new_predictions_regularized), 'truth':new_data['output']})
prediction_df

Unnamed: 0,input,prediction,truth
0,robert redford ' s a river runs through it is ...,pos,pos
1,if the 70 ' s nostalgia didn ' t make you feel...,neg,neg
2,you think that these people only exist in the ...,neg,neg
3,""" knock off "" is exactly that : a cheap knock ...",neg,neg
4,brian depalma needs a hit * really * badly . s...,pos,pos
...,...,...,...
195,i won  t even pretend that i have seen the ot...,pos,neg
196,the cartoon is way better . that ' s the botto...,neg,neg
197,"dr . alan grant ( sam neill , "" jurassic park ...",neg,neg
198,of course i knew this going in . why is it tha...,neg,neg


In [40]:
print(classification_report(new_y, new_predictions_regularized))

              precision    recall  f1-score   support

           0       0.83      0.80      0.82       111
           1       0.76      0.80      0.78        89

    accuracy                           0.80       200
   macro avg       0.80      0.80      0.80       200
weighted avg       0.80      0.80      0.80       200



## Getting insights

In order to explore which features are most indicative, we need some code

In [41]:
features = vectorizer.get_feature_names() # get the names of the features
top_scores = selector.scores_.argsort()[-1500:] # get the indices of the selection
best_indicator_terms = [features[i] for i in sorted(top_scores)] # sort feature names

top_indicator_scores = pd.DataFrame(data={'feature': best_indicator_terms, 'coefficient': clf.coef_[0]})
top_indicator_scores.sort_values('coefficient')

Unnamed: 0,feature,coefficient
84,bad,-12.780473
1449,waste,-8.680148
73,attempt,-8.436183
1307,suppose,-8.139754
154,boring,-7.687207
...,...,...
1004,perfectly,6.340445
311,definitely,6.399306
1005,performance,6.453414
579,hilarious,6.495110


# Exercise

Try to test the model trained on the sentiment analysis dataset on the wine reviews.

In [42]:
new_data = pd.read_excel('wine_reviews_small.xlsx')
print(len(new_data))
new_data.head()

20000


Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,country,description,designation,points,price,province,region_1,region_2,variety,winery,description_cleaned
0,0,0,US,This tremendous 100% varietal wine hails from ...,Martha's Vineyard,96,235.0,California,Napa Valley,Napa,Cabernet Sauvignon,Heitz,tremendous varietal wine hail be age year oak ...
1,1,1,Spain,"Ripe aromas of fig, blackberry and cassis are ...",Carodorum Selección Especial Reserva,96,110.0,Northern Spain,Toro,,Tinta de Toro,Bodega Carmen Rodríguez,ripe aroma fig blackberry cassis be soften swe...
2,2,2,US,Mac Watson honors the memory of a wine once ma...,Special Selected Late Harvest,96,90.0,California,Knights Valley,Sonoma,Sauvignon Blanc,Macauley,honor memory wine once make his mother tremend...
3,3,3,US,"This spent 20 months in 30% new French oak, an...",Reserve,96,65.0,Oregon,Willamette Valley,Willamette Valley,Pinot Noir,Ponzi,spend month new french oak incorporate fruit v...
4,4,4,France,"This is the top wine from La Bégude, named aft...",La Brûlade,95,66.0,Provence,Bandol,,Provence red blend,Domaine de la Bégude,be top wine name high point vineyard foot have...


In [43]:
new_data['clean_description'] = new_data.description.apply(clean_text)
#print(new_data.head(5))
new_X = vectorizer.transform(new_data['clean_description'])

new_predictions = classifier.predict(new_X)
print(new_predictions)

[1 1 1 ... 1 1 0]


Don't forget to clean it!

# Italian classifier

In our lab, we developed a Italian emotion and sentiment classifier available at https://github.com/MilaNLProc/feel-it

In [44]:
! pip install -U feel-it

Collecting feel-it
  Downloading https://files.pythonhosted.org/packages/a0/12/88b3941faf5124899f1c139a3d83210c76b4db802b144451062d0a0aac5c/feel_it-1.0.3-py2.py3-none-any.whl
Collecting transformers==4.3.3
[?25l  Downloading https://files.pythonhosted.org/packages/f9/54/5ca07ec9569d2f232f3166de5457b63943882f7950ddfcc887732fc7fb23/transformers-4.3.3-py3-none-any.whl (1.9MB)
[K     |████████████████████████████████| 1.9MB 5.4MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 18.1MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/08/cd/342e584ee544d044fb573ae697404ce22ede086c9e87ce5960772084cad0/sacremoses-0.0.44.tar.gz (862kB)
[K     |████████████████████████████████| 870kB 47.5MB/s 
Building wheels for collected packages: 

In [45]:
from feel_it import EmotionClassifier, SentimentClassifier

emotion_classifier = EmotionClassifier()

emotion_classifier.predict(["sono molto felice", "ma che cazzo vuoi", "sono molto triste"])



HBox(children=(FloatProgress(value=0.0, description='Downloading', max=899.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=793981.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1682192.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=299.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=414.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442587849.0, style=ProgressStyle(descri…




['joy', 'anger', 'sadness']

In [46]:
sentiment_classifier = SentimentClassifier()

sentiment_classifier.predict(["sono molto felice", "ma che cazzo vuoi", "sono molto triste"])

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=847.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=793981.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1682192.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=299.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=414.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442581705.0, style=ProgressStyle(descri…




['positive', 'negative', 'negative']

# Exercise

Download a set of tweets with a specific hashtag in Italian and try to run the Emotion and Sentiment Classifier.

In [47]:
tweets = pd.read_csv('betsentiment-IT-tweets-sentiment-players.csv', encoding='cp1252')
print(len(tweets))
tweets = tweets[:2000]
tweets.head()

165815


Unnamed: 0,tweet_date_created,tweet_id,tweet_text,language,sentiment,sentiment_score
0,2018-12-10T10:45:40.865000,1072079899224100865,"@juventusfc @G_Higuain Auguri pipita,sempre co...",it,NEUTRAL,"{""Neutral"":0.858726024627685546875,""Negative"":..."
1,2018-06-30T18:41:02.817000,1013130303454498817,@realvarriale @massimozampini @KMbappe @G_Higu...,it,NEUTRAL,"{""Neutral"":0.973993778228759765625,""Negative"":..."
2,2018-12-26T19:40:44.308000,1078012758069858308,@OfficialASRoma @D_10Perotti @Hyundai_Italia S...,it,NEUTRAL,"{""Neutral"":0.9844334125518798828125,""Negative""..."
3,2018-07-12T11:26:37.766000,1017369635816181766,@CarloRienzi @Cristiano @juventusfc Populismo ...,it,NEUTRAL,"{""Neutral"":0.993849217891693115234375,""Negativ..."
4,2018-06-12T19:20:36,1006617282779451392,@OfficialASRoma @OfficialRadja @leonsfdo non c...,it,NEUTRAL,"{""Neutral"":0.87245881557464599609375,""Negative..."


In [48]:
import nltk
nltk.download('stopwords')
import re
from nltk.corpus import stopwords
stopwords_ = set(stopwords.words('english'))
def preprocessing(tweet):
  tweet = ' '.join([w.lower() for w in tweet.split() if w not in stopwords_])
  #remove stopwords, remove links, remove emojis , remove mentions, remove hashtags , remove numbers
  url_pattern = re.compile('https?://[A-Za-z0-9\.-_]*/[A-Za-z0-9\.-_]*') 
  tweet = re.sub(url_pattern, '', tweet)
  user_names_pattern = re.compile('@[A-Za-z0-9\.-_]+')
  tweet = re.sub(user_names_pattern, '', tweet)
  hashtag_pattern = re.compile('#[\w]*')
  tweet = re.sub(hashtag_pattern, '', tweet)
  emojis_pattern = re.compile(pattern = "["
          u"\U0001F600-\U0001F64F"  # emoticons
          u"\U0001F300-\U0001F5FF"  # symbols & pictographs
          u"\U0001F680-\U0001F6FF"  # transport & map symbols
          u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
          u"\u2600-\u26FF\u2700-\u27BF"
                            "]+", flags = re.UNICODE)
  tweet = re.sub(emojis_pattern, '', tweet)

  numbers_patterns = re.compile('[0-9]+[\w]*')
  tweet = re.sub(numbers_patterns, '', tweet)

  punctiation_pattern = re.compile('[!-_@#$%^&*()?<>;\.,:"]')
  tweet = re.sub(punctiation_pattern, '', tweet) 

  return tweet

tweets['cleaned_tweets'] = tweets.tweet_text.apply(preprocessing)
tweets.head(5)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Unnamed: 0,tweet_date_created,tweet_id,tweet_text,language,sentiment,sentiment_score,cleaned_tweets
0,2018-12-10T10:45:40.865000,1072079899224100865,"@juventusfc @G_Higuain Auguri pipita,sempre co...",it,NEUTRAL,"{""Neutral"":0.858726024627685546875,""Negative"":...",auguri pipitasempre con noi
1,2018-06-30T18:41:02.817000,1013130303454498817,@realvarriale @massimozampini @KMbappe @G_Higu...,it,NEUTRAL,"{""Neutral"":0.973993778228759765625,""Negative"":...",nano puttano
2,2018-12-26T19:40:44.308000,1078012758069858308,@OfficialASRoma @D_10Perotti @Hyundai_Italia S...,it,NEUTRAL,"{""Neutral"":0.9844334125518798828125,""Negative""...",stasera bravissimi tutti una menzione per i...
3,2018-07-12T11:26:37.766000,1017369635816181766,@CarloRienzi @Cristiano @juventusfc Populismo ...,it,NEUTRAL,"{""Neutral"":0.993849217891693115234375,""Negativ...",populismo di basso spessore
4,2018-06-12T19:20:36,1006617282779451392,@OfficialASRoma @OfficialRadja @leonsfdo non c...,it,NEUTRAL,"{""Neutral"":0.87245881557464599609375,""Negative...",non ce posso pensa’ non giochiamo non facc...


In [49]:
tweets["sentiment_classification"] = [sentiment_classifier.predict([tweet]) for tweet in tweets.cleaned_tweets]
tweets["emotion_classification"] = [emotion_classifier.predict([tweet]) for tweet in tweets.cleaned_tweets]

In [51]:
tweets.head(20)

Unnamed: 0,tweet_date_created,tweet_id,tweet_text,language,sentiment,sentiment_score,cleaned_tweets,sentiment_classification,emotion_classification
0,2018-12-10T10:45:40.865000,1072079899224100865,"@juventusfc @G_Higuain Auguri pipita,sempre co...",it,NEUTRAL,"{""Neutral"":0.858726024627685546875,""Negative"":...",auguri pipitasempre con noi,[negative],[joy]
1,2018-06-30T18:41:02.817000,1013130303454498817,@realvarriale @massimozampini @KMbappe @G_Higu...,it,NEUTRAL,"{""Neutral"":0.973993778228759765625,""Negative"":...",nano puttano,[negative],[anger]
2,2018-12-26T19:40:44.308000,1078012758069858308,@OfficialASRoma @D_10Perotti @Hyundai_Italia S...,it,NEUTRAL,"{""Neutral"":0.9844334125518798828125,""Negative""...",stasera bravissimi tutti una menzione per i...,[positive],[joy]
3,2018-07-12T11:26:37.766000,1017369635816181766,@CarloRienzi @Cristiano @juventusfc Populismo ...,it,NEUTRAL,"{""Neutral"":0.993849217891693115234375,""Negativ...",populismo di basso spessore,[negative],[anger]
4,2018-06-12T19:20:36,1006617282779451392,@OfficialASRoma @OfficialRadja @leonsfdo non c...,it,NEUTRAL,"{""Neutral"":0.87245881557464599609375,""Negative...",non ce posso pensa’ non giochiamo non facc...,[positive],[anger]
5,2018-06-06T10:44:28,1004313065771487232,@BAILALASAMBA @Milanistiblog @ECavaniOfficial ...,it,NEUTRAL,"{""Neutral"":0.812459886074066162109375,""Negativ...",stop altrimenti tanto vale prendere immobi...,[negative],[anger]
6,2018-07-15T08:30:26.400000,1018412461836390400,@95Luk @calciomercatoit @G_Higuain È gli altri...,it,NEUTRAL,"{""Neutral"":0.849590957164764404296875,""Negativ...",è gli altri dove li prendi,[negative],[anger]
7,2018-05-27T13:40:41,1000733532707344385,"@SergioRamos non sei un cane, sei una piccola ...",it,NEUTRAL,"{""Neutral"":0.552955687046051025390625,""Negativ...",non sei un cane sei una piccola cagna spavent...,[negative],[anger]
8,2018-06-11T11:14:55,1006132664888446978,@DSportiva Ma @PauDybala_JR non si tocca.. res...,it,NEUTRAL,"{""Neutral"":0.727782666683197021484375,""Negativ...",ma non si tocca restasse da noi tutta la vita,[positive],[sadness]
9,2018-07-11T04:38:23.376000,1016904512999653376,@antoniusromano @Cristiano @antoniusromano ma ...,it,NEUTRAL,"{""Neutral"":0.977986395359039306640625,""Negativ...",ritornerete con il nostro canale ufficiale,[positive],[joy]


# Checklist: how to classify my data

1. label at ***least 2000*** tweets in your data set as `positive`, `negative`, or `neutral`
2. preprocess the text of *all* tweets in your data (labeled and unlabeled)
3. read in the labeled tweets and their labels
4. transform the labels into numbers
5. use `TfidfVectorizer` to extract the features and transform them into feature vectors
6. select the top $N$ features (where $N$ is smaller than the number of labeled tweets)
7. create a classifier
8. use 5-fold CV to find the best regularization parameter, top $N$ feature selection, and maybe feature generation and preprocessing steps

Once you are satisfied with the results:
9. read in the rest of the (unlabeled) tweets
10. use the `TfidfVectorizer` from 5. to transform the new data into vectors
11. use the `SelectKBest` selector from 6. to get the top $N$ features
12. use the classifier from 7. to predict the labels for the new data
13. save the predicted labels or probabilities to your database or an Excel file
