## Classifying text

In [11]:
import numpy as np
import pandas as pd
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV as gs
import sklearn.feature_extraction.text as text
import sklearn.naive_bayes as nb
import matplotlib.pyplot as plt
from sklearn.metrics import precision_score, recall_score, accuracy_score
%matplotlib inline

We turn to applying machine learning classification methods to text. There are
no new principles at stake.  In principle, everything is the same as it was for
learning how to classify irises.

1.  We need to find labeled data; each of the exemplars in the data should be represented with a fixed set of features.  
2. We need to split our data and training and test data.  
3. We need to train learner on the training data and evaluate it (test it) it on the test data.

The problem is that text data is not in a form  that is compatible with
what we have learned about classifiers.  The text must be put in a suitable
form before a linear model; can be trained on it.

**Training**

1.  Labeled data must be loaded (into Python).  It should be a sequence of documents T accompanied by a sequence of labels L.
2.  Split T and L into training and test groups, yielding T1 and T2; as well as and L1 and L2.
2.  Train or a **feature model** on the training data T1 (or in scikit learn terminology **fit** the model **to** the training data).  The feature model inputs the text sequence and outputs a **term-document** matrix suitable for training a linear classifier.  The feature model is called a **vectorizer**
(because it turns a document into a vector, a column of numbers).
3.  Using the trained vectorizer, transform T1 into a term document matrix M1.
4.  Train a linear model $\mu$ on M1 and L1.

**Evaluation**

1.  Transform the test data T2 into a term document matrix M2 using the vectorizer fit during step 2 of training;  in particular this means if there are words in the T2 data that were never seen during training, they are ignored in building M2.
2.  Use $\mu$  to classify the texts represented in M2; that is produce a set of predicted labels P2.
3.  Compare the actual labels L2 with the predicted labels P2 using standard evaluation metrics such as precision, accuracy, and recall.


## Review the steps with insult detection

We looked at the insult detection data in  the text classification notebook.

### Training step 1: Loading the data

Let's load the CSV file.

In [12]:
import os.path
site = 'https://raw.githubusercontent.com/gawron/python-for-social-science/master/'\
'text_classification/'
#site = 'https://gawron.sdsu.edu/python_for_ss/course_core/book_draft/_static/'
df = pd.read_csv(os.path.join(site,"troll.csv"))

Each row is a comment  taken from a blog or online forum. There are three columns: whether the comment is insulting (1) or not (0), the date, and the comment.

In [13]:
df.tail()

Unnamed: 0,Insult,Date,Comment
3942,1,20120502172717Z,"""you are both morons and that is never happening"""
3943,0,20120528164814Z,"""Many toolbars include spell check, like Yahoo..."
3944,0,20120620142813Z,"""@LambeauOrWrigley\xa0\xa0@K.Moss\xa0\nSioux F..."
3945,0,20120528205648Z,"""How about Felix? He is sure turning into one ..."
3946,0,20120515200734Z,"""You're all upset, defending this hipster band..."


Now we define the text sequences $\mathbf{T}$ and the label sequence  $\mathbf{L}$.

In [14]:
T = df['Comment']

In [15]:
L = df['Insult']

### Step 2 Split the data and labels into training and test groups

In [16]:
T1, T2, L1, L2 = train_test_split(T,L)

### Step 3 and 4:  Fit the feature model (vectorizer) to the training data and Transform  it

In [17]:
tf = text.TfidfVectorizer()
# Scikit learn has one function that does both fitting and transforming.
# M1 is the transformed data
# tf is the trained feature model (which will be used to transform the test data)
M1 = tf.fit_transform(T1)#.toarray()

### Step 5 Training the classifier

Now, we are going to train a classifier as usual. We first split the data into a train and test set.

We use a **Bernoulli Naive Bayes classifier**.

In [18]:
# Create classifer
bnb =nb.BernoulliNB()
#bnb= nb.MultinomialNB()
#bnb =nb.GaussianNB()

# Fit (train) the classifier  using the training data and labels
bnb.fit(M1, L1);

This function collects what we've down so far, plus it produces predictions for the test data.

In [19]:
def split_vectorize_and_fit(docs,labels,clf,**params):
    """
    Given labeled data (docs, labels) and a classifier,
    do the training test split.  Train the vectorizer and the classifier.
    Transform the test data and return a set of preducted labels
    for the test data,
    """
    T_train,T_test, y_train,y_test = train_test_split(docs,labels)
    tf = text.TfidfVectorizer(**params)
    X_train = tf.fit_transform(T_train)
    clf_inst = clf()
    clf_inst.fit(X_train, y_train)
    X_test = tf.transform(T_test)
    return clf_inst.predict(X_test), y_test

### Evaluation

Evaluate the classifier, first using accuracy (what `.score()` returns).

In [20]:
# vectorize the test data using the vectorizer trained on T1
# Notice we DONT call .fit_transform() because that would retrain the vectorizer on the test data
# We call .transform() using the trained model to transform the new data.
# Words not seen during training will be ignored.
M2 = tf.transform(T2)#.toarray()
# Classify the data using the trained classisifer and report the accuracy
bnb.score(M2, L2)

0.7750759878419453

Now try re-executing steps 2 through 5.  (Just re-execute the cells)  The results should be the same, right?

Well, are they?  

What happens:  each training test split produces a different set of test data.  Sometimes the test is harder.
Sometimes it's easier.  Or looking at it another way:  Sometimes the training data is a better preparation for the test than others.  

To get a realistic view of how our classifier is doing we take the average performance on a  number of
train/test splits.  This is called **cross validation**.  We return to that point below.

#### Using all three evaluation metrics

First let's get more evaluation numbers, in particular precision and recall.  We do
that by calling a method that returns the predicted labels P2, so we can compare
L2 and P2 using different evaluation metrics.

In [21]:
P2 = bnb.predict(M2)
scores = np.array([accuracy_score(P2, L2),
                   precision_score(P2, L2),
                   recall_score(P2, L2)])
print(f'Accuracy: {scores[0]:.2f} Precision: {scores[1]:.2f} Recall: {scores[2]:.2f}')

Accuracy: 0.78 Precision: 0.14 Recall: 0.95


We see that the accuracy is a bit misleading.  There is a serious precision problem.

What does that mean in the setting of insult detection?  It means the BNB classifier is a little too
eager to call something an insult.  When it flags something as an insult, it
is right only 14% of the time.

Why would that be?  Think about how the model is trained and what its weakness might be.
This is what it means to try to interpret or discuss a model's performance.  Zoom
in the model's weakness. Talk about where that weakness comes from.

#### Basic train and test loop

See the Insults with Naive Bayes Notebook.

## Homework

Work through the Insults Detection notebook about text classification and
insult detection. Focus on the use of `scikit_learn`, especially the
`TfidfVectorizer`. For this assignment you will be turning in the Python notebook (extension `.ipynb`, **not** a `.py` file).  Turn in this notebook with all the code needed to run your classifier.  If it doesn't run, your score will suffer.

For Parts One and Two try two different classifiers on the movie review data, the one used in the textbook, an SVM called `LinearSVC`, and  the Bernoulli Naive Bayes model used above. Some points of emphasis;

#### PART ONE

1.  Be sure to get the average of at runs  least 10 runs for **both** classifiers.  2 points
2.  Be sure to get average accuracy, precision, and recall for both classifiers on those multiple runs. You will probably find `split_vectorize_and_fit` defined above useful, but you will need to modify it.  2 points.
3.  Discuss which of the two classifiers does better.  Discuss which metric the best classifier does the worst at and speculate as to why (this will require reviewing the definitions of precision and recall and thinking about what they mean in a movie review setting). 3 points.
4. Do a new training/test split on the data and train and test an SVM model.  Choose one false positive and one false negative from the test set.  Call these documents $j$ and $k$ and call their functional margins $c_j$ and $c_k$ (see the SVM notebook).  Find

$$
\frac{c_{j}}{c_{max}-c_{min}}
$$

and

$$
\frac{c_{k}}{c_{max}-c_{min}},
$$

where $c_{max}$ and $c_{min}$ are the maximum and minimum functional margins for the training set.  Are documents $j$ and $k$ misclassified with high confidence?  Of course getting credit for this part means submitting the code you used to compute these quantities.  For the computation of functional margins, it will be convenient to relabel positive and negative classes 1 and -1 respectively. 5 points

In [22]:
import nltk
from nltk.corpus import movie_reviews as mr
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score, precision_score, recall_score
from random import shuffle
import numpy as np

nltk.download('movie_reviews')

def get_file_strings(corpus, file_ids):
    return [corpus.raw(file_id) for file_id in file_ids]

pos_reviews = get_file_strings(mr, mr.fileids('pos'))
neg_reviews = get_file_strings(mr, mr.fileids('neg'))

data = [(review, 1) for review in pos_reviews] + [(review, 0) for review in neg_reviews]
shuffle(data)
texts, labels = zip(*data)

def split_vectorize_and_fit(texts, labels, classifier):
    accuracies, precisions, recalls = [], [], []
    for _ in range(10):
        X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.2, random_state=None)
        vectorizer = TfidfVectorizer()
        X_train_vec = vectorizer.fit_transform(X_train)
        X_test_vec = vectorizer.transform(X_test)
        clf = classifier()
        clf.fit(X_train_vec, y_train)
        y_pred = clf.predict(X_test_vec)
        accuracies.append(accuracy_score(y_test, y_pred))
        precisions.append(precision_score(y_test, y_pred))
        recalls.append(recall_score(y_test, y_pred))
    return np.mean(accuracies), np.mean(precisions), np.mean(recalls)

svc_metrics = split_vectorize_and_fit(texts, labels, LinearSVC)
nb_metrics = split_vectorize_and_fit(texts, labels, BernoulliNB)

print("LinearSVC Metrics: Accuracy, Precision, Recall")
print(svc_metrics)
print("BernoulliNB Metrics: Accuracy, Precision, Recall")
print(nb_metrics)

from sklearn.utils.class_weight import compute_sample_weight

def calculate_functional_margins(texts, labels):
    X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.2, random_state=42)
    vectorizer = TfidfVectorizer()
    X_train_vec = vectorizer.fit_transform(X_train)
    X_test_vec = vectorizer.transform(X_test)
    svm = LinearSVC()
    svm.fit(X_train_vec, y_train)
    y_pred = svm.predict(X_test_vec)
    false_pos = [i for i, (y, pred) in enumerate(zip(y_test, y_pred)) if y == 0 and pred == 1]
    false_neg = [i for i, (y, pred) in enumerate(zip(y_test, y_pred)) if y == 1 and pred == 0]
    if false_pos and false_neg:
        fp_idx = false_pos[0]
        fn_idx = false_neg[0]
        margins = svm.decision_function(X_test_vec)
        c_max = np.max(margins)
        c_min = np.min(margins)
        c_fp = margins[fp_idx]
        c_fn = margins[fn_idx]
        print(f"False Positive Margin: {(c_fp - c_min) / (c_max - c_min)}")
        print(f"False Negative Margin: {(c_fn - c_min) / (c_max - c_min)}")
    else:
        print("No false positives or negatives found in this split.")

calculate_functional_margins(texts, labels)

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


LinearSVC Metrics: Accuracy, Precision, Recall
(0.8355, 0.8207656497460223, 0.8477206964492631)
BernoulliNB Metrics: Accuracy, Precision, Recall
(0.7842500000000001, 0.8774438442263344, 0.6791984696886508)
False Positive Margin: 0.5066834761051844
False Negative Margin: 0.35242522331354037


#### PART TWO

5. Using the SVM classifier and training on **all** the data find the 50 most important Positive features for the Movie Reviews Data.  They should differ significantly from the most important features in Insult Detection. The function `print_topn` (from the Insult Detection Notebook) should be of help. 4 Points
6.  Find the 100 most important Negative features for the Movie Reviews Data. Note that the way two-class problems work with SVMs there is only one set of weights to look at, so it won't work to pass more than one class name to the `class_labels` parameter of `print_top_n` (it would work with a NaiveBayes classifier).  In particular: you need the **lowest** weighted features if you want to look at the more fun word set that best characterized bad reviews. You will have to modify `print_topn` (from the Insult Detection Notebook) to do that. Try to do so in such a way that with one set of parameters it prints the most positive words, and with another, it prints the most negative words. You may notice the names of a few actors appearing in this feature set. Try not to laugh as the meaning of this dawns on you.  For an extra bit of approval from your instructor, while you're at it, modify it so that it returns a list of words in addition to printing them.  You should probably change the name of the function to `get_topn` if you succeed.  4 Points.

#### Help with getting the movie reviews data.

Execute the next two cells to get the movie review data.

In [24]:
import nltk
nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


True

In [25]:
from nltk.corpus import movie_reviews as mr

def get_file_strings (corpus, file_ids):
    return [corpus.raw(file_id) for file_id in file_ids]

data = dict(pos = mr.fileids('pos'),
            neg = mr.fileids('neg'))

pos_file_ids = data['pos']
neg_file_ids = data['neg']

# Get all the positive and negative reviews.
pos_file_reviews = get_file_strings (mr, pos_file_ids)
neg_file_reviews = get_file_strings (mr, neg_file_ids)

Each review is a string.  In principle, a list of strings like `pos_file_reviews`  can be passed to `text.TfidfVectorizer()` via the `fit_transform` method to train a vectorizer for machine learning.
You could code that up.

What you'd really like to do is use `split_fit_and_eval`, defined above, which does a lot of the work for you.

But hold on. You have a coding problem. You don't have  a sequence of documents and labels.  Instead you have
one sequence of positive documents  and another sequence of negative documents.  

So you will need to turn those two sequences into a sequence of documents and a sequence of labels
because that's what `split_fit_and_eval` wants.  You also want the doc sequence
to contain a random mixture of positive and negative documents, because some machine
learning algorithms are sensitive to the order in which training data is presented to
them.

The next cell does **not** do that for you.  But it illustrates an approach using
two sets of English letters in place of two sets of English documents.

In [26]:
list(zip(*[('a',1),('b',2),('c',3)]))

[('a', 'b', 'c'), (1, 2, 3)]

In [27]:
# Lets work on letters instead of documents
# There are 2 classes, letters from the first half of the
# alphabet ('f') and letters frmm the last half ('l')

from random import shuffle
from string import ascii_lowercase

#Class 1 of the letters: the f_lets
f_lets = ascii_lowercase[:13]
print(f_lets)
#Class2 of the letters: the l_lets
l_lets = ascii_lowercase[13:]
print(l_lets)

# Now get pairs of letters and labels
f_pairs = [(let,'f') for let in f_lets]
l_pairs = [(let,'l') for let in l_lets]

###########  Shuffling  ###########################
# Way too orderly, the classes arent mixed yet.
data = f_pairs + l_pairs
shuffle(data)
###################  Now they're shuffled! ###############

# Separate the letters from their labels
lets, lbls = zip(*data)
print(lets)
print(lbls)

abcdefghijklm
nopqrstuvwxyz
('v', 'l', 't', 'w', 'p', 'q', 'c', 'a', 'k', 'f', 'g', 'b', 'e', 'i', 'z', 'y', 'm', 'h', 'o', 'n', 'd', 'r', 's', 'j', 'u', 'x')
('l', 'f', 'l', 'l', 'l', 'l', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'f', 'l', 'l', 'f', 'f', 'l', 'l', 'f', 'l', 'l', 'f', 'l', 'l')


In [28]:
list(zip(*[(1,"a"),(2,"b"),(3,"c")]))

[(1, 2, 3), ('a', 'b', 'c')]

In [31]:
import nltk
from nltk.corpus import movie_reviews as mr
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
import numpy as np
from random import shuffle

nltk.download('movie_reviews')

def get_file_strings(corpus, file_ids):
    return [corpus.raw(file_id) for file_id in file_ids]

pos_reviews = get_file_strings(mr, mr.fileids('pos'))
neg_reviews = get_file_strings(mr, mr.fileids('neg'))

data = [(review, 1) for review in pos_reviews] + [(review, 0) for review in neg_reviews]
shuffle(data)
texts, labels = zip(*data)

vectorizer = TfidfVectorizer()
X_all = vectorizer.fit_transform(texts)

svm = LinearSVC()
svm.fit(X_all, labels)

def get_topn_features(vectorizer, classifier, n=50, positive=True):
    feature_names = np.array(vectorizer.get_feature_names_out())
    coefficients = classifier.coef_.flatten()  #access the coefficients
    if positive:
        top_indices = np.argsort(coefficients)[-n:]
    else:
        top_indices = np.argsort(coefficients)[:n]
    top_features = feature_names[top_indices]
    top_coefficients = coefficients[top_indices]
    result = list(zip(top_features, top_coefficients))
    print(f"Top {'Positive' if positive else 'Negative'} Features:")
    for feature, coefficient in reversed(result):
        print(f"{feature}: {coefficient:.4f}")
    return result

positive_features = get_topn_features(vectorizer, svm, n=50, positive=True)
negative_features = get_topn_features(vectorizer, svm, n=100, positive=False)

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


Top Positive Features:
and: 1.9328
great: 1.4733
fun: 1.4168
well: 1.3296
quite: 1.2249
life: 1.1709
is: 1.1249
seen: 1.1234
also: 1.1016
as: 1.0960
very: 1.0673
see: 1.0463
many: 1.0233
most: 1.0108
excellent: 1.0045
job: 0.9973
hilarious: 0.9673
terrific: 0.9649
truman: 0.9369
memorable: 0.9352
perfectly: 0.9286
american: 0.9233
true: 0.9173
overall: 0.9031
definitely: 0.8988
pulp: 0.8985
trek: 0.8981
especially: 0.8858
matrix: 0.8675
mulan: 0.8665
sometimes: 0.8576
rocky: 0.8533
will: 0.8438
performances: 0.8403
perfect: 0.8247
he: 0.8210
different: 0.8119
back: 0.8106
best: 0.8104
cameron: 0.8057
others: 0.7926
movies: 0.7922
war: 0.7906
bowfinger: 0.7736
family: 0.7680
enjoyed: 0.7639
beavis: 0.7608
enjoyable: 0.7560
you: 0.7544
yet: 0.7396
Top Negative Features:
saved: -0.6404
unfunny: -0.6406
hurlyburly: -0.6419
joke: -0.6462
actors: -0.6498
dialogue: -0.6600
annoying: -0.6671
weak: -0.6711
clich: -0.6712
filmmakers: -0.6745
laughable: -0.6758
movie: -0.6788
flat: -0.6809
dutch: