## Who's Text is This?

This notebook contains a simple classifier to predict the author of a text (string) containing words in a restricted language. An accompanying Excel file exists to generate instances of texts, labeled by an author. The generator instantiates preboiled instances in that the probability for using a word by a particular author is taken into account. As such the generator is at the same time a corpus generator and a means to check if the calculated classifier makes sense.

First, import the text corpus and create a sentence list and its label list.

In [2]:
import csv

f = open("ML06_Corpus.csv")
corpus = [ row for row in csv.reader(f, delimiter=';')]
X = [" ".join(lst[1:]).strip(" ") for lst in corpus]
y = [lst[0] for lst in corpus]
print(corpus[:2])
 
print()

print(X[:2])
print(y[:2])

[['Gerrit', 'soccer', 'soccer', 'car', 'difficult', 'difficult', 'soccer', 'difficult', 'soccer', 'car', 'difficult', 'car', '', '', '', ''], ['Gerrit', 'soccer', 'boring', 'car', 'boring', 'contract', 'boring', 'difficult', 'soccer', 'car', 'soccer', 'contract', 'boring', 'car', 'energy', 'soccer']]

['soccer soccer car difficult difficult soccer difficult soccer car difficult car', 'soccer boring car boring contract boring difficult soccer car soccer contract boring car energy soccer']
['Gerrit', 'Gerrit']


Suppose, you don't know anything about machine learning. Waht would be an approach to determine the author of a sentence, having this set of labeled sentences?

In [3]:
from collections import Counter

words_G = [word for row in corpus[:50] for word in row if word != '']
words_T = [word for row in corpus[50:] for word in row if word != '']

counts_G = Counter(words_G)
counts_T = Counter(words_T)

total_G = sum(counts_G.values())
total_T = sum(counts_T.values())

perc_G = {key: value/total_G for  (key, value) in counts_G.items()}
perc_T = {key: value/total_T for  (key, value) in counts_T.items()}

print(perc_G)
print(perc_T)

{'Gerrit': 0.09615384615384616, 'soccer': 0.2153846153846154, 'car': 0.19615384615384615, 'difficult': 0.1076923076923077, 'boring': 0.06923076923076923, 'contract': 0.13076923076923078, 'energy': 0.028846153846153848, 'sweet': 0.15576923076923077}
{'Truus': 0.09157509157509157, 'boring': 0.2893772893772894, 'energy': 0.25457875457875456, 'soccer': 0.07326007326007326, 'car': 0.11538461538461539, 'contract': 0.08058608058608059, 'sweet': 0.05311355311355311, 'difficult': 0.04212454212454213}


In [4]:
def determine_author(sentence, perc1, perc2):
    prod1 = prod2 = 1.0
    for word in sentence:
        prod1 *= perc1[word]
        prod2 *= perc2[word]
        # print(prod1, prod2)
    return prod1 > prod2

# determine_author(X[1].split(" "), perc_G, perc_T)
for sentence in X[:50]:
    print(determine_author(sentence.split(" "), perc_G, perc_T), end=';')

print("\n=====")

for sentence in X[50:]:
    print(determine_author(sentence.split(" "), perc_G, perc_T), end=';')

True;False;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;False;True;True;True;True;True;True;False;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;
=====
False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;True;False;False;False;True;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;

Well, that's interesting and perhaps a bit embarassing at the same time. I would have expected the first set of 50 all True (author = 'Gerrit') and the subsequent 50 all False. But in both sets some unexpected negates slip through. The only explanation I can think of right now is that the generator has generated a string, attributed to Gerrit that is more likely to have been written by Truus. This could be true due to the randimization in the generation.

_Question_: Can you think of a better explanation?

Next, we are defining the ``evaluate_cross_validation()`` function as we did in previous notebooks. As it seems that we are using this function in many notebooks, it would be a good idea to put it (together with other goodies) in a module. For now, we just repeat ourselves.

In [5]:
from sklearn.cross_validation import cross_val_score, KFold
from numpy import mean
from scipy.stats import sem

def evaluate_cross_validation(clf, X, y, K):
    # Create a k-fold cross validation iterator of k=5 folds
    cv = KFold(len(y), K, shuffle=True, random_state=0)
    # By default the score used is the one returned by score method of the estimator (accuracy)
    scores = cross_val_score(clf, X, y, cv=cv)
    print(scores)
    print("Mean score: {0:.3f}".format(mean(scores)))
    print("Standard error of the mean: (+/-{0:.3f})".format(sem(scores)))



Note that we create our validation folds from our complete dataset as contrasted to creating folds from the training set in the SVM notebook ...

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer()
print(X[:2])

print()

print(vec.fit_transform(X).toarray()[:2])

print()

print(vec.get_feature_names())

['soccer soccer car difficult difficult soccer difficult soccer car difficult car', 'soccer boring car boring contract boring difficult soccer car soccer contract boring car energy soccer']

[[0 3 0 4 0 4 0]
 [4 3 2 1 1 4 0]]

['boring', 'car', 'contract', 'difficult', 'energy', 'soccer', 'sweet']


What's that vectorizer thing doing? Well, our Naive Bayes classifier (see below) can only deal with numeric data. So we have to map the texts to numeric data. That's in short what the vectorizer does: it creates a vector of features that we can give numeric values. The CountVectorizer is one of the simplest vectorizers available: it just creates a feature for each unique word in the text an then counts occurrences of each word in a text.

There are also other vectorizers available in sklearn, such as the HashingVectorizer. Using the HashingVectorizer leads to a smaller feature space as different unique words may be hashed to the same bucket. The buckets form the feature space. Also, vectorizers may have paramaters. See the sklearn docs for which params you can use.

In [14]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer

clf = Pipeline([('vect', CountVectorizer()), ('clf', MultinomialNB())])

evaluate_cross_validation(clf, X, y, 5)

[ 0.95  0.8   0.95  0.95  1.  ]
Mean score: 0.930
Standard error of the mean: (+/-0.034)


Wow, that's pretty accurate. Of course, that's especially because we've carefully crafted our dataset and took care to create users (Gerrit and Truus) that have really different writing styles. As an exercise you should create users with less distinct writing styles. You would probably see a lower accuracy.

In [15]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33)
print(X_train[:2])
print(y_train[:2])

['car sweet contract sweet sweet car soccer sweet', 'soccer difficult sweet contract energy soccer difficult soccer car car boring contract sweet']
['Gerrit', 'Gerrit']


In [16]:
from sklearn import metrics

def train_and_evaluate(clf, X_train, X_test, y_train, y_test):
    
    clf.fit(X_train, y_train)
    
    print("Accuracy on training set:")
    print(clf.score(X_train, y_train))
    print("Accuracy on testing set:")
    print(clf.score(X_test, y_test))
    
    y_pred = clf.predict(X_test)
    
    print("Classification Report:")
    print(metrics.classification_report(y_test, y_pred))
    print("Confusion Matrix:")
    print(metrics.confusion_matrix(y_test, y_pred))

Now, lets fit our modelfrom the train set and test it against the test set. Explain why our test set contains 25 measurements.

In [17]:
train_and_evaluate(clf, X_train, X_test, y_train, y_test)

Accuracy on training set:
0.96
Accuracy on testing set:
0.92
Classification Report:
             precision    recall  f1-score   support

     Gerrit       0.80      1.00      0.89         8
      Truus       1.00      0.88      0.94        17

avg / total       0.94      0.92      0.92        25

Confusion Matrix:
[[ 8  0]
 [ 2 15]]


We already tested the sentence below in our Excel sheet and it calculated Truus as an author. Sure enough our classifier also predicts Truus as the original author.

In [18]:
clf.predict(["contract energy contract sweet contract soccer contract energy difficult energy"])

array(['Truus'], 
      dtype='<U6')

## Assignment

Now it's your turn to create a sentence generator for a small language (say, consisting of 10 words). You may get inspiration from the Excel generator, but of course you will use Python to create the generator. Generate a dataset (corpus) of 100 sentences attributed evenly to 2 authors. Your generator should take into account word preference of an author. Show that the more distinct preferences are, the more accurate your classifier is. And vice versa, the less distinct word preference is, the less accurate your classifier will be.

This is quite a challenging assignment, but being almost halfway the course, we think you should be able to succeed.