## Who's text is this?

This notebook contains a simple classifier to predict the author of a text (string) of words in a restricted language. An accompanying Excel file exists to generate instances of texts, labeled by an author. The generator instantiates preboiled instances in that the probability for using a word by a particular author is taken into account. As such the generator is at the same time a corpus generator and a means to check if the calculated classifier makes sense.

First, import the text corpus and create a sentence list and its label list.

In [22]:
import csv

f = open("ML06_Corpus1.csv")
corpus = [ row for row in csv.reader(f, delimiter=';')]
X = [" ".join(lst[1:]).strip(" ") for lst in corpus]
y = [lst[0] for lst in corpus]
print(corpus[:2])
print(X[:2])
print(y[:2])

[['Gerrit', 'soccer', 'soccer', 'car', 'difficult', 'difficult', 'soccer', 'difficult', 'soccer', 'car', 'difficult', 'car', '', '', '', ''], ['Gerrit', 'soccer', 'boring', 'car', 'boring', 'contract', 'boring', 'difficult', 'soccer', 'car', 'soccer', 'contract', 'boring', 'car', 'energy', 'soccer']]
['soccer soccer car difficult difficult soccer difficult soccer car difficult car', 'soccer boring car boring contract boring difficult soccer car soccer contract boring car energy soccer']
['Gerrit', 'Gerrit']


Suppose, you don't know anything about machine learning. Waht would be an approach to determine the author of a sentence, having this set of labeled sentences?

In [23]:
from collections import Counter

words_G = [word for row in corpus[:50] for word in row if word != '']
words_T = [word for row in corpus[50:] for word in row if word != '']
counts_G = Counter(words_G)
counts_T = Counter(words_T)
total_G = sum(counts_G.values())
total_T = sum(counts_G.values())
perc_G = {key: value/total_G for  (key, value) in counts_G.items()}
perc_T = {key: value/total_T for  (key, value) in counts_T.items()}

print(perc_G)
print(perc_T)

{'boring': 0.06923076923076923, 'soccer': 0.2153846153846154, 'Gerrit': 0.09615384615384616, 'sweet': 0.15576923076923077, 'contract': 0.13076923076923078, 'car': 0.19615384615384615, 'difficult': 0.1076923076923077, 'energy': 0.028846153846153848}
{'boring': 0.3038461538461538, 'car': 0.12115384615384615, 'soccer': 0.07692307692307693, 'sweet': 0.05576923076923077, 'energy': 0.2673076923076923, 'Truus': 0.09615384615384616, 'difficult': 0.04423076923076923, 'contract': 0.08461538461538462}


In [24]:
def determine_author(sentence, perc1, perc2):
    prod1 = prod2 = 1.0
    for word in sentence:
        prod1 *= perc1[word]
        prod2 *= perc2[word]
        # print(prod1, prod2)
    return prod1 > prod2

# determine_author(X[1].split(" "), perc_G, perc_T)

for sentence in X[:50]:
    print(determine_author(sentence.split(" "), perc_G, perc_T), end=';')

print("\n=====")

for sentence in X[50:]:
    print(determine_author(sentence.split(" "), perc_G, perc_T), end=';')

True;False;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;False;True;True;True;True;True;True;False;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;
=====
False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;True;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;

Well, that's interesting and perhaps a bit embarassing at the same time. I would have expected the first set of 50 all True (author = 'Gerrit' and the subsequent 50 all False. But in both sets some unexpected negates slip through. The only explanation I can think of right now is that the generator has generated a string, attributed to Gerrit that is more likely to have been written by Truus. This could be true due to the randimization in the generation.

_Question_: can you think of a better explanation?

Next, we are defining the evaluate_cross_validation() function as we did in previous notebooks. As it seems that we are using this function in many notebooks, it would be a good idea to put it (together with other goodies) in a module. For now, we just repeat ourselves.

In [25]:
from sklearn.cross_validation import cross_val_score, KFold
from numpy import mean
from scipy.stats import sem

def evaluate_cross_validation(clf, X, y, K):
    # create a k-fold cross validation iterator of k=5 folds
    cv = KFold(len(y), K, shuffle=True, random_state=0)
    # by default the score used is the one returned by score method of the estimator (accuracy)
    scores = cross_val_score(clf, X, y, cv=cv)
    print(scores)
    print("Mean score: {0:.3f}".format(mean(scores)))
    print("Standard error of the mean: (+/-{0:.3f})".format(sem(scores)))

Note that we create our validation folds from our complete dataset as contrasted to creating folds from the training set in the SVM notebook...

What's that vectorizer thing doin in our classifier? Well, our Naive Bayes classifier can only deal with numeric data. So we have to map the texts to numeric data. That's in short what the vectorizer does: it creates a vector of features that we can give numeric values. The CountVectorizer is one of the simplest vectorizers available: it just creates a feature for each unique word in the text an then counts occurrences of each word in a text.
There are also other vectorizers available in sklearn, such as the HashingVectorizer. Using the HashingVectorizer leads to a smaller feature space as different unique words may be hashed to the same bucket. The buckets form the feature space. Also, vectorizers may have paramaters. See the sklearn docs for which params you can use.

In [26]:
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer()
print(X[:2])
print(vec.fit_transform(X).toarray()[:2])
print(vec.get_feature_names())

['soccer soccer car difficult difficult soccer difficult soccer car difficult car', 'soccer boring car boring contract boring difficult soccer car soccer contract boring car energy soccer']
[[0 3 0 4 0 4 0]
 [4 3 2 1 1 4 0]]
['boring', 'car', 'contract', 'difficult', 'energy', 'soccer', 'sweet']


Now, what do you think the output from the vectorizer means? Hint: the sum of each array is equal to the number of words in each sentence with the same index.

In [27]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer

clf = Pipeline([('vect', CountVectorizer()), ('clf', MultinomialNB())])

evaluate_cross_validation(clf, X, y, 5)

[ 0.95  0.8   0.95  0.95  1.  ]
Mean score: 0.930
Standard error of the mean: (+/-0.034)


Wew, that's pretty accurate. Ofcourse, that's especially because we've carefully crafted our dataset and took care to create users (Gerrit and Truus) that have really different writing styles. As an exercise you should create users with less distinct writing styles. You would probably see a smaller accuracy.

In [28]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33)
print(X_train[:2])
print(y_train[:2])

['car sweet contract sweet sweet car soccer sweet', 'soccer difficult sweet contract energy soccer difficult soccer car car boring contract sweet']
['Gerrit', 'Gerrit']


In [29]:
from sklearn import metrics

def train_and_evaluate(clf, X_train, X_test, y_train, y_test):
    
    clf.fit(X_train, y_train)
    
    print("Accuracy on training set:")
    print(clf.score(X_train, y_train))
    print("Accuracy on testing set:")
    print(clf.score(X_test, y_test))
    
    y_pred = clf.predict(X_test)
    
    print("Classification Report:")
    print(metrics.classification_report(y_test, y_pred))
    print("Confusion Matrix:")
    print(metrics.confusion_matrix(y_test, y_pred))

Now, lets fit our modelfrom the train set and test it against the test set. Explain why our test set contains 25 measurements.

In [30]:
train_and_evaluate(clf, X_train, X_test, y_train, y_test)

Accuracy on training set:
0.96
Accuracy on testing set:
0.92
Classification Report:
             precision    recall  f1-score   support

     Gerrit       0.80      1.00      0.89         8
      Truus       1.00      0.88      0.94        17

avg / total       0.94      0.92      0.92        25

Confusion Matrix:
[[ 8  0]
 [ 2 15]]


We already tested the sentence below in our Excel sheet and it calculated Truus as an author. Sure enough our classifier also predicts Truus as the original author.

In [31]:
clf.predict(["contract energy contract sweet contract soccer contract energy difficult energy"])

array(['Truus'], 
      dtype='<U6')

__Assignment__: Now it's your turn to create a sentence generator for a small language (say, consisting of 10 words). You may get inspiration from the Excel generator, but of course you will use Python to create the generator. Generate a dataset (or: corpus) of 100 sentences attributed evenly to 2 authors. Your generator should take into account word preference of an author. Show that the more distinct preferences are, the more accurate your classifier is. And vice versa, the less distinct word preference is, the less accurate your classifier will be.

This is quite a challenging assignment, but being almost halfway the course, we think you should be able to succeed.

In [32]:
import itertools
import random
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from collections import Counter

# Make a list for every author, with the words he has used and how many times that word is used.
authors = [
    ('Henk-Jan', [('yes', 30), ('like', 5), ('about', 40), ('again', 20), ('but', 10), ('because', 20), ('nice', 10)]),
    ('Geert', [('yes', 10), ('like', 20), ('about', 20), ('again', 5), ('but', 30), ('because', 2), ('nice', 50)])
]

parsed_authors = []

# Now parse the above list and make some weird sentences with it.
for a in authors:
    new = []
    for (word, count) in a[1]:
        x = [word] * count
        new.append(x)
    parsed_authors.append((a[0], list(itertools.chain(*new))))

# Show a list of all the words each author has said.
parsed_authors

[('Henk-Jan',
  ['yes',
   'yes',
   'yes',
   'yes',
   'yes',
   'yes',
   'yes',
   'yes',
   'yes',
   'yes',
   'yes',
   'yes',
   'yes',
   'yes',
   'yes',
   'yes',
   'yes',
   'yes',
   'yes',
   'yes',
   'yes',
   'yes',
   'yes',
   'yes',
   'yes',
   'yes',
   'yes',
   'yes',
   'yes',
   'yes',
   'like',
   'like',
   'like',
   'like',
   'like',
   'about',
   'about',
   'about',
   'about',
   'about',
   'about',
   'about',
   'about',
   'about',
   'about',
   'about',
   'about',
   'about',
   'about',
   'about',
   'about',
   'about',
   'about',
   'about',
   'about',
   'about',
   'about',
   'about',
   'about',
   'about',
   'about',
   'about',
   'about',
   'about',
   'about',
   'about',
   'about',
   'about',
   'about',
   'about',
   'about',
   'about',
   'about',
   'about',
   'about',
   'again',
   'again',
   'again',
   'again',
   'again',
   'again',
   'again',
   'again',
   'again',
   'again',
   'again',
   'again',
   'aga

In [33]:
sentence_authors = []
author_per_sentence = []

for a in parsed_authors:
    sentenceList = []
    for i in range(50):
        numberWords = random.randint(5, 15)
        sentence = [random.choice(a[1]) for i in range(numberWords)]
        sentenceList.append(sentence)
        author_per_sentence.append(a[0])
    sentence_authors.append((a[0], sentenceList))

# Show 50 random sentences per author
sentence_authors

[('Henk-Jan',
  [['but',
    'nice',
    'but',
    'yes',
    'but',
    'about',
    'nice',
    'but',
    'yes',
    'because',
    'again'],
   ['about', 'about', 'yes', 'nice', 'but', 'about', 'but', 'again', 'again'],
   ['because',
    'about',
    'but',
    'yes',
    'nice',
    'like',
    'again',
    'about',
    'nice',
    'because',
    'but',
    'about',
    'again',
    'yes'],
   ['about', 'again', 'about', 'yes', 'about', 'about'],
   ['because',
    'because',
    'about',
    'yes',
    'again',
    'about',
    'about',
    'nice',
    'because',
    'about',
    'like'],
   ['because',
    'about',
    'because',
    'about',
    'about',
    'again',
    'again',
    'because',
    'nice',
    'about'],
   ['again', 'because', 'yes', 'about', 'yes', 'about', 'but'],
   ['about',
    'yes',
    'yes',
    'because',
    'because',
    'again',
    'nice',
    'again',
    'about',
    'because',
    'because',
    'but',
    'again',
    'because'],
   ['about

In [34]:
# Insert the author's name in each sentence
# TODO: I have no idea why I would do this??
# for a in sentence_authors:
#     for ls in a[1]:
#         ls.insert(0, a[0])
        
# And combine them.
flatten_sentences = [x[1] for x in sentence_authors]
flatten_sentences = list(itertools.chain(*flatten_sentences))

flatten_sentences

[['but',
  'nice',
  'but',
  'yes',
  'but',
  'about',
  'nice',
  'but',
  'yes',
  'because',
  'again'],
 ['about', 'about', 'yes', 'nice', 'but', 'about', 'but', 'again', 'again'],
 ['because',
  'about',
  'but',
  'yes',
  'nice',
  'like',
  'again',
  'about',
  'nice',
  'because',
  'but',
  'about',
  'again',
  'yes'],
 ['about', 'again', 'about', 'yes', 'about', 'about'],
 ['because',
  'because',
  'about',
  'yes',
  'again',
  'about',
  'about',
  'nice',
  'because',
  'about',
  'like'],
 ['because',
  'about',
  'because',
  'about',
  'about',
  'again',
  'again',
  'because',
  'nice',
  'about'],
 ['again', 'because', 'yes', 'about', 'yes', 'about', 'but'],
 ['about',
  'yes',
  'yes',
  'because',
  'because',
  'again',
  'nice',
  'again',
  'about',
  'because',
  'because',
  'but',
  'again',
  'because'],
 ['about',
  'yes',
  'about',
  'yes',
  'yes',
  'about',
  'yes',
  'yes',
  'but',
  'yes',
  'about'],
 ['yes', 'like', 'nice', 'yes', 'again', '

In [35]:
# Count the words usage per author in percentages.
percents = []

for author, sentences in sentence_authors:
    words = list(itertools.chain(*sentences))
    count = Counter(words)
    percent = {key: value/len(words) for (key, value) in count.items()}
    print('\n' + author + ':\n')
    print(percent)
    percents.append((author, percent))


Henk-Jan:

{'nice': 0.061052631578947365, 'but': 0.08421052631578947, 'yes': 0.20842105263157895, 'again': 0.1431578947368421, 'because': 0.14947368421052631, 'about': 0.30736842105263157, 'like': 0.04631578947368421}

Geert:

{'nice': 0.33399602385685884, 'but': 0.2504970178926441, 'yes': 0.061630218687872766, 'again': 0.03180914512922465, 'because': 0.02186878727634195, 'about': 0.15109343936381708, 'like': 0.14910536779324055}


In [36]:
def determine_author(sentence, perc1, perc2):
    prod1 = prod2 = 1.0
    for word in sentence:
        prod1 *= perc1[word]
        prod2 *= perc2[word]
    return prod1 > prod2

# Test that the first 50 sentences from the first author are correctly identified
for sentence in flatten_sentences[:50]:
    determine = determine_author(sentence, percents[0][1], percents[1][1])
    print(determine, end=';')
    
print('\n')
    
# Test that the other 50 sentences are not identified as from the first author
for sentence in flatten_sentences[50:]:
    determine = determine_author(sentence, percents[0][1], percents[1][1])
    print(determine, end=';')

False;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;True;False;True;True;True;True;

False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;False;True;False;False;False;False;True;False;False;False;False;False;False;False;False;False;True;False;False;False;False;False;False;False;False;

In [37]:
from sklearn.cross_validation import cross_val_score, KFold
from numpy import mean
from scipy.stats import sem

def evaluate_cross_validation(clf, X, y, K):
    # create a k-fold croos validation iterator of k=5 folds
    cv = KFold(len(y), K, shuffle=True, random_state=0)
    # by default the score used is the one returned by score method of the estimator (accuracy)
    scores = cross_val_score(clf, X, y, cv=cv)
    print(scores)
    print("Mean score: {0:.3f}".format(mean(scores)))
    print("Standard error of the mean: (+/-{0:.3f})".format(sem(scores)))
    
    
# Convert to real sentences
real_sentences = [" ".join(lst[:]).strip(" ") for lst in flatten_sentences]
print(real_sentences[:2])

vec = CountVectorizer()
vec.fit_transform(real_sentences).toarray()[:5]

['but nice but yes but about nice but yes because again', 'about about yes nice but about but again again']


array([[1, 1, 1, 4, 0, 2, 2],
       [3, 2, 0, 2, 0, 1, 1],
       [3, 2, 2, 2, 1, 2, 2],
       [4, 1, 0, 0, 0, 0, 1],
       [4, 1, 3, 0, 1, 1, 1]], dtype=int64)

We now have a classifier with a training set. Let's see what the accuracy is.

In [38]:
clf = Pipeline([('vect', CountVectorizer()), ('clf', MultinomialNB())])
1
# just for clarity
X = real_sentences
y = author_per_sentence

evaluate_cross_validation(clf, X, y, 5)

[ 1.    0.9   0.9   1.    0.95]
Mean score: 0.950
Standard error of the mean: (+/-0.022)


In [39]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33)
print(X_train[:2])
print(y_train[:2])

['because about yes about again yes again', 'again about about about yes']
['Henk-Jan', 'Henk-Jan']


In [40]:
from sklearn import metrics

def train_and_evaluate(clf, X_train, X_test, y_train, y_test):
    
    clf.fit(X_train, y_train)
    
    print("Accuracy on training set:")
    print(clf.score(X_train, y_train))
    print("Accuracy on testing set:")
    print(clf.score(X_test, y_test))
    
    y_pred = clf.predict(X_test)
    
    print("Classification Report:")
    print(metrics.classification_report(y_test, y_pred))
    print("Confusion Matrix:")
    print(metrics.confusion_matrix(y_test, y_pred))
    
train_and_evaluate(clf, X_train, X_test, y_train, y_test)

Accuracy on training set:
0.96
Accuracy on testing set:
0.96
Classification Report:
             precision    recall  f1-score   support

      Geert       1.00      0.94      0.97        17
   Henk-Jan       0.89      1.00      0.94         8

avg / total       0.96      0.96      0.96        25

Confusion Matrix:
[[16  1]
 [ 0  8]]


The accuracy seems okay. Now we want to test it on some real sentences.

In [41]:
clf.predict(["yes, because I like about five to six animals."])

array(['Henk-Jan'], 
      dtype='<U8')

In [42]:
clf.predict(["No, I like two animals but that doesn't matter because they are nice."])

array(['Geert'], 
      dtype='<U8')

In [43]:
clf.predict(["The nice thing of this is, like three out of four people don't care what you say."])

array(['Geert'], 
      dtype='<U8')