In [1]:
from sklearn import datasets

iris = datasets.load_iris()
digits = datasets.load_digits()

print(digits.data)

[[  0.   0.   5. ...,   0.   0.   0.]
 [  0.   0.   0. ...,  10.   0.   0.]
 [  0.   0.   0. ...,  16.   9.   0.]
 ..., 
 [  0.   0.   1. ...,   6.   0.   0.]
 [  0.   0.   2. ...,  12.   0.   0.]
 [  0.   0.  10. ...,  12.   1.   0.]]


In [4]:
from sklearn import svm
clf = svm.SVC(gamma=0.001, C=100.)
clf.fit(digits.data[:-1], digits.target[:-1])
clf.predict(digits.data[-1:])

array([8])

In [15]:
dl = len(digits.images)/2
clf = svm.SVC(gamma=0.001, C=100.)
clf.fit(digits.data[:dl], digits.target[:dl])
print(clf.predict(digits.data[-5:]))

[9 0 8 9 8]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


In [26]:
clf = svm.SVC()
clf.fit(iris.data, iris.target_names[iris.target])
list(clf.predict(iris.data[:10]))

['setosa',
 'setosa',
 'setosa',
 'setosa',
 'setosa',
 'setosa',
 'setosa',
 'setosa',
 'setosa',
 'setosa']

## Working With Text Data

My next Sk-Learn example is to work with text data! In this next bit I get data of tons of newsgroup documents, then use CountVectorizer to turn it into word count amounts. At least, I am pretty sure that is what it is doing.

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(subset='train',categories=categories, shuffle=True, random_state=42)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)

(2257, 35788)


We convert occurences to frequencies using a tf-idf transformer

tf-idf stands for “Term Frequency times Inverse Document Frequency”.

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

Time for actually making the classifier! Well, technically it is already coded and everything, because sklearn has it.

We are using a Naive Bayes classifier, because the example tells us to! Honestly, I have no idea what makes this classifier better than other ones! (I should probably figure that out)

In [None]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

First, I have to be honest: All of the code for this example is copied from their site, but I think they want us to do that, so I should be fine.

Next, what we are doing now is creating new "documents" which are actually just really stupid phrases, then we are doing a similar process to before, where we are turning words to word count amounts, then to frequencies. We then are using our newly-trained classifier to predict what categories these are in!

<span style="font-size: 50%;">I have no idea what the entire print statement is doing it just seems to work for some reason</span>

In [None]:
docs_new = ['God is love', 'OpenGL on the GPU is fast']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
predicted = clf.predict(X_new_tfidf)
for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))

Honestly, the only reason I am doing these comments is so I understand the examples myself and I can practice my markdown and stylistic experience.

Anyway, we can streamline the whole text=>vectorizer=>transformer=>classifier process by using a pipeline, as shown below, and then train it.

In [None]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', MultinomialNB()),])
text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

Screw this, it keeps adding more stuff and specifications and I kind of understand this so I will just stop adding new stuff in this example to the jupyter notebook