# Multi-Label Classification Notes

From https://www.youtube.com/watch?v=nNDqbUhtIRg

In [6]:
import sklearn
import pandas as pd
from sklearn import metrics

In [5]:
# sklearn metrics package to measure efficacy of model
# micro makes it predict rather than not predict at all

f1 = f1_score(testY, y_pred, average='micro')



# Approach:
1.  multi-class, multi-label problem
2.  OneVsRestClassifier + SGD classifier
3.  With 42k unique classes (tons of classes)
4.  Train 42k classifiers to distinguish amongst known tags

### SGD classifier:
1. Quick and dirty
2. Cheap computationally
3. Good for more than 100k examples

# Pipeline:
1.  Ingest Training Data
2.  Build Feature Vector
        2.1  3x title + body + tag parts
        2.2  tfidf vectorizer from CBOW
        2.3  200k x 100k sparse float for training
        2.4  Fit with OneVsRestClassifier
        2.5  Pickle Classifier

In [None]:
parameters = {'estimator__loss':('hinge', 'log', 'perceptron'),
             'estimator__alpha':(0.1, 0.001, 0.00001, 0.0000001),
             'estimator__penalty':('l1', 'l2', 'elasticnet')}

classifier = OneVsRestClassifier( SGDClassifier( random_state = 0,
                                               loss = 'hinge',
                                               alpha = 0.00001,
                                               penalty = 'elasticnet')).fit(features, labels)

y_pred = classifier predict(X_test)

# Clustering
1.  Intuition is that similarly tagged posts will group together
2.  Necessarily the case, but still an interesting experiment
3.  Turned out to be slower, no gain in accuracy
        3.1 450 topics clustered with several hundred tags each
        3.2 Expensive SVD step
        3.3 Uneven clusters

In [None]:
kmeans = MiniBatchKMeans(init ='K-means++',
                        n_clusters = NUM_CLUSTERS,
                        n_init=30,
                        batch_size=2000)

clusterf = kmeans.fit(X, y)

test_clusters = kmeans.predict(X_test)

# TF-IDF
Short for term frequency-inverse document frequency. It is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. Is practical or weighting terms ina  bag of words and gaining intuition around what is important vs. what is less important.

In [None]:
vectorizer = TfidfVectorizer(min_df=0.00009,
                            max_features=200000,
                            stop_words='english',
                            smooth_idf=True,
                            norm='l2',
                            sublinear_tf=False,
                            use_idf=True,
                            ngram_range=(1,3))

# ngram_range is really important here. If you start with 'windows xp crashed',
# ngram(1,3) will product 6 features based on single words, bigrams, and trigrams.

X = vectorizer.fit_transform(train_examples)
X_test = vectorizer.transform(test_examples)

# Can use this to look at different n-grams and get a sense for where 
# synonyms can be rolled in as synthetic data
print(vectorizer.get_feature_names())

# Putting it all together
1. Classifiers won't predict on every test item (where 'micro' comes into play)
2. Predicting is better than not predicting (duh)
3. stacked different classifiers
        3.1 combined multiple methods for best predictions
        3.2 didn't weight, sequenced for expediency
        3.3 Wound up with 10 prediction files and merged them
        3.4 dupes + 2 OvRClassifiers + Index + 5 Clusters + default
        
# Lessons:
1. pre-analyze data
2. reality check against literature and competitors
3. avoid premature optimization
4. save innovation and complexity until you have a great baseline score

# Notes from Machine Learning With Text
From: https://www.youtube.com/watch?v=ZiKMIuYidY0

1.  All features must be numeric, ML models can't deal with text explicitly
2.  Every observation must have the same features in the same order
3.  In order to make a prediction, the new observation must have the same features as the training observations, both in number and meaning


In [None]:
# Import the class
from sklearn.neighbords import KNeighborsClassifier

# Instantiate the model with default parameters
knn = KNeighborsClassifier()

# Fit the model with data, no assignment
knn.fit(X,y)

In [1]:
# Using CountVectorizer to vectorize the vocabulary
# Converts text into a matrix of token counts
from sklearn.feature_extraction.text import CountVectorizer

simple_train = ['call you tonight', 'call me a cab', 'what you talking about willis?']

vect = CountVectorizer()

# Learns vocabulary of the training data
vect.fit(simple_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [2]:
# Examines the fitted vocabulary
vect.get_feature_names()

['about', 'cab', 'call', 'me', 'talking', 'tonight', 'what', 'willis', 'you']

In [3]:
# Transform training data into a 'document term matrix'
simple_train_dtm = vect.transform(simple_train)
simple_train_dtm

<3x9 sparse matrix of type '<class 'numpy.int64'>'
	with 11 stored elements in Compressed Sparse Row format>

In [4]:
# Convert sparce matrix to a dense matrix
simple_train_dtm.toarray()

array([[0, 0, 1, 0, 0, 1, 0, 0, 1],
       [0, 1, 1, 1, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 1, 0, 1, 1, 1]])

# What This is:

Scheme for representing text as numerical data, where each individual token occurence frequency is treated as a feature. The vector of all the token frequencies or a given document is considered a multivariate sample.

A corpus of documents can thus be represented by a matrix with one row per document and one column per token occurring in the corpus. We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the 'Bag of Words' or 'Bag of n-grams' representation.

Bag of words does not keep track of order. Cannot use the matrix to reconstruct the messages.



In [8]:
pd.DataFrame(simple_train_dtm.toarray(), columns = vect.get_feature_names())

Unnamed: 0,about,cab,call,me,talking,tonight,what,willis,you
0,0,0,1,0,0,1,0,0,1
1,0,1,1,1,0,0,0,0,0
2,1,0,0,0,1,0,1,1,1


In [9]:
# Show sparse matrix representation
print(simple_train_dtm)

  (0, 2)	1
  (0, 5)	1
  (0, 8)	1
  (1, 1)	1
  (1, 2)	1
  (1, 3)	1
  (2, 0)	1
  (2, 4)	1
  (2, 6)	1
  (2, 7)	1
  (2, 8)	1


# Code Below:
Does not have the relevant data set loaded. For context, lots of SMS that have been hand labeled as ham or spam.

In [None]:
# Converts labels to a numberical value
sms['lable_num = sms.label.map({'ham':0, 'spam':1})

In [None]:
# Data needs to be a one dimensional object so we can pass it to count vectorizer
X = sms.message
y = sms.label_num
print(X.shape)
print(y.shape)

In [None]:
from sklearn.cross_validation import test_train_split

# Uses sklearn to build a test/train split of the data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_State = 1)

# Check shapes
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)


In [None]:
# Vectorize the vocabulary
X_train_dtm = vect.transform(X_train)
X_train_dtm

X_test_dtm = vect.transform(X_test)
X_test_dtm

# Building and Evaluating the Model:
Using multinomial Naive Bayes:
1.  Suitable for classification with discrete features (eg word counts for text classification).
2.  The Multinomial distribution normally requires integer feature counts.
3.  In practice, fractional counts such as tf-idf may work as well
4.  Naive Bayes is really fast and great to start with until you know how much time you have for other approaches
5.  For each and every token it calculates the conditional probability of a class given each token

In [None]:
# import and instantiate a Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB()

In [None]:
# train the model using the training data/sparse matrix
# time magic command should give a rough estimate of runtime
%time nb.fit(X_train_dtm, y_train)

In [None]:
# Makes class predictions for X_test_dtm
y_pred_vlass = nb.predict(X_test_dtm)

In [None]:
# Calculate accuracy of class predictions
from sklearn import metrics

metrics.accuracy_score(y_test, y_pred_class)

In [None]:
# Print the confusion matrix
# upper left is true negative, bottom right is true positive, 
# upper right is false positive, bottom left is false negative

metrics.confusion_matrix(y_test, y_pred_class)

In [None]:
# How to print the text message from false positives
X_test[(y_pred_class==1) & (y_test==0)]

# For false negatives
X_test[(y_pred_class==0) & (y_test==1)]