# Capstone 2: Supervised learning on 20 newsgroups

**Objective 1:** Compare the performance of various models classifying text-based models. Performance is defined as run time, accuracy, recall, precision and f1 scores.

**Objective 2:** Apply these learnings to a data held by an editing firm to test their current classification performance. These results will not be shown here.

**The dataset:** I use the 20 newsgroups dataset. It is a text-based dataset built into the scikit learn library comprises around 18000 newsgroups posts on 20 topics. The main page and instructions for downloading can be found [here]( http://scikit-learn.org/stable/datasets/twenty_newsgroups.html). This dataset does have several [tutorials](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html) and [examples]( http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html). These resources were used as a good introduction, but I made several adjustments (how the data was processed and the classifiers that I used).

**What I learnt:** Through this project I improved my ability and understanding of text processing and I learnt two new models (xgboost and stochastic gradient descent) and how to implement these.


## Summary of results

In [177]:
# Reproduced from conclusion
print(summary.round(2))    

     Run time  Accuracy  Precision  Recall    f1
nb       0.03      0.82       0.82    0.82  0.82
SVC     19.06      0.78       0.79    0.78  0.79
XGC     19.53      0.74       0.75    0.74  0.74
SGD      0.18      0.81       0.81    0.81  0.81


Surprisingly the Naïve Bayes classifier performed best. A more hands-on approach to the text cleaning and optimizing the models only led to modest gains against the external benchmark based solely on the scikit learn library (see below). 

## An external baseline

Following the methodology on scikit learn’s [website]( http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html) provides a good external baseline to measure performance against. This is reproduced below. Note that in loading the data headers, footers, and quotes are excluded since in the final use of this model (objective 2) this type of information may not necessarily be available.  Excluding this information reduces the predictive ability of the model. If this is not excluded ‘edu’ and ‘gov’ (from email addresses) become important features.

In [121]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB

# Note categories are changed from example to match categories below
categories = ['comp.sys.mac.hardware','comp.windows.x',
              'misc.forsale', 'rec.autos','rec.sport.baseball', 'rec.sport.hockey',
              'sci.crypt','sci.electronics']

# Note, headers, footers and quotes are removed
twenty_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'),
                                  categories=categories,
                                  shuffle=True, random_state=42)

twenty_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'),
                                 categories=categories, shuffle=True, random_state=42)
docs_test = twenty_test.data

# Naive Bayes
# Create pipeline to process and create model instance
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB()),
                    ])
# Fit
text_clf.fit(twenty_train.data, twenty_train.target)

#Metric
predicted = text_clf.predict(docs_test)
print('Naive Bayes accuracy: %0.3f' %np.mean(predicted == twenty_test.target))

# Stochastic Gradient Descent

# Create pipeline to process and create model instance
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', SGDClassifier(loss='hinge', penalty='l2',
                                           alpha=1e-3, random_state=42))])
text_clf.fit(twenty_train.data, twenty_train.target)  
predicted = text_clf.predict(docs_test)
print('Stochastic gradient descent accuracy: %0.3f' % np.mean(predicted == twenty_test.target)) 


  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


Naive Bayes accuracy: 0.784


  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


Stochastic gradient descent accuracy: 0.798


## Outline 

The outline for this project is:
1. Setting up the data: Clean and process the text
2. Create vectors based on term frequencies (weighted with inverse document frequencies)
3. Compare the results of baseline and optimized (through Gridsearch) models <br>
    3.1 Naïve Bayes Classifier <br>
    3.2 Support Vector Classifier <br>
    3.3 Extreme Gradient Boosting classifier (xbgoost) <br>
    3.4 Stochastic Gradient Descent Classifier <br>
4. Conclusion

## 1. Setting up the data

In [15]:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline
import seaborn as sns
import re
from nltk.corpus import stopwords
import string
import random
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
import timeit

In [96]:
from sklearn.datasets import fetch_20newsgroups

# Full dataset categories
categories = ['comp.graphics','comp.os.ms-windows.misc',
                  'comp.sys.ibm.pc.hardware','comp.sys.mac.hardware',
                  'comp.windows.x', 'rec.autos','rec.motorcycles',
                  'rec.sport.baseball','rec.sport.hockey', 'sci.crypt',
                  'sci.electronics','sci.med','sci.space',
                  'misc.forsale', 'talk.politics.misc',
                  'talk.politics.guns','talk.politics.mideast', 'talk.religion.misc',
                  'alt.atheism','soc.religion.christian']

# Reduce size of dataset to improve speed
random.seed(13)
categories_small = random.sample(categories, 8)

# Import dataset (type Bunch)
# Remove headers, footers, and quotes so that classifiers only work on text
dataset = fetch_20newsgroups(subset='train', categories=categories_small,
                             remove=('headers', 'footers', 'quotes'),
                             shuffle=True, random_state=42)


dataset_test = fetch_20newsgroups(subset='test', categories=categories_small,
                                  remove=('headers', 'footers', 'quotes'),
                                  shuffle=True, random_state=42)

# Also import full dataset for cv
dataset_full = fetch_20newsgroups(subset='all', categories=categories_small,
                                  remove=('headers', 'footers', 'quotes'),
                                  shuffle=True, random_state=42)

# Convert to dataframe
news = pd.DataFrame(dataset.data, columns=['Text'])
news_test = pd.DataFrame(dataset_test.data, columns=['Text'])
news_full = pd.DataFrame(dataset_full.data, columns=['Text'])

# Set outcome variable
y = dataset.target
y_test = dataset_test.target
y_full = dataset_full.target

In [97]:
# Check for class balance

def targetcounts(alist):
    '''Count of items in list
    Returns dictionary containing count'''
    adict = {}
    for key in alist:
        try:
            adict[key] += 1
        except KeyError:
            adict[key] = 1
    return adict 

counts = targetcounts(dataset.target)
print(counts)

{4: 597, 5: 600, 6: 595, 7: 591, 1: 593, 3: 594, 0: 578, 2: 585}


In [99]:
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

stopWords = set(stopwords.words('english'))

def textcleaner(text):
    ''' Takes in raw unformatted text and strips punctuation, removes whitespace,
    strips numbers, tokenizes and stems.
    Returns string of processed text to be used into CountVectorizer
    '''
    # Lowercase and strip everything except words
    cleaner = re.sub(r"[^a-zA-Z ]+", ' ', text.lower())
    # Tokenize
    cleaner = word_tokenize(cleaner)
    ps = PorterStemmer()
    clean = []
    for w in cleaner:
        # filter out stopwords
        if w not in stopWords:
            # filter out short words
            if len(w)>2:
                # Stem 
                clean.append(ps.stem(w))
    return ' '.join(clean)

In [100]:
# Running cleaning function on train and test df
news['Clean_text'] = news.Text.apply(lambda x: textcleaner(x))
print('Done news')
news_test['Clean_text'] = news_test.Text.apply(lambda x: textcleaner(x))
print('Done test')
# On full dataset
news_full['Clean_text'] = news_full.Text.apply(lambda x: textcleaner(x))

Done news
Done test


In [101]:
# Drop unprocessed text
news.drop(['Text'], inplace=True, axis=1)
news_test.drop(['Text'], inplace=True, axis=1)
news_full.drop(['Text'], inplace=True, axis=1)

In [102]:
# Save to file
S_news = news.loc[:,:]
S_news['y'] = y
S_news_test = news_test.loc[:,:]
S_news_test['y'] = y_test
S_news_full = news_full.loc[:,:]
S_news_full['y'] = y_full

S_news.to_csv(r'C:\Users\User\Documents\Python_scripts\Thinkful\news.csv')
S_news_test.to_csv(r'C:\Users\User\Documents\Python_scripts\Thinkful\news_test.csv')
S_news_full.to_csv(r'C:\Users\User\Documents\Python_scripts\Thinkful\news_full.csv')

In [45]:
# Unhash to reload
#news = pd.read_csv(r'C:\Users\User\Documents\Python_scripts\Thinkful\news.csv', encoding='latin-1')
#news.drop('Unnamed: 0', inplace=True, axis=1)
#news_test = pd.read_csv(r'C:\Users\User\Documents\Python_scripts\Thinkful\news_test.csv', encoding='latin-1')
#news_test.drop('Unnamed: 0', inplace=True, axis=1)

# Drop na
#news.dropna(inplace=True)
#news_test.dropna(inplace=True)

#Split out y
#y = news.loc[:, 'y']
#y_test = news.loc[:,'y']

In [124]:
# Create tf-idf matrix on train data
vectorizer = TfidfVectorizer(min_df=6, strip_accents='ascii', analyzer='word', lowercase=True,
                             ngram_range=(1,2))
X_train = vectorizer.fit_transform(news.loc[:, 'Clean_text'])

# Apply model to test data
X_test = vectorizer.transform(news_test.loc[:, 'Clean_text'])

In [106]:
X_train.shape

(4733, 8474)

In [107]:
indices = np.argsort(vectorizer.idf_)[::-1]
features = vectorizer.get_feature_names()
top_n = 100
top_features = [features[i] for i in indices[:top_n]]
print(top_features)

['seagat', 'run sun', 'end user', 'salesman', 'sale price', 'enforc prohibit', 'safeti effect', 'sacrific', 'enough make', 'rychel', 'entiti respons', 'entitl unbreak', 'russel', 'entropi', 'runtim', 'run wire', 'run win', 'end get', 'sampl program', 'end car', 'encrypt busi', 'say someth', 'say one', 'say got', 'say get', 'say everi', 'say could', 'encrypt clipper', 'encrypt unit', 'encrypt differ', 'encrypt file', 'encrypt need', 'encrypt polici', 'sand', 'encrypt threaten', 'run unix', 'equil', 'serial line', 'er', 'event mask', 'ron franci', 'ever get', 'ever sinc', 'role infrastructur', 'exampl say', 'exce', 'expect discuss', 'expect see', 'expedit', 'expert outsid', 'explor new', 'righti', 'right want', 'right reserv', 'even without', 'even best', 'rot', 'rsa data', 'erickson', 'run game', 'error code', 'error fail', 'esa tikkanen', 'rsa patent', 'royalti', 'evalu algorithm', 'especi sinc', 'essensa', 'etc know', 'round pick', 'euro', 'rotten', 'employ voic', 'email thank', 'emai

In [None]:
# To look at score by sample
feature_names = vectorizer.get_feature_names()
doc = 1
feature_index = X_train[doc,:].nonzero()[1]
tfidf_scores = zip(feature_index, [X_train[doc, x] for x in feature_index])
for w, s in [(feature_names[i], s) for (i, s) in tfidf_scores]:
    print(w, s)

## 2. Naive Bayes Classifier

The multinomial Naive Bayes classifier is a good baseline for classification with discrete features (in this case a word count). While the multinomial distribution normally requires integer feature counts, according to [Scikitlearn documentation](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html), fractional counts such as tf-idf may also work.

### Baseline model

In [125]:
from sklearn.naive_bayes import MultinomialNB

# Start timing
start = timeit.default_timer()

#Initialize and fit
nb = MultinomialNB()
nb.fit(X_train, y)

# Apply to testing data
y_hat = nb.predict(X_test)

# Stop timing
stop = timeit.default_timer()
nb_time = stop-start
print("Run time: %0.3f" % (nb_time))

# Showing model performance
print("Accuracy is: %0.3f" % nb.score(X_test, y_test))
print(metrics.classification_report(y_test, y_hat,
                                    target_names=dataset_test.target_names))

Run time: 0.032
Accuracy is: 0.811
                       precision    recall  f1-score   support

comp.sys.mac.hardware       0.77      0.80      0.78       385
       comp.windows.x       0.86      0.88      0.87       395
         misc.forsale       0.84      0.81      0.83       390
            rec.autos       0.83      0.79      0.81       396
   rec.sport.baseball       0.92      0.83      0.87       397
     rec.sport.hockey       0.73      0.92      0.82       399
            sci.crypt       0.78      0.85      0.81       396
      sci.electronics       0.79      0.61      0.68       393

          avg / total       0.81      0.81      0.81      3151



In [109]:
def show_top10(classifier, vectorizer, categories):
    feature_names = np.asarray(vectorizer.get_feature_names())
    for i, category in enumerate(categories):
        top10 = np.argsort(classifier.coef_[i])[-10:]
        print("%s: %s" % (category, " ".join(feature_names[top10])))
        
show_top10(nb, vectorizer, dataset.target_names)

comp.sys.mac.hardware: monitor get one card simm use problem drive appl mac
comp.windows.x: applic display file run program motif widget server use window
misc.forsale: use pleas price condit new includ sell ship offer sale
rec.autos: good look dealer drive one engin get would like car
rec.sport.baseball: win think run basebal player hit pitch team game year
rec.sport.hockey: would playoff nhl year season player hockey play team game
sci.crypt: peopl system would secur use govern clipper chip encrypt key
sci.electronics: work get know like power anyon circuit would one use


## Optimized model

In [144]:
from sklearn import model_selection

params = {'alpha': [0.01, 0.1, 0.3, 0.4, 0.5, 0.75, 1]}

# Initialize the model
nb = MultinomialNB()

# Apply GridSearch to the model
grid = model_selection.GridSearchCV(nb, params)

# Fit to data
grid.fit(X_train, y)

# For use in CV later
nb_best = grid.best_estimator_

# Metrics 
print(grid.best_estimator_)
print("Accuracy is: %0.3f" % grid.score(X_test, y_test))
y_hat = grid.predict(X_test)
print(metrics.classification_report(y_test, y_hat,
                                    target_names=dataset_test.target_names))
cross = pd.crosstab(y_hat, y_test)
print(cross)

MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True)
Accuracy is: 0.817
                       precision    recall  f1-score   support

comp.sys.mac.hardware       0.77      0.79      0.78       385
       comp.windows.x       0.90      0.87      0.89       395
         misc.forsale       0.84      0.81      0.83       390
            rec.autos       0.80      0.81      0.81       396
   rec.sport.baseball       0.91      0.86      0.88       397
     rec.sport.hockey       0.74      0.92      0.82       399
            sci.crypt       0.84      0.82      0.83       396
      sci.electronics       0.76      0.65      0.70       393

          avg / total       0.82      0.82      0.82      3151

col_0    0    1    2    3    4    5    6    7
row_0                                        
0      306   10   26    6    1    1   10   39
1        8  345    2    1    5    3    9   11
2       19    6  315   11    2    1    3   16
3        7    7   16  320    5    7    9   27
4        1   

In [145]:
from sklearn import metrics

# Summary stats
# Time, accuracy, precision, recall, f1
compareModels = {}
compareModels['nb'] = [nb_time, metrics.accuracy_score(y_test, y_hat),
                     metrics.precision_score(y_test, y_hat, average = 'macro'),
                     metrics.recall_score(y_test, y_hat, average = 'macro'),
                     metrics.f1_score(y_test, y_hat, average = 'macro')]                     

In [130]:
from sklearn import model_selection

# Calculating CV using KFolds
count = 0
kf = model_selection.KFold(n_splits=6, shuffle=True)

for train_index, test_index in kf.split(news_full, y_full):    
    Xk_train = vectorizer.fit_transform(news_full.iloc[train_index,0])
    Xk_test =  vectorizer.transform(news_full.iloc[test_index, 0])
    yk_train, yk_test = y_full[train_index], y_full[test_index]
    # Create instance based on GridCV
    nb = nb_best
    nb.fit(Xk_train, yk_train)
    print('Score for iteration {} is {}'.format(count, nb.score(Xk_test, yk_test)))
    count += 1

Score for iteration 0 is 0.8538812785388128
Score for iteration 1 is 0.8622526636225266
Score for iteration 2 is 0.8508371385083714
Score for iteration 3 is 0.8356164383561644
Score for iteration 4 is 0.8493150684931506
Score for iteration 5 is 0.8432267884322678


## 3.2 Support Vector Machine Classification

### Basline

In [113]:
from sklearn.svm import SVC

# Start timing
start = timeit.default_timer()

# Create instance and fit
sv = SVC(kernel='linear')
sv.fit(X_train, y)

# Apply to testing data
y_hat = sv.predict(X_test)

# Stop timing
stop = timeit.default_timer()
sv_time = stop - start
print("Run time:%0.3f" %sv_time)

# Showing model performance
cross = pd.crosstab(y_hat, y_test)
print("Accuracy is: %0.3f" % sv.score(X_test, y_test))
print(metrics.classification_report(y_test, y_hat,
                                    target_names=dataset_test.target_names))

Run time:19.060
Accuracy is: 0.784
                       precision    recall  f1-score   support

comp.sys.mac.hardware       0.74      0.74      0.74       385
       comp.windows.x       0.90      0.82      0.86       395
         misc.forsale       0.83      0.80      0.81       390
            rec.autos       0.64      0.84      0.73       396
   rec.sport.baseball       0.81      0.81      0.81       397
     rec.sport.hockey       0.87      0.84      0.86       399
            sci.crypt       0.88      0.75      0.81       396
      sci.electronics       0.68      0.66      0.67       393

          avg / total       0.79      0.78      0.79      3151



### Optimized

In [163]:
params = {'C': [0.1, 1, 10],
          'kernel': ['linear'],
          'class_weight': ['balanced', None]}

# Initialize the model
sv = SVC()

# Apply GridSearch to the model
grid = model_selection.GridSearchCV(sv, params)
grid.fit(X_train, y)

# Save model for use in CV
sv_best = grid.best_estimator_
print(grid.best_estimator_)

print("Accuracy is: %0.3f" % grid.score(X_test, y_test))

y_hat = grid.predict(X_test)
print(metrics.classification_report(y_test, y_hat,
                                    target_names=dataset_test.target_names))
cross = pd.crosstab(y_hat, y_test)
print(cross)

SVC(C=1, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
Accuracy is: 0.784
                       precision    recall  f1-score   support

comp.sys.mac.hardware       0.74      0.75      0.74       385
       comp.windows.x       0.90      0.82      0.86       395
         misc.forsale       0.83      0.80      0.81       390
            rec.autos       0.64      0.84      0.72       396
   rec.sport.baseball       0.81      0.81      0.81       397
     rec.sport.hockey       0.87      0.84      0.86       399
            sci.crypt       0.88      0.75      0.81       396
      sci.electronics       0.68      0.66      0.67       393

          avg / total       0.79      0.78      0.79      3151

col_0    0    1    2    3    4    5    6    7
row_0                                        
0      287   15   31    4

In [164]:
# Summary stats
# Time, accuracy, precision, recall, f1
compareModels['SVC'] = [sv_time, metrics.accuracy_score(y_test, y_hat),
                     metrics.precision_score(y_test, y_hat, average = 'macro'),
                     metrics.recall_score(y_test, y_hat, average = 'macro'),
                     metrics.f1_score(y_test, y_hat, average = 'macro')]

In [116]:
# Calculating CV using KFolds
count = 0
kf = model_selection.KFold(n_splits=6, shuffle=True)

for train_index, test_index in kf.split(news_full, y_full):    
    Xk_train = vectorizer.fit_transform(news_full.iloc[train_index,0])
    Xk_test =  vectorizer.transform(news_full.iloc[test_index, 0])
    yk_train, yk_test = y_full[train_index], y_full[test_index]
    # Create instance based on GridCV
    sv = sv_best
    sv.fit(Xk_train, yk_train)
    print('Score for iteration {} is {}'.format(count, sv.score(Xk_test, yk_test)))
    count += 1

  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


Score for iteration 0 is 0.8135464231354642
Score for iteration 1 is 0.8401826484018264
Score for iteration 2 is 0.8401826484018264
Score for iteration 3 is 0.821917808219178
Score for iteration 4 is 0.845509893455099
Score for iteration 5 is 0.837138508371385


## 3.3 Extreme gradient boosting (xgboost)

### Baseline

In [127]:
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn.datasets import dump_svmlight_file

# Start timing
start = timeit.default_timer()

# Initialize
xgc = xgb.XGBClassifier(objective= 'multi:softprob')

#Fit training 
xgc.fit(X_train, y)

# Apply to test
y_hat = xgc.predict(X_test)

# Stop timing
stop = timeit.default_timer()
xgc_time = stop - start 
print("Run time: %0.3f" % xgc_time)

# Showing model performance
print("Accuracy is: %0.3f" % xgc.score(X_test, y_test))

print(metrics.classification_report(y_test, y_hat,
                                    target_names=dataset_test.target_names))

Run time: 19.526
Accuracy is: 0.705
                       precision    recall  f1-score   support

comp.sys.mac.hardware       0.76      0.69      0.72       385
       comp.windows.x       0.87      0.73      0.80       395
         misc.forsale       0.77      0.72      0.74       390
            rec.autos       0.80      0.63      0.70       396
   rec.sport.baseball       0.49      0.82      0.61       397
     rec.sport.hockey       0.84      0.78      0.81       399
            sci.crypt       0.86      0.67      0.75       396
      sci.electronics       0.53      0.60      0.56       393

          avg / total       0.74      0.71      0.71      3151



  if diff:
  if diff:


### Optimized

In [148]:
params = {'objective': ['multi:softprob'],
          'n_estimators': [100, 1000],
          #'gamma': [0, 0.001],
          #'reg_alpha': [0,1]
         }

# Initialize the model
xgc = xgb.XGBClassifier()

# Apply GridSearch to the model
grid = model_selection.GridSearchCV(xgc, params)
grid.fit(X_train, y)

xgc_best = grid.best_estimator_
print(grid.best_estimator_)

print("Accuracy is %0.3f " % grid.score(X_test, y_test))

y_hat = grid.predict(X_test)
print(metrics.classification_report(y_test, y_hat,
                                    target_names=dataset_test.target_names))
cross = pd.crosstab(y_hat, y_test)
print(cross)

  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=1000, nthread=-1,
       objective='multi:softprob', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=1)


  if diff:


Accuracy is 0.743 
                       precision    recall  f1-score   support

comp.sys.mac.hardware       0.75      0.73      0.74       385
       comp.windows.x       0.85      0.80      0.82       395
         misc.forsale       0.78      0.77      0.78       390
            rec.autos       0.72      0.69      0.71       396
   rec.sport.baseball       0.62      0.83      0.71       397
     rec.sport.hockey       0.82      0.79      0.81       399
            sci.crypt       0.84      0.73      0.78       396
      sci.electronics       0.62      0.60      0.61       393

          avg / total       0.75      0.74      0.74      3151

col_0    0    1    2    3    4    5    6    7
row_0                                        
0      281   15   26    7    3    3    9   30
1       12  315    5    5    2    3    9   21
2       16   15  302   16    5    4    9   19
3       14    3   17  273   12    9   12   37
4       23   14   22   44  330   52   23   22
5        4    7    3    8 

  if diff:


In [151]:
# Summary stats
# Time, accuracy, precision, recall, f1
compareModels['XGC'] = [xgc_time, metrics.accuracy_score(y_test, y_hat),
                     metrics.precision_score(y_test, y_hat, average = 'macro'),
                     metrics.recall_score(y_test, y_hat, average = 'macro'),
                     metrics.f1_score(y_test, y_hat, average = 'macro')]

In [133]:
# Calculating CV using KFolds
count = 0
kf = model_selection.KFold(n_splits=6, shuffle=True)

for train_index, test_index in kf.split(news_full, y_full):    
    Xk_train = vectorizer.fit_transform(news_full.iloc[train_index,0])
    Xk_test =  vectorizer.transform(news_full.iloc[test_index, 0])
    yk_train, yk_test = y_full[train_index], y_full[test_index]
    # Create instance based on GridCV
    xgc = sv_best
    xgc.fit(Xk_train, yk_train)
    print('Score for iteration {} is {}'.format(count, xgc.score(Xk_test, yk_test)))
    count += 1

Score for iteration 0 is 0.834855403348554
Score for iteration 1 is 0.8333333333333334
Score for iteration 2 is 0.8234398782343988
Score for iteration 3 is 0.850076103500761
Score for iteration 4 is 0.8287671232876712
Score for iteration 5 is 0.8287671232876712


## 3.4 Stochastic Gradient Descent

## Baseline

In [134]:
from sklearn.linear_model import SGDClassifier

# Start timing
start = timeit.default_timer()

# Create instance and fit
sgdc = SGDClassifier(loss = 'hinge')
sgdc.fit(X_train, y)

# Apply to testing data
y_hat = sgdc.predict(X_test)

# Stop timing
stop = timeit.default_timer()
sgd_time = stop - start
print("Run time: %0.3f" % sgd_time)

# Showing model performance
cross = pd.crosstab(y_hat, y_test)
print("Accuracy is: %0.3f" % sgdc.score(X_test, y_test))
print(metrics.classification_report(y_test, y_hat,
                                    target_names=dataset_test.target_names))
print(cross)

Run time: 0.178
Accuracy is: 0.789
                       precision    recall  f1-score   support

comp.sys.mac.hardware       0.73      0.77      0.75       385
       comp.windows.x       0.90      0.84      0.87       395
         misc.forsale       0.78      0.83      0.80       390
            rec.autos       0.79      0.76      0.77       396
   rec.sport.baseball       0.70      0.82      0.76       397
     rec.sport.hockey       0.88      0.88      0.88       399
            sci.crypt       0.85      0.79      0.82       396
      sci.electronics       0.70      0.63      0.66       393

          avg / total       0.79      0.79      0.79      3151

col_0    0    1    2    3    4    5    6    7
row_0                                        
0      295   13   26   11    7    3    8   42
1        8  332    0    3    4    4    4   14
2       18    7  322   18    7    0   14   28
3        8    5   11  299   11   10   13   21
4       17   13   14   35  327   19   22   17
5        4

## Optimized

In [157]:
params = {'loss': ['hinge', 'log'],
          'penalty': ['l1', 'l2'],
          'alpha': [0.0001, 0.00001, 0.00000001],
          'average': [True, False],
          'class_weight': ['balanced', None],
          'learning_rate':['optimal', 'invscaling'],
          # Tried 0.5 in previous grid
          'power_t': [1.5],
          'eta0': [1],
          'n_iter': [5, 100]
          #'tol': [0.001, 0.0001],
         }


# Initialize the model
sgdc = SGDClassifier()

# Apply GridSearch to the model
grid = model_selection.GridSearchCV(sgdc, params)
grid.fit(X_train, y)

# Save instance for CV
sgd_best = grid.best_estimator_
print(grid.best_estimator_)

# Metrics
print("Accuracy is: ", grid.score(X_test, y_test))

y_hat = grid.predict(X_test)
print(metrics.classification_report(y_test, y_hat,
                                    target_names=dataset_test.target_names))
cross = pd.crosstab(y_hat, y_test)

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=1, fit_intercept=True, l1_ratio=0.15, learning_rate='optimal',
       loss='log', n_iter=5, n_jobs=1, penalty='l2', power_t=1.5,
       random_state=None, shuffle=True, verbose=0, warm_start=False)
Accuracy is:  0.8064106632814979
                       precision    recall  f1-score   support

comp.sys.mac.hardware       0.79      0.77      0.78       385
       comp.windows.x       0.90      0.86      0.88       395
         misc.forsale       0.82      0.81      0.81       390
            rec.autos       0.79      0.79      0.79       396
   rec.sport.baseball       0.70      0.87      0.78       397
     rec.sport.hockey       0.90      0.89      0.89       399
            sci.crypt       0.87      0.79      0.83       396
      sci.electronics       0.71      0.66      0.69       393

          avg / total       0.81      0.81      0.81      3151



In [158]:
# Summary stats
# Time, accuracy, precision, recall, f1
compareModels['SGD'] = [sgd_time, metrics.accuracy_score(y_test, y_hat),
                     metrics.precision_score(y_test, y_hat, average = 'macro'),
                     metrics.recall_score(y_test, y_hat, average = 'macro'),
                     metrics.f1_score(y_test, y_hat, average = 'macro')]

In [138]:
# Calculating CV using KFolds
count = 0
kf = model_selection.KFold(n_splits=6, shuffle=True)

for train_index, test_index in kf.split(news_full, y_full):    
    Xk_train = vectorizer.fit_transform(news_full.iloc[train_index,0])
    Xk_test =  vectorizer.transform(news_full.iloc[test_index, 0])
    yk_train, yk_test = y_full[train_index], y_full[test_index]
    # Create instance based on GridCV
    sgd = sgd_best
    sgd.fit(Xk_train, yk_train)
    print('Score for iteration {} is {}'.format(count, sgd.score(Xk_test, yk_test)))
    count += 1

Score for iteration 0 is 0.8394216133942162
Score for iteration 1 is 0.8584474885844748
Score for iteration 2 is 0.8447488584474886
Score for iteration 3 is 0.8318112633181126
Score for iteration 4 is 0.852359208523592
Score for iteration 5 is 0.845509893455099


## 4. Conclusion

Below is a summary of the results

In [176]:
summary = pd.DataFrame.from_dict(compareModels, orient='index')
summary.columns = ['Run time', 'Accuracy', 'Precision', 'Recall', 'f1']
print(summary.round(2))    

     Run time  Accuracy  Precision  Recall    f1
nb       0.03      0.82       0.82    0.82  0.82
SVC     19.06      0.78       0.79    0.78  0.79
XGC     19.53      0.74       0.75    0.74  0.74
SGD      0.18      0.81       0.81    0.81  0.81


Surprisingly the Naïve Bayes classifier performed best. It has the highest metrics and the model runs orders of magnitude faster than the more complex models. It is possible that better performance could be obtained from the extreme gradient descent and stochastic gradient descent models by allowing more iterations (for example the optimized xgboost model used n_estimators = 1000 – which was the maximum option fed into Gridsearch). However, this would slow down the run time performance even further and gains are likely to be limited (in the xgboost example the final accuracy was only marginally improved going from n_estimators=10 to n_estimators=1000).