# Text Classification

### We will explore Text Classification using nltk, scikit-learn, and gensim

We will use a newsgroups dataset: https://scikit-learn.org/stable/datasets/index.html#the-20-newsgroups-text-dataset (more than 18000 newsgroup posts, across 20 news categories)

#### Goal: Build ML models that predict the category of a newsgroup post, based on the text of the post

Prof Evan Katsamakas,
Gabelli School, 2/2019


In [2]:
import numpy as np
import pandas as pd  
import nltk
import gensim 
from sklearn.datasets import fetch_20newsgroups 
from sklearn.model_selection import train_test_split

# Get the data and take a look

In [3]:
#这就是个定义拿data 的function
categories = ['talk.politics.guns','rec.sport.baseball'] # We focus on 2 news categories
def get_data():
    data = fetch_20newsgroups(subset='all',
                              shuffle=True,
                              categories=categories,
                              remove=('headers', 'footers', 'quotes'))
    return data

In [4]:
# get text data and their labels
dataset = get_data()
print(dataset.target_names)

corpus, labels = dataset.data, dataset.target

print('Sample document:', corpus[10])
print('Class label:',labels[10])
print('Actual class label:', dataset.target_names[labels[10]])

# split training dataset and testing dataset
train_corpus, test_corpus, train_labels, test_labels = train_test_split(corpus,
                                                                        labels,
                                                                        test_size=0.3)

['rec.sport.baseball', 'talk.politics.guns']
Sample document: For those who didn't figure it out, the below message was a reply to another
in sci.crypt, for which the poster put t.p.g. in the Followup-To line. I
didn't notice that. Apologies to those who were confused.

The substance makes little sense unless one reads the prior messages.

However, I don't wish to enter into this discussion here, as it will be yet
another rehearsal of a long-tired set of arguments. Suffice it to say that I
disagree both with the interpretation of "well-regulated" in the Second
Amendment offered by gun lovers, and what I think to be their distortion of
the same phrase in the associated Federalist papers. My Webster and my
reading of the language convinces me that the word meant both under control,
and disciplined, and not 'of good marksmanship'. I think the latter a
special interest pleading. No one has yet shown a contemporateous reference
in which "well regulated" unambiguously meant 'of good marksman

In [5]:
len(corpus),len(labels) #1904 条新闻



(1904, 1904)

In [6]:
set(labels) # 这是一个binary 的问题



{0, 1}

# Prepape features for ML 
### {bow, tfidf, word2vec}

In [7]:
#bow features
from sklearn.feature_extraction.text import CountVectorizer #tokenizes and counts words

# build bag of words features' vectorizer and get features
bow_vectorizer=CountVectorizer(min_df=1, ngram_range=(1,1))


bow_train_features = bow_vectorizer.fit_transform(train_corpus)
bow_test_features = bow_vectorizer.transform(test_corpus) 


bow_train_features.A # Array 形式 
dftr_count = pd.DataFrame(data=bow_train_features.A, columns=bow_vectorizer.get_feature_names())
dftr_count



Unnamed: 0,00,000,000152,000th,001,001319,002,003,004,005,...,zwrm,zx,zx6wre,zxp,zxqi,zy,zyg,zz,zz_g9q3,zzzzzzt
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [8]:
# tfidf features
from sklearn.feature_extraction.text import TfidfVectorizer #alternatively, use TfidfTransformer()

tfidf_vectorizer=TfidfVectorizer(min_df=1, 
                                 norm='l2',
                                 smooth_idf=True,
                                 use_idf=True,
                                 ngram_range=(1,1))
tfidf_train_features = tfidf_vectorizer.fit_transform(train_corpus)  
tfidf_test_features = tfidf_vectorizer.transform(test_corpus) 
dftr_tfidf = pd.DataFrame(data=tfidf_train_features.A, columns=tfidf_vectorizer.get_feature_names())
dftr_tfidf

Unnamed: 0,00,000,000152,000th,001,001319,002,003,004,005,...,zwrm,zx,zx6wre,zxp,zxqi,zy,zyg,zz,zz_g9q3,zzzzzzt
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [9]:
### 这里讲的是word to vect了 


# tokenize documents for word2vec
tokenized_train = [nltk.word_tokenize(text)
                   for text in train_corpus]
tokenized_test = [nltk.word_tokenize(text)
                   for text in test_corpus]  

In [10]:
#
tokenized_train


[['The',
  'BATF',
  'got',
  'sat',
  'on',
  'pretty',
  'early',
  'on',
  '.',
  'After',
  'the',
  'initial',
  'shooting',
  'was',
  'over',
  ',',
  'it',
  'pretty',
  'much',
  'become',
  'the',
  'FBI',
  "'s",
  'show',
  '.',
  '(',
  'Even',
  'that',
  'BATF',
  'guy',
  'stopped',
  'showing',
  'up',
  'next',
  'to',
  'the',
  'speaker',
  'at',
  'the',
  'daily',
  'press',
  'conferences',
  ')',
  '.'],
 [':',
  'In',
  'article',
  '<',
  'C4u3x5.Fw7',
  '@',
  'magpie.linknet.com',
  '>',
  'manes',
  '@',
  'magpie.linknet.com',
  '(',
  'Steve',
  ':',
  '[',
  '...',
  ']',
  ':',
  '>',
  'I',
  'do',
  "n't",
  'know',
  'how',
  'anyone',
  'can',
  'state',
  'that',
  'gun',
  'control',
  'could',
  'have',
  'NO',
  ':',
  '>',
  'effect',
  'on',
  'homicide',
  'rates',
  '.',
  'There',
  'were',
  'over',
  '250',
  '>',
  'accidental',
  '<',
  'handgun',
  ':',
  '>',
  'homicides',
  'in',
  'America',
  'in',
  '1990',
  ',',
  'most',
  'wi

In [11]:
# build word2vec model                   
wv_model = gensim.models.Word2Vec(tokenized_train,
                               size=200,                          #set the size or dimension for the word vectors 
                               window=60,                        #specify the length of the window of words taken as context
                               min_count=10)                   #ignores all words with total frequency lower than                     

In [12]:
wv_model[list(wv_model.wv.vocab)[4]]



  """Entry point for launching an IPython kernel.


array([-0.08986105, -0.20418534,  0.35222358,  0.17236538, -0.04799503,
        0.4961919 , -0.11205919,  0.4807649 ,  0.363684  ,  0.08552857,
        0.26693577,  0.54115826,  0.01938818,  0.44384512,  0.11524207,
        0.47642446, -0.3773297 ,  0.25556535, -0.00133368, -0.03182231,
        0.05118736, -0.08417726, -0.33354655, -0.24575463,  0.14586335,
       -0.06801442,  0.32925105,  0.1033793 ,  0.40896392, -0.21387452,
       -0.13252494,  0.13947971,  0.46354836, -0.21637121,  0.11697112,
       -0.1882964 ,  0.14296302,  0.10859075, -0.17840178, -0.10203391,
       -0.12259667, -0.2701688 , -0.18931794, -0.03586711, -0.521591  ,
        0.7364915 , -0.06244168, -0.24275516, -0.02080456, -0.13380168,
       -0.06716608,  0.22104783,  0.08458999,  0.39461988,  0.11219265,
        0.37888947, -0.06115805,  0.2436359 ,  0.01708205, -0.3272379 ,
        0.36580136, -0.06931154, -0.10814975,  0.40662608, -0.0597943 ,
       -0.02859221, -0.10348104, -0.568948  ,  0.04234322, -0.67

In [13]:
def average_word_vectors(words,
                         model, 
                         vocabulary, 
                         num_features): 
    
    feature_vector = np.zeros((num_features,),dtype="float64")
    nwords = 0.
    
    for word in words:
        if word in vocabulary: 
            nwords = nwords + 1.
            feature_vector = np.add(feature_vector, model[word])
    
    if nwords:
        feature_vector = np.divide(feature_vector, nwords)
        
    return feature_vector 
   

def averaged_word_vectorizer(corpus, model, num_features):
    vocabulary = set(model.wv.index2word)
    
    features = [average_word_vectors(tokenized_sentence, model, vocabulary, num_features)
                    for tokenized_sentence in corpus] 
    return np.array(features)




In [14]:
# averaged word vector features from word2vec
avg_wv_train_features = averaged_word_vectorizer(corpus=tokenized_train,
                                                 model=wv_model,
                                                 num_features=200)                   
avg_wv_test_features = averaged_word_vectorizer(corpus=tokenized_test,
                                                model=wv_model,
                                                num_features=200) 

  if sys.path[0] == '':


In [15]:
print(avg_wv_train_features.shape)




(1332, 200)


# Define metrics for evaluation

In [16]:
from sklearn import metrics

# define a function to evaluate our classification models based on four metrics
def get_metrics(true_labels, predicted_labels):
    
    print ('Accuracy:', np.round(                                                    
                        metrics.accuracy_score(true_labels, 
                                               predicted_labels),
                        2))
    print ('Precision:', np.round(
                        metrics.precision_score(true_labels, 
                                               predicted_labels),
                        2))
    print ('Recall:', np.round(
                        metrics.recall_score(true_labels, 
                                               predicted_labels),
                        2))
    print ('F1 Score:', np.round(
                        metrics.f1_score(true_labels, 
                                               predicted_labels),
                        2))
                        

# Define how to train and evaluate classifier

In [17]:
# define a function that trains the model, performs predictions and evaluates the predictions
def train_predict_evaluate_model(classifier, 
                                 train_features, train_labels, 
                                 test_features, test_labels):
    # build model    
    classifier.fit(train_features, train_labels)
    # predict using model
    predictions = classifier.predict(test_features) 
    # evaluate model prediction performance   
    get_metrics(true_labels=test_labels, 
                predicted_labels=predictions)
    return predictions    

# Train and evaluate {mnb, svm} with bow features

In [18]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier

mnb = MultinomialNB()
svm = SGDClassifier(loss='hinge', max_iter=100)

# Multinomial Naive Bayes with bag of words features
mnb_bow_predictions = train_predict_evaluate_model(classifier=mnb,
                                           train_features=bow_train_features,
                                           train_labels=train_labels,
                                           test_features=bow_test_features,
                                           test_labels=test_labels)

Accuracy: 0.95
Precision: 0.94
Recall: 0.94
F1 Score: 0.94


In [19]:
#Support Vector Machine with bag of words features
svm_bow_predictions = train_predict_evaluate_model(classifier=svm,
                                           train_features=bow_train_features,
                                           train_labels=train_labels,
                                           test_features=bow_test_features,
                                           test_labels=test_labels)

Accuracy: 0.91
Precision: 0.89
Recall: 0.91
F1 Score: 0.9




# Train and evaluate {svm} with {tfidf, word2vec} features


In [20]:
# Support Vector Machine with tfidf features
svm_tfidf_predictions = train_predict_evaluate_model(classifier=svm,
                                           train_features=tfidf_train_features,
                                           train_labels=train_labels,
                                           test_features=tfidf_test_features,
                                           test_labels=test_labels)

Accuracy: 0.92
Precision: 0.96
Recall: 0.86
F1 Score: 0.91


In [21]:
# Support Vector Machine with averaged word vector features
svm_avgwv_predictions = train_predict_evaluate_model(classifier=svm,
                                           train_features=avg_wv_train_features,
                                           train_labels=train_labels,
                                           test_features=avg_wv_test_features,
                                           test_labels=test_labels)

Accuracy: 0.82
Precision: 0.8
Recall: 0.81
F1 Score: 0.8


## CONFUSION MATRIX (for SVM TFIDF) 

In [22]:
# build confusion matrix for SVM TF-IDF-based model
cm = metrics.confusion_matrix(test_labels, svm_tfidf_predictions)
pd.DataFrame(cm, index=range(0,2), columns=range(0,2))  

Unnamed: 0,0,1
0,307,10
1,35,220


In [23]:
# Observe false positive output
class_names = dataset.target_names
print (class_names[0], '->', class_names[1])

rec.sport.baseball -> talk.politics.guns


In [24]:
# Look at some misclassified documents in detail
import re

num = 0
for document, label, predicted_label in zip(test_corpus, test_labels, svm_tfidf_predictions):
    if label == 0 and predicted_label == 1:
        print ('Actual Label:', class_names[label])
        print ('Predicted Label:', class_names[predicted_label])
        print ('Document:-')
        print (re.sub('\n', ' ', document))
        num += 1
        if num == 4:
            break

Actual Label: rec.sport.baseball
Predicted Label: talk.politics.guns
Document:-
 Always has been??????  Even before he was even conceived of? That's a neat trick.  Always will be??????  We leave a lot of room for error don't we.  Hopefully I missed an earlier post that this was with regard to otherwise ... well I leave that to the individual to fill in but I will say what about Gehrig! (shortened and not capitalized for the ease of the reader) 
Actual Label: rec.sport.baseball
Predicted Label: talk.politics.guns
Document:-
The Royals are darkness.  They are the void of our time. When they play, shame descends upon the land like a cold front from Canada.   They are a humiliation to all who have lived and all who shall ever live.   They are utterly and completely doomed.  Other than that, I guess they're OK.  -- 
Actual Label: rec.sport.baseball
Predicted Label: talk.politics.guns
Document:-
 Look, asshole, I got him confused with somebody else.  I didn't flame you, and I would appreciat

## Tasks 2 Better Solutions

In [25]:
#Model！
import xgboost as xgb
from sklearn.linear_model import LogisticRegression # logistics 

from sklearn.linear_model import Perceptron # perceptron

from sklearn.naive_bayes import GaussianNB # gaussion 

from sklearn.ensemble import RandomForestClassifier # 

from sklearn.svm import LinearSVC #linear

from sklearn.svm import SVC #

from sklearn.neighbors import KNeighborsClassifier # 

from sklearn.model_selection import KFold #



from sklearn.model_selection import GridSearchCV #
from sklearn.metrics import confusion_matrix

from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import roc_auc_score
import csv
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import RandomizedSearchCV
from keras.models import Sequential
from keras.layers import Dense

Using TensorFlow backend.


In [29]:
from sklearn.neural_network import MLPClassifier

nn = MLPClassifier()
# NN with bag of words features
NN_bow_predictions = train_predict_evaluate_model(classifier=nn,
                                           train_features=bow_train_features.A,
                                           train_labels=train_labels,
                                           test_features=bow_test_features.A,
                                           test_labels=test_labels)

# NN with tfidf features
NN_tfidf_predictions = train_predict_evaluate_model(classifier=nn,
                                           train_features=tfidf_train_features,
                                           train_labels=train_labels,
                                           test_features=tfidf_test_features,
                                           test_labels=test_labels)


Accuracy: 0.92
Precision: 0.95
Recall: 0.86
F1 Score: 0.9
Accuracy: 0.95
Precision: 0.97
Recall: 0.93
F1 Score: 0.95


In [28]:
# NN with averaged word vector features
NN_avgwv_predictions = train_predict_evaluate_model(classifier=svm,
                                           train_features=avg_wv_train_features,
                                           train_labels=train_labels,
                                           test_features=avg_wv_test_features,
                                           test_labels=test_labels)

Accuracy: 0.83
Precision: 0.84
Recall: 0.77
F1 Score: 0.8




In [37]:
# build confusion matrix for NN BOW based model
NB = metrics.confusion_matrix(test_labels, NN_bow_predictions)
pd.DataFrame(NB, index=['Predict_Sport','Predict_Politic'], columns=['Actual_Sport','Actual_Politic'])  

Unnamed: 0,Actual_Sport,Actual_Politic
Predict_Sport,305,12
Predict_Politic,35,220


In [38]:
# build confusion matrix for NN TF-IDF-based model
NT = metrics.confusion_matrix(test_labels, NN_tfidf_predictions)
pd.DataFrame(NT, index=['Predict_Sport','Predict_Politic'], columns=['Actual_Sport','Actual_Politic'])  

Unnamed: 0,Actual_Sport,Actual_Politic
Predict_Sport,310,7
Predict_Politic,19,236


In [39]:
# build confusion matrix for NN AVG based model
NA = metrics.confusion_matrix(test_labels, NN_avgwv_predictions)
pd.DataFrame(NA, index=['Predict_Sport','Predict_Politic'], columns=['Actual_Sport','Actual_Politic'])  

Unnamed: 0,Actual_Sport,Actual_Politic
Predict_Sport,280,37
Predict_Politic,59,196


In [52]:
## tuning BOW
parameter_space = {
    'hidden_layer_sizes': [(50,100,50)],
    'activation': ['relu'],
    'solver': ['sgd', 'adam'],
    'alpha': [0.0001, 0.05],
    'learning_rate': ['constant','adaptive'],
}

NNCV = GridSearchCV(nn, parameter_space, n_jobs=-1, cv=3)
NNCV.fit(bow_train_features,train_labels)
print("Tuned Decision Per Parameters: {}".format(NNCV.best_params_))
print("Best score is {}".format(NNCV.best_score_))

Tuned Decision Per Parameters: {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (50, 100, 50), 'learning_rate': 'constant', 'solver': 'adam'}
Best score is 0.9301801801801802


In [51]:
## tuning TFIDF
parameter_space = {
    'hidden_layer_sizes': [(50,100,50)],
    'activation': ['relu'],
    'solver': ['sgd', 'adam'],
    'alpha': [0.0001],
    'learning_rate': ['constant','adaptive'],
}

NNCV = GridSearchCV(nn, parameter_space, n_jobs=-1, cv=3)
NNCV.fit(tfidf_train_features,train_labels)
print("Tuned Decision Per Parameters: {}".format(NNCV.best_params_))
print("Best score is {}".format(NNCV.best_score_))

Tuned Decision Per Parameters: {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (50, 100, 50), 'learning_rate': 'constant', 'solver': 'adam'}
Best score is 0.954954954954955


In [49]:
## tuning W2V
parameter_space = {
    'hidden_layer_sizes': [(50,50,50), (50,100,50), (100,)],
    'activation': ['tanh', 'relu'],
    'solver': ['sgd', 'adam'],
    'alpha': [0.0001, 0.05],
    'learning_rate': ['constant','adaptive'],
}

NNCV = GridSearchCV(nn, parameter_space, n_jobs=-1, cv=3)
NNCV.fit(avg_wv_train_features,train_labels)
print("Tuned Decision Per Parameters: {}".format(NNCV.best_params_))
print("Best score is {}".format(NNCV.best_score_))

Tuned Decision Per Parameters: {'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (50, 100, 50), 'learning_rate': 'constant', 'solver': 'sgd'}
Best score is 0.8596096096096096




In [57]:
#tuned nn for bow
#{'activation': 'relu', 'alpha': 0.0001, 'hidden_layer_sizes': (50, 100, 50), 'learning_rate': 'constant', 'solver': 'adam'}
nn_bow_tuned = MLPClassifier(activation='relu', alpha=0.0001, hidden_layer_sizes=(50, 100, 50), learning_rate='constant', solver='adam')
NN_tune_bow_predictions = train_predict_evaluate_model(classifier=nn_bow_tuned,
                                           train_features=bow_train_features.A,
                                           train_labels=train_labels,
                                           test_features=bow_test_features.A,
                                           test_labels=test_labels)

Accuracy: 0.92
Precision: 0.98
Recall: 0.83
F1 Score: 0.9


In [58]:
#tuned nn for tfidf
nn_tf_tuned = MLPClassifier(activation='relu', alpha=0.0001, hidden_layer_sizes=(50, 100, 50), learning_rate='constant', solver='adam')
NN_tune_tf_predictions = train_predict_evaluate_model(classifier=nn_tf_tuned,
                                           train_features=tfidf_train_features.A,
                                           train_labels=train_labels,
                                           test_features=tfidf_test_features.A,
                                           test_labels=test_labels)

Accuracy: 0.95
Precision: 0.97
Recall: 0.92
F1 Score: 0.95


In [54]:
#tuned nn for w2v
nn_w2v_tuned = MLPClassifier(activation='relu', alpha=0.0001, hidden_layer_sizes=(50, 100, 50), learning_rate='constant', solver='sgd')
NN_tune_w2v_predictions = train_predict_evaluate_model(classifier=nn_w2v_tuned,
                                           train_features=avg_wv_train_features,
                                           train_labels=train_labels,
                                           test_features=avg_wv_test_features,
                                           test_labels=test_labels)

Accuracy: 0.82
Precision: 0.78
Recall: 0.83
F1 Score: 0.8




In [60]:
# build confusion matrix for NN BOW based model
tNB = metrics.confusion_matrix(test_labels, NN_tune_bow_predictions)
pd.DataFrame(tNB, index=['Predict_Sport','Predict_Politic'], columns=['Actual_Sport','Actual_Politic'])  

Unnamed: 0,Actual_Sport,Actual_Politic
Predict_Sport,312,5
Predict_Politic,43,212


In [61]:
# build confusion matrix for NN TF-IDF-based model
tNT = metrics.confusion_matrix(test_labels, NN_tune_tf_predictions)
pd.DataFrame(tNT, index=['Predict_Sport','Predict_Politic'], columns=['Actual_Sport','Actual_Politic'])  

Unnamed: 0,Actual_Sport,Actual_Politic
Predict_Sport,310,7
Predict_Politic,20,235


In [62]:
# build confusion matrix for NN AVG based model
tNA = metrics.confusion_matrix(test_labels, NN_tune_w2v_predictions)
pd.DataFrame(tNA, index=['Predict_Sport','Predict_Politic'], columns=['Actual_Sport','Actual_Politic'])  

Unnamed: 0,Actual_Sport,Actual_Politic
Predict_Sport,258,59
Predict_Politic,44,211


## Task 3 Other Classifier

In [64]:
# Classifiers that we are going to use
clf1 = GaussianNB()

#perceptron
clf2 = Perceptron(tol=1e-3, random_state=42,alpha= 0.0001, n_iter= 15)

#Tuned Decision Tree Parameters: {'min_samples_leaf': 7, 'max_features': 3, 'criterion': 'entropy', 'max_depth': 8
clf3 = RandomForestClassifier(n_estimators='warn', criterion='entropy', max_depth=8, 
                          min_samples_split=2, min_samples_leaf=7, 
                          min_weight_fraction_leaf=0.0, max_features=3, 
                          max_leaf_nodes=None, min_impurity_decrease=0.0, 
                          min_impurity_split=None, bootstrap=True, oob_score=False, 
                          n_jobs=None, random_state=None, verbose=0, warm_start=False, 
                          class_weight=None)

#linear SVM
clf4_5 = LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=0, tol=1e-05, verbose=3)
#SVM {'C': 10, 'degree': 1, 'gamma': 0.01, 'kernel': 'poly'}
clf4 = SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=1, gamma=0.01,
  kernel='poly', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=3)

#k-nearest neighbor
clf5 = KNeighborsClassifier(n_neighbors= 41)
#'penalty': 'l2', 'C': 0.05179474679231213

#Logistics regresion
clf6 = LogisticRegression(solver='liblinear', C=0.4393970560760795,penalty='l2')


In [159]:
# Perceptron with bag of words features
Per_bow_predictions = train_predict_evaluate_model(classifier=clf2,
                                           train_features=bow_train_features.A,
                                           train_labels=train_labels,
                                           test_features=bow_test_features.A,
                                           test_labels=test_labels)
print(

)
# per with tfidf features
per_tfidf_predictions = train_predict_evaluate_model(classifier=clf2,
                                           train_features=tfidf_train_features,
                                           train_labels=train_labels,
                                           test_features=tfidf_test_features,
                                           test_labels=test_labels)



Accuracy: 0.91
Precision: 0.92
Recall: 0.9
F1 Score: 0.91

Accuracy: 0.92
Precision: 0.93
Recall: 0.91
F1 Score: 0.92




In [116]:
# Random_forest with bag of words features
rdf_bow_predictions = train_predict_evaluate_model(classifier=clf3,
                                           train_features=bow_train_features.A,
                                           train_labels=train_labels,
                                           test_features=bow_test_features.A,
                                           test_labels=test_labels)

# rdf with tfidf features
rdf_tfidf_predictions = train_predict_evaluate_model(classifier=clf3,
                                           train_features=tfidf_train_features,
                                           train_labels=train_labels,
                                           test_features=tfidf_test_features,
                                           test_labels=test_labels)

Accuracy: 0.55
Precision: 0.97
Recall: 0.1
F1 Score: 0.18
Accuracy: 0.56
Precision: 0.97
Recall: 0.13
F1 Score: 0.23


In [192]:
# SVM with bag of words features
svm_bow_predictions = train_predict_evaluate_model(classifier=clf4,
                                           train_features=bow_train_features.A,
                                           train_labels=train_labels,
                                           test_features=bow_test_features.A,
                                           test_labels=test_labels)

# Support Vector Machine with tfidf features
svm_tfidf_predictions = train_predict_evaluate_model(classifier=clf4,
                                           train_features=tfidf_train_features,
                                           train_labels=train_labels,
                                           test_features=tfidf_test_features,
                                           test_labels=test_labels)

[LibSVM]Accuracy: 0.9
Precision: 0.95
Recall: 0.84
F1 Score: 0.89
[LibSVM]Accuracy: 0.82
Precision: 0.98
Recall: 0.65
F1 Score: 0.78


In [144]:
# Linear svm with bag of words features
lsvm_bow_predictions = train_predict_evaluate_model(classifier=clf4_5,
                                           train_features=bow_train_features.A,
                                           train_labels=train_labels,
                                           test_features=bow_test_features.A,
                                           test_labels=test_labels)

lsvm_tfidf_predictions = train_predict_evaluate_model(classifier=clf4_5,
                                           train_features=tfidf_train_features,
                                           train_labels=train_labels,
                                           test_features=tfidf_test_features,
                                           test_labels=test_labels)

[LibLinear]Accuracy: 0.89
Precision: 0.94
Recall: 0.84
F1 Score: 0.89
[LibLinear]Accuracy: 0.94
Precision: 0.97
Recall: 0.91
F1 Score: 0.94


In [166]:
# KNN with bag of words features
KNN_bow_predictions = train_predict_evaluate_model(classifier=clf5,
                                           train_features=bow_train_features.A,
                                           train_labels=train_labels,
                                           test_features=bow_test_features.A,
                                           test_labels=test_labels)

# KNN with tfidf features
KNN_tfidf_predictions = train_predict_evaluate_model(classifier=clf5,
                                           train_features=tfidf_train_features,
                                           train_labels=train_labels,
                                           test_features=tfidf_test_features,
                                           test_labels=test_labels)

Accuracy: 0.67
Precision: 0.79
Recall: 0.46
F1 Score: 0.58
Accuracy: 0.5
Precision: 0.0
Recall: 0.0
F1 Score: 0.0


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


In [121]:
# Logistics with bag of words features
log_bow_predictions = train_predict_evaluate_model(classifier=clf6,
                                           train_features=bow_train_features.A,
                                           train_labels=train_labels,
                                           test_features=bow_test_features.A,
                                           test_labels=test_labels)

# Support Vector Machine with tfidf features
log_tfidf_predictions = train_predict_evaluate_model(classifier=clf6,
                                           train_features=tfidf_train_features,
                                           train_labels=train_labels,
                                           test_features=tfidf_test_features,
                                           test_labels=test_labels)

Accuracy: 0.92
Precision: 0.96
Recall: 0.87
F1 Score: 0.91
Accuracy: 0.91
Precision: 0.97
Recall: 0.85
F1 Score: 0.91


## Tuning

In [106]:
from scipy.stats import randint
# random forest
# Setup the parameters and distributions to sample from: param_dist
#Random Forest Tunning Params
c45_param = {'randomforestclassifier__n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}


# Instantiate the RandomizedSearchCV object: tree_cv
tree_cv = RandomizedSearchCV(clf3, param_dist, cv=10)

# Fit it to the data
tree_cv.fit(bow_train_features,train_labels)

# Print the tuned parameters and score
print("Tuned Decision Tree Parameters: {}".format(tree_cv.best_params_))
print("Best score is {}".format(tree_cv.best_score_))

Tuned Decision Tree Parameters: {'criterion': 'gini', 'max_depth': 5, 'max_features': 8, 'min_samples_leaf': 2}
Best score is 0.6216216216216216


In [160]:
C = [ 0.01, 0.1, 1, 10, 100] 
degree = [1,2]
gamma = [.001, .01]
kernel = ("rbf","poly")



parameters_svm = {'C':C, 'degree': degree, 'gamma':gamma,'kernel':kernel }

# Instantiate the RandomizedSearchCV object: tree_cv
svm_cv = GridSearchCV(clf4, parameters_svm, cv=10)

# Fit it to the data
svm_cv.fit(bow_train_features.A ,train_labels)

# Print the tuned parameters and score
print("Tuned Decision Tree Parameters: {}".format(svm_cv.best_params_))
print("Best score is {}".format(svm_cv.best_score_))

[LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM]

In [164]:
#KNN Tuning Params
k = [41, 61, 81]
parameters_knn = {'n_neighbors': [41, 61, 81] }

# Instantiate the RandomizedSearchCV object: tree_cv
knn_cv = GridSearchCV(clf5, parameters_knn, cv=10)

# Fit it to the data
knn_cv.fit(bow_train_features.A ,train_labels)

# Print the tuned parameters and score
print("Tuned Decision Tree Parameters: {}".format(knn_cv.best_params_))
print("Best score is {}".format(knn_cv.best_score_))

Tuned Decision Tree Parameters: {'n_neighbors': 41}
Best score is 0.6981981981981982


In [155]:
#Percept Tuning Params

# Instantiate the Per object: PER_cv
per_cv = GridSearchCV(clf2, {'alpha': [0.0001, 0.0003, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3],'n_iter': [5, 10, 15, 20, 50]}, cv=5)

# Fit it to the data
per_cv.fit(bow_train_features,train_labels)

# Print the tuned parameters and score
print("Tuned Decision Per Parameters: {}".format(per_cv.best_params_))
print("Best score is {}".format(per_cv.best_score_))










Tuned Decision Per Parameters: {'alpha': 0.0001, 'n_iter': 15}
Best score is 0.9121621621621622




In [156]:
print("Tuned Decision Per Parameters: {}".format(per_cv.best_params_))
print("Best score is {}".format(per_cv.best_score_))


Tuned Decision Per Parameters: {'alpha': 0.0001, 'n_iter': 15}
Best score is 0.9121621621621622


In [112]:
##Tuning Logis
# Setup the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space}

# Instantiate the GridSearchCV object: logreg_cv
logreg_cv = GridSearchCV(clf6, param_grid, cv=10)

# Fit it to the data
logreg_cv.fit(bow_train_features,train_labels)

# Print the tuned parameters and score
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_)) 
print("Best score is {}".format(logreg_cv.best_score_))

Tuned Logistic Regression Parameters: {'C': 0.4393970560760795}
Best score is 0.9114114114114115


## Ensemble Method Majorty vote



In [71]:
def Majority_vote(data,clfs):
    X_train, y_train, X_test, y_test = data
    vote = np.zeros(X_test.shape[0])
    for clf in clfs:
        clf.fit(X_train,y_train)
        predictions = clf.predict(X_test)
        vote = np.add(vote,predictions)
        print(clf)
        print(vote)
    answer = np.array([1 if n >= np.round(len(clfs)) else 0 for n in vote])
    pred = get_metrics(true_labels=y_test, 
                predicted_labels=answer)
    data = metrics.confusion_matrix(y_test, answer)
    pd.DataFrame(data, index=['Predict_Sport','Predict_Politic'], columns=['Actual_Sport','Actual_Politic']) 
    return pred,data
            
    

In [67]:
#data preparing
clf_list = [mnb,clf2,clf1,clf4,svm,clf6,nn_bow_tuned] #7 classifiers
bow = (bow_train_features.A,train_labels,bow_test_features.A,test_labels)


tif = (tfidf_train_features.A,train_labels,tfidf_test_features.A,test_labels)
w2v = (avg_wv_train_features,train_labels,avg_wv_test_features,test_labels)

In [72]:
x,y = Majority_vote(bow,clfs=clf_list)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
[1. 1. 1. 1. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0.
 1. 1. 0. 0. 1. 1. 0. 0. 0. 0. 1. 0. 1. 1. 0. 0. 1. 1. 1. 1. 0. 1. 0. 1.
 1. 1. 1. 1. 0. 1. 0. 0. 1. 1. 1. 0. 1. 0. 0. 1. 1. 1. 0. 0. 0. 1. 0. 0.
 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 1. 1. 1. 1. 0. 1. 1. 0. 1. 1. 1. 1. 0. 0.
 0. 1. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 1.
 0. 0. 1. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 1. 0. 0. 1. 0. 1.
 0. 1. 0. 0. 1. 0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 1. 0. 1. 0.
 1. 1. 1. 0. 1. 1. 1. 0. 0. 0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 1.
 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 1.
 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 1. 1. 0. 1. 1. 0. 0. 0. 1. 1.
 0. 0. 1. 1. 0. 0. 1. 1. 0. 1. 1. 1. 1. 0. 1. 0. 1. 0. 0. 0. 1. 0. 1. 1.
 1. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 1. 1. 0. 1. 1. 1. 0.
 1. 1. 1. 0. 0. 1. 1. 0. 1. 0. 1. 1. 1. 0. 0. 1. 1. 1. 1. 1. 1. 1



Perceptron(alpha=0.0001, class_weight=None, early_stopping=False, eta0=1.0,
      fit_intercept=True, max_iter=None, n_iter=15, n_iter_no_change=5,
      n_jobs=None, penalty=None, random_state=42, shuffle=True, tol=0.001,
      validation_fraction=0.1, verbose=0, warm_start=False)
[2. 2. 2. 1. 0. 2. 0. 2. 0. 0. 2. 0. 0. 0. 0. 2. 2. 0. 0. 0. 0. 0. 2. 0.
 2. 2. 0. 1. 2. 2. 0. 0. 0. 0. 2. 0. 2. 2. 0. 0. 2. 2. 1. 2. 0. 2. 0. 2.
 2. 1. 2. 1. 0. 2. 0. 0. 2. 2. 2. 0. 2. 0. 0. 2. 2. 2. 0. 0. 0. 2. 0. 0.
 0. 0. 0. 0. 0. 0. 2. 2. 0. 0. 2. 2. 2. 2. 0. 2. 2. 0. 1. 2. 2. 2. 0. 0.
 0. 2. 0. 0. 2. 1. 0. 0. 2. 0. 2. 0. 0. 0. 1. 0. 1. 2. 2. 0. 0. 0. 0. 2.
 1. 0. 2. 0. 2. 2. 0. 0. 0. 2. 0. 0. 0. 2. 0. 0. 2. 0. 2. 0. 0. 2. 0. 2.
 0. 2. 0. 0. 2. 0. 2. 2. 0. 0. 0. 0. 2. 0. 0. 0. 2. 0. 0. 2. 2. 0. 2. 0.
 1. 2. 2. 0. 2. 2. 2. 0. 0. 0. 2. 1. 0. 0. 2. 2. 0. 0. 0. 0. 2. 0. 0. 2.
 0. 0. 0. 2. 0. 0. 0. 2. 0. 0. 0. 1. 2. 2. 2. 2. 2. 0. 0. 0. 0. 0. 0. 2.
 2. 1. 0. 0. 2. 0. 0. 0. 0. 0. 1. 1. 1. 0. 2. 2. 1. 2. 2. 1.



SGDClassifier(alpha=0.0001, average=False, class_weight=None,
       early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
       l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=100,
       n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='l2',
       power_t=0.5, random_state=None, shuffle=True, tol=None,
       validation_fraction=0.1, verbose=0, warm_start=False)
[5. 3. 5. 2. 0. 5. 0. 5. 0. 0. 3. 0. 0. 0. 0. 5. 3. 0. 0. 0. 0. 0. 5. 0.
 5. 5. 0. 2. 5. 5. 0. 0. 2. 0. 5. 0. 5. 5. 1. 0. 5. 3. 1. 5. 0. 5. 0. 5.
 5. 1. 5. 4. 0. 5. 0. 0. 5. 5. 5. 0. 5. 0. 0. 5. 5. 3. 0. 1. 1. 4. 0. 0.
 0. 2. 0. 1. 0. 0. 5. 4. 0. 0. 5. 5. 5. 5. 0. 4. 5. 0. 2. 5. 4. 5. 1. 0.
 0. 5. 0. 0. 5. 2. 0. 0. 5. 0. 5. 1. 0. 1. 2. 0. 3. 4. 5. 0. 0. 0. 0. 5.
 2. 0. 5. 0. 5. 5. 0. 0. 0. 4. 0. 0. 0. 4. 0. 0. 5. 0. 5. 0. 0. 5. 0. 5.
 1. 4. 0. 0. 5. 2. 5. 5. 0. 0. 0. 0. 5. 0. 0. 1. 4. 0. 0. 4. 5. 0. 5. 0.
 3. 5. 5. 0. 5. 3. 5. 0. 0. 0. 5. 2. 1. 0. 3. 5. 0. 0. 0. 1. 5. 0. 0. 5.
 0. 0. 0. 5. 0. 

In [74]:
# build confusion matrix for Boosting bow based model
pd.DataFrame(y, index=['Predict_Sport','Predict_Politic'], columns=['Actual_Sport','Actual_Politic'])  

Unnamed: 0,Actual_Sport,Actual_Politic
Predict_Sport,316,1
Predict_Politic,61,194


In [75]:
xt,yt = Majority_vote(tif,clfs=clf_list)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
[1. 1. 1. 1. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0.
 1. 1. 0. 0. 1. 1. 0. 0. 0. 0. 1. 0. 1. 1. 0. 0. 1. 1. 0. 1. 0. 1. 0. 1.
 1. 1. 1. 1. 0. 1. 0. 0. 1. 1. 1. 0. 1. 0. 0. 1. 1. 1. 0. 0. 0. 1. 0. 0.
 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 1. 1. 1. 1. 0. 1. 1. 0. 1. 1. 1. 1. 0. 0.
 0. 1. 0. 0. 1. 1. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 1.
 0. 0. 1. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 1. 0. 0. 1. 0. 1.
 0. 1. 0. 0. 1. 0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 1. 1. 0. 0. 1. 1. 0. 1. 0.
 1. 1. 1. 0. 1. 1. 1. 0. 0. 0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 1. 0. 0. 1.
 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 1.
 1. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 1. 1. 0. 1. 1. 1. 1. 1. 0. 0. 0. 1. 1.
 0. 0. 1. 1. 0. 0. 1. 1. 1. 1. 1. 1. 1. 0. 1. 0. 1. 0. 0. 0. 1. 0. 1. 1.
 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 1. 1. 0. 1. 1. 1. 0.
 1. 1. 1. 0. 0. 1. 1. 1. 1. 0. 1. 1. 1. 0. 0. 1. 1. 1. 1. 1. 1. 1



Perceptron(alpha=0.0001, class_weight=None, early_stopping=False, eta0=1.0,
      fit_intercept=True, max_iter=None, n_iter=15, n_iter_no_change=5,
      n_jobs=None, penalty=None, random_state=42, shuffle=True, tol=0.001,
      validation_fraction=0.1, verbose=0, warm_start=False)
[2. 1. 2. 1. 0. 2. 0. 2. 0. 0. 2. 0. 0. 0. 0. 2. 1. 0. 0. 0. 0. 0. 1. 0.
 2. 2. 0. 0. 2. 2. 0. 0. 0. 0. 2. 0. 2. 2. 0. 0. 2. 1. 0. 2. 0. 2. 0. 2.
 2. 1. 2. 1. 0. 2. 0. 0. 1. 2. 2. 0. 2. 0. 0. 2. 2. 1. 0. 0. 0. 2. 0. 0.
 0. 0. 0. 0. 0. 0. 2. 2. 0. 0. 2. 2. 2. 2. 0. 1. 2. 0. 1. 2. 1. 2. 0. 0.
 0. 2. 0. 0. 2. 1. 0. 0. 2. 0. 2. 0. 0. 0. 0. 0. 0. 1. 2. 0. 0. 0. 0. 2.
 0. 0. 1. 0. 2. 2. 0. 0. 0. 2. 0. 0. 0. 1. 0. 0. 2. 0. 2. 0. 0. 2. 0. 1.
 0. 2. 0. 0. 2. 0. 2. 2. 0. 0. 0. 0. 2. 0. 0. 1. 1. 0. 0. 2. 2. 0. 2. 0.
 1. 2. 2. 0. 2. 1. 2. 0. 0. 0. 2. 0. 0. 0. 1. 2. 0. 0. 0. 0. 2. 0. 0. 2.
 0. 0. 0. 2. 0. 0. 0. 2. 0. 0. 0. 0. 2. 2. 2. 2. 2. 0. 0. 0. 0. 0. 0. 1.
 2. 0. 0. 0. 2. 0. 0. 0. 0. 0. 1. 1. 1. 0. 1. 2. 2. 2. 2. 0.



SGDClassifier(alpha=0.0001, average=False, class_weight=None,
       early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
       l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=100,
       n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='l2',
       power_t=0.5, random_state=None, shuffle=True, tol=None,
       validation_fraction=0.1, verbose=0, warm_start=False)
[5. 3. 5. 3. 0. 5. 0. 5. 0. 0. 5. 0. 0. 0. 0. 5. 4. 0. 0. 0. 0. 0. 4. 0.
 5. 5. 0. 1. 5. 5. 0. 0. 1. 0. 5. 1. 5. 5. 1. 0. 5. 2. 0. 5. 0. 5. 0. 5.
 5. 1. 5. 3. 0. 4. 0. 0. 3. 5. 5. 0. 5. 0. 0. 5. 5. 2. 0. 1. 0. 5. 0. 0.
 0. 1. 0. 0. 0. 0. 5. 4. 0. 0. 5. 5. 5. 5. 0. 2. 5. 0. 2. 5. 3. 5. 1. 0.
 0. 5. 0. 0. 5. 2. 0. 1. 5. 0. 5. 1. 0. 1. 0. 1. 1. 2. 5. 0. 0. 0. 0. 5.
 1. 0. 4. 0. 4. 5. 0. 0. 0. 5. 0. 0. 0. 2. 0. 0. 5. 0. 5. 0. 0. 5. 0. 3.
 0. 4. 0. 0. 5. 0. 5. 5. 0. 0. 1. 0. 5. 0. 0. 3. 4. 0. 0. 4. 5. 0. 5. 0.
 2. 5. 5. 0. 5. 2. 5. 0. 0. 0. 5. 0. 1. 0. 4. 5. 0. 0. 0. 1. 4. 0. 0. 5.
 0. 0. 0. 4. 0. 

In [76]:
# build confusion matrix for Boosting tfidf based model
pd.DataFrame(yt, index=['Predict_Sport','Predict_Politic'], columns=['Actual_Sport','Actual_Politic']) 

Unnamed: 0,Actual_Sport,Actual_Politic
Predict_Sport,317,0
Predict_Politic,93,162
