# **LAB 3. TEXT CLASSIFICATION** 

### **<font color=green>INSTRUCTIONS:</font>** <br> 

**<font color=green> 1. Look for EXERCISES in the script.</font>** <br>

**<font color=green> 2. Each student INDIVIDUALLY uploads this script with their answers embedded to Canvas by the end of the day on Wednesday.</font>** 

### **Lab Objective**
Our objective is to classify consumer messages based on the topic of the message. Today, you will:<br>
1. Learn how to vectorize *training* data and *testing* data (**watch out for very important nuances in the script!**)
2. Do feature selection using **filter methods**: chi-squared statistics and a variation of the entropy method (mutual information)
3. Train and test a **Naive Bayes classifier** for text data (you can use other methods, such as SVM, etc.)


### **Session Prep**
Install the modules we'll need:

In [None]:
import sys

#!{sys.executable} -m pip install numpy
import numpy as np

#!{sys.executable} -m pip install sklearn
from sklearn import metrics

#!{sys.executable} -m pip install pandas
import pandas as pd

#!{sys.executable} -m pip install nltk
import nltk

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

#!{sys.executable} -m pip install nbformat
!{sys.executable} -m pip install pattern3

Next, we will define the text normalization function. The function is defined in a separate file called Text_Normalization_Function.ipynb (we used in Lab 2 as well). 

---
**IMPORTANT**: 

1.   If you work locally on your own machine: make sure Text_Normalization_Function.ipynb file is in the same directory as the current notebook
2.   If on Google Colab: upload Text_Normalization_Function.ipynb file to session storage

---

Let's execute the file:




In [None]:
%run ./Text_Normalization_Function.ipynb  #defining text normalization function

### **Downloading and Exploring Data**
Download the dataset from the dataset sklearn's collection of datasets (sklearn.datasets). The dataset we need is called fetch_20newsgroups:<br><br>
*Note: You can check out the documentation for the dataset here: https://bit.ly/3aM5tUo*

In [12]:
from sklearn.datasets import fetch_20newsgroups
warnings.filterwarnings("ignore", category=DeprecationWarning) #supress deprication warnings if any

Out of 20 newsgroups (topics) available, we will use posts on **4** topics: atheism, religion, computer graphics, and science. You can refer to those newsgroups as **classes** or **categories**. Note right away, that the "atheism" and "religion" classes are likely **similar** to each other: people might be using similar words when they talk about atheism and religion.

In [13]:
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']

Now, let's create our **training** and **testing** datasets. We do that by picking the posts belonging to the selected classes, marked in the dataset "test" or "train". We remove headers (likely, email or letter headers), footers (containing author's signatures, etc.), and quotes: 

In [14]:
corpus_train = fetch_20newsgroups(categories = categories,
                                  subset = 'train', 
                                  remove = ('headers', 'footers', 'quotes')) 

corpus_test = fetch_20newsgroups(categories = categories,
                                 subset='test', 
                                 remove=('headers', 'footers', 'quotes')) 

Let's inspect the data by looking at the training text corpus. First, have a look at one of the posts:

In [15]:
print(corpus_train.data[7])       


Acorn Replay running on a 25MHz ARM 3 processor (the ARM 3 is about 20% slower
than the ARM 6) does this in software (off a standard CD-ROM). 16 bit colour at
about the same resolution (so what if the computer only has 8 bit colour
support, real-time dithering too...). The 3D0/O is supposed to have a couple of
DSPs - the ARM being used for housekeeping.


A 25MHz ARM 6xx should clock around 20 ARM MIPS, say 18 flat out. Depends
really on the surrounding system and whether you are talking ARM6x or ARM6xx
(the latter has a cache, and so is essential to run at this kind of speed with
slower memory).

I'll stop saying things there 'cos I'll hopefully be working for ARM after
graduation...

Mike

PS Don't pay heed to what reps from Philips say; if the 3D0/O doesn't beat the
   pants off 3DI then I'll eat this postscript.


The class labels for each message that you will be using for training and testing are encoded as numbers and can be accessed via attribute **.target** and their names can be accessed via attribute **.target_names**:

In [16]:
print("Category names: ", corpus_train.target_names)    
print("Categories for first 10 observations: ", corpus_train.target[:10])     
print("Number of posts in the training dataset: ", corpus_train.filenames.shape[0]) 

Category names:  ['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']
Categories for first 10 observations:  [1 3 2 0 2 0 2 1 2 1]
Number of posts in the training dataset:  2034


Let's now define a function and call it **fmat_descr_fun** to be used later to describe the feature matrix (vectorized corpus). The function prints out the dimensions of the matrix, share of non-zero elements and so on. The function will take two inputs: the feature matrix and your vectorizer function. In what follows, you can see the descriptives that the function provides:

In [17]:
def fmat_descr_fun(your_feature_matrix,your_vectorizer):
    print("Dimensions (number of posts, number of features): ", your_feature_matrix.shape)  
    print("The first 5 features - names: ", your_vectorizer.get_feature_names()[0:5]) 
    print("Share of non-zero elements in the matrix: ", 
          your_feature_matrix.nnz / (float(your_feature_matrix.shape[0]) * float(your_feature_matrix.shape[1]))) #nnz: Get the count of explicitly-stored values (nonzeros)
    print("Average number of features present, per post: ", 
          round(your_feature_matrix.nnz/float(your_feature_matrix.shape[0]),1))

### **Feature Extraction for TRAINING Data**

Let's do feature extraction for our **TRAINING** data using the **"Bag-of-words"** method and **TF-IDF** method (optional).
<br><br>

### **"Bag-of-words" Vectorization for TRAINING data**

As you remember from Lab 2, to do the Bag-of-Words vectorization, we can use the **CountVectorizer()** function from the sklearn package: 

In [18]:
from sklearn.feature_extraction.text import CountVectorizer

Let's define our Bag-of-Words vectorizer:

In [19]:
bow_vectorizer = CountVectorizer()

Now we can apply the vectorizer we defined, bow_vectorizer, to our training dataset, **without normalizing** it first (though tokenization will be done by the vectorizer). Remember, to create the corpus vocabulary and to vectorize the data accroding to that vocabulary, we use the **.fit_transform** method:

In [20]:
corpus_train_bow = bow_vectorizer.fit_transform(corpus_train.data)

Describe the Bag-of-Words matrix that you got by applying the function **fmat_descr_fun** we defined above:

In [21]:
fmat_descr_fun(corpus_train_bow, bow_vectorizer)

Dimensions (number of posts, number of features):  (2034, 26879)
The first 5 features - names:  ['00', '000', '0000', '00000', '000000']
Share of non-zero elements in the matrix:  0.0035978272269590263
Average number of features present, per post:  96.7


Let's have a look at the first 5 rows in the resulting matrix:

In [23]:
#convert vectorized data to a dataframe and give columns their names
corpus_train_bow_table = pd.DataFrame(data = corpus_train_bow.todense(), columns = bow_vectorizer.get_feature_names())
corpus_train_bow_table

Unnamed: 0,00,000,0000,00000,000000,000005102000,000062david42,0001,000100255pixel,00041032,...,zurich,zurvanism,zus,zvi,zwaartepunten,zwak,zwakke,zware,zwarte,zyxel
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2029,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2030,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2031,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2032,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### **<font color=green>EXERCISE 1:</font>**
**<font color=green> Answer the following questions using the descriptives of the Bag-of-Words matrix:</font>** <br><br>
**<font color=green>1.1. How many features does the Bag-of-Words matrix contain?</font>** <br><br> 
Your answer:

26879

**<font color=green>1.2. Is the Bag-of-Words matrix sparse? Explain your answer: </font>** <br><br> 
Your answer:

Yes. Because there are many 0 in the matrix.

Now, let's normalize our training data and call the normalized corpus **NORM_corpus_train** and then vectorize it using the same Bag-of-Words approach to vectorization (note that it will take some time for the normalization function to finish its job):

In [24]:
NORM_corpus_train = normalize_corpus(corpus_train.data)

### **<font color=green>EXERCISE 2:</font>**
**<font color=green>Vectorize the normalized training corpus using the Bag-of-Words approach and name the vectorized corpus NORM_corpus_train_bow. Describe the normalized training corpus using the fmat_descr_fun function. <br><br> 2.1. What differences do you see between normalized and non-normalized feature matrices?</font>** <br><br>
Your answer:

* The dimention of the matrix has been reduced. There are less columns in the new matrix. After normalization, the matrix has 21090 columns while the old matrix has 26879 columns.

* Share of non-zero elements in the matrix has been reduced as well.

* Average number of features present, per post has been smaller, showing that the features in the new matrix are more discriminating for classification.

In [25]:
bow_vectorizer = CountVectorizer()
NORM_corpus_train_bow = bow_vectorizer.fit_transform(NORM_corpus_train)
fmat_descr_fun(NORM_corpus_train_bow, bow_vectorizer)
corpus_train_bow_table = pd.DataFrame(data = NORM_corpus_train_bow.todense(), columns = bow_vectorizer.get_feature_names())
corpus_train_bow_table

Dimensions (number of posts, number of features):  (2034, 21061)
The first 5 features - names:  ['000062david42', '000100255pixel', '000usd', '001200201pixel', '00index']
Share of non-zero elements in the matrix:  0.002958676433492318
Average number of features present, per post:  62.3


Unnamed: 0,000062david42,000100255pixel,000usd,001200201pixel,00index,00pm,01a,023b,04g,054589e,...,zurich,zurvanism,zus,zvi,zwaartepunten,zwak,zwakke,zware,zwarte,zyxel
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2029,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2030,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2031,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2032,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## **Feature Selection for TRAINING Data** 
Let's select the best features for your normalized training corpus, **NORM_corpus_train_bow**, using the **chi-squared statistic** and **mutial information (MI)**, whcih is based on the ideas of **entropy**.

---

**IMPORTANT** <br> You need to be done with the previous EXERCISE to be able to continue. 

---

We need to import the feature selection function **SelectKBest** first. This function selects k best features based on the results of a test (in our case, we will use chi-squared and mutual information). Also, we need the **chi2** function and **mutual_info_classif** functions:

In [26]:
from sklearn.feature_selection import SelectKBest, chi2, mutual_info_classif

First, we need to specify that we will use SelectKBest function with the chi-squared statistic for feature selection (by setting parameter "score_func") and indicate the number of best features we want to find (by setting parameter "k"). We'll select 10,000 best features, so **k = 10,000**:<br><br>
*Look up the documentation for SelectKBest if needed: https://bit.ly/2Re54Ch*

### Chi-squared statistic

In [27]:
chi2_kbest = SelectKBest(score_func = chi2, k = 10000)

Now, let's find the best features. To do that, we call the **fit_transform** method with our defined kbest function. The fit_transform method takes **2 inputs**: the vectorized representation of the data (NORM_corpus_train_bow) and the array of class labels (corpus_train.target). To see the chi-squared scores for the features use method **scores_**:

In [28]:
NORM_corpus_train_bow_chi2_BEST = chi2_kbest.fit_transform(NORM_corpus_train_bow, corpus_train.target)
chi2_kbest.scores_

array([2.43001686, 2.48287671, 4.96575342, ..., 2.43001686, 4.86003373,
       4.96575342])

So, which features are best? The **get_support** method with parameter *indices* set to True will return the indecies of the best k features:

In [31]:
chi2_best_features_ind = chi2_kbest.get_support(indices=True)
chi2_best_features_ind

array([    2,     5,     7, ..., 21052, 21059, 21060])

What are the names of the best features, according to the chi-squared statistics?

In [32]:
chi2_best_features_names = np.array(bow_vectorizer.get_feature_names())[chi2_best_features_ind]

Let's have a look at the data vectorized using best features selected using the chi-squared statistics:

In [37]:
#convert vectorized data to a dataframe and give columns their names
X_train_bow_chi2_BEST_table = pd.DataFrame(data = NORM_corpus_train_bow_chi2_BEST.todense(), columns = chi2_best_features_names)
X_train_bow_chi2_BEST_table.head(5)

Unnamed: 0,000usd,00pm,023b,0x,1024x768,10bps,10km,10m,110m,115m,...,zoroaster,zoroastrian,zoroastrianism,zoroastrians,zubin,zuck,zullen,zurvanism,zwarte,zyxel
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### MI

### **<font color=green>EXERCISE 3:</font>**
**<font color=green>3.1. Select best features using the mutual information statistic. Follow the steps we used for the chi-squared statistic. The first line, showing how to specify the SelectKBest function using mutual information, is provided. You need to complete the script.</font>** <br><br>
**<font color=green>3.2. Do mutual information approach and chi-squared statistic select the same best features?</font>** 

*Note: you can check out the documentation for mutual information function for categorical data here: https://bit.ly/2JGgeeT* <br><br>
Your answer (you need to add more lines of Python code to the cell below):

They don't have the same best features but their 85.86% of features are the same.

In [38]:
MI_kbest = SelectKBest(score_func = mutual_info_classif, k = 10000)
NORM_corpus_train_bow_mutual_BEST = MI_kbest.fit_transform(NORM_corpus_train_bow, corpus_train.target)
mutual_best_features_ind = MI_kbest.get_support(indices=True)
mutual_best_features_names = np.array(bow_vectorizer.get_feature_names())[mutual_best_features_ind]
X_train_bow_mutual_BEST_table = pd.DataFrame(data = NORM_corpus_train_bow_mutual_BEST.todense(), columns = mutual_best_features_names)
X_train_bow_mutual_BEST_table.head(5)

Unnamed: 0,023b,0x,1024x768,10km,10m,13h,15m,15rpm,17th,18084tm,...,zorastrian,zoro,zoroaster,zoroastrian,zoroastrianism,zoroastrians,zubin,zuck,zurvanism,zyxel
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [39]:
def intersection(lst1, lst2):
    lst3 = [value for value in lst1 if value in lst2]
    return lst3
  
len(intersection(chi2_best_features_ind,mutual_best_features_ind))/10000

0.8592

### **Feature Extraction for TEST Data**

Let's normalize and vectorize the **TEST** text corpus (corpus_test.data) using the Bag-of-Words method. <br><br>First, we normalize the testing corpus and call it NORM_corpus_test:

In [40]:
NORM_corpus_test = normalize_corpus(corpus_test.data)

Now let's vectorize the normalized test corpus, NORM_corpus_test. <br><br>


---


**IMPORTANT**<br>
For transforming test data, you'll use features extracted from the training corpus [You do NOT want to create new feature based on your test data]. <br>Therefore: <br> 1) do **not** define a new vectorizer, use the one used on training data <br>2) use method **.transform** (not .fit_trandform) with your vectorizer to vectorize the test data.


---



In [41]:
NORM_corpus_test_bow = bow_vectorizer.transform(NORM_corpus_test)

We'll use the best features selected by **chi-squared statistic**. Now we need to pick up from the above Bag-of-Words matrix exactly the same best features we selected for our training dataset:

In [42]:
NORM_corpus_test_bow_chi2_BEST = chi2_kbest.transform(NORM_corpus_test_bow)
NORM_corpus_test_bow_chi2_BEST.shape

(1353, 10000)

### **Text Classification**

We'll train a text classification model that categorizes documents into 4 classes: religion, atheism, science and computer graphics. We will use a Naive Bayes Classifier and Support Vector Machines (SVM), the last one is optional.
<br><br>
#### **Naive Bayes Classifier**

Make the **MultinomialNB** packages available:

In [43]:
from sklearn.naive_bayes import MultinomialNB

Define the Naive Bayes classifier by specifiying the hyperparameter alpha and call the classifier NB_tc:<br><br>
*Note: you can set the hyperparameter alpha to an optimal value by trying different values > 0. With alpha = 0, you model will assign a probability of zero to a document in the test data if the document contains a feature not found in the training data.*

In [44]:
NB_tc = MultinomialNB(alpha=0.1) 

Let's train the model using the best features selected using the **chi-squared statistics**:

In [45]:
NB_tc.fit(NORM_corpus_train_bow_chi2_BEST, corpus_train.target)

In [46]:
predicted_nb_chi2_best = NB_tc.predict(NORM_corpus_test_bow_chi2_BEST)

Evaluate the predictive power for the Naive Bayes classifier using chi-squared k=10,000 best features:

In [47]:
cm_chi2_best = metrics.confusion_matrix(corpus_test.target, predicted_nb_chi2_best)
print("Confusion matrix: \n", pd.DataFrame(data = cm_chi2_best , 
                                           columns = corpus_train.target_names,
                                           index = corpus_train.target_names),"\n")
print("Accuracy rate: ", metrics.accuracy_score(corpus_test.target, predicted_nb_chi2_best),"\n") 

Confusion matrix: 
                     alt.atheism  comp.graphics  sci.space  talk.religion.misc
alt.atheism                 231              7         25                  56
comp.graphics                15            349         23                   2
sci.space                    20             18        349                   7
talk.religion.misc           77              5         19                 150 

Accuracy rate:  0.7974870657797487 



### **<font color=green>EXERCISE 4:</font>**
**<font color=green>4.1. Train the Naive Bayes classifier without doing feature selection, that is use all the features available in the normalized corpus. What accuracy do you get? If a classifier did a mistake and misclassified a "Computer Graphics" post, to which class such a post was mistakenly assigned, typically? What about a post on the "Atheism" topic? </font>** <br><br>


I got a less accurate result. The accuracy rate is 0.79. If "Computer Graphics" is misclassified, typically it would be mistakenly assigned to "science" class. And "Atheism" is typically mistakenly assigned to "religion" class.

In [52]:
NB_tc = MultinomialNB(alpha=0.1) 
NB_tc.fit(NORM_corpus_train_bow, corpus_train.target)
predicted_nb = NB_tc.predict(NORM_corpus_test_bow)

cm = metrics.confusion_matrix(corpus_test.target, predicted_nb)
print("Confusion matrix: \n", pd.DataFrame(data = cm , 
                                           columns = corpus_train.target_names,
                                           index = corpus_train.target_names),"\n")
print("Accuracy rate: ", metrics.accuracy_score(corpus_test.target, predicted_nb),"\n") 

Confusion matrix: 
                     alt.atheism  comp.graphics  sci.space  talk.religion.misc
alt.atheism                 227              6         24                  62
comp.graphics                12            350         23                   4
sci.space                    19             18        348                   9
talk.religion.misc           79              8         18                 146 

Accuracy rate:  0.7915742793791575 



**<font color=green>4.2. (OPTIONAL) Train the Naive Bayes classifier feature selection based on mutual information (MI). What accuracy do you get?</font>** <br><br>
Your answer:

In [51]:
NORM_corpus_test_bow_mi_BEST = MI_kbest.transform(NORM_corpus_test_bow)

NB_tc = MultinomialNB(alpha=0.1) 
NB_tc.fit(NORM_corpus_train_bow_mutual_BEST, corpus_train.target)
predicted_nb_mi_best = NB_tc.predict(NORM_corpus_test_bow_mi_BEST)

cm_mi_best = metrics.confusion_matrix(corpus_test.target, predicted_nb_mi_best)
print("Confusion matrix: \n", pd.DataFrame(data = cm_mi_best , 
                                           columns = corpus_train.target_names,
                                           index = corpus_train.target_names),"\n")
print("Accuracy rate: ", metrics.accuracy_score(corpus_test.target, predicted_nb_mi_best),"\n") 

Confusion matrix: 
                     alt.atheism  comp.graphics  sci.space  talk.religion.misc
alt.atheism                 237              7         24                  51
comp.graphics                10            351         24                   4
sci.space                    20             18        349                   7
talk.religion.misc           84              7         19                 141 

Accuracy rate:  0.7967479674796748 



### **<font color=green>EXERCISE 5 (OPTIONAL):</font>**
**<font color=green>5.1. Vectorize the data using the TF-IDF approach, with and without feature selection, and train and test the Naive Bayes classifier. What are your results? </font>** <br><br>

Your answer:

#### without feature selection

In [53]:
from sklearn.feature_extraction.text import TfidfVectorizer 

vectorizer_TF_IDF = TfidfVectorizer(norm = 'l2', smooth_idf = True)
NORM_corpus_train_tf_idf = vectorizer_TF_IDF.fit_transform(NORM_corpus_train).todense()
NORM_corpus_test_tf_idf = vectorizer_TF_IDF.transform(NORM_corpus_test).todense()

NB_tc = MultinomialNB(alpha=0.1) 
NB_tc.fit(NORM_corpus_train_tf_idf, corpus_train.target)
predicted_nb = NB_tc.predict(NORM_corpus_test_tf_idf)

cm = metrics.confusion_matrix(corpus_test.target, predicted_nb)
print("Confusion matrix: \n", pd.DataFrame(data = cm , 
                                           columns = corpus_train.target_names,
                                           index = corpus_train.target_names),"\n")
print("Accuracy rate: ", metrics.accuracy_score(corpus_test.target, predicted_nb),"\n") 

Confusion matrix: 
                     alt.atheism  comp.graphics  sci.space  talk.religion.misc
alt.atheism                 224              7         39                  49
comp.graphics                10            350         27                   2
sci.space                    19             18        354                   3
talk.religion.misc           84              8         20                 139 

Accuracy rate:  0.7886178861788617 



#### with feature selection

##### chi-square statistic

In [56]:
chi2_kbest = SelectKBest(score_func = chi2, k = 10000)
NORM_corpus_train_tf_idf_chi2_BEST = chi2_kbest.fit_transform(NORM_corpus_train_tf_idf, corpus_train.target)
NORM_corpus_test_tf_idf_chi2_BEST = chi2_kbest.transform(NORM_corpus_test_tf_idf)

NB_tc = MultinomialNB(alpha=0.1) 
NB_tc.fit(NORM_corpus_train_tf_idf_chi2_BEST, corpus_train.target)
predicted_nb_chi2_best = NB_tc.predict(NORM_corpus_test_tf_idf_chi2_BEST)

cm_chi2_best = metrics.confusion_matrix(corpus_test.target, predicted_nb_chi2_best)
print("Confusion matrix: \n", pd.DataFrame(data = cm_chi2_best, 
                                           columns = corpus_train.target_names,
                                           index = corpus_train.target_names),"\n")
print("Accuracy rate: ", metrics.accuracy_score(corpus_test.target, predicted_nb_chi2_best),"\n") 

Confusion matrix: 
                     alt.atheism  comp.graphics  sci.space  talk.religion.misc
alt.atheism                 221              8         38                  52
comp.graphics                11            351         26                   1
sci.space                    19             16        357                   2
talk.religion.misc           79              7         21                 144 

Accuracy rate:  0.7930524759793053 

