<a href="https://colab.research.google.com/github/Alejm16/NLPproject/blob/main/CSCE5290_Rafael_Moreira_Final_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **ICE-4: Text Classification**

Text classification terminologies:

* **Training:** Set of documents, set of classes (x , y) where documents are hand-labeled or system-generated
* **Inference:** Predict the class that the new document belongs to when a new, unseen document is given


In [None]:
# to avoid NumPy's truncation of outputs when certain code blocks are generated
import sys
import numpy
numpy.set_printoptions(threshold=sys.maxsize)

## **(Tutorial) Performing Naive Bayes Classification using Scikit-Learn**

### **Bag-of-Words (BoW)**

***Bag-of-Words*** is one of the many approaches to extract features (inputs to the learning algorithm) from the text data. Depending on the basis of measure used, a Bag-of-Words representation of text contains information about the occurrence of words in the underlying text.

Of the various measures, one of them is to create a Bag-of-Words representation using the ***information about the presence/absence of words in the text***. **0** indicates the word is absent while **1** indicates the word is present.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# 'binary' parameter set to True indicates the encoding measure is the presence/absence of words
vectorizer1 = CountVectorizer(binary=True)

# super small corpus
corpus = [
     'This is the first Document.',
     'This is the second second document.',
     'And the third one.',
     'Is this the first document?',
]

# fit the vectorizer on the corpus and then encode the data
data = vectorizer1.fit_transform(corpus)
data

<4x9 sparse matrix of type '<class 'numpy.int64'>'
	with 19 stored elements in Compressed Sparse Row format>

In [None]:
print(vectorizer1.get_feature_names())    # returns the features extracted from the text (vocabulary)

data.toarray()    # returns encoded representations


['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']


array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 1, 1, 0, 1],
       [1, 0, 0, 0, 1, 0, 1, 1, 0],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]])

A second measure that is commonly used while creating a Bag-of-Words representation is to use information about the ***term frequency***. *Term frequency* (TF) refers to the number of times a term (i.e. a word) is seen in the text.

In [None]:
# 'binary' parameter when not set indicates the encoding measure is term frequency
vectorizer2 = CountVectorizer()

# again, an example corpus
corpus = [
     'This is the first Document.',
     'This is the second second document.',
     'And the third one is the document.',
     'Is this the first document document document?',
]

# same as before...
data = vectorizer2.fit_transform(corpus)
data


<4x9 sparse matrix of type '<class 'numpy.int64'>'
	with 21 stored elements in Compressed Sparse Row format>

In [None]:
print(vectorizer2.get_feature_names())

data.toarray()


['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']


array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 2, 1, 0, 1],
       [1, 1, 0, 1, 1, 0, 2, 1, 0],
       [0, 3, 1, 1, 0, 0, 1, 0, 1]])

### **Using Naive Bayes in scikit-learn**

#### **Step 1. Create/Collect the Data**

In [None]:
# importing an existing dataset from scikit-learn for demonstrating use of Naive Bayes
from sklearn.datasets import fetch_20newsgroups
dataset = fetch_20newsgroups()


# scikit-learn provides helpful utilities for out-of-the-box datasets
# what are the various categories of documents in the above dataset?
dataset.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [None]:
# what are the inputs/attributes in the above dataset?
dataset.data      # returns a list

# print(f"Number of articles (inputs): {len(dataset.data)}")

["From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n",
 "From: guykuo@carson.u.washington.edu (Guy Kuo)\nSubject: SI Clock Poll - Final Call\nSummary: Final call for SI clock reports\nKeywords: SI,acceleration,clock,upgrade\nArticle-I.D.: shelley.1qvfo9INNc3s\nOrganization: University of Washington\nLines: 

In [None]:
# what are the corresponding outputs/targets in the above dataset?
dataset.target      # returns a n-dimensional NumPy array

# print(f"Number of targets (outputs): {dataset.target.shape}")

array([ 7,  4,  4,  1, 14, 16, 13,  3,  2,  4,  8, 19,  4, 14,  6,  0,  1,
        7, 12,  5,  0, 10,  6,  2,  4,  1, 12,  9, 15,  7,  6, 13, 12, 17,
       18, 10,  8, 11,  8, 16,  9,  4,  3,  9,  9,  4,  4,  8, 12, 14,  5,
       15,  2, 13, 17, 11,  7, 10,  2, 14, 12,  5,  4,  6,  7,  0, 11, 16,
        0,  6, 17,  7, 12,  7,  3, 12, 11,  7,  2,  2,  0, 16,  1,  2,  7,
        3,  2,  1, 10, 12, 12, 17, 12,  2,  8,  8, 18,  5,  0,  1,  6, 12,
        8,  4, 17, 12, 12, 12,  1,  6, 18,  4,  3, 10,  9,  0, 13, 11,  5,
       14, 15,  8,  4, 15, 15,  1,  0, 16,  9,  8,  6, 13,  6, 17, 14,  0,
        9,  1,  2, 15, 13,  9,  2,  8,  2, 13,  2,  0, 15, 14,  1, 14, 17,
       14,  4,  4,  7, 19,  1, 15, 17, 16,  2, 15,  9, 12,  6,  9,  6,  6,
       18,  1, 10,  6, 10,  5,  2, 13,  3,  9, 13, 12, 13,  8,  4,  3,  9,
        1, 12,  4,  2,  2, 11, 13,  4,  1, 12,  0, 16, 12, 16,  7, 17, 15,
       11, 14,  2,  7, 10, 14, 15,  5, 16, 11,  4, 13,  7,  4, 13, 17,  1,
       15, 17, 17,  9, 16

In [None]:
# for the demonstration, let's use only a small subset from the 20 categories
categories = ['talk.religion.misc', 'soc.religion.christian', 'sci.space', 'comp.graphics']

# get the training and testing data and have them ready to use later
train_data = fetch_20newsgroups(subset='train', categories=categories)
test_data = fetch_20newsgroups(subset='test', categories=categories)
print(train_data.data[5])

From: dmcgee@uluhe.soest.hawaii.edu (Don McGee)
Subject: Federal Hearing
Originator: dmcgee@uluhe
Organization: School of Ocean and Earth Science and Technology
Distribution: usa
Lines: 10


Fact or rumor....?  Madalyn Murray O'Hare an atheist who eliminated the
use of the bible reading and prayer in public schools 15 years ago is now
going to appear before the FCC with a petition to stop the reading of the
Gospel on the airways of America.  And she is also campaigning to remove
Christmas programs, songs, etc from the public schools.  If it is true
then mail to Federal Communications Commission 1919 H Street Washington DC
20054 expressing your opposition to her request.  Reference Petition number

2493.



#### **Step 2(a): Prepare the inputs for modeling**

In [None]:
# create BoW representations for the training data (excluding targets); aka feature extraction
from sklearn.feature_extraction.text import CountVectorizer
bow_vectorizer = CountVectorizer()

# first, build the training vocabulary, fit()
# then, use the vocabulary to transform the training data into a document-term matrix, transform()
# results in a document-term matrix structure
X_train = bow_vectorizer.fit_transform(train_data.data)

# let's check the document-term matrix
#X_train
print(f"X_train array size: {X_train.shape}")
# convert sparse matrix to dense matrix
# X_train.to_array()

X_train array size: (2153, 35329)


#### **Step 2(b): Prepare the outputs for modeling**

In [None]:
# store the outputs (corresponding to the news articles in the training set) for easy access
y_train = train_data.target     # returns a n-dimensional NumPy array
#y_train

print(f"y_train array size: {y_train.shape}")

y_train array size: (2153,)


##### **Transforming the outputs for a supervised learning task**

**PLEASE READ:** Depending on the dataset you are working with, there could be datasets that contain non-numerical labels corresponding to the targets (outputs for a supervised learning problem). scikit-learn also includes utilities that you could use to convert (transform) your non-numeric labels to numeric ones.

In [None]:
from sklearn import preprocessing
tgt_enc = preprocessing.LabelEncoder()

# assume the following is the list of unique classes in your data
some_data_targets = ["paris", "paris", "tokyo", "amsterdam", "paris", "tokyo", "tokyo", "tokyo", "amsterdam", "england"]

# fit your targets of the training data to the LabelEncoder instance
tgt_enc.fit(some_data_targets)

# get the set of unique classes
print(f"Unique categories: {list(tgt_enc.classes_)}")

# encode the targets as numerical labels
encoded_tgts = tgt_enc.transform(["tokyo", "tokyo", "paris"])
print(f"Encoded labels: {encoded_tgts}")

Unique categories: ['amsterdam', 'england', 'paris', 'tokyo']
Encoded labels: [3 3 2]


#### **Step 3: Building a Learning Model Using Naive Bayes Algorithm**

** **IMPORTANT!** ** - Regardless of the task, when building a learning model, always make sure ONLY the data from training set is used to train the model. ***Testing set MUST NEVER to be used to train/build the model***. Testing set is used only to report the results of your model, which is the last step of the process (after the model is trained and you have found a best model).

**Note:** Building the model is also referred to as training the model.

In [None]:
from sklearn.naive_bayes import MultinomialNB
mnb_model = MultinomialNB()
mnb_model.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

#### **Step 4(a): Make Predictions Using the Trained Model**
**Note:** We are assuming that the above model trained is the best model for the data under consideration. *In reality, decision about the best model is based on performing hyperparameter tuning on a tuning/validation data set*. While hyperparameter tuning is out of scope for this notebook, you can always lookup articles, blog posts about this topic on the World Wide Web.

In [None]:
# before testing the model, ensure your test data, both inputs and outputs, is ready for use:

# 1. preprocess your testing data into a document-term matrix (using the training vocabulary)
X_test = bow_vectorizer.transform(test_data.data)
X_test

<1432x35329 sparse matrix of type '<class 'numpy.int64'>'
	with 230051 stored elements in Compressed Sparse Row format>

In [None]:
# 2. store the known outputs corresponding to the articles in the test set
y_test = test_data.target

print(f"y_test array size: {y_test.shape}")

y_test array size: (1432,)


In [None]:
# finally, apply the model on the test set to make predictions
# in this case, predictions are classification labels
predictions = mnb_model.predict(X_test)

In [None]:
# using the same model but predictions are probabilities
y_pred_prob = mnb_model.predict_proba(X_test)
print(y_pred_prob)

[[2.75544189e-101 1.35696571e-093 9.04229140e-034 1.00000000e+000]
 [1.00000000e+000 8.11913323e-031 5.82342828e-045 1.00498728e-040]
 [7.23391406e-081 1.00000000e+000 6.57667626e-091 2.33963648e-074]
 [1.00000000e+000 2.06238640e-062 6.13232665e-085 3.97450971e-090]
 [4.78607486e-029 9.99999460e-001 5.40232163e-007 7.64770874e-016]
 [1.97449728e-016 3.80488166e-013 1.00000000e+000 1.31095855e-013]
 [1.00746827e-222 1.00000000e+000 1.75010680e-256 2.36523385e-255]
 [1.24078514e-069 1.00000000e+000 1.14014415e-084 1.21604211e-078]
 [1.00000000e+000 3.13263274e-015 1.72011457e-017 6.44230160e-014]
 [3.39809835e-025 1.00000000e+000 2.57276712e-039 1.83202609e-037]
 [9.21847391e-033 3.74504638e-027 2.58294255e-015 1.00000000e+000]
 [4.44624717e-012 8.12823934e-012 9.99999268e-001 7.32111989e-007]
 [1.15879118e-054 1.88240173e-035 1.20272394e-031 1.00000000e+000]
 [2.98126336e-165 2.98083778e-158 5.95385832e-006 9.99994046e-001]
 [6.53315939e-051 1.25933135e-047 9.99999955e-001 4.51282503e-

#### **Step 4(b): Evaluate the Model**

In [None]:
# importing the metrics module from sklearn
from sklearn import metrics

# use the accuracy_score metric to calculate accuracy of the model
# you evaluate a model by comparing its predictions against the known outputs of the test set
accuracy = metrics.accuracy_score(y_test, predictions)
print(accuracy)

0.9168994413407822


In [None]:
# use the confusion matrix metric to understand the predictive power of the model
print(metrics.confusion_matrix(y_test, predictions))

[[371  11   2   5]
 [ 11 377   5   1]
 [  5   4 379  10]
 [  5  11  49 186]]




---



## **Task: Classifying News Articles using Naive Bayes**

### **1. Create Dataset**

We will be using the modified version of the [BBC news dataset](http://mlg.ucd.ie/files/datasets/bbcsport-fulltext.zip) for this task. The zip file containing the raw data is made available on Canvas. Download the zip file and make sure that the file is available within your notebook session. 

**Instructions:** 
* Read all the .txt files in the bbc-updated.zip file 
* Text files read from the zip file must be stored in a Pandas dataframe along with the category the news article belongs to
  * You can use the subdirectory names while reading the files to store category names as corresponding targets in the dataframe
* The dataframe should consist of two columns: 'Text' and 'Category'. Here, ***Text*** column is an attribute and ***Category*** is the target corresponding to the attribute. 
* Use ```pandas.DataFrame.shape``` to print the size of the dataframe after the dataframe is created (useful in verifying all the text files are read)
* You can also use ```pandas.DataFrame.head(n)``` to view the first 'n' examples in the dataframe (useful for verifying that the raw data has been processed as expected)

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
!ls -ltr /content/drive/MyDrive/School/CSCE5290/Final\ Project/enron_dataframe.pkl

-rw------- 1 root root 48510257 Nov 13 02:31 '/content/drive/MyDrive/School/CSCE5290/Final Project/enron_dataframe.pkl'


In [3]:
!pip3 install --upgrade pandas==1.3.1



In [4]:
# add your code below this comment
import os
import pandas as pd

df = pd.read_pickle("/content/drive/MyDrive/School/CSCE5290/Final Project/enron_dataframe.pkl")
print(df.head())
print(df.shape)

                                                Text    Person
0                                                     arnold-j
1                            let's push until monday  arnold-j
2                                        what's pdx?  arnold-j
3  BMO wants to do this sleave trade. Duke, Dyneg...  arnold-j
4  I'm big seller of interventions. they tend not...  arnold-j
(95573, 2)


In [5]:
import numpy as np

persons = np.unique(df['Person'])

In [6]:
persons

array(['allen-p', 'arnold-j', 'arora-h', 'badeer-r', 'bailey-s', 'bass-e',
       'baughman-d', 'beck-s', 'benson-r', 'blair-l', 'brawner-s',
       'buy-r', 'campbell-l', 'carson-m', 'cash-m', 'causholli-m',
       'corman-s', 'crandell-s', 'cuilla-m', 'dasovich-j', 'davis-d',
       'dean-c', 'delainey-d', 'derrick-j', 'dickson-s', 'donoho-l',
       'donohoe-t', 'dorland-c', 'ermis-f', 'farmer-d', 'fischer-m',
       'forney-j', 'fossum-d', 'gang-l', 'gay-r', 'geaccone-t',
       'germany-c', 'gilbertsmith-d', 'giron-d', 'griffith-j',
       'grigsby-m', 'haedicke-m', 'hayslett-r', 'heard-m',
       'hendrickson-s', 'hernandez-j', 'hodge-j', 'holst-k', 'horton-s',
       'hyatt-k', 'hyvl-d', 'jones-t', 'kaminski-v', 'kean-s', 'keavey-p',
       'keiser-k', 'king-j', 'kitchen-l', 'kuykendall-t', 'lavorato-j',
       'lay-k', 'lenhart-m', 'lewis-a', 'lokay-m', 'lokey-t', 'love-p',
       'lucci-p', 'maggi-m', 'mann-k', 'martin-t', 'may-l', 'mccarty-d',
       'mcconnell-m', 'mckay-b',

**Instructions:**
* Once the dataframe is created, split the data into two sets: (1) train set and (2) test set
* Split 30% of the data as test set
* Use ```sklearn.model_selection.train_test_split()``` for easy splitting of data
    * Set ```random_state``` parameter value to **237** to ensure reproducible results
* Use ```pandas.DataFrame.shape``` to print the sizes of the train and test sets after splitting the data

Reference documentation: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html


In [7]:
# add your code below this comment
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df['Text'], df['Person'], test_size=0.1, random_state=237)



### **2. Feature Extraction (Prepare Inputs)**

**Question 1(a): Given the vocabulary *V*, what is the corresponding BoW representation for sentence *S* using the word absence/presence measure?**

***V: { fishing, likes, at, too, Beth, family, campfire, Adam, beach, the, vacation, lake, enjoys }***

***S: "Adam enjoys fishing at the lake. Beth likes fishing too."***

**Instructions:** This question is to be answered without using any code. **DO NOT** use scikit-learn or other NLP libraries to answer this question.

**Answer for Question 1(a):**

* [1111100101011] - when just taking the presence of the word

**Question 1(b): Given the same vocabulary *V*, what is the corresponding BoW representation for sentence *S* using the term frequency measure?**

***V: { fishing, likes, at, too, Beth, family, campfire, Adam, beach, the, vacation, lake, enjoys }***

***S: "Adam enjoys fishing at the lake. Beth likes fishing too."***

**Instructions:** This question is to be answered without using any code and/or scikit-learn. **DO NOT** use scikit-learn or other NLP libraries to answer this question.

**Answer for Question 1(b):**

* [2111100101011] - when counting the number of words




---



**Question 2: Given the same vocabulary (as in Questions 1a. and 1b.), write the BoW representations for the following sentence S using both measures (presence/absence of words and term frequency information):**

***S: He wanted to bring peace to his kingdom but his enemies killed him.***

**Instructions:** This question is to be answered without using any code and/or scikit-learn.  **DO NOT** use scikit-learn or other NLP libraries to answer this question.

**Answer for Question 2:** 

* Presence/Absence: [0000000000000]
* Frequency: [0000000000000]



---



**Question 3: Consider the following code snippet:**

```
bow_vect = CountVectorizer()

some_corpus = [
     'It is raining heavily today.',
     'And the weather is unpredictable.',
     'Weather forecast for tomorrow says sunny.',
     'Is it raining heavily today?',
]

res_data = bow_vect.fit_transform(some_corpus)

bow_vect.get_feature_names())
>>> ['and', 'for', 'forecast', 'heavily', 'is', 'it', 'raining', 'says', 'sunny', 'the', 'today', 'tomorrow', 'unpredictable', 'weather']


res_data.toarray()
>>> array([ [0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0],
            [1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1],
            [0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1],
            [0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0] ])

```

**In the above snippet, '>>>' indicates the output corresponding to the previous line of code. Something interesting is happening after the sentences are converted to their BoW representations. Can you identify? After you identify, provide your observation about what might be impacted/affected when using the encoded representations as shown in the above code snippet.**

**Instructions:** You do not have to run the above code again since the outputs that you would need to answer the corresponding question is already provided.


**Answer for Question 3:** At first it seems that the first and last sentences are the same, but they aren't. The first is a statement and the last sentence is a question.  So with this representation of the BoW, we lose the meaning of the sentence.  What actually find interesting is that dictionary is alphabetically sorted.



---



**Question 4: For the given task, use the term frequency measure to compute the BoW representations for the text documents in the modified version of the BBC dataset.**

**Instructions:** 
* This question is to be answered by using scikit-learn (similar to what was demonstrated in the tutorial section)
* Create BoW representations for train and test sets.

In [8]:
# add your code below this comment
import sys
import numpy
from sklearn.feature_extraction.text import CountVectorizer

numpy.set_printoptions(threshold=sys.maxsize)
# 'binary' parameter when not set indicates the encoding measure is term frequency
vectorizer = CountVectorizer()
data_train = vectorizer.fit_transform(X_train)
print(vectorizer.get_feature_names())
data_test = vectorizer.fit_transform(X_test)
print(vectorizer.get_feature_names())





### **3. Prepare Outputs/Labels**

**Instructions:**
* Check to see how the outputs are in the data
* If categories are non-numeric, then encode them as numeric labels (similar to what was discussed in the tutorial demonstration)
* You will have to perform encoding for targets in both train and test sets
* Make sure to perform the same encoding that was done for the targets in the training set when encoding the targets in the test set.

In [9]:
## converting to numeric labels
from sklearn import preprocessing
tgt_enc = preprocessing.LabelEncoder()

# fit your targets of the training data to the LabelEncoder instance
tgt_enc.fit(persons)

# get the set of unique classes
print(f"Unique categories: {list(tgt_enc.classes_)}")

Unique categories: ['allen-p', 'arnold-j', 'arora-h', 'badeer-r', 'bailey-s', 'bass-e', 'baughman-d', 'beck-s', 'benson-r', 'blair-l', 'brawner-s', 'buy-r', 'campbell-l', 'carson-m', 'cash-m', 'causholli-m', 'corman-s', 'crandell-s', 'cuilla-m', 'dasovich-j', 'davis-d', 'dean-c', 'delainey-d', 'derrick-j', 'dickson-s', 'donoho-l', 'donohoe-t', 'dorland-c', 'ermis-f', 'farmer-d', 'fischer-m', 'forney-j', 'fossum-d', 'gang-l', 'gay-r', 'geaccone-t', 'germany-c', 'gilbertsmith-d', 'giron-d', 'griffith-j', 'grigsby-m', 'haedicke-m', 'hayslett-r', 'heard-m', 'hendrickson-s', 'hernandez-j', 'hodge-j', 'holst-k', 'horton-s', 'hyatt-k', 'hyvl-d', 'jones-t', 'kaminski-v', 'kean-s', 'keavey-p', 'keiser-k', 'king-j', 'kitchen-l', 'kuykendall-t', 'lavorato-j', 'lay-k', 'lenhart-m', 'lewis-a', 'lokay-m', 'lokey-t', 'love-p', 'lucci-p', 'maggi-m', 'mann-k', 'martin-t', 'may-l', 'mccarty-d', 'mcconnell-m', 'mckay-b', 'mckay-j', 'mclaughlin-e', 'meyers-a', 'mims-thurston-p', 'motley-m', 'neal-s', 'nem

In [10]:
labels_persons = tgt_enc.transform(persons)
print(f"Unique categories: {labels_persons}")


Unique categories: [  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35
  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53
  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71
  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89
  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107
 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143]


In [11]:
# add your code below this comment

## checking data

#print(y_train,y_test)

## converting to numeric labels


output_train_data = tgt_enc.transform(y_train)
#print(vectorizer.get_feature_names())
#print("output train data sample:", output_train_data.toarray()[0])
output_test_data = tgt_enc.transform(y_test)
#print(vectorizer.get_feature_names())
#print("output test data sample:", output_test_data.toarray()[0])
input_train_data = vectorizer.fit_transform(X_train)
#print("input test data sample:", input_train_data.toarray()[0])
#print(vectorizer.get_feature_names())
input_test_data = vectorizer.transform(X_test)
#print("input test data sample:", input_test_data.toarray()[0])
#print(vectorizer.get_feature_names())
#print(input_train_data.shape)
#print(output_train_data.shape)

### **4. Model Training and Evaluation - MultiNomial Naive Bayes**


In [12]:
# add your code below this comment
# use the train set vocabulary to encode the test set
from sklearn.naive_bayes import MultinomialNB
mnb_model = MultinomialNB()
mnb_model.fit(input_train_data, output_train_data)

MultinomialNB()

In [14]:
# finally, apply the model on the test set to make predictions
# in this case, predictions are classification labels
predictions = mnb_model.predict(input_test_data)
#y_pred_prob = mnb_model.predict_proba(input_test_data)
#print(y_pred_prob)

# importing the metrics module from sklearn
from sklearn import metrics

# use the accuracy_score metric to calculate accuracy of the model
# you evaluate a model by comparing its predictions against the known outputs of the test set
accuracy = metrics.accuracy_score(output_test_data, predictions)
print("Accuracy: ", accuracy)

# use the confusion matrix metric to understand the predictive power of the model
print("Confusion matrix: \n", metrics.confusion_matrix(output_test_data, predictions))

Accuracy:  0.3878426449047918
Confusion matrix: 
 [[  2   0   0   0   0   0   5   0   0   0   0   0   0   0   0   0   0   0
   29   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   3
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   2  12   1   0
    0   0   0   0   0   0   0   0   0   0   0   0   0  12   0   0   0   0
    0   0   0   0   0   0   0   1   0   0   0   1   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   6   0
    4   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0]
 [  0  34   0   0   3   0   2   0   0   0   0   0   0   0   0   0   0   0
   38   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   4
    0   0   0   0   0   0   0   0   0   0   0   0   0   0  11  19   1   0
    0   0   0   0   0   0   3   0   0   0   0   0   0  20   0   0   0   0
    0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0

**MLPClassifier**

In [16]:
from sklearn.neural_network import MLPClassifier

mlp_model = MLPClassifier(random_state=237, max_iter=5, verbose=True)
#pfit = mlp_model.partial_fit(input_train_data, output_train_data, labels_persons)
pfit = mlp_model.fit(input_train_data, output_train_data)

Iteration 1, loss = 2.92973078
Iteration 2, loss = 1.77103694
Iteration 3, loss = 1.44205537
Iteration 4, loss = 1.24649223
Iteration 5, loss = 1.11819174




In [18]:
import pickle

#
# Create your model here (same as above)
#

# Save to file in the current working directory
pkl_filename = "/content/drive/MyDrive/School/CSCE5290/Final Project/mlp_enron_model_5e.pkl"
with open(pkl_filename, 'wb') as file:
    pickle.dump(pfit, file)

In [19]:
# finally, apply the model on the test set to make predictions
# in this case, predictions are classification labels
mlp_predictions = pfit.predict(input_test_data)
#y_pred_prob = mnb_model.predict_proba(input_test_data)
#print(y_pred_prob)

# importing the metrics module from sklearn
from sklearn import metrics

# use the accuracy_score metric to calculate accuracy of the model
# you evaluate a model by comparing its predictions against the known outputs of the test set
mlp_accuracy = metrics.accuracy_score(output_test_data, mlp_predictions)
print("Accuracy: ", mlp_accuracy)

# use the confusion matrix metric to understand the predictive power of the model
#print("Confusion matrix: \n", metrics.confusion_matrix(output_test_data, mlp_predictions))

Accuracy:  0.6659342958777987


In [20]:
print(persons[mlp_predictions[15]])

germany-c


In [21]:
list(X_test)[15]

'We have a new Transco 6-6 contract. Term 9/6 - 9/30 Contract # 3.6878 MDQ 4,752 Rec Leidy #6161 Del Mainline BG&E #7221 (Non New York) Please enter this on the morning sheet. Thanks'

**This is a good example of a sent email where the sender does not identify him/herself.**

### **Establishing a baseline model**

Baseline models are helpful for easy comparison of the models you build. These models are trained using simple heuristics or rules.

**Instructions:**
* All you have to do is run the following block of code. Report the accuracy of your model (the Naive Bayes one) in comparison to the baseline model created in the following code block

In [None]:
# Baselines are simple heuristics to make predictions for a given task
# just execute this code block; nothing needs to be added/modified
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score
# choose 'most-frequent class' as the baseline method
baseline_model = DummyClassifier(strategy="most_frequent")

# fit the baseline model on the training data
baseline_model.fit(input_train_data, output_train_data)

# make predictions on the test data using the created baseline model
baseline_preds = baseline_model.predict(input_test_data)

# compute the accuracy of the baseline model
print(accuracy_score(output_test_data, baseline_preds))

0.05580357142857143


**Report the accuracies for the baseline and NB models here. Type your answer below! Indicate clearly the numbers corresponding to the models.**

The baseline model with the Dummy Classifier had a very low accuracy of around 0.2156, whereas our Multiminal Naive Bayes models had accuracy of around 0.9746.



---



# **References**

* D. Greene and P. Cunningham. "Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering", Proc. ICML 2006.
* [Datasets (scikit-learn)](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets)
* [All about Naive Bayes (from a scikit-learn perspective)](https://scikit-learn.org/stable/modules/naive_bayes.html)
* [Multinomial Naive Bayes API](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn-naive-bayes-multinomialnb)
* [Evaluation metrics](https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics)
* [Feature extraction module](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction)
* [Transforming prediction targets using LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)


