# Classification and the Naive Bayes Algorithm
## Mastering Machine Learning on AWS
### Section 2, Chapter 2

In this notebook we will use data from Kaggle for classification.  These are pre-labelled text documents, found [here](https://www.kaggle.com/datasets/dipankarsrirag/topic-modelling-on-emails).  There are four categories, `Crime` (1100), `Entertainment` (1053), `Politics` (3001), `Science` (4000).  We will start by choosing `Politics` and `Science` to train a binary classifier.  The model will use 2500 cases to train, 250 to validate, and 250 to test.  Our evaluation metric will be model accuracy.

# Contents 

1. [Modules and Libraries](#Modules-and-Libraries)  
2. [Download Data](#Download-Data)  
3. [Define data variables and Import Data](#Define-data-variables-and-Import-Data)  
    1. [Importing Data with `read_text_files`](#Importing-Data-with-read_text_files)
    2. ~[Split Data, creating text files for later analysis](#Split-Data,-creating-text-files-for-later-analysis)~  
4. [Data Exploration](#Data-Exploration)
    1. [Crime Data](#Crime-Data)
    2. [Entertainment Data](#Entertainment-Data)
    3. [Science Data](#Science-Data)
    4. [Politics-Data](#Politics-Data)
5. [Duplicate Data Check](#Duplicate-Data-Check)
    1. [Crime vs Entertainment](#Crime-vs-Entertainment)
    2. [Science vs Politics](#Science-vs-Politics)
    3. [Science vs Crime](#Science-vs-Crime)
    4. [Politics vs Crime](#Politics-vs-Crime)  
6. [Naive Bayes Classifier](#Naive-Bayes-Classifier)
    1. [Split the data: Training, Validation, and Testing sets](#Split-the-data:-Training,-Validation,-and-Testing-sets)
    2. [Politics vs Science](#Politics-vs-Science)
    3. 

## Modules and Libraries 

In [1]:
import numpy as np
import pandas as pd
import matplotlib as mlp
import matplotlib.pyplot as plt

import praw   # Python Reddit API Wrapper
import datetime as dt
import time
import requests

import nltk
from nltk.stem import WordNetLemmatizer
# nltk.download('stopwords') 
# nltk.download('punkt_tab')
# nltk.download("wordnet")
# nltk.download("omw-1.4")

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.ensemble import AdaBoostClassifier

from sklearn.metrics import confusion_matrix

%matplotlib inline

class color:
   PURPLE = '\033[95m'
   CYAN = '\033[96m'
   DARKCYAN = '\033[36m'
   BLUE = '\033[94m'
   GREEN = '\033[92m'
   YELLOW = '\033[93m'
   RED = '\033[91m'
   BOLD = '\033[1m'
   UNDERLINE = '\033[4m'
   END = '\033[0m'

import bnaivebayes as bnb
import readtextfunctions as rtf

## Download Data

This is the script that was used to download and show where the data is located.
```python
import kagglehub

# Download latest version
path = kagglehub.dataset_download("dipankarsrirag/topic-modelling-on-emails")

print("Path to dataset files:", path)
```
Output: `Path to dataset files: /Users/blakewallace/.cache/kagglehub/datasets/dipankarsrirag/topic-modelling-on-emails/versions/1`

## Define data variables and Import Data

In [2]:
import os

def read_text_files(folder_path):
    """
    Reads all text files in a specified folder and returns their content.

    Args:
        folder_path (str): The path to the folder containing the text files.

    Returns:
        dict: A dictionary where keys are file names and values are file contents.
               Returns an empty dictionary if the folder does not exist or 
               no text files are found.
    """
    file_contents = {}
    if not os.path.exists(folder_path):
        print(f"Error: Folder '{folder_path}' not found.")
        return file_contents
    
    for filename in os.listdir(folder_path):
        if filename.endswith(".txt"):
            file_path = os.path.join(folder_path, filename)
            try:
                with open(file_path, 'r') as file:
                    file_contents[filename] = file.read()
            except Exception as e:
                 print(f"Error reading {filename}: {e}")
    return file_contents

# Example Usage

# if text_files_data:
    # for filename, content in text_files_data.items():
        # print(f"Contents of {filename}:\n{content}\n---")
# else:
    # print("No text files found or an error occurred.")

### Importing Data with `read_text_files`

This data is in the location on the system that it was downloaded to in the [Download Data](#Download-Data) section.

In [3]:
# utilize the `read_text_files` function in the readtextfunctions file
for directory in ['Crime', 'Entertainment', 'Politics', 'Science']:
    folder_path = "/Users/blakewallace/.cache/kagglehub/datasets/dipankarsrirag/topic-modelling-on-emails/versions/1/Data/" + directory + "/"  # Replace with the actual path to your folder
    print(folder_path)
    globals()['text_files_data_{}'.format(directory)] = rtf.read_text_files(folder_path)
    print()


/Users/blakewallace/.cache/kagglehub/datasets/dipankarsrirag/topic-modelling-on-emails/versions/1/Data/Crime/
Error reading 15672.txt: 'utf-8' codec can't decode byte 0xa0 in position 896: invalid start byte

/Users/blakewallace/.cache/kagglehub/datasets/dipankarsrirag/topic-modelling-on-emails/versions/1/Data/Entertainment/
Error reading 15672.txt: 'utf-8' codec can't decode byte 0xa0 in position 896: invalid start byte

/Users/blakewallace/.cache/kagglehub/datasets/dipankarsrirag/topic-modelling-on-emails/versions/1/Data/Politics/

/Users/blakewallace/.cache/kagglehub/datasets/dipankarsrirag/topic-modelling-on-emails/versions/1/Data/Science/
Error reading 15672.txt: 'utf-8' codec can't decode byte 0xa0 in position 896: invalid start byte
Error reading 53721.txt: 'utf-8' codec can't decode byte 0xfd in position 195: invalid start byte
Error reading 53803.txt: 'utf-8' codec can't decode byte 0xfd in position 285: invalid start byte
Error reading 53883.txt: 'utf-8' codec can't decode by

## Data Exploration

### Crime Data

**There are 1099 files in the Crime file.**

In [4]:
len(text_files_data_Crime)

1099

**We need the data to be in lists for some of the functions that will process the data.**

Create a list and a numpy array housing the Crime data.

In [5]:
Crime_list = list()
for key in text_files_data_Crime:
    Crime_list.append(text_files_data_Crime[key])

Crime_array = np.array(Crime_list)
# Crime_array

In [6]:
len(Crime_array)

1099

### Entertainment Data

**There are 1052 files in the Entertainment file.**

In [7]:
len(text_files_data_Entertainment)

1052

Create a list and a numpy array housing the Entertainment data.

In [42]:
Entertainment_list = list()
for key in text_files_data_Entertainment:
    Entertainment_list.append(text_files_data_Entertainment[key])

Entertainment_array = np.array(Entertainment_list)
# Entertainment_array

In [9]:
len(Entertainment_array)

1052

### Science Data

In [10]:
len(text_files_data_Science)

3987

In [11]:
Science_list = list()
for key in text_files_data_Science:
    Science_list.append(text_files_data_Science[key])
# text_files_data_Crime[list(text_files_data_Crime.keys())[0]]

Science_array = np.array(Science_list)
# Science_array

In [12]:
len(Science_array)

3987

### Politics Data

In [13]:
len(text_files_data_Politics)

3001

In [14]:
Politics_list = list()
for key in text_files_data_Politics:
    Politics_list.append(text_files_data_Politics[key])
# text_files_data_Crime[list(text_files_data_Crime.keys())[0]]

Politics_array = np.array(Politics_list)
# Politics_array

In [15]:
len(Politics_array)

3001

## Duplicate Data Check

### Crime vs Entertainment

**Entertainment and Crime data are corrupt.  We will not use them in our final analysis.**

We can see from the following analysis that every document in the `Entertainment` folder is also in the `Crime` folder.  We will not spend time deciding which data is corrupt.  As will be shown shortly, neither data will be used for the final model training phase.  

**First, attempt to train a model using the two sets of data.  This yields an accuracy of zero.**

Note on functions, we are using our own custome created functions.  See the accompanying file for more details.

The Naive Bayes model here removes stopwords and uses NLTK's built in word lemmatizer (nltk.stem.WordNetLemmatizer()).  However, at this point in the analysis there is no consideration of any Parts of Speech, and no POS partitioning to determine any type of term importance is being considered.  

In [43]:
# train a `Naive Bayes` model on 1000 elements from each dataset
crimeVSentertainment_weights = bnb.trainNaiveBayes(Crime_array[:1000], Entertainment_array[:1000]) 

There are **14883** terms in this model.

In [20]:
len(crimeVSentertainment_weights)

14883

In [21]:
# training accuracy
bnb.PredAccuracy(Crime_array[:1000], Entertainment_array[:1000], crimeVSentertainment_weights)

0.0

**Looking at an example element in the `Crime` data reveals that the two datasets might have the same data.**

The Score function is used to determine which class label to assign to each instance.  The left entry in the output list is associated with the label `Crime`, while the right is associated with the label `Entertainment`.  The higher of the two (which is always non-positive) is the label assigned to the instance.  In this example, the model is not able to assign a class label to the instance because the scores are identical.  This is because when generating the model weights during training the number of each feature associated with each label is identical, so the feature probabilities are identical.  

The data is likely identical in both data sets.

In [22]:
# similar score for each of the class labels
bnb.Score(Crime_array[4], crimeVSentertainment_weights)

[-268.8481402335173, -268.8481402335173]

**Notice, all of the scores look to be identical in the weights generated by the Naive Bayes model.  This hints towards duplicate data.**

In [23]:
# first 20 elemnts in the weights dictionary
bnb.print_first_n_items(crimeVSentertainment_weights, 20)

archivename: [-9.230412570777837, -9.230412570777837]
ripemfaq: [-10.904389004349508, -10.904389004349508]
lastupdate: [-10.904389004349508, -10.904389004349508]
sun: [-8.28942922631331, -8.28942922631331]
mar: [-10.616706931897728, -10.616706931897728]
post: [-6.203908638557092, -6.203908638557092]
still: [-6.826851560443789, -6.826851560443789]
rather: [-6.891013504661075, -6.891013504661075]
rough: [-9.805776715681398, -9.805776715681398]
list: [-6.879037313614359, -6.879037313614359]
likely: [-7.275613474305277, -7.275613474305277]
question: [-6.543415778873459, -6.543415778873459]
information: [-5.794411266920989, -5.794411266920989]
ripem: [-7.040156662757711, -7.040156662757711]
program: [-6.336574604905185, -6.336574604905185]
public: [-5.871774803534477, -5.871774803534477]
key: [-4.474669526310371, -4.474669526310371]
mail: [-6.556263921351308, -6.556263921351308]
encryption: [-5.152875126872117, -5.152875126872117]
faq: [-7.368272304787982, -7.368272304787982]


**We can confirm that all of the weights are identical.**

**Number of features in the Naive Bayes model**

In [25]:
len(crimeVSentertainment_weights)

14883

**Generate a comparison count between the label scores in the set of weights.**

In [26]:
number_different = 0                                       # count the number that are different
number_checked = 0                                         # count the number that are checked
for key, value in crimeVSentertainment_weights.items():
    if value[0] == value[1]:                               # compare each entry
        number_checked += 1
    else:
        number_different += 1
        number_checked += 1

print('Number different:' + color.BOLD + f' {number_different}' + color.END)
print('Number checked:' + color.BOLD + f' {number_checked}' + color.END)
print('Number identical:' + color.BOLD + f' {number_checked - number_different}' + color.END)

Number different:[1m 0[0m
Number checked:[1m 14883[0m
Number identical:[1m 14883[0m


**The intersection of `Crime` and `Entertainment` has 1052 elements, which is the exact number of elements in the `Entertainment` file.  Thus, everything in the `Entertainment` file is in the `Crime` file.  Duplicate data.**

In [27]:
# Intersection, Crime vs Entertainment
len(set(text_files_data_Crime.keys()).intersection(set(text_files_data_Entertainment.keys())))

1052

**47 elements are present in the `Crime` file and not the `Entertainment` file**

In [28]:
# Show the number of elements in the Crime file that are not in the Entertainment file
len(set(text_files_data_Crime.keys()) - set(text_files_data_Entertainment.keys()))

47

**Here are the 47 document keys present in the Crime file that are not present in the Entertainment file.**

In [29]:
# Crime and Not Entertainment
print(list(set(text_files_data_Crime.keys()) - set(text_files_data_Entertainment.keys())))

['52808.txt', '52778.txt', '52773.txt', '52769.txt', '52775.txt', '52798.txt', '52811.txt', '52784.txt', '52777.txt', '52789.txt', '52782.txt', '52787.txt', '52767.txt', '52796.txt', '52794.txt', '52779.txt', '52803.txt', '52809.txt', '52812.txt', '52770.txt', '52776.txt', '52791.txt', '52795.txt', '52772.txt', '52804.txt', '52810.txt', '52788.txt', '52783.txt', '52805.txt', '52806.txt', '52771.txt', '52790.txt', '52793.txt', '52781.txt', '52768.txt', '52792.txt', '52797.txt', '52802.txt', '52800.txt', '52799.txt', '52780.txt', '52801.txt', '52785.txt', '52813.txt', '52786.txt', '52774.txt', '52807.txt']


**We will look more closely at the other sets of docuements to make sure there are no duplicates in them.**

### Science vs Politics 

**266** elements are in both the Science and Politics sets of documents

In [31]:
# Science, number of elements
len(text_files_data_Science)

3987

In [32]:
# Politics, number of elements
len(text_files_data_Politics)

3001

In [33]:
# Intersection, Science vs Politics
len(set(text_files_data_Science.keys()).intersection(set(text_files_data_Politics.keys())))

266

**266 documents are in both Science and Politics.**

In [34]:
# create a list of shared keys
shared_keys_SvsP = list()
for key in text_files_data_Science.keys():
    if key in text_files_data_Politics:
        shared_keys_SvsP.append(key)

# number of shared keys
print('Number of shared keys:' + color.BOLD + f' {len(shared_keys_SvsP)}' + color.END + '\n')

# the first five document keys
print(f'The first five shared documents: {shared_keys_SvsP[:5]}.')

Number of shared keys:[1m 266[0m

The first five shared documents: ['54116.txt', '54117.txt', '54118.txt', '54119.txt', '54120.txt'].


### Science vs Crime

In [38]:
# Science, number of elements
len(text_files_data_Science)

3987

In [39]:
# Intersection, Science vs Crime
len(set(text_files_data_Science.keys()).intersection(set(text_files_data_Crime.keys())))

1099

In [40]:
# Crime, number of elements
len(text_files_data_Crime)

1099

**Every element in the Crime dataset is in the Science dataset.**

### Politics vs Crime

In [41]:
# Intersection, Politics vs Crime
len(set(text_files_data_Politics.keys()).intersection(set(text_files_data_Crime.keys())))

0

**There is no overlap between the `Politics` and `Crime` datasets.  A classifier could be trained between them.**

# Naive Bayes Classifier

It has been found that everything in Entertainment is in Crime, and everything in Crime is in Science.  Thus, any attempt to train a classifier between Science and Crime or Entertainment will not work. 

There is no overlap between the Politics and Crime datasets, and here is slight overlap between the Politics and Science sets.  We could look at two models, one for classifying between Politics and Crime, the other to classify between Politics and Science.  But, the Crime set is a subset of the Science set.  So, creating a model between Politics and Crime is the same as creating a model between Politics and a subset of Science.  We will not do this.  Instead, we will focus exclusively on building a model between Politics and Science.

## Politics vs Science 

Recall, there were **266** shared elements between these two datasets.  We will begin by removing these duplicates from each dataset.  

In [33]:
len(shared_keys_SvsP)

266

In [34]:
len(text_files_data_Politics.keys())

3001

In [35]:
# remove the shared keys from Politics
for key in shared_keys_SvsP:
    if key in text_files_data_Politics:
        del text_files_data_Politics[key]

len(text_files_data_Politics)

2735

In [36]:
# remove the shared keys from Science
for key in shared_keys_SvsP:
    if key in text_files_data_Science:
        del text_files_data_Science[key]

len(text_files_data_Science)

3721

### Split the data: Training, Validation, and Testing sets

#### New lists and arrays of data

In [37]:
print('Poilitcs set size:' + color.BOLD + f' {len(text_files_data_Politics)}' + color.END)
print('Science set size:' + color.BOLD + f' {len(text_files_data_Science)}' + color.END)

Poilitcs set size:[1m 2735[0m
Science set size:[1m 3721[0m


In [140]:
# Politics
Politics_list = list()
for key in text_files_data_Politics:
    Politics_list.append(text_files_data_Politics[key])

Politics_array = np.array(Politics_list)

# Science
Science_list = list()
for key in text_files_data_Science:
    Science_list.append(text_files_data_Science[key])

Science_array = np.array(Science_list)

In [39]:
print('Poilitcs set size:' + color.BOLD + f' {len(Politics_array)}' + color.END)
print('Science set size:' + color.BOLD + f' {len(Science_array)}' + color.END)

Poilitcs set size:[1m 2735[0m
Science set size:[1m 3721[0m


#### Train, Validation, Test Split

**<u>Instances from each set</u>:**  
Training: 2435 instances  
Validation: 150 instances  
Testing: 150 instances  

These numbers use all of the instances in the Politics set.

**Politics**

In [154]:
np.random.seed(42)
# np.random.shuffle(Politics_array)

Politics_copy = np.empty_like(Politics_array)
np.copyto(Politics_copy, Politics_array)

np.random.shuffle(Politics_copy)

In [155]:
train_p = list(Politics_copy[:2435])
val_p = list(Politics_copy[2435:2585])
test_p = list(Politics_copy[2585:])

In [156]:
print('Poilitcs Training size:' + color.BOLD + f' {len(train_p)}' + color.END)
print('Politics Validation size:' + color.BOLD + f' {len(val_p)}' + color.END)
print('Politics Testing size:' + color.BOLD + f' {len(test_p)}' + color.END)

Poilitcs Training size:[1m 2435[0m
Politics Validation size:[1m 150[0m
Politics Testing size:[1m 150[0m


**Science**

In [157]:
np.random.seed(42)
# np.random.shuffle(Science_array)

# np.random.seed(42)
# np.random.shuffle(Politics_array)

Science_copy = np.empty_like(Science_array)
np.copyto(Science_copy, Science_array)

np.random.shuffle(Science_copy)

In [158]:
train_s = list(Science_copy[:2435])
val_s = list(Science_copy[2435:2585])
test_s = list(Science_copy[2585:2735])

In [159]:
print('Science Training size:' + color.BOLD + f' {len(train_s)}' + color.END)
print('Science Validation size:' + color.BOLD + f' {len(val_s)}' + color.END)
print('Science Testing size:' + color.BOLD + f' {len(test_s)}' + color.END)

Science Training size:[1m 2435[0m
Science Validation size:[1m 150[0m
Science Testing size:[1m 150[0m


### Politics vs Science

In [160]:
# Training
PvsS_weights = bnb.NaiveBayes(train_p, train_s)
bnb.LogWeights(PvsS_weights)

In [161]:
# There are 50617 features in the trained model
len(PvsS_weights)

50617

In [162]:
# Training Score
train_score = bnb.PredAccuracy(train_p, train_s, PvsS_weights)
train_score

0.9936344969199179

In [163]:
# Validation Score (0.98)
val_score = bnb.PredAccuracy(val_p, val_s, PvsS_weights)
val_score

0.9733333333333334

In [164]:
# Training - Validation
print(train_score - val_score)
print(str(round(train_score - val_score, 5)*100) + '%')

0.020301163586584514
2.03%


In [165]:
# Testing Score
bnb.PredAccuracy(test_p, test_s, PvsS_weights)  

0.9833333333333333

## Missclassified Training cases explored

### Number Missclassified

In [166]:
print('Training score:' + color.BOLD + f' {train_score}' + color.END)
print('Training set size:' + color.BOLD + f' {len(train_p) + len(train_s)}' + color.END)
print('Total Missclassified' + color.BOLD + f' {round((1 - train_score) * (len(train_p) + len(train_s)))}' + color.END)

Training score:[1m 0.9936344969199179[0m
Training set size:[1m 4870[0m
Total Missclassified[1m 31[0m


### Missclassified Cases located

#### Cases with empty string

In [167]:
Politics_list[1:2]

['']

**There are 19 empty cases in the training data.**

7 in Politics  
12 in Science

In [168]:
# Politics set, number empty
number_empty_p = 0
for item in train_p:
    if item == '':
        number_empty_p += 1

# Science set, number empty
number_empty_s = 0
for item in train_s:
    if item == '':
        number_empty_s += 1

print('Politics, number empty:' + color.BOLD + f' {number_empty_p}' + color.END)
print('Science, number empty:' + color.BOLD + f' {number_empty_s}' + color.END)

Politics, number empty:[1m 7[0m
Science, number empty:[1m 12[0m


**These are being missclassified because the Score function is returning equil scores for each class.**

In [169]:
bnb.Score(Politics_list[1:2][0], PvsS_weights)

[0, 0]

**These 19 empty cases account for 19 of the 30 errors in the training score.**

In [170]:
len(train_p)

2435

In [171]:
for i in range(7):
    train_p.remove('')

In [172]:
len(train_p)

2428

In [173]:
len(train_s)

2435

In [174]:
for i in range(12):
    train_s.remove('')

In [175]:
len(train_s)

2423

In [176]:
# Politics set, number empty
number_empty_p = 0
for item in train_p:
    if item == '':
        number_empty_p += 1

# Science set, number empty
number_empty_s = 0
for item in train_s:
    if item == '':
        number_empty_s += 1

print('Politics, number empty:' + color.BOLD + f' {number_empty_p}' + color.END)
print('Science, number empty:' + color.BOLD + f' {number_empty_s}' + color.END)

Politics, number empty:[1m 0[0m
Science, number empty:[1m 0[0m


In [177]:
# training accuracy, recomputed
train_score_1 = bnb.PredAccuracy(train_p, train_s, PvsS_weights)

In [178]:
print('Training score:' + color.BOLD + f' {train_score_1}' + color.END)
print('Training set size:' + color.BOLD + f' {len(train_p) + len(train_s)}' + color.END)
print('Total Missclassified' + color.BOLD + f' {round((1 - train_score_1) * (len(train_p) + len(train_s)))}' + color.END)

Training score:[1m 0.9975262832405689[0m
Training set size:[1m 4851[0m
Total Missclassified[1m 12[0m


#### Non-empty missclassified cases

**Locate missclassified Politics cases**

In [191]:
index_missclassified = list()
for i in range(len(train_p)):
    if bnb.Score(train_p[i], PvsS_weights)[0] < bnb.Score(train_p[i], PvsS_weights)[1]:
        index_missclassified.append(i)

index_missclassified

[141, 2107, 2277, 2323, 2391]

In [192]:
bnb.Score(train_p[141], PvsS_weights)

[-66.76779799996136, -62.97423908335089]

In [195]:
# 141
print(train_p[141])


   

does anyone have the e-mail address for the white house. if so please send it to
me thanks a lot.


    



In [196]:
# 2107
print(train_p[2107])


Thanks to everyone who sent replies regarding this case.  A few of them were
very informative and helped very much. 


                     Once again.
 THANKS!                                                  T.C.



In [197]:
# 2277
print(train_p[2277])



	If you look through this newsgroup, you should be 
	able to find Clinton's proposed "Wiretapping" Initiative
	for our computer networks and telephone systems.

	This 'initiative" has been up before Congress for at least
	the past 6 months, in the guise of the "FBI Wiretapping"
	bill.

	I strongly urge you to begin considering your future.

	I strongly urge you to get your application for a passport
	in the mail soon.

	I strongly urge you to consider moving any savings you 
	have overseas, into protected bank accounts, while 
	you are still able.





In [209]:
print(bnb.Score(train_p[2277], PvsS_weights))
bnb.FeatureFunction(train_p[2277])

[-373.5545973640566, -370.74823733323814]


[('look', 1),
 ('newsgroup', 1),
 ('able', 2),
 ('find', 1),
 ('clintons', 1),
 ('propose', 1),
 ('wiretapping', 2),
 ('initiative', 2),
 ('computer', 1),
 ('network', 1),
 ('telephone', 1),
 ('systems', 1),
 ('congress', 1),
 ('least', 1),
 ('past', 1),
 ('months', 1),
 ('guise', 1),
 ('fbi', 1),
 ('bill', 1),
 ('strongly', 3),
 ('urge', 3),
 ('begin', 1),
 ('consider', 2),
 ('future', 1),
 ('get', 1),
 ('application', 1),
 ('passport', 1),
 ('mail', 1),
 ('soon', 1),
 ('move', 1),
 ('savings', 1),
 ('overseas', 1),
 ('protect', 1),
 ('bank', 1),
 ('account', 1),
 ('still', 1)]

In [215]:
print(bnb.Score(train_p[2277], PvsS_weights))
print()
for index, count in bnb.FeatureFunction(train_p[2277]):
    print(index+':', PvsS_weights[index], count)
print()
print(bnb.Score(train_p[2277], PvsS_weights))

[-373.5545973640566, -370.74823733323814]

look: [-6.453388447208584, -6.3206198582851485] 1
newsgroup: [-9.068348225244781, -8.390399489053248] 1
able: [-7.539245460901568, -7.258170589586153] 2
find: [-6.369698798427278, -6.154063781600403] 1
clintons: [-9.104066307846862, -9.802669336576399] 1
propose: [-8.600539986562483, -8.055760433513695] 1
wiretapping: [-10.713504220280962, -9.751376042188848] 2
initiative: [-9.240198482171442, -8.896960714032781] 2
computer: [-8.70468024581508, -6.994535676917207] 1
network: [-8.622763123347193, -7.548611284477014] 1
telephone: [-9.527880554623222, -8.102717416601466] 1
systems: [-8.780666152793001, -7.0981340775816335] 1
congress: [-7.630760569737341, -8.527600610566733] 1
least: [-7.206037645024762, -7.03337551023347] 1
past: [-7.838399934515586, -8.182760124275003] 1
months: [-8.067974376160086, -7.819854630585635] 1
guise: [-10.546450135617796, -11.360813954622948] 1
fbi: [-6.683294220394736, -8.527600610566733] 1
bill: [-7.095242333380979

In [223]:
# there are words we could potential remove from the training features
non_discrimitive = ['still', 'soon', 'get', 'past', 'least', 'find', 'able', 'look']
non_discrimitive

['still', 'soon', 'get', 'past', 'least', 'find', 'able', 'look']

In [221]:
from nltk.corpus import stopwords
STOP_WORDS = set(stopwords.words('english'))
# STOP_WORDS

stop_words = set()
for word in STOP_WORDS:
    new_word = str()                                     # create a string without punctuation
    for char in word:                                    # look at each character in the word
        if char.isalpha():                               # determine if the character is alphabetic
            new_word += char                             # if it is alphabetic, add it to the new word
    if new_word != '':
        stop_words.add(new_word)

print(len(STOP_WORDS))
print(len(stop_words))
print(stop_words)

198
196
{'ourselves', 'how', 'again', 'did', 'didnt', 'theyre', 'i', 'few', 'weve', 'just', 'me', 'other', 'thatll', 'against', 'whom', 'wed', 'too', 'your', 'have', 'the', 'can', 'do', 'wouldnt', 'youd', 'and', 'she', 'himself', 'here', 'both', 'between', 'who', 'into', 'them', 'off', 'all', 'does', 'now', 'should', 'o', 'itself', 'will', 'is', 'by', 're', 'up', 'arent', 'or', 'further', 'doesn', 'we', 'theyve', 'they', 'you', 'hers', 'youre', 'that', 'shant', 'out', 'hed', 'this', 'themselves', 'ill', 'no', 'hasn', 'nor', 'with', 'itd', 'until', 'hes', 'through', 'at', 'such', 'down', 'yourself', 'ours', 'dont', 've', 'don', 'as', 'be', 'isnt', 'mightnt', 'any', 'wasn', 'didn', 'y', 's', 'hell', 'a', 'what', 'before', 'where', 'wasnt', 'couldn', 't', 'werent', 'was', 'these', 'wouldn', 'wont', 'of', 'so', 'll', 'shed', 'youll', 'ive', 'during', 'youve', 'more', 'over', 'needn', 'havent', 'aren', 'mightn', 'yourselves', 'if', 'same', 'their', 'why', 'theirs', 'being', 'which', 'itll',

In [225]:
for word in non_discrimitive:
    print(word in stop_words)

False
False
False
False
False
False
False
False


In [383]:
docs = np.array(['The sun is shining.', 
                 'The weather is sweet.', 
                 'The sun is shining, the weather is sweet, and one and one is two.'])
docs

array(['The sun is shining.', 'The weather is sweet.',
       'The sun is shining, the weather is sweet, and one and one is two.'],
      dtype='<U65')

In [306]:
docs = list(['The sun is shining.', 
                 'The weather is sweet.', 
                 'The sun is shining, the weather is sweet, and one and one is two.'])
docs

['The sun is shining.',
 'The weather is sweet.',
 'The sun is shining, the weather is sweet, and one and one is two.']

In [307]:
for item in docs:
    print(type(item))

<class 'str'>
<class 'str'>
<class 'str'>


In [687]:
def vocabulary(documents: list[str]) -> dict[str, int]:
    '''Create the vocabulary given a list of documents.'''

    # create array of unique words (put in numpy array)

    from nltk.stem import WordNetLemmatizer
    from nltk.corpus import stopwords

    # Create a set of stop words 
    STOP_WORDS = set(stopwords.words('english'))

    stop_words = set()
    for word in STOP_WORDS:
        new_word = str()                                     # create a string without punctuation
        for char in word:                                    # look at each character in the word
            if char.isalpha():                               # determine if the character is alphabetic
                new_word += char                             # if it is alphabetic, add it to the new word
        if new_word != '':
            stop_words.add(new_word)
        
    # instantiate important data types for storing processed data
    # documents_array = documents
    # documents_array = np.array(list(documents).strip().lower().split())  # list of potential words from string
    vocab_list = list()                                                # dictionary for holding word counts

    # Initialize wordnet lemmatizer
    wnl = WordNetLemmatizer()
    
    # consider each word, determine if any of its characters are not alphabetic
    for doc in documents:
        # print(doc)
        for word in doc.strip().lower().split():
            # print(word)
            new_word = str()                                     # create a string without punctuation
            for char in word:                                    # look at each character in the word
                if char.isalpha():                               # determine if the character is alphabetic
                    new_word += char                             # if it is alphabetic, add it to the new word
            if new_word != '':                                   # make sure the new string is not empty

                newer_word = wnl.lemmatize(new_word, pos="v")    # lemetize the word
                if newer_word not in stop_words:                 # check that the new word is not a common word listed in the stop words
            
                # update the word count for the new word
                    if newer_word not in vocab_list:             # determine if the new_word has been counted
                        vocab_list.append(newer_word)
                
    # output = list()                                          # list of (word, count) tuples
    # for word in dict_counts:
    #     output.append((word, dict_counts[word]))


    # order alphabetically the elements in the array
    vocab_array = np.array(vocab_list)
    vocab_array.sort()

    # create a dictionary of the elements
    vocab_dict = dict()
    for index in range(len(vocab_array)):
        vocab_dict[vocab_array[index]] = index
    
    return vocab_dict
    

In [317]:
vocabulary(docs)

{np.str_('one'): 0,
 np.str_('shin'): 1,
 np.str_('sun'): 2,
 np.str_('sweet'): 3,
 np.str_('two'): 4,
 np.str_('weather'): 5}

In [312]:
docs

['The sun is shining.',
 'The weather is sweet.',
 'The sun is shining, the weather is sweet, and one and one is two.']

In [321]:
training_vocabulary = vocabulary(train_p+train_s)

In [672]:
train_p_vocabulary = vocabulary(train_p)

In [674]:
len(train_p_vocabulary)

30185

In [673]:
train_s_vocabulary = vocabulary(train_s)

In [675]:
len(train_s_vocabulary)

31418

In [322]:
len(training_vocabulary)

50617

In [323]:
len(PvsS_weights)

50617

In [709]:
def remove_punctuation(word: str) -> str:

    from nltk.stem import WordNetLemmatizer
    from nltk.corpus import stopwords    
    
    new_word = str()                                     # create a string without punctuation
    for char in word:                                    # look at each character in the word
        if char.isalpha():                               # determine if the character is alphabetic
            new_word += char                             # if it is alphabetic, add it to the new word

    return new_word

def lemmatize(word: str) -> str:
    from nltk.stem import WordNetLemmatizer
    wnl = WordNetLemmatizer()
    new_word = wnl.lemmatize(word, pos="v")
    return new_word

def create_lemmatized_stopwords() -> str:
    from nltk.corpus import stopwords

    # Create a set of stop words 
    STOP_WORDS = set(stopwords.words('english'))

    stop_words = set()
    for word in STOP_WORDS:
        new_word = str()                                     # create a string without punctuation
        for char in word:                                    # look at each character in the word
            if char.isalpha():                               # determine if the character is alphabetic
                new_word += char                             # if it is alphabetic, add it to the new word
        if new_word != '':
            stop_words.add(new_word)

    return stop_words

    
    # if new_word != '':                                   # make sure the new string is not empty

    #     newer_word = wnl.lemmatize(new_word, pos="v")    # lemetize the word
    #     if newer_word not in stop_words:                 # check that the new word is not a common word listed in the stop words
    #         return newer_word
        
    

In [710]:
len(create_lemmatized_stopwords())

196

In [712]:
remove_punctuation('shining')

'shining'

In [713]:
docs = np.array(['The sun is shining.', 
                 'The weather is sweet.', 
                 'The sun is shining, the weather is sweet, and one and one is two.'])
docs[:3]

array(['The sun is shining.', 'The weather is sweet.',
       'The sun is shining, the weather is sweet, and one and one is two.'],
      dtype='<U65')

In [470]:
docs = train_p+train_s

In [714]:
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()
new_docs = list()
for doc in docs[:3]:
    doc_list = doc.strip().lower().split()
    # print(doc_list)
# print(docs[0].strip().lower().split())
# print()
    # new_doc_list = [remove_punctuation(word) for word in doc_list]
    # new_doc_list = [remove_punctuation(word) for word in doc_list]
    # new_doc_list = [wnl.lemmatize(remove_punctuation(word), pos="v") for word in doc_list]
    new_doc_list = [lemmatize(remove_punctuation(word)) for word in doc_list]
    # print(new_doc_list)
    new_docs.append(new_doc_list)

print(new_docs)

[['the', 'sun', 'be', 'shin'], ['the', 'weather', 'be', 'sweet'], ['the', 'sun', 'be', 'shin', 'the', 'weather', 'be', 'sweet', 'and', 'one', 'and', 'one', 'be', 'two']]


In [715]:
stop_words = create_stopwords()
from scipy.sparse import csr_matrix
# docs = [["hello", "world", "hello"], ["goodbye", "cruel", "world"]]
# docs = [['the', 'sun', 'be', 'shin'], ['the', 'sun', 'be', 'shin'],['the', 'sun', 'be', 'shin']]
indptr = [0]
indices = []
data = []
vocabulary = {}
for d in new_docs:
    for term in d:
        if term in stop_words:
            continue
        elif term == '':
            continue
        else:
            index = vocabulary.setdefault(term, len(vocabulary))
            indices.append(index)
            data.append(1)
    indptr.append(len(indices))
term_frequency_vectors = csr_matrix((data, indices, indptr), dtype=int).toarray()

In [716]:
data

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

In [717]:
indices

[0, 1, 2, 3, 0, 1, 2, 3, 4, 4, 5]

In [718]:
vocabulary

{'sun': 0, 'shin': 1, 'weather': 2, 'sweet': 3, 'one': 4, 'two': 5}

In [719]:
indptr

[0, 2, 4, 11]

In [720]:
term_frequency_vectors

array([[1, 1, 0, 0, 0, 0],
       [0, 0, 1, 1, 0, 0],
       [1, 1, 1, 1, 2, 1]])

In [721]:
len(vocabulary)

6

In [722]:
vocabulary

{'sun': 0, 'shin': 1, 'weather': 2, 'sweet': 3, 'one': 4, 'two': 5}

In [723]:
sorted_vocabulary = dict(sorted(vocabulary.items()))
len(sorted_vocabulary)

6

In [474]:
len(PvsS_weights)

50617

In [591]:
vocabulary

{'sun': 0, 'shin': 1, 'weather': 2, 'sweet': 3, 'one': 4, 'two': 5}

In [592]:
term_frequency_vectors

array([[1, 1, 0, 0, 0, 0],
       [0, 0, 1, 1, 0, 0],
       [1, 1, 1, 1, 2, 1]])

In [724]:
def tf_idf(termFrequencyVectors: list[list[int]], vocabulary: dict[str, int]) -> list[int]:
    tfIdf = list()
    num_documents = len(termFrequencyVectors)
    df = list()
    idf = list()
    for term, index in vocabulary.items():
        # print(term, index)
        document_frequency = 0
        for document in termFrequencyVectors:
            # print(document[index])
            if document[index] != 0:
                document_frequency += 1
        df.append(document_frequency)
        idf.append(np.log((1+num_documents)/(1+document_frequency)))
    for item in termFrequencyVectors:
        # print(item)
        tf = [np.log(1+count) for count in item]
        # print(type(tf))
        tf_idf = list()
        for index in range(len(tf)):
            tf_idf.append(tf[index]*idf[index])
        tfIdf.append(tf_idf)
    
    return tfIdf

In [725]:
for document in tf_idf(term_frequency_vectors, vocabulary):
    print(document)
    print()

[np.float64(0.1994060174175938), np.float64(0.1994060174175938), np.float64(0.0), np.float64(0.0), np.float64(0.0), np.float64(0.0)]

[np.float64(0.0), np.float64(0.0), np.float64(0.1994060174175938), np.float64(0.1994060174175938), np.float64(0.0), np.float64(0.0)]

[np.float64(0.1994060174175938), np.float64(0.1994060174175938), np.float64(0.1994060174175938), np.float64(0.1994060174175938), np.float64(0.761500010418809), np.float64(0.4804530139182014)]



In [726]:
vocabulary

{'sun': 0, 'shin': 1, 'weather': 2, 'sweet': 3, 'one': 4, 'two': 5}

In [728]:
example_weights = bnb.NaiveBayes(docs[:2], docs[:3])
example_weights

{'sun': [0.25, 0.18181818181818182],
 'shin': [0.25, 0.18181818181818182],
 'weather': [0.25, 0.18181818181818182],
 'sweet': [0.25, 0.18181818181818182],
 'one': [0.0, 0.18181818181818182],
 'two': [0.0, 0.09090909090909091]}

In [730]:
def tfIdf_Score(review: str, tfIdf_vector: list, weights: dict[str, list[float]], vocabulary: dict[str, int]) -> list[float]:
    ''' Calculate the positive and negative score of the given review given the weights '''

    # create the bag of words associated with the review instance
    bag_of_words = bnb.FeatureFunction(review)

    # found the positive class and negative class scores
    positive = int()
    negative = int()
    for word, count in bag_of_words:

        try:
            positive += weights[word][0]*tfIdf_vector[vocabulary[word]]
            negative += weights[word][1]*tfIdf_vector[vocabulary[word]]
        except:
            pass

    # create the output list
    output = list()
    output.append(positive)
    output.append(negative)
    
    return output

In [733]:
tfIdf_Score(docs[2], tf_idf(term_frequency_vectors, vocabulary)[2], example_weights, vocabulary) 

[np.float64(0.1994060174175938), np.float64(0.32715465219059725)]

In [None]:
bnb.PredAccuracy()

In [348]:
word = 'shin'
try:
    print(training_vocabulary[word])
except:
    print(f'The word "{word}" is not in the dictionary.')

41227


In [350]:
len(training_vocabulary)

50617

In [359]:
import scipy.sparse as sp

bag_of_words = sp.csr_matrix((1,len(training_vocabulary)))
feature_array = np.zeros(len(training_vocabulary))
index = 0
for word in [wnl.lemmatize(remove_punctuation(word), pos="v") for word in docs[0].strip().lower().split()]:
    try:
        print(training_vocabulary[word])
        index = training_vocabulary[word]
        feature_array[index] = 1
    except:
        print(f'The word "{word}" is not in the dictionary.')    

print(feature_array[43816])
print(feature_array[41227])

The word "the" is not in the dictionary.
43816
The word "be" is not in the dictionary.
41227
1.0
1.0


In [360]:
bag_of_words

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 0 stored elements and shape (1, 50617)>

In [353]:
feature_array = np.zeros(len(training_vocabulary))
feature_array[41227] = 1

feature_array[41227]

np.float64(1.0)

In [362]:
from scipy.sparse import csr_matrix
docs = [["hello", "world", "hello"], ["goodbye", "cruel", "world"]]
indptr = [0]
indices = []
data = []
vocabulary = {}
for d in docs:
    for term in d:
        index = vocabulary.setdefault(term, len(vocabulary))
        indices.append(index)
        data.append(1)
    indptr.append(len(indices))
csr_matrix((data, indices, indptr), dtype=int).toarray()

array([[2, 1, 0, 0],
       [0, 1, 1, 1]])

In [363]:
data

[1, 1, 1, 1, 1, 1]

In [364]:
indices

[0, 1, 0, 2, 3, 1]

In [365]:
indptr

[0, 3, 6]

In [366]:
vocabulary

{'hello': 0, 'world': 1, 'goodbye': 2, 'cruel': 3}

In [368]:
dict().setdefault?

Object `setdefault` not found.


In [372]:
train_p[:2]

[np.str_('\nIsrael - Happy 45th Birthday!\n\n'),
 np.str_('\nIn article <SHAIG.93Apr15220200@composer.think.com>, shaig@composer.think.com (Shai Guday) writes:\n\n|>    [snip]\n|>    imagine ????  It is NOT a "terrorist camp" as you and the Israelis like \n|>    to view the villages they are small communities with kids playing soccer\n|>    in the streets, women preparing lunch, men playing cards, etc.....\n|> \n|> I would not argue that all or even most of the villages are "terrorist\n|> camps".  There are however some which come very close to serving that\n|> purpose and that is not to say that other did not function in that way\n|> prior to the invasion. \n\nThe village I described was actually the closest I could come to\ndescribing mine.  I agree there may be other villages where the civilian\npopulation has deserted because it is too close to Israeli lines and\nthus gets bombed more often.  In such villages often the only remaining \ninhabitants are guerillas and some elderly who

Let $x$ be a term, and let $N$ be the number of documents in the corpus.  Define $N_{0}$ and $N_{1}$ to be the number of documents the term appears in class 0 and 1 respectively.  Then, we define the **Discriminative Coefficient of the term $x$** as the following product of log-ratios:

$$\left|\text{log}\frac{1 + N_{0}}{1 + N_{1}}\right| \cdot \text{log}\frac{(N - N_0)(N - N_1)}{\left|N_0 - N_1\right| + 1}.$$

We point out that the left factor in this product, being in square roots, will invert the fraction if $N_1 > N_0$.

In [696]:
def vocabulary(documents: list[str]) -> dict[str, int]:
    '''Create the vocabulary given a list of documents.'''

    # create array of unique words (put in numpy array)

    from nltk.stem import WordNetLemmatizer
    from nltk.corpus import stopwords

    # Create a set of stop words 
    STOP_WORDS = set(stopwords.words('english'))

    stop_words = set()
    for word in STOP_WORDS:
        new_word = str()                                     # create a string without punctuation
        for char in word:                                    # look at each character in the word
            if char.isalpha():                               # determine if the character is alphabetic
                new_word += char                             # if it is alphabetic, add it to the new word
        if new_word != '':
            stop_words.add(new_word)
        
    # instantiate important data types for storing processed data
    # documents_array = documents
    # documents_array = np.array(list(documents).strip().lower().split())  # list of potential words from string
    vocab_list = list()                                                # dictionary for holding word counts

    # Initialize wordnet lemmatizer
    wnl = WordNetLemmatizer()
    
    # consider each word, determine if any of its characters are not alphabetic
    for doc in documents:
        # print(doc)
        for word in doc.strip().lower().split():
            # print(word)
            new_word = str()                                     # create a string without punctuation
            for char in word:                                    # look at each character in the word
                if char.isalpha():                               # determine if the character is alphabetic
                    new_word += char                             # if it is alphabetic, add it to the new word
            if new_word != '':                                   # make sure the new string is not empty

                newer_word = wnl.lemmatize(new_word, pos="v")    # lemetize the word
                if newer_word not in stop_words:                 # check that the new word is not a common word listed in the stop words
            
                # update the word count for the new word
                    if newer_word not in vocab_list:             # determine if the new_word has been counted
                        vocab_list.append(newer_word)
                
    # output = list()                                          # list of (word, count) tuples
    # for word in dict_counts:
    #     output.append((word, dict_counts[word]))


    # order alphabetically the elements in the array
    vocab_array = np.array(vocab_list)
    vocab_array.sort()

    # create a dictionary of the elements
    vocab_dict = dict()
    for index in range(len(vocab_array)):
        vocab_dict[vocab_array[index]] = index
    
    return vocab_dict
    
def discriminative_coefficients(pos_reviews: list[str], neg_reviews: list[str], vocab: dict[str, float]) -> float:

    N = len(pos_reviews+neg_reviews)
    # create a vocabulary for pos_reviews
    pos_vocabulary = vocabulary(pos_reviews)
    print(pos_vocabulary)

    # create a vocabulary for neg_reviews
    neg_vocabulary = vocabulary(neg_reviews)
    # print(neg_vocabulary)
    
    return pos_vocabulary

In [697]:
coefficients = discriminative_coefficients(train_p[:1], train_s[:1], vocabulary)
coefficients

{np.str_('birthday'): 0, np.str_('happy'): 1, np.str_('israel'): 2, np.str_('th'): 3}


{np.str_('birthday'): 0,
 np.str_('happy'): 1,
 np.str_('israel'): 2,
 np.str_('th'): 3}

In [648]:
def FeatureList(documents: list[str]) -> list[list]:
    stop_words = create_stopwords()
    docs_list = list()
    new_feature_list = list()
    for item in documents:
        for word in item.strip().lower().split():
            new_word = lemmatize(remove_punctuation(word))
            if new_word not in stop_words and new_word != '':
                new_feature_list.append(new_word)
    docs_list.append(new_feature_list)
    
    return docs_list

In [652]:
print(train_s[0])


In article <rdippold.735426379@qualcom> 
(Ron "Asbestos" Dippold) writes: 
  ...
> The only thing that worries me is that 2:1 compression - the
> SoundBlaster can do it automatically in hardware, but other than that
> I don't have a good feel for how processor intensive it is, so I can't
> estimate how fast a PC you'd need.

        There's a better way. Doesn't Qualcom have a secure design
        that it decided not to market?  Since they aren't going to
        use it, wouldn't the patriotic thing be to put the design in
        the public domain? How about selling a "Cryptography
        Educational Kit" with the critical parts? Something that could
        end up as a PC option board with two phone jacks?

        Cheers,
                Marc

---
 Marc Thibault                             | marc@tanda.isis.org
 Automation Architect                      | CIS:71441,2226
 R.R.1, Oxford Mills, Ontario, Canada      | NC FreeNet: aa185

-----BEGIN PGP PUBLIC KEY BLOCK-----
mQBNAiqxYT

In [663]:
PvsS_weights['pghcmnadgfuzgeuaxnpcyvcmc']

[-10000, -10.955348846514784]

In [664]:
bnb.Score(train_s[0], PvsS_weights)

[-150590.2003700718, -696.6277004778442]

In [659]:
len(training_vocabulary)

50617

In [662]:
training_vocabulary['pghcmnadgfuzgeuaxnpcyvcmc']

33762

In [651]:
FeatureList(train_s[:1])

[['article',
  'rdippoldqualcom',
  'ron',
  'asbestos',
  'dippold',
  'write',
  'thing',
  'worry',
  'compression',
  'soundblaster',
  'automatically',
  'hardware',
  'good',
  'feel',
  'processor',
  'intensive',
  'cant',
  'estimate',
  'fast',
  'pc',
  'need',
  'theres',
  'better',
  'way',
  'qualcom',
  'secure',
  'design',
  'decide',
  'market',
  'since',
  'go',
  'use',
  'patriotic',
  'thing',
  'put',
  'design',
  'public',
  'domain',
  'sell',
  'cryptography',
  'educational',
  'kit',
  'critical',
  'part',
  'something',
  'could',
  'end',
  'pc',
  'option',
  'board',
  'two',
  'phone',
  'jack',
  'cheer',
  'marc',
  'marc',
  'thibault',
  'marctandaisisorg',
  'automation',
  'architect',
  'cis',
  'rr',
  'oxford',
  'mill',
  'ontario',
  'canada',
  'nc',
  'freenet',
  'aa',
  'begin',
  'pgp',
  'public',
  'key',
  'block',
  'mqbnaiqxytkaaaecalfehypycsscfvjspjescaohihtnefrrnvuecsavh',
  'aauwpiugyvnnlftpnnlcmscpjupykviabrgihcmmgvghpymfbhq

In [646]:
bnb.FeatureFunction(train_p[0])

[('israel', 1), ('happy', 1), ('th', 1), ('birthday', 1)]

In [630]:
train_s[:1]

[np.str_('\nIn article <rdippold.735426379@qualcom> \n(Ron "Asbestos" Dippold) writes: \n  ...\n> The only thing that worries me is that 2:1 compression - the\n> SoundBlaster can do it automatically in hardware, but other than that\n> I don\'t have a good feel for how processor intensive it is, so I can\'t\n> estimate how fast a PC you\'d need.\n\n        There\'s a better way. Doesn\'t Qualcom have a secure design\n        that it decided not to market?  Since they aren\'t going to\n        use it, wouldn\'t the patriotic thing be to put the design in\n        the public domain? How about selling a "Cryptography\n        Educational Kit" with the critical parts? Something that could\n        end up as a PC option board with two phone jacks?\n\n        Cheers,\n                Marc\n\n---\n Marc Thibault                             | marc@tanda.isis.org\n Automation Architect                      | CIS:71441,2226\n R.R.1, Oxford Mills, Ontario, Canada      | NC FreeNet: aa185\n\n-----B

In [631]:
def remove_punctuation(word: str) -> str:

    from nltk.stem import WordNetLemmatizer
    from nltk.corpus import stopwords    
    
    new_word = str()                                     # create a string without punctuation
    for char in word:                                    # look at each character in the word
        if char.isalpha():                               # determine if the character is alphabetic
            new_word += char                             # if it is alphabetic, add it to the new word

    return new_word

def lemmatize(word: str) -> str:
    from nltk.stem import WordNetLemmatizer
    wnl = WordNetLemmatizer()
    new_word = wnl.lemmatize(word, pos="v")
    return new_word

def create_stopwords() -> str:
    from nltk.corpus import stopwords

    # Create a set of stop words 
    STOP_WORDS = set(stopwords.words('english'))

    stop_words = set()
    for word in STOP_WORDS:
        new_word = str()                                     # create a string without punctuation
        for char in word:                                    # look at each character in the word
            if char.isalpha():                               # determine if the character is alphabetic
                new_word += char                             # if it is alphabetic, add it to the new word
        if new_word != '':
            stop_words.add(new_word)

    return stop_words

    
    # if new_word != '':                                   # make sure the new string is not empty

    #     newer_word = wnl.lemmatize(new_word, pos="v")    # lemetize the word
    #     if newer_word not in stop_words:                 # check that the new word is not a common word listed in the stop words
    #         return newer_word
        
    

In [699]:
lemmatize("cats")

'cat'

In [705]:
import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [706]:
lemmatizer.lemmatize(word = "cats")

'cat'

In [707]:
lemmatize('cats')

'cat'

In [738]:
bnb.FeatureFunction(train_p[0])

[('israel', 1), ('happy', 1), ('th', 1), ('birthday', 1)]

In [740]:
import spacy
# train_p[0]

ModuleNotFoundError: No module named 'spacy'

In [741]:
spacy.download()

NameError: name 'spacy' is not defined

In [742]:
import sys
sys.executable

'/Users/blakewallace/anaconda3/envs/linear_regression_env/bin/python'

In [744]:
!/Users/blakewallace/anaconda3/envs/linear_regression_env/bin/python -m pip install spacy

Collecting spacy
  Using cached spacy-3.8.2.tar.gz (1.3 MB)
  Installing build dependencies ... [?25l-^C
[?25canceled
[31mERROR: Operation cancelled by user[0m[31m
[0m