<a href="https://colab.research.google.com/github/ANGELLOPARR/csc466-project/blob/main/NLP_Exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [39]:
import nltk
import spacy
nltk.download('punkt')
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

!python -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


In [30]:
!rm -r csc466-project
!git clone https://github.com/ANGELLOPARR/csc466-project.git

Cloning into 'csc466-project'...
remote: Enumerating objects: 77, done.[K
remote: Counting objects: 100% (77/77), done.[K
remote: Compressing objects: 100% (71/71), done.[K
remote: Total 77 (delta 30), reused 25 (delta 4), pack-reused 0[K
Unpacking objects: 100% (77/77), done.


In [31]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

dataset = pd.read_csv(r'csc466-project/emails.csv')
dataset .columns #Index(['text', 'spam'], dtype='object')
dataset.shape  #(5728, 2)
dataset.iloc[-1]['text']

'Subject: news : aurora 5 . 2 update  aurora version 5 . 2  - the fastest model just got faster -  epis announces the release of aurora , version 5 . 2  aurora the electric market price forecasting tool is already  legendary for power and speed . we \' ve combined a powerful chronological  dispatch model with the capability to simulate the market from 1  day to 25 + years . add to that a risk analysis section , powered by user  selectable monte carlo & / or latin hypercube modeling , enough  portfolio analysis power to please the toughest critic , & inputs and  outputs from standard excel & access tables and you \' ve got one of most  powerful tools in the market .  just a few months ago we expanded our emissions modeling  capabilities , added our quarterly database update , increased the speed  of the entire model , and made  but that wasn \' t enough .  we \' ve done it again . some of the operations that we \' ve  included . . .  two new reporting enhancements .  the first is margin

Data cleanup

In [32]:
dataset['text'] = dataset['text'].str.replace('Subject: ', '')
dataset

Unnamed: 0,text,spam
0,naturally irresistible your corporate identity...,1
1,the stock trading gunslinger fanny is merrill...,1
2,unbelievable new homes made easy im wanting t...,1
3,4 color printing special request additional i...,1
4,"do not have money , get software cds from here...",1
...,...,...
5723,re : research and development charges to gpg ...,0
5724,"re : receipts from visit jim , thanks again ...",0
5725,re : enron case study update wow ! all on the...,0
5726,"re : interest david , please , call shirley ...",0


In [33]:
def stem_text(text):
    stemmer = PorterStemmer()
    tokenized = nltk.word_tokenize(text)
    stemmed = [stemmer.stem(word) for word in tokenized]
    return ' '.join(stemmed)

In [34]:
stem_text(dataset.iloc[0]['text'])

"natur irresist your corpor ident lt is realli hard to recollect a compani : the market is full of suqgest and the inform isoverwhelminq ; but a good catchi logo , stylish statloneri and outstand websit will make the task much easier . we do not promis that havinq order a iogo your compani will automaticaili becom a world ieader : it isguit ciear that without good product , effect busi organ and practic aim it will be hotat nowaday market ; but we do promis that your market effort will becom much more effect . here is the list of clear benefit : creativ : hand - made , origin logo , special done to reflect your distinct compani imag . conveni : logo and stationeri are provid in all format ; easi - to - use content manag system letsyou chang your websit content and even it structur . prompt : you will see logo draft within three busi day . afford : your market break - through shouldn ' t make gap in your budget . 100 % satisfact guarante : we provid unlimit amount of chang with no extra

In [35]:
def lemmatize_text(text):
  doc = nlp(text)

  lemmed = [token.lemma_ if token.lemma_ != '-PRON-' else token.text for token in doc]
  return ' '.join(lemmed)

In [36]:
lemmatize_text(dataset.iloc[0]['text'])

"naturally irresistible your corporate identity   lt be really hard to recollect a company : the   market be full of suqgestion and the information isoverwhelminq ; but a good   catchy logo , stylish statlonery and outstanding website   will make the task much easy .   we do not promise that havinq order a iogo your   company will automaticaily become a world ieader : it isguite ciear that   without good product , effective business organization and practicable aim it   will be hotat nowadays market ; but we do promise that your marketing effort   will become much more effective . here be the list of clear   benefit : creativeness : hand - make , original logo , specially do   to reflect your distinctive company image . convenience : logo and stationery   be provide in all format ; easy - to - use content management system letsyou   change your website content and even its structure . promptness : you   will see logo draft within three business day . affordability : your   marketing br

In [None]:
corpus = dataset['text'].values
stemmed_corpus = [stem_text(document) for document in corpus]
lemmed_corpus = [lemmatize_text(document) for document in corpus]

In [56]:
stemmed_df = dataset.copy(deep=True)
stemmed_df['stemmed'] = stemmed_corpus
stemmed_df.drop(['text'], axis=1, inplace=True)

training_stemmed = stemmed_df['stemmed']
testing_stemmed = stemmed_df['spam']

In [57]:
lemmed_df = dataset.copy(deep=True)
lemmed_df['lemmed'] = lemmed_corpus
lemmed_df.drop(['text'], axis=1, inplace=True)

training_lemmed = lemmed_df['lemmed']
testing_lemmed = lemmed_df['spam']

In [46]:
stemmed_vectorizer = TfidfVectorizer(stop_words='english')
stemmed_vectorizer.fit(training_stemmed)
X_stemmed = stemmed_vectorizer.transform(training_stemmed, testing_stemmed)

In [48]:
lemmed_vectorizer = TfidfVectorizer(stop_words='english')
Y = lemmed_vectorizer.fit_transform(lemmed_corpus)

Unnamed: 0,text,spam,lemmed
0,naturally irresistible your corporate identity...,1,naturally irresistible your corporate identity...
1,the stock trading gunslinger fanny is merrill...,1,the stock trading gunslinger fanny be merril...
2,unbelievable new homes made easy im wanting t...,1,unbelievable new home make easy i be want to...
3,4 color printing special request additional i...,1,4 color printing special request additional ...
4,"do not have money , get software cds from here...",1,"do not have money , get software cd from here ..."
...,...,...,...
5723,re : research and development charges to gpg ...,0,re : research and development charge to gpg ...
5724,"re : receipts from visit jim , thanks again ...",0,"re : receipt from visit jim , thank again ..."
5725,re : enron case study update wow ! all on the...,0,re : enron case study update wow ! all on th...
5726,"re : interest david , please , call shirley ...",0,"re : interest david , please , call shirle..."
