# TEXT ANALYTICS

### Pre-processing text data

Cleaning up the text data is necessary to highlight attributes that you're going to want your machine learning system to pick up on. Cleaning (or pre-processing) the data typically consists of a number of steps:
1. **Remove punctuation**
2. **Tokenization**
3. **Remove stopwords**
4. **Lemmatize/Stem**


##### Natural Language Processing

NLP is a field in machine learning with the ability of a computer to understand, analyze, 
manipulate, and potentially generate human language.

#### What is NLTK 
(Natural Language Toolkit)NLTK: NLTK is a popular open-source package in Python. 
    Rather than building all tools from scratch, NLTK provides all common NLP Tasks.

In [19]:
import nltk
# nltk.download()
from nltk.corpus import stopwords

In [20]:
#nltk.data.path

In [21]:
import pandas as pd

data = pd.read_csv("spam.csv",encoding='latin1')


data.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives around here though",,,


In [22]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB


In [23]:
data = data.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1)


In [24]:
data.columns = ['label', 'body_text']

In [25]:
data.head()

Unnamed: 0,label,body_text
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


### Remove punctuation

### These are the punctuation marks that need to be removed. 

In [26]:
import pandas as pd
import re
import string
import nltk

# Download NLTK resources if not already downloaded
nltk.download('stopwords')
nltk.download('wordnet')




[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Mukesh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Mukesh\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [27]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [28]:
# The two text below have same meaning but machine consider them different due to punctuation mark
"I like NLP." == "I like NLP"

False

In [29]:
s = 'I like NLP.'



In [30]:
string.punctuation?

[1;31mType:[0m        str
[1;31mString form:[0m !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
[1;31mLength:[0m      32
[1;31mDocstring:[0m  
str(object='') -> str
str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or
errors is specified, then the object must expose a data buffer
that will be decoded using the given encoding and error handler.
Otherwise, returns the result of object.__str__() (if defined)
or repr(object).
encoding defaults to sys.getdefaultencoding().
errors defaults to 'strict'.

In [31]:
"ANANYA" =="Ananya"

False

In [32]:
# I am creating my own UDF for removing the punctuation

def remove_punct(text):
    text_nopunct = "".join([char for char in text if char not in string.punctuation])
    return text_nopunct

In [33]:
my_text = "I've been @$ se@rching for the right words to that"

In [34]:
remove_punct(my_text)

'Ive been  serching for the right words to that'

##### Function to remove punctuation from text 

In [35]:
# Set maximum column width for better display
pd.set_option('display.max_colwidth', 100)



import string

def remove_punctuation(text):
    return "".join([char for char in text if char not in string.punctuation])

# Applying the function to the body_text column
data['body_text_no_punct'] = data['body_text'].apply(lambda x: remove_punctuation(x))






In [36]:
# Display the cleaned data
data.head()

Unnamed: 0,label,body_text,body_text_no_punct
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...",Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amo...
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say
4,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though


### Tokenization

Tokenization is taking a text or set of text and breaking it up into its individual words.

Here we additionaly converting them to lowcase as "Hello" and "hello" are not same in machine language but do have same meaning.

In [37]:
import re

def tokenize(text):
    return re.split('\W+', text)

# Applying the function to the body_text_no_punct column
data['body_text_tokenized'] = data['body_text_no_punct'].apply(lambda x: tokenize(x))

# Display the data with tokenized text
data.head()


Unnamed: 0,label,body_text,body_text_no_punct,body_text_tokenized
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...",Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amo...,"[Go, until, jurong, point, crazy, Available, only, in, bugis, n, great, world, la, e, buffet, Ci..."
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni,"[Ok, lar, Joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[Free, entry, in, 2, a, wkly, comp, to, win, FA, Cup, final, tkts, 21st, May, 2005, Text, FA, to..."
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say,"[U, dun, say, so, early, hor, U, c, already, then, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[Nah, I, dont, think, he, goes, to, usf, he, lives, around, here, though]"


In [None]:
'NLP' == 'nlp'

### Remove stopwords

This is a important task to remove words that doesn't make anything like 'i','me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves'....and many more. 

In [7]:
# NLTK - Natural Language Tool Kit is library to perform analysis on text 
import nltk
nltk.download('stopwords')

# Corpus means bag of words 
#stopword = nltk.corpus.stopwords.words('english')

ParseError: mismatched tag: line 48, column 2 (<string>)

In [40]:
import nltk

# Download NLTK resources if not already downloaded
nltk.download('stopwords')

stopwords = nltk.corpus.stopwords.words('english')

def remove_stopwords(tokenized_list):
    return [word for word in tokenized_list if word.lower() not in stopwords]

# Applying the function to the body_text_tokenized column
data['body_text_no_stopwords'] = data['body_text_tokenized'].apply(lambda x: remove_stopwords(x))

# Display the data with stopwords removed
data.head()


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Mukesh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,label,body_text,body_text_no_punct,body_text_tokenized,body_text_no_stopwords
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...",Go until jurong point crazy Available only in bugis n great world la e buffet Cine there got amo...,"[Go, until, jurong, point, crazy, Available, only, in, bugis, n, great, world, la, e, buffet, Ci...","[Go, jurong, point, crazy, Available, bugis, n, great, world, la, e, buffet, Cine, got, amore, wat]"
1,ham,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni,"[Ok, lar, Joking, wif, u, oni]","[Ok, lar, Joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005 Text FA to 87121 to receive e...,"[Free, entry, in, 2, a, wkly, comp, to, win, FA, Cup, final, tkts, 21st, May, 2005, Text, FA, to...","[Free, entry, 2, wkly, comp, win, FA, Cup, final, tkts, 21st, May, 2005, Text, FA, 87121, receiv..."
3,ham,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say,"[U, dun, say, so, early, hor, U, c, already, then, say]","[U, dun, say, early, hor, U, c, already, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though",Nah I dont think he goes to usf he lives around here though,"[Nah, I, dont, think, he, goes, to, usf, he, lives, around, here, though]","[Nah, dont, think, goes, usf, lives, around, though]"


###### Now consolidating Remove Punctuation, Tokenization and Remove Stopwords and creating a function for it 

In [44]:
import pandas as pd
import re
import string

stopwords = nltk.corpus.stopwords.words('english')

data = pd.read_csv("spam.csv", encoding='Latin1')

data = data.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1)
data.columns = ['label', 'body_text']

data.head()

Unnamed: 0,label,body_text
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


In [45]:
def clean_text(text):
    text = "".join([word for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [word for word in tokens if word not in stopwords]
    return text

data['body_text_nostop'] = data['body_text'].apply(lambda x: clean_text(x.lower()))

data.head()

Unnamed: 0,label,body_text,body_text_nostop
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...","[go, jurong, point, crazy, available, bugis, n, great, world, la, e, buffet, cine, got, amore, wat]"
1,ham,Ok lar... Joking wif u oni...,"[ok, lar, joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv..."
3,ham,U dun say so early hor... U c already then say...,"[u, dun, say, early, hor, u, c, already, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, dont, think, goes, usf, lives, around, though]"


# Stemming and Lemmetization 

#### Main differences between stemming and lemmatization:

The main difference is the way they work and therefore the result they each of them returns:

Stemming algorithms work by cutting off the end or the beginning of the word, taking into account a list of common prefixes and suffixes that can be found in an inflected word. This indiscriminate cutting can be successful in some occasions, but not always, and that is why we affirm that this approach presents some limitations. Below we illustrate the method with examples in both English and Spanish.

Lemmatization, on the other hand, takes into consideration the morphological analysis of the words. To do so, it is necessary to have detailed dictionaries which the algorithm can look through to link the form back to its lemma. Again, you can see how it works with the same example words.

Another important difference to highlight is that a lemma is the base form of all its inflectional forms, whereas a stem isn’t

### Supplemental Data Cleaning: Using Stemming


### Test out Porter stemmer

In [46]:
import nltk

# Porter Stemmer - It's a stemmer available in nltk

ps = nltk.PorterStemmer()

In [47]:
print(ps.stem('grows'))
print(ps.stem('growing'))
print(ps.stem('grow'))

grow
grow
grow


In [48]:
print(ps.stem('studies'))
print(ps.stem('studying'))


studi
studi


In [50]:
import nltk

wn = nltk.WordNetLemmatizer()
ps = nltk.PorterStemmer()

In [52]:
print(ps.stem('run'))
print(ps.stem('running'))
print(ps.stem('runner'))

run
run
runner


### Stem text

In [53]:
def stemming(tokenized_text):
    text = [ps.stem(word) for word in tokenized_text]
    return text

data['body_text_stemmed'] = data['body_text_nostop'].apply(lambda x: stemming(x))

data.head()

Unnamed: 0,label,body_text,body_text_nostop,body_text_stemmed
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...","[go, jurong, point, crazy, available, bugis, n, great, world, la, e, buffet, cine, got, amore, wat]","[go, jurong, point, crazi, avail, bugi, n, great, world, la, e, buffet, cine, got, amor, wat]"
1,ham,Ok lar... Joking wif u oni...,"[ok, lar, joking, wif, u, oni]","[ok, lar, joke, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv...","[free, entri, 2, wkli, comp, win, fa, cup, final, tkt, 21st, may, 2005, text, fa, 87121, receiv,..."
3,ham,U dun say so early hor... U c already then say...,"[u, dun, say, early, hor, u, c, already, say]","[u, dun, say, earli, hor, u, c, alreadi, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, dont, think, goes, usf, lives, around, though]","[nah, dont, think, goe, usf, live, around, though]"


## Supplemental Data Cleaning: Using a Lemmatizer

### Test out WordNet lemmatizer (read more about WordNet [here](https://wordnet.princeton.edu/))

In [54]:
import nltk

wn = nltk.WordNetLemmatizer()
ps = nltk.PorterStemmer()

### Clean up text

In [56]:
import pandas as pd
import re
import string

stopwords = nltk.corpus.stopwords.words('english')


data = pd.read_csv("spam.csv", encoding='Latin1')

data = data.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1)
data.columns = ['label', 'body_text']

data.head()


def clean_text(text):
    text = "".join([word for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [word for word in tokens if word not in stopwords]
    return text

data['body_text_nostop'] = data['body_text'].apply(lambda x: clean_text(x.lower()))

data.head()

Unnamed: 0,label,body_text,body_text_nostop
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...","[go, jurong, point, crazy, available, bugis, n, great, world, la, e, buffet, cine, got, amore, wat]"
1,ham,Ok lar... Joking wif u oni...,"[ok, lar, joking, wif, u, oni]"
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,"[free, entry, 2, wkly, comp, win, fa, cup, final, tkts, 21st, may, 2005, text, fa, 87121, receiv..."
3,ham,U dun say so early hor... U c already then say...,"[u, dun, say, early, hor, u, c, already, say]"
4,ham,"Nah I don't think he goes to usf, he lives around here though","[nah, dont, think, goes, usf, lives, around, though]"


### Lemmatize text

In [58]:
print(dir(nltk.corpus.stopwords))

from nltk.corpus import stopwords

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_citation', '_encoding', '_fileids', '_get_root', '_license', '_readme', '_root', '_tagset', '_unload', 'abspath', 'abspaths', 'citation', 'encoding', 'ensure_loaded', 'fileids', 'license', 'open', 'raw', 'readme', 'root', 'words']


### Vectorizers output sparse matrices

_**Sparse Matrix**: A matrix in which most entries are 0. In the interest of efficient storage, a sparse matrix will be stored by only storing the locations of the non-zero elements._

# Vectorizing Raw Data: Count Vectorization

### Count vectorization 

Creates a document-term matrix where the entry of each cell will be a count of the number of times that word occurred in that document.

In [59]:

data = pd.read_csv("spam.csv", encoding='Latin1')

data = data.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1)
data.columns = ['label', 'body_text']

data.head()

Unnamed: 0,label,body_text
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


In [60]:
data_sample = data[0:20]

data_sample

Unnamed: 0,label,body_text
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"
5,spam,FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for ...
6,ham,Even my brother is not like to speak with me. They treat me like aids patent.
7,ham,As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call...
8,spam,WINNER!! As a valued network customer you have been selected to receivea å£900 prize reward! To ...
9,spam,Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with came...


In [61]:
import pandas as pd
import re
import string
import nltk

pd.set_option('display.max_colwidth', 100)

stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()


data = pd.read_csv("spam.csv", encoding='Latin1')

data = data.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1)
data.columns = ['label', 'body_text']

data.head()

Unnamed: 0,label,body_text
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


In [62]:
def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [ps.stem(word) for word in tokens if word not in stopwords]
    return text

### Apply CountVectorizer

In [63]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(analyzer=clean_text)
X_counts = count_vect.fit_transform(data['body_text'])
print(X_counts.shape)

(5572, 8060)


### Apply CountVectorizer to smaller sample

In [64]:
data_sample = data[0:20]

count_vect_sample = CountVectorizer(analyzer=clean_text)
X_counts_sample = count_vect_sample.fit_transform(data_sample['body_text'])
print(X_counts_sample.shape)
print(count_vect_sample.get_feature_names())

(20, 223)


AttributeError: 'CountVectorizer' object has no attribute 'get_feature_names'

In [32]:
X_counts_df = pd.DataFrame(X_counts_sample.toarray())
X_counts_df.columns = count_vect_sample.get_feature_names()
X_counts_df

Unnamed: 0,08002986030,08452810075over18,09061701461,1,100,100000,11,12,150pday,16,...,wet,win,winner,wkli,word,wwwdbuknet,xxxmobilemovieclub,xxxmobilemovieclubcomnqjkgighjjgcbl,ye,ü
0,0,1,0,0,0,0,0,0,0,0,...,0,1,0,1,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,1,0,0,0,0,1,0,0,...,0,0,1,0,0,0,0,0,0,0
6,1,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,1,0,0,0,1,1,...,0,1,0,0,0,0,0,0,0,0
9,0,0,0,1,0,1,0,0,0,0,...,0,0,0,0,1,1,0,0,0,0


# Vectorizing Raw Data: N-Grams

### N-Grams 

Creates a document-term matrix where counts still occupy the cell but instead of the columns representing single terms, they represent all combinations of adjacent words of length n in your text.

"NLP is an interesting topic"

| n | Name      | Tokens                                                         |
|---|-----------|----------------------------------------------------------------|
| 2 | bigram    | ["nlp is", "is an", "an interesting", "interesting topic"]      |
| 3 | trigram   | ["nlp is an", "is an interesting", "an interesting topic"] |
| 4 | four-gram | ["nlp is an interesting", "is an interesting topic"]    |

In [65]:
import pandas as pd
import re
import string
import nltk
pd.set_option('display.max_colwidth', 100)

stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()


data = pd.read_csv("spam.csv", encoding='Latin1')

data = data.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1)
data.columns = ['label', 'body_text']

data.head()

def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = " ".join([ps.stem(word) for word in tokens if word not in stopwords])
    return text

data['cleaned_text'] = data['body_text'].apply(lambda x: clean_text(x))
data.head()

Unnamed: 0,label,body_text,cleaned_text
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g...",go jurong point crazi avail bugi n great world la e buffet cine got amor wat
1,ham,Ok lar... Joking wif u oni...,ok lar joke wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,free entri 2 wkli comp win fa cup final tkt 21st may 2005 text fa 87121 receiv entri questionstd...
3,ham,U dun say so early hor... U c already then say...,u dun say earli hor u c alreadi say
4,ham,"Nah I don't think he goes to usf, he lives around here though",nah dont think goe usf live around though


### Apply CountVectorizer (w/ N-Grams)

In [67]:
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer

# Initialize CountVectorizer with ngram_range=(1,2)
ngram_vect = CountVectorizer(ngram_range=(1,2))

# Fit and transform the cleaned_text data
X_counts = ngram_vect.fit_transform(data['cleaned_text'])
print(X_counts.shape)

# Access feature names using get_feature_names_out()
feature_names = ngram_vect.get_feature_names_out()
print(feature_names)


(5572, 39168)
['008704050406' '008704050406 sp' '0089mi' ... 'ûò sound' 'ûówell'
 'ûówell done']


### Apply CountVectorizer (w/ N-Grams) to smaller sample

In [69]:
from sklearn.feature_extraction.text import CountVectorizer

data_sample = data[0:20]

ngram_vect_sample = CountVectorizer(ngram_range=(2,2))
X_counts_sample = ngram_vect_sample.fit_transform(data_sample['cleaned_text'])
print(X_counts_sample.shape)
print(ngram_vect_sample.get_feature_names_out())


(20, 226)
['09061701461 claim' '100 20000' '100000 prize' '11 month' '12 hour'
 '150 rcv' '150pday 6day' '16 tsandc' '20000 pound' '2005 text' '21st may'
 '4txtì¼120 poboxox36504w45wq' '6day 16' '81010 tc' '87077 eg'
 '87077 trywal' '87121 receiv' '87575 cost' '900 prize' 'aid patent'
 'alreadi say' 'amor wat' 'anymor tonight' 'appli 08452810075over18'
 'appli repli' 'around though' 'avail bugi' 'back id' 'bless time'
 'breather promis' 'brother like' 'buffet cine' 'bugi great'
 'call 09061701461' 'call mobil' 'caller press' 'callertun caller'
 'camera free' 'cash 100' 'chanc win' 'chg send' 'cine got' 'claim 81010'
 'claim call' 'claim code' 'click httpwap' 'click wap' 'co free'
 'code kl341' 'colour mobil' 'comp win' 'copi friend' 'cost 150pday'
 'crazi avail' 'credit click' 'cri enough' 'csh11 send' 'cup final'
 'custom select' 'darl week' 'date sunday' 'dont miss' 'dont think'
 'dont want' 'dun say' 'earli hor' 'eg england' 'eh rememb'
 'england 87077' 'england macedonia' 'enough t

In [71]:
import pandas as pd



# Convert the sparse matrix to a DataFrame
X_counts_df = pd.DataFrame(X_counts_sample.toarray())

# Access feature names from the vocabulary_ attribute
feature_names = ngram_vect_sample.get_feature_names_out()

# Assign feature names to the DataFrame columns
X_counts_df.columns = feature_names

# Display the DataFrame
print(X_counts_df)


    09061701461 claim  100 20000  100000 prize  11 month  12 hour  150 rcv  \
0                   0          0             0         0        0        0   
1                   0          0             0         0        0        0   
2                   0          0             0         0        0        0   
3                   0          0             0         0        0        0   
4                   0          0             0         0        0        0   
5                   0          0             0         0        0        1   
6                   0          0             0         0        0        0   
7                   0          0             0         0        0        0   
8                   1          0             0         0        1        0   
9                   0          0             0         1        0        0   
10                  0          0             0         0        0        0   
11                  0          1             0         0        

# Vectorizing Raw Data: TF-IDF

### TF-IDF

Creates a document-term matrix where the columns represent single unique terms (unigrams) but the cell represents a weighting meant to represent how important a word is to a document.



### Read in text

In [72]:
import pandas as pd
import re
import string
import nltk
pd.set_option('display.max_colwidth', 100)

stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()

data = pd.read_csv("spam.csv", encoding='Latin1')

data = data.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1)
data.columns = ['label', 'body_text']

data.head()

Unnamed: 0,label,body_text
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


### Create function to remove punctuation, tokenize, remove stopwords, and stem

In [73]:
def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [ps.stem(word) for word in tokens if word not in stopwords]
    return text

### Apply TfidfVectorizer

In [75]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Assuming you have already defined the clean_text function and loaded your data

tfidf_vect = TfidfVectorizer(analyzer=clean_text)
X_tfidf = tfidf_vect.fit_transform(data['body_text'])
print(X_tfidf.shape)

# Access feature names from the vocabulary_ attribute
feature_names = tfidf_vect.get_feature_names_out()
print(feature_names)


(5572, 8060)
['' '0' '008704050406' ... 'ûïharri' 'ûò' 'ûówell']


### Apply TfidfVectorizer to smaller sample

In [77]:
data_sample = data[0:20]

tfidf_vect_sample = TfidfVectorizer(analyzer=clean_text)
X_tfidf_sample = tfidf_vect_sample.fit_transform(data_sample['body_text'])
print(X_tfidf_sample.shape)

# Access feature names from the vocabulary_ attribute
feature_names = tfidf_vect_sample.get_feature_names_out()
print(feature_names)

(20, 223)
['08002986030' '08452810075over18' '09061701461' '1' '100' '100000' '11'
 '12' '150' '150pday' '16' '2' '20000' '2005' '21st' '3' '4'
 '4403ldnw1a7rw18' '4txtì¼120' '6day' '81010' '87077' '87121' '87575' '9'
 '900' 'aid' 'alreadi' 'amor' 'anymor' 'appli' 'around' 'avail' 'b' 'back'
 'bless' 'breather' 'brother' 'buffet' 'bugi' 'c' 'call' 'caller'
 'callertun' 'camera' 'cash' 'chanc' 'chg' 'cine' 'claim' 'click' 'co'
 'code' 'colour' 'comp' 'copi' 'cost' 'crazi' 'credit' 'cri' 'csh11' 'cup'
 'custom' 'darl' 'date' 'dont' 'dun' 'e' 'earli' 'eg' 'eh' 'england'
 'enough' 'entitl' 'entri' 'even' 'fa' 'feel' 'final' 'fine' 'free'
 'freemsg' 'friend' 'fulfil' 'fun' 'go' 'goalsteam' 'goe' 'gonna' 'got'
 'gota' 'grant' 'great' 'help' 'hey' 'hl' 'home' 'hor' 'hour' 'httpwap'
 'id' 'im' 'info' 'ive' 'jackpot' 'joke' 'jurong' 'k' 'kim' 'kl341' 'la'
 'lar' 'latest' 'lccltd' 'like' 'link' 'live' 'macedonia' 'make' 'may'
 'mell' 'membership' 'messag' 'minnaminungint' 'miss' 'mobil' 'month' 

In [79]:


X_tfidf_df = pd.DataFrame(X_tfidf_sample.toarray())
X_tfidf_df.columns = feature_names
X_tfidf_df

Unnamed: 0,08002986030,08452810075over18,09061701461,1,100,100000,11,12,150,150pday,...,wonder,wont,word,world,wwwdbuknet,xxx,xxxmobilemovieclub,xxxmobilemovieclubcomnqjkgighjjgcbl,ye,å
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.198423,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.23345,0.0,...,0.0,0.0,0.185167,0.0,0.0,0.23345,0.0,0.0,0.0,0.185167
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.227832,0.0,0.0,0.0,0.0,0.227832,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.18071
9,0.1971,0.0,0.0,0.0,0.0,0.0,0.1971,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Senimental Analysis

##### Importing Data

In [80]:
import pandas as pd
import re
import string
import nltk
pd.set_option('display.max_colwidth', 100)

stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()

data = pd.read_csv("spam.csv", encoding='Latin1')

data = data.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1)
data.columns = ['label', 'body_text']

data.head()

Unnamed: 0,label,body_text
0,ham,"Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there g..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives around here though"


##### Install vaderSentiment to perform Sentimental Analysis

In [82]:
!pip install vaderSentiment


Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
     -------------------------------------- 126.0/126.0 kB 3.7 MB/s eta 0:00:00
Installing collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2


In [83]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

#### Defining function to extract positive, negative, neutral and compound

In [84]:
analyser = SentimentIntensityAnalyzer()
def sentiment_pos(sentence):
    snt = analyser.polarity_scores(sentence)
    return snt['pos']

def sentiment_neg(sentence):
    snt = analyser.polarity_scores(sentence)
    return snt['neg']

def sentiment_neu(sentence):
    snt = analyser.polarity_scores(sentence)
    return snt['neu']

def sentiment_com(sentence):
    snt = analyser.polarity_scores(sentence)
    return snt['compound']

In [86]:
data['Positivity'] = data['body_text'].apply(sentiment_pos)
data['Negative'] = data['body_text'].apply(sentiment_neg)
data['Neutral'] = data['body_text'].apply(sentiment_neu)
data['Compound'] = data['body_text'].apply(sentiment_com)

In [62]:
df.head()

Unnamed: 0,label,body_text,Positivity,Negative,Neutral,Compound
0,spam,Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive ...,0.228,0.0,0.772,0.7964
1,ham,"Nah I don't think he goes to usf, he lives around here though",0.0,0.113,0.887,-0.1027
2,ham,Even my brother is not like to speak with me. They treat me like aids patent.,0.136,0.212,0.653,-0.1331
3,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,0.0,0.0,1.0,0.0
4,ham,As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your call...,0.11,0.0,0.89,0.4767
