# NLP Modeling

How do we quantify a document?

- [Setup](#setup)
- [Data Representation](#data-representation)
    - [Bag of Words](#bag-of-words)
    - [TF-IDF](#tf-idf)
    - [Bag Of Ngrams](#bag-of-ngrams)
- [Modeling](#modeling)
    - [Modeling Results](#modeling-results)
- [Next Steps](#next-steps)

## Setup

In [1]:
from pprint import pprint
import pandas as pd
import nltk
import re

def clean(text: str) -> list:
    'A simple function to cleanup text data'
    wnl = nltk.stem.WordNetLemmatizer()
    stopwords = set(nltk.corpus.stopwords.words('english'))
    text = (text.encode('ascii', 'ignore')
             .decode('utf-8', 'ignore')
             .lower())
    words = re.sub(r'[^\w\s]', '', text).split() # tokenization
    return [wnl.lemmatize(word) for word in words if word not in stopwords]

## Data Representation

Simple data for demonstration.

In [2]:
data = [
    'Python is pretty cool',
    'Python is a nice programming language with nice syntax',
    'I think SQL is cool too',
]

In [4]:
print(data)

['Python is pretty cool', 'Python is a nice programming language with nice syntax', 'I think SQL is cool too']


In [5]:
pprint(data)

['Python is pretty cool',
 'Python is a nice programming language with nice syntax',
 'I think SQL is cool too']


### Bag of Words

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

In [6]:
df = pd.read_csv('spam_clean.csv')

In [7]:
cv = CountVectorizer()

In [8]:
bag_o_words = cv.fit_transform(data)

In [9]:
bag_o_words

<3x12 sparse matrix of type '<class 'numpy.int64'>'
	with 16 stored elements in Compressed Sparse Row format>

In [11]:
bag_o_words.todense()

matrix([[1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0],
        [0, 1, 1, 2, 0, 1, 1, 0, 1, 0, 0, 1],
        [1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0]])

Here `bag_of_words` is a **sparse matrix**. Usually you should keep it as such,
but for demonstration we'll view the data within.

In [13]:
cv.get_feature_names_out()

array(['cool', 'is', 'language', 'nice', 'pretty', 'programming',
       'python', 'sql', 'syntax', 'think', 'too', 'with'], dtype=object)

In [16]:
cv.vocabulary_

{'python': 6,
 'is': 1,
 'pretty': 4,
 'cool': 0,
 'nice': 3,
 'programming': 5,
 'language': 2,
 'with': 11,
 'syntax': 8,
 'think': 9,
 'sql': 7,
 'too': 10}

In [20]:
# Taking a look at the bag of words transformation for education and diagnostics.
# In practice this is not necesssary and the resulting data might be to big to be reasonably helpful.
bow = pd.DataFrame(bag_o_words.todense())
bow.columns = cv.get_feature_names_out()

In [22]:
bow.T

Unnamed: 0,0,1,2
cool,1,0,1
is,1,1,1
language,0,1,0
nice,0,2,0
pretty,1,0,0
programming,0,1,0
python,1,1,0
sql,0,0,1
syntax,0,1,0
think,0,0,1


In [23]:
bow

Unnamed: 0,cool,is,language,nice,pretty,programming,python,sql,syntax,think,too,with
0,1,1,0,0,1,0,1,0,0,0,0,0
1,0,1,1,2,0,1,1,0,1,0,0,1
2,1,1,0,0,0,0,0,1,0,1,1,0


In [24]:
bow.apply(lambda row: row / row.sum(), axis=1)

Unnamed: 0,cool,is,language,nice,pretty,programming,python,sql,syntax,think,too,with
0,0.25,0.25,0.0,0.0,0.25,0.0,0.25,0.0,0.0,0.0,0.0,0.0
1,0.0,0.125,0.125,0.25,0.0,0.125,0.125,0.0,0.125,0.0,0.0,0.125
2,0.2,0.2,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.2,0.2,0.0


### TF-IDF

- term frequency - inverse document frequency
- $\text{tf} \times \text{idf} = \frac{\text{tf}}{\text{df}}$
- a measure that helps identify how important a word is in a document
- combination of how often a word appears in a document (**tf**) and how unqiue the word
  is among documents (**idf**)
- used by search engines
- naturally helps filter out stopwords
- tf is for a single document, idf is for a corpus

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
bag_of_words = tfidf.fit_transform(data)
pprint(data)
pd.DataFrame(bag_of_words.todense(),
             columns= tfidf.get_feature_names_out())

['Python is pretty cool',
 'Python is a nice programming language with nice syntax',
 'I think SQL is cool too']


Unnamed: 0,cool,is,language,nice,pretty,programming,python,sql,syntax,think,too,with
0,0.480458,0.373119,0.0,0.0,0.631745,0.0,0.480458,0.0,0.0,0.0,0.0,0.0
1,0.0,0.197673,0.334689,0.669378,0.0,0.334689,0.25454,0.0,0.334689,0.0,0.0,0.334689
2,0.38377,0.298032,0.0,0.0,0.0,0.0,0.0,0.504611,0.0,0.504611,0.504611,0.0


To get the idf score for each word (these aren't terribly usefule themselves):

In [27]:
pd.Series(dict(zip(tfidf.get_feature_names_out(), tfidf.idf_)))

cool           1.287682
is             1.000000
language       1.693147
nice           1.693147
pretty         1.693147
programming    1.693147
python         1.287682
sql            1.693147
syntax         1.693147
think          1.693147
too            1.693147
with           1.693147
dtype: float64

### Bag Of Ngrams

For either `CountVectorizer` or `TfidfVectorizer`, you can set the `ngram_range`
parameter.

In [28]:
cv = CountVectorizer(ngram_range=(2,4))
grams_bag = cv.fit_transform(data)

In [31]:
model_df = pd.DataFrame(grams_bag.todense(), columns= cv.get_feature_names_out())
model_df

Unnamed: 0,cool too,is cool,is cool too,is nice,is nice programming,is nice programming language,is pretty,is pretty cool,language with,language with nice,...,python is pretty,python is pretty cool,sql is,sql is cool,sql is cool too,think sql,think sql is,think sql is cool,with nice,with nice syntax
0,0,0,0,0,0,0,1,1,0,0,...,1,1,0,0,0,0,0,0,0,0
1,0,0,0,1,1,1,0,0,1,1,...,0,0,0,0,0,0,0,0,1,1
2,1,1,1,0,0,0,0,0,0,0,...,0,0,1,1,1,1,1,1,0,0


## Modeling

In [32]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

df

Unnamed: 0,id,label,text,text_clean,message_length,word_count,sentiment
0,0,ham,"Go until jurong point, crazy.. Available only ...","['go', 'jurong', 'point', 'crazy', 'available'...",111,16,0.6249
1,1,ham,Ok lar... Joking wif u oni...,"['ok', 'lar', 'joking', 'wif', 'oni']",29,5,0.4767
2,2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,"['free', 'entry', 'wkly', 'comp', 'win', 'fa',...",155,22,0.7964
3,3,ham,U dun say so early hor... U c already then say...,"['dun', 'say', 'early', 'hor', 'c', 'already',...",49,7,0.0000
4,4,ham,"Nah I don't think he goes to usf, he lives aro...","['nah', 'dont', 'think', 'go', 'usf', 'life', ...",61,8,-0.1027
...,...,...,...,...,...,...,...
5567,5567,spam,This is the 2nd time we have tried 2 contact u...,"['2nd', 'time', 'tried', 'contact', 'a750', 'p...",161,16,0.8805
5568,5568,ham,Will Ì_ b going to esplanade fr home?,"['i_', 'b', 'going', 'esplanade', 'fr', 'home']",37,6,0.0000
5569,5569,ham,"Pity, * was in mood for that. So...any other s...","['pity', 'mood', 'soany', 'suggestion']",57,4,-0.2960
5570,5570,ham,The guy did some bitching but I acted like i'd...,"['guy', 'bitching', 'acted', 'like', 'id', 'in...",125,14,0.8934


In [35]:
df['text_cleaned'] = df.text.apply(clean).apply(' '.join)

In [36]:
df

Unnamed: 0,id,label,text,text_clean,message_length,word_count,sentiment,text_cleaned
0,0,ham,"Go until jurong point, crazy.. Available only ...","['go', 'jurong', 'point', 'crazy', 'available'...",111,16,0.6249,go jurong point crazy available bugis n great ...
1,1,ham,Ok lar... Joking wif u oni...,"['ok', 'lar', 'joking', 'wif', 'oni']",29,5,0.4767,ok lar joking wif u oni
2,2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,"['free', 'entry', 'wkly', 'comp', 'win', 'fa',...",155,22,0.7964,free entry 2 wkly comp win fa cup final tkts 2...
3,3,ham,U dun say so early hor... U c already then say...,"['dun', 'say', 'early', 'hor', 'c', 'already',...",49,7,0.0000,u dun say early hor u c already say
4,4,ham,"Nah I don't think he goes to usf, he lives aro...","['nah', 'dont', 'think', 'go', 'usf', 'life', ...",61,8,-0.1027,nah dont think go usf life around though
...,...,...,...,...,...,...,...,...
5567,5567,spam,This is the 2nd time we have tried 2 contact u...,"['2nd', 'time', 'tried', 'contact', 'a750', 'p...",161,16,0.8805,2nd time tried 2 contact u u 750 pound prize 2...
5568,5568,ham,Will Ì_ b going to esplanade fr home?,"['i_', 'b', 'going', 'esplanade', 'fr', 'home']",37,6,0.0000,_ b going esplanade fr home
5569,5569,ham,"Pity, * was in mood for that. So...any other s...","['pity', 'mood', 'soany', 'suggestion']",57,4,-0.2960,pity mood soany suggestion
5570,5570,ham,The guy did some bitching but I acted like i'd...,"['guy', 'bitching', 'acted', 'like', 'id', 'in...",125,14,0.8934,guy bitching acted like id interested buying s...


In [37]:
X = df.text_cleaned
y = df.label
X_train, X_test, y_train, y_test = \
train_test_split(X, y, test_size=0.2, random_state=1349)

In [38]:
X_train

2302                                     make baby yo tho
4304                                  yo come carlos soon
4925                 oh yes like torture watching england
1753                  jus came back fr lunch wif si u leh
3211                                 got divorce lol shes
                              ...                        
1435                                       dad went oredi
227                         hey company elama po mudyadhu
2724                                             nope c _
4990    made eta taunton 1230 planned hope thats still...
2834    ya well fine bbdpooja full pimpleseven become ...
Name: text_cleaned, Length: 4457, dtype: object

Iterate:

- try out the bag of ngrams
- try out different ways of text prep (stem vs lemmatize)
- etc...

In [39]:
# Whatever transformations we apply to X_train need to be applied to X_test
cv = CountVectorizer()
X_bow = cv.fit_transform(X_train)

tree= DecisionTreeClassifier(max_depth=5)
tree.fit(X_bow, y_train)
tree.score(X_bow, y_train)

0.9351581781467355

In [40]:
X_test_bow = cv.transform(X_test)

In [41]:
tree.score(X_test_bow, y_test)

0.9219730941704036

### Modeling Results

A super-useful feature of decision trees and linear models is that they do some
built-in feature selection through the coefficeints or feature importances:

In [42]:
tree.feature_importances_

array([0., 0., 0., ..., 0., 0., 0.])

In [45]:
pd.Series(dict(zip(cv.get_feature_names_out(), tree.feature_importances_))).sort_values().tail(15)

eg23f       0.000000
effect      0.000000
let         0.003451
finish      0.006010
bak         0.006246
probably    0.006495
stop        0.025823
mobile      0.035174
im          0.036523
reply       0.055870
claim       0.056853
later       0.062218
text        0.086997
txt         0.266384
call        0.351957
dtype: float64

## Next Steps

- Try other model types

    [Naive Bayes](https://en.wikipedia.org/wiki/Naive_Bayes_classifier)
    ([`sklearn`
    docs](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html))
    is a very popular classifier for NLP tasks.

- Look at other metrics, is accuracy the best choice here?

- Try ngrams instead of single words

- Try a combination of ngrams and words (`ngram_range=(1, 2)` for words and
  bigrams)

- Try using tf-idf instead of bag of words

- Combine the top `n` performing words with the other features that you have
  engineered (the `CountVectorizer` and `TfidfVectorizer` have a `vocabulary`
  argument you can use to restrict the words used)