## What are we going to cover?

So far we have looked the `Preprocessing` task of the Pipeline. Now it's time to dig in `Feature Engineering`.

In fact we have already see one way that we can extract information about a document using `Regular Expressions`.

Now we will see another approach: `Machine Learning`

The main goal is to create vectors that describes numericly the words of a corpus (`word embeddings`) in a meaningful way, such similary word to have similar or even the same vector representation. It's maybe the hardest and the most important task to the entire Pipeline.

The models that uses vectors to describe and represent words, are being called `Vector Space Models`.

The first think we can do before even start to creating the model and trying to compute those embeddings is to create a `vocabulary`.

A vocabulary is a list, or a dictionary that captures all the words of out dataset samples. In order to add a word into the vocabulary we need first to perform a kind of Preprocessing, such as Tokenization, Stemming and Lemmatization.

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer # For creating the BOW representation
from sklearn.feature_extraction.text import TfidfVectorizer # For tf-idf representation
from sklearn.naive_bayes import MultinomialNB               # For Naive Bayes Classifier
from sklearn.neighbors import KNeighborsClassifier          # For K-Neighbours Classifier
from sklearn.ensemble import RandomForestClassifier         # For Random Forest Classifier
from sklearn.metrics import classification_report           # For evaluating metric

## Understanding the Dataset

In [None]:
df = pd.read_csv("sms_spam.csv", encoding="ISO-8859-1")[["v1", "v2"]]

df.head(3)

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...


### Changing the value of "ham" into 0 and "spam" into 1

In [None]:
df["spam"] = df["v1"].apply(lambda x: 1 if x == "spam" else 0)

### Dropping the first column

In [None]:
df = df[["v2", "spam"]]
df.rename(columns={"v2": "Message"}, inplace=True)

df.head(3)

Unnamed: 0,Message,spam
0,"Go until jurong point, crazy.. Available only ...",0
1,Ok lar... Joking wif u oni...,0
2,Free entry in 2 a wkly comp to win FA Cup fina...,1


### Printing the Number of Spam and Ham SMSs

In [None]:
df.value_counts("spam")

spam
0    4825
1     747
dtype: int64

We can see that this dataset is hardly imbalance. We will solve that problem by discarding the leading ham messages. This method is not efficient, there are a lot of ways to deal with that problem, but this is the easier.

In [None]:
# # Balancing the DataFrame
# g = df.groupby("spam", group_keys=False)
# df = g.apply(lambda x: x.sample(g.size().min()).reset_index(drop=True))

# df.head(3)

In [None]:
df.value_counts("spam")

spam
0    4825
1     747
dtype: int64

### Splitting the Dataset into Training and Test Sets

In [None]:
# We are splitting the dataset with 20% test size
x_train, x_test, y_train, y_test = train_test_split(df.Message, df.spam, test_size=0.2)

In [None]:
# The resulting x and y are pandas Series
print(type(x_train), type(y_test))

<class 'pandas.core.series.Series'> <class 'pandas.core.series.Series'>


In [None]:
# Printing the first 3 elements of the testing data
print(x_test[:3])
print(y_test[:3])

5431                   If I was I wasn't paying attention
3937    WHEN THE FIRST STRIKE IS A RED ONE. THE BIRD +...
877     Sunshine Quiz Wkly Q! Win a top Sony DVD playe...
Name: Message, dtype: object
5431    0
3937    0
877     1
Name: spam, dtype: int64


In [None]:
# Printing the size of x and y
print(len(x_train), len(y_train))
print(len(x_test), len(y_test))

4457 4457
1115 1115


In [None]:
print(x_train[0])

Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...


## BOW

The first method for representing text as vector we would examine is `Bag of Words`.

We will evaluate this approach over a simple binary classification example of spam and ham sms.

### Initialize the BOW object

In [None]:
bow_v = CountVectorizer()

### Creating the BOWs

In [None]:
x_train_cv = bow_v.fit_transform(x_train.values) # We need `.value` in order to convert the training samples from a Series into ndarray

x_train_cv

<4457x7655 sparse matrix of type '<class 'numpy.int64'>'
	with 59159 stored elements in Compressed Sparse Row format>

### Discovering the Results

In [None]:
# We can convert this representation into a ndarray
x_train_np = x_train_cv.toarray()
x_train_np

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [None]:
# The shape of this array
x_train_np.shape

(4457, 7655)

4095 are the unique words in the vocabulary

In [None]:
# We can see all the words using:
print(bow_v.get_feature_names_out())
print(bow_v.get_feature_names_out().shape)

['00' '000' '000pes' ... 'ûªve' 'ûò' 'ûówell']
(7655,)


In [None]:
# We can also see the vocabulary in a dictionary format
print(bow_v.vocabulary_)



### Converting the Vector Representation into Text

In [None]:
source_vector = x_train_np[0]
source_vector

array([0, 0, 0, ..., 0, 0, 0])

In [None]:
non_zero_indexes = np.where(source_vector!=0)[0]
non_zero_indexes

array([ 946, 1576, 2833, 2841, 2920, 2942, 3389, 4139, 4359, 4679, 6845,
       6967, 7577, 7604])

In [None]:
for index in non_zero_indexes[:3]:
    print(bow_v.get_feature_names_out()[index])

am
busy
finally


### Creating the Machine Learning Model to Evaluate this Vector Representation Method

In [None]:
model = MultinomialNB()

### Training the Model

In [None]:
model.fit(x_train_cv, y_train)

### Evaluating the Model

In [None]:
x_test_cv = bow_v.transform(x_test.values)

In [None]:
y_pred = model.predict(x_test_cv)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99       968
           1       0.95      0.93      0.94       147

    accuracy                           0.98      1115
   macro avg       0.97      0.96      0.97      1115
weighted avg       0.98      0.98      0.98      1115



### Making Predictions on custom Data

In [None]:
sms = [
    "hey dad, can we get together to watch football?",
    "Upto 20% discount parking. Don't miss this reward!"
]

sms_cv = bow_v.transform(sms)
model.predict(sms_cv)

array([0, 1])

### Another Way

In [None]:
from sklearn.pipeline import Pipeline

clf = Pipeline([
    ("vectorizer", CountVectorizer()),
    ("nb", MultinomialNB())
])

clf.fit(x_train, y_train)

print(classification_report(y_test, clf.predict(x_test)))
print(clf.predict(sms))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99       968
           1       0.95      0.93      0.94       147

    accuracy                           0.98      1115
   macro avg       0.97      0.96      0.97      1115
weighted avg       0.98      0.98      0.98      1115

[0 1]


### Stop Words

By default the BOW approach creates a sparse matrix which other than the relevant information also contains words like "the", "or", "to", "a", etc. Those words are called `Stop Words` and the don't give much information about the text as we wanted.

So in order to make our model simpler and yet, more powerfull we could remove them from the representation.

This is another `Preprosessing` task: `Stop Word Removal`.

Although it is usefull to remove those words, in some downstream tasks like sentiment analysis and language translation, words like "and", "not", "nor" actually gives us a lot of information about the sentiment of the document. So we need to be carefull when we are using `Stop Word Removal`.

In [None]:
!python -m spacy download en_core_web_sm -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m33.1 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")

We can see all the Stop Words that spacy has define

In [None]:
from spacy.lang.en.stop_words import STOP_WORDS

STOP_WORDS

{"'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'bottom',
 'but',
 'by',
 'ca',
 'call',
 'can',
 'cannot',
 'could',
 'did',
 'do',
 'does',
 'doing',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'first',
 'five',
 'for',
 'former',
 'formerly',
 'forty',
 'four',
 'from',
 'fron

In [None]:
# We can also see if a word token is a stop word
doc = nlp("We just opened out wings, the flying part is comming soon")

stops = [token for token in doc if token.is_stop]
stops

[We, just, out, the, part, is]

## TF-IDF

### A first Example to Demonstrate this Vectror Representation

In [None]:
corpus = [
    "Thor eating pizza, Loki is eating pizza, Ironman ate pizza already",
    "Apple is announcing new iphone tomorrow",
    "Tesla is announcing new model-3 tomorrow",
    "Google is announcing new pixel-6 tomorrow",
    "Microsoft is announcing new surface tomorrow",
    "Amazon is announcing new eco-dot tomorrow",
    "I am eating biryani and you are eating grapes"
]

tf_idf_v = TfidfVectorizer()
transformed_output = tf_idf_v.fit_transform(corpus)
print(tf_idf_v.vocabulary_)

{'thor': 25, 'eating': 10, 'pizza': 22, 'loki': 17, 'is': 16, 'ironman': 15, 'ate': 7, 'already': 0, 'apple': 5, 'announcing': 4, 'new': 20, 'iphone': 14, 'tomorrow': 26, 'tesla': 24, 'model': 19, 'google': 12, 'pixel': 21, 'microsoft': 18, 'surface': 23, 'amazon': 2, 'eco': 11, 'dot': 9, 'am': 1, 'biryani': 8, 'and': 3, 'you': 27, 'are': 6, 'grapes': 13}


In [None]:
# We can convert this dictionary into a numpy array
tf_idf_v.get_feature_names_out()

array(['already', 'am', 'amazon', 'and', 'announcing', 'apple', 'are',
       'ate', 'biryani', 'dot', 'eating', 'eco', 'google', 'grapes',
       'iphone', 'ironman', 'is', 'loki', 'microsoft', 'model', 'new',
       'pixel', 'pizza', 'surface', 'tesla', 'thor', 'tomorrow', 'you'],
      dtype=object)

In [None]:
# Prinint the idf scores of each word
for word in tf_idf_v.get_feature_names_out():
    print(word, " | ", tf_idf_v.idf_[tf_idf_v.vocabulary_[word]])

already  |  2.386294361119891
am  |  2.386294361119891
amazon  |  2.386294361119891
and  |  2.386294361119891
announcing  |  1.2876820724517808
apple  |  2.386294361119891
are  |  2.386294361119891
ate  |  2.386294361119891
biryani  |  2.386294361119891
dot  |  2.386294361119891
eating  |  1.9808292530117262
eco  |  2.386294361119891
google  |  2.386294361119891
grapes  |  2.386294361119891
iphone  |  2.386294361119891
ironman  |  2.386294361119891
is  |  1.1335313926245225
loki  |  2.386294361119891
microsoft  |  2.386294361119891
model  |  2.386294361119891
new  |  1.2876820724517808
pixel  |  2.386294361119891
pizza  |  2.386294361119891
surface  |  2.386294361119891
tesla  |  2.386294361119891
thor  |  2.386294361119891
tomorrow  |  1.2876820724517808
you  |  2.386294361119891


In [None]:
# Printing the vector representation of the first sentence
transformed_output.toarray()[0]

array([0.24266547, 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.24266547, 0.        , 0.        ,
       0.40286636, 0.        , 0.        , 0.        , 0.        ,
       0.24266547, 0.11527033, 0.24266547, 0.        , 0.        ,
       0.        , 0.        , 0.72799642, 0.        , 0.        ,
       0.24266547, 0.        , 0.        ])

### Loading the Dataset

In [2]:
from pathlib import Path
import zipfile


zip_path = Path("/content/ecommerce_classification.zip")
dest_dir = Path("/content")

if not dest_dir.is_file():
    with zipfile.ZipFile(zip_path, "r") as zip_ref:
        print(f"[INFO] Unzipping dataset `{zip_path}` to `{dest_dir}`...")
        zip_ref.extractall(dest_dir)

print(f"[INFO] Dataset succesfully downloaded to `{dest_dir}`..")

[INFO] Unzipping dataset `/content/ecommerce_classification.zip` to `/content`...
[INFO] Dataset succesfully downloaded to `/content`..


### Preprocessing the Dataset

In [3]:
df = pd.read_csv("/content/ecommerceDataset.csv")

df.head(3)

Unnamed: 0,Household,"Paper Plane Design Framed Wall Hanging Motivational Office Decor Art Prints (8.7 X 8.7 inch) - Set of 4 Painting made up in synthetic frame with uv textured print which gives multi effects and attracts towards it. This is an special series of paintings which makes your wall very beautiful and gives a royal touch. This painting is ready to hang, you would be proud to possess this unique painting that is a niche apart. We use only the most modern and efficient printing technology on our prints, with only the and inks and precision epson, roland and hp printers. This innovative hd printing technique results in durable and spectacular looking prints of the highest that last a lifetime. We print solely with top-notch 100% inks, to achieve brilliant and true colours. Due to their high level of uv resistance, our prints retain their beautiful colours for many years. Add colour and style to your living space with this digitally printed painting. Some are for pleasure and some for eternal bliss.so bring home this elegant print that is lushed with rich colors that makes it nothing but sheer elegance to be to your friends and family.it would be treasured forever by whoever your lucky recipient is. Liven up your place with these intriguing paintings that are high definition hd graphic digital prints for home, office or any room."
0,Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
1,Household,SAF 'UV Textured Modern Art Print Framed' Pain...
2,Household,"SAF Flower Print Framed Painting (Synthetic, 1..."


In [4]:
# Converting the CSV file into a Dataframe
first_label = df.columns[0]
first_text = df.columns[1]

h1 = pd.DataFrame(np.array([first_label]))
h2 = pd.DataFrame(np.array([first_text]))

g1 = pd.DataFrame(df[first_label].values)
g2 = pd.DataFrame(df[first_text].values)

df = pd.concat([h1, h2], axis=1, ignore_index=True)
df = pd.concat([df, pd.concat([g1, g2], axis=1, ignore_index=True)], axis=0, ignore_index=True)

df.rename(columns={0: "label", 1: "text"}, inplace=True)

df.head(3)

Unnamed: 0,label,text
0,Household,Paper Plane Design Framed Wall Hanging Motivat...
1,Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,Household,SAF 'UV Textured Modern Art Print Framed' Pain...


In [5]:
print(df.shape)
df.value_counts("label")

(50425, 2)


label
Household                 19313
Books                     11820
Electronics               10621
Clothing & Accessories     8671
dtype: int64

### Fixing the imbalance and also lowering the dataset size

In [6]:
# Creating list that contains the labels
labels_l = list(set(label for label in df["label"]))

df_l = [df[df["label"] == label].sample(df.value_counts("label").min()) for label in labels_l]

df = pd.concat(df_l, axis=0, ignore_index=True)

df

Unnamed: 0,label,text
0,Clothing & Accessories,Disney Boys Brief Pack of 6 from Bodycare Body...
1,Clothing & Accessories,Chromozome Men's Cotton Track Pants Chromozome...
2,Clothing & Accessories,Amazon Brand - Symbol Men's Round Neck Sports ...
3,Clothing & Accessories,Van Heusen Men's Cotton Rich Lounge Shorts Don...
4,Clothing & Accessories,Softskin Women's Cotton Seamless Non Wired T-s...
...,...,...
34679,Books,Complete Course Biology: Maharashtra Board Cla...
34680,Books,"The Prophet About the Author Kahlil Gibran, au..."
34681,Books,The Art of War
34682,Books,A Pelican Book: Object-Oriented Ontology (Peli...


In [7]:
print(df.shape)
df.value_counts("label")

(34684, 2)


label
Books                     8671
Clothing & Accessories    8671
Electronics               8671
Household                 8671
dtype: int64

In [8]:
df["Label"] = df["label"].map({
    "Household": 0,
    "Books": 1,
    "Electronics": 2,
    "Clothing & Accessories": 3
})

df.rename(columns={"text": "Text"}, inplace=True)
df = df[["Text", "Label"]]

df.head(3)

Unnamed: 0,Text,Label
0,Disney Boys Brief Pack of 6 from Bodycare Body...,3
1,Chromozome Men's Cotton Track Pants Chromozome...,3
2,Amazon Brand - Symbol Men's Round Neck Sports ...,3


In [9]:
# Eliminating Nan values
df = df[df["Text"] == df["Text"]]

### Creating the Training and Testing Sets

In [10]:
test_prop = 0.2

x_train, x_test, y_train, y_test = train_test_split(
    df["Text"].values,
    df["Label"].values,
    test_size = test_prop,
    stratify=df["Label"]
)

print(len(x_train), len(y_train), len(x_test), len(y_test))

27746 27746 6937 6937


### KNN Model

In [None]:
clf = Pipeline([
    ("vectorizer_ti-idf", TfidfVectorizer()),
    ("KNN", KNeighborsClassifier())
])

clf.fit(x_train, y_train)

print(classification_report(y_test, clf.predict(x_test)))

### Naive Bayes Model

In [11]:
clf = Pipeline([
    ("vectorizer_ti-idf", TfidfVectorizer()),
    ("Mulit NB", MultinomialNB())
])

clf.fit(x_train, y_train)

print(classification_report(y_test, clf.predict(x_test)))

              precision    recall  f1-score   support

           0       0.91      0.95      0.93      1734
           1       0.97      0.92      0.94      1735
           2       0.95      0.95      0.95      1734
           3       0.97      0.98      0.97      1734

    accuracy                           0.95      6937
   macro avg       0.95      0.95      0.95      6937
weighted avg       0.95      0.95      0.95      6937



### Random Forest

In [12]:
clf = Pipeline([
    ("vectorizer_ti-idf", TfidfVectorizer()),
    ("RFC", RandomForestClassifier())
])

clf.fit(x_train, y_train)

print(classification_report(y_test, clf.predict(x_test)))

              precision    recall  f1-score   support

           0       0.93      0.95      0.94      1734
           1       0.97      0.96      0.97      1735
           2       0.97      0.95      0.96      1734
           3       0.98      0.99      0.98      1734

    accuracy                           0.96      6937
   macro avg       0.96      0.96      0.96      6937
weighted avg       0.96      0.96      0.96      6937



Even best practice is to also apply `Preprocessing` to the samples, in order to remove irrelevant information.