Most classic machine learning algo cannot take in raw text.
In fact a feature extraction needs to be performed on the raw text in order to pass numerical features to the machine learning algo.

For example:- we could count the occurence of each word to map text to a number.


# Count Vectorization (TF-IDF)

#TF  ---> Term Frequency.
#IDF ----> Inverse Document Frequency.

In [1]:
%%writefile 1.txt
This is a story about cats
our feline pets
Cats are furry animals

Overwriting 1.txt


In [2]:
%%writefile 2.txt
This story is about surfing
Catching waves is fun
Surfing is a popular water sport

Overwriting 2.txt


# Building a vocabulary

The goal here is to build a numerical array from all the words that appear in every document.

Later create instances for each indivigual document.


In [3]:
vocab= {}
i=1

with open('1.txt') as f:
    x= f.read().lower().split()
    
for word in x:
    if word in vocab:
        continue
    else:
        vocab[word]=i
        i+=1
        
print(vocab)

{'this': 1, 'is': 2, 'a': 3, 'story': 4, 'about': 5, 'cats': 6, 'our': 7, 'feline': 8, 'pets': 9, 'are': 10, 'furry': 11, 'animals': 12}


In [4]:
vocab= {}
i=1

with open('2.txt') as f:
    x=f.read().lower().split()
    
for word in x:
    if word in vocab:
        continue
    else:
        vocab[word] = i
        i+=1
        
print(vocab)

{'this': 1, 'story': 2, 'is': 3, 'about': 4, 'surfing': 5, 'catching': 6, 'waves': 7, 'fun': 8, 'a': 9, 'popular': 10, 'water': 11, 'sport': 12}


# Feature Extraction

In [5]:
# Create an empty vector with space for each word in teh vocabulary:

one = ['1.txt']+[0]*len(vocab)
one

['1.txt', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [6]:
# map the frequency of each word in 1.txt to our vector:
with open('1.txt') as f:
    x= f.read().lower().split()
    
for word in x:
    one[vocab[word]]+= 1
    


KeyError: 'cats'

In [7]:
one

['1.txt', 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0]

In [20]:
two= ['2.txt']+[0]*len(vocab)

with open('2.txt') as f:
    x=f.read().lower().split()
    
for word in x:
    two[vocab[word]]+= 1
    

In [21]:
two

['2.txt', 1, 1, 3, 1, 2, 1, 1, 1, 1, 1, 1, 1]

While comparing the two vectors it is seen that some words are common to both, some appera only in 1.txt and
vice versa. Extending this logic to tens and thousands of documents, we would see
the vocaulary dictionary to grow to hundreds of thousands of words.
Vectors would contain mostly zeros values to make a Sparse Matrix.

In [22]:
# Compare two vectors
print(f'{one}\n{two}')

['1.txt', 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0]
['2.txt', 1, 1, 3, 1, 2, 1, 1, 1, 1, 1, 1, 1]


# Feature Extraction from Text.

In [1]:
import numpy as np
import pandas as pd

dataset = pd.read_csv('C:\\Users\\ebineet\\Documents\\GitHub\\NLP\\UPDATED_NLP_COURSE\\TextFiles\\smsspamcollection.tsv', sep= '\t')

In [2]:
dataset.isnull().sum()

label      0
message    0
length     0
punct      0
dtype: int64

In [3]:
dataset['label'].value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [4]:
# Splitting data into training and test set.

from sklearn.model_selection import train_test_split
x= dataset['message']
y= dataset['label']
x_train, x_test, y_train, y_test= train_test_split(x,y,test_size=0.3,random_state=0)

In [5]:
x_train

4380                 How are you. Just checking up on you
3887    Same, I'm at my great aunts anniversary party ...
4755                   Ok lor... Or u wan me go look 4 u?
2707    S now only i took tablets . Reaction morning o...
4747           Orh i tot u say she now still dun believe.
                              ...                        
4931    Hi, the SEXYCHAT girls are waiting for you to ...
3264                              So u gonna get deus ex?
1653    For ur chance to win a £250 cash every wk TXT:...
2607    R U &SAM P IN EACHOTHER. IF WE MEET WE CAN GO ...
2732    Mm feeling sleepy. today itself i shall get th...
Name: message, Length: 3900, dtype: object

In [6]:
# Perform count vector

from sklearn.feature_extraction.text import CountVectorizer
count_vect= CountVectorizer()


In [7]:
# Fit a vectorizer to a data.
count_vect.fit(x_train)
x_train_counts = count_vect.transform(x_train)

#or

#count_vect.fit_transform(x_train)

In [8]:
x_train_counts

<3900x7314 sparse matrix of type '<class 'numpy.int64'>'
	with 52101 stored elements in Compressed Sparse Row format>

In [9]:
x_train.shape

(3900,)

In [10]:
x_train_counts.shape

(3900, 7314)

In [11]:
# Transfomr the original text to vector
# TF-IDF transformer.

from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer= TfidfTransformer()
x_train_tfidf = tfidf_transformer.fit_transform(x_train_counts)

In [12]:
x_train_tfidf.shape

(3900, 7314)

# OR

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
x_train_tfidf = vectorizer.fit_transform(x_train)

In [14]:
x_train_tfidf.shape

(3900, 7314)

In [15]:
x_train_tfidf

<3900x7314 sparse matrix of type '<class 'numpy.float64'>'
	with 52101 stored elements in Compressed Sparse Row format>

In [43]:
from sklearn.svm import LinearSVC
clf= LinearSVC()
clf.fit(x_train_tfidf,y_train)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)

# Creating a PipeLine Object

In [45]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('tfidf', TfidfVectorizer()),('clf', LinearSVC())])

In [46]:
text_clf.fit(x_train,y_train)

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('clf',
                 LinearSVC(C=1.0, class_weight=None, dual=True,
                           fit_intercept=True, intercept_scaling=1,
               

In [47]:
pred = text_clf.predict(x_test)

In [48]:
from sklearn.metrics import confusion_matrix, classification_report

In [49]:
print( confusion_matrix(y_test,pred))

[[1450    1]
 [  17  204]]


In [50]:
print(classification_report(y_test,pred))

              precision    recall  f1-score   support

         ham       0.99      1.00      0.99      1451
        spam       1.00      0.92      0.96       221

    accuracy                           0.99      1672
   macro avg       0.99      0.96      0.98      1672
weighted avg       0.99      0.99      0.99      1672



In [52]:
from sklearn import metrics
metrics.accuracy_score(y_test,pred)

0.9892344497607656

In [53]:
text_clf.predict(["Hi how are you doing today?"])

array(['ham'], dtype=object)

In [54]:
text_clf.predict(["Congratulations you have been selected as a Winner! Text Won to 4255 congratulaitions to you and your family"])

array(['spam'], dtype=object)