# <b><span style = "color:#d46606; font-family:calibri"> Text Classification

### <b><span style = "color:#d46606; font-family:calibri"> Topics Covered:
* Understand Machine Learning Basics
* Understanding Clasffication Metrics
* Understand Text Feature Extraction


## <b><span style = "color:#d46606; font-family:calibri"> Machine Learning Overview

**1. Machine Learning is a method of data analysis that automates analytical model building.**

**2. Using algorithms that iteratively learn from data, machine learning allows computers to find hidden insights without being exlicitly programmed.**

### `Supervised Learning` is commonly used in applications where historical data predicts likely future events.

## <b><span style = "color:#d46606; font-family:calibri"> Machine Learning Process

**1. Data Acquisition.**

**2. Data cleaning.**

**3. Model Training & Building.**

**4. Model Testing.**

**5. Model Deployment.**

### `Text classification` and `recognition` is a very common and widely applicable use of machine learning.

## <b><span style = "color:#d46606; font-family:calibri"> Classification Metrics

### After our `Machine Learning Process` is complete, we will use performac metrics to evaluate how out model did

### The Key classification metrics we need to understand are:
* Accuracy
* Recall
* Precision
* F1_score

## False positive is `Type 1 error` in stats

## False Negative is `Type 2 error` in stats

# Text Feature Extraction

### Most of the machine learning algorithms can't take in raw text.

### Instead we need to perform a feature "extraction" from the  raw text in orderto pass numerical features to the machine learning algorithm.

### For example, we count the occurance of each word to map text to a number

## `Count Vectoriation`

### using `from sklearn.feature_extraction.text import CountVectorizer` , we can create count vectorize

### `Count Vectorizer` treats each individual word as a feature. 

#### Then it counts occurance of every single unique word in the document and end up creating `Document Term Matrix (DTM)`

### An alternative to `CountVectorizer` is somethin called `TfidfVectorizer`. It also create a document term matrix from out messages.

### However, instead of filling the DTM with token counts it calculates term frequency-inverse document frequency value  of each word `(TF-IDF)`

### It is the logarithmically scaled inverse fractions of the documents that contain the word(obtained by dividing the total number of documents by the numer of documents containing the term and then taking the logarithm of that quotient.)

## Coding example

In [1]:
%%writefile 1.txt
This is a story about cats
our feline pets
Cats are furry animals

Overwriting 1.txt


In [2]:
%%writefile 2.txt
This story is about surfing
Catching waves is fun
Surfing is a popular water sport

Overwriting 2.txt


#### Building Vocabulary

In [5]:
vocab = {}
i = 1
with open('1.txt') as f:
    x = f.read().lower().split()
    
for word in x:
    if word in vocab:
        continue
    else:
        vocab[word] = i
        i +=1

print(vocab)

{'this': 1, 'is': 2, 'a': 3, 'story': 4, 'about': 5, 'cats': 6, 'our': 7, 'feline': 8, 'pets': 9, 'are': 10, 'furry': 11, 'animals': 12}


In [7]:
with open('2.txt') as f:
    x = f.read().lower().split()
    
for word in x:
    if word in vocab:
        continue
    else:
        vocab[word] = i
        i +=1
print(vocab)

{'this': 1, 'is': 2, 'a': 3, 'story': 4, 'about': 5, 'cats': 6, 'our': 7, 'feline': 8, 'pets': 9, 'are': 10, 'furry': 11, 'animals': 12, 'surfing': 13, 'catching': 14, 'waves': 15, 'fun': 16, 'popular': 17, 'water': 18, 'sport': 19}


#### Feature Extraction

In [8]:
# Creating an empty vector with space for each word in the Vocubulary
one = ['1.txt'] + [0]* len(vocab)
one

['1.txt', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [9]:
with open("1.txt") as f:
    x = f.read().lower().split()
    
for word in x:
    one[vocab[word]]+=1
    
one

['1.txt', 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]

In [10]:
two = ['2.txt']+[0]* len(vocab)

with open('2.txt') as f:
    x = f.read().lower().split()
for word in x:
    two[vocab[word]] +=1

In [11]:
print(f"{one}\n {two}")

['1.txt', 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
 ['2.txt', 1, 3, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 2, 1, 1, 1, 1, 1, 1]


### `Bag of Words and Tf-IDF`

In [12]:
import numpy as np
import pandas as pd


In [13]:
df= pd.read_csv('smsspamcollection.tsv', sep='\t')

In [14]:
df.head()

Unnamed: 0,label,message,length,punct
0,ham,"Go until jurong point, crazy.. Available only ...",111,9
1,ham,Ok lar... Joking wif u oni...,29,6
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155,6
3,ham,U dun say so early hor... U c already then say...,49,6
4,ham,"Nah I don't think he goes to usf, he lives aro...",61,2


In [15]:
df.isnull().sum()

label      0
message    0
length     0
punct      0
dtype: int64

In [16]:
df["label"].value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [17]:
from sklearn.model_selection import train_test_split

In [19]:
X = df["message"]

In [21]:
y = df["label"]

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

### Now we will do Count Vectorization. Text preprocessing, tokenization and ability to use stop words are all included in count vectorizer

In [23]:
from sklearn.feature_extraction.text import CountVectorizer

In [25]:
count_vect = CountVectorizer()

###  When we fit the data to CountVectorizer it build a vocab, count the number of word

### transform makes origianl text to vector

In [26]:
X_train_counts = count_vect.fit_transform(X_train)

In [27]:
X_train_counts

<3733x7082 sparse matrix of type '<class 'numpy.int64'>'
	with 49992 stored elements in Compressed Sparse Row format>

In [28]:
from sklearn.feature_extraction.text import TfidfTransformer

In [29]:
tf_IDF_Tran = TfidfTransformer()

In [30]:
X_train_tfidf = tf_IDF_Tran.fit_transform(X_train_counts)

In [31]:
X_train_tfidf.shape

(3733, 7082)

In [32]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [33]:
vect = TfidfVectorizer()

In [34]:
X_train_vect = vect.fit_transform(X_train)

In [35]:
X_train_vect.shape

(3733, 7082)

In [36]:
from sklearn.svm import LinearSVC

In [37]:
clf = LinearSVC()

In [39]:
clf.fit(X_train_vect, y_train)

LinearSVC()

### We can combine everything in one single pipeline object

In [40]:
from sklearn.pipeline import Pipeline

In [41]:
text_clf = Pipeline([('tfidf', TfidfVectorizer()), ('clf',LinearSVC())])

In [42]:
text_clf.fit(X_train,y_train)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])

In [43]:
pred = text_clf.predict(X_test)

In [44]:
from sklearn.metrics import confusion_matrix, classification_report

In [45]:
print(confusion_matrix(y_test,pred))

[[1586    7]
 [  12  234]]


In [46]:
print(classification_report(y_test,pred))

              precision    recall  f1-score   support

         ham       0.99      1.00      0.99      1593
        spam       0.97      0.95      0.96       246

    accuracy                           0.99      1839
   macro avg       0.98      0.97      0.98      1839
weighted avg       0.99      0.99      0.99      1839



In [47]:
from sklearn import metrics

In [48]:
metrics.accuracy_score(y_test,pred)

0.989668297988037

In [49]:
text_clf.predict(["Hi, How are you today?"])

array(['ham'], dtype=object)

In [50]:
text_clf.predict(["Congratulation you have been selected as awinner. Text WON to 442255 congratulations free entry to contest."])

array(['spam'], dtype=object)

# `Another example`

In [51]:
import numpy as np
import pandas as pd

In [52]:
df = pd.read_csv("moviereviews.tsv", sep="\t")

In [53]:
df.head()

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...


In [54]:
df.isnull().sum()

label      0
review    35
dtype: int64

In [55]:
len(df)

2000

In [56]:
print(df["review"][0])

how do films like mouse hunt get into theatres ? 
isn't there a law or something ? 
this diabolical load of claptrap from steven speilberg's dreamworks studio is hollywood family fare at its deadly worst . 
mouse hunt takes the bare threads of a plot and tries to prop it up with overacting and flat-out stupid slapstick that makes comedies like jingle all the way look decent by comparison . 
writer adam rifkin and director gore verbinski are the names chiefly responsible for this swill . 
the plot , for what its worth , concerns two brothers ( nathan lane and an appalling lee evens ) who inherit a poorly run string factory and a seemingly worthless house from their eccentric father . 
deciding to check out the long-abandoned house , they soon learn that it's worth a fortune and set about selling it in auction to the highest bidder . 
but battling them at every turn is a very smart mouse , happy with his run-down little abode and wanting it to stay that way . 
the story alternate

In [57]:
df.dropna(inplace=True)

In [58]:
df.isnull().sum()

label     0
review    0
dtype: int64

In [59]:
blanks = []

for i , lb, rv in df.itertuples():
    if rv.isspace():
        blanks.append(i)

In [60]:
blanks

[57,
 71,
 147,
 151,
 283,
 307,
 313,
 323,
 343,
 351,
 427,
 501,
 633,
 675,
 815,
 851,
 977,
 1079,
 1299,
 1455,
 1493,
 1525,
 1531,
 1763,
 1851,
 1905,
 1993]

In [61]:
df.drop(blanks, inplace=  True)

In [62]:
len(df)

1938

In [75]:
from sklearn.model_selection import train_test_split

In [76]:
X = df["review"]

In [77]:
y = df["label"]

In [82]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [83]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

In [84]:
txt_clf = Pipeline([("tfidf", TfidfVectorizer()),('clf', LinearSVC())])

In [85]:
text_clf.fit(X_train, y_train)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])

In [86]:
predictons = text_clf.predict(X_test)

In [89]:
from sklearn.metrics import confusion_matrix, classification_report,accuracy_score

In [90]:
print(confusion_matrix(y_test,predictons))

[[235  47]
 [ 41 259]]


In [91]:
print(classification_report(y_test,predictons))

              precision    recall  f1-score   support

         neg       0.85      0.83      0.84       282
         pos       0.85      0.86      0.85       300

    accuracy                           0.85       582
   macro avg       0.85      0.85      0.85       582
weighted avg       0.85      0.85      0.85       582



In [92]:
print(accuracy_score(y_test,predictons))

0.8487972508591065


# Another Example

### Task #1: Perform imports and load the dataset into a pandas DataFrame


In [97]:
import numpy as np
import pandas as pd

In [98]:
df = pd.read_csv("moviereviews2.tsv", sep = "\t")

In [99]:
df.head()

Unnamed: 0,label,review
0,pos,I loved this movie and will watch it again. Or...
1,pos,"A warm, touching movie that has a fantasy-like..."
2,pos,I was not expecting the powerful filmmaking ex...
3,neg,"This so-called ""documentary"" tries to tell tha..."
4,pos,This show has been my escape from reality for ...


### Task #2: Check for missing values:

In [101]:
df.isnull().sum()

label      0
review    20
dtype: int64

### Task #3: Remove NaN values:

In [102]:
df.dropna(inplace=True)

In [103]:
df.isnull().sum()

label     0
review    0
dtype: int64

In [110]:
blank = []


for i , lb, rv in df.itertuples():
    if type(rv)==str:
        if rv.isspace():
            blank.append(i)

In [112]:
len(blank)

0

### Task #4: Take a quick look at the `label` column:

In [114]:
df["label"].value_counts()

pos    2990
neg    2990
Name: label, dtype: int64

In [115]:
X = df["review"]

In [116]:
y = df["label"]

### Task #5: Split the data into train & test sets:

In [117]:
from sklearn.model_selection import train_test_split

In [118]:
 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

### Task #6: Build a pipeline to vectorize the date, then train and fit a model
You may use whatever model you like. To compare your results to the solution notebook, use `LinearSVC`.

In [119]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

In [121]:
from sklearn.svm import LinearSVC

In [122]:
clf_pip = Pipeline([('tfidf', TfidfVectorizer()),("clf",LinearSVC())])

In [123]:
clf_pip.fit(X_train,y_train)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])

In [124]:
pred = clf_pip.predict(X_test)

### Task #7: Run predictions and analyze the results

In [125]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

In [126]:
confusion_matrix(y_test,pred)

array([[900,  91],
       [ 63, 920]], dtype=int64)

In [127]:
print(classification_report(y_test,pred))

              precision    recall  f1-score   support

         neg       0.93      0.91      0.92       991
         pos       0.91      0.94      0.92       983

    accuracy                           0.92      1974
   macro avg       0.92      0.92      0.92      1974
weighted avg       0.92      0.92      0.92      1974



In [128]:
accuracy_score(y_test,pred)

0.9219858156028369