# Text Classification

**Goals:**
- Understand Text Feature Extraction.
- Familiarize with `scikit-learn` and `python` to perform text classification on real datasets.

The machine learning technique that will be utilized here is a supervised learning.

Most classic machine learning algorithms can't take in raw test. Insead it is important to perform a feature extraction from the raw text in order to pass numerical features to the machine learning algorithm. For example, we could count the occurence of each word to map text to number. Or use Counter Vectorization along with Ter-Frequency and Inverse Document Frequency.

An alternative to `CountVectorizer` is `TfidfVectorizer`. It also created document term matrix from the text. However, instead of filling the DTM with token counts it calculated term frequency-inverse document frequency value for each word (`TF-IDF`).

In this case term frequency $tf(t,d)$ is the raw count of a term in a document, i.e. the number of times that term $t$ occurs in the document $d$. However, Term Frequency alone isn't enough for thorough feautre analysis of the text. Let's imagine very common terms like "a" or "the". Because the term "the" is so common, therm frequency will tend to incorrectly emphasize document which happen to use the word "the" more frequently, without giving enough weight to the more meaningul terms such as "red", "dog", "weather" etc.

An inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely. It's logarithmically scaled inverse fraction of the documents that contain the word (obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient).

<br></br>
<center>$tfidf(t,d,D) = tf(t,d) \cdot idf(t,D)$</center>

<br></br>
<center>$idf(t, D) = \log\left(\frac{N}{\{d \in D : t \in d\}}\right)$</center>

TF-IDF allows to understand the context of words across an entire corpus of documents, instead of just its relative importance in a single document. 

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('smsspamcollection.tsv',sep='\t')

In [3]:
df.head()

Unnamed: 0,label,message,length,punct
0,ham,"Go until jurong point, crazy.. Available only ...",111,9
1,ham,Ok lar... Joking wif u oni...,29,6
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155,6
3,ham,U dun say so early hor... U c already then say...,49,6
4,ham,"Nah I don't think he goes to usf, he lives aro...",61,2


In [4]:
df.isnull().sum()

label      0
message    0
length     0
punct      0
dtype: int64

In [5]:
df['label'].value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
X = df['message']
y = df['label']

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

## Count vectorization

After we did the split it is time to perform count vectorization.

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

In [10]:
count_vect = CountVectorizer()

There are 2 ways of transforming raw text into vectors.

Fit the vectorizer to the data (build the vocab, count the number of words)

`count_vect.fit(X_train)`

`X_train_counts = count_vect.transfomr(X_train)`

In [13]:
# Transform the original text message into vector
X_train_counts = count_vect.fit_transform(X_train)

We cannot view the `X_train_counts` because it is a huge sparse matrix. The `scikit-learn` compresses that into `Compressed Sparse Row format`. It looks like as if there a row with multiple zeroes so it is a lot better to count the number of zeroes instead of keeping them in the place.

In [14]:
X_train_counts

<3733x7082 sparse matrix of type '<class 'numpy.int64'>'
	with 49992 stored elements in Compressed Sparse Row format>

So, across 3733 messages there vere 7082 unique words as we can see from the output from the previous cell.

In [15]:
X_train.shape

(3733,)

In [16]:
X_train_counts.shape

(3733, 7082)

## TF-IDF Tranformation

Let's transofrm the counts into frequencies with `TF-IDF`.

In [17]:
from sklearn.feature_extraction.text import TfidfTransformer

In [18]:
tfidf_transformer = TfidfTransformer()

In [19]:
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

How we can see that it has the same shape but it is no longer counts. Instead, we've taken in the term frequency and multiply it by inverse document frequency.

In [20]:
X_train_tfidf.shape

(3733, 7082)

Since it is popular to do count vectorization and then TF-IDF tranfsomration, the option of TF-IDF Vectorizer is avaliable. And it does the same thing in 1 step.

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [22]:
vectorizer = TfidfVectorizer()

In [23]:
X_train_tfidf = vectorizer.fit_transform(X_train)

## Model training

Let's train `SVM` classifier on our data.

In [24]:
from sklearn.svm import LinearSVC

In [25]:
clf = LinearSVC()

In [26]:
clf.fit(X_train_tfidf, y_train)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

## Pipeline creation

Now our training set has been vectorized into a full vocabulary. In order to perform an analysis on the test set we have to repeat all the same procedures. `Scikit-learn` provide a pipeline class that essentialy behaves like a compound classifier. It can perform both vectorization. So in order not to repeat whole process once again let's use it.

In [27]:
from sklearn.pipeline import Pipeline

The pipeline object will take a list of tuples. Each tuble is going to have a string name that you decide on what to call this step in the pipeline. So `text-clf` will be able to perform all these steps in a single call.

In [28]:
text_clf = Pipeline([('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])

In [29]:
text_clf.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))])

In [30]:
predictions = text_clf.predict(X_test)

In [31]:
from sklearn.metrics import confusion_matrix, classification_report

In [32]:
print(confusion_matrix(y_test, predictions))

[[1586    7]
 [  12  234]]


In [33]:
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

         ham       0.99      1.00      0.99      1593
        spam       0.97      0.95      0.96       246

   micro avg       0.99      0.99      0.99      1839
   macro avg       0.98      0.97      0.98      1839
weighted avg       0.99      0.99      0.99      1839



In [34]:
from sklearn import metrics

In [35]:
metrics.accuracy_score(y_test, predictions)

0.989668297988037

We can also predict on the new text message.

In [38]:
text_clf.predict(['Hi how are you doing today?'])

array(['ham'], dtype=object)

In [39]:
text_clf.predict(["Congratulations! You've been selected as a winner. TEXT WON to 4255 congratulations free entry to contest."])

array(['spam'], dtype=object)

___
# Movie Review project

In this project we will use movie dataset in order to identify whether the review is positive or negative. We're going to predict just based of the test if that movie review is positive or negative on the movie.

In [40]:
import numpy as np
import pandas as pd

In [42]:
df = pd.read_csv('moviereviews.tsv', sep='\t')

In [43]:
df.head()

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...


In [44]:
len(df)

2000

We can take a look at what the reviews like inside the `DataFrame`.

In [45]:
# Negative review
df['review'][0]

'how do films like mouse hunt get into theatres ? \r\nisn\'t there a law or something ? \r\nthis diabolical load of claptrap from steven speilberg\'s dreamworks studio is hollywood family fare at its deadly worst . \r\nmouse hunt takes the bare threads of a plot and tries to prop it up with overacting and flat-out stupid slapstick that makes comedies like jingle all the way look decent by comparison . \r\nwriter adam rifkin and director gore verbinski are the names chiefly responsible for this swill . \r\nthe plot , for what its worth , concerns two brothers ( nathan lane and an appalling lee evens ) who inherit a poorly run string factory and a seemingly worthless house from their eccentric father . \r\ndeciding to check out the long-abandoned house , they soon learn that it\'s worth a fortune and set about selling it in auction to the highest bidder . \r\nbut battling them at every turn is a very smart mouse , happy with his run-down little abode and wanting it to stay that way . \r\

In [46]:
# Positive review
print(df['review'][2])

this has been an extraordinary year for australian films . 
 " shine " has just scooped the pool at the australian film institute awards , picking up best film , best actor , best director etc . to that we can add the gritty " life " ( the anguish , courage and friendship of a group of male prisoners in the hiv-positive section of a jail ) and " love and other catastrophes " ( a low budget gem about straight and gay love on and near a university campus ) . 
i can't recall a year in which such a rich and varied celluloid library was unleashed from australia . 
 " shine " was one bookend . 
stand by for the other one : " dead heart " . 
>from the opening credits the theme of division is established . 
the cast credits have clear and distinct lines separating their first and last names . 
bryan | brown . 
in a desert settlement , hundreds of kilometres from the nearest town , there is an uneasy calm between the local aboriginals and the handful of white settlers who live nearby . 

We aren't missing any labels buy we're missing a few reviews. We can safely drop them.

In [47]:
df.isnull().sum()

label      0
review    35
dtype: int64

In [48]:
df.dropna(inplace=True)

In [49]:
df.isnull().sum()

label     0
review    0
dtype: int64

Sometimes in the databases we don't have `null` values directly. Instead we can have a blank string. We want to detect and remove empty strings with `isspace()` method.

In [50]:
blanks = []

for i, lb, rv in df.itertuples():
    if rv.isspace():
        blanks.append(i)

In [51]:
df.drop(blanks, inplace=True)

In [52]:
df.shape

(1938, 2)

## Data Split

Here we will split the data into the training set and a test set.

In [53]:
from sklearn.model_selection import train_test_split

In [54]:
X = df['review']
y = df['label']

In [55]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

## Pipeline building

Let's build a pipeline to vectorize the data and then train and fit the model.

In [58]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

In [63]:
text_clf = Pipeline([('tfidf', TfidfVectorizer()), 
                     ('clf', LinearSVC())])

In [64]:
text_clf.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))])

In [69]:
predictions = text_clf.predict(X_test)

## Model evaluation

Let's evaluate our model.

In [66]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

In [70]:
print(confusion_matrix(y_test, predictions))

[[235  47]
 [ 41 259]]


In [71]:
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

         neg       0.85      0.83      0.84       282
         pos       0.85      0.86      0.85       300

   micro avg       0.85      0.85      0.85       582
   macro avg       0.85      0.85      0.85       582
weighted avg       0.85      0.85      0.85       582



In [72]:
print(accuracy_score(y_test, predictions))

0.8487972508591065
