## Feature Extraction from Text using Vectorization

___
# Feature Extraction from Text
 In this section we'll actually look at the text of each message and try to perform a classification based on content. We'll take advantage of some of scikit-learn's [feature extraction](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction) tools.

## Load a dataset

In [10]:
import numpy as np
import pandas as pd


In [11]:
df=pd.read_csv('smsspamcollection.tsv',sep='\t')


In [12]:
df.head()
#take raw text information and vectorize it 

Unnamed: 0,label,message,length,punct
0,ham,"Go until jurong point, crazy.. Available only ...",111,9
1,ham,Ok lar... Joking wif u oni...,29,6
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155,6
3,ham,U dun say so early hor... U c already then say...,49,6
4,ham,"Nah I don't think he goes to usf, he lives aro...",61,2


## Check for missing values:

In [13]:
df.isnull().sum()

label      0
message    0
length     0
punct      0
dtype: int64

## Take a quick look at the *ham* and *spam* `label` column:

In [14]:
df['label'].value_counts()

ham     4825
spam     747
Name: label, dtype: int64

## Split the data into train & test sets:

In [15]:
from sklearn.model_selection import train_test_split

In [16]:
X=df['message']
y=df['label']

In [17]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.33,random_state=42)

## Scikit-learn's CountVectorizer
Text preprocessing, tokenizing and the ability to filter out stopwords are all included in [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html), which builds a dictionary of features and transforms documents to feature vectors.

In [18]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect=CountVectorizer()


In [19]:
X_train_counts=count_vect.fit_transform(X_train)

In [20]:
X_train_counts.shape

(3733, 7082)

This shows that our training set is comprised of 3733 documents, and 7082 features.

In [21]:
#ifidf can be used to give words that are important more weights
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer=TfidfTransformer()

In [22]:
X_train_tfidf=tfidf_transformer.fit_transform(X_train_counts)
#pass countvectorizer to tfidf transformer

In [23]:
X_train_tfidf.shape

(3733, 7082)

Note: the `fit_transform()` method actually performs two operations: it fits an estimator to the data and then transforms our count-matrix to a tf-idf representation.

## Combine Steps with TfidVectorizer
In the future, we can combine the CountVectorizer and TfidTransformer steps into one using [TfidVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html):

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer 
#combines process of countvectorization and tfidftransformation
tfidf_vectorizer=TfidfVectorizer()

In [25]:
X_train_tfidf=tfidf_vectorizer.fit_transform(X_train)
X_train_tfidf.shape

(3733, 7082)

## Train a Classifier
Here we'll introduce an SVM classifier that's similar to SVC, called [LinearSVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html). LinearSVC handles sparse input better, and scales well to large numbers of samples.

In [26]:
from sklearn.svm import LinearSVC
clf=LinearSVC()
clf.fit(X_train_tfidf,y_train)

LinearSVC()

## Build a Pipeline
Remember that only our training set has been vectorized into a full vocabulary. In order to perform an analysis on our test set we'll have to submit it to the same procedures. Fortunately scikit-learn offers a [**Pipeline**](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) class that behaves like a compound classifier.

In [27]:
#pipeline can perform vectorization and classification 
from sklearn.pipeline import Pipeline
text_clf=Pipeline([('tfidf',TfidfVectorizer()),('clf',LinearSVC())]) #this line creates a pipeline
text_clf.fit(X_train,y_train)#pass it raw training data 

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])

## Test the classifier and display results

In [28]:
#form a predication set
predications=text_clf.predict(X_test)
#X_test contains raw text messages


In [29]:
#repost the confusion matrix
from sklearn.metrics import confusion_matrix,classification_report
print(confusion_matrix(y_test,predications))

[[1586    7]
 [  12  234]]


In [30]:
#print a classification report
print(classification_report(y_test,predications))

              precision    recall  f1-score   support

         ham       0.99      1.00      0.99      1593
        spam       0.97      0.95      0.96       246

    accuracy                           0.99      1839
   macro avg       0.98      0.97      0.98      1839
weighted avg       0.99      0.99      0.99      1839



In [31]:
#print the overall accuracy
from sklearn import metrics
metrics.accuracy_score(y_test,predications)
#gives nearly 99 per accuracy

0.989668297988037

Using the text of the messages, our model performed exceedingly well; it correctly predicted spam **98.97%** of the time!<br>
Now let's apply what we've learned to a text classification project involving positive and negative movie reviews.

In [32]:
#apply on a new text message

In [33]:
text_clf.predict(["How are you doing today"]) #pass the message as a list

array(['ham'], dtype=object)

In [34]:
text_clf.predict(["Congratulations! You've been selected as a winner. TEXT WON to 44255 congratulations free entry to contest"])

array(['spam'], dtype=object)

In [36]:
print("The End")

The End
