## Feature Extraction from Text using Vectorization

___
# Feature Extraction from Text


## Load a dataset

In [None]:
import numpy as np
import pandas as pd


In [None]:
df=pd.read_csv('smsspamcollection.tsv',sep='\t')


In [None]:
df.head()
#take raw text information and vectorize it 

## Check for missing values:

In [None]:
df.isnull().sum()

## Take a quick look at the *ham* and *spam* `label` column:

In [None]:
df['label'].value_counts()

## Split the data into train & test sets:

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X=df['message']
y=df['label']

In [None]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.33,random_state=42)

## Scikit-learn's CountVectorizer
Text preprocessing, tokenizing and the ability to filter out stopwords are all included in [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html), which builds a dictionary of features and transforms documents to feature vectors.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect=CountVectorizer()


In [None]:
X_train_counts=count_vect.fit_transform(X_train)

In [None]:
X_train_counts.shape

This shows that our training set is comprised of 3733 documents, and 7082 features.

In [None]:
#ifidf can be used to give words that are important more weights
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer=TfidfTransformer()

In [None]:
X_train_tfidf=tfidf_transformer.fit_transform(X_train_counts)
#pass countvectorizer to tfidf transformer

In [None]:
X_train_tfidf.shape

## Combine Steps with TfidVectorizer


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer 
#combines process of countvectorization and tfidftransformation
tfidf_vectorizer=TfidfVectorizer()

In [None]:
X_train_tfidf=tfidf_vectorizer.fit_transform(X_train)
X_train_tfidf.shape

## Train a Classifier
Here we'll introduce an SVM classifier that's similar to SVC, called [LinearSVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html). LinearSVC handles sparse input better, and scales well to large numbers of samples.

In [None]:
from sklearn.svm import LinearSVC
clf=LinearSVC()
clf.fit(X_train_tfidf,y_train)

## Build a Pipeline
Remember that only our training set has been vectorized into a full vocabulary. In order to perform an analysis on our test set we'll have to submit it to the same procedures. Fortunately scikit-learn offers a [**Pipeline**](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) class that behaves like a compound classifier.

In [None]:
#pipeline can perform vectorization and classification 
from sklearn.pipeline import Pipeline
text_clf=Pipeline([('tfidf',TfidfVectorizer()),('clf',LinearSVC())]) #this line creates a pipeline
text_clf.fit(X_train,y_train)#pass it raw training data 

## Test the classifier and display results

In [None]:
#form a predication set
predications=text_clf.predict(X_test)
#X_test contains raw text messages


In [None]:
#repost the confusion matrix
from sklearn.metrics import confusion_matrix,classification_report
print(confusion_matrix(y_test,predications))

In [None]:
#print a classification report
print(classification_report(y_test,predications))

In [None]:
#print the overall accuracy
from sklearn import metrics
metrics.accuracy_score(y_test,predications)
#gives nearly 99 per accuracy

Using the text of the messages, our model performed exceedingly well; it correctly predicted spam **98.97%** of the time!<br>
Now let's apply what we've learned to a text classification project involving positive and negative movie reviews.

In [None]:
#apply on a new text message

In [None]:
text_clf.predict(["How are you doing today"]) #pass the message as a list

In [None]:
text_clf.predict(["Congratulations! You've been selected as a winner. TEXT WON to 44255 congratulations free entry to contest"])

In [None]:
print("The End")