# Text Classification

## Agenda

1. Represent text as numerical data
2. Read text dataset into pandas
3. Vectorize text dataset ( using both CountVectorizer and TFIDFVectorizer )
4. Build and evaluate a model
5. Compare multiple models
6. Fine tune vectorizer
7. Word cloud
8. Lemmatization and stemming
9. Sentiment calculation

## 1: Represent text as numerical data

In [2]:
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier

In [None]:
# Sample text data
sampleTrain = ['i will call you tonight', 'please help me...', 'Please call a cab please !']

In [None]:
# sample target vector
y = [0, 1, 0]

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
toNumeric = CountVectorizer()

In [None]:
#Get the vocabulary(unique words) from training data
toNumeric.fit(sampleTrain)

In [None]:
# Test the created vocabulary
toNumeric.get_feature_names()

In [None]:
sampleTrain = ['i will call you tonight', 'please help me...', 'Please call a cab please !']

In [None]:
# Convert training data into a 'document-term matrix'
sampleTrain_dtm = toNumeric.transform(sampleTrain)
sampleTrain_dtm

In [None]:
# Let's convert sparse matrix to a dense matrix
sampleTrain_dtm.toarray()

In [None]:
pd.DataFrame(sampleTrain_dtm.toarray(), columns=toNumeric.get_feature_names())

In [None]:
# Build a model to predict the target
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(sampleTrain_dtm, y)

In [None]:
# Test text for model validation
sampleTest = ["Don't call please"]

In order to make a prediction, the new test data must have the same features as the training observations, both in number and meaning.

In [None]:
# Transform test data into DTM by using generated vocabulary
sampleTest_dtm = toNumeric.transform(sampleTest)
sampleTest_dtm.toarray()

In [None]:
# Test built model
knn.predict(sampleTest_dtm)

**Summary:**

- `vect.fit(sampleTrain)` **learn the vocabulary** from training data
- `vect.transform(sampleTrain)` use the **trained vocabulary** to build DTM from the train data
- `vect.transform(sampleTest)` use the **trained vocabulary** to build a DTM from the test data and **ignore vocabulary** it hasn't seen before

## 2. Read text dataset into pandas

In [4]:
# Read sms data into pandas
data = pd.read_table('sms (1).tsv', header=None, names=['label', 'message'])

In [5]:
data.shape

(5572, 2)

In [7]:
data.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [8]:
# Target variable 'label' is categorical. Convert it into numeric value
data['label_num'] = data.label.map({'ham':0, 'spam':1})

In [9]:
data.head(5)

Unnamed: 0,label,message,label_num
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


In [10]:
data.label.value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [None]:
# Define X and y data sets from'data'
X = data.message
y = data.label_num
print(X.shape)
print(y.shape)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=7)

##  3. Vectorize text dataset

In [None]:
# call the vectorizer
toNumeric = CountVectorizer()

In [None]:
# creat vocabulary and create document-term matrix
toNumeric.fit(X_train)
X_train_dtm = toNumeric.transform(X_train)

In [None]:
X_train_dtm

In [None]:
# transform test data using fitted vocabulary into DTM
X_test_dtm = toNumeric.transform(X_test)
X_test_dtm

## 4. Build and evaluate a model

In [None]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [None]:
nb.fit(X_train_dtm, y_train)

In [None]:
y_pred_class = nb.predict(X_test_dtm)

In [None]:
# Accuracy calculation
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

In [None]:
metrics.confusion_matrix(y_test, y_pred_class)

In [None]:
# Predict probablity for test data instead of labels directly
y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
y_pred_prob

In [None]:
# Calculate AUC-ROC
metrics.roc_auc_score(y_test, y_pred_prob)

## 5. Compare multiple models

In [None]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()

In [None]:
lr.fit(X_train_dtm, y_train)

In [None]:
y_pred_class = lr.predict(X_test_dtm)

In [None]:
# calculate predicted probabilities for X_test_dtm
y_pred_prob = lr.predict_proba(X_test_dtm)[:, 1]
y_pred_prob

In [None]:
metrics.accuracy_score(y_test, y_pred_class)

In [None]:
metrics.roc_auc_score(y_test, y_pred_prob)

## 6. Fine tune vectorizer

In [None]:
# Default parameters for CountVectorizer
toNumeric = CountVectorizer()
toNumeric

In [None]:
# Remove English stop words
toNumeric = CountVectorizer(stop_words='english')
toNumeric.fit(X_train)
len(toNumeric.get_feature_names())

In [None]:
# Include 1 and 2-grams
toNumeric = CountVectorizer(ngram_range=(1, 2))
toNumeric.fit(X_train)
len(toNumeric.get_feature_names())

In [None]:
# Ignore terms that appear in more than 75% of the documents
toNumeric = CountVectorizer(max_df=0.75)
toNumeric.fit(X_train)
len(toNumeric.get_feature_names())

In [None]:
# Keep terms that appear in at least 5 documents
toNumeric = CountVectorizer(min_df=2)
toNumeric.fit(X_train)
len(toNumeric.get_feature_names())

## 7. Word Cloud

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud 

In [None]:
sms.head(5)

In [None]:
wc = WordCloud()
wc.generate(str(sms['message']))
plt.figure(figsize=(20,10), facecolor='k')
plt.title("Most frequent words in SMS dataset", fontsize=40,color='white')
plt.imshow(wc)
plt.show()

## 8.  Lemmatization and stemming

In [None]:
import nltk

In [None]:
nltk.download('punkt')

In [None]:
from nltk.stem import PorterStemmer
from nltk.stem import LancasterStemmer

In [None]:
#create an object of class PorterStemmer
porter = PorterStemmer()
lancaster=LancasterStemmer()

#### PorterStemmer

- It uses set of rules to decide whether it is wise to strip a suffix. 
- Quite often does not generate words which are in dictionary.
- PorterStemmer is known for its simplicity and speed. 

In [None]:
#proide a word to be stemmed
print(porter.stem("cats"))
print(porter.stem("trouble"))
print(porter.stem("troubling"))
print(porter.stem("troubled"))

### Sentence stemming

In [None]:
sentence="Pythoners are very intelligent and work very pythonly and now they are pythoning their way to success."
porter.stem(sentence)

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize
def stemSentence(sentence):
    token_words=word_tokenize(sentence)
    token_words
    stem_sentence=[]
    for word in token_words:
        stem_sentence.append(porter.stem(word))
        stem_sentence.append(" ")
    return "".join(stem_sentence)

x=stemSentence(sentence)
print(x)

In [None]:
from textblob import TextBlob

In [None]:
sent = TextBlob(sentence)

In [None]:
print(' '.join([porter.stem(word) for word in sent.words]))

## 9.  Sentiment calculation

In [None]:
from textblob import TextBlob

In [None]:
text = "I hate anything that goes in my ear"

In [None]:
result = TextBlob(text)

In [None]:
result.sentiment.polarity