# Machine Learning and NLP

## Text Classification Examples
- Logistic Regression
- Naive Bayes
- Comparing Methods: Classification Metrics

## Supervised Learning

![](https://i163.photobucket.com/albums/t281/kyin_album/m1_1.png)

# <font color="blue"> __Logistic Regression__

![](https://i163.photobucket.com/albums/t281/kyin_album/m6.png)

# Step 1: Prepare the data



In [None]:
# make sure the data is labeled
import pandas as pd
data = pd.read_table('spam.txt',encoding='windows-1252', header=None)
data.columns = ['label', 'text']
print(data.head()) 
len(data)

In [None]:
# remove words with numbers, punctuation and capital letters
import re
import string
alphanumeric = lambda x: re.sub(r"""\w*\d\w*""", ' ', x)
punc_lower = lambda x: re.sub('[%s]' % re.escape(string.punctuation), ' ', x.lower())
data['text'] = data.text.map(alphanumeric).map(punc_lower)
print(data.head())

# Step 2: Split the data (into training and testing set)

<Font color="Blue">__Input__: Features, Predictors, Independent Variables, X's 
<Font color="orange">__Outputs__: Label, Outcome, Dependent Variable, Y
    
![](https://i163.photobucket.com/albums/t281/kyin_album/m2.png)

In [None]:
# split the data into feature and label
X = data.text # inputs into model
y = data.label # output of model


In [None]:
X.head()

In [None]:
y.head()

## Overfitting

![](https://i163.photobucket.com/albums/t281/kyin_album/m3.png)

![](https://i163.photobucket.com/albums/t281/kyin_album/m4.png)

# Split the data [Code]

In [None]:
# split the data into a training and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# test size = 30% of observations, which means training size = 70% of observations
# random state = 42, so we all get the same random train / test split

In [None]:
X_train.head()

In [None]:
X_train.shape

In [None]:
y_train.head()

In [None]:
y_train.shape

In [None]:
X_test.shape


In [None]:
y_test.shape

# Step 3: Numerically encode the input data [Code]

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(stop_words='english')
X_train_cv = cv.fit_transform(X_train) 
X_test_cv = cv.transform(X_test) # transform uses the same vocab and one-hot encodes
# print the dimensions of the training set (text messages, terms)
print(X_train_cv.toarray().shape)

In [None]:
help(cv.fit_transform)

# Step 4: Fit model and predict outcomes [Code]

In [None]:
# Use a logistic regression model
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()

# Train the model
lr.fit(X_train_cv, y_train)

# Take the model that was trained on the X_train_cv data and apply it to the X_test_cv
y_pred_cv = lr.predict(X_test_cv)
y_pred_cv # The output is all of the predictions/ labels

# Step 5: Evaluate the model

![](https://i163.photobucket.com/albums/t281/kyin_album/m5.png)

# Step 5: Evaluate the model [Code]

In [None]:
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
cm = confusion_matrix(y_test, y_pred_cv)
sns.heatmap(cm, xticklabels=['predicted_ham', 'predicted_spam'], yticklabels=['actual_ham', 'actual_spam'],
annot=True, fmt='d', annot_kws={'fontsize':20}, cmap="YlGnBu");
true_neg, false_pos = cm[0]
false_neg, true_pos = cm[1]
accuracy = round((true_pos + true_neg) / (true_pos + true_neg + false_pos + false_neg),3)
precision = round((true_pos) / (true_pos + false_pos),3)
recall = round((true_pos) / (true_pos + false_neg),3)
f1 = round(2 * (precision * recall) / (precision + recall),3)
print('Accuracy: {}'.format(accuracy))
print('Precision: {}'.format(precision))
print('Recall: {}'.format(recall))
print('F1 Score: {}'.format(f1))

# <font color="blue"> __Naive Bayes__

# Naive Bayes [code]

In [None]:
# Use a Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
import numpy as np
nb = MultinomialNB()
# Train the model
nb.fit(X_train_cv, y_train)
# Take the model that was trained on the X_train_cv data and apply it to the X_test_cv

y_pred_cv_nb = nb.predict(X_test_cv)
y_pred_cv_nb # The output is all of the predictions



# Naive Bayes: Results

In [None]:
cm = confusion_matrix(y_test, y_pred_cv_nb)
sns.heatmap(cm, xticklabels=['predicted_ham', 'predicted_spam'], yticklabels=['actual_ham', 'actual_spam'],
annot=True, fmt='d', annot_kws={'fontsize':20}, cmap="YlGnBu");
true_neg, false_pos = cm[0]
false_neg, true_pos = cm[1]
accuracy = round((true_pos + true_neg) / (true_pos + true_neg + false_pos + false_neg),3)
precision = round((true_pos) / (true_pos + false_pos),3)
recall = round((true_pos) / (true_pos + false_neg),3)
f1 = round(2 * (precision * recall) / (precision + recall),3)
print('Accuracy: {}'.format(accuracy))
print('Precision: {}'.format(precision))
print('Recall: {}'.format(recall))
print('F1 Score: {}'.format(f1))

NBscore = nb.score(X_test_cv, y_test)


# <font color="red"> Machine Learning and NLP Exercises

# Introduction

We will be using the same review data set from Kaggle for this exercise. The product we'll focus on this time is a cappuccino cup.

The following code will help you load in the data.


In [None]:
import nltk
import pandas as pd

In [None]:
data = pd.read_csv('coffee.csv')
data.head()

# Question 1 

- Determine how many reviews there are in total.


Use the preprocessing code below to clean the reviews data before moving on to modeling.


In [None]:
# Text preprocessing steps - remove numbers, captial letters and punctuation
import re
import string

alphanumeric = lambda x: re.sub(r"""\w*\d\w*""", ' ', x)
punc_lower = lambda x: re.sub('[%s]' % re.escape(string.punctuation), ' ', x.lower())

data['reviews'] = data.reviews.map(alphanumeric).map(punc_lower)
data.head()

# Question 2: Classsification *(20% testing, 80% training)*

Processes for classification 

### <font color="Blue">Step 1:</font> Prepare the data (identify the feature and label)

### <font color="Blue">Step 2:</font> Split the data into training and testing sets

### <font color="Blue">Step 3:</font> Vectorize the feature

### <font color="Blue">Step 4:</font> Idenfity the model/ classifier to be used. Feed the train data into the model

### - Linear Regression

### - SVM

### - Decision Tree

### - Random Forest

### - KNN

### -  Naive Bayes

### <font color="Blue">Step 5:</font> Evaluate the Model - Accuracy Measurement
Generate the accuracy scores for Linear Regression, SVM, Decision Tree, Random Forest, KNN, and Naive Bayes.  

__Example Output:__
- Accuracy score for LR  = 0.1651
- Accuracy score for SVM = 0.5413
- Accuracy score for DT  = 0.5505
- Accuracy score for RF  = 0.5872
- Accuracy score for KNN = 0.5963
- Accuracy score for NB  = 0.6514

# Question 3
Predict the rate of this review, 

<font color="blue">__"I dislike this coffee, terrible taste and very greasy."__



by using Linear Regression, SVM, Decision Tree, Random Forest, KNN, and Naive Bayes