<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Machine Learning for NLP
    </div>
    <p>We learned about various extraction methods, such as tokenization, stemming, lemmatization, and stop-word removal, which are used to extract features from unstructured text. We also discussed Bag of Words and Term Frequency-Inverse Document Frequency (TFIDF).

In this session, we will learn how to use these extracted features to develop machine learning models. These models are capable of solving real-world problems, such as detecting whether sentiments carried by texts are positive or negative, predicting whether emails are spam or not, and so on. We will also cover concepts such as supervised and unsupervised learning, classifications and regressions, sampling and splitting data, along with evaluating the performance of a model in depth. This chapter also discusses how to load and save these models for future use.</p>
    <ol>
        <li>Supervised Learning</li>
         <li>Unsupervised Learning</li>
         <li>Semi-supervised Learning</li>
        <li>Re-inforcement</li>
    </ol>

</div>

## Text Classification

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Classification
    </div>
    <p>Say you have two types of food, of which type 1 tastes sweet and type 2 tastes salty, and you need to determine how an unknown food will taste using various attributes of the food (such as color, fragrance, shape, and ingredients). This is an instance of classification.

Here, the two classes are class 1, which tastes sweet, and class 2, which tastes salty. The features that are used in this classification are color, fragrance, the ingredients used to prepare the dish, and so on. These features are called independent variables. The class (sweet or salty) is called a dependent variable.

Formally, classification algorithms are those that learn patterns from a given dataset to determine classes of unknown datasets. Some of the most widely used classification algorithms are logistic regression, Naive Bayes, k-nearest neighbor, and tree methods. Let's learn about each of them</p>

</div>


### Logistic Regression

#### Insert a new cell and add the following code to import the necessary packages:

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Logistic Regression
    </div>
    <p>Despite having the term "regression" in it, logistic regression is used for probabilistic classification. In this case, the dependent variable (the outcome) is binary, which means that the values can be represented by 0 or 1. For example, consider that you need to decide whether an email is spam or not. Here, the value of the decision (the dependent variable, or the outcome) can be considered to be 1 if the email is spam; otherwise, it will be 0. No other outcome is possible. The independent variables (that is, the features) will consist of various attributes of the email, such as the number of occurrences of certain keywords and so on. We can then make use of the logistic regression algorithm to create a model that predicts if the email is spam (1) or not (0), as shown in the following graph:</p>

</div>


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import re
import string
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter
from pylab import *
import nltk
import warnings
warnings.filterwarnings('ignore')

#### Read the data file in JSON format using pandas

In [None]:
review_data = pd.read_json('data/reviews_Musical_Instruments_5.json', lines=True)
review_data.shape

In [None]:
review_data

In [None]:
review_data[['reviewText', 'overall']].head()

#### Use a lambda function to extract tokens from each 'reviewText' of this DataFrame
We will then lemmatize them, and concatenate them side by side Use the join function to concatenate a list of words into a single sentence. Use the regular expression method (re) to replace anything other than alphabetical characters, digits, and whitespaces with blank space.

In [None]:
lemmatizer = WordNetLemmatizer()
review_data['cleaned_review_text'] = review_data['reviewText'].apply(\
lambda x : ' '.join([lemmatizer.lemmatize(word.lower()) \
    for word in word_tokenize(re.sub(r'([^\s\w]|_)+', ' ', str(x)))]))

#### Stop words removal

In [None]:
from nltk import download
download('stopwords')
from nltk import word_tokenize
from nltk.corpus import stopwords

In [None]:
stop_words = stopwords.words('english')
len(review_data['cleaned_review_text'])

In [None]:
review_data['cleaned_text'] = review_data['reviewText'].apply(\
lambda x : ' '.join([lemmatizer.lemmatize(word.lower()) \
    for word in word_tokenize(re.sub(r'([^\s\w]|_)+', ' ', str(x))) if word.lower() not in stop_words]))

#### Create a DataFrame from the TFIDF matrix representation of the cleaned version of reviewText

In [None]:
review_data[['cleaned_review_text', 'reviewText', 'overall']].head()

#### Create a TFIDF matrix and transform it into a DataFrame

In [None]:
tfidf_model = TfidfVectorizer(max_features=500)
tfidf_df = pd.DataFrame(tfidf_model.fit_transform(review_data['cleaned_review_text']).todense())
tfidf_df.columns = sorted(tfidf_model.vocabulary_)
tfidf_df.head()

In [None]:
review_data['target'] = review_data['overall'].apply(lambda x : 0 if x<=4 else 1)
review_data['target'].value_counts()

#### Splitting the dataset into the Training set and Test set

In [None]:
 # Ideintify X and Y
X = tfidf_df
y = review_data['target']

In [None]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split

training_features, test_features, \
training_target, test_target, = train_test_split(X,y,
                                               test_size = .2,
                                               random_state= 45)

#### Use sklearn's LogisticRegression() function to fit a logistic regression model 

In [None]:
# Import the library
from sklearn.linear_model import LogisticRegression
# Initiate the model
logreg = LogisticRegression()
# Train the Model on 80%
logreg.fit(training_features, training_target)
#  Test the Model on 20%
lr_predicted_labels = logreg.predict(test_features)
# Transform the probability into binary ouput
logreg.predict_proba(test_features)[:,1]

#### Crosstab function to create a cross validation table
Compater the actual target variable with predicted target variable

In [None]:
# review_data['predicted_labels'] = predicted_labels
# pd.crosstab(review_data['target'], review_data['predicted_labels'])
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Making the Confusion Matrix
CMLR= confusion_matrix(test_target, lr_predicted_labels)
CMLR

In [None]:
# Accuracy Score
LR = accuracy_score(test_target, lr_predicted_labels)

print(" Logistic Regression Prediction Accuracy : {:.2f}%".format(LR * 100))

### Naive Bayes Classifiers

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Naive Bayes Classifiers
    </div>
    <p>ust like logistic regression, a Naive Bayes classifier is another kind of probabilistic classifier. It is based on Bayes' theorem.

In the preceding formula, A and B are events and P(B) is not equal to 0. P(A/B) is the probability of event A occurring, given that event B is true. Similarly, P(B/A) is the probability of event B occurring, given that event A is true. P(B) is the probability of the occurrence of event B.

Say there is an online platform where hotel customers can provide a review for the service provided to them. The hotel now wants to figure out whether new reviews on the platform are appreciative in nature or not. Here, P(A) = the probability of the review being an appreciative one, while P(B) = the probability of the review being long. Now, we've come across a review that is long and want to figure out the probability of it being appreciative. To do that, we need to calculate P(A/B). P(B/A) will be the probability of appreciative reviews being long. From the training dataset, we can easily calculate P(B/A), P(A), and P(B) and then use Bayes' theorem to calculate P(A/B).

Similar to logistic regression, the scikit-learn library can be used to perform naïve Bayes classification and can be implemented in Python using the following code:</p>

</div>


#### Insert a new cell and add the following code to import the necessary packages

In [None]:
# Import the library
from sklearn.naive_bayes import GaussianNB
# Initiate the model
nb = GaussianNB()
# Train the Model on 80%
nb.fit(training_features, training_target)
#  Test the Model on 20%
nb_predicted_labels = nb.predict(test_features)
# Transform the probability into binary ouput
nb.predict_proba(test_features)[:,1]

In [None]:
# Making the Confusion Matrix
CMNB= confusion_matrix(test_target, nb_predicted_labels)
CMNB

In [None]:
# Accuracy Score
NB = accuracy_score(test_target, nb_predicted_labels)

print(" Naive Bayes Prediction Accuracy : {:.2f}%".format(NB * 100))

### k-nearest Neighbors

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        k-nearest Neighbors
    </div>
    <p>k-nearest neighbors is an algorithm that can be used to solve both regression and classification. In this chapter, we will focus on the classification aspect of the algorithm as it is used for NLP applications. Consider, for instance, the saying "birds of a feather flock together." This means that people who have similar interests prefer to stay close to each other and form groups. This characteristic is called homophily. This characteristic is the main idea behind the k-nearest neighbors classification algorithm.

To classify an unknown object, k number of other objects located nearest to it with class labels will be looked into. The class that occurs the most among them will be assigned to it—that is, the object with an unknown class. The value of k is chosen by running experiments on the training dataset to find the most optimal value. When dealing with text data for a given document, we interpret "nearest neighbors" as other documents that are the most similar to the unknown document.</p>

</div>


#### Insert a new cell and add the following code to import the necessary packages

In [None]:
# Import the library
from sklearn.neighbors import KNeighborsClassifier
# Initiate the model
knn = KNeighborsClassifier(n_neighbors=3)
# Train the Model on 80%
knn.fit(training_features, training_target)
#  Test the Model on 20%
knn_predicted_labels = knn.predict(test_features)
# Transform the probability into binary ouput
knn.predict_proba(test_features)[:,1]

In [None]:
# Making the Confusion Matrix
CMKNN = confusion_matrix(test_target, knn_predicted_labels)
CMKNN

In [None]:
# Accuracy Score
KNN = accuracy_score(test_target, knn_predicted_labels)

print(" KNN Prediction Accuracy : {:.2f}%".format(NB * 100))

#### Models compared

In [None]:
print(" Logistic Regression Prediction Accuracy : {:.2f}%".format(LR * 100))
print(" Naive Bayes Prediction Accuracy         : {:.2f}%".format(NB * 100))
print(" KNN Prediction Accuracy                 : {:.2f}%".format(NB * 100))

## Regression

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Regression
    </div>
    <p>To better understand regression, consider a practical example. For example, say you have photos of several people, along with a list of their respective ages, and you need to predict the ages of some other people from their photos. This is a use case for regression.

In the case of regression, the dependent variable (age, in this example) is continuous. The independent variables—that is, features—consist of the attributes of the images, such as the color intensity of each pixel. Formally, regression analysis refers to the process of learning a mapping function, which relates features or predictors (inputs) to the dependent variable (output).

There are various types of regression: univariate, multivariate, simple, multiple, linear, non-linear, polynomial regression, stepwise regression, ridge regression, lasso regression, and elastic net regression. If there is just one dependent variable, then it is referred to as univariate regression. On the other hand, two or more dependent variables constitute multivariate regression. Simple regression has only one predictor or target variable, while multivariate regression has more than one predictor variable.

Since linear regression in the base algorithm for all the different types of regression mentioned previously, in the next section, we will cover linear regression in detail.</p>

</div>


### Linear Regression

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Linear Regression
    </div>
    <p>The term "linear" refers to the linearity of parameters. Parameters are the coefficients of predictor variables in the linear regression equation. The following formula represents the linear regression equation:

Here, y is termed a dependent variable (output); it is continuous. X is an independent variable or feature (input). β0 and β1 are parameters. Є is the error component, which is the difference between the actual and predicted values of y. Since linear regression requires the variable to be linear, it is not used much in the real world. However, it is useful for high-level predictions, such as the sales revenue of a product given the price and advertising cost.</p>

</div>


#### Transform the original data again 

In [None]:
tfidf_model = TfidfVectorizer(max_features=500)
tfidf_df = pd.DataFrame(tfidf_model.fit_transform(review_data['cleaned_review_text']).todense())
tfidf_df.columns = sorted(tfidf_model.vocabulary_)
tfidf_df.head()

#### Splitting the dataset into the Training set and Test set

In [None]:
 # Ideintify X and Y
X = tfidf_df
y = review_data['target']

In [None]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split

training_features, test_features, \
training_target, test_target, = train_test_split(X,y,
                                               test_size = .2,
                                               random_state= 45)

#### Use sklearn's LinearRegression() function to fit a linear regression model

In [None]:
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(training_features,training_target)
linreg.coef_

#### To check the intercept or the error term of the linear regression model

In [None]:
linreg.intercept_

#### To check the prediction in a TFIDF DataFrame

In [None]:
linreg.predict(test_features)

#### Use the model to predict the 'overall' score and store it in a column 

In [None]:
test_features['predicted_score_from_linear_regression'] = linreg.predict(test_features)
test_features[['overall', 'predicted_score_from_linear_regression']].head(10)

## Hierarchical Clustering

In this exercise, we will analyze the text documents in sklearn's fetch_20newsgroups dataset. The 20 newsgroups dataset contains news articles on 20 different topics. We will make use of hierarchical clustering to classify the documents into different groups. Once the clusters have been created, we will compare them with their actual categories. Follow these steps to implement this exercise:

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        Performing Hierarchical Clustering
    </div>
    <p>Hierarchical clustering algorithms group similar objects together to create a cluster with the help of a dendrogram. In this algorithm, we can vary the number of clusters as per our requirements. First, we construct a matrix consisting of distances between each pair of instances (data points). After that, we construct a dendrogram (a representation of clusters in the form of a tree) based on the distances between them. We truncate the tree at a location corresponding to the number of clusters we need.

For example, imagine that you have 10 documents and want to group them into a number of categories based on their attributes (the number of words they contain, the number of paragraphs, punctuation, and so on) and don't have any fixed number of categories in mind. This is a use case of hierarchical clustering. Let's assume that we have a dataset containing the features of the 10 documents. Firstly, the distances between each pair of documents from the set of 10 documents are calculated. After that, we construct a dendrogram and truncate it at a suitable position to get a suitable number of clusters:</p>

</div>


#### Insert a new cell and add the following code to import the necessary libraries

In [None]:
from sklearn.datasets import fetch_20newsgroups
from scipy.spatial.distance import cdist
from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import ward, dendrogram
import matplotlib as mpl
from scipy.cluster.hierarchy import fcluster
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import re
import string
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter
from pylab import *
import nltk
import warnings
warnings.filterwarnings('ignore')

#### Download a list of stop words and the Wordnet corpus from nltk. Insert a new cell and add the following code to implement this:

In [None]:
nltk.download('stopwords')
stop_words=stopwords.words('english')
stop_words=stop_words+list(string.printable)
nltk.download('wordnet')
lemmatizer=WordNetLemmatizer()

#### Specify the categories of news articles we want to fetch to perform our clustering task.

In [None]:
categories= ['misc.forsale', 'sci.electronics', 'talk.religion.misc']

#### To fetch the dataset, add the following lines of code:

In [None]:
news_data = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42, download_if_missing=True)

#### To view the data of the fetched content, add the following code:

In [None]:
news_data['data'][:5]

#### To check the categories of news articles

In [None]:
print(news_data.target)

#### To store news_data and the corresponding categories in a pandas DataFrame and view it, write the following code:

In [None]:
news_data_df = pd.DataFrame({'text' : news_data['data'], 'category': news_data.target})
news_data_df.head()

#### To count the number of occurrences of each category appearing in this dataset

In [None]:
news_data_df['category'].value_counts()

#### Use a lambda function to extract tokens from each "text" of the news_data_df DataFrame.

In [None]:
news_data_df['cleaned_text'] = news_data_df['text'].apply(\
lambda x : ' '.join([lemmatizer.lemmatize(word.lower()) \
    for word in word_tokenize(re.sub(r'([^\s\w]|_)+', ' ', str(x))) if word.lower() not in stop_words]))

#### Create a TFIDF matrix and transform it into a DataFrame

In [None]:
tfidf_model = TfidfVectorizer(max_features=200)
tfidf_df = pd.DataFrame(tfidf_model.fit_transform(news_data_df['cleaned_text']).todense())
tfidf_df.columns = sorted(tfidf_model.vocabulary_)
tfidf_df.head()

#### Calculate the distance using the sklearn library

In [None]:
from sklearn.metrics.pairwise import euclidean_distances as euclidean
dist = 1 - euclidean(tfidf_df)

#### Now, create a dendrogram for the TFIDF representation of documents:

In [None]:
import scipy.cluster.hierarchy as sch

In [None]:
dendrogram = sch.dendrogram(sch.linkage(dist, method='ward'))

plt.xlabel('Data Points')
plt.ylabel('Euclidean Distance')
plt.title('Dendrogram')
plt.show()

Using the above image, we can analyze the high-level patterns that the clustering algorithm found to group the articles into one of the four clusters. As you can see, cluster 2 has mostly religion-related articles, while cluster 3 consists of primarily sales-related articles. The other two clusters do not have a proper distinction. The reason for this could be that the model figured out that words related to "religion" and "for sale" appeared frequently in the articles that were classified into those respective clusters, while the articles on "electronics" consist of mostly generic words.

#### Use the fcluster() function to obtain the cluster labels of the clusters that were obtained by hierarchical clustering

In [None]:
k = 4
clusters = fcluster(sch.linkage(dist, method='ward'), k, criterion='maxclust')
clusters

In [None]:
news_data_df['obtained_clusters'] = clusters
pd.crosstab(news_data_df['category'].replace(
    {0:'misc.forsale', 
     1:'sci.electronics', 
     2:'talk.religion.misc'}),\
            news_data_df['obtained_clusters'].replace(
    {1 : 'cluster_1', 
     2 : 'cluster_2', 
     3 : 'cluster_3', 
     4: 'cluster_4'}))

## k-means Clustering

<div style="background-color: rgba(0, 100, 200, .1); padding: 1em 3em; border-radius: 5px; border: 1px solid black">
    <div style="font-weight: bold; font-size: 1.2em; border-bottom: 1px dashed black; padding-bottom: .5em;">
        k-means Clustering
    </div>
    <p>In this algorithm, we segregate the given instances (data points) into "k" number of groups (here, k is a natural number). First, we choose k centroids. We assign each instance to its nearest centroid, thereby creating k groups. This is the assignment phase, which is followed by the update phase.

In the update phase, new centroids for each of these k groups are calculated. The data points are reassigned to their nearest newly calculated centroids. The assignment phase and the update phase are carried on repeatedly until the assignment of data points no longer changes.

For example, suppose you have 10 documents. You want to group them into three categories based on their attributes, such as the number of words they contain, the number of paragraphs, punctuation, and the tone of the document. In this case, we will assume that k is 3; that is, we want to create these three groups. Firstly, three centroids need to be chosen. In the initialization phase, each of these 10 documents is assigned to one of these three categories, thereby forming three groups. In the update phase, the centroids of the three newly formed groups are calculated. To decide the optimal number of clusters (that is, k), we execute k-means clustering for various values of k and note down their performances (sum of squared errors). We try to select a small value for k that has the lowest sum of squared errors. This method is called the elbow method.</p>

</div>


In this exercise, we will create four clusters from text documents in sklearn's fetch_20newsgroups text dataset using k-means clustering. We will compare these clusters with the actual categories and use the elbow method to obtain the optimal number of clusters. Follow these steps to implement this exercise:

#### Use pandas' crosstab function to compare the clusters we have obtained with the actual categories of the news articles.

In [None]:
news_data_df['obtained_clusters'] = clusters
pd.crosstab(news_data_df['category'].replace(
    {0:'misc.forsale', 
     1:'sci.electronics', 
     2:'talk.religion.misc'}),\
            news_data_df['obtained_clusters'].replace(
    {1 : 'cluster_1', 
     2 : 'cluster_2', 
     3 : 'cluster_3', 
     4: 'cluster_4'}))

#### Obtain the optimal value of k

In [None]:
distortions = []
K = range(1,6)
for k in K:
    kmeanModel = KMeans(n_clusters=k)
    kmeanModel.fit(tfidf_df)
    distortions.append(sum(np.min(cdist(tfidf_df, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / tfidf_df.shape[0])

plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal number of clusters')
plt.show()

In [None]:
## 7

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import re
import string
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter
from pylab import *
import nltk
import warnings
warnings.filterwarnings('ignore')

In [None]:
data_patio_lawn_garden = pd.read_json('data/reviews_Patio_Lawn_and_Garden_5.json', lines = True)
data_patio_lawn_garden[['reviewText', 'overall']].head()

In [None]:
lemmatizer = WordNetLemmatizer()
data_patio_lawn_garden['cleaned_review_text'] = data_patio_lawn_garden['reviewText'].apply(\
lambda x : ' '.join([lemmatizer.lemmatize(word.lower()) \
    for word in word_tokenize(re.sub(r'([^\s\w]|_)+', ' ', str(x)))]))
data_patio_lawn_garden[['cleaned_review_text', 'reviewText', 'overall']].head()

In [None]:
tfidf_model = TfidfVectorizer(max_features=500)
tfidf_df = pd.DataFrame(tfidf_model.fit_transform(data_patio_lawn_garden['cleaned_review_text']).todense())
tfidf_df.columns = sorted(tfidf_model.vocabulary_)
tfidf_df.head()

In [None]:
data_patio_lawn_garden['target'] = data_patio_lawn_garden['overall'].apply(lambda x : 0 if x<=4 else 1)
data_patio_lawn_garden['target'].value_counts()

In [None]:
from sklearn import tree
dtc = tree.DecisionTreeClassifier()
dtc = dtc.fit(tfidf_df, data_patio_lawn_garden['target'])
data_patio_lawn_garden['predicted_labels_dtc'] = dtc.predict(tfidf_df)

In [None]:
pd.crosstab(data_patio_lawn_garden['target'], data_patio_lawn_garden['predicted_labels_dtc'])

In [None]:
from sklearn import tree
dtr = tree.DecisionTreeRegressor()
dtr = dtr.fit(tfidf_df, data_patio_lawn_garden['overall'])
data_patio_lawn_garden['predicted_values_dtr'] = dtr.predict(tfidf_df)
data_patio_lawn_garden[['predicted_values_dtr', 'overall']].head(10)