![license_header_logo](../../../images/license_header_logo.png)

> **Copyright (c) 2021 CertifAI Sdn. Bhd.**<br>
<br>
This program is part of OSRFramework. You can redistribute it and/or modify
<br>it under the terms of the GNU Affero General Public License as published by
<br>the Free Software Foundation, either version 3 of the License, or
<br>(at your option) any later version.
<br>
<br>This program is distributed in the hope that it will be useful
<br>but WITHOUT ANY WARRANTY; without even the implied warranty of
<br>MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
<br>GNU Affero General Public License for more details.
<br>
<br>You should have received a copy of the GNU Affero General Public License
<br>along with this program.  If not, see <http://www.gnu.org/licenses/>.
<br>

# Introduction

This notebook is to introduce the basic of Scikit-learn module for basic NLP task. Sklearn is a useful library especially in machine learning. This tutorial consists of two major parts, which are:

1. Introduction to scikit-learn
2. Using sklearn in sentiment analysis task

# Notebook Content

* [What is Scikit-Learn (Sklearn)](#What-is-Scikit-Learn-(Sklearn))


* [Prerequisites](#Prerequisites)


* [Installation](#Installation)
    * [Using PIP](#Using-PIP)
    * [Using conda](#Using-conda)


* [Features](#Features)


* [Sentiment Analysis](#Sentiment-Analysis)
    * [Objective](#Objective)


* [Get Started](#Let’s-Get-Started)
    * [TF-IDF](#TF-IDF)
    * [Support Vector Machine](#Support-Vector-Machine)
    

* [Conclusion](#Conclusion)

<img align="left" width="300" height="300" src="../../../images/sklearn.png">

# What is Scikit-Learn (Sklearn)

Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python. It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction via a consistence interface in Python. This library, which is largely written in Python, is built upon **NumPy**, **SciPy** and **Matplotlib**.

# Prerequisites
Before we start using scikit-learn latest release, we require the following:

    + Python (>=3.5)

    + NumPy (>= 1.11.0)

    + Scipy (>= 0.17.0)li

    + Joblib (>= 0.11)

    + Matplotlib (>= 1.5.1) is required for Sklearn plotting capabilities.

    + Pandas (>= 0.18.0) is required for some of the scikit-learn examples using data structure and analysis.

# Installation

If you already installed NumPy and Scipy, following are the two easiest ways to install scikit-learn:

### Using PIP
Following command can be used to install scikit-learn via pip:

> `pip install -U scikit-learn`

### Using conda
Following command can be used to install scikit-learn via conda:

> `conda install scikit-learn`

On the other hand, if NumPy and Scipy is not yet installed on your Python workstation then, you can install them by using either **pip** or **conda**.

Another option to use scikit-learn is to use Python distributions like **Canopy** and **Anaconda** because they both ship the latest version of scikit-learn.

# Features

Rather than focusing on loading, manipulating and summarising data, Scikit-learn library is focused on modeling the data. Some of the most popular groups of models provided by Sklearn are as follows:

- **Unpervised Learning algorithms** − Almost all the popular supervised learning algorithms, like Linear Regression, Support Vector Machine (SVM), Decision Tree etc., are the part of scikit-learn.


- **Unsupervised Learning algorithms** − On the other hand, it also has all the popular unsupervised learning algorithms from clustering, factor analysis, PCA (Principal Component Analysis) to unsupervised neural networks.


- **Clustering** − This model is used for grouping unlabeled data.


- **Cross Validation** − It is used to check the accuracy of supervised models on unseen data.


- **Dimensionality Reduction** − It is used for reducing the number of attributes in data which can be further used for summarisation, visualisation and feature selection.


- **Ensemble methods** − As name suggest, it is used for combining the predictions of multiple supervised models.


- **Feature extraction** − It is used to extract the features from data to define the attributes in image and text data.


- **Feature selection** − It is used to identify useful attributes to create supervised models.


- **Open Source** − It is open source library and also commercially usable under BSD license.

# Sentiment Analysis

## Objective

In this notebook we are going to perform a binary classification i.e. we will classify the sentiment as positive or negative according to the `Reviews’ column data of the IMDB dataset.  We will use TFIDF for text data vectorization and Linear Support Vector Machine for classification.

`Natural Language Processing (NLP)` is a sub-field of artificial intelligence that deals understanding and processing human language. In light of new advancements in machine learning, many organizations have begun applying natural language processing for translation, chatbots and candidate filtering.

Machine learning algorithms cannot work with raw text directly. Rather, the text must be converted into vectors of numbers. Then we use `TF-IDF` vectorizer approach. `TF-IDF` is a technique used for natural language processing, that transforms text to feature vectors that can be used as input to the estimator.

## Required Libraries

- **Pandas**
> `!pip install pandas`

- **Numpy**
> `!pip install numpy`

- **Scikit-learn**
> `!pip install scikit-learn`

# Let’s Get Started

In [1]:
import pandas as pd
import numpy as np

In [2]:
train_df = pd.read_excel("../../../resources/day_02/IMDB_train.xlsx")

## TF-IDF

![TF-IDF](../../../images/tf-idf.png)

Some semantic information is preserved as uncommon words are given more importance than common words in TF-IDF.

`E.g. 'She is beautiful', Here 'beautiful will have more importance than 'she' or 'is'.`

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report

In [4]:
# displaying top 5 rows of our dataset
train_df.head()

Unnamed: 0,Reviews,Sentiment
0,"When I first tuned in on this morning news, I ...",neg
1,"Mere thoughts of ""Going Overboard"" (aka ""Babes...",neg
2,Why does this movie fall WELL below standards?...,neg
3,Wow and I thought that any Steven Segal movie ...,neg
4,"The story is seen before, but that does'n matt...",neg


In natural language processing (NLP), text preprocessing is the practice of cleaning and preparing text data. **NLTK** and **re** are common Python libraries used to handle many text preprocessing tasks.

Defining `get_clean` function which is taking argument as ‘Reviews’ column then after performing some steps:

1. Lowering the letter then after replacing backward slash from nothing and underscore from space.

2. Remove emails from the Reviews column.

3. Removing html tags from the Reviews column.

4. Removing special character.

5. If you have multiple repeated character then it converted into single character and make meaningful.

In [5]:
import re

def get_clean(x):
    x = str(x).lower().replace('\\', '').replace('_', ' ')
    
    # Remove Emails
    x = re.sub(r'([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+_-]+)',"",  x)

    # Remove Urls
    x = re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '' , x)

    # Remove Special Characters
    x = re.sub(r'[^\w ]+', "", x)
    x = ' '.join(x.split())

    x = re.sub("(.)\{2,}", "'\'", x)
    return x

train_df['Reviews'] = train_df['Reviews'].apply(lambda x: get_clean(x))
train_df.head()

Unnamed: 0,Reviews,Sentiment
0,when i first tuned in on this morning news i t...,neg
1,mere thoughts of going overboard aka babes aho...,neg
2,why does this movie fall well below standards ...,neg
3,wow and i thought that any steven segal movie ...,neg
4,the story is seen before but that doesn matter...,neg


In [6]:
# Use TF-IDF count vectorizer

tfidf = TfidfVectorizer(max_features=5000)
X = train_df['Reviews']
y = train_df['Sentiment']

In [7]:
# Fit the data into vectorizer and then transform it
X = tfidf.fit_transform(X)
X

<25000x5000 sparse matrix of type '<class 'numpy.float64'>'
	with 2894437 stored elements in Compressed Sparse Row format>

Here, splitting the dataset into x and y column having **20%** is for `testing` and **80%** for `training` purposes.

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## Support Vector Machine

`SVM` is a supervised machine learning algorithm that can be used for classification or regression problems. It uses a technique called the kernel trick to transform your data and then based on these transformations it finds an optimal boundary between the possible outputs.

![SVM](../../../images/SVM_1.png)

![SVM](../../../images/SVM_2.png)

The objective of a `Linear SVC` (Support Vector Classifier) is to fit the data you provide, returning a “best fit” hyperplane that divides, or categorizes your data. From there, after getting the hyperplane, you can then feed some features to your classifier to see what the “predicted” class is.

In [9]:
clf = LinearSVC()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

The `classification report` shows a representation of the main classification metrics on a per-class basis. This gives a deeper intuition of the classifier behavior over global accuracy which can mask functional weaknesses in one class of a multiclass problem.

In [10]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         neg       0.87      0.87      0.87      2480
         pos       0.87      0.88      0.88      2520

    accuracy                           0.87      5000
   macro avg       0.87      0.87      0.87      5000
weighted avg       0.87      0.87      0.87      5000



In [11]:
x = 'this movie is really good. thanks a lot for making it'

x = get_clean(x)
vec = tfidf.transform([x])

In [12]:
# Print the shape of input vector
vec.shape

(1, 5000)

In [13]:
# Prediction on input vector
clf.predict(vec)

array(['pos'], dtype=object)

Python `pickle` module is used for serializing and de-serializing python object structures. The process to converts any kind of python objects (list, dict, etc.) into byte streams (0s and 1s) is called `pickling` or `serialization` or `flattening` or `marshalling`. We can convert the byte stream (generated through pickling) back into python objects by a process called as `unpickling`.

In [14]:
import pickle
pickle.dump(clf, open('model/clf_model', 'wb'))
pickle.dump(tfidf, open('model/tfidf', 'wb'))

# Conclusion

- Firstly, We have loaded the IMBD movie reviews dataset using the pandas dataframe.


- Then define get_clean() function and removed unwanted emails, urls, Html tags and special character.


- Convert the text into vectors with the help of the TF-IDF Vectorizer.


- After that use a linear vector machine classifier algorithm.


- We have fit the model on LinearSVC classifier for binary classification and predict the sentiment i.e. positive or negative on real data.


- Lastly, Dump the clf and TF-IDF model with the help of the pickle library. In other words, it’s the process of converting a python object into a byte stream to store it in a file/database, maintain program state across sessions or transport data over the network.

# Contributors

**Author**
<br>Chee Lam

# References

1. [Sklearn Tutorial](https://www.tutorialspoint.com/scikit_learn/index.htm)
2. [Sentiment Analysis Using Sklearn](https://kgptalkie.com/sentiment-analysis-using-scikit-learn/)