# Text Feature Extraction

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>
    
1. <a href="#item31">NLP</a> 
2. <a href="#item31">Download and Import NLTK</a>  
3. <a href="#item32">Perform Data Processing</a> 
4. <a href="#item33">Perform Feature Extraction</a>
5. <a href="#item33">Import pandas</a>
6. <a href="#item34">Read Dataset</a>  
7. <a href="#item34">Train and Test the model</a> 
8. <a href="#item34">Model Evaluation</a>
9. <a href="#item34">Questions</a>

</font>
</div>

## NLP

**Natural language processing (NLP)** is the ability of a computer program to understand human language, in particular how to program computers to process and analyze large amounts of natural language data.

Computers are very good at handling direct numerical information and dealing with text data is problematic, since our computers, scripts can’t read and understand text in any human sense.

But the main problem in working with language processing is that machine learning algorithms can not work on the raw text directly so it needs some specialized **pre-processing techniques** to understand raw text data so we can do some **feature extraction techniques** to convert text into a matrix(or vector) of features. 

In this section we will be discussing some of the techniques to create some structure out of highly unstructure text data using **NLTK** library.

NLTK stands for **Natural Language Toolkit**, which is a commonly used NLP library with a lot of corpus, models, and algorithms. 

**corpus** is the collection of text documents.



#### The flow of NLP application looks like


 Text ----> Pre processing -----> Feature extraction ----> ML Model

## Download NLTK

The below command will open the NLTK downloader. You may download everything from the collections tab.

In [65]:
# nltk.download()
import nltk

## Pre-processing data: tokenization, stemming, and removal of stop words

Here we will look at three common pre-processing step's in natural language processing:

1) **Tokenization:** the process of segmenting text into words, clauses or sentences (here we will separate out words).

2) **Removal of stop words:** removal of commonly used words unlikely to be useful for learning e.g. 'a', 'and', 'but', 'what' etc.

3) **Stemming:** reducing related words to a common stem.


  ### Tokenization

In [1]:
import nltk
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
text = "Jhon was delighted he got a part in school play, even though the part was small one. "
print("\nOriginal string:")
print(text)
print("\nList of words:")
words = tokenizer.tokenize(text)
print(words)

### Removal of stop words

In [2]:
from nltk.corpus import stopwords
stops = set(stopwords.words("english"))                  
words = [w for w in words if not w in stops]
print(words)

### Stemming

In [3]:
from nltk.stem import PorterStemmer
stemming = PorterStemmer()
meaningful_words = []
for w in words:
    meaningful_words.append(stemming.stem(w))
    print(w,"---->",stemming.stem(w))

In [4]:
print(meaningful_words)

## Feature Extraction

Feature extraction step means to extract and produce feature representations that are appropriate for the type of NLP task you are trying to accomplish and the type of model you are planning to use.

In this section we will transform tokens into features.

**Some of the most popular methods of feature extraction are :**

1- Bag-of-Words

2- TF-IDF

### Bag-of-Words

The bag-of-words model is a simplifying representation used in NLP. In this model, a text is represented as the bag of its words, disregarding grammar and even word order but keeping multiplicity(number of times words appears in courpus).

Bag-of-Words is one of the most fundamental methods to transform tokens into a set of features. The Bag of word model is used in document classification, where each word is used as a feature for training the classifier.

There are 3 steps while creating a BoW model :

1- The first step is **text-preprocessing**

2- The second step is to **create a vocabulary** of all unique words from the corpus.

3- In the third step, we **create a matrix of features** by assigning a separate column for each word, while each row corresponds to messages. This process is known as **Text Vectorization**. Each entry in the matrix signifies the presence or absence of the word in the message. We put 1 if the word is present in the review, and 0 if it is not present.

And there are ready-to-use python package for this model **"CountVectorizer"**, which will do all of the above 3 steps for us.

In [70]:
from sklearn.feature_extraction.text import CountVectorizer
corpus = ["Hey, let's go to the game today",
          "call your sister also call your parents",
          "want to go out?"]
# initialize count vectorizer object
vect = CountVectorizer()
X = vect.fit_transform(corpus)

In the below output you can see numer of times the word appears in the courpus.

In [5]:
X.toarray()

Here you can see the unique words present in the courpus

In [6]:
column_names = vect.get_feature_names()
print(column_names)

### TF-IDF

**Term frequency-Inverse document frequency** uses all the tokens in the dataset as vocabulary. Frequency of occurrence of a token from vocabulary in each document consists of the term frequency and number of documents in which token occurs determines the Inverse document frequency.What this ensures is that,if a token occurs frequently in a document that token will have high TF but if that token occurs frequently in majority of documents then it reduces the IDF, which occur frequently are penalized and important words which contain the essence of document get a boost. Both these TF and IDF matrices for a particular document are multiplied and normalized to form TF-IDF of a document.

And there are ready-to-use python package for this model as well **"TfidfTransformer"**.

But we will use **TfidfVectorizer** in our next section, which perform both **CountVectorizer** and **TfidfTransformer**


**TfidfVectorizer = CountVectorizer + TfidfTransformer**

## Model

### Sentiment Analysis

Sentiment Analysis is the most common text classification tool that analyses an incoming message and tells whether the underlying sentiment is positive, negative or neutral.

The most common use of Sentiment Analysis is to classifying a text to a class. Depending on the dataset and the reason, Sentiment Classification can be binary (positive or negative) or multi-class (3 or more classes) problem.

**In this section, I’ll show you how to effectively perform sentiment classification on hotel reviews dataset using** **Machine Learning approach**. 

This approach, employees a machine-learning technique and diverse features to construct a classifier that can identify text that expresses sentiment.

We have a datset of hotel reviews and it contains 210 reviews of guests and information of 2 classes, positive class(positive reviews) and negative class(negative reviews).

1 stand for positive review in Label column.

0 stand for negative review in the Label column

#### Import Packages

Let's start by importing pandas

In [73]:
import pandas as pd

Use the Pandas method <b>read_csv()</b> to load the data

In [74]:
url ='https://raw.githubusercontent.com/skilluplabs/TextFeatureExtraction/master/hotel_review.csv'
df = pd.read_csv(url)

Use the method <b>head()</b> to display the first five rows of the dataframe.

In [7]:
df.head()

In [8]:
df.shape

Now, we put our Reviews into X variable and Lable into y variable

In [77]:
X = df['Reviews']
y = df['Label']

#### Split them to form train and test datasets. Using the train_test_split to create train and test sets.

In [78]:
from sklearn.model_selection import train_test_split 

In [79]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=40)

In [80]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

#### Importing the Linear SVC classifier from the sklearn library.

In [81]:
from sklearn.svm import LinearSVC

In [82]:
svc_classifier = LinearSVC()

In [83]:
test_clf = Pipeline([("tfidf", TfidfVectorizer(stop_words ='english')),("svc_classifier", LinearSVC())])

#### Training the Linear SVC classifier. 

In [9]:
test_clf.fit(X_train, y_train)

In [85]:
prediction = test_clf.predict(X_test)

#### Measuring Performance of the Model 

To see how the model performs on the new data (test data), we will use accuracy as our metric.

In [86]:
from sklearn.metrics import accuracy_score

In [87]:
print('Accuracy Score on test data: ', accuracy_score(y_test,prediction))

Accuracy Score on test data:  0.873015873015873


## Bonus Section

#### Here, you need to apply the above pre-processing techniques on the same dataset 'hotel_review.csv' using dataframe.

In [88]:
import pandas as pd
url ='https://raw.githubusercontent.com/skilluplabs/TextFeatureExtraction/master/hotel_review.csv'
reviews = pd.read_csv(url)

#### Passing the dataset into a dataframe

In [10]:
df = pd.DataFrame(reviews)
df.head()

#### Let’s look at what columns exist in our dataset

In [90]:
print (list(reviews))

['Reviews', 'Label']


#### We will convert all text to lower case in order to reduce the size of vocabulary by reducing the same words because NLTK is

#### case sensitive.

In [91]:
reviews['Reviews'] = reviews['Reviews'].str.lower()

#### Let's check out the first review and it's sentiment. 
#### The review is text and the sentiment label is either 0 (negative) or 1 (positive) based on how the reviewer rated it on hotel review.

In [11]:
example_review = reviews.iloc[0]
print(example_review['Reviews'])

#Now we do same for label:
print(example_review['Label'])

#### Question 1. Create a function 'identify_tokens()' to perform Tokenization using TreebankWordTokenizer on the complete dataframe(Reviews).

In [94]:
#Type your answer here

Double-click <b>here</b> for the solution.

<!-- Your answer is below:
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
def identify_tokens(row):
    Reviews = row['Reviews']
    tokens = tokenizer.tokenize(Reviews)
    token_words = [w for w in tokens]
    return token_words

reviews['Reviews'] = reviews.apply(identify_tokens, axis=1)
-->


#### Question 2: Create a function 'remove_stops()' to remove stopwords from the dataframe 

In [95]:
#Type your answer here

Double-click <b>here</b> for the solution.

<!-- Your answer is below:
from nltk.corpus import stopwords
stops = set(stopwords.words("english"))                  

def remove_stops(row):
    my_list = row['Reviews']
    meaningful_words = [w for w in my_list if not w in stops]
    return (meaningful_words)

reviews['Reviews'] = reviews.apply(remove_stops, axis=1)
-->

#### Question 3: Create a function 'stem_list()' to perform stemming on the complete dataframe

In [96]:
#Type your answer here

Double-click <b>here</b> for the solution.

<!-- Your answer is below:
from nltk.stem import PorterStemmer
stemming = PorterStemmer()

def stem_list(row):
    my_list = row['Reviews']
    stemmed_list = [stemming.stem(word) for word in my_list]
    return (stemmed_list)

reviews['Reviews'] = reviews.apply(stem_list, axis=1)
-->


<h1>Thank you for completing this notebook</h1>