# Text Analytics with Python - Part 1

Text analytics, also known as Natural Language Processing (NLP), refer to the analytics toward text-based data.

Text-based communication has become one of the most common forms of expression. We email, text message, tweet, and update our statuses on a daily basis. As a result, unstructured text data has become extremely common, and analyzing large quantities of text data is now a key way to understand what people are thinking.

One of the biggest breakthroughs required for achieving any level of artificial intelligence is to have machines which can process text data. Thankfully, the amount of text data being generated in this universe has exploded exponentially in the last few years.

It has become imperative for an organization to have a structure in place to mine actionable insights from the text being generated. From social media analytics to risk management and cybercrime protection, dealing with text data has never been more important.

In this article we will discuss different feature extraction methods, starting with some basic techniques which will lead into advanced Natural Language Processing techniques. We will also learn about pre-processing of the text data in order to extract better features from clean data.

By the end of this article, you will be able to perform text operations by yourself. Let’s get started!

Original tutorial from [link](https://www.analyticsvidhya.com/blog/2018/02/the-different-methods-deal-text-data-predictive-python/), revised and adapted by Dr. Tao.
ver 1.0, May 2018

# Table of Contents

1. Basic feature extraction using text data
    - Load data
    - Number of words
    - Unique Words and Their Counts
    - Other Counts of Words
2. Basic Text Pre-processing of text data
    - Lower case
    - Removing Punctuation
    - Removal of Stop Words
    - Common words removal
    - Rare words removal
    - Spelling correction
    - Tokenization
    - Stemming
    - Lemmatization
    - Part-of-Speach Tagging

# 1. Basic Text Feature Extraction

We can use text data to extract a number of features even if we don’t have sufficient knowledge of Natural Language Processing. So let’s discuss some of them in this section.

Before starting, let’s quickly read the training file from the dataset in order to perform different tasks on it. In the entire article, we will use the *twitter sentiment dataset* provided with NLTK.

In this class we are playing with Twitter data, NLTK has the data embedded - we can just load them from there.

## 1.0 Load Data

In [None]:
from nltk.corpus import twitter_samples

The data is loaded as a JSON file - which can be intuitively loaded as a dict. Let's look at what are included.

In [None]:
twitter_samples.fileids()

It is clear the data is divided into three parts:
    - positive;
    - negative (these two are used for training in sentiment analysis);
    - tweets.20150430... 
    
In this particular exercise, we want to focus on the **positive** tweets.

In [None]:
pos_tweets = twitter_samples.strings('positive_tweets.json')

Let's look at what is in our *pos_tweets* data.

In [None]:
type(pos_tweets)

### YOUR TURN HERE
Provide your code in the block below two list the first **10** tweets in the *pos_tweets* dataset.

In [None]:
#### YOUR CODE HERE


## 1.1 Number of Words

The first fact we want to know about text are word counts - let us start counting each tweet.

In [None]:
# This will return the count for the first tweet
print('The first tweet has %s words.' % str(len(pos_tweets[0].split(' '))))

### YOUR TURN HERE

Provide the word counts for the first **10** tweets.

In [None]:
#### YOUR CODE HERE
# use enumerate() here


Clearly this is not very data science like approach, so we are going to use a powerful tool provided in Python: **Pandas**.

In [None]:
import pandas as pd

In [None]:
pos_tweets = pd.DataFrame(pos_tweets, columns=['text'])
pos_tweets.head()

Now we can count words in every tweet.

In [None]:
pos_tweets['word_count'] = pos_tweets['text'].apply(lambda x: len(str(x).split(" ")))
pos_tweets.head()

We can also look at the number of characters in each tweet.

In [None]:
pos_tweets['char_count'] = pos_tweets.text.str.len()
pos_tweets.head()

We will also extract another feature which will calculate the average word length of each tweet. This can also potentially help us in improving our model.

Here, we simply take the sum of the length of all the words and divide it by the total length of the tweet:

In [None]:
# We use this function to calcualte the average word length in each tweet
def avg_word(sentence):
    words = sentence.split()
    return (sum(len(word) for word in words)/len(words))

pos_tweets['avg_word_length'] = pos_tweets.text.apply(lambda x: avg_word(x))
pos_tweets.head()

Let's save this dataframe just in case.

In [None]:
pos_tweets.to_csv('pos_tweets.csv')

## 1.2 Unique Words and Their Counts

You may have observed that the split() function is not well at identifying words - so we are going to use a built-in method with NLTK to identify words - namely tokenization. 

Refer to this [link](https://www.nltk.org/book/ch03.html) for more information.

You should have noticed from the results below that we can capture the emojis from tweets using the TweetTokenizer.

In [None]:
from nltk.tokenize import TweetTokenizer
tknzr = TweetTokenizer() # This function is particularly used for tweets
from collections import Counter
vocab = Counter()

Now we can process the count of unique words in the overall set of texts - namely **Corpus**.

In [None]:
for text in pos_tweets.text:
    n = tknzr.tokenize(text)
    vocab.update(n)

vocab

How many unique token (words) in the corpus?

In [None]:
print('The corpus has %s unique words.' % str(len(vocab)))

What are the top 10 unique words in the corpus, by their counts?

In [None]:
print(vocab.most_common(10))

In [None]:
dict(vocab.most_common(10))

Does above list make any sense to you? Maybe not.

Nonetheless, we should visualize the counter of unique words just because we can.

We should just visualze the most 20 words because of the long running time.

In [None]:
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib inline

count_dict = dict(vocab.most_common(20))
words = range(len(count_dict))
plt.bar(words, count_dict.values(), color='blue', align='center', tick_label=count_dict.keys())
plt.show()

## 1.3 Other Counts of Words

Sometimes we also care about other counts of words in a corpus, let's go through them.

For instance, we noticed that the most common words from the counter (vocab) have less meanings - we call them **stopwords** (i.e. 'I', 'the', ...). Let's examine them by counts for now.

In [None]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

pos_tweets['stop_counts'] = pos_tweets.text.apply(lambda x: len([x for x in x.split() if x in stop]))
pos_tweets.head()

We also noticed that a great amount of special characters, such as *emojis*, appeared in the tweets. We can count them as well.

In [None]:
t = pos_tweets.text[0]
print(tknzr.tokenize(t))

In [None]:
import re
def count_special_char(sentence):
    non_special = [tok for tok in tknzr.tokenize(sentence) if(re.search(r'^\w', tok))]
    return(len(non_special))

In [None]:
count_special_char(t)

In [None]:
pos_tweets['special_char'] = pos_tweets.text.apply(lambda x: len(tknzr.tokenize(x)) - count_special_char(x))
pos_tweets.head()

We may also care about the number of numerics in each tweet.

Just like we calculated the number of words, we can also calculate the number of numerics which are present in the tweets. It does not have a lot of use in our example, but this is still a useful feature that should be run while doing similar exercises. For example, 

In [None]:
pos_tweets['num_counts'] = pos_tweets.text.apply(lambda x: len([x for x in x.split() if x.isdigit()]))
pos_tweets[['text','num_counts']].head()

### YOUR TURN HERE

Say we are interested in searching for all long words in the tweets. A long word is defined as:
- has more than 5 letters;
- has to be a word (not special character, numerics, ...)

Please write you own code for the **count of long words** and add it as a column in *pos_tweets*.

**HINT**: use code/results from above as much as you can.

In [None]:
## YOUR CODE HERE
# define your function here


In [None]:
# use apply(lambda) here


# 2. Basic Preprocessing

So far, we have learned how to extract basic features from text data. Before diving into text and feature extraction, our first step should be cleaning the data in order to obtain better features. We will achieve this by doing some of the basic pre-processing steps on our training data.

So, let’s get into it.

## 2.1 Lower case

The first pre-processing step which we will do is transform our tweets into lower case. This avoids having multiple copies of the same words. For example, while calculating the word count, ‘Analytics’ and ‘analytics’ will be taken as different words.

In [None]:
pos_tweets['lower'] = pos_tweets.text.apply(lambda x: " ".join(x.lower() for x in x.split()))
pos_tweets.lower.head()

## 2.2 Removing Punctuation

The next step is to remove punctuation, as it doesn’t add any extra information while treating text data. Therefore removing all instances of it will help us reduce the size of the training data.

In [None]:
from nltk.tokenize import RegexpTokenizer
reg_tok = RegexpTokenizer(r'\w+')
pos_tweets['no_punc'] = pos_tweets['lower'].apply(lambda x: ' '.join(reg_tok.tokenize(x)))
pos_tweets.no_punc.head()

As you can see in the above output, all the punctuation, including ‘#’ and ‘@’, has been removed from the training data.

## 2.3 Removal of Stop Words

As we discussed earlier, stop words (or commonly occurring words) should be removed from the text data. For this purpose, we can either create a list of stopwords ourselves or we can use predefined libraries.

In [None]:
pos_tweets['no_stop'] = pos_tweets['no_punc'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
pos_tweets.no_stop.head()

## 2.4 Common word removal

Previously, we just removed commonly occurring words in a general sense. We can also remove commonly occurring words from our text data First, let’s check the 10 most frequently occurring words in our text data then take call to remove or retain.

In [None]:
freq = pd.Series(' '.join(pos_tweets['no_stop']).split()).value_counts()[:10]
freq

Now, let’s remove these words as their presence will not of any use in classification of our text data.

In [None]:
freq = list(freq.index)
pos_tweets['no_common'] = pos_tweets['no_stop'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))
pos_tweets['no_common'].head()

## 2.5 Rare words removal

Similarly, just as we removed the most common words, this time let’s remove rarely occurring words from the text. Because they’re so rare, the association between them and other words is dominated by noise. You can replace rare words with a more general form and then this will have higher counts.

In [None]:
rare = pd.Series(' '.join(pos_tweets['no_common']).split()).value_counts()[-10:]
rare

In [None]:
rare = list(rare.index)
pos_tweets['no_rare'] = pos_tweets['no_common'].apply(lambda x: " ".join(x for x in x.split() if x not in rare))
pos_tweets['no_common'].head()

##2.6 Spelling correction

We’ve all seen tweets with a plethora of spelling mistakes. Our timelines are often filled with hastly sent tweets that are barely legible at times.

In that regard, spelling correction is a useful pre-processing step because this also will help us in reducing multiple copies of words. For example, “Analytics” and “analytcs” will be treated as different words even if they are used in the same sense.

To achieve this we will use the textblob library. If you are not familiar with it, you can check my previous article on ‘NLP for beginners using textblob’.

In [None]:
from textblob import TextBlob
pos_tweets['no_common'][:5].apply(lambda x: str(TextBlob(x).correct()))

Note that some of the words are corrected accurately ('accnt -> accent'); also some are not ('rqst' -> 'rest').

## 2.7 Tokenization - Again

Tokenization refers to dividing the text into a sequence of words or sentences. In our example, we have used the textblob library to first transform our tweets into a blob and then converted them into a series of words.

In [None]:
from nltk.tokenize import TweetTokenizer
twtkr = TweetTokenizer()
twtkr.tokenize(pos_tweets['text'][1])

### YOUR TURN HERE

Select the last 10 tweets in the dataset (pos_tweets), then use following tokenizers to tokenize each tweet and output the results.
- Regxp Tokenizer (reg_tok)
- Tweet Tokenizer (twtkr)

In [None]:
# We already defined both tokenizer, now we just need to call them


## 2.8 Stemming

Stemming refers to the removal of suffices, like “ing”, “ly”, “s”, etc. by a simple rule-based approach. For this purpose, we will use PorterStemmer from the NLTK library.

In [None]:
from nltk.stem.snowball import SnowballStemmer
st = SnowballStemmer("english")
pos_tweets['text'][:5].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))

Twitter data is not very good from stemming because of the spelling and non plain English tokens (handles, tags, ...). However, to further prove that the stemming is working, try code below.

In [None]:
print(st.stem('running'))

Note that we are using the native Snowball stemmer; we can also use other stemmer within it.

In [None]:
print('Use Snowball stemmer on generously is: %s.' % str(SnowballStemmer("english").stem("generously")))
print('Use Porter stemmer on generously is: %s.' % str(SnowballStemmer("porter").stem("generously")))

## 2.9 Lemmatization

Lemmatization is a more effective option than stemming because it converts the word into its root word, rather than just stripping the suffices. It makes use of the vocabulary and does a morphological analysis to obtain the root word. **Therefore, we usually prefer using lemmatization over stemming.**

In [None]:
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
pos_tweets['text'][:5].apply(lambda x: " ".join([wordnet_lemmatizer.lemmatize(word) for word in x.split()]))

Again, tweets may not be the best for demonstrating *lemmatization*, so we check following examples.

In [None]:
my_sent = 'WordNet superficially resembles a thesaurus, in that it groups words together based on their meanings. \
However, there are some important distinctions.'
[wordnet_lemmatizer.lemmatize(w) for w in reg_tok.tokenize(my_sent)]

You can also define the Part-of-Speech in lemmatization. See code below.

In [None]:
wordnet_lemmatizer.lemmatize('are', pos='v')

##2.10 Part-of-Speech Tagging

Part-of-speech tagging is one of the most important text analysis tasks used to classify words into their part-of-speech and label them according the tagset which is a collection of tags used for the pos tagging. Part-of-speech tagging also known as word classes or lexical categories. Detailed definition can be found on [Wikipedia](https://en.wikipedia.org/wiki/Part-of-speech_tagging).

In [None]:
s = 'Due to the nature of this tagger, it works best when trained over sentence delimited input.'
from nltk import pos_tag
pos_tag([w for w in reg_tok.tokenize(s)])

Note that pos_tag by default takes **a list of strings** as input; if you input a string, pos_tag will split it for you.

## Exercises

### Q1.
We are going to load the book *Monty Python and Holy Grail* from the embedded text in NLTK.

In [1]:
from nltk.book import text1

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


What is the difference between following two lines? Which one will give a larger value of length? Will this be the case for other texts?

    sorted(set(w.lower() for w in text1))
    sorted(w.lower() for w in set(text1))

In [2]:
#### Your code here


### Q2.
Define sent to be the list of words 

    ['she', 'sells', 'sea', 'shells', 'by', 'the', 'sea', 'shore']. 

Now write code to perform the following tasks:
- Print all words beginning with **se**
- Print all words longer than **four** characters

In [None]:
#### Your code here


### End of Tutorial

In this tutorial, we complete **basic feature extraction** and **basic text pre-processing** in this tutorial. We will continue with **advanced text processing** in Part 2 of the tutorial.