# Text Processing

One of the most common examples of text processing is text classification to detect emails as spam/non-spam or for sentiment analysis. Sentiment analysis is nothing but classifying a statement as positive/negative. For example 

Text(Feature) : 
 Eight days, 25 dead: California shaken by string of mass shootings

Sentiment: Negative

NLP - Natural Language processing, is the process of analyzing text data and performing tasks such as sentiment analysis, cognitive assistance, identifying fake news, and real-time language translation.

By using NLP, text classification can be used to analyze text and assign a predefined class based on the context. There is mainly three text classification approach-

Rule-based System,
Machine System
Hybrid System.

Before we jump to the classification approaches let's understand the preprocessing. We need to transform the text data/words to the numerical features that work as input to the ML algorithms and this process is called as text preprocessing.


In this tutorial, we will perform text preprocessing using nltk. nltk is a python library mainly used for Natural Language processing.

In [2]:
# Load libraries
import pandas as pd
import numpy as np

import nltk
import string
import re

Import the file

The dataset has the sms messages classified as spam/ham. We will use this for our text processing.

In [3]:
data = pd.read_csv('spamdata.csv')
data.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Check for null values and class frequency

In [4]:
data.isnull().sum()

Category    0
Message     0
dtype: int64

In [5]:
data['Category'].value_counts()

ham     4825
spam     747
Name: Category, dtype: int64

To lowercase: The first step to data preprocessing is to convert to lowercase to reduce the vocabulary size.

In [6]:
def to_lowercase(text):
    text = text.lower()
    return text

data['Message'] = data['Message'].apply(to_lowercase)

Remove numerals: We remove numbers or can convert to text representation. In this case removing numbers isn't a problem as they have lesser significance in classification.

In [7]:
# Remove numbers
def remove_numbers(text):
    text = re.sub(r'\d+', '', text)
    return text
 
data['Message'] = data['Message'].apply(remove_numbers)

Remove punctuation: To avoid duplication of words we remove punctuations. For example, the, the! are same words that cannot be distinguished without removing the words.

In [8]:
def remove_punctuations(text):
    return text.translate(str.maketrans('', '', string.punctuation))

data['Message'] = data['Message'].apply(remove_punctuations)

Remove unnecessary white spaces

In [9]:
# remove whitespace from text
def remove_whitespace(text):
    return  " ".join(text.split())
 
data['Message'] = data['Message'].apply(remove_punctuations)

Remove stop words: Stop words are words that do not add any meaning or context to the sentence. NLTK has a list of stop words that can be used to remove from the data

Input: “Boulder is the best city in colorado” 
Output: [‘Boulder’, ‘best’, ‘city’, ‘colorado’’] 

In [13]:
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

def remove_stopwords(text):
    new_list = []
    words = word_tokenize(text)
    stopwrds = stopwords.words('english')
    for word in words:
        if word not in stopwrds:
            new_list.append(word)
    return ' '.join(new_list)

data['Message'] = data['Message'].apply(remove_stopwords)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Stemming: Many words have plural forms and suffixes added. To get the root of the word stemming is applied. For example, parents, parenthood, and parenting have the root word 'parent'.

In [15]:

from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
def perform_stemming(text):
    stemmer = PorterStemmer()
    new_list = []
    words = word_tokenize(text)
    for word in words:
        new_list.append(stemmer.stem(word))

    return " ".join(new_list)



Lemmatization:
 Lemmatization also performs the same operation as stemming but ensures that the word does belong to the language. For example, stemming can reduce 'tries' to 'tri'.

In [16]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
lemmatizer = WordNetLemmatizer()
# lemmatize string
def lemmatize_word(text):
    word_tokens = word_tokenize(text)
    # provide context i.e. part-of-speech
    lemmas = [lemmatizer.lemmatize(word, pos ='v') for word in word_tokens]
    return lemmas
 
data['Message'] = data['Message'].apply(perform_stemming)

Think of which can be used (stemming or lemmatization) for the dataset that you will use for the assignment.

Congratulations!