# CHAPTER-6 Handling Text

Stratergies for transforming text into information-rich features

## 6.1 Cleaning Text

Cleanig an unstructured data. Most basic operations are by using python string operations

In [1]:
# Creating a text

text_data = [" Interrobang. By Aishwarya Henriette   ",
            "Parking And Going. By Karl Gautier",
            " Today Is The night. By Jarek Prakash "]

In [4]:
# Strip Whitespaces
strip_whitespace = [ string.strip() for string in text_data ]

# display text

strip_whitespace

['Interrobang. By Aishwarya Henriette',
 'Parking And Going. By Karl Gautier',
 'Today Is The night. By Jarek Prakash']

In [6]:
# Remove periods

remove_periods = [string.replace("."," ") for string in strip_whitespace]

# display text

remove_periods

['Interrobang  By Aishwarya Henriette',
 'Parking And Going  By Karl Gautier',
 'Today Is The night  By Jarek Prakash']

In [10]:
# creating and applying a custom transformation function

def cpitalize(string: str) -> str:
    return string.upper()

[cpitalize(string) for string in remove_periods]

['INTERROBANG  BY AISHWARYA HENRIETTE',
 'PARKING AND GOING  BY KARL GAUTIER',
 'TODAY IS THE NIGHT  BY JAREK PRAKASH']

In [25]:
# def cpitalizer(strn):
#     return strn.upper()

# [cpitalizer(string) for string in remove_periods]

In [26]:
# we can use regualr expressions to make powerful string operations:

import re

In [29]:
# replace letters with X function

def replace_with_X(string: str) -> str:
    return re.sub(r"[a-zA-Z]","X",string)

[replace_with_X(string) for string in remove_periods]

['XXXXXXXXXXX  XX XXXXXXXXX XXXXXXXXX',
 'XXXXXXX XXX XXXXX  XX XXXX XXXXXXX',
 'XXXXX XX XXX XXXXX  XX XXXXX XXXXXXX']

## 6.2 Parsing and Cleaning HTML

Extracting text data from HTML elements

In [54]:
# importing BeautifulSoup library
# BeautifulSoup - for scrapping HTML pages

from bs4 import BeautifulSoup

In [55]:
# creating HTML code

html = """ <div class ='fullName'><span style = 'font-weight:bold'>Masego</span> Azra</div> """

In [59]:
# parsing HTML

soup = BeautifulSoup(html, 'lxml')

# lxml is a HTML parser

In [60]:

# finding a particular div and showing the text in it

soup.find("div", {"class" : "fullName"}).text

'Masego Azra'

## 6.3 Removing Punctuation

Removing punctuation from feature of text data

In [61]:
# importing libraries

import unicodedata
import sys

In [62]:
# creating a text

text = {'Hi!!!! I. Love. This. Song....',
       '10000% Agree!!!! #LoveIT',
       'Right?!?!'}

In [69]:
# creating a dictionary of punctuation charecters

punctuation = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P'))

In [70]:
[string.translate(punctuation) for string in text]

['Hi I Love This Song', 'Right', '10000 Agree LoveIT']

## 6.4 Tokenizing text

Breaking text into individual words

In [72]:
# importing word tokenizer librrary 
# NLTK: Natural Language ToolKit

from nltk.tokenize import word_tokenize

In [81]:
nltk.download('punkt')

# have to be downloaded to run word_tokenize

<IPython.core.display.Javascript object>

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [86]:
# creating a text

tokenize_word = "The science of today is the technology of tomorrow"

In [88]:
# tokenizing words

word_tokenize(tokenize_word)

# word tokenizing is the most common and first step in cleaning the data

['The', 'science', 'of', 'today', 'is', 'the', 'technology', 'of', 'tomorrow']

In [84]:
# we can also tokenize sentences

from nltk.tokenize import sent_tokenize

In [89]:
# creating a text

tokenize_sentence = "The science of today is the technology of tomorrow. Tomorrow is today. Today is now :)"

In [90]:
# tokenizing the sentence

sent_tokenize(tokenize_sentence)

['The science of today is the technology of tomorrow.',
 'Tomorrow is today.',
 'Today is now :)']

## 6.5 Removing stop words

Removing less information words from tokenized text data. By using NLTK's stopwords

In [92]:
# importing the library

from nltk.corpus import stopwords

In [95]:
# when working on the stopwords firt time we have to download set of stopwords

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [96]:
tokenized_words = ['The', 'science', 'of', 'today', 'is', 'the', 'technology', 'of', 'tomorrow']

In [97]:
# loading the stopwords

stop_words = stopwords.words('english')

In [101]:
# show few stopwords

stop_words[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [98]:
# removing stop words

[word for word in tokenized_words if word not in stop_words]

['The', 'science', 'today', 'technology', 'tomorrow']

## 6.6 Stemming words

Converting tokenized words into root forms.

In [102]:
from nltk.stem.porter import PorterStemmer

In [103]:
# creating word tokens

tokenized_words = ['i', 'am', 'humbled', 'by', 'this', 'traditional', 'meeting']

In [104]:
# creating stemmer

porter = PorterStemmer()

In [106]:
# Applying stemmer 
[porter.stem(word) for word in tokenized_words]

['i', 'am', 'humbl', 'by', 'thi', 'tradit', 'meet']

## 6.7 Tagging Parts of Speech

Tagging text data with its Parts of Speech. Using pretrained model of NLTK.

Text tagging is nothing but labeling words in a text with thier grammatical categories, like nouns, verbs....

In [111]:
from nltk import pos_tag
from nltk import word_tokenize

In [114]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


True

In [117]:
# creating text data

text_data = "Chris loved outdoor running"

In [118]:
text_tagged =  pos_tag(word_tokenize(text_data))

In [124]:
text_tagged

# output is a list of tuples with the word and the tag

[('Chris', 'NNP'), ('loved', 'VBD'), ('outdoor', 'RP'), ('running', 'VBG')]

In [125]:
# we can filter the words  in based on parts of speech

[word for word, tag in text_tagged if tag in ['NN', 'NNS', 'NNP', 'NNPS']]

['Chris']

In [126]:
# converting sentences into features based on thier individual parts of speech(like tweets), 
# feature with 1 if a proper noun is present and 0 otherwise

#creating text

tweets = ["I am eating a burrito for breakfast",
         "Political science is an amazing field",
         "San Francisco is an awesome city"]

In [145]:
# Creating a empty list

tagged_tweets = []

# tagging each word for each tweet

for tweet in tweets:
    tweet_tag = pos_tag(word_tokenize(tweet))
    tagged_tweets.append([tag for word, tag in tweet_tag])

In [132]:
from sklearn.preprocessing import MultiLabelBinarizer

In [136]:
# using one-hot encoding to convert tags into features

one_hot_multi = MultiLabelBinarizer()
one_hot_multi.fit_transform(tagged_tweets)

array([[1, 1, 0, 1, 0, 1, 1, 1, 0],
       [1, 0, 1, 1, 0, 0, 0, 0, 1],
       [1, 0, 1, 1, 1, 0, 0, 0, 1]])

In [134]:
# using classes_ we can check each feature is a parts of speech

one_hot_multi.classes_

array(['DT', 'IN', 'JJ', 'NN', 'NNP', 'PRP', 'VBG', 'VBP', 'VBZ'],
      dtype=object)

Tagged Corpus: Its a collection of text data where each word is associated with its Parts-Of-Speech tag(POS tagging). This is useful for training and evaluating POS tagging models.

Corpus used - Brown Corpus

Tagger used - n-gram tagger

n is the no of previous words we see for predicting. first we take previous 2 words using TrigramTagger, if 2 words are not present we back off and take one previous word using BigramTagger.

To check the accuracy of our tagger, we split our text data into 2 parts, train on first part and test on second part

In [148]:
from nltk.corpus import brown
from nltk.tag import UnigramTagger
from nltk.tag import BigramTagger
from nltk.tag import TrigramTagger

In [150]:
nltk.download('brown')

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\brown.zip.


True

In [151]:
# Text from Brown Corpus, broken into sentences

sentences = brown.tagged_sents(categories = 'news')

In [156]:
len(sentences)

4623

In [159]:
# Splitting such that, 4000 sentences for training and 623 for testing

train = sentences[:4000]
test = sentences[4000:]

In [164]:
# Creating backoff taggers

unigram = UnigramTagger(train)
bigram = BigramTagger(train, backoff = unigram)
trigram = TrigramTagger(train, backoff = bigram)

# if 3 previous words do not present then TrigramTagger fallsback to BigramTagger, 
# if 2 previous words do not present then BigramTagger fallsback to UnigramTagger,
# uses that word without considering other words

In [165]:
# show accuracy

trigram.evaluate(test)

  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  trigram.evaluate(test)


0.8174734002697437

## 6.8 Encoding Text as a Bag of words

Creating set of Features to find the number of times a word is present in a text data.

In [166]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

In [167]:
# creating text

text_data = np.array(['I love Brazil. Brazil!',
                     'Sweden is best',
                     'India beats both'])

In [168]:
# creating bag of words feature matrix

count = CountVectorizer()

bag_of_words = count.fit_transform(text_data)

In [169]:
# feature matrix
bag_of_words

<3x8 sparse matrix of type '<class 'numpy.int64'>'
	with 8 stored elements in Compressed Sparse Row format>

In [172]:
count.get_feature_names_out()

array(['beats', 'best', 'both', 'brazil', 'india', 'is', 'love', 'sweden'],
      dtype=object)

In [173]:
# output of above is a sparse array we can use toarray to view the matrix

bag_of_words.toarray()

# from the output we can observe that brazil feature occurs twice so the count below is 2,
# each word above represents a feature, 
# for large text data the resulting matrix can contain thousands of features.

array([[0, 0, 0, 2, 0, 0, 1, 0],
       [0, 1, 0, 0, 0, 1, 0, 1],
       [1, 0, 1, 0, 1, 0, 0, 0]], dtype=int64)

In many cases most of the words do not occur again, resulting in a feature matrix of too many zeros which takes up lot of memory. We can use sparse matrix which only stores nonzero values. CountVectorizer outputs the sparse matrix by default.

In default case every feature is a word instead we can take combination of words called 2-gram or 3 gram.

'ngram_range' sets minimum and maximum size of grams, for example, (1,2) returns all 1-grams and 2-grams, (2,3) returns all 2-grams and 3-grams.

And we can remove filler words using 'stop_words'.

We can restrict the words using 'vocabulary'.

In [179]:
# creating feature matrix

coungt_2gram = CountVectorizer(ngram_range = (1,2),
                              stop_words = 'english',
                              vocabulary = ['brazil'])

bag = coungt_2gram.fit_transform(text_data)

In [180]:
# show feature matrix

bag.toarray()

array([[2],
       [0],
       [0]], dtype=int64)

In [181]:
# View 1-grams and 2-grams

coungt_2gram.vocabulary_

{'brazil': 0}

In [185]:
# creating feature matrix

coungt_2gram = CountVectorizer(ngram_range = (1,1),
                              stop_words = 'english')

bag = coungt_2gram.fit_transform(text_data)

In [186]:
# show feature matrix

bag.toarray()

array([[0, 0, 2, 0, 1, 0],
       [0, 1, 0, 0, 0, 1],
       [1, 0, 0, 1, 0, 0]], dtype=int64)

In [188]:
# View 1-grams

coungt_2gram.vocabulary_

{'love': 4, 'brazil': 2, 'sweden': 5, 'best': 1, 'india': 3, 'beats': 0}

## 6.9 Weighing Word Importance

Bag of words based on thier importance.

This is done by comparing the frequency of the word in a document with frequency of the word in all other documents using 'term frequency-inverse document frequency'(tf-idf). 

scikit-learn provides TfidfVectorizer for this.

In [189]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

In [191]:
# creating text data

text_data = np.array(['I love Brazil. Brazil!',
                     'Sweden is best',
                     'India beats both'])

In [193]:
# creating tf-idf feature matrix

tfidf = TfidfVectorizer()
feature_matrix = tfidf.fit_transform(text_data)

In [194]:
feature_matrix

<3x8 sparse matrix of type '<class 'numpy.float64'>'
	with 8 stored elements in Compressed Sparse Row format>

In [201]:
feature_matrix.toarray()

the output returned here is 

array([[0.        , 0.        , 0.        , 0.89442719, 0.        ,
        0.        , 0.4472136 , 0.        ],
       [0.        , 0.57735027, 0.        , 0.        , 0.        ,
        0.57735027, 0.        , 0.57735027],
       [0.57735027, 0.        , 0.57735027, 0.        , 0.57735027,
        0.        , 0.        , 0.        ]])

In [202]:
feature_matrix.toarray().shape

(3, 8)

In [198]:
tfidf.vocabulary_

{'love': 6,
 'brazil': 3,
 'sweden': 7,
 'is': 5,
 'best': 1,
 'india': 4,
 'beats': 0,
 'both': 2}

In [199]:
tfidf.get_feature_names_out()

array(['beats', 'best', 'both', 'brazil', 'india', 'is', 'love', 'sweden'],
      dtype=object)

'tf' is the no of times a word occurs in a document and 'idf' is calculated with the number of documents and number of times the word occured in the document.

Tag - Part of speech
================

    NNP: Proper noun, singular
    NN: Noun, singular or mass
    RB: Adverb
    VBD: Verb, past tense
    VBG: Verb, gerund or present participle
    JJ: Adjective
    PRP: Personal pronoun

KEYPOINTS:

    NLTK stopwords assumes all tokenized words are lowercased
    
    Corpus: Corpus in NLP is a large ans tructures text in particular language. Serves as a resource for language-related algorithms and models.
    
    Tagged Corpus: Its a collection of text data where each word is associated with its Parts-Of-Speech tag(POS tagging). This is useful for training and evaluating POS tagging models.
    
    Brown Corpus: Collection of test samples from various sources representing American english.


 KEYWORDS:
    
     BeautifulSoup: A python library for scrapping HTML live websites.
     nltk: Natural Language ToolKit is a python library for text manipulation.
     PorterStemmer: Stemming algorithm to remove affixes keeping the stem.
     pos_tag: Parts-Of-Speech tagging(pos_tag). where each word is tagged with its parts of speech
     UnigramTagger: Is a type of POS tagging in which it assigns tags based solely on its occurencies, without considering the context of surrounding words
     BigramTagger: Is a type of POS tagging in which it assigns tags based on the probability of previous 2 words.
     TrigramTagger: Is a type of POS tagging in which it assigns tags based on the probability of previous 3 words.
     CountVectorizer: To convert text-documents to a feature matrix