<a href="https://colab.research.google.com/github/PaulAyobamidele/NLP/blob/main/dataprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Basics in NLP 1**

In [1]:
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

### **Data Acquisition**

In [None]:
import requests
import re

def fetch_and_save_wiki_text(title):
    response = requests.get(
        "https://en.wikipedia.org/w/api.php",
        params={
            "action": "query",
            "format": "json",
            "titles": title,
            "prop": "extracts",
            "explaintext": True,
        },
    ).json()

    page = next(iter(response["query"]["pages"].values()))
    wiki_text = page["extract"]

    return wiki_text


In [None]:
city = 'Tokyo'
info = fetch_and_save_wiki_text(city)

In [None]:
print(info)

## **Data Cleaning using Regex**

In [None]:
def clean_text(text):
    # Remove special characters except "."
    #text = re.sub(r'[^A-Za-z0-9\s.\(\)\[\]\{\}]+', '', text)
    text = re.sub(r'[^A-Za-z\s.]+','',text)
    # remove HTML TAG
    html = re.compile('[<,#*?>]')
    text = html.sub(r'',text)
    # Remove urls:
    url = re.compile('https?://\S+|www\.S+')
    text = url.sub(r'',text)
    # Remove email id:
    email = re.compile('[A-Za-z0-2]+@[\w]+.[\w]+')
    text = email.sub(r'',text)
    # Remove extra whitespace
    text = ' '.join(text.split())
    return text

In [None]:
clean = clean_text(info)
print(clean)

## **Tokenization**

### **Tokenization using NLTK**

NLTK is Natural Language Tool Kit. It is used to build python programming. It helps to work with human languages data. It gives a very easy user interface. It supports classification, steaming, tagging, etc.

In [None]:
!pip install -q nltk

The next code will take quite some time due to the massive amount of tokenizers, chunkers, other algorithms, and all of the corpora to be downloaded.

In [None]:
import nltk
nltk.download('all')

**1. Sentence Tokenization**

In [None]:
from nltk.tokenize import sent_tokenize

sentences = sent_tokenize(clean)

for i in range(len(sentences)):
  print("Sentence " + str(i) + " : ")
  print(sentences[i])

**2. Word Tokenization**

In [None]:
from nltk.tokenize import word_tokenize

words = word_tokenize(clean)

for i in range(len(words)):
  print(words[i])

These tokenizers work by separating the words using punctuation and spaces, and it doesn’t discard the punctuation, allowing a user to decide what to do with the punctuations at the time of pre-processing.

**3. Regexp Tokenization**

In [None]:
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer("[^ ][\w']*")
regwords = tokenizer.tokenize(clean)
print(regwords)

## **Lemmatization**

**1. Wordnet Lemmatizer**



[WordNet®](https://wordnet.princeton.edu/) is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations.

Wordnet is a publicly available lexical database of over 200 languages that provides semantic relationships between its words. It is one of the earliest and most commonly used lemmatizer technique.  

*   It is present in the nltk library in python.
*   Wordnet links words into semantic relations. ( eg. synonyms )
*   It groups synonyms in the form of `synsets` (a group of data elements that are semantically equivalent.)



---


How to use:  

Download Wordnet from nltk
*   import nltk
*   nltk.download(‘wordnet’)
*   nltk.download(‘averaged_perceptron_tagger’)

In [None]:
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

# Create WordNetLemmatizer object
wnl = WordNetLemmatizer()

# single word lemmatization
for word in words[:100]:
    print(word + " ---> " + wnl.lemmatize(word))


Tokyo Japanese Tky toko officially the Tokyo Metropolis Tkyto is the capital of Japan and the most populous city in the world with a population of over million residents as of . ---> Tokyo Japanese Tky toko officially the Tokyo Metropolis Tkyto is the capital of Japan and the most populous city in the world with a population of over million residents as of .
The Tokyo metropolitan area which includes Tokyo and nearby prefectures is the worlds mostpopulous metropolitan area with . ---> The Tokyo metropolitan area which includes Tokyo and nearby prefectures is the worlds mostpopulous metropolitan area with .
million residents as of and is the secondlargest metropolitan economy in the world after New York City with a gross metropolitan product estimated at US trillion. ---> million residents as of and is the secondlargest metropolitan economy in the world after New York City with a gross metropolitan product estimated at US trillion.


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


**2. Wordnet Lemmatizer (with POS tag)**

In the above approach, we observed that Wordnet results were not up to the mark. Words like ‘located’, ‘includes’ etc remained the same after lemmatization. This is because these words are treated as a noun in the given sentence rather than a verb. To overcome come this, we use POS (Part of Speech) tags.

We add a tag with a particular word defining its type (verb, noun, adjective etc).


In [None]:
# WORDNET LEMMATIZER (with appropriate pos tags)

import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import wordnet

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [None]:


lemmatizer = WordNetLemmatizer()

# Define function to lemmatize each word with its POS tag

# POS_TAGGER_FUNCTION
def pos_tagger(nltk_tag):
	if nltk_tag.startswith('J'):
		return wordnet.ADJ
	elif nltk_tag.startswith('V'):
		return wordnet.VERB
	elif nltk_tag.startswith('N'):
		return wordnet.NOUN
	elif nltk_tag.startswith('R'):
		return wordnet.ADV
	else:
		return None

sentence = sentences[4]
print("Original sentence: ")
print(sentence)
print("---------------------")

# tokenize the sentence and find the POS tag for each token
pos_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))
print("POS results:")
print(pos_tagged)
print("---------------------")

# As you may have noticed, the above pos tags are a little confusing.

# we use our own pos_tagger function to make things simpler to understand.
wordnet_tagged = list(map(lambda x: (x[0], pos_tagger(x[1])), pos_tagged))
print("Our POS results:")
print(wordnet_tagged)
print("---------------------")

lemmatized_sentence = []
for word, tag in wordnet_tagged:
	if tag is None:
		# if there is no available tag, append the token as is
		lemmatized_sentence.append(word)
	else:
		# else use the tag to lemmatize the token
		lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
lemmatized_sentence = " ".join(lemmatized_sentence)

print("Final tokenization results:")
print(lemmatized_sentence)



Original sentence: 
Tokyo serves as Japans economic center and the seat of both the Japanese government and the Emperor of Japan.
---------------------
POS results:
[('Tokyo', 'NNP'), ('serves', 'VBZ'), ('as', 'IN'), ('Japans', 'NNPS'), ('economic', 'JJ'), ('center', 'NN'), ('and', 'CC'), ('the', 'DT'), ('seat', 'NN'), ('of', 'IN'), ('both', 'DT'), ('the', 'DT'), ('Japanese', 'JJ'), ('government', 'NN'), ('and', 'CC'), ('the', 'DT'), ('Emperor', 'NNP'), ('of', 'IN'), ('Japan', 'NNP'), ('.', '.')]
---------------------
Our POS results:
[('Tokyo', 'n'), ('serves', 'v'), ('as', None), ('Japans', 'n'), ('economic', 'a'), ('center', 'n'), ('and', None), ('the', None), ('seat', 'n'), ('of', None), ('both', None), ('the', None), ('Japanese', 'a'), ('government', 'n'), ('and', None), ('the', None), ('Emperor', 'n'), ('of', None), ('Japan', 'n'), ('.', None)]
---------------------
Final tokenization results:
Tokyo serve as Japans economic center and the seat of both the Japanese government and 

**3. TextBlob Lemmatizer**

TextBlob is a python library used for processing textual data. It provides a simple API to access its methods and perform basic NLP tasks.

First, download TextBlob package:


In [None]:
!pip install textblob

In [None]:
from textblob import TextBlob, Word


sentence = sentences[15]
print("Original sentence: ")
print(sentence)
print("---------------------")

s = TextBlob(sentence)
lemmatized_sentence = " ".join([w.lemmatize() for w in s.words])

print("Final tokenization results:")
print(lemmatized_sentence)


Original sentence: 
As of the city is home to of the worlds largest companies listed in the annual Fortune Global .In Tokyo ranked fourth on the Global Financial Centres Index behind New York City London and Shanghai.
---------------------
Final tokenization results:
As of the city is home to of the world largest company listed in the annual Fortune Global In Tokyo ranked fourth on the Global Financial Centres Index behind New York City London and Shanghai


**4. TextBlob (with POS tag)**

Same as in Wordnet approach without using appropriate POS tags, we observe the same limitations in this approach as well. So, we use one of the more powerful aspects of the TextBlob module the ‘Part of Speech’ tagging to overcome this problem.

In [None]:
from textblob import TextBlob

# Define function to lemmatize each word with its POS tag

# POS_TAGGER_FUNCTION
def pos_tagger_blob(sentence):
	sent = TextBlob(sentence)
	tag_dict = {"J": 'a', "N": 'n', "V": 'v', "R": 'r'}
	words_tags = [(w, tag_dict.get(pos[0], 'n')) for w, pos in sent.tags]
	lemma_list = [wd.lemmatize(tag) for wd, tag in words_tags]
	return lemma_list

# Lemmatize
sentence = sentences[15]
print("Original sentence: ")
print(sentence)
print("---------------------")

lemma_list = pos_tagger_blob(sentence)
lemmatized_sentence = " ".join(lemma_list)
print("Final tokenization results:")
print(lemmatized_sentence)


Original sentence: 
As of the city is home to of the worlds largest companies listed in the annual Fortune Global .In Tokyo ranked fourth on the Global Financial Centres Index behind New York City London and Shanghai.
---------------------
Final tokenization results:
As of the city be home to of the world large company list in the annual Fortune Global .In Tokyo rank fourth on the Global Financial Centres Index behind New York City London and Shanghai


**5. spaCy**

spaCy is an open-source python library that parses and “understands” large volumes of text. Separate models are available that cater to specific languages (English, French, German, etc.).

In [None]:
!pip install -q spacy

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [None]:
# Create a Doc object
doc = nlp(u'Japanese author Haruki Murakami has based some of his novels in Tokyo including Norwegian Wood and David Mitchells first two novels numberdream and Ghostwritten featured the city.')

# Create list of tokens from given string
tokens = []
for token in doc:
	tokens.append(token)

print(tokens)

lemmatized_sentence = " ".join([token.lemma_ for token in doc])

print(lemmatized_sentence)



[Japanese, author, Haruki, Murakami, has, based, some, of, his, novels, in, Tokyo, including, Norwegian, Wood, and, David, Mitchells, first, two, novels, numberdream, and, Ghostwritten, featured, the, city, .]
japanese author Haruki Murakami have base some of his novel in Tokyo include Norwegian Wood and David Mitchells first two novel numberdream and Ghostwritten feature the city .




---



## **Stemmming**

Python NLTK contains a variety of stemming algorithms, including several types. Let’s examine them down below.

### **1. Porter’s Stemmer**

In [None]:
from nltk.stem import PorterStemmer

# Create a Porter Stemmer instance
porter_stemmer = PorterStemmer()

# Example words for stemming
#originals = ["running", "better", "mice", "caring", "wolves"]
originals = words[:10]

# Apply stemming to each word
stemmed_words = [porter_stemmer.stem(word) for word in originals]

# Print the results
print("Original words:", originals)
print("Stemmed words:", stemmed_words)


Original words: ['Tokyo', 'Japanese', 'Tky', 'toko', 'officially', 'the', 'Tokyo', 'Metropolis', 'Tkyto', 'is']
Stemmed words: ['tokyo', 'japanes', 'tki', 'toko', 'offici', 'the', 'tokyo', 'metropoli', 'tkyto', 'is']


### **2. Snowball Stemmer**

The Snowball Stemmer, compared to the Porter Stemmer, is multi-lingual as it can handle non-English words. It supports various languages and is based on the ‘Snowball’ programming language, known for efficient processing of small strings.

The Snowball stemmer is way more aggressive than Porter Stemmer and is also referred to as Porter2 Stemmer. Because of the improvements added when compared to the Porter Stemmer, the Snowball stemmer is having greater computational speed.

In [None]:
from nltk.stem import SnowballStemmer

# Choose a language for stemming, for example, English
stemmer = SnowballStemmer(language='english')

# Example words to stem
words_to_stem = originals

# Apply Snowball Stemmer
stemmed_words = [stemmer.stem(word) for word in words_to_stem]

# Print the results
print("Original words:", words_to_stem)
print("Stemmed words:", stemmed_words)


Original words: ['Tokyo', 'Japanese', 'Tky', 'toko', 'officially', 'the', 'Tokyo', 'Metropolis', 'Tkyto', 'is']
Stemmed words: ['tokyo', 'japanes', 'tki', 'toko', 'offici', 'the', 'tokyo', 'metropoli', 'tkyto', 'is']


### **3. Lancaster Stemmer**

The Lancaster stemmers are more aggressive and dynamic compared to the other two stemmers. The stemmer is really faster, but the algorithm is really confusing when dealing with small words. The Lancaster stemmers are not as efficient as Snowball Stemmers.

In [None]:
from nltk.stem import LancasterStemmer

# Create a Lancaster Stemmer instance
stemmer = LancasterStemmer()

# Example words to stem
words_to_stem = originals

# Apply Lancaster Stemmer
stemmed_words = [stemmer.stem(word) for word in words_to_stem]

# Print the results
print("Original words:", words_to_stem)
print("Stemmed words:", stemmed_words)

Original words: ['Tokyo', 'Japanese', 'Tky', 'toko', 'officially', 'the', 'Tokyo', 'Metropolis', 'Tkyto', 'is']
Stemmed words: ['tokyo', 'japanes', 'tky', 'toko', 'off', 'the', 'tokyo', 'metropol', 'tkyto', 'is']


## **Stopwords Removal**

Stopwords are frequently occurring words in a language that are frequently omitted from natural language processing (NLP) tasks due to their low significance for deciphering textual meaning. The particular list of stopwords can change based on the language being studied and the context. The following is a broad list of stopword categories:

*  Common Stopwords: These are the most frequently occurring words in a language and are often removed during text preprocessing. Examples include “the,” “is,” “in,” “for,” “where,” “when,” “to,” “at,” etc.
*  Custom Stopwords: Depending on the specific task or domain, additional words may be considered as stopwords. These could be domain-specific terms that don’t contribute much to the overall meaning. For example, in a medical context, words like “patient” or “treatment” might be considered as custom stopwords.
*  Numerical Stopwords: Numbers and numeric characters may be treated as stopwords in certain cases, especially when the analysis is focused on the meaning of the text rather than specific numerical values.
*  Single-Character Stopwords: Single characters, such as “a,” “I,” “s,” or “x,” may be considered stopwords, particularly in cases where they don’t convey much meaning on their own.
*  Contextual Stopwords: Words that are stopwords in one context but meaningful in another may be considered as contextual stopwords. For instance, the word “will” might be a stopword in the context of general language processing but could be important in predicting future events.

In [None]:
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
print(stopwords.words('english'))


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### **1. Removing stop words with NLTK**

In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

example_sent = "Japanese author Haruki Murakami has based some of his novels in Tokyo including Norwegian Wood and David Mitchells first two novels numberdream and Ghostwritten featured the city."

stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(example_sent)

filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]


print(word_tokens)
print(filtered_sentence)


['Japanese', 'author', 'Haruki', 'Murakami', 'has', 'based', 'some', 'of', 'his', 'novels', 'in', 'Tokyo', 'including', 'Norwegian', 'Wood', 'and', 'David', 'Mitchells', 'first', 'two', 'novels', 'numberdream', 'and', 'Ghostwritten', 'featured', 'the', 'city', '.']
['Japanese', 'author', 'Haruki', 'Murakami', 'based', 'novels', 'Tokyo', 'including', 'Norwegian', 'Wood', 'David', 'Mitchells', 'first', 'two', 'novels', 'numberdream', 'Ghostwritten', 'featured', 'city', '.']


### **2. Removing stop words with SpaCy**

In [None]:
import spacy

# Load spaCy English model
nlp = spacy.load("en_core_web_sm")

# Sample text
text = "Japanese author Haruki Murakami has based some of his novels in Tokyo including Norwegian Wood and David Mitchells first two novels numberdream and Ghostwritten featured the city."

# Process the text using spaCy
doc = nlp(text)

# Remove stopwords
filtered_words = [token.text for token in doc if not token.is_stop]

# Join the filtered words to form a clean text
clean_text = ' '.join(filtered_words)

print("Original Text:", text)
print("Text after Stopword Removal:", clean_text)


Original Text: Japanese author Haruki Murakami has based some of his novels in Tokyo including Norwegian Wood and David Mitchells first two novels numberdream and Ghostwritten featured the city.
Text after Stopword Removal: Japanese author Haruki Murakami based novels Tokyo including Norwegian Wood David Mitchells novels numberdream Ghostwritten featured city .
