### Preprocessing Techniques on text data 

<!-- ![hugging face](https://www.thesoftwarereport.com/wp-content/uploads/2023/09/Hugging-Face2.png) -->
- The example must contain at least 4 sentences.
- Write about which text processing steps you might use for this task.
- Support each step with a description of the technique and worked-out example.
- The same example is to be used for each step, so choose the example carefully so that you will be able to demonstrate each step.


##### 1. Importing Dependencies 

In [1]:
import nltk
import os
import re
import math
import operator
from nltk import pos_tag, ne_chunk
from nltk.wsd import lesk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize,word_tokenize
from bs4 import BeautifulSoup # scrap up blog post from the web 
import requests # make over http calls over to web (to scrap up a page)
import json
from nltk.tree import Tree

In [2]:
# Download NLTK data
nltk.download('all')


[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\Harman\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\Harman\AppData\Roaming\nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     C:\Users\Harman\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     C:\Users\Harman\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_eng is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     C:\Users\Harman\AppData\Roaming\nltk_data

True

In [3]:
Stopwords = set(stopwords.words('english'))
wordlemmatizer = WordNetLemmatizer()

##### 2. making web-scrapper

In [4]:
# URL = "https://hackernoon.com/simplifying-the-crazy-world-of-linux-distros"
URL = "https://medium.com/@palashm0002/understanding-the-basics-of-natural-language-processing-nlp-52b67cf954cb"
# URL = "https://hackernoon.com/nvidia-throws-gamers-under-the-bus"
# URL = "https://hackernoon.com/the-eggs-destined-to-give-birth-to-queens"

In [5]:
URL = input('Enter the URL: ')

In [6]:
r = requests.get(URL)

In [7]:
soup = BeautifulSoup(r.text, 'html.parser')
results = soup.find_all(['h1', 'p'])
text = [result.text for result in results]
ARTICLE = ' '.join(text)
script_tag = soup.find('script', type='application/ld+json')

if script_tag:
    # Parse JSON data
    json_data = json.loads(script_tag.string)
    
    # Extract articleBody
    article_body = json_data.get('articleBody', '')

    print(article_body)

    ARTICLE = ARTICLE +". " + article_body
else:
    print("JSON-LD script tag not found.")

If you ever needed proof that Nvidia is no longer the company you think it is, look no further than the company&apos;s most recent GPU Technology Conference (also known as the GTC). The conference was chock-full of announcements, but the big one was of course the unveiling of the Blackwell class of AI chips which are supposedly going to be better than the widely-in-demand, and equally as expensive, Hopper processors that have powered the likes of OpenAI&apos;s ChatGPT. It was the demand for the H100 chips that transformed Nvidia into a company worth over $2 trillion, almost overnight in the larger scheme of things, leading some (including HackerNoon) to question whether the stock was in a bubble (almost a quarter of HackerNoon readers thought that it was). But, as if Nvidia was out to prove everyone wrong, the company&apos;s last earnings release showed that not only was it still selling the Hopper processors, it was selling more of them and would continue to do so in the near future. 

In [8]:
print(ARTICLE)

Nvidia Throws Gamers Under the Bus If you ever needed proof that Nvidia is no longer the company you think it is, look no further than the company's most recent GPU Technology Conference (also known as the GTC). The conference was chock-full of announcements, but the big one was of course the unveiling of the Blackwell class of AI chips which are supposedly going to be better than the widely-in-demand, and equally as expensive, Hopper processors that have powered the likes of OpenAI's ChatGPT.    It was the demand for the H100 chips that transformed Nvidia into a company worth over $2 trillion, almost overnight in the larger scheme of things, leading some (including HackerNoon) to question whether the stock was in a bubble (almost a quarter of HackerNoon readers thought that it was). But, as if Nvidia was out to prove everyone wrong, the company's last earnings release showed that not only was it still selling the Hopper processors, it was selling more of them and would continue to do 

##### 3. Data Cleaning Techniques
- Noise Removal
- Normalization
- Sentence Segmentation
- Tokenization
- Stop-word Removal

In [9]:
def remove_noise(text):
    # Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    return text

cleaned_text = remove_noise(ARTICLE)
print(cleaned_text)

Nvidia Throws Gamers Under the Bus If you ever needed proof that Nvidia is no longer the company you think it is look no further than the companys most recent GPU Technology Conference also known as the GTC The conference was chockfull of announcements but the big one was of course the unveiling of the Blackwell class of AI chips which are supposedly going to be better than the widelyindemand and equally as expensive Hopper processors that have powered the likes of OpenAIs ChatGPT    It was the demand for the H chips that transformed Nvidia into a company worth over  trillion almost overnight in the larger scheme of things leading some including HackerNoon to question whether the stock was in a bubble almost a quarter of HackerNoon readers thought that it was But as if Nvidia was out to prove everyone wrong the companys last earnings release showed that not only was it still selling the Hopper processors it was selling more of them and would continue to do so in the near future    Whic

In [10]:

def tokenize_sentence(text):
    return sent_tokenize(text)
tokens = tokenize_sentence(ARTICLE)
print(tokens)

["Nvidia Throws Gamers Under the Bus If you ever needed proof that Nvidia is no longer the company you think it is, look no further than the company's most recent GPU Technology Conference (also known as the GTC).", "The conference was chock-full of announcements, but the big one was of course the unveiling of the Blackwell class of AI chips which are supposedly going to be better than the widely-in-demand, and equally as expensive, Hopper processors that have powered the likes of OpenAI's ChatGPT.", 'It was the demand for the H100 chips that transformed Nvidia into a company worth over $2 trillion, almost overnight in the larger scheme of things, leading some (including HackerNoon) to question whether the stock was in a bubble (almost a quarter of HackerNoon readers thought that it was).', "But, as if Nvidia was out to prove everyone wrong, the company's last earnings release showed that not only was it still selling the Hopper processors, it was selling more of them and would continu

In [11]:
def normalize_text(text):
    return text.lower()

normalized_text = normalize_text(cleaned_text)
print(normalized_text)

nvidia throws gamers under the bus if you ever needed proof that nvidia is no longer the company you think it is look no further than the companys most recent gpu technology conference also known as the gtc the conference was chockfull of announcements but the big one was of course the unveiling of the blackwell class of ai chips which are supposedly going to be better than the widelyindemand and equally as expensive hopper processors that have powered the likes of openais chatgpt    it was the demand for the h chips that transformed nvidia into a company worth over  trillion almost overnight in the larger scheme of things leading some including hackernoon to question whether the stock was in a bubble almost a quarter of hackernoon readers thought that it was but as if nvidia was out to prove everyone wrong the companys last earnings release showed that not only was it still selling the hopper processors it was selling more of them and would continue to do so in the near future    whic

In [12]:

def tokenize_text(text):
    return word_tokenize(text)
tokens = tokenize_text(normalized_text)
print(tokens)

['nvidia', 'throws', 'gamers', 'under', 'the', 'bus', 'if', 'you', 'ever', 'needed', 'proof', 'that', 'nvidia', 'is', 'no', 'longer', 'the', 'company', 'you', 'think', 'it', 'is', 'look', 'no', 'further', 'than', 'the', 'companys', 'most', 'recent', 'gpu', 'technology', 'conference', 'also', 'known', 'as', 'the', 'gtc', 'the', 'conference', 'was', 'chockfull', 'of', 'announcements', 'but', 'the', 'big', 'one', 'was', 'of', 'course', 'the', 'unveiling', 'of', 'the', 'blackwell', 'class', 'of', 'ai', 'chips', 'which', 'are', 'supposedly', 'going', 'to', 'be', 'better', 'than', 'the', 'widelyindemand', 'and', 'equally', 'as', 'expensive', 'hopper', 'processors', 'that', 'have', 'powered', 'the', 'likes', 'of', 'openais', 'chatgpt', 'it', 'was', 'the', 'demand', 'for', 'the', 'h', 'chips', 'that', 'transformed', 'nvidia', 'into', 'a', 'company', 'worth', 'over', 'trillion', 'almost', 'overnight', 'in', 'the', 'larger', 'scheme', 'of', 'things', 'leading', 'some', 'including', 'hackernoon',

In [13]:
tokens

['nvidia',
 'throws',
 'gamers',
 'under',
 'the',
 'bus',
 'if',
 'you',
 'ever',
 'needed',
 'proof',
 'that',
 'nvidia',
 'is',
 'no',
 'longer',
 'the',
 'company',
 'you',
 'think',
 'it',
 'is',
 'look',
 'no',
 'further',
 'than',
 'the',
 'companys',
 'most',
 'recent',
 'gpu',
 'technology',
 'conference',
 'also',
 'known',
 'as',
 'the',
 'gtc',
 'the',
 'conference',
 'was',
 'chockfull',
 'of',
 'announcements',
 'but',
 'the',
 'big',
 'one',
 'was',
 'of',
 'course',
 'the',
 'unveiling',
 'of',
 'the',
 'blackwell',
 'class',
 'of',
 'ai',
 'chips',
 'which',
 'are',
 'supposedly',
 'going',
 'to',
 'be',
 'better',
 'than',
 'the',
 'widelyindemand',
 'and',
 'equally',
 'as',
 'expensive',
 'hopper',
 'processors',
 'that',
 'have',
 'powered',
 'the',
 'likes',
 'of',
 'openais',
 'chatgpt',
 'it',
 'was',
 'the',
 'demand',
 'for',
 'the',
 'h',
 'chips',
 'that',
 'transformed',
 'nvidia',
 'into',
 'a',
 'company',
 'worth',
 'over',
 'trillion',
 'almost',
 'over

In [14]:
def remove_stopwords(tokens):
    stop_words = set(stopwords.words('english'))
    return [word for word in tokens if word not in stop_words]

filtered_tokens = remove_stopwords(tokens)
print(filtered_tokens)

['nvidia', 'throws', 'gamers', 'bus', 'ever', 'needed', 'proof', 'nvidia', 'longer', 'company', 'think', 'look', 'companys', 'recent', 'gpu', 'technology', 'conference', 'also', 'known', 'gtc', 'conference', 'chockfull', 'announcements', 'big', 'one', 'course', 'unveiling', 'blackwell', 'class', 'ai', 'chips', 'supposedly', 'going', 'better', 'widelyindemand', 'equally', 'expensive', 'hopper', 'processors', 'powered', 'likes', 'openais', 'chatgpt', 'demand', 'h', 'chips', 'transformed', 'nvidia', 'company', 'worth', 'trillion', 'almost', 'overnight', 'larger', 'scheme', 'things', 'leading', 'including', 'hackernoon', 'question', 'whether', 'stock', 'bubble', 'almost', 'quarter', 'hackernoon', 'readers', 'thought', 'nvidia', 'prove', 'everyone', 'wrong', 'companys', 'last', 'earnings', 'release', 'showed', 'still', 'selling', 'hopper', 'processors', 'selling', 'would', 'continue', 'near', 'future', 'brings', 'us', 'nvidia', 'gtc', 'nvidia', 'chief', 'executive', 'jensen', 'huang', 'stev

##### 4. Data Transformation Techniques
- Stemming
- Lemmatization

In [15]:
def stem_words(tokens):
    stemmer = PorterStemmer()
    return [stemmer.stem(word) for word in tokens]

stemmed_tokens = stem_words(filtered_tokens)
print(stemmed_tokens)

['nvidia', 'throw', 'gamer', 'bu', 'ever', 'need', 'proof', 'nvidia', 'longer', 'compani', 'think', 'look', 'compani', 'recent', 'gpu', 'technolog', 'confer', 'also', 'known', 'gtc', 'confer', 'chockful', 'announc', 'big', 'one', 'cours', 'unveil', 'blackwel', 'class', 'ai', 'chip', 'supposedli', 'go', 'better', 'widelyindemand', 'equal', 'expens', 'hopper', 'processor', 'power', 'like', 'openai', 'chatgpt', 'demand', 'h', 'chip', 'transform', 'nvidia', 'compani', 'worth', 'trillion', 'almost', 'overnight', 'larger', 'scheme', 'thing', 'lead', 'includ', 'hackernoon', 'question', 'whether', 'stock', 'bubbl', 'almost', 'quarter', 'hackernoon', 'reader', 'thought', 'nvidia', 'prove', 'everyon', 'wrong', 'compani', 'last', 'earn', 'releas', 'show', 'still', 'sell', 'hopper', 'processor', 'sell', 'would', 'continu', 'near', 'futur', 'bring', 'us', 'nvidia', 'gtc', 'nvidia', 'chief', 'execut', 'jensen', 'huang', 'steve', 'job', 'certainli', 'sharehold', 'man', 'know', 'keep', 'happi', 'prese

In [16]:
def lemmatize_words(tokens):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(word) for word in tokens]

lemmatized_tokens = lemmatize_words(filtered_tokens)
print(lemmatized_tokens)

['nvidia', 'throw', 'gamers', 'bus', 'ever', 'needed', 'proof', 'nvidia', 'longer', 'company', 'think', 'look', 'company', 'recent', 'gpu', 'technology', 'conference', 'also', 'known', 'gtc', 'conference', 'chockfull', 'announcement', 'big', 'one', 'course', 'unveiling', 'blackwell', 'class', 'ai', 'chip', 'supposedly', 'going', 'better', 'widelyindemand', 'equally', 'expensive', 'hopper', 'processor', 'powered', 'like', 'openais', 'chatgpt', 'demand', 'h', 'chip', 'transformed', 'nvidia', 'company', 'worth', 'trillion', 'almost', 'overnight', 'larger', 'scheme', 'thing', 'leading', 'including', 'hackernoon', 'question', 'whether', 'stock', 'bubble', 'almost', 'quarter', 'hackernoon', 'reader', 'thought', 'nvidia', 'prove', 'everyone', 'wrong', 'company', 'last', 'earnings', 'release', 'showed', 'still', 'selling', 'hopper', 'processor', 'selling', 'would', 'continue', 'near', 'future', 'brings', 'u', 'nvidia', 'gtc', 'nvidia', 'chief', 'executive', 'jensen', 'huang', 'steve', 'job', '

##### 5. Data Enrichment Techniques
- Part-of-Speech Tagging (PoS)
- Word Sense Disambiguation
- Noun Phrase Extraction
- Named-entity Recognition

In [17]:
def pos_tagging(tokens):
    pos_tag = nltk.pos_tag(tokens)
    pos_tagged_noun_verb = []
    for word, tag in pos_tag:
        if tag in ["NN", "NNP", "NNS", "VB", "VBD", "VBG", "VBN", "VBP", "VBZ"]:
            pos_tagged_noun_verb.append((word, tag))
    return pos_tagged_noun_verb

pos_tagged_text = pos_tagging(lemmatized_tokens)
print(pos_tagged_text)

[('throw', 'NN'), ('gamers', 'NNS'), ('bus', 'VBP'), ('needed', 'VBN'), ('proof', 'NN'), ('nvidia', 'NN'), ('company', 'NN'), ('think', 'VBP'), ('look', 'NN'), ('company', 'NN'), ('gpu', 'NN'), ('technology', 'NN'), ('conference', 'NN'), ('known', 'VBN'), ('conference', 'NN'), ('chockfull', 'NN'), ('announcement', 'NN'), ('course', 'NN'), ('unveiling', 'VBG'), ('class', 'NN'), ('ai', 'NN'), ('chip', 'NN'), ('going', 'VBG'), ('widelyindemand', 'VB'), ('hopper', 'NN'), ('processor', 'NN'), ('powered', 'VBN'), ('openais', 'NN'), ('demand', 'NN'), ('h', 'NN'), ('chip', 'NN'), ('transformed', 'VBD'), ('company', 'NN'), ('worth', 'NN'), ('thing', 'NN'), ('leading', 'VBG'), ('including', 'VBG'), ('question', 'NN'), ('stock', 'NN'), ('quarter', 'NN'), ('hackernoon', 'NN'), ('reader', 'NN'), ('thought', 'VBD'), ('everyone', 'NN'), ('company', 'NN'), ('earnings', 'NNS'), ('release', 'NN'), ('showed', 'VBD'), ('selling', 'VBG'), ('processor', 'NN'), ('selling', 'NN'), ('continue', 'VB'), ('brings

In [18]:
def named_entity_recognition(text):
    tokens = word_tokenize(text)
    pos_tags = pos_tag(tokens)
   
    ne_tree = ne_chunk(pos_tags)

    named_entities = []
    for subtree in ne_tree:
        if isinstance(subtree, Tree):
            entity_type = subtree.label()
            entity_value = " ".join([word for word, tag in subtree.leaves()])
            named_entities.append((entity_type, entity_value))
   
    return named_entities

named_entity_text = named_entity_recognition(ARTICLE)
print("\nNamed entities found:")
print(named_entity_text)


Named entities found:
[('GPE', 'Nvidia'), ('PERSON', 'Throws Gamers'), ('GPE', 'Nvidia'), ('ORGANIZATION', 'GPU Technology Conference'), ('ORGANIZATION', 'GTC'), ('ORGANIZATION', 'Blackwell'), ('ORGANIZATION', 'AI'), ('GPE', 'Hopper'), ('ORGANIZATION', 'OpenAI'), ('ORGANIZATION', 'ChatGPT'), ('ORGANIZATION', 'H100'), ('GPE', 'Nvidia'), ('ORGANIZATION', 'HackerNoon'), ('ORGANIZATION', 'HackerNoon'), ('PERSON', 'Nvidia'), ('ORGANIZATION', 'Hopper'), ('PERSON', 'Which'), ('ORGANIZATION', 'Nvidia'), ('PERSON', 'Nvidia Chief Executive Jensen Huang'), ('PERSON', 'Steve Jobs'), ('ORGANIZATION', 'Blackwell'), ('PERSON', 'Apple'), ('ORGANIZATION', 'iPhone'), ('FACILITY', 'Wall Street'), ('GPE', 'Nvidia'), ('PERSON', 'Which'), ('GPE', 'Nvidia'), ('GPE', 'Nvidia'), ('ORGANIZATION', 'GTC'), ('PERSON', 'Jensen Huang'), ('GPE', 'Nvidia'), ('GPE', 'Nvidia'), ('PERSON', 'Nvidia'), ('ORGANIZATION', 'OpenAI'), ('PERSON', 'Nvidia'), ('PERSON', 'Little'), ('GPE', 'Nvidia'), ('ORGANIZATION', 'LLMs'), ('PE

In [19]:
def word_sense_disambiguation(sentence, word):
    return lesk(sentence, word)

sense_disambiguation = word_sense_disambiguation(ARTICLE, "nlp")
print(sense_disambiguation)

Synset('natural_language_processing.n.01')


In [20]:
def noun_phrase_extraction(text):
    tokens = word_tokenize(text)
    pos_tags = pos_tag(tokens)
    noun_phrases = []
    current_phrase = []
    for word, tag in pos_tags: 
        if tag.startswith('NN') or tag.startswith('JJ'):
            current_phrase.append(word)
        else:
            if current_phrase:
                noun_phrases.append(' '.join(current_phrase))
                current_phrase = []
    # Add the last phrase if it exists
    if current_phrase:
        noun_phrases.append(' '.join(current_phrase))
    return noun_phrases

noun_phrases = noun_phrase_extraction(ARTICLE)

print("Extracted noun phrases:")
for np in noun_phrases:
    print(f"- {np}")

Extracted noun phrases:
- Nvidia Throws Gamers
- Bus
- proof
- Nvidia
- company
- further
- company
- recent GPU Technology Conference
- GTC
- conference
- chock-full
- announcements
- big
- course
- unveiling
- Blackwell class
- AI chips
- better
- widely-in-demand
- expensive
- Hopper processors
- likes
- OpenAI
- ChatGPT
- demand
- H100 chips
- Nvidia
- company worth
- larger scheme
- things
- HackerNoon
- stock
- bubble
- quarter
- HackerNoon readers
- Nvidia
- everyone wrong
- company
- last earnings release
- Hopper processors
- more
- near future
- Which
- Nvidia GTC
- Nvidia Chief Executive Jensen Huang
- Steve Jobs
- shareholders
- man
- happy – presentation skills
- Blackwell announcement
- nothing akin
- Apple
- unveiling
- first iPhone
- Wall Street
- crazy
- crazy
- monopoly
- supply
- company
- darn AI processors
- Nvidia
- COVID-19
- demand
- consumers
- desperate
- company
- price
- hardware
- ladies
- gentlemen
- future tech companies
- worldwide
- demand
- world
- Whi