### Preprocessing Techniques on text data 

<!-- ![hugging face](https://www.thesoftwarereport.com/wp-content/uploads/2023/09/Hugging-Face2.png) -->
- The example must contain at least 4 sentences.
- Write about which text processing steps you might use for this task.
- Support each step with a description of the technique and worked-out example.
- The same example is to be used for each step, so choose the example carefully so that you will be able to demonstrate each step.


##### 1. Importing Dependencies 

In [55]:
import nltk
import os
import re
import math
import operator
from nltk import pos_tag, ne_chunk
from nltk.wsd import lesk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize,word_tokenize
from nltk.tree import Tree

In [56]:
# Download NLTK data
nltk.download('all')


[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\Harman\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\Harman\AppData\Roaming\nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     C:\Users\Harman\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     C:\Users\Harman\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_eng is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     C:\Users\Harman\AppData\Roaming\nltk_data

True

In [57]:
Stopwords = set(stopwords.words('english'))
wordlemmatizer = WordNetLemmatizer()

##### 2. Getting Text 

In [58]:
ARTICLE = "The quick brown fox jumps over the lazy dog. This sentence contains every letter of the English alphabet. Natural Language Processing (NLP) is a fascinating field of study. It combines linguistics, computer science, and artificial intelligence to analyze and understand human language. NLP is used in a wide range of applications, from chatbots to machine translation. In this article, we will explore the basics of NLP and how it works. We will also discuss some common NLP techniques and tools. Let's get started!"

In [59]:
ARTICLE = input('Enter Text: ')

In [60]:
print(ARTICLE)

The quick brown fox jumps over the lazy dog. This sentence contains every letter of the English alphabet. Natural Language Processing (NLP) is a fascinating field of study. It combines linguistics, computer science, and artificial intelligence to analyze and understand human language. NLP is used in a wide range of applications, from chatbots to machine translation. In this article, we will explore the basics of NLP and how it works. We will also discuss some common NLP techniques and tools. Let's get started!


##### 3. Data Cleaning Techniques
- Noise Removal
- Normalization
- Sentence Segmentation
- Tokenization
- Stop-word Removal

In [61]:
def remove_noise(text):
    # Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    return text

cleaned_text = remove_noise(ARTICLE)
print(cleaned_text)

The quick brown fox jumps over the lazy dog This sentence contains every letter of the English alphabet Natural Language Processing NLP is a fascinating field of study It combines linguistics computer science and artificial intelligence to analyze and understand human language NLP is used in a wide range of applications from chatbots to machine translation In this article we will explore the basics of NLP and how it works We will also discuss some common NLP techniques and tools Lets get started


In [62]:

def tokenize_sentence(text):
    return sent_tokenize(text)
tokens = tokenize_sentence(ARTICLE)
print(tokens)

['The quick brown fox jumps over the lazy dog.', 'This sentence contains every letter of the English alphabet.', 'Natural Language Processing (NLP) is a fascinating field of study.', 'It combines linguistics, computer science, and artificial intelligence to analyze and understand human language.', 'NLP is used in a wide range of applications, from chatbots to machine translation.', 'In this article, we will explore the basics of NLP and how it works.', 'We will also discuss some common NLP techniques and tools.', "Let's get started!"]


In [63]:
def normalize_text(text):
    return text.lower()

normalized_text = normalize_text(cleaned_text)
print(normalized_text)

the quick brown fox jumps over the lazy dog this sentence contains every letter of the english alphabet natural language processing nlp is a fascinating field of study it combines linguistics computer science and artificial intelligence to analyze and understand human language nlp is used in a wide range of applications from chatbots to machine translation in this article we will explore the basics of nlp and how it works we will also discuss some common nlp techniques and tools lets get started


In [64]:

def tokenize_text(text):
    return word_tokenize(text)
tokens = tokenize_text(normalized_text)
print(tokens)

['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', 'this', 'sentence', 'contains', 'every', 'letter', 'of', 'the', 'english', 'alphabet', 'natural', 'language', 'processing', 'nlp', 'is', 'a', 'fascinating', 'field', 'of', 'study', 'it', 'combines', 'linguistics', 'computer', 'science', 'and', 'artificial', 'intelligence', 'to', 'analyze', 'and', 'understand', 'human', 'language', 'nlp', 'is', 'used', 'in', 'a', 'wide', 'range', 'of', 'applications', 'from', 'chatbots', 'to', 'machine', 'translation', 'in', 'this', 'article', 'we', 'will', 'explore', 'the', 'basics', 'of', 'nlp', 'and', 'how', 'it', 'works', 'we', 'will', 'also', 'discuss', 'some', 'common', 'nlp', 'techniques', 'and', 'tools', 'lets', 'get', 'started']


In [65]:
tokens

['the',
 'quick',
 'brown',
 'fox',
 'jumps',
 'over',
 'the',
 'lazy',
 'dog',
 'this',
 'sentence',
 'contains',
 'every',
 'letter',
 'of',
 'the',
 'english',
 'alphabet',
 'natural',
 'language',
 'processing',
 'nlp',
 'is',
 'a',
 'fascinating',
 'field',
 'of',
 'study',
 'it',
 'combines',
 'linguistics',
 'computer',
 'science',
 'and',
 'artificial',
 'intelligence',
 'to',
 'analyze',
 'and',
 'understand',
 'human',
 'language',
 'nlp',
 'is',
 'used',
 'in',
 'a',
 'wide',
 'range',
 'of',
 'applications',
 'from',
 'chatbots',
 'to',
 'machine',
 'translation',
 'in',
 'this',
 'article',
 'we',
 'will',
 'explore',
 'the',
 'basics',
 'of',
 'nlp',
 'and',
 'how',
 'it',
 'works',
 'we',
 'will',
 'also',
 'discuss',
 'some',
 'common',
 'nlp',
 'techniques',
 'and',
 'tools',
 'lets',
 'get',
 'started']

In [66]:
def remove_stopwords(tokens):
    stop_words = set(stopwords.words('english'))
    return [word for word in tokens if word not in stop_words]

filtered_tokens = remove_stopwords(tokens)
print(filtered_tokens)

['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog', 'sentence', 'contains', 'every', 'letter', 'english', 'alphabet', 'natural', 'language', 'processing', 'nlp', 'fascinating', 'field', 'study', 'combines', 'linguistics', 'computer', 'science', 'artificial', 'intelligence', 'analyze', 'understand', 'human', 'language', 'nlp', 'used', 'wide', 'range', 'applications', 'chatbots', 'machine', 'translation', 'article', 'explore', 'basics', 'nlp', 'works', 'also', 'discuss', 'common', 'nlp', 'techniques', 'tools', 'lets', 'get', 'started']


##### 4. Data Transformation Techniques
- Stemming
- Lemmatization

In [67]:
def stem_words(tokens):
    stemmer = PorterStemmer()
    return [stemmer.stem(word) for word in tokens]

stemmed_tokens = stem_words(filtered_tokens)
print(stemmed_tokens)

['quick', 'brown', 'fox', 'jump', 'lazi', 'dog', 'sentenc', 'contain', 'everi', 'letter', 'english', 'alphabet', 'natur', 'languag', 'process', 'nlp', 'fascin', 'field', 'studi', 'combin', 'linguist', 'comput', 'scienc', 'artifici', 'intellig', 'analyz', 'understand', 'human', 'languag', 'nlp', 'use', 'wide', 'rang', 'applic', 'chatbot', 'machin', 'translat', 'articl', 'explor', 'basic', 'nlp', 'work', 'also', 'discuss', 'common', 'nlp', 'techniqu', 'tool', 'let', 'get', 'start']


In [68]:
def lemmatize_words(tokens):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(word) for word in tokens]

lemmatized_tokens = lemmatize_words(filtered_tokens)
print(lemmatized_tokens)

['quick', 'brown', 'fox', 'jump', 'lazy', 'dog', 'sentence', 'contains', 'every', 'letter', 'english', 'alphabet', 'natural', 'language', 'processing', 'nlp', 'fascinating', 'field', 'study', 'combine', 'linguistics', 'computer', 'science', 'artificial', 'intelligence', 'analyze', 'understand', 'human', 'language', 'nlp', 'used', 'wide', 'range', 'application', 'chatbots', 'machine', 'translation', 'article', 'explore', 'basic', 'nlp', 'work', 'also', 'discus', 'common', 'nlp', 'technique', 'tool', 'let', 'get', 'started']


##### 5. Data Enrichment Techniques
- Part-of-Speech Tagging (PoS)
- Word Sense Disambiguation
- Noun Phrase Extraction
- Named-entity Recognition

In [69]:
def pos_tagging(tokens):
    pos_tag = nltk.pos_tag(tokens)
    pos_tagged_noun_verb = []
    for word, tag in pos_tag:
        if tag in ["NN", "NNP", "NNS", "VB", "VBD", "VBG", "VBN", "VBP", "VBZ"]:
            pos_tagged_noun_verb.append((word, tag))
    return pos_tagged_noun_verb

pos_tagged_text = pos_tagging(lemmatized_tokens)
print(pos_tagged_text)

[('brown', 'NN'), ('jump', 'NN'), ('lazy', 'NN'), ('dog', 'NN'), ('sentence', 'NN'), ('contains', 'VBZ'), ('letter', 'NN'), ('alphabet', 'NN'), ('language', 'NN'), ('processing', 'NN'), ('field', 'NN'), ('study', 'NN'), ('combine', 'VBP'), ('linguistics', 'NNS'), ('computer', 'NN'), ('science', 'NN'), ('intelligence', 'NN'), ('analyze', 'NN'), ('understand', 'VBP'), ('language', 'NN'), ('nlp', 'NN'), ('used', 'VBN'), ('range', 'NN'), ('application', 'NN'), ('chatbots', 'NNS'), ('machine', 'NN'), ('translation', 'NN'), ('article', 'NN'), ('explore', 'VBP'), ('nlp', 'NN'), ('work', 'NN'), ('nlp', 'NN'), ('technique', 'NN'), ('tool', 'NN'), ('let', 'VBD'), ('get', 'VB'), ('started', 'VBN')]


In [70]:
def named_entity_recognition(text):
    tokens = word_tokenize(text)
    pos_tags = pos_tag(tokens)
   
    ne_tree = ne_chunk(pos_tags)

    named_entities = []
    for subtree in ne_tree:
        if isinstance(subtree, Tree):
            entity_type = subtree.label()
            entity_value = " ".join([word for word, tag in subtree.leaves()])
            named_entities.append((entity_type, entity_value))
   
    return named_entities

named_entity_text = named_entity_recognition(ARTICLE)
print("\nNamed entities found:")
print(named_entity_text)


Named entities found:
[('GPE', 'English'), ('ORGANIZATION', 'NLP'), ('ORGANIZATION', 'NLP'), ('ORGANIZATION', 'NLP'), ('ORGANIZATION', 'NLP')]


In [71]:
def word_sense_disambiguation(sentence, word):
    return lesk(sentence, word)

sense_disambiguation = word_sense_disambiguation(ARTICLE, "nlp")
print(sense_disambiguation)

Synset('natural_language_processing.n.01')


In [72]:
def noun_phrase_extraction(text):
    tokens = word_tokenize(text)
    pos_tags = pos_tag(tokens)
    noun_phrases = []
    current_phrase = []
    for word, tag in pos_tags: 
        if tag.startswith('NN') or tag.startswith('JJ'):
            current_phrase.append(word)
        else:
            if current_phrase:
                noun_phrases.append(' '.join(current_phrase))
                current_phrase = []
    # Add the last phrase if it exists
    if current_phrase:
        noun_phrases.append(' '.join(current_phrase))
    return noun_phrases

noun_phrases = noun_phrase_extraction(ARTICLE)

print("Extracted noun phrases:")
for np in noun_phrases:
    print(f"- {np}")

Extracted noun phrases:
- quick brown fox
- lazy dog
- sentence
- letter
- English alphabet
- Natural Language Processing
- NLP
- fascinating field
- study
- linguistics
- computer science
- artificial intelligence
- human language
- NLP
- wide range
- applications
- chatbots
- machine translation
- article
- basics
- NLP
- common NLP techniques
- tools
