<a href="https://colab.research.google.com/github/SaraGul0111/SaraGul/blob/main/Comprehensive_NLP_Preprocessing_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

***Sara Gul***

#**Comprehensive NLP Preprocessing**


#Introduction

Natural Language Processing (NLP) is a crucial field in artificial intelligence that focuses on the interaction between computers and human language. Preprocessing is a vital step in any NLP project, as it prepares raw text data for further analysis or model training. This notebook will guide you through a comprehensive set of preprocessing techniques using a paragraph as an example.

#Setup
Import the necessary libraries:

In [None]:
!pip install nltk spacy scikit-learn pyspellchecker
!python -m spacy download en_core_web_sm

Collecting pyspellchecker
  Downloading pyspellchecker-0.8.1-py3-none-any.whl.metadata (9.4 kB)
Downloading pyspellchecker-0.8.1-py3-none-any.whl (6.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyspellchecker
Successfully installed pyspellchecker-0.8.1
Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m48.2 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or

In [None]:
import re
import nltk
import spacy
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tag import pos_tag
from nltk.chunk import ne_chunk
from sklearn.feature_extraction.text import TfidfVectorizer
from spellchecker import SpellChecker

# Download required NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


##Sample Paragraph

In [None]:
paragraph = """
In 2023, AI technolgy advanced rapidley! ChatGPT-4 amazd users with its
capabilities, while Tesla's self-driving cars loged 1,000,000+ miles.
The globel AI market reached $150 billion. Despite concerns, 78 percent of
companies planed to increase AI investments by 2025.
"""

print("Original paragraph:")
print(paragraph)

Original paragraph:

In 2023, AI technolgy advanced rapidley! ChatGPT-4 amazd users with its
capabilities, while Tesla's self-driving cars loged 1,000,000+ miles.
The globel AI market reached $150 billion. Despite concerns, 78 percent of
companies planed to increase AI investments by 2025.



**#Preprocessing Steps**

#1. Lowercase Conversion
Converting all text to lowercase helps in standardizing the text:

In [None]:
def to_lowercase(text):
    return text.lower()

lowercase_text = to_lowercase(paragraph)
print("\nLowercase text:")
print(lowercase_text)


Lowercase text:

in 2023, ai technolgy advanced rapidley! chatgpt-4 amazd users with its
capabilities, while tesla's self-driving cars loged 1,000,000+ miles.
the globel ai market reached $150 billion. despite concerns, 78 percent of
companies planed to increase ai investments by 2025.



#2. Remove Punctuation
We'll use regex to remove punctuation:

In [None]:
def remove_punctuation(text):
    return re.sub(r'[^\w\s]', '', text)

no_punct_text = remove_punctuation(lowercase_text)
print("\nText after removing punctuation:")
print(no_punct_text)


Text after removing punctuation:

in 2023 ai technolgy advanced rapidley chatgpt4 amazd users with its
capabilities while teslas selfdriving cars loged 1000000 miles
the globel ai market reached 150 billion despite concerns 78 percent of
companies planed to increase ai investments by 2025



#3. Tokenization
Tokenization breaks down the text into individual words:

In [None]:
def tokenize_text(text):
    return word_tokenize(text)

tokens = tokenize_text(no_punct_text)
print("\nTokenized text:")
print(tokens)


Tokenized text:
['in', '2023', 'ai', 'technolgy', 'advanced', 'rapidley', 'chatgpt4', 'amazd', 'users', 'with', 'its', 'capabilities', 'while', 'teslas', 'selfdriving', 'cars', 'loged', '1000000', 'miles', 'the', 'globel', 'ai', 'market', 'reached', '150', 'billion', 'despite', 'concerns', '78', 'percent', 'of', 'companies', 'planed', 'to', 'increase', 'ai', 'investments', 'by', '2025']


#4. Correct Misspellings
We'll use the pyspellchecker library to correct misspellings:

In [None]:
def correct_spellings(tokens):
    spell = SpellChecker()
    return [spell.correction(word) if spell.correction(word) is not None else word for word in tokens]

corrected_tokens = correct_spellings(tokens)
print("\nTokens after spell correction:")
print(corrected_tokens)


Tokens after spell correction:
['in', '2023', 'ai', 'technology', 'advanced', 'rapidly', 'chatgpt4', 'amazed', 'users', 'with', 'its', 'capabilities', 'while', 'teslas', 'selfdriving', 'cars', 'loved', '1000000', 'miles', 'the', 'global', 'ai', 'market', 'reached', '150', 'billion', 'despite', 'concerns', '78', 'percent', 'of', 'companies', 'planed', 'to', 'increase', 'ai', 'investments', 'by', '2025']


#5. Remove Stopwords
Stopwords are common words that often don't contribute much to the meaning:

In [None]:
def remove_stopwords(tokens):
    stop_words = set(stopwords.words('english'))
    return [word for word in tokens if word not in stop_words]

filtered_tokens = remove_stopwords(corrected_tokens)
print("\nTokens after removing stopwords:")
print(filtered_tokens)


Tokens after removing stopwords:
['2023', 'ai', 'technology', 'advanced', 'rapidly', 'chatgpt4', 'amazed', 'users', 'capabilities', 'teslas', 'selfdriving', 'cars', 'loved', '1000000', 'miles', 'global', 'ai', 'market', 'reached', '150', 'billion', 'despite', 'concerns', '78', 'percent', 'companies', 'planed', 'increase', 'ai', 'investments', '2025']


#6. Stemming
Stemming reduces words to their root form:

In [None]:
def stem_words(tokens):
    stemmer = PorterStemmer()
    return [stemmer.stem(word) for word in tokens]

stemmed_tokens = stem_words(filtered_tokens)
print("\nStemmed tokens:")
print(stemmed_tokens)


Stemmed tokens:
['2023', 'ai', 'technolog', 'advanc', 'rapidli', 'chatgpt4', 'amaz', 'user', 'capabl', 'tesla', 'selfdriv', 'car', 'love', '1000000', 'mile', 'global', 'ai', 'market', 'reach', '150', 'billion', 'despit', 'concern', '78', 'percent', 'compani', 'plane', 'increas', 'ai', 'invest', '2025']


#7. Lemmatization
Lemmatization is similar to stemming but produces more meaningful root forms:

In [None]:
def lemmatize_words(tokens):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(word) for word in tokens]

lemmatized_tokens = lemmatize_words(filtered_tokens)
print("\nLemmatized tokens:")
print(lemmatized_tokens)


Lemmatized tokens:
['2023', 'ai', 'technology', 'advanced', 'rapidly', 'chatgpt4', 'amazed', 'user', 'capability', 'tesla', 'selfdriving', 'car', 'loved', '1000000', 'mile', 'global', 'ai', 'market', 'reached', '150', 'billion', 'despite', 'concern', '78', 'percent', 'company', 'planed', 'increase', 'ai', 'investment', '2025']


#8. Parts of Speech Tagging
POS tagging assigns grammatical categories to each word:

In [None]:
def pos_tagging(tokens):
    return pos_tag(tokens)

pos_tagged = pos_tagging(lemmatized_tokens)
print("\nPOS tagged tokens:")
print(pos_tagged)


POS tagged tokens:
[('2023', 'CD'), ('ai', 'NN'), ('technology', 'NN'), ('advanced', 'VBD'), ('rapidly', 'RB'), ('chatgpt4', 'JJ'), ('amazed', 'VBN'), ('user', 'NN'), ('capability', 'NN'), ('tesla', 'VBP'), ('selfdriving', 'VBG'), ('car', 'NN'), ('loved', 'VBD'), ('1000000', 'CD'), ('mile', 'NN'), ('global', 'JJ'), ('ai', 'NN'), ('market', 'NN'), ('reached', 'VBD'), ('150', 'CD'), ('billion', 'CD'), ('despite', 'IN'), ('concern', 'NN'), ('78', 'CD'), ('percent', 'NN'), ('company', 'NN'), ('planed', 'VBD'), ('increase', 'NN'), ('ai', 'NN'), ('investment', 'NN'), ('2025', 'CD')]


#9. Named Entity Recognition
NER identifies and classifies named entities in the text:

In [None]:
def named_entity_recognition(text):
    doc = nlp(text)
    return [(ent.text, ent.label_) for ent in doc.ents]

ner_results = named_entity_recognition(" ".join(lemmatized_tokens))
print("\nNamed Entities:")
print(ner_results)


Named Entities:
[('2023', 'DATE'), ('chatgpt4', 'ORG'), ('1000000 mile', 'QUANTITY'), ('150 billion', 'CARDINAL'), ('78 percent', 'PERCENT'), ('2025', 'DATE')]


#10. Compound Term Extraction
We'll use spaCy to extract compound terms (noun chunks):

In [None]:
def extract_compound_terms(text):
    doc = nlp(text)
    return [chunk.text for chunk in doc.noun_chunks]

compound_terms = extract_compound_terms(" ".join(lemmatized_tokens))
print("\nCompound Terms:")
print(compound_terms)


Compound Terms:
['2023 ai technology', 'rapidly chatgpt4 amazed user capability tesla selfdriving car', '1000000 mile global ai market', 'concern 78 percent company', 'increase', 'investment']


#11. Term Frequency (TF) Calculation
Term Frequency measures how frequently a term occurs in a document.

In [None]:
from collections import Counter

def calculate_tf(text):
    tf_dict = Counter(text)
    for word in tf_dict:
        tf_dict[word] = tf_dict[word] / len(text)
    return tf_dict

# Calculate TF for our preprocessed text
tf_scores = calculate_tf(lemmatized_tokens)
print("\nTerm Frequency (TF) Scores:")
for term, score in tf_scores.items():
    print(f"{term}: {score:.4f}")


Term Frequency (TF) Scores:
2023: 0.0323
ai: 0.0968
technology: 0.0323
advanced: 0.0323
rapidly: 0.0323
chatgpt4: 0.0323
amazed: 0.0323
user: 0.0323
capability: 0.0323
tesla: 0.0323
selfdriving: 0.0323
car: 0.0323
loved: 0.0323
1000000: 0.0323
mile: 0.0323
global: 0.0323
market: 0.0323
reached: 0.0323
150: 0.0323
billion: 0.0323
despite: 0.0323
concern: 0.0323
78: 0.0323
percent: 0.0323
company: 0.0323
planed: 0.0323
increase: 0.0323
investment: 0.0323
2025: 0.0323


#12. Inverse Document Frequency (IDF) Calculation
For a single document, IDF is calculated as the logarithm of the total number of terms divided by the number of times a specific term appears.

In [None]:
import math

def calculate_idf(text):
    total_terms = len(text)
    term_count = Counter(text)
    idf_dict = {}
    for term, count in term_count.items():
        idf_dict[term] = math.log(total_terms / count)
    return idf_dict

# Calculate IDF for our preprocessed text
idf_scores = calculate_idf(lemmatized_tokens)
print("\nInverse Document Frequency (IDF) Scores:")
for term, score in idf_scores.items():
    print(f"{term}: {score:.4f}")


Inverse Document Frequency (IDF) Scores:
2023: 3.4340
ai: 2.3354
technology: 3.4340
advanced: 3.4340
rapidly: 3.4340
chatgpt4: 3.4340
amazed: 3.4340
user: 3.4340
capability: 3.4340
tesla: 3.4340
selfdriving: 3.4340
car: 3.4340
loved: 3.4340
1000000: 3.4340
mile: 3.4340
global: 3.4340
market: 3.4340
reached: 3.4340
150: 3.4340
billion: 3.4340
despite: 3.4340
concern: 3.4340
78: 3.4340
percent: 3.4340
company: 3.4340
planed: 3.4340
increase: 3.4340
investment: 3.4340
2025: 3.4340


#13. TF-IDF Calculation
Now we'll combine TF and IDF to get the TF-IDF scores. It measures the importance of words in the document

In [None]:
def calculate_tfidf(tf_scores, idf_scores):
    tfidf_scores = {}
    for word, tf_score in tf_scores.items():
        tfidf_scores[word] = tf_score * idf_scores[word]
    return tfidf_scores

# Calculate TF-IDF
tfidf_scores = calculate_tfidf(tf_scores, idf_scores)
print("\nTF-IDF Scores:")
for term, score in sorted(tfidf_scores.items(), key=lambda x: x[1], reverse=True):
    print(f"{term}: {score:.4f}")


TF-IDF Scores:
ai: 0.2260
2023: 0.1108
technology: 0.1108
advanced: 0.1108
rapidly: 0.1108
chatgpt4: 0.1108
amazed: 0.1108
user: 0.1108
capability: 0.1108
tesla: 0.1108
selfdriving: 0.1108
car: 0.1108
loved: 0.1108
1000000: 0.1108
mile: 0.1108
global: 0.1108
market: 0.1108
reached: 0.1108
150: 0.1108
billion: 0.1108
despite: 0.1108
concern: 0.1108
78: 0.1108
percent: 0.1108
company: 0.1108
planed: 0.1108
increase: 0.1108
investment: 0.1108
2025: 0.1108


#**Conclusion**
In this notebook, we've demonstrated a comprehensive set of NLP preprocessing techniques on a custom paragraph. These steps are crucial for cleaning, standardizing, and extracting meaningful information from text data before further analysis or model training. Each step serves a specific purpose:

Lowercase conversion: Standardizes text
Punctuation removal: Cleans the text
Tokenization: Breaks text into individual units
Spell correction: Fixes misspellings
Stopword removal: Eliminates less informative words
Stemming: Reduces words to their root form
Lemmatization: Produces meaningful root forms
POS tagging: Assigns grammatical categories
Named Entity Recognition: Identifies and classifies named entities
Compound Term Extraction: Identifies multi-word expressions
Term Frequency (TF) Calculation: Computes how often a word appears in the document
Inverse Document Frequency (IDF) Calculation: Measures the importance of a word within the document
TF-IDF Calculation: Combines TF and IDF to compute the importance of words in the document

1.   Lowercase conversion: Standardizes text
2.   Punctuation removal: Cleans the text
1.   Tokenization: Breaks text into individual units
2.   Spell correction: Fixes misspellings
1.   Stopword removal: Eliminates less informative words
2.   Stemming: Reduces words to their root form

1.   Lemmatization: Produces meaningful root forms
2.   POS tagging: Assigns grammatical categories

1.   Named Entity Recognition: Identifies and classifies named entities
2.   Compound Term Extraction: Identifies multi-word expressions

1.   Term Frequency (TF) Calculation: Computes how often a word appears in the document
2.   Inverse Document Frequency (IDF) Calculation: Measures the importance of a word within the document

1.   TF-IDF Calculation: Combines TF and IDF to compute the importance of words in the document





















By applying these preprocessing steps, we've transformed our original paragraph into a format that's more suitable for various NLP tasks.