<a href="https://colab.research.google.com/github/Downforcedemon/AI/blob/main/NLP_dataSet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1️⃣ Tokenization (Splitting Text into Words or Sentences) process of breaking text into smaller units (words or sentences).

In [1]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

In [4]:
nltk.download('punkt')
nltk.download('punkt_tab')

text = "Natural Language Processing (NLP) is amazing! It helps machines understand human language."

# Tokenizing words
words = word_tokenize(text)
print("Word Tokenization:", words)

# Tokenizing sentences
sentences = sent_tokenize(text)
print("Sentence Tokenization:", sentences)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Word Tokenization: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'amazing', '!', 'It', 'helps', 'machines', 'understand', 'human', 'language', '.']
Sentence Tokenization: ['Natural Language Processing (NLP) is amazing!', 'It helps machines understand human language.']


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


2️⃣ Stopwords Removal (Filtering Out Common Words)

Stopwords are common words (e.g., "the", "is", "in", "and") that don’t add much meaning.

In [5]:
from nltk.corpus import stopwords

nltk.download('stopwords')

words = ["this", "is", "an", "example", "of", "text", "processing"]
fitered_words = [word for word in words if word.lower() not in stopwords.words('english')]

print("Filtered Words:", fitered_words)

Filtered Words: ['example', 'text', 'processing']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


3️⃣ Stemming (Reducing Words to Root Form)


In [6]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
words = ["running", "runs", "running","easily","fairly"]
stemmed_words = [stemmer.stem(word) for word in words]

print("Stemmed Words:", stemmed_words)

Stemmed Words: ['run', 'run', 'run', 'easili', 'fairli']


4️⃣ Lemmatization (Getting Base Form of Words)

Lemmatization converts words to their dictionary form (lemma), considering meaning.

In [7]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()
words = ["running", "flies", "better", "wolves"]

lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print("Lemmatized Words:", lemmatized_words)

[nltk_data] Downloading package wordnet to /root/nltk_data...


Lemmatized Words: ['running', 'fly', 'better', 'wolf']


5️⃣ Part-of-Speech (POS) Tagging (Identifying Word Types)

POS tagging labels words as noun, verb, adjective, etc..

In [12]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

# Download the necessary data package
nltk.download('averaged_perceptron_tagger_eng')

text = word_tokenize("John plays football on Sunday.")
pos_tags = nltk.pos_tag(text)

print("POS Tags:", pos_tags)


[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


POS Tags: [('John', 'NNP'), ('plays', 'VBZ'), ('football', 'NN'), ('on', 'IN'), ('Sunday', 'NNP'), ('.', '.')]


6️⃣ Named Entity Recognition (NER) (Extracting Names, Locations, etc.)

NER identifies important entities like people, places, dates.

📌 Example: NER using SpaCy

In [13]:
import spacy

nlp = spacy.load("en_core_web_sm")
text = "Elon Musk founded Tesla in 2003."

doc = nlp(text)
for ent in doc.ents:
    print(ent.text, ent.label_)


Elon Musk PERSON
Tesla ORG
2003 DATE


Summary of Basic NLP Functions
Function	Purpose	Example Output
Tokenization	Splits text into words/sentences	['Natural', 'Language', 'Processing', 'is', 'fun', '!']
Stopword Removal	Removes common words	['example', 'text', 'processing']
Stemming	Reduces words to root form	'running' → 'run'
Lemmatization	Converts words to base dictionary form	'wolves' → 'wolf'
POS Tagging	Labels words as noun, verb, etc.	[('John', 'NNP'), ('plays', 'VBZ')]
NER	Extracts people, places, organizations	'Elon Musk' → PERSON