<a href="https://colab.research.google.com/github/AtifQureshi110/NLP/blob/main/nltk.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

NLTK (Natural Language Toolkit) is a leading platform for building Python programs to work with human language data. Its purpose in NLP tasks is to simplify complex text processing tasks like tokenization, POS tagging, stemming, and lemmatization, providing tools and methods to analyze, preprocess, and understand natural language data efficiently.


In [1]:
import nltk
nltk.download('punkt')
"""nltk.download('punkt') downloads the Punkt tokenizer models used by NLTK for sentence tokenization,
which breaks text into individual sentences."""

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [2]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [3]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string
# Initialize stemmer, stopwords, and punctuation
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))
punctuation = set(string.punctuation)
# Your list of sentences
sentences = ["""Muhammad Ali Jinnah played a pivotal role in the creation of Pakistan, which gained
              independence from British rule on August 14, 1947.
              Jinnah was a lawyer and politician who advocated for the rights of Muslims in British India.
              He was the leader of the All-India Muslim League and was instrumental in negotiating for a
              separate homeland for Muslims,
              leading to the establishment of Pakistan as a separate nation for Muslims.
              Jinnah served as Pakistan's first Governor-General until his death in 1948.
              He is highly revered in Pakistan for his efforts in achieving independence and is often
              referred to as Quaid-e-Azam Muhammad Ali Jinnah."""]

['muhammad ali jinnah play pivot role creation pakistan gain independ british rule august 14 jinnah lawyer politician advoc right muslim british india leader muslim leagu instrument negoti separ homeland muslim lead establish pakistan separ nation muslim jinnah serv pakistan first death highli rever pakistan effort achiev independ often refer muhammad ali jinnah']


In [None]:
# Initialize an empty list to store the cleaned sentences
q = []
for i in range(len(sentences)):
    # Tokenize the sentence and convert to lowercase
    words = nltk.word_tokenize(sentences[i].lower())
    # Remove punctuation from words
    words = [word for word in words if word not in punctuation]
    # Stemming and filtering out stopwords
    words = [stemmer.stem(word) for word in words if word not in stop_words and word.isalnum()]
    # Join the words back into a sentence
    cleaned_sentence = ' '.join(words)
    # Append the cleaned sentence to the list
    q.append(cleaned_sentence)
print(q)

In [4]:
# Your list of sentences
sentences = ["""Muhammad Ali Jinnah played a pivotal role in the creation of Pakistan, which gained
              independence from British rule on August 14, 1947.
              Jinnah was a lawyer and politician who advocated for the rights of Muslims in British India.
              He was the leader of the All-India Muslim League and was instrumental in negotiating for a
              separate homeland for Muslims,
              leading to the establishment of Pakistan as a separate nation for Muslims.
              Jinnah served as Pakistan's first Governor-General until his death in 1948.
              He is highly revered in Pakistan for his efforts in achieving independence and is often
              referred to as Quaid-e-Azam Muhammad Ali Jinnah."""]

print(sentences)

["Muhammad Ali Jinnah played a pivotal role in the creation of Pakistan, which gained independence from British rule on August 14, 1947.\n              Jinnah was a lawyer and politician who advocated for the rights of Muslims in British India.\n              He was the leader of the All-India Muslim League and was instrumental in negotiating for a separate homeland for Muslims,\n             leading to the establishment of Pakistan as a separate nation for Muslims.\n             Jinnah served as Pakistan's first Governor-General until his death in 1948.\n              He is highly revered in Pakistan for his efforts in achieving independence and is often referred to as Quaid-e-Azam Muhammad Ali Jinnah."]


In [6]:
nltk.download('wordnet')
"""nltk.download('wordnet') downloads the WordNet database, which is a lexical database of English words.
WordNet is often used in NLP for tasks like
synonym generation, semantic similarity calculations, and lemmatization."""

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [10]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string

# Initialize lemmatizer, stopwords, and punctuation
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))
punctuation = set(string.punctuation)

# Your list of sentences
sentences = ["""Muhammad Ali Jinnah played a pivotal role in the creation of Pakistan, which gained
              independence from British rule on August 14, 1947.
              Jinnah was a lawyer and politician who advocated for the rights of Muslims in British India.
              He was the leader of the All-India Muslim League and was instrumental in negotiating for a
              separate homeland for Muslims,
              leading to the establishment of Pakistan as a separate nation for Muslims.
              Jinnah served as Pakistan's first Governor-General until his death in 1948.
              He is highly revered in Pakistan for his efforts in achieving independence and is often
              referred to as Quaid-e-Azam Muhammad Ali Jinnah."""]

# Initialize an empty list to store the cleaned sentences
q = []
print("=========================sentences=========================")
print(sentences)
for i in range(len(sentences)):
    # Tokenize the sentence and convert to lowercase
    words = nltk.word_tokenize(sentences[i].lower())
    print("==================word_tokenize===========================")
    print(words)
    # Remove punctuation from words
    words = [word for word in words if word not in punctuation]
    print("======================punctuation=======================")
    print(words)
    # Lemmatization and filtering out stopwords
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words and word.isalnum()]
    print("====================Lemmatization and filtering out stopwords=========================")
    print(words)
    print("===================cleaned_sentence==========================")
    # Join the words back into a sentence
    cleaned_sentence = ' '.join(words)
    print(words)
    # Append the cleaned sentence to the list
    q.append(cleaned_sentence)
print(q)

["Muhammad Ali Jinnah played a pivotal role in the creation of Pakistan, which gained independence from British rule on August 14, 1947.\n              Jinnah was a lawyer and politician who advocated for the rights of Muslims in British India.\n              He was the leader of the All-India Muslim League and was instrumental in negotiating for a separate homeland for Muslims,\n             leading to the establishment of Pakistan as a separate nation for Muslims.\n             Jinnah served as Pakistan's first Governor-General until his death in 1948.\n              He is highly revered in Pakistan for his efforts in achieving independence and is often referred to as Quaid-e-Azam Muhammad Ali Jinnah."]
['muhammad', 'ali', 'jinnah', 'played', 'a', 'pivotal', 'role', 'in', 'the', 'creation', 'of', 'pakistan', ',', 'which', 'gained', 'independence', 'from', 'british', 'rule', 'on', 'august', '14', ',', '1947.', 'jinnah', 'was', 'a', 'lawyer', 'and', 'politician', 'who', 'advocated', 'f

In [12]:
nltk.download('averaged_perceptron_tagger')
"""nltk.download('averaged_perceptron_tagger') downloads the averaged perceptron tagger model used by NLTK for
part-of-speech (POS) tagging. This model assigns grammatical categories (like noun, verb, adjective) to words
in a sentence, aiding in understanding the grammatical structure of text."""

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string
from nltk import pos_tag
from nltk.corpus import wordnet

# Initialize lemmatizer, stopwords, and punctuation
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))
punctuation = set(string.punctuation)

# Function to convert NLTK POS tags to WordNet POS tags
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

In [14]:
# Your list of sentences
sentences = ["""Muhammad Ali Jinnah played a pivotal role in the creation of Pakistan, which gained
              independence from British rule on August 14, 1947.
              Jinnah was a lawyer and politician who advocated for the rights of Muslims in British India.
              He was the leader of the All-India Muslim League and was instrumental in negotiating for a
              separate homeland for Muslims,
              leading to the establishment of Pakistan as a separate nation for Muslims.
              Jinnah served as Pakistan's first Governor-General until his death in 1948.
              He is highly revered in Pakistan for his efforts in achieving independence and is often
              referred to as Quaid-e-Azam Muhammad Ali Jinnah."""]

# Initialize an empty list to store the cleaned sentences
q = []

for i in range(len(sentences)):
    # Tokenize the sentence and convert to lowercase
    words = nltk.word_tokenize(sentences[i].lower())

    # Remove punctuation from words
    words = [word for word in words if word not in punctuation]
    # POS tagging
    tagged_words = pos_tag(words)
    # Lemmatization using POS tagging and filtering out stopwords
    words = [lemmatizer.lemmatize(word, get_wordnet_pos(tag)) if get_wordnet_pos(tag) else word for word,
             tag in tagged_words
             if word not in stop_words and word.isalnum()]

    # Join the words back into a sentence
    cleaned_sentence = ' '.join(words)
    # Append the cleaned sentence to the list
    q.append(cleaned_sentence)
print(q)


['muhammad ali jinnah play pivotal role creation pakistan gain independence british rule august 14 jinnah lawyer politician advocate right muslim british india leader muslim league instrumental negotiate separate homeland muslim lead establishment pakistan separate nation muslim jinnah serve pakistan first death highly revere pakistan effort achieve independence often refer muhammad ali jinnah']
