## Program:
#### Download Wikipedia's page on open source and convert the text to its native forms. Try it with various stemming and lemmatization modules. Use Python's timer module to measure their performance.


In [1]:
# import the required libraries
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [3]:
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_rus to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |  

True

In [4]:
# import the timer module to evaluate the performance of different stemming and lemmetization modules
import time

In [6]:
!pip install wikipedia-api

Collecting wikipedia-api
  Downloading wikipedia_api-0.8.1.tar.gz (19 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wikipedia-api
  Building wheel for wikipedia-api (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia-api: filename=Wikipedia_API-0.8.1-py3-none-any.whl size=15384 sha256=9408d64ff05ae282063b0531864f7c577b9235c93256fc341a00c88734d9b504
  Stored in directory: /root/.cache/pip/wheels/0b/0f/39/e8214ec038ccd5aeb8c82b957289f2f3ab2251febeae5c2860
Successfully built wikipedia-api
Installing collected packages: wikipedia-api
Successfully installed wikipedia-api-0.8.1


In [7]:
# import the downloaded wikipedia api to get the text data
import wikipediaapi
#  import the NLP library for lemmatization
import spacy
# import the porterStemmer module to implement stemming
from nltk.stem import PorterStemmer
# import the word tokenizer
from nltk.tokenize import word_tokenize


In [8]:
# define a function to get the page summary of a wikipedia page
def get_wikipedia_text(page_title):
  wiki_wiki = wikipediaapi.Wikipedia(user_agent="MyNLPProject/1.0", language="en")
  page = wiki_wiki.page(page_title)
  return page.summary if page.exists() else ""


In [35]:
text = get_wikipedia_text("ISRO")

In [36]:
# tokenize the text using the word tokenizer
tokens = word_tokenize(text)
print(text)

The Indian Space Research Organisation (ISRO ) is India's national space agency. It serves as the principal research and development arm of the Department of Space (DoS), overseen by the Prime Minister of India, with the Chairman of ISRO also serving as the chief executive of the DoS. It is primarily responsible for space-based operations, space exploration, international space cooperation and the development of related technologies. The agency maintains a constellation of imaging, communication and remote sensing satellites. It operates the GAGAN and IRNSS satellite navigation systems. It has sent three missions to the Moon and one mission to Mars.
Formerly known as the Indian National Committee for Space Research (INCOSPAR), ISRO was set up in 1962 by then-Prime Minister Jawaharlal Nehru on the recommendation of scientist Vikram Sarabhai. It was renamed as ISRO in 1969 and was subsumed into the Department of Atomic Energy (DAE). The establishment of ISRO institutionalised space resea

In [37]:
# perform stemming using the porterStemmer module
stemmer = PorterStemmer()
start_stem1 = time.time() # Start timer
stemmed_words = [stemmer.stem(word) for word in tokens]
end_stem1 = time.time() # End timer

In [39]:
print("Original Text Sample:", tokens[:15])
print("Stemmed Words:", stemmed_words[:15])
print("\nPerformance Analysis:")
print(f"Stemming Execution Time for porterStemmer: {end_stem1 - start_stem1:.5f} seconds")


Original Text Sample: ['The', 'Indian', 'Space', 'Research', 'Organisation', '(', 'ISRO', ')', 'is', 'India', "'s", 'national', 'space', 'agency', '.']
Stemmed Words: ['the', 'indian', 'space', 'research', 'organis', '(', 'isro', ')', 'is', 'india', "'s", 'nation', 'space', 'agenc', '.']

Performance Analysis:
Stemming Execution Time for porterStemmer: 0.00527 seconds


In [40]:
# perform stemming using the LacasterStemmer module
from nltk.stem import LancasterStemmer
stemmer2 = LancasterStemmer()
start_stem2 = time.time() # Start timer
stemmed_words = [stemmer2.stem(word) for word in tokens]
end_stem2 = time.time() # End timer

In [41]:
print("Original Text Sample:", tokens[:10])
print("Stemmed Words:", stemmed_words[:10])
print("\nPerformance Analysis:")
print(f"Stemming Execution Time for porterStemmer: {end_stem2 - start_stem2:.5f} seconds")


Original Text Sample: ['The', 'Indian', 'Space', 'Research', 'Organisation', '(', 'ISRO', ')', 'is', 'India']
Stemmed Words: ['the', 'ind', 'spac', 'research', 'org', '(', 'isro', ')', 'is', 'ind']

Performance Analysis:
Stemming Execution Time for porterStemmer: 0.00467 seconds


In [42]:
# perform stemming using the SnowballStemmer module
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer("english")
start_stem3 = time.time() # Start timer
stemmed_words = [stemmer2.stem(word) for word in tokens]
end_stem3 = time.time() # End timer

In [43]:
print("Original Text Sample:", tokens[:10])
print("Stemmed Words:", stemmed_words[:10])
print("\nPerformance Analysis:")
print(f"Stemming Execution Time for porterStemmer: {end_stem3 - start_stem3:.5f} seconds")


Original Text Sample: ['The', 'Indian', 'Space', 'Research', 'Organisation', '(', 'ISRO', ')', 'is', 'India']
Stemmed Words: ['the', 'ind', 'spac', 'research', 'org', '(', 'isro', ')', 'is', 'ind']

Performance Analysis:
Stemming Execution Time for porterStemmer: 0.00408 seconds


In [44]:
#using spaCY Library apply Lemmatization to extract the base words
nlp = spacy.load("en_core_web_sm")
start_lem = time.time() # Start timer
doc = nlp(" ".join(tokens))
lemmatized_words = [token.lemma_ for token in doc]
end_lem = time.time() # End timer

In [46]:
print("Original Text Sample:", tokens[:10])
print("Lemmatized Words:", lemmatized_words[:10])
print("\nPerformance Analysis:")
print(f"Stemming Execution Time for porterStemmer: {end_lem- start_lem:.5f} seconds")

Original Text Sample: ['The', 'Indian', 'Space', 'Research', 'Organisation', '(', 'ISRO', ')', 'is', 'India']
Lemmatized Words: ['the', 'Indian', 'Space', 'Research', 'Organisation', '(', 'ISRO', ')', 'be', 'India']

Performance Analysis:
Stemming Execution Time for porterStemmer: 0.05374 seconds


In [48]:
# using wordNet lemmatizer in the NLTK library to perform lemmatization
from nltk.stem import WordNetLemmatizer
start_lem2 = time.time() # Start timer
nltk_lemmatizer = WordNetLemmatizer()
nltk_lemmas = [nltk_lemmatizer.lemmatize(token) for token in tokens]
end_lem2 = time.time() # Start timer

In [50]:
print("Original Text Sample:", tokens[:10])
print("Lemmatized Words:",nltk_lemmas)
print("\nPerformance Analysis:")
print(f"Stemming Execution Time for porterStemmer: {end_lem2- start_lem2:.5f} seconds")

Original Text Sample: ['The', 'Indian', 'Space', 'Research', 'Organisation', '(', 'ISRO', ')', 'is', 'India']
Lemmatized Words: ['The', 'Indian', 'Space', 'Research', 'Organisation', '(', 'ISRO', ')', 'is', 'India', "'s", 'national', 'space', 'agency', '.', 'It', 'serf', 'a', 'the', 'principal', 'research', 'and', 'development', 'arm', 'of', 'the', 'Department', 'of', 'Space', '(', 'DoS', ')', ',', 'overseen', 'by', 'the', 'Prime', 'Minister', 'of', 'India', ',', 'with', 'the', 'Chairman', 'of', 'ISRO', 'also', 'serving', 'a', 'the', 'chief', 'executive', 'of', 'the', 'DoS', '.', 'It', 'is', 'primarily', 'responsible', 'for', 'space-based', 'operation', ',', 'space', 'exploration', ',', 'international', 'space', 'cooperation', 'and', 'the', 'development', 'of', 'related', 'technology', '.', 'The', 'agency', 'maintains', 'a', 'constellation', 'of', 'imaging', ',', 'communication', 'and', 'remote', 'sensing', 'satellite', '.', 'It', 'operates', 'the', 'GAGAN', 'and', 'IRNSS', 'satellite'