
### **Q2 Copy the data about the NASA from wikipedia page and perform the following task**
##### 1. Apply POS tagging on this data
##### 2. Remove punctuation symbols
##### 3. Delete the stopwords
##### 4. perform morphological analysis on all the nouns
##### 5. find top three words in the data

In [2]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.stem import WordNetLemmatizer 
from nltk.probability import FreqDist

In [4]:
# text data from the Nasa webpage
text ="""The National Aeronautics and Space Administration (NASA; /ˈnæsə/) is an independent agency of the U.S. federal government responsible for the civil space program, aeronautics research, and space research. Established in 1958, it succeeded the National Advisory Committee for Aeronautics (NACA) to give the U.S. space development effort a distinct civilian orientation, emphasizing peaceful applications in space science. It has since led most of America's space exploration programs, including Project Mercury, Project Gemini, the 1968–1972 Apollo Moon landing missions, the Skylab space station, and the Space Shuttle. Currently, NASA supports the International Space Station (ISS) along with the Commercial Crew Program, and oversees the development of the Orion spacecraft and the Space Launch System for the lunar Artemis program.

NASA's science division is focused on better understanding Earth through the Earth Observing System; advancing heliophysics through the efforts of the Science Mission Directorate's Heliophysics Research Program; exploring bodies throughout the Solar System with advanced robotic spacecraft such as New Horizons and planetary rovers such as Perseverance; and researching astrophysics topics, such as the Big Bang, through the James Webb Space Telescope, the Great Observatories and associated programs. The Launch Services Program oversees launch operations for its uncrewed launches."""

In [5]:
#1. Apply the POS tagging
tags = pos_tag(word_tokenize(text))
tags

[('The', 'DT'),
 ('National', 'NNP'),
 ('Aeronautics', 'NNP'),
 ('and', 'CC'),
 ('Space', 'NNP'),
 ('Administration', 'NNP'),
 ('(', '('),
 ('NASA', 'NNP'),
 (';', ':'),
 ('/ˈnæsə/', 'NNP'),
 (')', ')'),
 ('is', 'VBZ'),
 ('an', 'DT'),
 ('independent', 'JJ'),
 ('agency', 'NN'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('U.S.', 'NNP'),
 ('federal', 'JJ'),
 ('government', 'NN'),
 ('responsible', 'JJ'),
 ('for', 'IN'),
 ('the', 'DT'),
 ('civil', 'JJ'),
 ('space', 'NN'),
 ('program', 'NN'),
 (',', ','),
 ('aeronautics', 'NNS'),
 ('research', 'NN'),
 (',', ','),
 ('and', 'CC'),
 ('space', 'NN'),
 ('research', 'NN'),
 ('.', '.'),
 ('Established', 'VBN'),
 ('in', 'IN'),
 ('1958', 'CD'),
 (',', ','),
 ('it', 'PRP'),
 ('succeeded', 'VBD'),
 ('the', 'DT'),
 ('National', 'NNP'),
 ('Advisory', 'NNP'),
 ('Committee', 'NNP'),
 ('for', 'IN'),
 ('Aeronautics', 'NNP'),
 ('(', '('),
 ('NACA', 'NNP'),
 (')', ')'),
 ('to', 'TO'),
 ('give', 'VB'),
 ('the', 'DT'),
 ('U.S.', 'NNP'),
 ('space', 'NN'),
 ('development', 

In [6]:
#2. Remove punctuation symbols
punct = [word for word in word_tokenize(text) if word.isalnum()]
text1 = (" ").join(punct)
text1

'The National Aeronautics and Space Administration NASA is an independent agency of the federal government responsible for the civil space program aeronautics research and space research Established in 1958 it succeeded the National Advisory Committee for Aeronautics NACA to give the space development effort a distinct civilian orientation emphasizing peaceful applications in space science It has since led most of America space exploration programs including Project Mercury Project Gemini the Apollo Moon landing missions the Skylab space station and the Space Shuttle Currently NASA supports the International Space Station ISS along with the Commercial Crew Program and oversees the development of the Orion spacecraft and the Space Launch System for the lunar Artemis program NASA science division is focused on better understanding Earth through the Earth Observing System advancing heliophysics through the efforts of the Science Mission Directorate Heliophysics Research Program exploring 

In [7]:
#3. Delete the stopwords
stp_wrds = stopwords.words('english')
without_sw = [word for word in word_tokenize(text) if word.lower() not in stp_wrds]
without_sw

['National',
 'Aeronautics',
 'Space',
 'Administration',
 '(',
 'NASA',
 ';',
 '/ˈnæsə/',
 ')',
 'independent',
 'agency',
 'U.S.',
 'federal',
 'government',
 'responsible',
 'civil',
 'space',
 'program',
 ',',
 'aeronautics',
 'research',
 ',',
 'space',
 'research',
 '.',
 'Established',
 '1958',
 ',',
 'succeeded',
 'National',
 'Advisory',
 'Committee',
 'Aeronautics',
 '(',
 'NACA',
 ')',
 'give',
 'U.S.',
 'space',
 'development',
 'effort',
 'distinct',
 'civilian',
 'orientation',
 ',',
 'emphasizing',
 'peaceful',
 'applications',
 'space',
 'science',
 '.',
 'since',
 'led',
 'America',
 "'s",
 'space',
 'exploration',
 'programs',
 ',',
 'including',
 'Project',
 'Mercury',
 ',',
 'Project',
 'Gemini',
 ',',
 '1968–1972',
 'Apollo',
 'Moon',
 'landing',
 'missions',
 ',',
 'Skylab',
 'space',
 'station',
 ',',
 'Space',
 'Shuttle',
 '.',
 'Currently',
 ',',
 'NASA',
 'supports',
 'International',
 'Space',
 'Station',
 '(',
 'ISS',
 ')',
 'along',
 'Commercial',
 'Crew',


In [8]:
#4. perform morphological analysis on all the nouns
nouns=[tup[0] for tup in tags if tup[1].startswith('NN') ]
nouns

['National',
 'Aeronautics',
 'Space',
 'Administration',
 'NASA',
 '/ˈnæsə/',
 'agency',
 'U.S.',
 'government',
 'space',
 'program',
 'aeronautics',
 'research',
 'space',
 'research',
 'National',
 'Advisory',
 'Committee',
 'Aeronautics',
 'NACA',
 'U.S.',
 'space',
 'development',
 'effort',
 'orientation',
 'applications',
 'space',
 'science',
 'America',
 'space',
 'exploration',
 'programs',
 'Project',
 'Mercury',
 'Project',
 'Gemini',
 'Apollo',
 'Moon',
 'missions',
 'Skylab',
 'space',
 'station',
 'Space',
 'Shuttle',
 'Currently',
 'NASA',
 'International',
 'Space',
 'Station',
 'ISS',
 'Crew',
 'Program',
 'development',
 'Orion',
 'spacecraft',
 'Space',
 'Launch',
 'System',
 'Artemis',
 'program',
 'NASA',
 'science',
 'division',
 'understanding',
 'Earth',
 'Earth',
 'Observing',
 'System',
 'heliophysics',
 'efforts',
 'Science',
 'Mission',
 'Directorate',
 'Heliophysics',
 'Research',
 'Program',
 'bodies',
 'Solar',
 'System',
 'spacecraft',
 'New',
 'Horizo

In [None]:
#morphological analysis using polyglot
from polyglot.text import Text,Word
for w in nouns:
    w= Word(w, language="en")
    print(w,  '\t\t',w.morphemes)

In [16]:
#5. find top three words in the data
words = [word.lower() for word in word_tokenize(text) if word.isalnum()]
#get the frequency of each distinct word
freq = FreqDist(words)
#get top 3 words
top3 = freq.most_common(3)
top3

[('the', 22), ('space', 11), ('and', 8)]

### Q2. Open the wikipedia page on Pune
##### Find the adjectives used in the text

In [17]:
#text data from wiki page on PUNE
text1="""Pune (/ˈpuːnə/ POO-nə, Marathi: [ˈpuɳe] ⓘ), previously spelled in English as Poona (the official name until 1978),[15][16] is a city in Maharashtra state in the Deccan plateau in Western India. It is the administrative headquarters of the Pune district, and of Pune division. According to the 2011 Census of India, Pune has 7.2 million residents in the metropolitan region, making it the eighth-most populous metropolitan area in India.[17] The city of Pune is part of Pune Metropolitan Region.[18] Pune is one of the largest IT hubs in India.[19][20] It is also one of the most important automobile and manufacturing hubs of India. Pune is often referred to as the "Oxford of the East" because of its highly regarded educational institutions.[21][22][23] It has been ranked "the most liveable city in India" several times.[24][25]

Pune at different points in time has been ruled by the Rashtrakuta dynasty, Ahmadnagar Sultanate, the Mughals, and the Adil Shahi dynasty. In the 18th century, the city was part of the Maratha Empire, and the seat of the Peshwas, the prime ministers of the Maratha Empire.[26] Many historical landmarks like Pataleshwar caves, Shaniwarwada, Shinde Chhatri, and Vishrambaug Wada date to this era. Historical sites from different eras dot the city.

Pune has historically been a major cultural centre, with important figures like Dnyaneshwar, Shivaji, Tukaram, Baji Rao I, Balaji Baji Rao, Madhavrao I, Nana Fadnavis, Mahadev Govind Ranade, Gopal Krishna Gokhale, Mahatma Jyotirao Phule, Savitribai Phule, Gopal Ganesh Agarkar, Tarabai Shinde, Dhondo Keshav Karve, and Pandita Ramabai doing their life's work in Pune City or in an area that falls in Pune Metropolitan Region. Pune was a major centre of resistance to British Raj, with people like Gopal Krishna Gokhale, Bal Gangadhar Tilak and Vinayak Damodar Savarkar playing leading roles in struggle for Indian independence in their times."""

In [18]:
#POS tagging of the text data
tags1 = pos_tag(word_tokenize(text1))
tags1

[('Pune', 'NNP'),
 ('(', '('),
 ('/ˈpuːnə/', 'JJ'),
 ('POO-nə', 'NNP'),
 (',', ','),
 ('Marathi', 'NNP'),
 (':', ':'),
 ('[', 'NN'),
 ('ˈpuɳe', 'NNP'),
 (']', 'NNP'),
 ('ⓘ', 'NNP'),
 (')', ')'),
 (',', ','),
 ('previously', 'RB'),
 ('spelled', 'VBN'),
 ('in', 'IN'),
 ('English', 'NNP'),
 ('as', 'IN'),
 ('Poona', 'NNP'),
 ('(', '('),
 ('the', 'DT'),
 ('official', 'NN'),
 ('name', 'NN'),
 ('until', 'IN'),
 ('1978', 'CD'),
 (')', ')'),
 (',', ','),
 ('[', '$'),
 ('15', 'CD'),
 (']', 'NNP'),
 ('[', 'VBD'),
 ('16', 'CD'),
 (']', 'NN'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('city', 'NN'),
 ('in', 'IN'),
 ('Maharashtra', 'NNP'),
 ('state', 'NN'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('Deccan', 'NNP'),
 ('plateau', 'NN'),
 ('in', 'IN'),
 ('Western', 'JJ'),
 ('India', 'NNP'),
 ('.', '.'),
 ('It', 'PRP'),
 ('is', 'VBZ'),
 ('the', 'DT'),
 ('administrative', 'JJ'),
 ('headquarters', 'NN'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('Pune', 'NNP'),
 ('district', 'NN'),
 (',', ','),
 ('and', 'CC'),
 ('of', 'IN'),
 ('Pun

In [19]:
#extract all the adjectives from text and store list 'adjectives'
adjectives=[tup[0] for tup in tags1 if tup[1].startswith('JJ') ]
adjectives

['/ˈpuːnə/',
 'Western',
 'administrative',
 'metropolitan',
 'eighth-most',
 'populous',
 'metropolitan',
 'largest',
 ']',
 'important',
 'regarded',
 'educational',
 'liveable',
 'several',
 ']',
 'different',
 '18th',
 'prime',
 ']',
 'Many',
 'historical',
 'Historical',
 'different',
 'major',
 'cultural',
 'important',
 'major',
 'Indian']