## **Experiment - 7: Perform the steps involved in Text Analytics in Python & R**

### **Task performed**:
Explored Top-5 Text Analytics Libraries in Python (w.r.t Features & Applications)

Explored Top-5 Text Analytics Libraries in R (w.r.t Features & Applications)

Perform the following experiments using Python & R

1. Tokenization (Sentence & Word)

2. Frequency Distribution

3. Remove stopwords & punctuations

4. Lexicon Normalization (Stemming, Lemmatization)

5. Part of Speech tagging

6. Named Entity Recognization

7. Scrape data from a website

**Python Libraries**: nltk, scattertext, SpaCy, TextBlob, sklearn, pandas, numpy

**R Libraries**: shiny, tm, quanteda

In [1]:
pip install nltk



In [2]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [3]:
import nltk
from nltk import sent_tokenize, word_tokenize, FreqDist, pos_tag, ne_chunk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [4]:
import nltk
from nltk import sent_tokenize, word_tokenize, FreqDist, pos_tag, ne_chunk
def tokenize_sentences(text):
    return sent_tokenize(text)
sample_text = "Nature, with its infinite canvas of wonders, paints landscapes that evoke a profound sense of awe and serenity. The rustling leaves whisper tales of ancient wisdom, while the gentle caress of the breeze carries the sweet fragrance of blooming flowers. The sun, a radiant artist, orchestrates the masterpiece of dawn and dusk, casting hues that dance across the sky in a symphony of colors. Majestic mountains stand as sentinels of time, their peaks touching the heavens, and babbling brooks weave through meadows like nature's delicate embroidery. In the heart of the wilderness, the melody of chirping birds and the hum of insects create a harmonious chorus, inviting one to immerse in the symphony of life. Each season, from the vibrant rebirth of spring to the introspective hush of winter, paints a unique stroke on the canvas of existence. Nature, a sanctuary of tranquility, inspires contemplation and connection, reminding us of the intricate dance between the earth and its inhabitants."
sentences = tokenize_sentences(sample_text)
print("Sentences:", sentences)



Sentences: ['Nature, with its infinite canvas of wonders, paints landscapes that evoke a profound sense of awe and serenity.', 'The rustling leaves whisper tales of ancient wisdom, while the gentle caress of the breeze carries the sweet fragrance of blooming flowers.', 'The sun, a radiant artist, orchestrates the masterpiece of dawn and dusk, casting hues that dance across the sky in a symphony of colors.', "Majestic mountains stand as sentinels of time, their peaks touching the heavens, and babbling brooks weave through meadows like nature's delicate embroidery.", 'In the heart of the wilderness, the melody of chirping birds and the hum of insects create a harmonious chorus, inviting one to immerse in the symphony of life.', 'Each season, from the vibrant rebirth of spring to the introspective hush of winter, paints a unique stroke on the canvas of existence.', 'Nature, a sanctuary of tranquility, inspires contemplation and connection, reminding us of the intricate dance between the e

In [5]:
def tokenize_words(text):
    return word_tokenize(text)
words = tokenize_words(sample_text)
print("Words:",words)

Words: ['Nature', ',', 'with', 'its', 'infinite', 'canvas', 'of', 'wonders', ',', 'paints', 'landscapes', 'that', 'evoke', 'a', 'profound', 'sense', 'of', 'awe', 'and', 'serenity', '.', 'The', 'rustling', 'leaves', 'whisper', 'tales', 'of', 'ancient', 'wisdom', ',', 'while', 'the', 'gentle', 'caress', 'of', 'the', 'breeze', 'carries', 'the', 'sweet', 'fragrance', 'of', 'blooming', 'flowers', '.', 'The', 'sun', ',', 'a', 'radiant', 'artist', ',', 'orchestrates', 'the', 'masterpiece', 'of', 'dawn', 'and', 'dusk', ',', 'casting', 'hues', 'that', 'dance', 'across', 'the', 'sky', 'in', 'a', 'symphony', 'of', 'colors', '.', 'Majestic', 'mountains', 'stand', 'as', 'sentinels', 'of', 'time', ',', 'their', 'peaks', 'touching', 'the', 'heavens', ',', 'and', 'babbling', 'brooks', 'weave', 'through', 'meadows', 'like', 'nature', "'s", 'delicate', 'embroidery', '.', 'In', 'the', 'heart', 'of', 'the', 'wilderness', ',', 'the', 'melody', 'of', 'chirping', 'birds', 'and', 'the', 'hum', 'of', 'insects'

In [6]:
def calculate_frequency_distribution(words):
    return FreqDist(words)
freq_dist = calculate_frequency_distribution(words)
print("Frequency Distribution:", freq_dist)

Frequency Distribution: <FreqDist with 111 samples and 182 outcomes>


In [7]:
def remove_stopwords_and_punctuations(words):
    stop_words = set(stopwords.words('english'))
    return [word.lower() for word in words if word.isalnum() and word.lower() not in stop_words]
filtered_words = remove_stopwords_and_punctuations(words)
print("Filtered Words (stopwords and punctuations removed):", filtered_words)

Filtered Words (stopwords and punctuations removed): ['nature', 'infinite', 'canvas', 'wonders', 'paints', 'landscapes', 'evoke', 'profound', 'sense', 'awe', 'serenity', 'rustling', 'leaves', 'whisper', 'tales', 'ancient', 'wisdom', 'gentle', 'caress', 'breeze', 'carries', 'sweet', 'fragrance', 'blooming', 'flowers', 'sun', 'radiant', 'artist', 'orchestrates', 'masterpiece', 'dawn', 'dusk', 'casting', 'hues', 'dance', 'across', 'sky', 'symphony', 'colors', 'majestic', 'mountains', 'stand', 'sentinels', 'time', 'peaks', 'touching', 'heavens', 'babbling', 'brooks', 'weave', 'meadows', 'like', 'nature', 'delicate', 'embroidery', 'heart', 'wilderness', 'melody', 'chirping', 'birds', 'hum', 'insects', 'create', 'harmonious', 'chorus', 'inviting', 'one', 'immerse', 'symphony', 'life', 'season', 'vibrant', 'rebirth', 'spring', 'introspective', 'hush', 'winter', 'paints', 'unique', 'stroke', 'canvas', 'existence', 'nature', 'sanctuary', 'tranquility', 'inspires', 'contemplation', 'connection',

In [8]:

def stem_words(words):
    porter_stemmer = PorterStemmer()
    return [porter_stemmer.stem(word) for word in words]
stemmed_words = stem_words(filtered_words)
print("Stemmed Words:", stemmed_words)

Stemmed Words: ['natur', 'infinit', 'canva', 'wonder', 'paint', 'landscap', 'evok', 'profound', 'sens', 'awe', 'seren', 'rustl', 'leav', 'whisper', 'tale', 'ancient', 'wisdom', 'gentl', 'caress', 'breez', 'carri', 'sweet', 'fragranc', 'bloom', 'flower', 'sun', 'radiant', 'artist', 'orchestr', 'masterpiec', 'dawn', 'dusk', 'cast', 'hue', 'danc', 'across', 'sky', 'symphoni', 'color', 'majest', 'mountain', 'stand', 'sentinel', 'time', 'peak', 'touch', 'heaven', 'babbl', 'brook', 'weav', 'meadow', 'like', 'natur', 'delic', 'embroideri', 'heart', 'wilder', 'melodi', 'chirp', 'bird', 'hum', 'insect', 'creat', 'harmoni', 'choru', 'invit', 'one', 'immers', 'symphoni', 'life', 'season', 'vibrant', 'rebirth', 'spring', 'introspect', 'hush', 'winter', 'paint', 'uniqu', 'stroke', 'canva', 'exist', 'natur', 'sanctuari', 'tranquil', 'inspir', 'contempl', 'connect', 'remind', 'us', 'intric', 'danc', 'earth', 'inhabit']


In [9]:
def perform_pos_tagging(words):
    return pos_tag(words)
pos_tags = perform_pos_tagging(stemmed_words)
print("Part of Speech Tags:", pos_tags)

Part of Speech Tags: [('natur', 'JJ'), ('infinit', 'NN'), ('canva', 'NN'), ('wonder', 'VBP'), ('paint', 'NN'), ('landscap', 'NN'), ('evok', 'VBP'), ('profound', 'NN'), ('sens', 'NNS'), ('awe', 'VBP'), ('seren', 'JJ'), ('rustl', 'NN'), ('leav', 'NN'), ('whisper', 'IN'), ('tale', 'JJ'), ('ancient', 'JJ'), ('wisdom', 'NN'), ('gentl', 'NN'), ('caress', 'NN'), ('breez', 'NN'), ('carri', 'NN'), ('sweet', 'JJ'), ('fragranc', 'NN'), ('bloom', 'NN'), ('flower', 'NN'), ('sun', 'NN'), ('radiant', 'JJ'), ('artist', 'NN'), ('orchestr', 'IN'), ('masterpiec', 'JJ'), ('dawn', 'NN'), ('dusk', 'NN'), ('cast', 'VBD'), ('hue', 'JJ'), ('danc', 'NN'), ('across', 'IN'), ('sky', 'NN'), ('symphoni', 'NN'), ('color', 'NN'), ('majest', 'JJS'), ('mountain', 'NN'), ('stand', 'NN'), ('sentinel', 'NN'), ('time', 'NN'), ('peak', 'JJ'), ('touch', 'JJ'), ('heaven', 'NN'), ('babbl', 'NN'), ('brook', 'NN'), ('weav', 'VBP'), ('meadow', 'NN'), ('like', 'IN'), ('natur', 'JJ'), ('delic', 'JJ'), ('embroideri', 'JJ'), ('heart'

In [10]:
def perform_named_entity_recognition(pos_tags):
    return ne_chunk(pos_tags)
named_entities = perform_named_entity_recognition(pos_tags)
print("Named Entities:", named_entities)

Named Entities: (S
  natur/JJ
  infinit/NN
  canva/NN
  wonder/VBP
  paint/NN
  landscap/NN
  evok/VBP
  profound/NN
  sens/NNS
  awe/VBP
  seren/JJ
  rustl/NN
  leav/NN
  whisper/IN
  tale/JJ
  ancient/JJ
  wisdom/NN
  gentl/NN
  caress/NN
  breez/NN
  carri/NN
  sweet/JJ
  fragranc/NN
  bloom/NN
  flower/NN
  sun/NN
  radiant/JJ
  artist/NN
  orchestr/IN
  masterpiec/JJ
  dawn/NN
  dusk/NN
  cast/VBD
  hue/JJ
  danc/NN
  across/IN
  sky/NN
  symphoni/NN
  color/NN
  majest/JJS
  mountain/NN
  stand/NN
  sentinel/NN
  time/NN
  peak/JJ
  touch/JJ
  heaven/NN
  babbl/NN
  brook/NN
  weav/VBP
  meadow/NN
  like/IN
  natur/JJ
  delic/JJ
  embroideri/JJ
  heart/NN
  wilder/NN
  melodi/NN
  chirp/NN
  bird/NN
  hum/NN
  insect/JJ
  creat/NN
  harmoni/NN
  choru/NN
  invit/NN
  one/CD
  immers/NNS
  symphoni/JJ
  life/NN
  season/NN
  vibrant/JJ
  rebirth/NN
  spring/NN
  introspect/NN
  hush/NN
  winter/NN
  paint/NN
  uniqu/JJ
  stroke/VBD
  canva/JJ
  exist/JJ
  natur/NN
  sanctuari/NN
 

In [11]:
pip install requests beautifulsoup4




In [15]:

import requests
from bs4 import BeautifulSoup
url = 'https://www.w3schools.com'
response = requests.get(url)
if response.status_code == 200:
    soup = BeautifulSoup(response.content, 'html.parser')
    print("Page Title:", soup.title.text)
    links = soup.find_all('a')
    for link in links:
        print("Link:", link.get('href'))
else:
    print("Failed to retrieve the page. Status code:", response.status_code)


# import requests
# from bs4 import BeautifulSoup

# url = 'https://www.w3schools.com'

# response = requests.get(url)

# if response.status_code == 200:
#     soup = BeautifulSoup(response.text, 'html.parser')
#     quotes = soup.find_all('span', class_='text')
#     authors = soup.find_all('small', class_='author')
#     for quote, author in zip(quotes, authors):
#         print(f"Quote: {quote.text.strip()}")
#         print(f"Author: {author.text.strip()}\n")

# else:
#     print(f"Failed to retrieve the page. Status code: {response.status_code}")

Page Title: W3Schools Online Web Tutorials
Link: https://www.w3schools.com
Link: javascript:void(0)
Link: javascript:void(0)
Link: javascript:void(0)
Link: javascript:void(0)
Link: javascript:void(0)
Link: javascript:void(0);
Link: https://profile.w3schools.com/log-in
Link: https://profile.w3schools.com/sign-up
Link: https://profile.w3schools.com/log-in
Link: https://my-learning.w3schools.com
Link: https://campus.w3schools.com/collections/course-catalog
Link: /spaces/index.php
Link: /pathfinder/pathfinder_talent.php
Link: https://campus.w3schools.com/collections/course-catalog
Link: https://spaces.w3schools.com/space/
Link: /pathfinder/pathfinder_talent.php
Link: https://my-learning.w3schools.com
Link: /spaces/index.php
Link: https://campus.w3schools.com/collections/course-catalog
Link: /pathfinder/pathfinder_talent.php
Link: https://profile.w3schools.com/logout
Link: https://www.facebook.com/w3schoolscom/
Link: https://www.instagram.com/w3schools.com_official/
Link: https://discord.gg

# **Using R**

In [3]:
install.packages("tm")
install.packages("rvest")
install.packages("tokenizers")
install.packages("openNLP")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘NLP’, ‘Rcpp’, ‘slam’, ‘BH’


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependency ‘SnowballC’


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘openNLPdata’, ‘rJava’


“installation of package ‘rJava’ had non-zero exit status”
“installation of package ‘openNLPdata’ had non-zero exit status”
“installation of package ‘openNLP’ had non-zero exit status”


In [4]:
library(tm)
library(rvest)
library(NLP)
library(tokenizers)
library(SnowballC)

Loading required package: NLP



In [None]:
# install.packages(c("tm", "NLP", "quanteda", "udpipe", "rvest", "tidyverse"))
install.packages('tm')

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [None]:
install.packages('NLP')

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [None]:
install.packages('quanteda')

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [None]:
# install.packages('udpipe')
# install.packages('rvest')
install.packages('tidyverse')

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [None]:
 # Sample Text
library(tm)
library(NLP)
library(quanteda)
library(udpipe)
library(rvest)
library(tidyverse)


text <- "Welcome to Geeks for Geeks.Embark on an extraordinary coding odyssey with our
groundbreaking course,
DSA to Development - Complete Coding Guide! Discover the transformative power of mastering
Data Structures and Algorithms
(DSA) as you venture towards becoming a Proficient Developer."

# Tokenize the text into words
word_tokens <- unlist(tokenize_words(text))

# Print the result
print(word_tokens)



 [1] "welcome"        "to"             "geeks"          "for"           
 [5] "geeks.embark"   "on"             "an"             "extraordinary" 
 [9] "coding"         "odyssey"        "with"           "our"           
[13] "groundbreaking" "course"         "dsa"            "to"            
[17] "development"    "complete"       "coding"         "guide"         
[21] "discover"       "the"            "transformative" "power"         
[25] "of"             "mastering"      "data"           "structures"    
[29] "and"            "algorithms"     "dsa"            "as"            
[33] "you"            "venture"        "towards"        "becoming"      
[37] "a"              "proficient"     "developer"     


In [8]:
# Remove stopwords and punctuations
stop_words <- stopwords("en")
filtered_tokens <- word_tokens[!(word_tokens %in% stop_words) & grepl("[a-zA-Z]", word_tokens)]
cat("Filtered Tokens (without stopwords and punctuations):", filtered_tokens, "\n")

Filtered Tokens (without stopwords and punctuations): nature infinite canvas wonders paints landscapes evoke profound sense awe serenity rustling leaves whisper tales ancient wisdom gentle caress breeze carries sweet fragrance blooming flowers sun radiant artist orchestrates masterpiece dawn dusk casting hues dance across sky symphony colors majestic mountains stand sentinels time peaks touching heavens babbling brooks weave meadows like nature's delicate embroidery heart wilderness melody chirping birds hum insects create harmonious chorus inviting one immerse symphony life season vibrant rebirth spring introspective hush winter paints unique stroke canvas existence nature sanctuary tranquility inspires contemplation connection reminding us intricate dance earth inhabitants 


In [7]:
text <- "Nature, with its infinite canvas of wonders, paints landscapes that evoke a profound sense of awe and serenity. The rustling leaves whisper tales of ancient wisdom, while the gentle caress of the breeze carries the sweet fragrance of blooming flowers. The sun, a radiant artist, orchestrates the masterpiece of dawn and dusk, casting hues that dance across the sky in a symphony of colors. Majestic mountains stand as sentinels of time, their peaks touching the heavens, and babbling brooks weave through meadows like nature's delicate embroidery. In the heart of the wilderness, the melody of chirping birds and the hum of insects create a harmonious chorus, inviting one to immerse in the symphony of life. Each season, from the vibrant rebirth of spring to the introspective hush of winter, paints a unique stroke on the canvas of existence. Nature, a sanctuary of tranquility, inspires contemplation and connection, reminding us of the intricate dance between the earth and its inhabitants."
sent_tokens <- unlist(tokenize_sentences(text))
word_tokens <- unlist(tokenize_words(text))
cat("Sentence Tokens:", sent_tokens, "\n")
cat("Word Tokens:", word_tokens, "\n")
# Frequency Distribution
fdist <- table(unlist(word_tokens))
print(head(sort(fdist, decreasing = TRUE), 2))

Sentence Tokens: Nature, with its infinite canvas of wonders, paints landscapes that evoke a profound sense of awe and serenity. The rustling leaves whisper tales of ancient wisdom, while the gentle caress of the breeze carries the sweet fragrance of blooming flowers. The sun, a radiant artist, orchestrates the masterpiece of dawn and dusk, casting hues that dance across the sky in a symphony of colors. Majestic mountains stand as sentinels of time, their peaks touching the heavens, and babbling brooks weave through meadows like nature's delicate embroidery. In the heart of the wilderness, the melody of chirping birds and the hum of insects create a harmonious chorus, inviting one to immerse in the symphony of life. Each season, from the vibrant rebirth of spring to the introspective hush of winter, paints a unique stroke on the canvas of existence. Nature, a sanctuary of tranquility, inspires contemplation and connection, reminding us of the intricate dance between the earth and its i

In [9]:
stemmed_tokens <- wordStem(filtered_tokens, language = "en")

# Lemmatization
lemmatized_text <- tolower(text)
lemmatized_text <- wordStem(lemmatized_text, language = "en")
cat("Stemmed Tokens:", stemmed_tokens, "\n")
cat("Lemmatized Text:", lemmatized_text, "\n")

Stemmed Tokens: natur infinit canva wonder paint landscap evok profound sens awe seren rustl leav whisper tale ancient wisdom gentl caress breez carri sweet fragranc bloom flower sun radiant artist orchestr masterpiec dawn dusk cast hue danc across sky symphoni color majest mountain stand sentinel time peak touch heaven babbl brook weav meadow like natur delic embroideri heart wilder melodi chirp bird hum insect creat harmoni chorus invit one immers symphoni life season vibrant rebirth spring introspect hush winter paint uniqu stroke canva exist natur sanctuari tranquil inspir contempl connect remind us intric danc earth inhabit 
Lemmatized Text: nature, with its infinite canvas of wonders, paints landscapes that evoke a profound sense of awe and serenity. the rustling leaves whisper tales of ancient wisdom, while the gentle caress of the breeze carries the sweet fragrance of blooming flowers. the sun, a radiant artist, orchestrates the masterpiece of dawn and dusk, casting hues that d

In [10]:
url <- 'http://quotes.toscrape.com/'
web_page <- read_html(url)
web_text <- html_text(web_page)
cat("Scraped Data from the Website:\n", web_text)

Scraped Data from the Website:
 Quotes to Scrape
    
        
            
                
                    Quotes to Scrape
                
            
            
                
                
                    Login
                
                
            
        
    


    

    
        “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
        by Albert Einstein
        (about)
        
        
            Tags:
            change
            
            deep-thoughts
            
            thinking
            
            world
            
        
    

    
        “It is our choices, Harry, that show what we truly are, far more than our abilities.”
        by J.K. Rowling
        (about)
        
        
            Tags:
            abilities
            
            choices
            
        
    

    
        “There are only two ways to live your life. One is as though nothing

**Conslusion: -**
*   Identified the Text Analytics Libraries in Python and R
*   Performed simple experiments with these libraries in Python and R


