**Aim** : Perform the steps involved in Text Analytics in Python & R


# **Python**

1. **NLTK (Natural Language Toolkit)**: NLTK is one of the most widely used libraries for natural language processing (NLP) and text analysis in Python. It provides easy-to-use interfaces to over 50 corpora and lexical resources, along with a suite of text processing libraries for tokenization, stemming, tagging, parsing, and semantic reasoning.

2.   **spaCy**: spaCy is a modern and efficient NLP library designed for production use. It offers pre-trained models for various languages and provides easy-to-use APIs for performing tokenization, part-of-speech tagging, named entity recognition, dependency parsing, and sentence segmentation.

3.   **TextBlob**: TextBlob is a simple and beginner-friendly library for text processing and sentiment analysis in Python. It offers a consistent API for common NLP tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, translation, and classification.

4.   **Gensim**: Gensim is a robust library for topic modeling, document similarity analysis, and word vector representations in Python. Gensim is optimized for large-scale text corpora and provides tools for discovering hidden semantic structures within documents, identifying topics, and calculating document similarities.

5.   **Transformers (Hugging Face)**:Transformers, developed by Hugging Face, is a state-of-the-art library for natural language understanding and generation tasks using transformer-based models like BERT, GPT, and RoBERTa. It provides pre-trained models and fine-tuning pipelines for various NLP tasks, including text classification, named entity recognition, question answering, summarization, and text generation.






In [None]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tag import pos_tag
from nltk.chunk import ne_chunk

In [None]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [None]:
example_string = "Bob Newman,The tranquil serenity of the Amazon forest in South America enveloped her as she ventured deeper into its heart. Sunlight filtered through the thick canopy above, dappling the forest floor with patches of golden light. The air was crisp, carrying the scent of pine and earth, mingling with the faint chorus of birdsong. With each step, she felt the weight of the world lift from her shoulders, replaced by a profound sense of peace. Here, amidst the ancient trees and winding paths, she found solace, a sanctuary from the chaos of everyday life."


Tokenization (Sentence & Word)

In [None]:
sentences = sent_tokenize(example_string)
print(sentences)

['The tranquil serenity of the forest enveloped her as she ventured deeper into its heart.', 'Sunlight filtered through the thick canopy above, dappling the forest floor with patches of golden light.', 'The air was crisp, carrying the scent of pine and earth, mingling with the faint chorus of birdsong.', 'With each step, she felt the weight of the world lift from her shoulders, replaced by a profound sense of peace.', 'Here, amidst the ancient trees and winding paths, she found solace, a sanctuary from the chaos of everyday life.']


In [None]:
words = word_tokenize(example_string)
print(words)

['The', 'tranquil', 'serenity', 'of', 'the', 'forest', 'enveloped', 'her', 'as', 'she', 'ventured', 'deeper', 'into', 'its', 'heart', '.', 'Sunlight', 'filtered', 'through', 'the', 'thick', 'canopy', 'above', ',', 'dappling', 'the', 'forest', 'floor', 'with', 'patches', 'of', 'golden', 'light', '.', 'The', 'air', 'was', 'crisp', ',', 'carrying', 'the', 'scent', 'of', 'pine', 'and', 'earth', ',', 'mingling', 'with', 'the', 'faint', 'chorus', 'of', 'birdsong', '.', 'With', 'each', 'step', ',', 'she', 'felt', 'the', 'weight', 'of', 'the', 'world', 'lift', 'from', 'her', 'shoulders', ',', 'replaced', 'by', 'a', 'profound', 'sense', 'of', 'peace', '.', 'Here', ',', 'amidst', 'the', 'ancient', 'trees', 'and', 'winding', 'paths', ',', 'she', 'found', 'solace', ',', 'a', 'sanctuary', 'from', 'the', 'chaos', 'of', 'everyday', 'life', '.']


Frequency Distribution


In [None]:
fdist = FreqDist(words)
print("Frequency Distribution:")
print(fdist.most_common(20))

Frequency Distribution:
[('the', 9), (',', 8), ('of', 7), ('.', 5), ('she', 3), ('The', 2), ('forest', 2), ('her', 2), ('with', 2), ('and', 2), ('from', 2), ('a', 2), ('tranquil', 1), ('serenity', 1), ('enveloped', 1), ('as', 1), ('ventured', 1), ('deeper', 1), ('into', 1), ('its', 1)]


Remove stopwords & punctuations


In [None]:
stop_words = set(stopwords.words('english'))
filtered_words = [word.lower() for word in words if word.isalnum() and word.lower() not in stop_words]
print(filtered_words)

['tranquil', 'serenity', 'forest', 'enveloped', 'ventured', 'deeper', 'heart', 'sunlight', 'filtered', 'thick', 'canopy', 'dappling', 'forest', 'floor', 'patches', 'golden', 'light', 'air', 'crisp', 'carrying', 'scent', 'pine', 'earth', 'mingling', 'faint', 'chorus', 'birdsong', 'step', 'felt', 'weight', 'world', 'lift', 'shoulders', 'replaced', 'profound', 'sense', 'peace', 'amidst', 'ancient', 'trees', 'winding', 'paths', 'found', 'solace', 'sanctuary', 'chaos', 'everyday', 'life']


Lexicon Normalization (Stemming, Lemmatization)


In [None]:
ps = PorterStemmer()
stemmed_words = [ps.stem(word) for word in filtered_words]
print(stemmed_words)

['tranquil', 'seren', 'forest', 'envelop', 'ventur', 'deeper', 'heart', 'sunlight', 'filter', 'thick', 'canopi', 'dappl', 'forest', 'floor', 'patch', 'golden', 'light', 'air', 'crisp', 'carri', 'scent', 'pine', 'earth', 'mingl', 'faint', 'choru', 'birdsong', 'step', 'felt', 'weight', 'world', 'lift', 'shoulder', 'replac', 'profound', 'sens', 'peac', 'amidst', 'ancient', 'tree', 'wind', 'path', 'found', 'solac', 'sanctuari', 'chao', 'everyday', 'life']


In [None]:
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]
print(lemmatized_words)

['tranquil', 'serenity', 'forest', 'enveloped', 'ventured', 'deeper', 'heart', 'sunlight', 'filtered', 'thick', 'canopy', 'dappling', 'forest', 'floor', 'patch', 'golden', 'light', 'air', 'crisp', 'carrying', 'scent', 'pine', 'earth', 'mingling', 'faint', 'chorus', 'birdsong', 'step', 'felt', 'weight', 'world', 'lift', 'shoulder', 'replaced', 'profound', 'sense', 'peace', 'amidst', 'ancient', 'tree', 'winding', 'path', 'found', 'solace', 'sanctuary', 'chaos', 'everyday', 'life']


Part of Speech tagging


In [None]:
pos_tags = pos_tag(filtered_words)
print("Part of Speech tagging:")
print(pos_tags)


Part of Speech tagging:
[('tranquil', 'JJ'), ('serenity', 'NN'), ('forest', 'JJS'), ('enveloped', 'VBN'), ('ventured', 'VBD'), ('deeper', 'JJ'), ('heart', 'NN'), ('sunlight', 'NN'), ('filtered', 'VBD'), ('thick', 'JJ'), ('canopy', 'NN'), ('dappling', 'VBG'), ('forest', 'JJS'), ('floor', 'NN'), ('patches', 'NNS'), ('golden', 'JJ'), ('light', 'JJ'), ('air', 'NN'), ('crisp', 'NN'), ('carrying', 'VBG'), ('scent', 'JJ'), ('pine', 'NN'), ('earth', 'NN'), ('mingling', 'VBG'), ('faint', 'NN'), ('chorus', 'NN'), ('birdsong', 'JJ'), ('step', 'NN'), ('felt', 'VBD'), ('weight', 'JJ'), ('world', 'NN'), ('lift', 'NN'), ('shoulders', 'NNS'), ('replaced', 'VBD'), ('profound', 'JJ'), ('sense', 'NN'), ('peace', 'NN'), ('amidst', 'NN'), ('ancient', 'NN'), ('trees', 'NNS'), ('winding', 'VBG'), ('paths', 'NNS'), ('found', 'VBD'), ('solace', 'NN'), ('sanctuary', 'JJ'), ('chaos', 'NN'), ('everyday', 'JJ'), ('life', 'NN')]


Named Entity Recognization


In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")
text1= nlp(example_string)

for ent in text1.ents:
    print(ent.text,ent.label_)

Bob Newman PERSON
Amazon ORG
South America LOC


Scrape data from a website


In [None]:
import requests

In [None]:
r = requests.get("https://www.tripadvisor.in/")

In [None]:
print(r)

<Response [200]>


In [None]:
print(r.content)

b'<!DOCTYPE html><html lang="en-IN"><head><meta http-equiv="content-type" content="text/html; charset=utf-8"/><meta http-equiv="content-language" content="en"/><link href="https://static.tacdn.com/css2/webfonts/TripSans/TripSans-VF.woff2?v1.002" rel="preload" as="font" type="font/woff2" crossorigin="anonymous"/><link rel="icon" id="favicon" href="https://static.tacdn.com/favicon.ico?v2" type="image/x-icon"/><link rel="mask-icon" sizes="any" href="https://static.tacdn.com/img2/brand_refresh/application_icons/mask-icon.svg" color="#000000"/><meta name="theme-color" content="#34e0a1"/><meta name="format-detection" content="telephone=no"/><meta property="al:ios:app_name" content="TripAdvisor"/><meta property="al:ios:app_store_id" content="284876795"/><meta property="twitter:app:id:ipad" name="twitter:app:id:ipad" content="284876795"/><meta property="twitter:app:id:iphone" name="twitter:app:id:iphone" content="284876795"/><meta property="al:ios:url" content="tripadvisor://www.tripadvisor.in

# **R**

1.   **tm (Text Mining)**:The tm (Text Mining) package in R is a comprehensive toolkit for text mining and analysis. It provides functionalities for text preprocessing, transformation, and analysis, including term-document matrix creation, text cleaning, stemming, and stopword removal.

2.   **quanteda**:quanteda is a powerful and user-friendly package for text analysis and natural language processing in R. It offers an intuitive and consistent interface for text corpus creation, tokenization, n-gram analysis, and text manipulation.

3.   **text2vec**:text2vec is an efficient and scalable text mining package in R designed for processing large text corpora. It provides tools for word embeddings, text vectorization, document similarity analysis, and dimensionality reduction.

4.   **udpipe**:udpipe is an R package for tokenization, part-of-speech tagging, and dependency parsing using pre-trained models provided by the UDPipe project. It offers multilingual support and provides access to state-of-the-art NLP models for various languages.

5.   **koRpus**:koRpus is an R package designed for text analysis and computational linguistics tasks. It provides functions for text readability analysis, lexical diversity measurement, and linguistic annotation. koRpus offers a wide range of linguistic indices and measures, including lexical richness, syntactic complexity, and text cohesion metrics.



In [None]:
install.packages("tokenizers")
install.packages("textstem")

In [None]:
library(tokenizers)
library(textstem)
library(stringr)

In [None]:
example_string <- "Bob Newman,The tranquil serenity of the Amazon forest in South America enveloped her as she ventured deeper into its heart. Sunlight filtered through the thick canopy above, dappling the forest floor with patches of golden light. The air was crisp, carrying the scent of pine and earth, mingling with the faint chorus of birdsong. With each step, she felt the weight of the world lift from her shoulders, replaced by a profound sense of peace. Here, amidst the ancient trees and winding paths, she found solace, a sanctuary from the chaos of everyday life."

Tokenization (Sentence & Word)


In [None]:
sentences <- tokenize_sentences(example_string)
print(sentences)

[[1]]
[1] "Bob Newman,The tranquil serenity of the Amazon forest in South America enveloped her as she ventured deeper into its heart."
[2] "Sunlight filtered through the thick canopy above, dappling the forest floor with patches of golden light."                  
[3] "The air was crisp, carrying the scent of pine and earth, mingling with the faint chorus of birdsong."                       
[4] "With each step, she felt the weight of the world lift from her shoulders, replaced by a profound sense of peace."           
[5] "Here, amidst the ancient trees and winding paths, she found solace, a sanctuary from the chaos of everyday life."           



In [None]:
word_tokens <- tokenize_words(example_string)
print(word_tokens)

[[1]]
 [1] "bob"       "newman"    "the"       "tranquil"  "serenity"  "of"       
 [7] "the"       "amazon"    "forest"    "in"        "south"     "america"  
[13] "enveloped" "her"       "as"        "she"       "ventured"  "deeper"   
[19] "into"      "its"       "heart"     "sunlight"  "filtered"  "through"  
[25] "the"       "thick"     "canopy"    "above"     "dappling"  "the"      
[31] "forest"    "floor"     "with"      "patches"   "of"        "golden"   
[37] "light"     "the"       "air"       "was"       "crisp"     "carrying" 
[43] "the"       "scent"     "of"        "pine"      "and"       "earth"    
[49] "mingling"  "with"      "the"       "faint"     "chorus"    "of"       
[55] "birdsong"  "with"      "each"      "step"      "she"       "felt"     
[61] "the"       "weight"    "of"        "the"       "world"     "lift"     
[67] "from"      "her"       "shoulders" "replaced"  "by"        "a"        
[73] "profound"  "sense"     "of"        "peace"     "here"      "amid

Frequency Distribution


In [None]:
word_freq <- table(word_tokens)
print("Word Frequency Distribution:")
print(word_freq)

[1] "Word Frequency Distribution:"
word_tokens
        a     above       air    amazon   america    amidst   ancient       and 
        2         1         1         1         1         1         1         2 
       as  birdsong       bob        by    canopy  carrying     chaos    chorus 
        1         1         1         1         1         1         1         1 
    crisp  dappling    deeper      each     earth enveloped  everyday     faint 
        1         1         1         1         1         1         1         1 
     felt  filtered     floor    forest     found      from    golden     heart 
        1         1         1         2         1         2         1         1 
      her      here        in      into       its      life      lift     light 
        2         1         1         1         1         1         1         1 
 mingling    newman        of   patches     paths     peace      pine  profound 
        1         1         7         1         1         1   

Remove stopwords & punctuations


In [None]:
stopwords <- stopwords("en")
filtered_tokens <- word_tokens[!word_tokens %in% stopwords]
filtered_tokens <- filtered_tokens[str_detect(filtered_tokens, "[a-zA-Z]")]

“argument is not an atomic vector; coercing”


In [None]:
print("Tokens after removing stopwords & punctuations:")
print(filtered_tokens)

[1] "Tokens after removing stopwords & punctuations:"
[[1]]
 [1] "bob"       "newman"    "the"       "tranquil"  "serenity"  "of"       
 [7] "the"       "amazon"    "forest"    "in"        "south"     "america"  
[13] "enveloped" "her"       "as"        "she"       "ventured"  "deeper"   
[19] "into"      "its"       "heart"     "sunlight"  "filtered"  "through"  
[25] "the"       "thick"     "canopy"    "above"     "dappling"  "the"      
[31] "forest"    "floor"     "with"      "patches"   "of"        "golden"   
[37] "light"     "the"       "air"       "was"       "crisp"     "carrying" 
[43] "the"       "scent"     "of"        "pine"      "and"       "earth"    
[49] "mingling"  "with"      "the"       "faint"     "chorus"    "of"       
[55] "birdsong"  "with"      "each"      "step"      "she"       "felt"     
[61] "the"       "weight"    "of"        "the"       "world"     "lift"     
[67] "from"      "her"       "shoulders" "replaced"  "by"        "a"        
[73] "profound" 

Lexicon Normalization (Stemming, Lemmatization)


In [None]:
lemmatized_tokens <- lemmatize_words(filtered_tokens)
print("Lemmatized Tokens:")
print(lemmatized_tokens)

[1] "Lemmatized Tokens:"
[[1]]
 [1] "bob"       "newman"    "the"       "tranquil"  "serenity"  "of"       
 [7] "the"       "amazon"    "forest"    "in"        "south"     "america"  
[13] "enveloped" "her"       "as"        "she"       "ventured"  "deeper"   
[19] "into"      "its"       "heart"     "sunlight"  "filtered"  "through"  
[25] "the"       "thick"     "canopy"    "above"     "dappling"  "the"      
[31] "forest"    "floor"     "with"      "patches"   "of"        "golden"   
[37] "light"     "the"       "air"       "was"       "crisp"     "carrying" 
[43] "the"       "scent"     "of"        "pine"      "and"       "earth"    
[49] "mingling"  "with"      "the"       "faint"     "chorus"    "of"       
[55] "birdsong"  "with"      "each"      "step"      "she"       "felt"     
[61] "the"       "weight"    "of"        "the"       "world"     "lift"     
[67] "from"      "her"       "shoulders" "replaced"  "by"        "a"        
[73] "profound"  "sense"     "of"        "pea

In [None]:
stemmed_tokens <- stem_words(filtered_tokens)
print("Stemmed Tokens:")
print(stemmed_tokens)

[1] "Stemmed Tokens:"
[1] "c(\"bob\", \"newman\", \"the\", \"tranquil\", \"serenity\", \"of\", \"the\", \"amazon\", \"forest\", \"in\", \"south\", \"america\", \"enveloped\", \"her\", \"as\", \"she\", \"ventured\", \"deeper\", \"into\", \"its\", \"heart\", \"sunlight\", \"filtered\", \"through\", \"the\", \"thick\", \"canopy\", \"above\", \"dappling\", \"the\", \"forest\", \"floor\", \"with\", \"patches\", \"of\", \"golden\", \"light\", \"the\", \"air\", \"was\", \"crisp\", \"carrying\", \"the\", \"scent\", \"of\", \"pine\", \"and\", \"earth\", \"mingling\", \"with\", \"the\", \"faint\", \"chorus\", \"of\", \"birdsong\", \"with\", \"each\", \n\"step\", \"she\", \"felt\", \"the\", \"weight\", \"of\", \"the\", \"world\", \"lift\", \"from\", \"her\", \"shoulders\", \"replaced\", \"by\", \"a\", \"profound\", \"sense\", \"of\", \"peace\", \"here\", \"amidst\", \"the\", \"ancient\", \"trees\", \"and\", \"winding\", \"paths\", \"she\", \"found\", \"solace\", \"a\", \"sanctuary\", \"from\", \"

Scrape data from a website


In [None]:
# Example: Scrape titles from a website
url <- "https://www.tripadvisor.in/"
page <- read_html(url)
titles <- page %>% html_nodes("h2") %>% html_text()
print("Scraped Titles:")
print(titles)

[1] "Scraped Titles:"
[1] "Take a trip to a traveller fave"                  
[2] "Top experiences on Tripadvisor"                   
[3] "Vibrant Asian cities"                             
[4] "More to explore"                                  
[5] "Top destinations for your next holiday"           
[6] "Watch The Wanderer Season 1 on Amazon Prime Video"
[7] "Travellers' Choice Awards\nBest of the Best"      


**Outcome :**


1.   Identified the Text Analytics Libraries in Python and R

2.   Performed simple experiments with these libraries in Python and R



