<a href="https://colab.research.google.com/github/DrizzyOVO/DataAnalysisVisualization_39/blob/main/DAV_Exp7_39.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Experiment - 7: Perform the steps involved in Text Analytics in Python & R**

**Task to be performed :**

Explore Top-5 Text Analytics Libraries in Python (w.r.t Features & Applications)

Explore Top-5 Text Analytics Libraries in R (w.r.t Features & Applications)

Perform the following experiments using Python & R

Tokenization (Sentence & Word)

Frequency Distribution

Remove stopwords & punctuations

Lexicon Normalization (Stemming, Lemmatization)

Part of Speech tagging

Named Entity Recognization

Scrape data from a website

**Explore Top-5 Text Analytics Libraries in Python**

1. **NLTK (Natural Language Toolkit):**
   - **Features:**
     - Comprehensive set of libraries and tools for natural language processing (NLP).
     - Tokenization, stemming, tagging, parsing, and semantic reasoning functionalities.
     - Supports various corpora and lexical resources.
     - Provides interfaces to popular resources like WordNet.

   - **Applications:**
     - Text classification and sentiment analysis.
     - Named entity recognition.
     - Part-of-speech tagging.
     - Concordance and collocation analysis.

2. **Scattertext:**
   - **Features:**
     - Specifically designed for visualizing linguistic variation between document categories.
     - Produces interactive scatter plots that highlight terms differentiating categories.
     - Supports customization and interactive exploration of visualizations.
     - Handles linguistic and stylistic differences well.

   - **Applications:**
     - Comparative analysis of document categories.
     - Identifying distinctive terms in different contexts.
     - Visual exploration of language patterns.

3. **SpaCy:**
   - **Features:**
     - Fast and efficient NLP library.
     - Tokenization, part-of-speech tagging, named entity recognition, and dependency parsing.
     - Pre-trained models for various languages.
     - Easy integration with machine learning pipelines.

   - **Applications:**
     - Named entity recognition and extraction.
     - Dependency parsing for understanding relationships between words.
     - Text summarization.
     - Information extraction.

4. **TextBlob:**
   - **Features:**
     - Simple and intuitive API for common NLP tasks.
     - Built on top of NLTK and Pattern libraries.
     - Part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.
     - Easy to use for beginners.

   - **Applications:**
     - Sentiment analysis and classification.
     - Basic text processing tasks.
     - Language translation.
     - Parsing and extracting information from text.

5. **scikit-learn (sklearn):**
   - **Features:**
     - General-purpose machine learning library with text processing capabilities.
     - Text vectorization techniques (TF-IDF, CountVectorizer).
     - Integration with other machine learning algorithms for text classification and clustering.
     - Comprehensive documentation and community support.

   - **Applications:**
     - Text classification (e.g., spam detection).
     - Clustering and topic modeling.
     - Feature extraction and representation.
     - Text regression.



**Explore Top-5 Text Analytics Libraries in R**

1. **shiny:**
   - **Features:**
     - Web application framework for R.
     - Allows for the creation of interactive and dynamic web-based dashboards and applications.
     - Well-suited for building user interfaces and visualizations for text analytics applications.
     - Integration with other R libraries for data processing and analysis.

   - **Applications:**
     - Building interactive dashboards for text analysis results.
     - Creating user-friendly interfaces for exploring and visualizing text data.
     - Incorporating text analytics into web-based applications.

2. **tm (Text Mining Package):**
   - **Features:**
     - Comprehensive package for text mining in R.
     - Supports text preprocessing tasks such as cleaning, stemming, and stopword removal.
     - Provides functions for creating document-term matrices (DTM) and term-document matrices (TDM).
     - Integration with other R packages for statistical analysis.

   - **Applications:**
     - Document clustering and classification.
     - Term frequency analysis.
     - Text preprocessing and transformation.
     - Integration with machine learning algorithms for text analysis.

3. **quanteda:**
   - **Features:**
     - Modern and flexible package for quantitative text analysis.
     - Supports corpus management, document-feature matrices, and various text analysis operations.
     - Designed for efficiency and scalability in handling large text datasets.
     - Integration with other R packages for statistical analysis and visualization.

   - **Applications:**
     - Document-feature matrix creation for text analysis.
     - Text preprocessing, including tokenization and stemming.
     - Sentiment analysis and text classification.
     - Topic modeling and exploratory data analysis.

4. **quanteda.textstats:**
   - **Features:**
     - An extension of the quanteda package, specifically focusing on text statistics.
     - Provides functions for calculating various text statistics, such as word frequencies, lexical diversity, and readability measures.
     - Useful for gaining insights into the linguistic characteristics of a text corpus.
     - Complements quanteda's core functionalities for text analysis.

   - **Applications:**
     - Analyzing word frequencies and patterns in a corpus.
     - Assessing the complexity and readability of text.
     - Extracting key statistical information about a text dataset.

5. **tm.plugin.sentiment:**
   - **Features:**
     - A plugin for the tm package that focuses on sentiment analysis.
     - Enables sentiment analysis on text data by incorporating pre-trained sentiment lexicons.
     - Supports the calculation of sentiment scores for individual documents or terms.
     - Useful for understanding the emotional tone or sentiment expressed in a text corpus.

   - **Applications:**
     - Sentiment analysis in text mining projects.
     - Assessing the sentiment polarity (positive, negative, neutral) of documents.
     - Incorporating sentiment analysis into larger text analytics workflows.



In [None]:
pip install nltk beautifulsoup4




1: Tokenization (Sentence & Word)

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

# Sample text
text = "NLTK is a powerful library for natural language processing"

# Sentence Tokenization
sentences = sent_tokenize(text)
print("Sentence Tokenization:")
print(sentences)

# Word Tokenization
words = word_tokenize(text)
print("\nWord Tokenization:")
print(words)


Sentence Tokenization:
['NLTK is a powerful library for natural language processing']

Word Tokenization:
['NLTK', 'is', 'a', 'powerful', 'library', 'for', 'natural', 'language', 'processing']


 2: Frequency Distribution

In [None]:
from nltk import FreqDist

# Sample words
word_list = ["apple", "banana", "apple", "orange", "banana", "apple", "grape"]

# Calculate frequency distribution
freq_dist = FreqDist(word_list)
print("Frequency Distribution:")
print(freq_dist)


Frequency Distribution:
<FreqDist with 4 samples and 7 outcomes>


3: Remove Stopwords & Punctuations

In [None]:
   import nltk
   nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
from nltk.corpus import stopwords
from string import punctuation

# Sample text
text = "This is a sample sentence, with some stopwords and punctuations."

# Remove stopwords and punctuations
stop_words = set(stopwords.words("english"))
filtered_text = [word.lower() for word in word_tokenize(text) if word.isalnum() and word.lower() not in stop_words]

print("Text after removing stopwords and punctuations:")
print(filtered_text)


Text after removing stopwords and punctuations:
['sample', 'sentence', 'stopwords', 'punctuations']


4: Lexicon Normalization (Stemming, Lemmatization)

In [None]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Sample words
words = ["running", "better", "cats", "dogs"]

# Stemming
porter_stemmer = PorterStemmer()
stemmed_words = [porter_stemmer.stem(word) for word in words]
print("Stemmed Words:")
print(stemmed_words)

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print("\nLemmatized Words:")
print(lemmatized_words)


Stemmed Words:
['run', 'better', 'cat', 'dog']

Lemmatized Words:
['running', 'better', 'cat', 'dog']


5: Part of Speech Tagging

In [None]:
import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [None]:
# Sample text
text = "NLTK is a powerful library for natural language processing."

# Part of Speech Tagging
pos_tags = nltk.pos_tag(word_tokenize(text))
print("Part of Speech Tagging:")
print(pos_tags)


Part of Speech Tagging:
[('NLTK', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('powerful', 'JJ'), ('library', 'NN'), ('for', 'IN'), ('natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('.', '.')]


6: Named Entity Recognition

In [None]:
import nltk
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [None]:
nltk.download('words')# Sample text
text = "Barack Obama was the 44th President of the United States."

# Named Entity Recognition
from nltk import ne_chunk
named_entities = ne_chunk(pos_tags)
print("Named Entity Recognition:")
print(named_entities)


Named Entity Recognition:
(S
  (ORGANIZATION NLTK/NNP)
  is/VBZ
  a/DT
  powerful/JJ
  library/NN
  for/IN
  natural/JJ
  language/NN
  processing/NN
  ./.)


[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


7: Scrape data from a website

In [None]:
import requests
from bs4 import BeautifulSoup

# URL to scrape
url = "https://example.com"

# Make a request to the URL
response = requests.get(url)

# Parse HTML content
soup = BeautifulSoup(response.text, "html.parser")

# Extract text content from the webpage
webpage_text = soup.get_text()

print("Text extracted from the website:")
print(webpage_text)


Text extracted from the website:



Example Domain







Example Domain
This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.
More information...






In [None]:
import requests
from bs4 import BeautifulSoup

# URL to scrape (Python official documentation homepage)
url = "https://docs.python.org/3/"

# Make a request to the URL
response = requests.get(url)

# Parse HTML content
soup = BeautifulSoup(response.text, "html.parser")

# Extract text content from the webpage
webpage_text = soup.get_text()

# Display a portion of the extracted text (for brevity)
print("Text extracted from the Python documentation homepage:")
print(webpage_text[:500])


Text extracted from the Python documentation homepage:




3.12.2 Documentation














































    Theme
    
Auto
Light
Dark


Download
Download these documents
Docs by version

Python 3.13 (in development)
Python 3.12 (stable)
Python 3.11 (stable)
Python 3.10 (security-fixes)
Python 3.9 (security-fixes)
Python 3.8 (security-fixes)
Python 3.7 (EOL)
Python 3.6 (EOL)
Python 3.5 (EOL)
Python 3.4 (EOL)
Python 3.3 (EOL)
Python 3.2 (EOL)
Python 3.1 (EOL)
Python 3.0 (EOL)
Python 2.7 (EOL)
Python 2.6 (EOL)
All versions

Other


# **R**

In [None]:
install.packages(c("shiny", "tm", "SnowballC", "NLP"))


Installing packages into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [None]:
# Install necessary packages
install.packages(c("tokenizers", "tm", "stringr", "SnowballC", "openNLP", "NLP", "rvest", "udpipe"))

# Load required libraries
library(tokenizers)
library(tm)
library(stringr)
library(SnowballC)
library(openNLP)
library(NLP)
library(udpipe)

# Sample text
text <- "This is a sample sentence. Tokenization in R is interesting!"

# Sentence Tokenization
sent_tokens <- sent_token_annotate(text)$features

# Word Tokenization
word_tokens <- word_token_annotate(text)$features

# Frequency Distribution
word_freq <- table(word_tokens)
print(word_freq)

# Remove stopwords and punctuations
stopwords <- stopwords("en")
filtered_tokens <- word_tokens[!(word_tokens %in% stopwords) & !word_tokens %in% strsplit(punctuation(), "")[[1]]]

# Lexicon Normalization (Stemming, Lemmatization)
stemmed_tokens <- wordStem(filtered_tokens)
lemmatized_tokens <- lemmatize_words(filtered_tokens)

# Part of Speech tagging
pos_tags <- pos_tag_annotate(text)$features

# Named Entity Recognition (NER)
# Note: NER requires a pre-trained model, for example, the spaCy model
# You can use udpipe for POS tagging, but for NER, you might want to use spaCy in Python
# Alternatively, you can explore the 'cleanNLP' package for NER in R

# Scraping data from a website
library(rvest)

# Example: Scraping titles from a website
url <- "https://example.com"
webpage <- read_html(url)
titles <- html_text(html_nodes(webpage, "h2"))
print(titles)


Installing packages into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘openNLPdata’, ‘rJava’


“installation of package ‘rJava’ had non-zero exit status”
“installation of package ‘openNLPdata’ had non-zero exit status”
“installation of package ‘openNLP’ had non-zero exit status”
Loading required package: NLP



ERROR: Error in library(openNLP): there is no package called ‘openNLP’


1. Tokenization (Sentence & Word)

In [None]:
# Tokenization (Sentence & Word)
text <- "This is a sample sentence. Tokenization is important for NLP."
sentences <- strsplit(text, "\\.")[[1]]
words <- unlist(strsplit(text, "\\s+"))

print("Sentences:")
print(sentences)
print("Words:")
print(words)

[1] "Sentences:"
[1] "This is a sample sentence"          " Tokenization is important for NLP"
[1] "Words:"
 [1] "This"         "is"           "a"            "sample"       "sentence."   
 [6] "Tokenization" "is"           "important"    "for"          "NLP."        


In [None]:
install.packages("tokenizers")
library(tokenizers)

text <- "This is a sample sentence. Tokenization is important for NLP."
sentences <- tokenize_sentences(text)
words <- tokenize_words(text)

print("Sentences:")
print(sentences)
print("Words:")
print(words)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘Rcpp’, ‘SnowballC’




[1] "Sentences:"
[[1]]
[1] "This is a sample sentence."         "Tokenization is important for NLP."

[1] "Words:"
[[1]]
 [1] "this"         "is"           "a"            "sample"       "sentence"    
 [6] "tokenization" "is"           "important"    "for"          "nlp"         



2. Frequency Distribution

In [None]:
# Frequency Distribution
word_freq <- table(words)
print("Word frequency:")
print(word_freq)

[1] "Word frequency:"
words
           a          for    important           is          nlp       sample 
           1            1            1            2            1            1 
    sentence         this tokenization 
           1            1            1 


3. Remove stopwords & punctuations

In [None]:
# Remove stopwords & punctuations
stop_words <- c("is", "a", "for")  # Example list of stopwords
filtered_words <- words[!tolower(words) %in% stop_words & !grepl("[[:punct:]]", words)]
print("Filtered words:")
 print(filtered_words)

[1] "Filtered words:"
list()


4. Lexicon Normalization (Stemming, Lemmatization)

In [None]:
install.packages("SnowballC")
library(SnowballC)

# Example data
filtered_words <- c("running", "flies", "happily", "jumps")

# Stemming using SnowballC
stemmed_words <- wordStem(filtered_words)

# Print results
print("Stemmed words:")
print(stemmed_words)


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



[1] "Stemmed words:"
[1] "run"     "fli"     "happili" "jump"   


5. Part of Speech tagging

In [None]:
install.packages("udpipe", dependencies=TRUE)


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘modeltools’, ‘topicmodels’


“installation of package ‘topicmodels’ had non-zero exit status”


In [None]:
# Install and load the udpipe package
install.packages("udpipe")
library(udpipe)

# Download and load the English model
ud_model <- udpipe_download_model(language = "english")
ud_model <- udpipe_load_model(ud_model$file_model)

# Example data
words <- c("running", "flies", "happily", "jumps")

# Annotate for lemmatization
x <- udpipe_annotate(ud_model, x = words, doc_id = 1:length(words))
lemmatized_words <- as.data.frame(x)$lemma

# Print the result
print("Lemmatized words:")
print(lemmatized_words)


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Downloading udpipe model from https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.5/master/inst/udpipe-ud-2.5-191206/english-ewt-ud-2.5-191206.udpipe to /content/english-ewt-ud-2.5-191206.udpipe

 - This model has been trained on version 2.5 of data from https://universaldependencies.org

 - The model is distributed under the CC-BY-SA-NC license: https://creativecommons.org/licenses/by-nc-sa/4.0

 - Visit https://github.com/jwijffels/udpipe.models.ud.2.5 for model license details.

 - For a list of all models and their licenses (most models you can download with this package have either a CC-BY-SA or a CC-BY-SA-NC license) read the documentation at ?udpipe_download_model. For building your own models: visit the documentation by typing vignette('udpipe-train', package = 'udpipe')

Downloading finished, model stored at '/content/english-ewt-ud-2.5-191206.udpipe'



[1] "Lemmatized words:"
[1] "run"     "flie"    "happily" "jump"   


6. Named Entity Recognization

In [None]:
install.packages("NLP")
install.packages("openNLP")
library(openNLP)
library(NLP)

ner_tags <- maxent_tagger_chunker(filtered_words, pos_tags)
print("Named Entities:")
print( ner_tags)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘openNLPdata’, ‘rJava’


“installation of package ‘rJava’ had non-zero exit status”
“installation of package ‘openNLPdata’ had non-zero exit status”
“installation of package ‘openNLP’ had non-zero exit status”


ERROR: Error in library(openNLP): there is no package called ‘openNLP’


7. Scrape data from a website

In [None]:
install.packages("rvest")
library(rvest)

url <- "https://example.com"
page <- read_html(url)
text_data <- page %>%
  html_text()

print("Text data from website:")
print(text_data)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



[1] "Text data from website:"
[1] "Example Domain\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: -apple-system, system-ui, BlinkMacSystemFont, \"Segoe UI\", \"Open Sans\", \"Helvetica Neue\", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 2em;\n        background-color: #fdfdff;\n        border-radius: 0.5em;\n        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        div {\n            margin: 0 auto;\n            width: auto;\n        }\n    }\n    \n\n    Example Domain\n    This domain is for use in illustrative examples in documents. You may use this\n    domain in literature without prior coordination or asking for permission.\n    More information...\n\n"


Outcome :
Identified the Text Analytics Libraries in Python and R.
Performed simple experiments with these libraries in Python and R.