<a href="https://colab.research.google.com/github/BajajF/DAV/blob/main/DAV_Exp_7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Aim:**<br>
Getting introduced to visualization libraries in Python and R. <br>


**Lab Objectives:**<br> To effectively use libraries for data analytics.<br>

**Lab Outcomes:**<br>
Implement visualization techniques to given data sets using R
Implement visualization techniques to given data sets using Python



**Lab Objectives:**<br>
To effectively use libraries for data analytics.<br>
Identify 5 Python and R Libraries for Data Visualization.<br>
Prepare a brief summary about their features and applications. <br>
Perform simple experiments on 2 Libraries each for Python & R

**Theory :**

**Python Text Analytics Libraries:**

**1) NLTK (Natural Language Toolkit):**

Features:

a) Tokenization, stemming, lemmatization.

b) Part-of-speech tagging, named entity recognition.

c) Concordance and collocation analysis.

Applications:

a) Sentiment analysis, text classification.

b)Information retrieval, language modeling.

**2) spaCy:**

Features:

a) Tokenization, POS tagging, and named entity recognition.

b) Dependency parsing, sentence segmentation.

c) Pre-trained models for multiple languages.

Applications:

a) Named entity recognition, information extraction.

b) Natural language understanding in chatbots.

**3) TextBlob:**

Features:

a) Simple API for common NLP tasks.

b) Sentiment analysis, noun phrase extraction.

c) Part-of-speech tagging.

Applications:

a) Sentiment analysis of customer reviews.

b) Basic text processing for beginners.

**4) Gensim:**

Features:

a) Topic modeling (e.g., LDA).

b) Document similarity analysis.

c) Word embedding models (Word2Vec, Doc2Vec).

Applications:

a) Topic modeling in large document collections.

b) Document similarity and clustering.

**5) Transformers (Hugging Face):**

Features:

a) State-of-the-art pre-trained models (e.g., BERT, GPT).

b) Easy integration for various NLP tasks.

c) Fine-tuning capabilities.

Applications:

a) Named entity recognition, sentiment analysis.

b) Text generation, language translation.

**R Text Analytics Libraries:**

**1) tm (Text Mining Package):**

Features:

a) Text preprocessing: Cleaning, stemming, stopword removal.

b) Document-term matrix creation.

c) Basic text mining functions.

Applications:

a) Topic modeling, clustering.

b) Sentiment analysis, document classification.

**2) quanteda:**

Features:

a) Fast and flexible text processing.

b) Tokenization, n-grams, and corpus analysis.

c) Support for advanced text analysis functions.

Applications:

a) Content analysis, sentiment analysis.

b) Document-feature matrix creation.

**3) tm.plugin.sentiment:**

Features:

a) Sentiment analysis using pre-trained models.

b) Integration with the tm package.

Applications:

a) Analyzing sentiment in textual data.

**4) textTinyR:**

Features:

a) Efficient text classification.

b) Lightweight implementation for quick analysis.

Applications:

a) Fast text classification tasks.

**5) quanteda.textstats:**

Features:

a) Basic text statistics.

b) Word frequency analysis, readability scores.

Applications:

a) Exploratory text analysis.

b) Extracting insights from text data.

**Steps Involved in Text Analytics:**

**1) Data Collection:** Gather text data from various sources such as documents, social media, or websites.

**2) Text Preprocessing:** Clean and preprocess the text data by removing noise, handling missing values, and performing tasks like tokenization and stemming.

**3) Text Exploration:** Explore the data through descriptive statistics, word frequency analysis, and visualization.

**4) Feature Extraction:** Convert the text into a numerical format, often using techniques like TF-IDF (Term Frequency-Inverse Document Frequency).

**5)Model Building:** Apply machine learning or natural language processing models for tasks such as sentiment analysis, classification, or clustering.

**6) Evaluation:** Assess the performance of the models using appropriate metrics based on the task.

**7) Visualization:** Visualize the results to gain insights and interpret the findings.

**Benefits of Text Analytics:**

**1) Insight Generation:** Extract valuable insights from large volumes of unstructured text data.

**2) Decision Support:** Support decision-making processes by providing data-driven insights.

**3) Automation:** Automate the analysis of textual data, saving time and resources.

**4) Customer Feedback Analysis:** Understand customer sentiments and opinions from reviews and feedback.

**5) Information Retrieval:** Enhance search functionalities by improving the relevance of retrieved information.

**6) Fraud Detection:** Identify patterns or anomalies in textual data that may indicate fraudulent activities.

**7) Competitive Intelligence:** Monitor and analyze competitor activities and market trends from textual sources.




# **Text Analytics in Python**

In [None]:
pip install beautifulsoup4



In [None]:
pip install nltk



In [None]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import pos_tag, ne_chunk
import requests
from bs4 import BeautifulSoup

# Download necessary resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [None]:
# Function to scrape text from a website
def scrape_text(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    # Extract text from paragraphs
    paragraphs = soup.find_all('p')
    text = ' '.join([para.get_text() for para in paragraphs])
    return text

# URL of the Wikipedia article to scrape
url = 'https://en.wikipedia.org/wiki/Natural_language_processing'
text = scrape_text(url)

In [None]:
# Tokenization (Sentence & Word)
sentences = sent_tokenize(text)
words = word_tokenize(text)

# Frequency Distribution
fdist = FreqDist(words)
print("Frequency Distribution:")
print(fdist.most_common(10))

# Remove stopwords & punctuations
stop_words = set(stopwords.words('english'))
filtered_words = [word.lower() for word in words if word.isalnum() and word.lower() not in stop_words]

# Lexicon Normalization (Stemming, Lemmatization)
porter = PorterStemmer()
stemmed_words = [porter.stem(word) for word in filtered_words]

lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]

# Part of Speech tagging
pos_tags = pos_tag(filtered_words)

# Named Entity Recognition
named_entities = ne_chunk(pos_tags)


Frequency Distribution:
[(',', 74), ('the', 61), ('of', 55), ('.', 44), ('and', 31), ('[', 29), (']', 29), ('in', 26), ('language', 22), ('to', 19)]


In [None]:
print("\nProcessed Text:")
print("Sentences:", len(sentences))
print("Words before processing:", len(words))
print("Words after processing:", len(filtered_words))
print("Stemmed Words:", stemmed_words[:10])
print("Lemmatized Words:", lemmatized_words[:10])
print("POS Tags:", pos_tags[:10])
print("Named Entities:", named_entities[:10])


Processed Text:
Sentences: 44
Words before processing: 1363
Words after processing: 686
Stemmed Words: ['natur', 'languag', 'process', 'nlp', 'interdisciplinari', 'subfield', 'comput', 'scienc', 'linguist', 'primarili']
Lemmatized Words: ['natural', 'language', 'processing', 'nlp', 'interdisciplinary', 'subfield', 'computer', 'science', 'linguistics', 'primarily']
POS Tags: [('natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('nlp', 'JJ'), ('interdisciplinary', 'JJ'), ('subfield', 'NN'), ('computer', 'NN'), ('science', 'NN'), ('linguistics', 'NNS'), ('primarily', 'RB')]
Named Entities: [('natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('nlp', 'JJ'), ('interdisciplinary', 'JJ'), ('subfield', 'NN'), ('computer', 'NN'), ('science', 'NN'), ('linguistics', 'NNS'), ('primarily', 'RB')]


# **Text Analytics in R**

In [None]:
install.packages('tokenizers')

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘Rcpp’, ‘SnowballC’




In [None]:
install.packages('tm')

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘NLP’, ‘slam’, ‘BH’




In [None]:
install.packages('rvest')

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [None]:
library(tm)
library(rvest)
library(NLP)
library(tokenizers)
library(SnowballC)

Loading required package: NLP



In [None]:
text <- "He raced to the grocery store. He went inside but realized he forgot his wallet. He raced back home to grab it. Once he found it, he raced to the car again and drove back to the grocery store."
# Tokenisation
nsent_tokens <- unlist(tokenize_sentences(text))
word_tokens <- unlist(tokenize_words(text))
cat("Sentence Tokens:", sent_tokens, "\n")
cat("Word Tokens:", word_tokens, "\n")


Sentence Tokens: He raced to the grocery store. He went inside but realized he forgot his wallet. He raced back home to grab it. Once he found it, he raced to the car again and drove back to the grocery store. 
Word Tokens: he raced to the grocery store he went inside but realized he forgot his wallet he raced back home to grab it once he found it he raced to the car again and drove back to the grocery store 


In [None]:
# Frequency Distribution
fdist <- table(unlist(word_tokens))
print(head(sort(fdist, decreasing = TRUE), 2))


he to 
 6  4 


In [None]:
# Remove stopwords and punctuations
stop_words <- stopwords("en")
filtered_tokens <- word_tokens[!(word_tokens %in% stop_words) & grepl("[a-zA-Z]", word_tokens)]
cat("Filtered Tokens (without stopwords and punctuations):", filtered_tokens, "\n")

Filtered Tokens (without stopwords and punctuations): raced grocery store went inside realized forgot wallet raced back home grab found raced car drove back grocery store 


In [None]:
# Stemming
stemmed_tokens <- wordStem(filtered_tokens, language = "en")
cat("Stemmed Tokens:", stemmed_tokens, "\n")

Stemmed Tokens: race groceri store went insid realiz forgot wallet race back home grab found race car drove back groceri store 


In [None]:
# Lemmatization
lemmatized_text <- tolower(text)
lemmatized_text <- wordStem(lemmatized_text, language = "en")

cat("Lemmatized Text:", lemmatized_text, "\n")

Lemmatized Text: he raced to the grocery store. he went inside but realized he forgot his wallet. he raced back home to grab it. once he found it, he raced to the car again and drove back to the grocery store. 


In [None]:
# Web Scraping
url <- 'http://quotes.toscrape.com/'
web_page <- read_html(url)
web_text <- html_text(web_page)
cat("Scraped Data from the Website:\n", web_text)

Scraped Data from the Website:
 Quotes to Scrape
    
        
            
                
                    Quotes to Scrape
                
            
            
                
                
                    Login
                
                
            
        
    


    

    
        “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
        by Albert Einstein
        (about)
        
        
            Tags:
            change
            
            deep-thoughts
            
            thinking
            
            world
            
        
    

    
        “It is our choices, Harry, that show what we truly are, far more than our abilities.”
        by J.K. Rowling
        (about)
        
        
            Tags:
            abilities
            
            choices
            
        
    

    
        “There are only two ways to live your life. One is as though nothing

**Conclusion :** Text analytics is a powerful tool for extracting valuable information from text data, and the choice of libraries and techniques depends on the specific goals and characteristics of the data at hand.