<a href="https://colab.research.google.com/github/Rohit280903/DAV_Exp/blob/main/DAV_Exp7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Experiment - 7: Perform the steps involved in Text Analytics in Python & R**

# **Lab Outcomes (LO):**

*   Design Text Analytics Application on a given data set. (LO4)



# **Theory :**

# 1. Explore 5 Python Text Analytics Libraries in Python and write their features & applications.

Now that we have an understanding of what natural language processing can achieve and the purpose of Python NLP libraries, let’s take a look at some of the best options that are currently available.

**1. TextBlob**

TextBlob is a Python (2 and 3) library that is used to process textual data, with a primary focus on making common text-processing functions accessible via easy-to-use interfaces. Objects within TextBlob can be used as Python strings that can deliver NLP functionality to help build text analysis applications.

TextBlob’s API is extremely intuitive and makes it easy to perform an array of NLP tasks, such as noun phrase extraction, language translation, part-of-speech tagging, sentiment analysis, WordNet integration, and more.

This library is highly recommended for anyone relatively new to developing text analysis applications, as text can be processed with just a few lines of code.

**2. SpaCy**

This open source Python NLP library has established itself as the go-to library for production usage, simplifying the development of applications that focus on processing significant volumes of text in a short space of time.

SpaCy can be used for the preprocessing of text in deep learning environments, building systems that understand natural language and for the creation of information extraction systems.

Two of the key selling points of SpaCy are that it features many pre-trained statistical models and word vectors, and has tokenization support for 49 languages. SpaCy is also preferred by many Python developers for its extremely high speeds, parsing efficiency, deep learning integration, convolutional neural network modeling, and named entity recognition capabilities.

**3. Natural Language Toolkit (NLTK)**

NLTK consists of a wide range of text-processing libraries and is one of the most popular Python platforms for processing human language data and text analysis. Favored by experienced NLP developers and beginners, this toolkit provides a simple introduction to programming applications that are designed for language processing purposes.

Some of the key features provided by Natural Language Toolkit’s libraries include sentence detection, POS tagging, and tokenization. Tokenization, for example, is used in NLP to split paragraphs and sentences into smaller components that can be assigned specific, more understandable, meanings.

NLTK’s interface is very simple, with over 50 corpora and lexical resources. Thanks to a large number of libraries made available, NLTK offers all the crucial functionality to complete almost any type of NLP task within Python.

**4. Genism**

Genism is a bespoke Python library that has been designed to deliver document indexing, topic modeling and retrieval solutions, using a large number of Corpora resources. Algorithms within Genism depend on memory, concerning the Corpus size. This means it can process an input that exceeds the available RAM on a system.

All the popular NLP algorithms can be implemented via the library’s user-friendly interfaces, including algorithms such as Hierarchical Dirichlet Process (HDP), Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA/LSI/SVD), and Random Projections (RP).

Genism’s accessibility is further enhanced by the plethora of documentation available, in addition to Jupyter Notebook tutorials. However, it should be noted that to use Genism, the Python packages SciPy and NumPy must also be installed for scientific computing functionality.

**5. PyNLPl**

Last on our list is PyNLPl (Pineapple), a Python library that is made of several custom Python modules designed specifically for NLP tasks. The most notable feature of PyNLPl is its comprehensive library for developing Format for Linguistic Annotation (FoLiA) XML.

The platform is segmented into different packages and modules that are capable of both basic and advanced tasks, from the extraction of things like n-grams to much more complex functions. This makes it a great option for any NLP developer, regardless of their experience level.

# 2. Explore 5 Python Text Analytics Libraries in R and write their features & applications.

 R also has several powerful libraries for text analytics. Here are five popular ones with their features and applications:

**tm (Text Mining Infrastructure in R):**

**Features:**

Corpus creation and manipulation.

Text preprocessing tasks such as stemming, stop-word removal, and tokenization.

Document-term matrix creation.

Various text mining algorithms.

Applications:

Document classification.

Sentiment analysis.

Topic modeling.

Text clustering.

**quanteda:**

**Features:**

High-performance text analysis.

Tokenization and document-feature matrix creation.

Text scaling and transformation.

Advanced text manipulation functions.

Support for multi-language text analysis.

**Applications:**

Text classification.


Topic modeling.

Text clustering.

Named entity recognition.

NLP (Natural Language Processing) Package:

**Features:**

Part-of-speech tagging.

Named entity recognition.

Dependency parsing.

Coreference resolution.

**Applications:**

Advanced syntactic and semantic analysis.

Information extraction.

Coreference resolution in documents.

**tm.plugin.sentiment:**


**Features:**

Sentiment analysis on text data.

Integration with external sentiment lexicons.

Customizable sentiment analysis functions.

**Applications:**


Sentiment analysis in social media data.

Customer review sentiment analysis.

Assessing sentiment trends over time.

**wordcloud:**

**Features:**

Creation of word clouds.

Customizable word cloud appearance.

Frequency-based word representation.

**Applications:**

Visualization of most frequent terms in a corpus.

Identifying key terms in a document or set of documents.

Exploring and presenting textual data in a visually appealing way.

These R libraries provide a comprehensive set of tools for text analytics,
ranging from basic text preprocessing to advanced natural language processing tasks. Depending on the specific requirements of your project, you can choose and combine these libraries to perform various text analytics tasks efficiently.

#3.  Steps involved in Text Analytics

The steps involved in analyzing an unstructured text document are :


*   Language Identification
*   Tokenization
  
*  Sentence breaking
*  Part of Speech tagging
*  Chunking
*  Syntax parsing
*  Sentence chaining

# 4. Benefits of Text Analytics

* Helps in understanding emerging customer trends, product performance, and service quality.
* Helps researchers to explore pre-existing literature and extracting what’s relevant to their study.
* Text analytic techniques help search engines to improve their performance, thereby providing fast user experiences.
*  Helps in making more data-driven decisions
* Refines user content recommendation systems by categorizing related content
* Boost Efficiency of working with Unstructured data

# **Task to be performed :**

1. Explore Top-5 Text Analytics Libraries in Python (w.r.t Features & Applications)
2. Explore Top-5 Text Analytics Libraries in R (w.r.t Features & Applications)
3. Perform the following experiments using Python & R
* Tokenization (Sentence & Word)
* Frequency Distribution
* Remove stopwords & punctuations
* Lexicon Normalization (Stemming, Lemmatization)
* Part of Speech tagging
* Named Entity Recognization
* Scrape data from a website
4. Prepare a document with the Aim, Tasks performed, Program, Output, and Conclusion.

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.sentiment import SentimentIntensityAnalyzer

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('vader_lexicon')

text = "NLTK is a powerful library for natural language processing."

tokens = word_tokenize(text)
filtered_tokens = [word for word in tokens if word.lower() not in stopwords.words('english')]

sid = SentimentIntensityAnalyzer()
sentiment_score = sid.polarity_scores(text)

print("Tokenized Text:", tokens)
print("Filtered Tokens:", filtered_tokens)
print("Sentiment Score:", sentiment_score)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


Tokenized Text: ['NLTK', 'is', 'a', 'powerful', 'library', 'for', 'natural', 'language', 'processing', '.']
Filtered Tokens: ['NLTK', 'powerful', 'library', 'natural', 'language', 'processing', '.']
Sentiment Score: {'neg': 0.0, 'neu': 0.531, 'pos': 0.469, 'compound': 0.6486}


In [None]:
import requests
from bs4 import BeautifulSoup
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import string
from nltk import pos_tag, ne_chunk


nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

# Sample text
sample_text = """
Natural language processing (NLP) is a field of artificial intelligence that focuses on the interaction
between computers and humans through natural language. It involves several tasks such as tokenization,
lemmatization, and named entity recognition. Web scraping is a technique to extract information from websites.
"""

# Tokenization
sentences = sent_tokenize(sample_text)
words = word_tokenize(sample_text)

# Frequency Distribution
fdist = FreqDist(words)
print("Frequency Distribution:")
print(fdist.most_common(5))

# Remove Stopwords & Punctuation
stop_words = set(stopwords.words("english"))
filtered_words = [word.lower() for word in words if word.lower() not in stop_words and word.lower() not in string.punctuation]

# Lexicon Normalization (Stemming, Lemmatization)
porter_stemmer = PorterStemmer()
stemmed_words = [porter_stemmer.stem(word) for word in filtered_words]

lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]

# Part of Speech Tagging
pos_tags = pos_tag(filtered_words)

# Named Entity Recognition
ner_tags = ne_chunk(pos_tags)

# Web Scraping
url = "https://example.com"
response = requests.get(url)
html_content = response.text
soup = BeautifulSoup(html_content, 'html.parser')
scraped_text = soup.get_text()

# Print Results
print("\nTokenization (Sentence):")
print(sentences)

print("\nTokenization (Word):")
print(words)

print("\nFrequency Distribution:")
print(fdist.most_common(5))

print("\nAfter Removing Stopwords & Punctuation:")
print(filtered_words)

print("\nAfter Stemming:")
print(stemmed_words)

print("\nAfter Lemmatization:")
print(lemmatized_words)

print("\nPart of Speech Tagging:")
print(pos_tags)

print("\nNamed Entity Recognition:")
print(ner_tags)

print("\nScraped Data from Website:")
print(scraped_text)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


Frequency Distribution:
[('.', 3), ('language', 2), ('is', 2), ('a', 2), ('and', 2)]

Tokenization (Sentence):
['\nNatural language processing (NLP) is a field of artificial intelligence that focuses on the interaction\nbetween computers and humans through natural language.', 'It involves several tasks such as tokenization,\nlemmatization, and named entity recognition.', 'Web scraping is a technique to extract information from websites.']

Tokenization (Word):
['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'field', 'of', 'artificial', 'intelligence', 'that', 'focuses', 'on', 'the', 'interaction', 'between', 'computers', 'and', 'humans', 'through', 'natural', 'language', '.', 'It', 'involves', 'several', 'tasks', 'such', 'as', 'tokenization', ',', 'lemmatization', ',', 'and', 'named', 'entity', 'recognition', '.', 'Web', 'scraping', 'is', 'a', 'technique', 'to', 'extract', 'information', 'from', 'websites', '.']

Frequency Distribution:
[('.', 3), ('language', 2), ('i