<a href="https://colab.research.google.com/github/Prathmesh1612/DAV/blob/main/Prathmesh_Dubey_16_exp7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Experiment 7


> **Aim:** Perform the steps involved in Text Analytics in Python & R

> **Lab Objective:** To introduce the concept of text analytics and its applications.

>**Lab Outcome:** Design Text Analytics Application on a given data set. (LO4)



**Task to be performed:**
1. Explore Top-5 Text Analytics Libraries in Python (w.r.t Features & Applications)
2. Explore Top-5 Text Analytics Libraries in R (w.r.t Features & Applications)
3. Perform the following experiments using Python & R
* Tokenization (Sentence & Word)
* Frequency Distribution
* Remove stopwords & punctuations
* Lexicon Normalization (Stemming, Lemmatization)
* Part of Speech tagging
* Named Entity Recognization
* Scrape data from a website
4. Prepare a document with the Aim, Tasks performed, Program, Output, and Conclusion.

**Tools & Libraries to be explored:**
* Python Libraries: nltk, scattertext, SpaCy, TextBlob, sklearn, pandas, numpy
* R Libraries: shiny, tm, quanteda

**Theory:**
### Text analytics libraries in Python

1. **NLTK:** NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and more.

> **Features:**
* Tokenization: NLTK provides functions for splitting text into words, sentences, or other linguistic units, making it easy to process text data.
* Part-of-speech Tagging: NLTK includes pre-trained models for tagging words with their respective parts of speech (e.g., noun, verb, adjective), which is essential for many NLP tasks.
* Named Entity Recognition (NER): NLTK offers tools for identifying and extracting named entities such as persons, organizations, locations, and dates from text documents.

> **Applications:**
* Text Classification: NLTK is widely used for building text classification models, including spam detection, sentiment analysis, topic classification, and document categorization.
* Information Extraction: NLTK facilitates the extraction of structured information from unstructured text sources, including entity extraction, event extraction, and relationship extraction.
* Language Understanding: NLTK enables the development of systems for understanding and interpreting human language, such as chatbots, virtual assistants, and question-answering systems.


2. **SpaCy:** spaCy is an open-source library for advanced Natural Language Processing (NLP) in Python. It features pre-trained models for various languages, including English, German, French, and Spanish, and provides capabilities for tokenization, part-of-speech tagging, named entity recognition, dependency parsing, and more.

> **Features:**
* Dependency Parsing: spaCy supports dependency parsing, allowing for the analysis of syntactic relationships between words in a sentence and the construction of parse trees representing sentence structure.
* Lemmatization: spaCy provides lemmatization functionality for reducing words to their base or canonical form, which helps in text normalization and word analysis.
* Sentence Boundary Detection: spaCy includes algorithms for detecting sentence boundaries in text, enabling tasks such as sentence segmentation and sentence-level analysis.

> **Applications:**
* Information Retrieval: spaCy is employed in information retrieval systems for indexing, searching, and retrieving relevant documents or passages based on user queries or keywords.
* Chatbots and Virtual Assistants: spaCy is used in the development of conversational agents, chatbots, and virtual assistants for natural language understanding, dialogue management, and response generation.
* Text Mining and Analysis: spaCy is applied in text mining and analysis tasks, including document clustering, keyword extraction, trend detection, and content summarization.

3. **TextBlob:** TextBlob is a simple and easy-to-use library for processing textual data in Python. It provides a simple API for common text processing tasks such as sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and classification.

> **Features:**
* Sentiment Analysis: TextBlob provides built-in sentiment analysis capabilities for assessing the sentiment or opinion expressed in a piece of text, including polarity (positive, negative, neutral) and subjectivity scores.
* Language Detection: TextBlob can detect the language of a given text, enabling language identification and language-specific processing.
* Translation: TextBlob offers support for language translation, allowing users to translate text between different languages using machine translation models.

> **Applications:**
* Text Classification: TextBlob is employed in text classification tasks, including document categorization, topic modeling, and sentiment-based classification in various domains.
* Language Processing: TextBlob is utilized in language processing applications, such as language detection, translation, spell checking, and linguistic analysis.
* Data Preprocessing: TextBlob is used for data preprocessing tasks in natural language processing pipelines, including text cleaning, normalization, and feature extraction.

4. **ScatterText:** Scattertext is a free, opensouce python library for visualization of text data in different corpora in an interactive HTML scatterplot. It allows you to visualize how words are distributed in different documents or in different categories of documents.

> **Features:**
* Comparative Visualization: Scattertext specializes in visualizing the differences in word frequencies and associations between two or more corpora of text, allowing users to compare and contrast language use across different groups or categories.
* Term Frequency Analysis: Scattertext provides tools for analyzing and visualizing the frequency of terms (words or phrases) in text data, highlighting terms that are more prevalent in one corpus compared to another.
* Term Association Analysis: Scattertext identifies and visualizes associations between terms and categories, revealing which terms are strongly associated with specific groups or topics.

> **Applications:**
* Comparative Text Analysis: Scattertext is commonly used for comparative text analysis tasks, such as comparing language use between different groups (e.g., political parties, product reviews, social media posts) to identify distinctive terms and themes.
*Sentiment Analysis: Scattertext can be applied to sentiment analysis tasks to examine differences in sentiment expression across different categories or topics, helping identify sentiment-related patterns and trends.
* Authorship Attribution: Scattertext is used in authorship attribution studies to identify linguistic features that distinguish between different authors or writing styles, aiding in the analysis of authorship patterns and authorship identification.

5. **Gensim:** Gensim is a Python library for topic modeling, document similarity analysis, and other natural language processing (NLP) tasks.

> **Features:**
* Topic Modeling: Gensim provides efficient implementations of popular topic modeling algorithms such as Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA). These algorithms allow users to discover latent topics in a collection of documents and analyze the distribution of topics across documents.
* Word Embeddings: Gensim offers tools for training and using word embedding models such as Word2Vec, Doc2Vec, and FastText. These models learn dense vector representations of words and documents in a continuous vector space, capturing semantic relationships and similarities between words.
* Document Similarity: Gensim allows users to compute similarities between documents based on their vector representations. This functionality is useful for tasks such as document clustering, information retrieval, and recommendation systems.

> **Applications:**
* Topic Modeling: Gensim is widely used for topic modeling tasks in various domains, including academic research, social media analysis, and content recommendation systems. It helps identify latent themes and topics in large collections of documents, enabling researchers and analysts to explore and understand complex textual data.
* Document Clustering: Gensim's document similarity functionality is employed in document clustering applications to group similar documents together based on their content. It is used in information retrieval systems, content organization, and text classification tasks.
* Natural Language Understanding: Gensim's word embedding models are used for natural language understanding tasks such as named entity recognition, sentiment analysis, and semantic similarity assessment. These models capture semantic relationships between words and phrases, facilitating more accurate and context-aware NLP applications.


In [None]:
import requests
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/Text_mining'
response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.content, 'html.parser')

    first_paragraph = soup.find('p')

    if first_paragraph:
        print(first_paragraph.text)
    else:
        print('No <p> tags found on the website')
else:
    print('Failed to retrieve data from the website')


Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources."[1] Written resources may include websites, books, emails, reviews, and articles. High-quality information is typically obtained by devising patterns and trends by means such as statistical pattern learning. According to Hotho et al. (2005) we can distinguish between three different perspectives of text mining: information extraction, data mining, and a knowledge discovery in databases (KDD) process.[2] Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. '

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize

text = first_paragraph.text
print(text)
print("\n Sentence tokenization: \n" , sent_tokenize(text))
print("\n Word tokenization: \n" , word_tokenize(text))

Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources."[1] Written resources may include websites, books, emails, reviews, and articles. High-quality information is typically obtained by devising patterns and trends by means such as statistical pattern learning. According to Hotho et al. (2005) we can distinguish between three different perspectives of text mining: information extraction, data mining, and a knowledge discovery in databases (KDD) process.[2] Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. '

In [None]:
from nltk.probability import FreqDist

fdist = FreqDist(word_tokenize(text))
print(fdist.most_common(2))

print("Frequency of each word: \n")
for word, freq in fdist.items():
    print(f'{word}: {freq}')


[(',', 22), ('text', 9)]
Frequency of each word: 

Text: 2
mining: 7
,: 22
text: 9
data: 3
(: 5
TDM: 1
): 5
or: 1
analytics: 1
is: 2
the: 8
process: 3
of: 9
deriving: 2
high-quality: 1
information: 5
from: 2
.: 9
It: 1
involves: 2
``: 2
discovery: 2
by: 4
computer: 1
new: 1
previously: 1
unknown: 1
automatically: 1
extracting: 1
different: 2
written: 1
resources: 2
[: 2
1: 1
]: 2
Written: 1
may: 1
include: 2
websites: 1
books: 1
emails: 1
reviews: 1
and: 9
articles: 1
High-quality: 1
typically: 1
obtained: 1
devising: 1
patterns: 2
trends: 1
means: 1
such: 1
as: 1
statistical: 1
pattern: 1
learning: 2
According: 1
to: 2
Hotho: 1
et: 1
al: 1
2005: 1
we: 1
can: 1
distinguish: 1
between: 2
three: 1
perspectives: 1
:: 1
extraction: 2
a: 2
knowledge: 1
in: 2
databases: 1
KDD: 1
2: 1
usually: 3
structuring: 1
input: 1
parsing: 1
along: 1
with: 1
addition: 1
some: 2
derived: 1
linguistic: 1
features: 1
removal: 1
others: 1
subsequent: 1
insertion: 1
into: 1
database: 1
within: 1
structured: 1

In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
from nltk.corpus import stopwords

stop_words=set(stopwords.words("english"))

print(stop_words)

{'those', 'during', 'an', 'be', 've', 'ours', 'nor', 'myself', 'while', "shouldn't", 'again', 'not', "you're", 'himself', 'yours', "hasn't", 'now', 'weren', 'in', 'all', "you've", 'wasn', 'don', 'because', 'itself', 'y', 'how', 'before', 'but', 'through', 'wouldn', "weren't", 'any', 'their', 'being', 'each', 'it', "you'd", 't', 'very', 'then', 'they', 'these', "it's", 'other', 'haven', 'shouldn', 'ourselves', 'he', 'whom', 'hers', 'which', 'after', 'mightn', "needn't", 'doesn', 'what', 'some', 'do', 's', "you'll", 'few', "aren't", 'when', 'did', "should've", 'under', 'can', "mustn't", 'she', 'its', 'yourself', "that'll", 'only', 'here', 'more', 'll', 'won', 'had', 'against', 'theirs', 'where', 'we', "couldn't", 'shan', 'into', 'me', "hadn't", "don't", 'from', 'the', 'no', 'is', 'so', 'them', 'over', 'mustn', 'just', 'as', 'hadn', "won't", 'between', 'why', 'above', 'further', 'below', 'needn', 'o', 'my', 'too', "haven't", 'about', 'was', 'having', 'her', 'aren', 'should', 'a', 'up', 'h

In [None]:
filtered_tokens=[]
for w in word_tokenize(text):
    if w not in stop_words:
         filtered_tokens.append(w)

print("Tokenized Words:",word_tokenize(text))
print("Filterd Tokens:",filtered_tokens)

Tokenized Words: ['Text', 'mining', ',', 'text', 'data', 'mining', '(', 'TDM', ')', 'or', 'text', 'analytics', 'is', 'the', 'process', 'of', 'deriving', 'high-quality', 'information', 'from', 'text', '.', 'It', 'involves', '``', 'the', 'discovery', 'by', 'computer', 'of', 'new', ',', 'previously', 'unknown', 'information', ',', 'by', 'automatically', 'extracting', 'information', 'from', 'different', 'written', 'resources', '.', '``', '[', '1', ']', 'Written', 'resources', 'may', 'include', 'websites', ',', 'books', ',', 'emails', ',', 'reviews', ',', 'and', 'articles', '.', 'High-quality', 'information', 'is', 'typically', 'obtained', 'by', 'devising', 'patterns', 'and', 'trends', 'by', 'means', 'such', 'as', 'statistical', 'pattern', 'learning', '.', 'According', 'to', 'Hotho', 'et', 'al', '.', '(', '2005', ')', 'we', 'can', 'distinguish', 'between', 'three', 'different', 'perspectives', 'of', 'text', 'mining', ':', 'information', 'extraction', ',', 'data', 'mining', ',', 'and', 'a', 

In [None]:
import string

punctuations=list(string.punctuation)

filtered_tokens2=[]

for i in filtered_tokens:
    if i not in punctuations:
        filtered_tokens2.append(i)

print("Filterd Tokens After Removing Punctuations:",filtered_tokens2)

Filterd Tokens After Removing Punctuations: ['Text', 'mining', 'text', 'data', 'mining', 'TDM', 'text', 'analytics', 'process', 'deriving', 'high-quality', 'information', 'text', 'It', 'involves', '``', 'discovery', 'computer', 'new', 'previously', 'unknown', 'information', 'automatically', 'extracting', 'information', 'different', 'written', 'resources', '``', '1', 'Written', 'resources', 'may', 'include', 'websites', 'books', 'emails', 'reviews', 'articles', 'High-quality', 'information', 'typically', 'obtained', 'devising', 'patterns', 'trends', 'means', 'statistical', 'pattern', 'learning', 'According', 'Hotho', 'et', 'al', '2005', 'distinguish', 'three', 'different', 'perspectives', 'text', 'mining', 'information', 'extraction', 'data', 'mining', 'knowledge', 'discovery', 'databases', 'KDD', 'process', '2', 'Text', 'mining', 'usually', 'involves', 'process', 'structuring', 'input', 'text', 'usually', 'parsing', 'along', 'addition', 'derived', 'linguistic', 'features', 'removal', '

In [None]:
# Stemming
from nltk.stem import PorterStemmer
ps = PorterStemmer()

stemmed_words=[]

for w in filtered_tokens2:
     stemmed_words.append(ps.stem(w))

print("Filtered Tokens After Removing Punctuations:",filtered_tokens2)
print("Stemmed Tokens:",stemmed_words)

Filtered Tokens After Removing Punctuations: ['Text', 'mining', 'text', 'data', 'mining', 'TDM', 'text', 'analytics', 'process', 'deriving', 'high-quality', 'information', 'text', 'It', 'involves', '``', 'discovery', 'computer', 'new', 'previously', 'unknown', 'information', 'automatically', 'extracting', 'information', 'different', 'written', 'resources', '``', '1', 'Written', 'resources', 'may', 'include', 'websites', 'books', 'emails', 'reviews', 'articles', 'High-quality', 'information', 'typically', 'obtained', 'devising', 'patterns', 'trends', 'means', 'statistical', 'pattern', 'learning', 'According', 'Hotho', 'et', 'al', '2005', 'distinguish', 'three', 'different', 'perspectives', 'text', 'mining', 'information', 'extraction', 'data', 'mining', 'knowledge', 'discovery', 'databases', 'KDD', 'process', '2', 'Text', 'mining', 'usually', 'involves', 'process', 'structuring', 'input', 'text', 'usually', 'parsing', 'along', 'addition', 'derived', 'linguistic', 'features', 'removal', 

In [None]:
import spacy

nlp = spacy.load('en_core_web_sm')

doc = nlp(text)

lemmatized_tokens = [token.lemma_ for token in doc]

lemmatized_text = ' '.join(lemmatized_tokens)

print("Original Text:", text)
print("Lemmatized Text:", lemmatized_text)


Original Text: Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources."[1] Written resources may include websites, books, emails, reviews, and articles. High-quality information is typically obtained by devising patterns and trends by means such as statistical pattern learning. According to Hotho et al. (2005) we can distinguish between three different perspectives of text mining: information extraction, data mining, and a knowledge discovery in databases (KDD) process.[2] Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation o

In [None]:
import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [None]:
from nltk.tokenize import word_tokenize
from nltk import pos_tag

tokens=word_tokenize(text)
pos_=pos_tag(tokens)


print("Tokens:",tokens)
print("PoS tags:",pos_)

Tokens: ['Text', 'mining', ',', 'text', 'data', 'mining', '(', 'TDM', ')', 'or', 'text', 'analytics', 'is', 'the', 'process', 'of', 'deriving', 'high-quality', 'information', 'from', 'text', '.', 'It', 'involves', '``', 'the', 'discovery', 'by', 'computer', 'of', 'new', ',', 'previously', 'unknown', 'information', ',', 'by', 'automatically', 'extracting', 'information', 'from', 'different', 'written', 'resources', '.', '``', '[', '1', ']', 'Written', 'resources', 'may', 'include', 'websites', ',', 'books', ',', 'emails', ',', 'reviews', ',', 'and', 'articles', '.', 'High-quality', 'information', 'is', 'typically', 'obtained', 'by', 'devising', 'patterns', 'and', 'trends', 'by', 'means', 'such', 'as', 'statistical', 'pattern', 'learning', '.', 'According', 'to', 'Hotho', 'et', 'al', '.', '(', '2005', ')', 'we', 'can', 'distinguish', 'between', 'three', 'different', 'perspectives', 'of', 'text', 'mining', ':', 'information', 'extraction', ',', 'data', 'mining', ',', 'and', 'a', 'knowledg

In [None]:
import nltk
nltk.download('words')

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [None]:
import nltk
nltk.download('maxent_ne_chunker')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.


True

In [None]:
from nltk import ne_chunk

for chunk in ne_chunk(nltk.pos_tag(word_tokenize(text))):
        if hasattr(chunk, 'label'):
            print(chunk.label(), ' '.join(c[0] for c in chunk))

GPE Text
ORGANIZATION TDM
GPE Hotho
ORGANIZATION KDD


### Text analytics libraries in R

1. **tm:**
The tm (Text Mining Infrastructure in R) package in R is a comprehensive framework for text mining tasks.

> **Features:**
* Text Preprocessing: tm provides functions for text cleaning, including removing punctuation, numbers, and stop words, stemming, and converting text to lower case.
* Document-Term Matrix (DTM) Creation: It offers methods to create document-term matrices, a crucial data structure for text analysis, where rows represent documents and columns represent terms (words or n-grams).
* Text Transformation: The package supports various transformations on text data, such as term frequency-inverse document frequency (TF-IDF) weighting and scaling.

> **Applications:**
* Document Classification: tm is used for categorizing documents into predefined classes or categories, such as spam detection, sentiment analysis, and topic classification.
* Text Clustering: It facilitates grouping similar documents together based on their content, enabling tasks like document clustering and clustering-based topic modeling.
* Text Retrieval: The package supports efficient retrieval of relevant documents based on search queries or similarity measures, enabling applications like information retrieval and recommendation systems.

2. **quanteda:** The quanteda package in R is a comprehensive and powerful framework for quantitative text analysis.

> **Features:**
* Text Preprocessing: quanteda offers extensive text preprocessing capabilities, including tokenization, stemming, lemmatization, and stop word removal, to clean and prepare text data for analysis.
* Document-Term Matrix (DTM) Creation: It provides efficient methods to create document-term matrices (DTMs) and document-feature matrices (DFMs), essential data structures for text analysis, allowing users to represent text corpora in a structured format.
* Advanced Tokenization: The package supports advanced tokenization features, including n-grams, skip-grams, and user-defined tokenization rules, enabling fine-grained control over text processing.

> **Applications:**
* Social Science Research: quanteda is widely used in social science research for analyzing textual data from sources such as surveys, interviews, and social media to understand social phenomena, public opinion, and sentiment.
* Political Analysis: It enables political scientists to analyze political texts, such as speeches, policy documents, and legislative texts, for studying political discourse, ideology, and policy preferences.
* Market Research: The package is applied in market research and consumer analytics for analyzing customer feedback, product reviews, and social media data to identify trends, sentiments, and customer preferences.


 3. **shiny:**  Shiny is an R package that enables the creation of interactive web applications directly from R.

> **Features:**
* Interactive Text Analysis: Shiny allows users to build interactive web applications for text analytics, enabling exploration and analysis of text data through interactive visualizations and tools.
* Reactive Text Processing: With Shiny's reactive programming framework, changes in input text or parameters trigger updates to text processing and analysis, providing real-time feedback to users.
* Customizable Text Widgets: Shiny offers customizable text input and output widgets, such as text areas, text inputs, and text outputs, allowing users to input, process, and display text data flexibly.

> **Applications:**
* Text Mining Dashboards: Shiny is used to create interactive dashboards for text mining and exploration, allowing users to interactively analyze and visualize text data patterns, trends, and insights.
* Text Analytics Tools: It serves as a platform for building text analytics tools and applications for tasks such as sentiment analysis, topic modeling, text classification, and NER, providing users with customizable and user-friendly interfaces.
* Text Data Visualization: Shiny applications enable the visualization of text data through interactive plots, word clouds, heatmaps, and network graphs, facilitating the exploration and interpretation of text data structures and relationships.

4. **tidytext:** Tidy text format can be defined as a table with one-token-per-row. A token is any meaningful unit of text, such as a word, that we are interested in using for analysis.

> **Features:**
* Integration with Tidy Data Principles: tidytext follows the principles of tidy data, making it compatible with other tidyverse packages in R. This ensures consistency in data manipulation and analysis workflows.
* Tokenization and Text Parsing: tidytext provides functions for tokenizing text data, splitting it into individual words or tokens, and parsing text into grammatical elements such as sentences, words, and n-grams.
* Sentiment Analysis: The package includes functions for sentiment analysis, allowing users to analyze the sentiment of text data by associating each word with a sentiment score or sentiment category. This enables the identification of positive, negative, or neutral sentiment in text documents.


> **Applications:**
* Social Media Analysis: tidytext is used for analyzing text data from social media platforms, such as Twitter, Facebook, and Instagram. Researchers and marketers use the package to perform sentiment analysis, topic modeling, and trend analysis on social media content.
* Customer Feedback Analysis: Businesses leverage tidytext for analyzing customer feedback data, including product reviews, surveys, and customer support tickets. By conducting sentiment analysis and word frequency analysis, companies can gain insights into customer sentiment, preferences, and pain points.
* Text-Based Recommender Systems: tidytext is employed in building text-based recommender systems for recommending articles, products, or content based on textual similarity. By applying techniques such as TF-IDF and cosine similarity, developers can create personalized recommendations for users.


5. **text2vec:** text2vec is an R package which provides an efficient framework with a concise API for text analysis and natural language processing (NLP).

> **Features:**
* Efficient Text Processing: text2vec is designed for efficient processing of large-scale text data, utilizing memory-mapped files and parallel processing techniques to handle big text corpora.
* Modular Framework: The package offers a modular framework for text analysis, allowing users to build custom text processing pipelines by combining various vectorization, dimensionality reduction, and modeling techniques.
* Vectorization Methods: text2vec provides several vectorization methods for converting text data into numerical representations, including Bag-of-Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), word embeddings (Word2Vec), and GloVe embeddings. These methods capture semantic and contextual information from text documents.

> **Applications:**
* Document Classification: text2vec is used for document classification tasks, such as sentiment analysis, topic categorization, and spam detection. By converting text documents into numerical vectors and training machine learning models, users can classify documents into predefined categories or labels.
* Information Retrieval: The package is employed in information retrieval systems for searching and retrieving relevant documents based on user queries. By indexing and vectorizing text documents, information retrieval systems can quickly retrieve documents that match specific keywords or topics.
* Text Clustering: text2vec facilitates text clustering tasks, where similar documents are grouped together based on their semantic similarity. By applying clustering algorithms to vectorized representations of text documents, users can discover latent clusters and organize large text corpora into meaningful groups.

In [None]:
install.packages("tokenizers")
install.packages("tm")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [None]:
# Text analytics in R

library(tokenizers)

text <- readline(prompt = "Enter text: ")

word_tokens <- unlist(tokenize_words(text))
sentence_tokens <- unlist(tokenize_sentences(text))

cat("\nTokenized words:\n")
print(word_tokens)

cat("\nTokenized sentences:\n")
print(sentence_tokens)

Enter text: Tokenization is the first step in text analytics. The process of breaking down text paragraphs into smaller chunks such as words or sentences is called Tokenization. The token is a single entity that is building blocks for sentences or paragraphs.

Tokenized words:
 [1] "tokenization" "is"           "the"          "first"        "step"        
 [6] "in"           "text"         "analytics"    "the"          "process"     
[11] "of"           "breaking"     "down"         "text"         "paragraphs"  
[16] "into"         "smaller"      "chunks"       "such"         "as"          
[21] "words"        "or"           "sentences"    "is"           "called"      
[26] "tokenization" "the"          "token"        "is"           "a"           
[31] "single"       "entity"       "that"         "is"           "building"    
[36] "blocks"       "for"          "sentences"    "or"           "paragraphs"  

Tokenized sentences:
[1] "Tokenization is the first step in text analytics."     

In [None]:
word_freq <- table(word_tokens)

print("Most common words:")
print(head(sort(word_freq, decreasing = TRUE), 2))

print("Frequency of each word:")
print(word_freq)

[1] "Most common words:"
word_tokens
 is the 
  4   3 
[1] "Frequency of each word:"
word_tokens
           a    analytics           as       blocks     breaking     building 
           1            1            1            1            1            1 
      called       chunks         down       entity        first          for 
           1            1            1            1            1            1 
          in         into           is           of           or   paragraphs 
           1            1            4            1            2            2 
     process    sentences       single      smaller         step         such 
           1            2            1            1            1            1 
        text         that          the        token tokenization        words 
           2            1            3            1            2            1 


In [None]:
library(tm)

filtered_tokens <- word_tokens[!word_tokens %in% stopwords("en")]

print("Filtered Tokens:")
print(filtered_tokens)

[1] "Filtered Tokens:"
 [1] "tokenization" "first"        "step"         "text"         "analytics"   
 [6] "process"      "breaking"     "text"         "paragraphs"   "smaller"     
[11] "chunks"       "words"        "sentences"    "called"       "tokenization"
[16] "token"        "single"       "entity"       "building"     "blocks"      
[21] "sentences"    "paragraphs"  


In [None]:
stemming <- function(text) {
  corpus <- Corpus(VectorSource(text))
  corpus <- tm_map(corpus, stemDocument)
  return(corpus)
}

stemmed_corpus <- stemming(filtered_tokens)

print("Stemmed Tokens:")
print(unlist(sapply(stemmed_corpus, as.character)))

“transformation drops documents”


[1] "Stemmed Tokens:"
 [1] "token"     "first"     "step"      "text"      "analyt"    "process"  
 [7] "break"     "text"      "paragraph" "smaller"   "chunk"     "word"     
[13] "sentenc"   "call"      "token"     "token"     "singl"     "entiti"   
[19] "build"     "block"     "sentenc"   "paragraph"


In [None]:
lemmatization <- function(text) {
  corpus <- Corpus(VectorSource(text))
  corpus <- tm_map(corpus, lemmatize_strings)
  return(corpus)
}

lemmatized_corpus <- lemmatization(text)

print("Lemmatized Tokens:")
print(unlist(sapply(lemmatized_corpus, as.character)))

“transformation drops documents”


[1] "Lemmatized Tokens:"
[1] "Token is the first step in text analytics. The process of break down text paragraph into smaller chunk such as word or sentenc is call Tokenization. The token is a singl entiti that is build block for sentenc or paragraphs."


In [None]:
library(rvest)

url <- 'https://en.wikipedia.org/wiki/Text_mining'

page <- read_html(url)

if (!is.null(page)) {
  print(page)

  first_paragraph <- html_nodes(page, 'p')[1]

  if (!is.null(first_paragraph)) {
    print(html_text(first_paragraph))
  } else {
    print('No <p> tags found on the website')
  }
} else {
  print('Failed to retrieve data from the website')
}


{html_document}
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-client-prefs-pinned-disabled vector-feature-night-mode-disabled skin-night-mode-clientpref-0 vector-toc-available" lang="en" dir="ltr">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="skin-vector skin-vector-search-vue mediawiki ltr sitedir-ltr ...
[1] "Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves \"the discovery by computer of new, previously unknown information, by automatic

### Steps involved in text analytics

1. **Data gathering:**
In this stage, you gather text data from internal or external sources.
> Internal data: Internal data is text content that is internal to your business and is readily available—for example, emails, chats, invoices, and employee surveys.

 > External data:  You can find external data in sources such as social media posts, online reviews, news articles, and online forums. It is harder to acquire external data because it is beyond your control. You might need to use web scraping tools or integrate with third-party solutions to extract external data.

2. **Data preparation:**
Data preparation is an essential part of text analysis. It involves structuring raw text data in an acceptable format for analysis. The text analysis software automates the process and involves the following common natural language processing (NLP) methods.
> Tokenization: Tokenization is segregating the raw text into multiple parts that make semantic sense. For example, the phrase text analytics benefits businesses tokenizes to the words text, analytics, benefits, and businesses.

 > Part-of-speech tagging: Part-of-speech tagging assigns grammatical tags to the tokenized text. For example, applying this step to the previously mentioned tokens results in text: Noun; analytics: Noun; benefits: Verb; businesses: Noun.

 >Parsing : Parsing establishes meaningful connections between the tokenized words with English grammar. It helps the text analysis software visualize the relationship between words.

 >Lemmatization: Lemmatization is a linguistic process that simplifies words into their dictionary form, or lemma. For example, the dictionary form of visualizing is visualize.

 >Stop words removal: Stop words are words that offer little or no semantic context to a sentence, such as and, or, and for. Depending on the use case, the software might remove them from the structured text.

3. **Text analysis:**
Text analysis is the core part of the process, in which text analysis software processes the text by using different methods.
> Text classification: Classification is the process of assigning tags to the text data that are based on rules or machine learning-based systems.

 >Text extraction: Extraction involves identifying the presence of specific keywords in the text and associating them with tags. The software uses methods such as regular expressions and conditional random fields (CRFs) to do this.

4. **Visualization:**
Visualization is about turning the text analysis results into an easily understandable format. You will find text analytics results in graphs, charts, and tables. The visualized results help you identify patterns and trends and build action plans. For example, suppose you’re getting a spike in product returns, but you have trouble finding the causes. With visualization, you look for words such as defects, wrong size, or not a good fit in the feedback and tabulate them into a chart. Then you’ll know which is the major issue that takes top priority.

### Benefits of Text Analytics
* Helps in understanding emerging customer trends, product performance, and service quality.
* Helps researchers to explore pre-existing literature and extracting what’s relevant to their study.
* Text analytic techniques help search engines to improve their performance, thereby providing fast user experiences.
* Helps in making more data-driven decisions
* Refines user content recommendation systems by categorizing related content
* Boost Efficiency of working with Unstructured data

###  Conclusion:
* Identified the Text Analytics Libraries in Python and R
* Performed simple experiments with these libraries in Python and R