<a href="https://colab.research.google.com/github/Kartikee12/DAV/blob/main/Exp7_DAV_12.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Aim :** Perform the steps involved in Text Analytics in Python & R.

**Theory :**

**Python Text Analytics Libraries:**

**1) NLTK (Natural Language Toolkit):**

Features:

a) Tokenization, stemming, lemmatization.

b) Part-of-speech tagging, named entity recognition.

c) Concordance and collocation analysis.

Applications:

a) Sentiment analysis, text classification.

b)Information retrieval, language modeling.

**2) spaCy:**

Features:

a) Tokenization, POS tagging, and named entity recognition.

b) Dependency parsing, sentence segmentation.

c) Pre-trained models for multiple languages.

Applications:

a) Named entity recognition, information extraction.

b) Natural language understanding in chatbots.

**3) TextBlob:**

Features:

a) Simple API for common NLP tasks.

b) Sentiment analysis, noun phrase extraction.

c) Part-of-speech tagging.

Applications:

a) Sentiment analysis of customer reviews.

b) Basic text processing for beginners.

**4) Gensim:**

Features:

a) Topic modeling (e.g., LDA).

b) Document similarity analysis.

c) Word embedding models (Word2Vec, Doc2Vec).

Applications:

a) Topic modeling in large document collections.

b) Document similarity and clustering.

**5) Transformers (Hugging Face):**

Features:

a) State-of-the-art pre-trained models (e.g., BERT, GPT).

b) Easy integration for various NLP tasks.

c) Fine-tuning capabilities.

Applications:

a) Named entity recognition, sentiment analysis.

b) Text generation, language translation.

**R Text Analytics Libraries:**

**1) tm (Text Mining Package):**

Features:

a) Text preprocessing: Cleaning, stemming, stopword removal.

b) Document-term matrix creation.

c) Basic text mining functions.

Applications:

a) Topic modeling, clustering.

b) Sentiment analysis, document classification.

**2) quanteda:**

Features:

a) Fast and flexible text processing.

b) Tokenization, n-grams, and corpus analysis.

c) Support for advanced text analysis functions.

Applications:

a) Content analysis, sentiment analysis.

b) Document-feature matrix creation.

**3) tm.plugin.sentiment:**

Features:

a) Sentiment analysis using pre-trained models.

b) Integration with the tm package.

Applications:

a) Analyzing sentiment in textual data.

**4) textTinyR:**

Features:

a) Efficient text classification.

b) Lightweight implementation for quick analysis.

Applications:

a) Fast text classification tasks.

**5) quanteda.textstats:**

Features:

a) Basic text statistics.

b) Word frequency analysis, readability scores.

Applications:

a) Exploratory text analysis.

b) Extracting insights from text data.

**Steps Involved in Text Analytics:**

**1) Data Collection:** Gather text data from various sources such as documents, social media, or websites.

**2) Text Preprocessing:** Clean and preprocess the text data by removing noise, handling missing values, and performing tasks like tokenization and stemming.

**3) Text Exploration:** Explore the data through descriptive statistics, word frequency analysis, and visualization.

**4) Feature Extraction:** Convert the text into a numerical format, often using techniques like TF-IDF (Term Frequency-Inverse Document Frequency).

**5)Model Building:** Apply machine learning or natural language processing models for tasks such as sentiment analysis, classification, or clustering.

**6) Evaluation:** Assess the performance of the models using appropriate metrics based on the task.

**7) Visualization:** Visualize the results to gain insights and interpret the findings.

**Benefits of Text Analytics:**

**1) Insight Generation:** Extract valuable insights from large volumes of unstructured text data.

**2) Decision Support:** Support decision-making processes by providing data-driven insights.

**3) Automation:** Automate the analysis of textual data, saving time and resources.

**4) Customer Feedback Analysis:** Understand customer sentiments and opinions from reviews and feedback.

**5) Information Retrieval:** Enhance search functionalities by improving the relevance of retrieved information.

**6) Fraud Detection:** Identify patterns or anomalies in textual data that may indicate fraudulent activities.

**7) Competitive Intelligence:** Monitor and analyze competitor activities and market trends from textual sources.

**Conclusion :** Text analytics is a powerful tool for extracting valuable information from text data, and the choice of libraries and techniques depends on the specific goals and characteristics of the data at hand.


**Tokenization (Sentence & Word) in python**

In [1]:
import nltk
nltk.download('punkt')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
from nltk.tokenize import word_tokenize, sent_tokenize
text = "This is a sample text. Tokenize it!"
sent_tokens = sent_tokenize(text)
word_tokens = word_tokenize(text)

print(sent_tokens)
print(word_tokens)

['This is a sample text.', 'Tokenize it!']
['This', 'is', 'a', 'sample', 'text', '.', 'Tokenize', 'it', '!']


**Frequency Distribution in python**

In [3]:
from nltk import FreqDist
fdist = FreqDist(word_tokens)
print(fdist)

<FreqDist with 9 samples and 9 outcomes>


**Frequency Distribution in R**

In [4]:
%load_ext rpy2.ipython

In [5]:
%%R
library(tm)

text <- "This is a sample text. Tokenize it!"
corpus <- Corpus(VectorSource(text))

clean_text <- function(text) {
  text <- tolower(text)
  text <- removePunctuation(text)
  text <- removeWords(text, stopwords("en"))
  return(text)
}

corpus_cleaned <- tm_map(corpus, content_transformer(clean_text))

cleaned_text <- sapply(corpus_cleaned, function(x) as.character(x))
print(cleaned_text)






[1] "   sample text tokenize "


**Remove stopwords & punctuations in python**

In [7]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

nltk.download('stopwords')
nltk.download('punkt')

text = "This is a sample text. Tokenize it!"
stop_words = set(stopwords.words('english'))

tokens = word_tokenize(text)

filtered_tokens = [word.lower() for word in tokens if (word.isalpha() and word.lower() not in stop_words)]

print(filtered_tokens)


['sample', 'text', 'tokenize']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


**Remove stopwords & punctuations in R**

In [8]:
%%R
library(tm)

text <- "This is a sample text. Tokenize it!"
corpus <- Corpus(VectorSource(text))

corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)

corpus <- tm_map(corpus, removeWords, stopwords("english"))

cleaned_text <- sapply(corpus, function(x) as.character(x))

print(cleaned_text)


[1] "   sample text tokenize "


**Lexicon Normalization (Stemming, Lemmatization) in Python**

In [10]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [11]:
from nltk.stem import PorterStemmer
porter = PorterStemmer()
words = ["running", "flies", "happily", "better"]
stemmed_words = [porter.stem(word) for word in words]
print(stemmed_words)

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
words = ["running", "flies", "happily", "better"]
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print(lemmatized_words)


['run', 'fli', 'happili', 'better']
['running', 'fly', 'happily', 'better']


**Lexicon Normalization (Stemming, Lemmatization) in R**

In [19]:
%load_ext rpy2.ipython

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython


In [24]:
%%R
library(tm)
library(textTinyR)
library(SnowballC)
words <- c("running", "flies", "happily", "better")
corpus <- Corpus(VectorSource(words))
corpus <- tm_map(corpus, content_transformer(stemDocument))
stemmed_words <- sapply(corpus, function(x) unlist(strsplit(as.character(x), " ")))
print(stemmed_words)




[1] "run"     "fli"     "happili" "better" 


**Part of Speech Tagging in Python**

In [14]:
import nltk
nltk.download('averaged_perceptron_tagger')


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [15]:
from nltk import pos_tag
pos_tags = pos_tag(words)


**Named Entity Recognition in Python**

In [43]:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
entities = [(ent.text, ent.label_) for ent in doc.ents]


**Web Scraping in Python**

In [18]:
import requests
from bs4 import BeautifulSoup
url = "https://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
