# **Title: Perform the steps involved in Text Analytics in Python & R**

Lab Objectives:
* To introduce the concept of text analytics and its applications.

Lab Outcome(LO): Implement various Regression techniques for prediction. (LO2)


## **Python**


In [2]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

!pip install contractions

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Collecting textsearch>=0.0.21 (from contractions)
  Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Collecting anyascii (from textsearch>=0.0.21->contractions)
  Downloading anyascii-0.3.2-py3-none-any.whl (289 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.9/289.9 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyahocorasick (from textsearch>=0.0.21->contractions)
  Downloading pyahocorasick-2.0.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.8/110.8 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully installed anyascii-0.3.2 contractions-0.1.73 pyahocorasick-2.0.0 textsearch-0.0.24


In [3]:
import pandas as pd
import string
import contractions
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tag import pos_tag
from nltk import ne_chunk

## Load Dataset
## **[Kaggle Dataset Link](https://www.kaggle.com/datasets/mujeebunissa/movies-csv)**

In [4]:
data = pd.read_csv('/content/drive/MyDrive/Dataset/movies.csv')

print(data.info(),'\n\n Non-Null Rows:\n',data.isna().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9166 entries, 0 to 9165
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Unnamed: 0    9166 non-null   int64  
 1   id            9166 non-null   int64  
 2   title         9166 non-null   object 
 3   overview      9165 non-null   object 
 4   popularity    9166 non-null   float64
 5   vote_average  9166 non-null   float64
 6   vote_count    9166 non-null   int64  
dtypes: float64(2), int64(3), object(2)
memory usage: 501.4+ KB
None 

 Non-Null Rows:
 Unnamed: 0      0
id              0
title           0
overview        1
popularity      0
vote_average    0
vote_count      0
dtype: int64


In [5]:
data.dropna(axis=0,inplace=True)

text = data[['title','overview']]
text = text.sample(frac=1,random_state=42).reset_index(drop=True)

In [6]:
sample = text.sample(n=1)
para = sample['overview'].values[0]

### Word Tokenizer

In [7]:
token = word_tokenize(contractions.fix(para))
print(token)

['An', 'opulent', 'beach', 'resort', 'provides', 'a', 'scenic', 'background', 'to', 'this', 'amusing', 'whodunit', 'as', 'Poirot', 'attempts', 'to', 'uncover', 'the', 'nefarious', 'evildoer', 'behind', 'the', 'strangling', 'of', 'a', 'notorious', 'stage', 'star', '.']


### Sentence Tokenizer

In [8]:
sentence = sent_tokenize(para)
for sent in sentence:
  print(sent,'\n')

An opulent beach resort provides a scenic background to this amusing whodunit as Poirot attempts to uncover the nefarious evildoer behind the strangling of a notorious stage star. 



### Word Frequency Distribution

In [9]:
frequency = FreqDist(token)
print('5 Most words in the text: ',frequency.most_common(5))

5 Most words in the text:  [('a', 2), ('to', 2), ('the', 2), ('An', 1), ('opulent', 1)]


### Punctuation and Stopword Removal

In [10]:
stopword = stopwords.words('english')
punct = list(string.punctuation)

def text_clean(tokens):
  cleaned = []
  for word in tokens:
    if word.lower() not in stopword and word not in punct:
      cleaned.append(word)
  return cleaned

cleaned_tokens = text_clean(token)
cleaned_para = ' '.join(cleaned_tokens)
print(cleaned_para)

opulent beach resort provides scenic background amusing whodunit Poirot attempts uncover nefarious evildoer behind strangling notorious stage star


### Lexicon Normalization

In [11]:
stem = PorterStemmer()
lemmatize = WordNetLemmatizer()

In [12]:
stemmed_tokens = [stem.stem(word) for word in cleaned_tokens]
stemmed_para = ' '.join(stemmed_tokens)
print('Stemmed Sentence: ',stemmed_para)

lemmatized_tokens = [lemmatize.lemmatize(word, pos='v') for word in cleaned_tokens]
lemmatized_para = ' '.join(lemmatized_tokens)
print('Lemmatized Sentence: ',lemmatized_para)

Stemmed Sentence:  opul beach resort provid scenic background amus whodunit poirot attempt uncov nefari evildo behind strangl notori stage star
Lemmatized Sentence:  opulent beach resort provide scenic background amuse whodunit Poirot attempt uncover nefarious evildoer behind strangle notorious stage star


### POS Tagging

* CC: Coordinating conjunction
* CD: Cardinal number
* DT: Determiner
* EX: Existential there
* FW: Foreign word
* IN: Preposition or subordinating conjunction
* JJ: Adjective
* JJR: Adjective, comparative
* JJS: Adjective, superlative
* LS: List item marker
* MD: Modal
* NN: Noun, singular or mass
* NNS: Noun, plural
* NNP: Proper noun, singular
* NNPS: Proper noun, plural
* PDT: Predeterminer
* POS: Possessive ending
* PRP: Personal pronoun
* RB: Adverb
* RBR: Adverb, comparative
* RBS: Adverb, superlative
* RP: Particle
* SYM: Symbol
* TO: to
* UH: Interjection
* VB: Verb, base form
* VBD: Verb, past tense
* VBG: Verb, gerund or present participle
* VBN: Verb, past participle
* VBP: Verb, non-3rd person singular present
* VBZ: Verb, 3rd person singular present
* WDT: Wh-determiner
* WP: Wh-pronoun
* WP$: Possessive wh-pronoun
* WRB: Wh-adverb

In [13]:
pos_tagged = pos_tag(token)
print('Parts of Speech: ', pos_tag)

Parts of Speech:  <function pos_tag at 0x7c81785bed40>


### Named Entity Recognization

In [14]:
ner_tagged = nltk.ne_chunk(pos_tagged)

names = []
for chunk in ner_tagged:
    if isinstance(chunk, nltk.Tree):
        names.append(" ".join([token for token, pos in chunk.leaves()]))

print(names)

['Poirot']


### Web Scraping

In [15]:
!pip install beautifulsoup4
!pip install requests



In [16]:
from bs4 import BeautifulSoup
import requests
from requests import HTTPError

In [17]:
url = 'https://github.com/LifnaJos/ADL601-Data-Analytics-and-Visualization-Lab/blob/main/Experiments/Experiment_7.md'

try:
  response = requests.get(url)
except HTTPError as e:
  print(e)

In [18]:
soup = BeautifulSoup(response.text, 'html.parser')

In [19]:
links = soup.find_all('a')

for link in links:
    print(link.get('href'))

\"#experiment---7-perform-the-steps-involved-in-text-analytics-in-python--r\"
\"#lab-outcomes-lo\"
\"#task-to-be-performed-\"
\"#tools--libraries-to-be-explored\"
\"#theory-to-be-written\"
\"#outcome-\"
\"#online-resources\"
\"https://github.com/LifnaJos/ADC601-Data-Analytics-Visualization/blob/DAV_Colab_Notebooks/Data_Preprocessing_techniques.ipynb\"
\"https://guides.library.upenn.edu/penntdm/python\"
\"https://machinelearninggeek.com/text-analytics-for-beginners-using-python-nltk/\"
\"https://www.analyticsvidhya.com/blog/2018/02/the-different-methods-deal-text-data-predictive-python/\"
\"https://www.kdnuggets.com/2020/05/text-mining-python-steps-examples.html\"
\"https://www.youtube.com/watch?v=bZoC-UW50sI&list=PLH6mU1kedUy-xjgiuvqMkVn8npK0TGAv5\"
\"https://www.analyticsvidhya.com/blog/2022/07/sentiment-analysis-using-python/\"


## **R**

In [1]:
install.packages("tokenizers")
install.packages("tm")
install.packages("udpipe")
install.packages("spacyr")
install.packages("rvest")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘Rcpp’, ‘SnowballC’


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘NLP’, ‘slam’, ‘BH’


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘RcppTOML’, ‘here’, ‘png’, ‘reticulate’


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [3]:
library(tokenizers)
library(tm)
library(udpipe)
library(spacyr)
library(rvest)

Loading required package: NLP



### Tokenization

In [4]:
text <- 'Suave, charming and volatile, Reggie Kray and his unstable twin brother Ronnie start to leave their mark on the London underworld in the 1960s. Using violence to get what they want, the siblings orchestrate robberies and murders while running nightclubs and protection rackets. With police Detective Leonard "Nipper" Read hot on their heels, the brothers continue their rapid rise to power and achieve tabloid notoriety.'

word_tokens <- tokenize_words(text)
sent_tokens <- tokenize_sentences(text)

print("Sentence Tokens:")
print(sent_tokens)
print("Word Tokens:")
print(word_tokens)

[1] "Sentence Tokens:"
[[1]]
[1] "Suave, charming and volatile, Reggie Kray and his unstable twin brother Ronnie start to leave their mark on the London underworld in the 1960s." 
[2] "Using violence to get what they want, the siblings orchestrate robberies and murders while running nightclubs and protection rackets."           
[3] "With police Detective Leonard \"Nipper\" Read hot on their heels, the brothers continue their rapid rise to power and achieve tabloid notoriety."

[1] "Word Tokens:"
[[1]]
 [1] "suave"       "charming"    "and"         "volatile"    "reggie"     
 [6] "kray"        "and"         "his"         "unstable"    "twin"       
[11] "brother"     "ronnie"      "start"       "to"          "leave"      
[16] "their"       "mark"        "on"          "the"         "london"     
[21] "underworld"  "in"          "the"         "1960s"       "using"      
[26] "violence"    "to"          "get"         "what"        "they"       
[31] "want"        "the"         "siblings

### Word Frequency Distribution

In [5]:
word_freq <- table(word_tokens)

print("Word Frequency Distribution:")
print(word_freq)

[1] "Word Frequency Distribution:"
word_tokens
      1960s     achieve         and     brother    brothers    charming 
          1           1           5           1           1           1 
   continue   detective         get       heels         his         hot 
          1           1           1           1           1           1 
         in        kray       leave     leonard      london        mark 
          1           1           1           1           1           1 
    murders  nightclubs      nipper   notoriety          on orchestrate 
          1           1           1           1           2           1 
     police       power  protection     rackets       rapid        read 
          1           1           1           1           1           1 
     reggie        rise   robberies      ronnie     running    siblings 
          1           1           1           1           1           1 
      start       suave     tabloid         the       their        they 
    

In [6]:
sorted_freq <- sort(word_freq, decreasing = TRUE)

print("Top 5 most used words:")
print(sorted_freq[1:5])

[1] "Top 5 most used words:"
word_tokens
  and   the their    to    on 
    5     4     3     3     2 


### Punctuation and Stopword Removal

In [7]:
corpus <- Corpus(VectorSource(text))

corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("english"))

cleaned_text <- sapply(corpus, as.character)
print(cleaned_text)

“transformation drops documents”
“transformation drops documents”
“transformation drops documents”


[1] "suave charming  volatile reggie kray   unstable twin brother ronnie start  leave  mark   london underworld   1960s using violence  get   want  siblings orchestrate robberies  murders  running nightclubs  protection rackets  police detective leonard nipper read hot   heels  brothers continue  rapid rise  power  achieve tabloid notoriety"


### Lexicon Normalization

In [8]:
ud_model <- udpipe_download_model(language = "english")
ud_model <- udpipe_load_model(ud_model$file_model)

Downloading udpipe model from https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.5/master/inst/udpipe-ud-2.5-191206/english-ewt-ud-2.5-191206.udpipe to /content/english-ewt-ud-2.5-191206.udpipe

 - This model has been trained on version 2.5 of data from https://universaldependencies.org

 - The model is distributed under the CC-BY-SA-NC license: https://creativecommons.org/licenses/by-nc-sa/4.0

 - Visit https://github.com/jwijffels/udpipe.models.ud.2.5 for model license details.

 - For a list of all models and their licenses (most models you can download with this package have either a CC-BY-SA or a CC-BY-SA-NC license) read the documentation at ?udpipe_download_model. For building your own models: visit the documentation by typing vignette('udpipe-train', package = 'udpipe')

Downloading finished, model stored at '/content/english-ewt-ud-2.5-191206.udpipe'



In [9]:
tokens <- udpipe_annotate(ud_model, x = text)

lemmas <- as.data.frame(tokens)$lemma
cat("Lemmas:", lemmas, "\n")

stems <- as.data.frame(tokens)$token
cat("Stems:", stems, "\n")

Lemmas: Suave , charming and volatile , Reggie Kray and he unstable twin brother Ronnie start to leave they mark on the London underworld in the 1960 . use violence to get what they want , the sibling orchestrate robbery and murder while run nightclub and protection racket . with police detective Leonard " Nipper " read hot on they heel , the brother continue they rapid rise to power and achieve tabloid notoriety . 
Stems: Suave , charming and volatile , Reggie Kray and his unstable twin brother Ronnie start to leave their mark on the London underworld in the 1960s . Using violence to get what they want , the siblings orchestrate robberies and murders while running nightclubs and protection rackets . With police Detective Leonard " Nipper " Read hot on their heels , the brothers continue their rapid rise to power and achieve tabloid notoriety . 


### Part of Speech tagging

In [10]:
pos_tags <- as.data.frame(tokens)$upos
cat("POS tags:", pos_tags, "\n")

POS tags: PROPN PUNCT NOUN CCONJ NOUN PUNCT PROPN PROPN CCONJ PRON ADJ NOUN NOUN PROPN VERB PART VERB PRON NOUN ADP DET PROPN NOUN ADP DET NOUN PUNCT VERB NOUN PART VERB PRON PRON VERB PUNCT DET NOUN ADJ NOUN CCONJ NOUN SCONJ VERB NOUN CCONJ NOUN NOUN PUNCT ADP NOUN PROPN PROPN PUNCT PROPN PUNCT VERB ADJ ADP PRON NOUN PUNCT DET NOUN VERB PRON ADJ NOUN ADP NOUN CCONJ NOUN NOUN NOUN PUNCT 


### Web Scraping

In [11]:
url <- "https://en.wikipedia.org/wiki/Web_scraping"
webpage <- read_html(url)

article_titles <- webpage %>%
  html_nodes(".text") %>%
  html_text()

print(article_titles)

 [1] "\"Web scraping\""                                                                                                                   
 [2] "news"                                                                                                                               
 [3] "newspapers"                                                                                                                         
 [4] "books"                                                                                                                              
 [5] "scholar"                                                                                                                            
 [6] "JSTOR"                                                                                                                              
 [7] "improve this section"                                                                                                               
 [8] "\"SASSCAL WebSAPI: A 

## Conclusion

We have successfully performed text analysis using different libraries in Python and R such as NLTK, spaCy, tm, tidytext, and udpipe. These libraries offer robust and flexible tools for various text analysis tasks such as tokenization, lexicon normalization, and named entity recognition.