<a href="https://colab.research.google.com/github/Jasleen8801/Conversational-Systems/blob/main/NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **NLTK**
The Natural Language Toolkit (NLTK) is a powerful Python library for working with human language data, particularly for natural language processing (NLP) and text analysis tasks.

In [1]:
!pip install nltk



**NLTK Corpora**: NLTK Corpora refers to a collection of text datasets and resources that the Natural Language Toolkit (NLTK) library provides.

**Text Processing**: Text Processing involves converting textual data into mathematical objects, typically vectors or matrices, that can be analyzed, manipulated, and used for various machine learning and statistical techniques.

In [1]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /home/jasleen/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/jasleen/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /home/jasleen/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jasleen/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [2]:
sentence = "I am running in the park and playing with my friends."

**Tokenization**: Tokenization is a fundamental text processing task in natural language processing (NLP). It involves breaking a text into individual units, typically words or tokens.

In [3]:
from nltk.tokenize import word_tokenize

tokens = word_tokenize(sentence)
print(tokens)

['I', 'am', 'running', 'in', 'the', 'park', 'and', 'playing', 'with', 'my', 'friends', '.']


**Stop words removal**-  Stop words are words that are considered to be of little value in text analysis because they are very common and do not carry much information about the content of the text. Examples of stop words in English include "the," "and," "in," "is," and "it." Removing stop words can help reduce noise in your text data and improve the efficiency and accuracy of various NLP tasks.

In [4]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
print(stop_words)

# print(tokens)
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
filtered_text = ' '.join(filtered_tokens)
print(filtered_text)

{'him', 'isn', 'themselves', 'should', 'he', 'the', 'from', 'when', 'each', 'between', "wouldn't", 'couldn', 'once', 'out', 'off', 'needn', 'hasn', 'be', "you've", 'same', 'doesn', "weren't", 'other', 'that', 'then', 'your', 'shouldn', 'there', 'above', "aren't", 'having', 'any', 'of', 'against', 'its', 'before', 'nor', 'where', 'further', 'why', 'on', 'again', "shan't", 'she', 'mightn', "shouldn't", 't', 'what', 'into', "hadn't", 'below', 'in', 'wasn', 'after', 'few', 'at', 'this', 'll', 'for', 'but', 'mustn', "wasn't", 'because', 'who', "hasn't", 'm', 'doing', "that'll", 'so', "needn't", "don't", 'himself', 'such', 'an', 'y', 'which', 'didn', 'was', "you'd", 'now', 'through', 'have', 'up', 'theirs', 'his', 'they', "didn't", 'shan', 'herself', 'hers', "you'll", 'can', "mightn't", 'you', 'being', 'only', 'about', 'will', 'd', 'were', "isn't", 'not', 'no', 'own', "doesn't", 'ain', 'over', 'under', "it's", 're', 'our', 'while', 'am', 'them', 'had', 'during', "you're", 's', 've', 'it', 'm

**Lemmatization**: Lemmatization reduces words to their base or root form, which can be helpful in various natural language processing tasks.

In [5]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
lemmatized_text = ' '.join(lemmatized_tokens)
print(lemmatized_text)

I am running in the park and playing with my friend .


**Stemming**: Stemming reduces words to their root or stem form by removing suffixes.

In [6]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in tokens]
stemmed_text = ' '.join(stemmed_tokens)
print(stemmed_text)

i am run in the park and play with my friend .


**Part of Speech Tagging**: Part-of-speech tagging (POS tagging) is the process of assigning a part-of-speech label (such as noun, verb, adjective, etc.) to each word in a text.


In [7]:
from nltk import pos_tag

pos_tags = pos_tag(tokens)
print(pos_tags)

[('I', 'PRP'), ('am', 'VBP'), ('running', 'VBG'), ('in', 'IN'), ('the', 'DT'), ('park', 'NN'), ('and', 'CC'), ('playing', 'NN'), ('with', 'IN'), ('my', 'PRP$'), ('friends', 'NNS'), ('.', '.')]


## **TextBlob**

**Sentiment Analysis**: Sentiment analysis, also known as opinion mining, is a natural language processing (NLP) technique used to determine the emotional tone or sentiment expressed in a piece of text, such as a sentence, paragraph, or document.

In [8]:
!pip install textblob
!python -m textblob.download_corpora

Collecting textblob
  Downloading textblob-0.17.1-py2.py3-none-any.whl (636 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m636.8/636.8 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m0:01[0m
Installing collected packages: textblob
Successfully installed textblob-0.17.1
[nltk_data] Downloading package brown to /home/jasleen/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to /home/jasleen/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/jasleen/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jasleen/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package conll2000 to
[nltk_data]     /home/jasleen/nltk_data...
[nltk_data]   Unzipping corpora/conll200

In [9]:
from textblob import TextBlob

**Polarity Score**: The polarity score measures the sentiment of the text as positive, negative, or neutral. It is represented as a numerical value ranging from -1 (very negative) to 1 (very positive), with 0 indicating a neutral sentiment.

**Subjectivity Score**: The subjectivity score measures the subjectivity or objectivity of the text. It is represented as a numerical value ranging from 0 to 1, where 0 is highly objective (factual) and 1 is highly subjective (opinionated).

In [10]:
text = "TextBlob is a simple and easy-to-use library for processing textual data."

blob = TextBlob(text)

print("\nTokenization:")
print(blob.words)

print("\nPart-of-speech tagging:")
print(blob.tags)

print("\nSentiment Analysis:")
print(f"Polarity: {blob.sentiment.polarity}")
print(f"Subjectivity: {blob.sentiment.subjectivity}")



Tokenization:
['TextBlob', 'is', 'a', 'simple', 'and', 'easy-to-use', 'library', 'for', 'processing', 'textual', 'data']

Part-of-speech tagging:
[('TextBlob', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('simple', 'JJ'), ('and', 'CC'), ('easy-to-use', 'JJ'), ('library', 'NN'), ('for', 'IN'), ('processing', 'VBG'), ('textual', 'JJ'), ('data', 'NNS')]

Sentiment Analysis:
Polarity: 0.0
Subjectivity: 0.35714285714285715


In [15]:
import nltk
nltk.download('tagsets')
nltk.help.upenn_tagset()


[nltk_data] Downloading package tagsets to /home/jasleen/nltk_data...


$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

[nltk_data]   Unzipping help/tagsets.zip.


**Machine Translation**: Textblob uses Google Translator's API to provide a very simple interface for translating text.

The error message you are getting is because the TextBlob.translate() method is deprecated. As you mentioned in your prompt, the TextBlob library has deprecated the translate() and detect_language() methods since version 0.16.0. You can use the official Google Translate API instead.

In [13]:
from textblob import TextBlob
 
blob = TextBlob("Comment vas-tu?")
 
print(blob.detect_language())
 
print(blob.translate(to='es'))
print(blob.translate(to='en'))
print(blob.translate(to='zh'))

HTTPError: HTTP Error 400: Bad Request

**Valence Aware Dictionary and sEntiment Reasoner (VADER)** is a recently developed lexicon-based sentiment analysis tool whose accuracy is shown to be much greater than the existing lexicon-based
sentiment analyzers.

In [16]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

analyser = SentimentIntensityAnalyzer()
analyser.polarity_scores("This life sucks.")

{'neg': 0.556, 'neu': 0.444, 'pos': 0.0, 'compound': -0.3612}

**Web Scraping Libraries and Methodology**: Web scraping is a technique for extracting information from websites. It involves automating the web browser to navigate a web page, extract data from it, and save it in a structured format. Web scraping is a useful skill for data scientists, researchers, and data enthusiasts.

In [25]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

titles = []
prices = []
ratings = []

url = "https://webscraper.io/test-sites/e-commerce/allinone/computers/laptops"

request = requests.get(url)
# print(request.text)
soup = BeautifulSoup(request.text, "html.parser")

for prod in soup.find_all("div", {'class': 'col-sm-4 col-lg-4 col-md-4'}):
    for pr in prod.find_all("div", {'class': 'caption'}):
        for p in pr.find_all('h4', {'class': 'pull-right price'}):
            prices.append(p.text)
        for t in pr.find_all('a', {'title'}):
            titles.append(t.get('title'))
    for rt in prod.find_all('div', {'class':'ratings'}):
        ratings.append(len(rt.find_all('span', {'class':'glycophicon glycophicon-star'})))

# print(prices, ratings, titles)
prod_df = pd.DataFrame(zip(titles, prices, ratings), columns=['Titles','Prices','Ratings'])
prod_df.to_csv("ecommerce.csv", index=False)