<a href="https://colab.research.google.com/github/Jasleen8801/Conversational-Systems/blob/main/NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **NLTK**
The Natural Language Toolkit (NLTK) is a powerful Python library for working with human language data, particularly for natural language processing (NLP) and text analysis tasks.

In [1]:
!pip install nltk



**NLTK Corpora**: NLTK Corpora refers to a collection of text datasets and resources that the Natural Language Toolkit (NLTK) library provides.

**Text Processing**: Text Processing involves converting textual data into mathematical objects, typically vectors or matrices, that can be analyzed, manipulated, and used for various machine learning and statistical techniques.

In [28]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [20]:
sentence = "I am running in the park and playing with my friends."

**Tokenization**: Tokenization is a fundamental text processing task in natural language processing (NLP). It involves breaking a text into individual units, typically words or tokens.

In [21]:
from nltk.tokenize import word_tokenize

tokens = word_tokenize(sentence)
print(tokens)

['I', 'am', 'running', 'in', 'the', 'park', 'and', 'playing', 'with', 'my', 'friends', '.']


**Stop words removal**-  Stop words are words that are considered to be of little value in text analysis because they are very common and do not carry much information about the content of the text. Examples of stop words in English include "the," "and," "in," "is," and "it." Removing stop words can help reduce noise in your text data and improve the efficiency and accuracy of various NLP tasks.

In [29]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
print(stop_words)

# print(tokens)
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
filtered_text = ' '.join(filtered_tokens)
print(filtered_text)

{'themselves', 'because', 'are', 'whom', "haven't", 'can', 'just', 'by', 'from', 'herself', 'has', 'with', 'above', "it's", 'been', 'all', "hasn't", 'further', 'about', "couldn't", 'ma', 'wasn', "mightn't", 'mustn', 'at', 'most', 'hasn', 'll', "wasn't", 'were', 'weren', 'theirs', 'only', 'her', 'was', 't', 'him', 'through', 're', 'between', 'there', "doesn't", "wouldn't", 'my', "you'll", 'why', 'me', 'them', "that'll", 'too', 'himself', 'isn', 'as', 'again', 'no', 'is', 'off', "weren't", 'who', 'but', 'don', 'you', "hadn't", 'on', 'very', 'have', 'those', "should've", 'this', 'y', "isn't", 'they', 'an', 'not', 'same', 'didn', 'needn', 'until', 'against', 'yours', 'will', "you're", 'then', 'did', 'i', 'for', 'she', "shan't", 'during', 'more', "don't", 'and', 'where', "you've", 'below', 'wouldn', 'both', 'or', 'than', 'up', 'these', 'does', 'once', 'he', "you'd", 'mightn', "aren't", 'yourselves', 'be', 'ourselves', 'our', 'am', 'shan', 'before', 'which', 'doing', 'won', 'that', 'such', '

**Lemmatization**: Lemmatization reduces words to their base or root form, which can be helpful in various natural language processing tasks.

In [30]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
lemmatized_text = ' '.join(lemmatized_tokens)
print(lemmatized_text)

I am running in the park and playing with my friend .


**Stemming**: Stemming reduces words to their root or stem form by removing suffixes.

In [31]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in tokens]
stemmed_text = ' '.join(stemmed_tokens)
print(stemmed_text)

i am run in the park and play with my friend .


**Part of Speech Tagging**: Part-of-speech tagging (POS tagging) is the process of assigning a part-of-speech label (such as noun, verb, adjective, etc.) to each word in a text.


In [32]:
from nltk import pos_tag

pos_tags = pos_tag(tokens)
print(pos_tags)

[('I', 'PRP'), ('am', 'VBP'), ('running', 'VBG'), ('in', 'IN'), ('the', 'DT'), ('park', 'NN'), ('and', 'CC'), ('playing', 'NN'), ('with', 'IN'), ('my', 'PRP$'), ('friends', 'NNS'), ('.', '.')]


## **TextBlob**

**Sentiment Analysis**: Sentiment analysis, also known as opinion mining, is a natural language processing (NLP) technique used to determine the emotional tone or sentiment expressed in a piece of text, such as a sentence, paragraph, or document.

**Polarity Score**: The polarity score measures the sentiment of the text as positive, negative, or neutral. It is represented as a numerical value ranging from -1 (very negative) to 1 (very positive), with 0 indicating a neutral sentiment.

**Subjectivity Score**: The subjectivity score measures the subjectivity or objectivity of the text. It is represented as a numerical value ranging from 0 to 1, where 0 is highly objective (factual) and 1 is highly subjective (opinionated).

In [18]:
!pip install textblob
!python -m textblob.download_corpora

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package conll2000 to /root/nltk_data...
[nltk_data]   Unzipping corpora/conll2000.zip.
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
Finished.


In [33]:
from textblob import TextBlob

text = "TextBlob is a simple and easy-to-use library for processing textual data."

blob = TextBlob(text)

print("\nTokenization:")
print(blob.words)

print("\nPart-of-speech tagging:")
print(blob.tags)

print("\nSentiment Analysis:")
print(f"Polarity: {blob.sentiment.polarity}")
print(f"Subjectivity: {blob.sentiment.subjectivity}")



Tokenization:
['TextBlob', 'is', 'a', 'simple', 'and', 'easy-to-use', 'library', 'for', 'processing', 'textual', 'data']

Part-of-speech tagging:
[('TextBlob', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('simple', 'JJ'), ('and', 'CC'), ('easy-to-use', 'JJ'), ('library', 'NN'), ('for', 'IN'), ('processing', 'VBG'), ('textual', 'JJ'), ('data', 'NNS')]

Sentiment Analysis:
Polarity: 0.0
Subjectivity: 0.35714285714285715
