# Assignment-21

## NLP

### Name- NITESH KUMAR Batch-4

# Q1. What is the primary purpose of tokenization in natural language processing? 

### Answer:-

In natural language processing (NLP), tokenization is the process of breaking down a text into smaller units called tokens. These tokens can be words, phrases, sentences, or other meaningful elements, depending on the level of granularity required for a particular NLP task. The primary purpose of tokenization is to convert raw text into a format that can be easily processed and analyzed by machines.

#### Here are some key reasons for tokenization in NLP:

1. Text Preprocessing:
Tokenization is often the first step in text preprocessing. Breaking down text into tokens makes it easier to handle and apply subsequent processing steps such as stemming, lemmatization, and stop-word removal.

2. Feature Extraction:
Tokens serve as the basic building blocks for feature extraction in NLP tasks. By representing text as a sequence of tokens, machine learning models can analyze and learn patterns from the data more effectively.

3. Text Representation:
Tokenization is crucial for converting text into a format that can be used for various NLP applications, such as sentiment analysis, named entity recognition, and machine translation. Each token represents a discrete unit of meaning.

4. Vocabulary Building:
Tokenization is essential for building the vocabulary of a corpus. Each unique token typically corresponds to a unique word, and by keeping track of the vocabulary, models can understand the distribution of words and learn relationships between them.

5. Text Analysis:
Tokenization enables the analysis of the structure and content of a text. It helps in understanding the syntactic and semantic aspects of language, which is important for tasks like part-of-speech tagging and parsing.

6. Computational Efficiency:
Working with tokens rather than raw text improves computational efficiency. It simplifies the processing of text data, making it more manageable and reducing the computational resources required for subsequent tasks.

7. Machine Learning Input:
Many machine learning models, especially those used in NLP, require fixed-size input vectors. Tokenization helps convert variable-length sequences of text into a format suitable for feeding into these models.

# Q2. Write a Python function to tokenize a given sentence? 

### Answer:-

In [1]:
pip install nltk

Collecting nltk
  Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m56.0 MB/s[0m eta [36m0:00:00[0m
Collecting regex>=2021.8.3
  Downloading regex-2023.10.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (773 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m773.9/773.9 kB[0m [31m55.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: regex, nltk
Successfully installed nltk-3.8.1 regex-2023.10.3
Note: you may need to restart the kernel to use updated packages.


In [1]:
import nltk
from nltk.tokenize import word_tokenize

nltk.download('punkt')  # Download the necessary data for tokenization

def tokenize_sentence(sentence):
    tokens = word_tokenize(sentence)
    return tokens

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [2]:
sentence = "Tokenization is an important step in natural language processing."
tokens = tokenize_sentence(sentence)
print(tokens)


['Tokenization', 'is', 'an', 'important', 'step', 'in', 'natural', 'language', 'processing', '.']


# Q3. Why is the removal of stop words considered significant in NLP? 

### Answer:-

The removal of stop words is considered significant in natural language processing (NLP) for several reasons:

1. Reduction of Dimensionality:
Stop words are commonly occurring words in a language that often do not carry significant meaning by themselves, such as articles (e.g., "the," "a," "an"), prepositions (e.g., "in," "on," "at"), and conjunctions (e.g., "and," "but," "or"). By removing these words, the overall dimensionality of the data is reduced, making it easier to analyze and process.

2. Improved Computational Efficiency:
Stop words are frequently used, but they typically contribute less to the overall meaning of a document. Removing them can significantly reduce the size of the data that needs to be processed, leading to improved computational efficiency. This is especially important when working with large datasets.

3. Focus on Content Words:
Stop word removal allows NLP models to focus more on content words, which carry the primary meaning of a text. Content words, such as nouns, verbs, and adjectives, are often more informative for tasks like sentiment analysis, topic modeling, and document classification.

4. Enhanced Semantic Analysis:
Stop words often do not contribute much to the semantic content of a document. By eliminating them, the remaining words become more indicative of the underlying meaning, facilitating more accurate semantic analysis.

5. Improved Information Retrieval:
In tasks related to information retrieval, such as document retrieval or search engines, the removal of stop words can improve the precision and relevance of search results. Queries and documents can be compared based on more meaningful terms.

6. Prevention of Data Skewness:
Stop words are highly frequent in language and can dominate term frequency statistics. Removing them helps prevent skewness in the data, ensuring that less informative words do not disproportionately influence the results of certain analyses.

7. Memory and Storage Efficiency:
By excluding stop words from the analysis, memory and storage requirements are reduced. This is particularly beneficial when working with resource-constrained environments.


# Q4. Write a Python function to eliminate stop words from a given sentence. 

### Answer:-

In [3]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
nltk.download('punkt')

def remove_stop_words(sentence):
    stop_words = set(stopwords.words('english'))
    words = word_tokenize(sentence)
    filtered_words = [word for word in words if word.lower() not in stop_words]
    filtered_sentence = ' '.join(filtered_words)
    return filtered_sentence

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [4]:
sentence = "Removing stop words is essential for natural language processing."
result = remove_stop_words(sentence)
print(result)

Removing stop words essential natural language processing .


# Q5. Explain the key differences between stemming and lemmatization in text processing.

### Answer:-

Stemming and lemmatization are two text processing techniques used in natural language processing (NLP) to reduce words to their base or root forms. However, they differ in their approaches and the results they produce. Here are the key differences between stemming and lemmatization:

## Stemming:

#### Definition:

It is the process of removing suffixes from words to obtain their root forms. The goal is to reduce words to a common base form, even if the result is not a valid word in the language.
#### Method:

It uses heuristics or rule-based approaches to chop off suffixes. It doesn't always guarantee that the resulting stem is a valid word.
#### Output:

The output (stem) may not be a real word. For example, "running" might be stemmed to "run," even though "run" is a valid word.
#### Speed:

It tends to be faster computationally because it involves simple rule-based operations.
### Examples:

Stemming:
"running" -> "run"
"happily" -> "happi"
"better" -> "better"

## Lemmatization:

#### Definition:

It is the process of reducing words to their base or dictionary form, known as the lemma. The goal is to transform words into valid words by considering their meaning in context.
#### Method:

It involves using language dictionaries and morphological analysis to obtain the base or dictionary form of a word.
#### Output:

The output (lemma) is always a valid word. For example, "running" might be lemmatized to "run," and "better" might be lemmatized to "good."
#### Precision:

It is more precise than stemming because it considers the context and meaning of words. It requires more complex linguistic rules and tools.
### Examples:

"running" -> "run"
"happily" -> "happy"
"better" -> "good"

In summary, stemming aims to reduce words to a common root, often resulting in non-words, while lemmatization aims to reduce words to their base or dictionary form, ensuring that the output is always a valid word. The choice between stemming and lemmatization depends on the specific requirements of the NLP task and the desired trade-off between simplicity and linguistic precision.

# Q6. Provide a Python code snippet for performing stemming on a list of words.

### Answer:-

In [5]:
from nltk.stem import PorterStemmer

def perform_stemming(words):
    stemmer = PorterStemmer()
    stemmed_words = [stemmer.stem(word) for word in words]
    return stemmed_words

In [6]:
words = ["running", "happily", "better", "playing"]
stemmed_words = perform_stemming(words)
print(stemmed_words)

['run', 'happili', 'better', 'play']


# Q7. What is the role of text classification in natural language processing? 

### Answer:-

Text classification is a fundamental task in natural language processing (NLP) that involves assigning predefined categories or labels to pieces of text based on their content. The primary role of text classification is to automatically analyze and categorize unstructured text data, enabling machines to make sense of human language. 

### Here are key roles and applications of text classification in NLP:

#### Document Categorization:

Classifying documents into predefined categories or topics. For example, categorizing news articles into sections like politics, sports, or entertainment.
#### Sentiment Analysis:

Determining the sentiment expressed in a piece of text, such as identifying whether a review is positive, negative, or neutral. This is crucial for understanding user opinions and feedback.
#### Spam Detection:

Identifying and filtering out spam or unwanted messages from emails, comments, or other forms of user-generated content.
#### Topic Modeling:

Uncovering latent topics within a collection of documents. Text classification can be used to assign topics to documents, aiding in organizing and summarizing large datasets.
#### Intent Recognition:

Understanding the intent behind user queries or statements. This is common in chatbots, virtual assistants, and customer support systems.
#### Language Identification:

Determining the language of a given text. This is useful in multilingual applications and content filtering.
#### Named Entity Recognition (NER):

Identifying and classifying named entities, such as names of people, organizations, locations, and other specific entities within a text.
#### Document Routing:

Automatically routing documents to appropriate departments or individuals based on their content. For example, sorting customer support tickets into relevant categories.
#### Legal and Compliance Analysis:

Analyzing legal documents to categorize them based on legal clauses, topics, or compliance requirements.
#### Product Categorization:

Categorizing product descriptions or reviews into relevant product categories. This is commonly used in e-commerce applications.
#### Medical Text Analysis:

Categorizing medical documents, such as patient records or research papers, into specific medical conditions or topics.
#### Fraud Detection:

Identifying fraudulent activities or transactions by classifying text data associated with financial transactions or communication.

Text classification is often approached as a supervised machine learning task, where models are trained on labeled datasets to learn patterns and relationships between the input text and corresponding categories. Common algorithms for text classification include Naive Bayes, Support Vector Machines (SVM), and deep learning approaches such as Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). The effectiveness of text classification models depends on the quality and representativeness of the training data and the chosen features for analysis.

# Q8. Share a basic example of text classification using a machine learning library like scikit-learn in Python. 

### Answer:-

In [1]:
import nltk
from sklearn.feature_extraction.text import CountVectorizer



In [2]:
# Download the NLTK movie_reviews dataset
nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\L184\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\movie_reviews.zip.


True

In [3]:
# Load the movie_reviews dataset
from nltk.corpus import movie_reviews

In [4]:
# Create a list of documents (reviews) and their corresponding labels (positive or negative)
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

In [5]:
# Shuffle the documents
import random
random.shuffle(documents)

In [6]:
# Extract features (words) and labels
all_words = [word.lower() for word in movie_reviews.words()]
all_words = nltk.FreqDist(all_words)
word_features = list(all_words.keys())[:3000]

In [7]:
def document_features(document):
    document_words = set(document)
    features = {word: (word in document_words) for word in word_features}
    return features

featuresets = [(document_features(d), c) for (d,c) in documents]

In [8]:
# Split the dataset into training and testing sets
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(featuresets, test_size=0.2, random_state=42)

In [12]:
# Extract features and labels from the training set
X_train, y_train = zip(*train_set)

# Extract features and labels from the test set
X_test, y_test = zip(*test_set)

In [14]:
# Convert features to text using a simple join
X_train_text = [' '.join(document) for document in X_train]
X_test_text = [' '.join(document) for document in X_test]

In [15]:
# Convert features to vectors using CountVectorizer
vectorizer = CountVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train_text)
X_test_vectorized = vectorizer.transform(X_test_text)

In [16]:
# Train a Naive Bayes classifier
classifier = MultinomialNB()
classifier.fit(X_train_vectorized, y_train)

MultinomialNB()

In [17]:
# Make predictions on the test set
predictions = classifier.predict(X_test_vectorized)

In [18]:
# Evaluate the performance
from sklearn.metrics import accuracy_score, classification_report
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy:.2%}")

Accuracy: 47.50%


In [19]:
# Display the classification report
print("Classification Report:\n", classification_report(y_test, predictions))

Classification Report:
               precision    recall  f1-score   support

         neg       0.47      1.00      0.64       190
         pos       0.00      0.00      0.00       210

    accuracy                           0.48       400
   macro avg       0.24      0.50      0.32       400
weighted avg       0.23      0.47      0.31       400



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


# Q9. Why is Named Entity Recognition (NER) important, and what types of entities can it identify? 

### Answer:-

Named Entity Recognition (NER) is a crucial task in natural language processing (NLP) because it involves identifying and classifying entities in text, providing valuable information about specific elements mentioned in the text. The primary importance of NER lies in its ability to extract structured information from unstructured text, enabling applications to understand and work with named entities in a meaningful way. 

### Here are some key reasons why NER is important:

#### Information Extraction:

NER helps extract structured information from unstructured text, allowing systems to identify and catalog important details such as names of people, organizations, locations, dates, percentages, and more.
#### Data Linking and Integration:

By recognizing named entities, NER facilitates the linking of text data with external knowledge bases or databases. This linking enables the integration of information and enhances the overall understanding of the content.
#### Improved Search and Retrieval:

NER enhances search capabilities by enabling users to search for specific entities or categories within a document or a corpus. It contributes to more accurate and relevant search results.
#### Event Extraction:

NER can be used to identify entities involved in events or activities mentioned in text. This information is valuable for understanding relationships, roles, and interactions between entities.
#### Sentiment Analysis:

Identifying named entities in text can contribute to more nuanced sentiment analysis. Knowing which entities are mentioned in positive or negative contexts can provide a deeper understanding of opinions and sentiments.
#### Enhanced Language Understanding:

NER contributes to the overall understanding of language by recognizing and categorizing named entities. This understanding is crucial for various NLP tasks, such as machine translation, summarization, and question-answering systems.
#### Legal and Compliance Analysis:

In legal documents, contracts, and compliance-related text, NER can identify and categorize entities such as legal provisions, names of parties, dates, and monetary amounts, facilitating legal analysis and compliance monitoring.
#### Financial and Business Analysis:

In financial reports and business documents, NER can identify entities related to companies, financial figures, dates, and other relevant information, supporting financial analysis and decision-making.

### Types of entities that NER can identify include:

##### Person:
Names of individuals.
##### Organization:
Names of companies, institutions, or other organized entities.
##### Location:
Geographical locations, such as cities, countries, and landmarks.
##### Date:
References to specific dates or date ranges.
##### Time:
Mentions of specific times or time intervals.
##### Percentage:
Indications of percentage values.
##### Money:
References to monetary values or currencies.
##### Product:
Names of products or items.
##### Event:
Names of events or occurrences.
The ability to recognize and categorize these entities enhances the overall utility of NLP applications, making them more powerful in extracting actionable information from diverse textual data sources.

# Q10. Develop a simple Python function to perform basic Named Entity Recognition on a given text.

### Answer:-

In [30]:
pip install spacy

Collecting spacy
  Downloading spacy-3.7.2-cp39-cp39-win_amd64.whl (12.2 MB)
     ---------------------------------------- 12.2/12.2 MB 1.1 MB/s eta 0:00:00
Collecting thinc<8.3.0,>=8.1.8
  Downloading thinc-8.2.1-cp39-cp39-win_amd64.whl (1.5 MB)
     ---------------------------------------- 1.5/1.5 MB 946.7 kB/s eta 0:00:00
Collecting weasel<0.4.0,>=0.1.0
  Downloading weasel-0.3.4-py3-none-any.whl (50 kB)
     -------------------------------------- 50.1/50.1 kB 632.8 kB/s eta 0:00:00
Collecting typer<0.10.0,>=0.3.0
  Downloading typer-0.9.0-py3-none-any.whl (45 kB)
     -------------------------------------- 45.9/45.9 kB 565.4 kB/s eta 0:00:00
Collecting pydantic!=1.8,!=1.8.1,<3.0.0,>=1.7.4
  Downloading pydantic-2.5.2-py3-none-any.whl (381 kB)
     ------------------------------------ 381.9/381.9 kB 553.0 kB/s eta 0:00:00
Collecting cymem<2.1.0,>=2.0.2
  Downloading cymem-2.0.8-cp39-cp39-win_amd64.whl (39 kB)
Collecting spacy-legacy<3.1.0,>=3.0.11
  Downloading spacy_legacy-3.0.12-p

In [35]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 12.8/12.8 MB 1.8 MB/s eta 0:00:00
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.7.1
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [36]:
import spacy

def spacy_ner(text):
    # Load the English NLP model from spaCy
    nlp = spacy.load("en_core_web_sm")

    # Process the text with spaCy NLP pipeline
    doc = nlp(text)

    # Extract named entities
    named_entities = [(ent.text, ent.label_) for ent in doc.ents]

    return named_entities

In [37]:
# Example usage:
text = "Apple Inc. is planning to open a new office in London in 2022."
result = spacy_ner(text)
print(result)

[('Apple Inc.', 'ORG'), ('London', 'GPE'), ('2022', 'DATE')]
