## Natural language processing (NLP)
Natural language processing, or NLP, combines computational linguistics—rule-based modeling of human language—with statistical and machine learning models to enable computers and digital devices to recognize, understand and generate text and speech.
A branch of artificial intelligence (AI), NLP lies at the heart of applications and devices that can

* Translate text from one language to another
* respond to typed or spoken commands
* recognize or authenticate users based on voice
* summarize large volumes of text
* assess the intent or sentiment of text or speech
* generate text or graphics or other content on demand
NLP is a rapidly evolving field, with new models and techniques emerging regularly. The field is also highly interdisciplinary, drawing on insights from linguistics, cognitive science, computer science, and other fields.

## NLP Techniques

NLP encompasses a wide array of techniques that aimed at enabling computers to process and understand human language. These tasks can be categorized into several broad areas, each addressing different aspects of language processing. Here are some of the key NLP techniques:

**1. Text Processing and Preprocessing In NLP**
Text processing and preprocessing are the first steps in NLP. These techniques are used to clean and prepare text data for further analysis. Some common text processing and preprocessing techniques include:
**Tokenization**: Dividing text into smaller units, such as words or sentences.

**Stemming and Lemmatization:** Reducing words to their base or root forms.

**Stop word Removal:** Removing common words (like “and”, “the”, “is”) that may not carry significant meaning.

**Text Normalization:** Standardizing text, including case normalization, removing punctuation, and correcting spelling error.

**Bag of words:**
**TF-IDF (Term Frequency-Inverse Document Frequency)**



**Part-of-Speech (POS) Tagging**: Assigning parts of speech to each word in a sentence (e.g., noun, verb, adjective).

**Dependency Parsing:** Analyzing the grammatical structure of a sentence to identify relationships between words.

**Constituency Parsing:** Breaking down a sentence into its constituent parts or phrases (e.g., noun phrases, verb phrases).

**3. Semantic Analysis**

**Named Entity Recognition (NER)**: Identifying and classifying entities in text, such as names of people, organizations, locations, dates, etc.

**Word Sense Disambiguation (WSD)**: Determining which meaning of a word is used in a given context.

**Coreference Resolution**: Identifying when different words refer to the same entity in a text (e.g., “he” refers to “John”).

**4. Information Extraction**

**Entity Extraction:** Identifying specific entities and their relationships within the text.
Relation Extraction: Identifying and categorizing the relationships between entities in a text.

**5. Text Classification in NLP**

**Sentiment Analysis:** Determining the sentiment or emotional tone expressed in a text (e.g., positive, negative, neutral).

**Topic Modeling:** Identifying topics or themes within a large collection of documents.

**Spam Detection:** Classifying text as spam or not spam

**6. Language Generation**
**Machine Translation**: Translating text from one language to another.

**Text Summarization:** Producing a concise summary of a larger text.

**Text Generation:** Automatically generating coherent and contextually relevant text.

**7. Speech Processing**

**Speech Recognition:** Converting spoken language into text.

**Text-to-Speech (TTS) Synthesis:** Converting written text into spoken language.
**8. Question Answering**

**Retrieval-Based QA:** Finding and returning the most relevant text passage in response to a query.

**Generative QA:** Generating an answer based on the information available in a text corpus.

**9. Dialogue Systems**

**Chatbots and Virtual Assistants:** Enabling systems to engage in conversations with users, providing responses and performing tasks based on user input.

**10. Sentiment and Emotion Analysis in NLP**

**Emotion Detection:** Identifying and categorizing emotions expressed in text.

**Opinion Mining:** Analyzing opinions or reviews to understand public sentiment toward products, services, or topics.




![image.png](attachment:image.png)

**1. Text Input and Data Collection**

**Data Collection:** Gathering text data from various sources such as websites, books, social media, or proprietary databases.

**Data Storage:** Storing the collected text data in a structured format, such as a database or a collection of documents.

**2. Text Preprocessing**

Preprocessing is crucial to clean and prepare the raw text data for analysis. Common preprocessing steps include:

**Tokenization:** Splitting text into smaller units like words or sentences.

**Lowercasing:** Converting all text to lowercase to ensure uniformity.

**Stopword Removal:** Removing common words that do not contribute significant meaning, such as “and,” “the,” “is.”

**Punctuation Removal:** Removing punctuation marks.

**Stemming and Lemmatization:** Reducing words to their base or root forms. Stemming cuts off suffixes, while lemmatization considers the context and converts words to their meaningful base form.

**Text Normalization:** Standardizing text format, including correcting spelling errors, expanding contractions, and handling special characters.

**3. Text Representation**

**Bag of Words (BoW):** Representing text as a collection of words, ignoring grammar and word order but keeping track of word frequency.

**Term Frequency-Inverse Document Frequency (TF-IDF)**: A statistic that reflects the importance of a word in a document relative to a collection of documents.

**Word Embeddings:** Using dense vector representations of words where semantically similar words are closer together in the vector space (e.g., Word2Vec, GloVe).

**4. Feature Extraction**
Extracting meaningful features from the text data that can be used for various NLP tasks.

**N-grams:** Capturing sequences of N words to preserve some context and word order.

**Syntactic Features:** Using parts of speech tags, syntactic dependencies, and parse trees.

**Semantic Features:** Leveraging word embeddings and other representations to capture word meaning and context.

**5. Model Selection and Training**

Selecting and training a machine learning or deep learning model to perform specific NLP tasks.

**Supervised Learning:** Using labeled data to train models like Support Vector Machines (SVM), Random Forests, or deep learning models like Convolutional Neural Networks (CNNs) and Recurrent 
Neural Networks (RNNs).

**Unsupervised Learning:** Applying techniques like clustering or topic modeling (e.g., Latent Dirichlet Allocation) on unlabeled data.

**Pre-trained Models**: Utilizing pre-trained language models such as BERT, GPT, or transformer-based models that have been trained on large corpora.

**6. Model Deployment and Inference**

Deploying the trained model and using it to make predictions or extract insights from new text data.
**Text Classification:** Categorizing text into predefined classes (e.g., spam detection, sentiment analysis).

**Named Entity Recognition (NER)**: Identifying and classifying entities in the text.

**Machine Translation:** Translating text from one language to another.

**Question Answering:** Providing answers to questions based on the context provided by text data.

**7. Evaluation and Optimization**

Evaluating the performance of the NLP algorithm using metrics such as accuracy, precision, recall, F1-score, and others.
**Hyperparameter Tuning:** Adjusting model parameters to improve performance.

**Error Analysis:** Analyzing errors to understand model weaknesses and improve robustness.

**8. Iteration and Improvement**
Continuously improving the algorithm by incorporating new data, refining preprocessing techniques, experimenting with different models, and optimizing features.


## let's build a simple RNN-based character-level text generator using a sample text. We'll use a sequence-to-sequence prediction approach to predict the next character in a sequence

In [3]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense

In [4]:
# Sample text data
text = "hello world! this is a simple text example for rnn."
chars = sorted(list(set(text))) # Get unique characters
char_to_int = dict((c, i) for i, c in enumerate(chars))
int_to_char = dict((i, c) for i, c in enumerate(chars))

In [6]:
# Create input-output sequences
seq_length = 10
X = []
y = []

for i in range(len(text) - seq_length):
    seq_in = text[i:i + seq_length]
    seq_out = text[i + seq_length]
    X.append([char_to_int[char] for char in seq_in])
    y.append(char_to_int[seq_out])

X = np.reshape(X, (len(X), seq_length, 1)) / float(len(chars)) # Normalize
y = tf.keras.utils.to_categorical(y)

In [7]:
# Define model
model = Sequential([
    LSTM(128, input_shape=(X.shape[1], X.shape[2])),
    Dense(len(chars), activation='softmax')])

  super().__init__(**kwargs)


In [8]:
%%time
# Compile and train
model.compile(loss='categorical_crossentropy', optimizer='adam')
model.fit(X, y, epochs=100, batch_size=100, verbose=1)

Epoch 1/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 6s/step - loss: 2.9397
Epoch 2/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 99ms/step - loss: 2.9292
Epoch 3/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 149ms/step - loss: 2.9189
Epoch 4/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 155ms/step - loss: 2.9084
Epoch 5/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 164ms/step - loss: 2.8975
Epoch 6/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 87ms/step - loss: 2.8859
Epoch 7/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 166ms/step - loss: 2.8734
Epoch 8/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 166ms/step - loss: 2.8596
Epoch 9/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 163ms/step - loss: 2.8442
Epoch 10/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 177ms/step - loss: 2.8269
Epoc

<keras.src.callbacks.history.History at 0x16e7a5be990>

In [9]:
# Predict a sequence
start = np.random.randint(0, len(X)-1)
pattern = X[start]
print("Seed:")
print("\"", ''.join([int_to_char[int(value*len(chars))] for value in pattern]), "\"")

Seed:
" xt example "


In [10]:
for i in range(50): # Generate 50 characters
    x = np.reshape(pattern, (1, len(pattern), 1))
    prediction = model.predict(x, verbose=1)
    index = np.argmax(prediction)
    result = int_to_char[index]
    seq_in = [int_to_char[int(value*len(chars))] for value in pattern]
    print(result, end="")
    pattern = np.append(pattern, index/float(len(chars)))
    pattern = pattern[1:len(pattern)]

print("\n")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 525ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 49ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 48ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 47ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 56ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 53ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 45ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 45ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 53ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 44ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 27ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 49ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 53ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3

## Basic text classification using a simple RNN model. We'll use a dataset of movie reviews and classify them as positive or negative based on the sentiment expressed in the review.

In [1]:
import matplotlib.pyplot as plt
import os  # The os module in Python provides a way of using operating system-dependent functionality like reading or writing to the file system.
import re  # regular expressions
import shutil #It includes functions to copy, move, and remove files and directories. It also provides utilities to work with archive files (such as zip or tar files), which can be very useful for managing files and directories.
import string
import tensorflow as tf

from tensorflow.keras import layers
from tensorflow.keras import losses

In [2]:
print(tf.__version__)

2.16.1


## Sentiment analysis

This notebook trains a sentiment analysis model to classify movie reviews as positive or negative, based on the text of the review. This is an example of binary—or two-class—classification, an important and widely applicable kind of machine learning problem.

You'll use the Large Movie Review Dataset that contains the text of 50,000 movie reviews from the Internet Movie Database. These are split into 25,000 reviews for training and 25,000 reviews for testing. The training and testing sets are balanced, meaning they contain an equal number of positive and negative reviews.

In [9]:
import tensorflow as tf
import os

url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

# Specify the desired directory
desired_directory = r"D:\Drive D\One drive folder\OneDrive - Higher Education Commission\Drive G\Data science coding\Data sets"

# Download and extract the dataset to the specified directory
dataset = tf.keras.utils.get_file("aclImdb_v1", url,
                                  untar=True, cache_dir=desired_directory,
                                  cache_subdir='')

# Form the path to the extracted dataset directory
dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')

print(f"The dataset has been downloaded and extracted to: {dataset_dir}")


The dataset has been downloaded and extracted to: D:\Drive D\One drive folder\OneDrive - Higher Education Commission\Drive G\Data science coding\Data sets\aclImdb


In [10]:
os.listdir(dataset_dir)

['imdb.vocab', 'imdbEr.txt', 'README', 'test', 'train']

In [11]:
train_dir = os.path.join(dataset_dir, 'train')
os.listdir(train_dir)

['labeledBow.feat',
 'neg',
 'pos',
 'unsupBow.feat',
 'urls_neg.txt',
 'urls_pos.txt',
 'urls_unsup.txt']

The aclImdb/train/pos and aclImdb/train/neg directories contain many text files, each of which is a single movie review. Let's take a look at one of them.

In [14]:
dataset_dir

'D:\\Drive D\\One drive folder\\OneDrive - Higher Education Commission\\Drive G\\Data science coding\\Data sets\\aclImdb'

In [15]:
train_dir 

'D:\\Drive D\\One drive folder\\OneDrive - Higher Education Commission\\Drive G\\Data science coding\\Data sets\\aclImdb\\train'

In [16]:
sample_file = os.path.join(train_dir, 'pos/1181_9.txt')
with open(sample_file) as f:
  print(f.read())

Rachel Griffiths writes and directs this award winning short film. A heartwarming story about coping with grief and cherishing the memory of those we've loved and lost. Although, only 15 minutes long, Griffiths manages to capture so much emotion and truth onto film in the short space of time. Bud Tingwell gives a touching performance as Will, a widower struggling to cope with his wife's death. Will is confronted by the harsh reality of loneliness and helplessness as he proceeds to take care of Ruth's pet cow, Tulip. The film displays the grief and responsibility one feels for those they have loved and lost. Good cinematography, great direction, and superbly acted. It will bring tears to all those who have lost a loved one, and survived.


**Load the dataset**

Next, you will load the data off disk and prepare it into a format suitable for training. To do so, you will use the helpful text_dataset_from_directory utility, which expects a directory structure as follows

main_directory/
...class_a/
......a_text_1.txt
......a_text_2.txt
...class_b/
......b_text_1.txt
......b_text_2.txt

To prepare a dataset for binary classification, you will need two folders on disk, corresponding to class_a and class_b. These will be the positive and negative movie reviews, which can be found in aclImdb/train/pos and aclImdb/train/neg. As the IMDB dataset contains additional folders, you will remove them before using this utility

In [21]:
train_dir = os.path.join(dataset_dir, 'train')
remove_dir = os.path.join(train_dir, 'unsup')
shutil.rmtree(remove_dir)


FileNotFoundError: [WinError 3] The system cannot find the path specified: 'D:\\Drive D\\One drive folder\\OneDrive - Higher Education Commission\\Drive G\\Data science coding\\Data sets\\aclImdb\\train\\unsup'

Next, you will use the text_dataset_from_directory utility to create a labeled tf.data.Dataset. tf.data is a powerful collection of tools for working with data.

When running a machine learning experiment, it is a best practice to divide your dataset into three splits: train, validation, and test.

The IMDB dataset has already been divided into train and test, but it lacks a validation set. Let's create a validation set using an 80:20 split of the training data by using the validation_split argument below.

In [19]:
batch_size = 32
seed = 42

raw_train_ds = tf.keras.utils.text_dataset_from_directory(
    'aclImdb/train',
    batch_size=batch_size,
    validation_split=0.2,
    subset='training',
    seed=seed)

NotFoundError: Could not find directory aclImdb/train