#Natural Language Processing (NLP)

Natural Language Processing (NLP) is a field of artificial intelligence focused on the interaction between computers and humans' natural language. It enables computers to understand, interpret, and generate human language in a way that is both useful and meaningful.



#Corpus-based Methods

Corpus-based methods in NLP involve analyzing large collections of texts (corpora) to extract patterns, relationships, and insights. These methods often include techniques such as statistical analysis, machine learning, and linguistic processing to understand and manipulate natural language data.



In [None]:
# Example of corpus-based method: Counting word frequencies in a corpus
from collections import Counter

corpus = ["The cat sat on the mat.", "The dog barked loudly.", "The sun is shining."]
tokenized_corpus = [sentence.split() for sentence in corpus]
word_counts = Counter(word for sentence in tokenized_corpus for word in sentence)

print(word_counts)

Counter({'The': 3, 'cat': 1, 'sat': 1, 'on': 1, 'the': 1, 'mat.': 1, 'dog': 1, 'barked': 1, 'loudly.': 1, 'sun': 1, 'is': 1, 'shining.': 1})


#N-gram Representation

N-grams are contiguous sequences of n items from a given sample of text or speech. In NLP, representing text using N-grams can capture local word dependencies and provide useful features for tasks such as language modeling, text generation, and sentiment analysis.



In [None]:
# Example of representing text using N-grams
from nltk import ngrams

sentence = "The cat sat on the mat."
n = 2  # Bi-grams
tokenized_sentence = sentence.split()
bi_grams = list(ngrams(tokenized_sentence, n))

print(bi_grams)

[('The', 'cat'), ('cat', 'sat'), ('sat', 'on'), ('on', 'the'), ('the', 'mat.')]


#Stopwords and Lemmatization

Stopwords are common words (e.g., "and", "the", "is") that are often filtered out from text data because they do not carry significant meaning. Lemmatization is the process of reducing words to their base or dictionary form (lemmas) to normalize variations of words.

In [None]:
# Example of stop-word removal and lemmatization using NLTK
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

text = "The cats are sitting on the mats and dogs are barking."
tokenized_text = word_tokenize(text)
filtered_text = [word for word in tokenized_text if word.lower() not in stop_words]
lemmatized_text = [lemmatizer.lemmatize(word) for word in filtered_text]

print(lemmatized_text)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...


['cat', 'sitting', 'mat', 'dog', 'barking', '.']


#Text Classification

Text classification is the task of assigning predefined categories or labels to textual documents based on their content. It is a fundamental problem in NLP with applications in sentiment analysis, spam detection, topic categorization, and more.

In [None]:
# Example of text classification using scikit-learn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

# Training data
X_train = ["I love this movie.", "This movie is awful.", "The plot is interesting.", "This movie is terrible"]
y_train = ["positive", "negative", "positive", "negative"]

# Test data
X_test = ["I enjoyed the film.", "The movie was terrible."]

# Create a pipeline with CountVectorizer and MultinomialNB classifier
model = make_pipeline(CountVectorizer(), MultinomialNB())

# Train the model
model.fit(X_train, y_train)

# Predict
predictions = model.predict(X_test)
print(predictions)

['positive' 'negative']


#Some NLP models:

#Named Entity Recognition (NER) Model

Named Entity Recognition (NER) is a task in NLP that involves identifying and classifying named entities (such as names of persons, organizations, locations, etc.) in text data. It is often used for information extraction and text understanding tasks.

In [None]:
# Example of Named Entity Recognition using spaCy
import spacy

# Load English NER model
nlp = spacy.load("en_core_web_sm")

# Text to analyze
text = "Apple is headquartered in Cupertino, California."

# Analyze text
doc = nlp(text)

# Extract named entities
for entity in doc.ents:
    print(entity.text, entity.label_)

Apple ORG
Cupertino GPE
California GPE


#Word Embedding Model

Word embeddings are dense vector representations of words in a high-dimensional space where the similarity between words is captured by the proximity of their vectors. They are widely used in NLP tasks such as semantic similarity, language translation, and document clustering.

In [None]:
# Example of Word Embedding using Word2Vec
from gensim.models import Word2Vec

# Example sentences
sentences = [["cat", "sat", "mat"], ["dog", "barked", "loudly"], ["sun", "shining"]]

# Train Word2Vec model
model = Word2Vec(sentences, min_count=1)

# Get word vector
vector = model.wv['cat']
print(vector)

[ 8.1681199e-03 -4.4430327e-03  8.9854337e-03  8.2536647e-03
 -4.4352221e-03  3.0310510e-04  4.2744912e-03 -3.9263200e-03
 -5.5599655e-03 -6.5123225e-03 -6.7073823e-04 -2.9592158e-04
  4.4630850e-03 -2.4740540e-03 -1.7260908e-04  2.4618758e-03
  4.8675989e-03 -3.0808449e-05 -6.3394094e-03 -9.2608072e-03
  2.6657581e-05  6.6618943e-03  1.4660227e-03 -8.9665223e-03
 -7.9386048e-03  6.5519023e-03 -3.7856805e-03  6.2549924e-03
 -6.6810320e-03  8.4796622e-03 -6.5163244e-03  3.2880199e-03
 -1.0569858e-03 -6.7875278e-03 -3.2875966e-03 -1.1614120e-03
 -5.4709399e-03 -1.2113475e-03 -7.5633135e-03  2.6466595e-03
  9.0701487e-03 -2.3772502e-03 -9.7651005e-04  3.5135616e-03
  8.6650876e-03 -5.9218528e-03 -6.8875779e-03 -2.9329848e-03
  9.1476962e-03  8.6626766e-04 -8.6784009e-03 -1.4469790e-03
  9.4794659e-03 -7.5494875e-03 -5.3580985e-03  9.3165627e-03
 -8.9737261e-03  3.8259076e-03  6.6544057e-04  6.6607012e-03
  8.3127534e-03 -2.8507852e-03 -3.9923131e-03  8.8979173e-03
  2.0896459e-03  6.24894

#Text Summarization Model

Text summarization is the process of distilling the most important information from a text document to produce a concise summary. It can be done through extractive methods (selecting important sentences) or abstractive methods (generating new sentences).

In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.probability import FreqDist

def summarize_text(text, num_sentences=2):
    # Tokenize the text into sentences
    sentences = sent_tokenize(text)

    # Tokenize the text into words
    words = word_tokenize(text)

    # Filter out stopwords
    stop_words = set(stopwords.words("english"))
    filtered_words = [word for word in words if word.lower() not in stop_words]

    # Calculate word frequencies
    word_freq = FreqDist(filtered_words)

    # Calculate sentence scores based on word frequencies
    sentence_scores = {sentence: sum(word_freq[word] for word in word_tokenize(sentence.lower()) if word in word_freq)
                       for sentence in sentences}

    # Sort sentences by scores and select the top N sentences
    top_sentences = sorted(sentence_scores, key=sentence_scores.get, reverse=True)[:num_sentences]

    # Return the summarized text
    summarized_text = ' '.join(top_sentences)
    return summarized_text

# Example text
text = """
Natural Language Processing (NLP) is a field of artificial intelligence focused on the interaction between computers and humans' natural language. It enables computers to understand, interpret, and generate human language in a way that is both useful and meaningful. Text summarization is the process of distilling the most important information from a text document to produce a concise summary. It can be done through extractive methods (selecting important sentences) or abstractive methods (generating new sentences).
"""

# Summarize the text
summary = summarize_text(text)
print(summary)

It can be done through extractive methods (selecting important sentences) or abstractive methods (generating new sentences). 
Natural Language Processing (NLP) is a field of artificial intelligence focused on the interaction between computers and humans' natural language.


#Sentiment Analysis Model

Sentiment analysis aims to determine the sentiment or opinion expressed in a piece of text. It can be binary (positive or negative) or multiclass (positive, neutral, negative), and it is useful for understanding customer feedback, social media sentiment, and market trends.



In [None]:
# Example of Sentiment Analysis using VADER
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')

# Text to analyze
text = "I love this movie. It's fantastic!"

# Sentiment analysis
sid = SentimentIntensityAnalyzer()
sentiment_scores = sid.polarity_scores(text)

# Print sentiment scores
print(sentiment_scores)

{'neg': 0.0, 'neu': 0.27, 'pos': 0.73, 'compound': 0.8439}


[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


#Text Generation Model

Text generation involves generating new text based on a given input or prompt. It can be done using various techniques such as Markov chains, recurrent neural networks (RNNs), or transformers. Here's an example using a simple Markov chain approach:



In [None]:
%pip install markovify

Collecting markovify
  Downloading markovify-0.9.4.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting unidecode (from markovify)
  Downloading Unidecode-1.3.8-py3-none-any.whl (235 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.5/235.5 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: markovify
  Building wheel for markovify (setup.py) ... [?25l[?25hdone
  Created wheel for markovify: filename=markovify-0.9.4-py3-none-any.whl size=18608 sha256=d5d86414383a08808abf5554f3802a59a029bb7d30d80ae18145394d9f70d908
  Stored in directory: /root/.cache/pip/wheels/ca/8c/c5/41413e24c484f883a100c63ca7b3b0362b7c6f6eb6d7c9cc7f
Successfully built markovify
Installing collected packages: unidecode, markovify
Successfully installed markovify-0.9.4 unidecode-1.3.8


In [None]:
import markovify

# Larger corpus of text for training
corpus = [
    "The cat sat on the mat.",
    "The dog barked loudly.",
    "The sun is shining.",
    "She sells seashells by the seashore.",
    "How much wood would a woodchuck chuck if a woodchuck could chuck wood?"
]

# Build a Markov chain model
text_model = markovify.Text(corpus)

# Generate text
generated_text = text_model.make_sentence()
print(generated_text)

None


#Recurrent Neural Network:

* Positive: Indicates that the review expresses a positive sentiment towards the movie.
* Negative: Indicates that the review expresses a negative sentiment towards the movie.

In [None]:
import numpy as np
from tensorflow.keras.datasets import imdb
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Embedding, SpatialDropout1D
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import BinaryCrossentropy
from sklearn.model_selection import train_test_split

# Load the IMDb dataset
num_words = 10000
max_len = 200
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=num_words)

# Pad sequences to ensure uniform length
X_train = pad_sequences(X_train, maxlen=max_len)
X_test = pad_sequences(X_test, maxlen=max_len)

# Split the training data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Define the RNN architecture
embedding_dim = 128
units = 64
dropout_rate = 0.2

model = Sequential()
model.add(Embedding(input_dim=num_words, output_dim=embedding_dim, input_length=max_len))
model.add(SpatialDropout1D(dropout_rate))
model.add(LSTM(units=units, dropout=dropout_rate, recurrent_dropout=dropout_rate, return_sequences=True))
model.add(LSTM(units=units, dropout=dropout_rate, recurrent_dropout=dropout_rate))
model.add(Dense(1, activation='sigmoid'))

# Compile the model
optimizer = Adam(learning_rate=0.001)
loss = BinaryCrossentropy()
model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])

# Early stopping to prevent overfitting
early_stopping = EarlyStopping(monitor='val_loss', patience=3)

# Train the model
batch_size = 128
epochs = 10
history = model.fit(X_train, y_train, batch_size=batch_size, epochs=epochs, validation_data=(X_val, y_val), callbacks=[early_stopping])

# Evaluate the model on test set
loss, accuracy = model.evaluate(X_test, y_test)
print(f'Test Loss: {loss}, Test Accuracy: {accuracy}')

# Example predictions
predictions = model.predict(X_test[:10])
print(predictions)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Test Loss: 0.4318576157093048, Test Accuracy: 0.8570799827575684
[[0.2337942 ]
 [0.9995297 ]
 [0.99739015]
 [0.9942407 ]
 [0.9995601 ]
 [0.9973126 ]
 [0.9983939 ]
 [0.00730167]
 [0.99783653]
 [0.9969639 ]]
