<a href="https://colab.research.google.com/github/HumayounAkhtar/HumayounAkhtar/blob/main/NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **NLP Basic**

In [None]:
import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Natural_Language_Processing"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
text = soup.get_text()
print(text)





Natural language processing - Wikipedia



























Jump to content







Main menu





Main menu
move to sidebar
hide



		Navigation
	


Main pageContentsCurrent eventsRandom articleAbout WikipediaContact usDonate





		Contribute
	


HelpLearn to editCommunity portalRecent changesUpload file



















Search











Search






















Appearance
















Create account

Log in








Personal tools





 Create account Log in





		Pages for logged out editors learn more



ContributionsTalk




























Contents
move to sidebar
hide




(Top)





1
History




Toggle History subsection





1.1
Symbolic NLP (1950s – early 1990s)








1.2
Statistical NLP (1990s–2010s)








1.3
Neural NLP (present)










2
Approaches: Symbolic, statistical, neural networks




Toggle Approaches: Symbolic, statistical, neural networks subsection





2.1
Statistical approach








2.2
Neural networks










3
Common NL

# **Text Cleaning**

In [None]:
import re
import nltk
from nltk.corpus import stopwords
from bs4 import BeautifulSoup

# Sample text
text = "This is a sample text with <b>HTML</b> tags and @special characters! It's a beautiful day, isn't it?"

# Step 1: Remove HTML tags
soup = BeautifulSoup(text, 'html.parser')
text = soup.get_text()
print("After removing HTML tags: ", text)

# Step 2: Remove special characters and punctuation
text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
print("After removing special characters and punctuation: ", text)

# Step 3: Convert to lowercase
text = text.lower()
print("After converting to lowercase: ", text)

# Step 4: Remove stopwords
nltk.download('stopwords')  # Download stopwords corpus if not already downloaded
stop_words = set(stopwords.words('english'))
text = ' '.join([word for word in text.split() if word not in stop_words])
print("After removing stopwords: ", text)

After removing HTML tags:  This is a sample text with HTML tags and @special characters! It's a beautiful day, isn't it?
After removing special characters and punctuation:  This is a sample text with HTML tags and special characters Its a beautiful day isnt it
After converting to lowercase:  this is a sample text with html tags and special characters its a beautiful day isnt it
After removing stopwords:  sample text html tags special characters beautiful day isnt


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# **Tokenization**

In [None]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

text = "This is a sample text with multiple sentences. It's a beautiful day!"

# Download the Punkt sentence tokenizer models
nltk.download('punkt')

# Word Tokenization
words = word_tokenize(text)
print("Word Tokens:", words)

# Sentence Tokenization
sentences = sent_tokenize(text)
print("Sentence Tokens:", sentences)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Word Tokens: ['This', 'is', 'a', 'sample', 'text', 'with', 'multiple', 'sentences', '.', 'It', "'s", 'a', 'beautiful', 'day', '!']
Sentence Tokens: ['This is a sample text with multiple sentences.', "It's a beautiful day!"]


# **Remove Stopword**

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

text = "This is a sample text with multiple sentences. It's a beautiful day!"

# Tokenize text
words = word_tokenize(text)

# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word.lower() not in stop_words]

print("Filtered Words:", filtered_words)

Filtered Words: ['sample', 'text', 'multiple', 'sentences', '.', "'s", 'beautiful', 'day', '!']


# **Stemming or Lemmatization**

In [None]:
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
import nltk
nltk.download('wordnet')
text = "running, runs, runner"

# Tokenize text
words = text.split(",")

# Stemming using Porter Stemmer
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word.strip()) for word in words]
print("Stemmed Words:", stemmed_words)

# Lemmatization using WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word.strip()) for word in words]
print("Lemmatized Words:", lemmatized_words)

[nltk_data] Downloading package wordnet to /root/nltk_data...


Stemmed Words: ['run', 'run', 'runner']
Lemmatized Words: ['running', 'run', 'runner']


# **Vectorization**

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer

# Sample text data
text_data = ["This is a sample text", "Another text for demonstration"]

# Tokenize text data
tokenized_data = [word_tokenize(text) for text in text_data]

# Create a Bag-of-Words model
vectorizer = CountVectorizer()

# Fit the model to the tokenized data and transform into vectors
vectors = vectorizer.fit_transform([' '.join(tokens) for tokens in tokenized_data])

# Print the vectorized data
print("Vectorized Data:")
print(vectors.toarray())

Vectorized Data:
[[0 0 0 1 1 1 1]
 [1 1 1 0 0 1 0]]


# **Text Classification**

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Sample text data
text_data = ["This is a positive review", "This is a negative review"]
labels = [1, 0]

# Tokenize text data
tokenized_data = [word_tokenize(text) for text in text_data]

# Create a TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Fit the vectorizer to the tokenized data and transform into vectors
vectors = vectorizer.fit_transform([' '.join(tokens) for tokens in tokenized_data])

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(vectors, labels, test_size=0.2)

# Train a Naive Bayes classifier
clf = MultinomialNB()
clf.fit(X_train, y_train)

# Predict on the test set
y_pred = clf.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.0


# **Sentiment Analysis**

In [None]:
import seaborn as sns
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Load titanic dataset from seaborn
titanic = sns.load_dataset('titanic')

# Create a new column 'sentiment' based on 'survived' column
titanic['sentiment'] = titanic['survived'].apply(lambda x: 1 if x == 1 else 0)

# Define text data column
text_data = titanic['sex'] # Changed column from 'name' to 'sex'

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(text_data, titanic['sentiment'], test_size=0.2)

# Tokenize text data
tokenized_train = [word_tokenize(str(review)) for review in X_train] # Added str() to handle non-string values
tokenized_test = [word_tokenize(str(review)) for review in X_test] # Added str() to handle non-string values

# Create a TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Fit the vectorizer to the tokenized data and transform into vectors
X_train_vectors = vectorizer.fit_transform([' '.join(tokens) for tokens in tokenized_train])
X_test_vectors = vectorizer.transform([' '.join(tokens) for tokens in tokenized_test])

# Train a Naive Bayes classifier
clf = MultinomialNB()
clf.fit(X_train_vectors, y_train)

# Predict on the test set
y_pred = clf.predict(X_test_vectors)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Classification Report:")
print(classification_report(y_test, y_pred))

Accuracy: 0.776536312849162
Classification Report:
              precision    recall  f1-score   support

           0       0.84      0.78      0.81       108
           1       0.70      0.77      0.73        71

    accuracy                           0.78       179
   macro avg       0.77      0.78      0.77       179
weighted avg       0.78      0.78      0.78       179



# **Text Representation**

In [3]:
# Import necessary libraries
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from gensim.models import Word2Vec, Doc2Vec
from gensim.models.doc2vec import TaggedDocument # Import TaggedDocument
import numpy as np

# Sample text data
text_data = ["The quick brown fox jumps over the lazy dog"]

# Bag-of-Words (BoW) representation
vectorizer = CountVectorizer()
bow_vector = vectorizer.fit_transform(text_data)
print("Bag-of-Words Representation:")
print(bow_vector.toarray())

# Term Frequency-Inverse Document Frequency (TF-IDF) representation
tfidf_vectorizer = TfidfVectorizer()
tfidf_vector = tfidf_vectorizer.fit_transform(text_data)
print("\nTerm Frequency-Inverse Document Frequency Representation:")
print(tfidf_vector.toarray())

# Word Embeddings representation using Word2Vec
sentences = [text_data[0].split()]
model = Word2Vec(sentences, min_count=1, vector_size=5)
print("\nWord Embeddings Representation:")
print(model.wv["dog"])

# Document Embeddings representation using Doc2Vec
doc_sentences = [TaggedDocument(text_data[0].split(), [0])] # Use TaggedDocument to wrap the sentence
doc_model = Doc2Vec(doc_sentences, min_count=1, vector_size=5)
print("\nDocument Embeddings Representation:")
print(doc_model.docvecs[0])

Bag-of-Words Representation:
[[1 1 1 1 1 1 1 2]]

Term Frequency-Inverse Document Frequency Representation:
[[0.30151134 0.30151134 0.30151134 0.30151134 0.30151134 0.30151134
  0.30151134 0.60302269]]

Word Embeddings Representation:
[-0.01072454  0.00472863  0.10206699  0.18018547 -0.186059  ]

Document Embeddings Representation:
[-0.10484434 -0.1198414  -0.19804156  0.17142637  0.07147636]


  print(doc_model.docvecs[0])


# **Syntax and Semantic Analysis**

In [5]:
# Import necessary libraries
import nltk
from nltk import pos_tag, word_tokenize
from nltk.chunk import conlltags2tree, tree2conlltags
from nltk.stem import WordNetLemmatizer

# Download required NLTK resources
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

# Sample text data
text_data = "John Smith bought a book from Amazon. He read it."


# Part-of-Speech (POS) Tagging
def pos_tagging(text):
    tokens = word_tokenize(text)
    pos_tags = pos_tag(tokens)
    return pos_tags

print("Part-of-Speech Tagging:")
print(pos_tagging(text_data))


# Named Entity Recognition (NER)
def named_entity_recognition(text):
    tokens = word_tokenize(text)
    tagged = pos_tag(tokens)
    named_entities = nltk.ne_chunk(tagged)
    return named_entities

print("\nNamed Entity Recognition:")
print(named_entity_recognition(text_data))


# Dependency Parsing
def dependency_parsing(text):
    tokens = word_tokenize(text)
    tagged = pos_tag(tokens)
    chunked = nltk.ne_chunk(tagged)
    return chunked

print("\nDependency Parsing:")
print(dependency_parsing(text_data))


# Semantic Role Labeling (SRL)
def semantic_role_labeling(text):
    # NLTK doesn't have built-in SRL support
    # Using a simplified approach
    tokens = word_tokenize(text)
    tagged = pos_tag(tokens)
    roles = []
    for token, tag in tagged:
        if tag.startswith('NN'):
            roles.append((token, 'Theme'))
        elif tag.startswith('VB'):
            roles.append((token, 'Action'))
    return roles

print("\nSemantic Role Labeling:")
print(semantic_role_labeling(text_data))


# Coreference Resolution
def coreference_resolution(text):
    # NLTK doesn't have built-in coreference resolution support
    # Using a simplified approach
    sentences = nltk.sent_tokenize(text)
    pronouns = []
    for sentence in sentences:
        tokens = word_tokenize(sentence)
        tagged = pos_tag(tokens)
        for token, tag in tagged:
            if tag.startswith('PRP'):
                pronouns.append(token)
    return pronouns

print("\nCoreference Resolution:")
print(coreference_resolution(text_data))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


Part-of-Speech Tagging:
[('John', 'NNP'), ('Smith', 'NNP'), ('bought', 'VBD'), ('a', 'DT'), ('book', 'NN'), ('from', 'IN'), ('Amazon', 'NNP'), ('.', '.'), ('He', 'PRP'), ('read', 'VBD'), ('it', 'PRP'), ('.', '.')]

Named Entity Recognition:
(S
  (PERSON John/NNP)
  (PERSON Smith/NNP)
  bought/VBD
  a/DT
  book/NN
  from/IN
  (GPE Amazon/NNP)
  ./.
  He/PRP
  read/VBD
  it/PRP
  ./.)

Dependency Parsing:
(S
  (PERSON John/NNP)
  (PERSON Smith/NNP)
  bought/VBD
  a/DT
  book/NN
  from/IN
  (GPE Amazon/NNP)
  ./.
  He/PRP
  read/VBD
  it/PRP
  ./.)

Semantic Role Labeling:
[('John', 'Theme'), ('Smith', 'Theme'), ('bought', 'Action'), ('book', 'Theme'), ('Amazon', 'Theme'), ('read', 'Action')]

Coreference Resolution:
['He', 'it']


# **1.	Sentiment Analysis**

In [8]:
# Import necessary libraries
import pandas as pd
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from keras.datasets import imdb
from keras.preprocessing import sequence
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Download required NLTK resources
nltk.download('vader_lexicon')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Load the IMDB dataset from Keras
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=10000)

# Convert data to text format
# Create a dictionary to map word indices to words
word_index = imdb.get_word_index()
word_index = {k:v for k,v in word_index.items() if v <= 10000}
word_index['<PAD>'] = 0

# Reverse the word index
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

x_train_text = [' '.join([reverse_word_index.get(word, '?') for word in sequence.pad_sequences([x_train[i]], maxlen=100)[0] if word != 0]) for i in range(len(x_train))]
x_test_text = [' '.join([reverse_word_index.get(word, '?') for word in sequence.pad_sequences([x_test[i]], maxlen=100)[0] if word != 0]) for i in range(len(x_test))]


# Initialize SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()

# Perform sentiment analysis using VADER
sentiments = []
for text in x_train_text:
    sentiment = sia.polarity_scores(text)
    if sentiment['compound'] >= 0.05:
        sentiments.append('Positive')
    elif sentiment['compound'] <= -0.05:
        sentiments.append('Negative')
    else:
        sentiments.append('Neutral')

# Compare VADER sentiment with actual labels
print("VADER Sentiment Accuracy:", accuracy_score(sentiments, y_train))

# Create a TF-IDF vectorizer
vectorizer = TfidfVectorizer(max_features=5000)

# Fit the vectorizer to the training data and transform both sets
x_train_vectors = vectorizer.fit_transform(x_train_text)
x_test_vectors = vectorizer.transform(x_test_text)

# Train a Naive Bayes classifier
clf = MultinomialNB()
clf.fit(x_train_vectors, y_train)

# Predict on the test set
y_pred = clf.predict(x_test_vectors)

# Evaluate the model
print("Naive Bayes Accuracy:", accuracy_score(y_test, y_pred))
print("Naive Bayes Classification Report:")
print(classification_report(y_test, y_pred))

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json
[1m1641221/1641221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step
VADER Sentiment Accuracy: 0.0
Naive Bayes Accuracy: 0.8268
Naive Bayes Classification Report:
              precision    recall  f1-score   support

           0       0.82      0.84      0.83     12500
           1       0.83      0.82      0.83     12500

    accuracy                           0.83     25000
   macro avg       0.83      0.83      0.83     25000
weighted avg       0.83      0.83      0.83     25000

