<a href="https://colab.research.google.com/github/Anschoudary/NLP/blob/main/NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 ## What is Natural Language Processing (NLP)

NLP is a field of Artificial intelligence (AI) that focuses on enabling computers to understand, interpret, and generate human language. It combines principles from linguistics, computer science, and AI to bridge the gap between human communication and machine understanding.
Types of NLP

    Text Classification: Categorizing text into predefined categories (e.g., spam detection, sentiment analysis).
    Machine Translation: Converting text from one language to another.
    Question Answering: Extracting answers from text based on given questions.
    Text Summarization: Generating concise summaries of longer texts.
    Named Entity Recognition (NER): Identifying and classifying named entities (e.g., person, location, organization).

### Examples of NLP Applications

    Chatbots: Simulating human conversation for customer service or information retrieval.
    Virtual Assistants: Performing tasks based on voice commands (e.g., Siri, Alexa).
    Search Engines: Understanding user queries to retrieve relevant information.
    Social Media Monitoring: Analyzing social media data for insights and trends.


In [3]:
!pip install nltk



In [34]:
import nltk

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [35]:
# Tokenization: Splitting text into individual words or sentences.

from nltk.tokenize import word_tokenize, sent_tokenize

text = """We don’t regularly think about the intricacies of our own languages.
          It’s an intuitive behavior used to convey information and meaning with semantic cues such as words,
          signs, or images. It’s been said that language is easier to learn and comes more naturally in adolescence
          because it’s a repeatable, trained behavior—much like walking. And language doesn’t follow a strict set
          of rules, with so many exceptions like “I before E except after C.” What comes naturally to humans, however,
          is exceedingly difficult for computers with the amount of unstructured data, lack of formal rules, and absence
          of real-world context or intent. That’s why machine learning and artificial intelligence (AI) are gaining
          attention and momentum, with greater human dependency on computing systems to communicate and perform tasks.
          And as AI and augmented analytics get more sophisticated, so will Natural Language Processing (NLP).
          While the terms AI and NLP might conjure images of futuristic robots, there are already basic examples of
          NLP at work in our daily lives. Here are a few prominent examples."""
tokens = word_tokenize(text)
sentences = sent_tokenize(text)

In [36]:
# Stop Word Removal: Filtering out common words like "the", "a", "is", etc.

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
filtered_words = [word for word in tokens if word.lower() not in stop_words]

print(filtered_words)

['’', 'regularly', 'think', 'intricacies', 'languages', '.', '’', 'intuitive', 'behavior', 'used', 'convey', 'information', 'meaning', 'semantic', 'cues', 'words', ',', 'signs', ',', 'images', '.', '’', 'said', 'language', 'easier', 'learn', 'comes', 'naturally', 'adolescence', '’', 'repeatable', ',', 'trained', 'behavior—much', 'like', 'walking', '.', 'language', '’', 'follow', 'strict', 'set', 'rules', ',', 'many', 'exceptions', 'like', '“', 'E', 'except', 'C.', '”', 'comes', 'naturally', 'humans', ',', 'however', ',', 'exceedingly', 'difficult', 'computers', 'amount', 'unstructured', 'data', ',', 'lack', 'formal', 'rules', ',', 'absence', 'real-world', 'context', 'intent', '.', '’', 'machine', 'learning', 'artificial', 'intelligence', '(', 'AI', ')', 'gaining', 'attention', 'momentum', ',', 'greater', 'human', 'dependency', 'computing', 'systems', 'communicate', 'perform', 'tasks', '.', 'AI', 'augmented', 'analytics', 'get', 'sophisticated', ',', 'Natural', 'Language', 'Processing',

In [37]:
# Stemming: Reducing words to their base form (e.g., "running" to "run").

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in filtered_words]

print(stemmed_words)

['’', 'regularli', 'think', 'intricaci', 'languag', '.', '’', 'intuit', 'behavior', 'use', 'convey', 'inform', 'mean', 'semant', 'cue', 'word', ',', 'sign', ',', 'imag', '.', '’', 'said', 'languag', 'easier', 'learn', 'come', 'natur', 'adolesc', '’', 'repeat', ',', 'train', 'behavior—much', 'like', 'walk', '.', 'languag', '’', 'follow', 'strict', 'set', 'rule', ',', 'mani', 'except', 'like', '“', 'e', 'except', 'c.', '”', 'come', 'natur', 'human', ',', 'howev', ',', 'exceedingli', 'difficult', 'comput', 'amount', 'unstructur', 'data', ',', 'lack', 'formal', 'rule', ',', 'absenc', 'real-world', 'context', 'intent', '.', '’', 'machin', 'learn', 'artifici', 'intellig', '(', 'ai', ')', 'gain', 'attent', 'momentum', ',', 'greater', 'human', 'depend', 'comput', 'system', 'commun', 'perform', 'task', '.', 'ai', 'augment', 'analyt', 'get', 'sophist', ',', 'natur', 'languag', 'process', '(', 'nlp', ')', '.', 'term', 'ai', 'nlp', 'might', 'conjur', 'imag', 'futurist', 'robot', ',', 'alreadi', 'b

In [38]:
# Lemmatization: Converting words to their canonical form (e.g., "better" to "good").

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in stemmed_words]
print("Lemmatized Words:", lemmatized_words)

Lemmatized Words: ['’', 'regularli', 'think', 'intricaci', 'languag', '.', '’', 'intuit', 'behavior', 'use', 'convey', 'inform', 'mean', 'semant', 'cue', 'word', ',', 'sign', ',', 'imag', '.', '’', 'said', 'languag', 'easier', 'learn', 'come', 'natur', 'adolesc', '’', 'repeat', ',', 'train', 'behavior—much', 'like', 'walk', '.', 'languag', '’', 'follow', 'strict', 'set', 'rule', ',', 'mani', 'except', 'like', '“', 'e', 'except', 'c.', '”', 'come', 'natur', 'human', ',', 'howev', ',', 'exceedingli', 'difficult', 'comput', 'amount', 'unstructur', 'data', ',', 'lack', 'formal', 'rule', ',', 'absenc', 'real-world', 'context', 'intent', '.', '’', 'machin', 'learn', 'artifici', 'intellig', '(', 'ai', ')', 'gain', 'attent', 'momentum', ',', 'greater', 'human', 'depend', 'comput', 'system', 'commun', 'perform', 'task', '.', 'ai', 'augment', 'analyt', 'get', 'sophist', ',', 'natur', 'languag', 'process', '(', 'nlp', ')', '.', 'term', 'ai', 'nlp', 'might', 'conjur', 'imag', 'futurist', 'robot', 

In [39]:
# Part-of-Speech (POS) Tagging: Assigning grammatical tags to words (e.g., noun, verb, adjective).

pos_tags = nltk.pos_tag(lemmatized_words)
print("POS Tags:", pos_tags)

POS Tags: [('’', 'JJ'), ('regularli', 'NN'), ('think', 'VBP'), ('intricaci', 'JJ'), ('languag', 'NN'), ('.', '.'), ('’', 'CC'), ('intuit', 'NN'), ('behavior', 'NN'), ('use', 'NN'), ('convey', 'JJ'), ('inform', 'NN'), ('mean', 'NN'), ('semant', 'NN'), ('cue', 'NN'), ('word', 'NN'), (',', ','), ('sign', 'NN'), (',', ','), ('imag', 'NN'), ('.', '.'), ('’', 'NN'), ('said', 'VBD'), ('languag', 'RB'), ('easier', 'JJR'), ('learn', 'JJ'), ('come', 'VBP'), ('natur', 'JJ'), ('adolesc', 'NN'), ('’', 'NNP'), ('repeat', 'NN'), (',', ','), ('train', 'VBP'), ('behavior—much', 'JJ'), ('like', 'IN'), ('walk', 'NN'), ('.', '.'), ('languag', 'CC'), ('’', 'JJ'), ('follow', 'JJ'), ('strict', 'NN'), ('set', 'NN'), ('rule', 'NN'), (',', ','), ('mani', 'RB'), ('except', 'IN'), ('like', 'IN'), ('“', 'NNP'), ('e', 'FW'), ('except', 'IN'), ('c.', 'NN'), ('”', 'NNP'), ('come', 'VBP'), ('natur', 'JJ'), ('human', 'JJ'), (',', ','), ('howev', 'NN'), (',', ','), ('exceedingli', 'VBZ'), ('difficult', 'JJ'), ('comput',

## Word Embeddings

This code snippet demonstrates how to train a Word2Vec model using the gensim library. You provide a list of sentences as input, and the model learns word embeddings based on the context of words in the sentences.

In [40]:
!pip install gensim



In [41]:
from gensim.models import Word2Vec

# Sample corpus
sentences = [["this", "is", "a", "sentence"], ["another", "sentence"], ["yet", "another", "one"]]

# Train Word2Vec model
model = Word2Vec(sentences, min_count=1)

# Get word embedding for "sentence"
word_embedding = model.wv['sentence']
print(word_embedding)

[-8.6196875e-03  3.6657380e-03  5.1898835e-03  5.7419385e-03
  7.4669183e-03 -6.1676754e-03  1.1056137e-03  6.0472824e-03
 -2.8400505e-03 -6.1735227e-03 -4.1022300e-04 -8.3689485e-03
 -5.6000124e-03  7.1045388e-03  3.3525396e-03  7.2256695e-03
  6.8002474e-03  7.5307419e-03 -3.7891543e-03 -5.6180597e-04
  2.3483764e-03 -4.5190323e-03  8.3887316e-03 -9.8581640e-03
  6.7646410e-03  2.9144168e-03 -4.9328315e-03  4.3981876e-03
 -1.7395747e-03  6.7113843e-03  9.9648498e-03 -4.3624435e-03
 -5.9933780e-04 -5.6956373e-03  3.8508223e-03  2.7866268e-03
  6.8910765e-03  6.1010956e-03  9.5384968e-03  9.2734173e-03
  7.8980681e-03 -6.9895042e-03 -9.1558648e-03 -3.5575271e-04
 -3.0998408e-03  7.8943167e-03  5.9385742e-03 -1.5456629e-03
  1.5109634e-03  1.7900408e-03  7.8175711e-03 -9.5101865e-03
 -2.0553112e-04  3.4691966e-03 -9.3897223e-04  8.3817719e-03
  9.0107834e-03  6.5365066e-03 -7.1162102e-04  7.7104042e-03
 -8.5343346e-03  3.2071066e-03 -4.6379971e-03 -5.0889552e-03
  3.5896183e-03  5.37033

## Bagging

This example demonstrates how to use the BaggingClassifier from sklearn.ensemble with a DecisionTreeClassifier as the base estimator. You can replace the sample data with your own dataset and experiment with different base estimators and parameters.


In [42]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# Sample data (replace with your own dataset)
X = [[1, 2], [2, 3], [3, 4], [4, 5]]
y = [0, 1, 0, 1]

# Create a decision tree classifier
base_estimator = DecisionTreeClassifier()

# Create a bagging classifier with 10 estimators
bagging_model = BaggingClassifier(base_estimator=base_estimator, n_estimators=10)

# Fit the bagging model to the data
bagging_model.fit(X, y)

# Make predictions
predictions = bagging_model.predict([[3, 3]])
print(predictions)

[0]




## Vectorization

This code demonstrates how to use the TfidfVectorizer from sklearn.feature_extraction.text to convert a corpus of text documents into a TF-IDF matrix. Each row in the matrix represents a document, and each column represents a unique word in the corpus. The values in the matrix represent the TF-IDF score of each word in each document.

In [43]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample corpus
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

# Create a TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Fit the vectorizer to the corpus
vectorizer.fit(corpus)

# Transform the corpus into a TF-IDF matrix
tfidf_matrix = vectorizer.transform(corpus)

# Get the feature names (words)
feature_names = vectorizer.get_feature_names_out()

# Print the TF-IDF matrix
print(tfidf_matrix.toarray())

# Print the feature names
print(feature_names)

[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]
 [0.         0.6876236  0.         0.28108867 0.         0.53864762
  0.28108867 0.         0.28108867]
 [0.51184851 0.         0.         0.26710379 0.51184851 0.
  0.26710379 0.51184851 0.26710379]
 [0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]
['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']
