<a href="https://colab.research.google.com/github/Ananya221203/Natural-Language-Processing-NLP-/blob/main/NLP_(for_beginners).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Natural Language Processing (NLP)**

Objectives:

To introduce the basic concepts of NLP.

To demonstrate real-world applications of NLP.

To engage students with interactive examples and exercises.

**1. What is NLP?**

**Definition:** Natural Language Processing (NLP) is a field of artificial intelligence that focuses on the interaction between computers and humans using natural language.

**Simple Explanation:** Think of NLP as teaching computers to understand and talk in human language.

**2. Need for NLP**

**Understanding Human Language:** Computers need to understand human language to provide meaningful responses.

**Automating Repetitive Tasks:** Tasks like sorting emails, summarizing texts, or analyzing sentiments can be automated using NLP.

**Enhancing User Experience:** Virtual assistants, chatbots, and translation services all rely on NLP to improve interactions

**3. Applications of NLP**

**Machine Translation:** Tools like Google Translate.

**Sentiment Analysis:** Understanding emotions in texts (e.g., social media posts).

**Text Summarization:** Creating short summaries of long documents.
**Speech Recognition:** Voice-activated assistants like Siri or Alexa.
Chatbots and Virtual Assistants: Automated customer service bots.**


**4. Basic steps in NLP**

**Text Preprocessing:** Cleaning and preparing text data.

**Tokenization:** Splitting text into individual words or phrases.

**Removing Stop Words:** Filtering out common words that add little meaning.

**Stemming and Lemmatization:** Reducing words to their root forms.

**Vectorization:** Converting text into numerical vectors.

**Model Building:** Creating machine learning models to analyze text.

**Evaluation:** Assessing the performance of the models.

**Demonstration Outline:**

**1. Text Preprocessing**


In [4]:
import re

def preprocess_text(text):
    text = re.sub(r'[^\w\s]|_', '', text)
    text = text.lower()
    return text

sample_text = "Hello World! i am a univrtsity student ___ ."
processed_text = preprocess_text(sample_text)
print(processed_text)


hello world i am a univrtsity student  


**2. Tokenisation**

In [5]:
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

tokens = word_tokenize(processed_text)
print(tokens)

['hello', 'world', 'i', 'am', 'a', 'univrtsity', 'student']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


**3. Removing Stop Words:**

In [6]:
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word not in stop_words]
print(filtered_tokens)

['hello', 'world', 'univrtsity', 'student']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [7]:
print(stopwords.words)

<bound method WordListCorpusReader.words of <WordListCorpusReader in '/root/nltk_data/corpora/stopwords'>>


**4. Stemming**

In [8]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
print(stemmed_tokens)



['hello', 'world', 'univrts', 'student']


**5. Lemmatization:**

In [11]:
import nltk
from nltk.stem import WordNetLemmatizer
import spacy

# Download the necessary NLTK data
nltk.download('wordnet')
nltk.download('omw-1.4')  # Required for wordnet
nltk.download('punkt')  # Ensure punkt is downloaded for tokenization

# Load the spaCy model
nlp = spacy.load('en_core_web_sm')

# Example text
text = "I am very happy.I am going for a trip "

# Initialize the WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Example list of filtered tokens
filtered_tokens = ["tree", "happy", "god", "fairly"]

# Check if the required NLTK resources are available
try:
    nltk.data.find('corpora/wordnet.zip')
    nltk.data.find('corpora/omw-1.4.zip')
    # Lemmatize each token
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
    print("Lemmatized tokens using NLTK:", lemmatized_tokens)
except LookupError:
    print("WordNet resource not found. Please ensure it is downloaded properly.")

# Additionally, lemmatize using spaCy
doc = nlp(text)
spacy_lemmatized_tokens = [token.lemma_ for token in doc]
print("Lemmatized tokens using spaCy:", spacy_lemmatized_tokens)


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Lemmatized tokens using NLTK: ['tree', 'happy', 'god', 'fairly']
Lemmatized tokens using spaCy: ['I', 'be', 'very', 'happy', '.', 'I', 'be', 'go', 'for', 'a', 'trip']


**6. Vectorization**

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

texts = ["I love drawing.", "Python is great.", "I hate studies.", "Debugging is difficult."]
labels = [1, 1, 0, 1]  # 1 for positive, 0 for negative

tfidf_vectorizer = TfidfVectorizer()
vectorized_texts = tfidf_vectorizer.fit_transform(texts)
print(vectorized_texts)

  (0, 2)	0.7071067811865476
  (0, 6)	0.7071067811865476
  (1, 3)	0.6176143709756019
  (1, 5)	0.48693426407352264
  (1, 7)	0.6176143709756019
  (2, 8)	0.7071067811865476
  (2, 4)	0.7071067811865476
  (3, 1)	0.6176143709756019
  (3, 0)	0.6176143709756019
  (3, 5)	0.48693426407352264


**7. Simple Model Building:**

Splitting data into training and testing sets.

Training a Multinomial Naive Bayes model on the training data.

Evaluating the model's performance on the test data using accuracy as a metric.




In [13]:
X_train, X_test, y_train, y_test = train_test_split(vectorized_texts, labels, test_size=0.25, random_state=42)
model = MultinomialNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

Accuracy: 100.00%
