# NLP Zero to Hero

## Introduction to NLP

Natural Language Processing (NLP) is a sub-field of artificial intelligence (AI) that focuses on the interaction between computers and humans through natural language. The ultimate objective of NLP is to enable computers to understand, interpret, and generate human languages in a way that is both meaningful and useful.

### Why is NLP Important?

NLP is important because it helps resolve ambiguity in language and adds useful numeric representation to the data for many downstream applications, such as text analytics or speech recognition. Here are some reasons why NLP is important:

1. **Volume of Text Data**: With the explosion of digital communication, the amount of text data generated daily is vast. NLP helps in extracting useful information from this vast amount of unstructured data.
2. **Human-Computer Interaction**: NLP provides more natural interactions between humans and computers, making technologies like virtual assistants and chatbots more effective.
3. **Automation of Routine Tasks**: NLP can automate and align routine tasks such as summarizing documents, filtering spam emails, and translating languages.

### Applications of NLP

NLP has a wide range of applications across various domains:

- **Text Classification**: Categorizing text into predefined categories. For example, filtering spam emails or classifying customer reviews.
- **Sentiment Analysis**: Determining the sentiment expressed in a piece of text, such as identifying positive or negative reviews.
- **Machine Translation**: Translating text from one language to another, like Google Translate.
- **Named Entity Recognition (NER)**: Identifying entities such as names, dates, and places within a text.
- **Speech Recognition**: Converting spoken language into text, as used in virtual assistants like Siri and Alexa.
- **Chatbots**: Enabling conversational agents to understand and respond to human queries in real-time.
- **Text Summarization**: Creating a summary of a longer piece of text.


### Basic NLP Concepts and Terminology

Before looking into NLP tasks, it's essential to understand some basic concepts and terminology:

- **Tokenization**: The process of splitting text into individual words or phrases, known as tokens.
- **Stopwords**: Common words like "the", "is", "in", which are often removed from text before processing because they add little value to the analysis.
- **Stemming and Lemmatization**: Techniques to reduce words to their base or root form. Stemming is a crude heuristic process that chops off the ends of words, while lemmatization uses a dictionary to find the root form.
- **Vectorization**: Converting text into numerical format. Common methods include Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF).
- **Word Embeddings**: Dense vector representations of words, capturing their meanings, semantic relationships, and context. Examples include Word2Vec, GloVe, and FastText.
- **Sequence Models**: Models like Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM), and Gated Recurrent Units (GRUs) that are capable of processing sequences of data.
- **Transformers**: A type of model architecture that has revolutionized NLP, particularly through models like BERT, GPT, and T5. Transformers handle sequences in parallel and have shown significant improvements in various NLP tasks.

### Structure of This Notebook

This notebook will take you through a journey from basic NLP tasks to more advanced techniques. Here’s the structure:

1. **Text Preprocessing**
   - Tokenization
   - Stopword removal
   - Stemming and Lemmatization
   - Vectorization

2. **Basic NLP Tasks**
   - Sentiment Analysis
   - Named Entity Recognition (NER)
   - Part-of-Speech Tagging
   - Text Classification

3. **Advanced NLP Techniques**
   - Word Embeddings (Word2Vec, GloVe)
   - Sequence Models (RNN, LSTM, GRU)
   - Attention Mechanisms and Transformers
   - BERT and other Transformer-based models

4. **Practical Projects**
   - Building a Sentiment Analysis Model
   - Creating a Chatbot
   - Text Summarization
   - Machine Translation

5. **Conclusion and Further Reading**
   - Summary of key points
   - Resources for further learning

Let's get started on this exciting journey into the world of Natural Language Processing!

# Part 1 : Text Preprocessing

### 📝 Importing libraries

In [15]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import spacy

#### 🚀 Let's download NLTK and Spacy data we need (just once)

In [17]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nlp = spacy.load('en_core_web_sm')
# nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [10]:
# 🌟 Example text
text = """
FeatureLens generates visuals by applying text mining concepts such as frequent words, expressions, and closed itemsets of n-grams to guide the discovery process. These concepts are combined with interactive visualization to help users analyze text, create insights
"""

### 🪄 Step 1: Tokenization - Breaking text into words (tokens) 🌟

In [11]:
tokens = word_tokenize(text)
print("🔹 Tokens:", tokens)

🔹 Tokens: ['FeatureLens', 'generates', 'visuals', 'by', 'applying', 'text', 'mining', 'concepts', 'such', 'as', 'frequent', 'words', ',', 'expressions', ',', 'and', 'closed', 'itemsets', 'of', 'n-grams', 'to', 'guide', 'the', 'discovery', 'process', '.', 'These', 'concepts', 'are', 'combined', 'with', 'interactive', 'visualization', 'to', 'help', 'users', 'analyze', 'text', ',', 'create', 'insights']


### 🛑 Step 2: Stopword removal - Get rid of those pesky common words 🛑

In [12]:

stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print("🔹 Tokens after stopword removal:", filtered_tokens)


🔹 Tokens after stopword removal: ['FeatureLens', 'generates', 'visuals', 'applying', 'text', 'mining', 'concepts', 'frequent', 'words', ',', 'expressions', ',', 'closed', 'itemsets', 'n-grams', 'guide', 'discovery', 'process', '.', 'concepts', 'combined', 'interactive', 'visualization', 'help', 'users', 'analyze', 'text', ',', 'create', 'insights']


### ✂️ Step 3: Stemming - Chopping words to their roots using PorterStemmer! ✂️

In [13]:
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
print("🔹 Stemmed tokens:", stemmed_tokens)

🔹 Stemmed tokens: ['featurelen', 'gener', 'visual', 'appli', 'text', 'mine', 'concept', 'frequent', 'word', ',', 'express', ',', 'close', 'itemset', 'n-gram', 'guid', 'discoveri', 'process', '.', 'concept', 'combin', 'interact', 'visual', 'help', 'user', 'analyz', 'text', ',', 'creat', 'insight']


### 🌱 Step 4: Lemmatization - Morphing words to their base form with WordNetLemmatizer! 🌱

In [18]:
def lemmatize_with_spacy(tokens):
    doc = nlp(' '.join(tokens))
    return [token.lemma_ for token in doc]

lemmatized_tokens = lemmatize_with_spacy(filtered_tokens)
print("🔹 Lemmatized tokens:", lemmatized_tokens)

🔹 Lemmatized tokens: ['FeatureLens', 'generate', 'visual', 'apply', 'text', 'mining', 'concept', 'frequent', 'word', ',', 'expression', ',', 'closed', 'itemset', 'n', '-', 'gram', 'guide', 'discovery', 'process', '.', 'concept', 'combine', 'interactive', 'visualization', 'help', 'user', 'analyze', 'text', ',', 'create', 'insight']


In [None]:
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
print("🔹 Lemmatized tokens:", lemmatized_tokens)

### 💼 Step 5: Vectorization - Turn those words into numbers for our model! 📊

#### 🧮 Count Vectorizer - Counting word occurrences

In [19]:
count_vectorizer = CountVectorizer()
count_vectors = count_vectorizer.fit_transform([' '.join(lemmatized_tokens)])
print("🔹 Count Vectorizer feature names:", count_vectorizer.get_feature_names_out())
print("🔹 Count Vectors:\n", count_vectors.toarray())

🔹 Count Vectorizer feature names: ['analyze' 'apply' 'closed' 'combine' 'concept' 'create' 'discovery'
 'expression' 'featurelens' 'frequent' 'generate' 'gram' 'guide' 'help'
 'insight' 'interactive' 'itemset' 'mining' 'process' 'text' 'user'
 'visual' 'visualization' 'word']
🔹 Count Vectors:
 [[1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1]]


#### 🔥 TF-IDF Vectorizer - Considering word frequency across documents

In [20]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_vectors = tfidf_vectorizer.fit_transform([' '.join(lemmatized_tokens)])
print("🔹 TF-IDF Vectorizer feature names:", tfidf_vectorizer.get_feature_names_out())
print("🔹 TF-IDF Vectors:\n", tfidf_vectors.toarray())

🔹 TF-IDF Vectorizer feature names: ['analyze' 'apply' 'closed' 'combine' 'concept' 'create' 'discovery'
 'expression' 'featurelens' 'frequent' 'generate' 'gram' 'guide' 'help'
 'insight' 'interactive' 'itemset' 'mining' 'process' 'text' 'user'
 'visual' 'visualization' 'word']
🔹 TF-IDF Vectors:
 [[0.18257419 0.18257419 0.18257419 0.18257419 0.36514837 0.18257419
  0.18257419 0.18257419 0.18257419 0.18257419 0.18257419 0.18257419
  0.18257419 0.18257419 0.18257419 0.18257419 0.18257419 0.18257419
  0.18257419 0.36514837 0.18257419 0.18257419 0.18257419 0.18257419]]


#### 🎉 And that's it guys! Our text is now preprocessed and ready for some NLP magic! 🎉

# Part 2 : Basic NLP Tasks

####  🎉 Now, let's dive into some basic NLP tasks! 🎉

### 📢 Task 1: Sentiment Analysis - Get those text vibes! 😃😢😡

In [29]:
# NLTK's VADER for sentiment analysis
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()

sentiment_scores = sid.polarity_scores(text)
print("🔹 Sentiment Scores:", sentiment_scores)

def get_sentiment_label(scores):
    if scores['compound'] >= 0.05:
        return 'positive'
    elif scores['compound'] <= -0.05:
        return 'negative'
    else:
        return 'neutral'
    
sentiment_label = get_sentiment_label(sentiment_scores)
print(f"🔹 Sentiment Label: {sentiment_label}")

🔹 Sentiment Scores: {'neg': 0.0, 'neu': 0.879, 'pos': 0.121, 'compound': 0.5859}
🔹 Sentiment Label: positive


### 🎭 Task 2: Named Entity Recognition (NER) - Who's who and what's what? 🤔

In [22]:
# using spaCy
doc = nlp(text)

entities = [(entity.text, entity.label_) for entity in doc.ents]
print("🔹 Named Entities:", entities)

🔹 Named Entities: [('FeatureLens', 'ORG')]


### 🏷️ Task 3: Part-of-Speech Tagging - Tagging words like a pro! 🏷️

In [23]:
pos_tags = [(token.text, token.pos_) for token in doc]
print("🔹 Part-of-Speech Tags:", pos_tags)

🔹 Part-of-Speech Tags: [('\n', 'SPACE'), ('FeatureLens', 'PROPN'), ('generates', 'VERB'), ('visuals', 'NOUN'), ('by', 'ADP'), ('applying', 'VERB'), ('text', 'NOUN'), ('mining', 'NOUN'), ('concepts', 'NOUN'), ('such', 'ADJ'), ('as', 'ADP'), ('frequent', 'ADJ'), ('words', 'NOUN'), (',', 'PUNCT'), ('expressions', 'NOUN'), (',', 'PUNCT'), ('and', 'CCONJ'), ('closed', 'ADJ'), ('itemsets', 'NOUN'), ('of', 'ADP'), ('n', 'NOUN'), ('-', 'PUNCT'), ('grams', 'NOUN'), ('to', 'PART'), ('guide', 'VERB'), ('the', 'DET'), ('discovery', 'NOUN'), ('process', 'NOUN'), ('.', 'PUNCT'), ('These', 'DET'), ('concepts', 'NOUN'), ('are', 'AUX'), ('combined', 'VERB'), ('with', 'ADP'), ('interactive', 'ADJ'), ('visualization', 'NOUN'), ('to', 'PART'), ('help', 'VERB'), ('users', 'NOUN'), ('analyze', 'VERB'), ('text', 'NOUN'), (',', 'PUNCT'), ('create', 'VERB'), ('insights', 'NOUN'), ('\n', 'SPACE')]


### 📚 Task 4: Text Classification - Let's train a simple classifier! 📚

In [24]:
train_texts = [
    "I love sunny days",
    "I hate rainy days",
    "Sunshine makes me happy",
    "Rainy days make me sad"
]
train_labels = ['positive', 'negative', 'positive', 'negative']

#### 🧑‍🏫 Step 1: Vectorize the training texts

In [25]:
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(train_texts)

#### 🧠 Step 2: Train a simple classifier (Naive Bayes)

In [26]:
from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB()
classifier.fit(X_train, train_labels)

#### 🎓 Step 3: Predict sentiment of a new text

In [28]:
new_text = "I feel happy when it is sunny"
X_new = vectorizer.transform([new_text])
predicted_label = classifier.predict(X_new)[0]
print(f"🔹 New Text: {new_text}")
print(f"🔹 Predicted Sentiment: {predicted_label}")

🔹 New Text: I feel happy when it is sunny
🔹 Predicted Sentiment: positive


#### 🎉 And that's a wrap on our basic NLP tasks! You're now an NLP hero! 🎉