# NLP Zero to Hero

## Introduction to NLP

Natural Language Processing (NLP) is a sub-field of artificial intelligence (AI) that focuses on the interaction between computers and humans through natural language. The ultimate objective of NLP is to enable computers to understand, interpret, and generate human languages in a way that is both meaningful and useful.

### Why is NLP Important?

NLP is important because it helps resolve ambiguity in language and adds useful numeric representation to the data for many downstream applications, such as text analytics or speech recognition. Here are some reasons why NLP is important:

1. **Volume of Text Data**: With the explosion of digital communication, the amount of text data generated daily is vast. NLP helps in extracting useful information from this vast amount of unstructured data.
2. **Human-Computer Interaction**: NLP provides more natural interactions between humans and computers, making technologies like virtual assistants and chatbots more effective.
3. **Automation of Routine Tasks**: NLP can automate and align routine tasks such as summarizing documents, filtering spam emails, and translating languages.

### Applications of NLP

NLP has a wide range of applications across various domains:

- **Text Classification**: Categorizing text into predefined categories. For example, filtering spam emails or classifying customer reviews.
- **Sentiment Analysis**: Determining the sentiment expressed in a piece of text, such as identifying positive or negative reviews.
- **Machine Translation**: Translating text from one language to another, like Google Translate.
- **Named Entity Recognition (NER)**: Identifying entities such as names, dates, and places within a text.
- **Speech Recognition**: Converting spoken language into text, as used in virtual assistants like Siri and Alexa.
- **Chatbots**: Enabling conversational agents to understand and respond to human queries in real-time.
- **Text Summarization**: Creating a summary of a longer piece of text.


### Basic NLP Concepts and Terminology

Before looking into NLP tasks, it's essential to understand some basic concepts and terminology:

- **Tokenization**: The process of splitting text into individual words or phrases, known as tokens.
- **Stopwords**: Common words like "the", "is", "in", which are often removed from text before processing because they add little value to the analysis.
- **Stemming and Lemmatization**: Techniques to reduce words to their base or root form. Stemming is a crude heuristic process that chops off the ends of words, while lemmatization uses a dictionary to find the root form.
- **Vectorization**: Converting text into numerical format. Common methods include Bag of Words (BoW) and Term Frequency-Inverse Document Frequency (TF-IDF).
- **Word Embeddings**: Dense vector representations of words, capturing their meanings, semantic relationships, and context. Examples include Word2Vec, GloVe, and FastText.
- **Sequence Models**: Models like Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM), and Gated Recurrent Units (GRUs) that are capable of processing sequences of data.
- **Transformers**: A type of model architecture that has revolutionized NLP, particularly through models like BERT, GPT, and T5. Transformers handle sequences in parallel and have shown significant improvements in various NLP tasks.

### Structure of This Notebook

This notebook will take you through a journey from basic NLP tasks to more advanced techniques. Here’s the structure:

1. **Text Preprocessing**
   - Tokenization
   - Stopword removal
   - Stemming and Lemmatization
   - Vectorization

2. **Basic NLP Tasks**
   - Sentiment Analysis
   - Named Entity Recognition (NER)
   - Part-of-Speech Tagging
   - Text Classification

3. **Advanced NLP Techniques**
   - Word Embeddings (Word2Vec, GloVe)
   - Sequence Models (RNN, LSTM, GRU)
   - Attention Mechanisms and Transformers
   - BERT and other Transformer-based models

4. **Practical Projects**
   - Building a Sentiment Analysis Model
   - Creating a Chatbot
   - Text Summarization
   - Machine Translation

5. **Conclusion and Further Reading**
   - Summary of key points
   - Resources for further learning

Let's get started on this exciting journey into the world of Natural Language Processing!

# 📝 Importing libraries

In [2]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# 🚀 Let's download NLTK data we need (just once)

In [3]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [4]:
# 🌟 Example text to process (replace with your text)
text = """
FeatureLens generates visuals by applying text mining concepts such as frequent words, expressions, and closed itemsets of n-grams to guide the discovery process. These concepts are combined with interactive visualization to help users analyze text, create insights
"""

In [5]:
# 🪄 Step 1: Tokenization - Breaking text into words (tokens) 🌟
tokens = word_tokenize(text)
print("🔹 Tokens:", tokens)

🔹 Tokens: ['FeatureLens', 'generates', 'visuals', 'by', 'applying', 'text', 'mining', 'concepts', 'such', 'as', 'frequent', 'words', ',', 'expressions', ',', 'and', 'closed', 'itemsets', 'of', 'n-grams', 'to', 'guide', 'the', 'discovery', 'process', '.', 'These', 'concepts', 'are', 'combined', 'with', 'interactive', 'visualization', 'to', 'help', 'users', 'analyze', 'text', ',', 'create', 'insights']


In [6]:
# 🛑 Step 2: Stopword removal - Get rid of those pesky common words 🛑
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print("🔹 Tokens after stopword removal:", filtered_tokens)


🔹 Tokens after stopword removal: ['FeatureLens', 'generates', 'visuals', 'applying', 'text', 'mining', 'concepts', 'frequent', 'words', ',', 'expressions', ',', 'closed', 'itemsets', 'n-grams', 'guide', 'discovery', 'process', '.', 'concepts', 'combined', 'interactive', 'visualization', 'help', 'users', 'analyze', 'text', ',', 'create', 'insights']


In [7]:
# ✂️ Step 3: Stemming - Chopping words to their roots using PorterStemmer! ✂️
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
print("🔹 Stemmed tokens:", stemmed_tokens)

🔹 Stemmed tokens: ['featurelen', 'gener', 'visual', 'appli', 'text', 'mine', 'concept', 'frequent', 'word', ',', 'express', ',', 'close', 'itemset', 'n-gram', 'guid', 'discoveri', 'process', '.', 'concept', 'combin', 'interact', 'visual', 'help', 'user', 'analyz', 'text', ',', 'creat', 'insight']
