## 🔍 Natural Language Processing (NLP): Foundational Theory

### 🎯 Definition:
Natural Language Processing (NLP) is a subfield of Artificial Intelligence (AI) and Linguistics that focuses on enabling machines to understand, interpret, and generate human language.

### 📚 Core Components:
| Component               | Description                                                                                      |
| ----------------------- | ------------------------------------------------------------------------------------------------ |
| **Text Preprocessing**  | Tokenization, stemming, lemmatization, stop word removal, lowercasing, etc.                      |
| **Text Representation** | Bag of Words (BoW), TF-IDF, Word Embeddings (Word2Vec, GloVe), Contextual embeddings (BERT, GPT) |
| **Syntax & Parsing**    | Part-of-speech tagging, dependency parsing, constituency parsing                                 |
| **Semantics**           | Named Entity Recognition (NER), word sense disambiguation                                        |
| **Pragmatics**          | Understanding context, irony, and implied meaning                                                |

### 📦 Common NLP Tasks:
| Task                               | Description                                                    |
| ---------------------------------- | -------------------------------------------------------------- |
| **Text Classification**            | Spam detection, sentiment analysis, topic classification       |
| **Named Entity Recognition (NER)** | Extracting people, places, organizations from text             |
| **Part-of-Speech Tagging**         | Identifying grammatical roles (noun, verb, adjective, etc.)    |
| **Machine Translation**            | Translating text from one language to another                  |
| **Text Summarization**             | Extractive and abstractive methods to summarize content        |
| **Question Answering**             | Answering queries from unstructured or structured content      |
| **Text Generation**                | Producing human-like text (e.g., chatbots, content generation) |


### 💼 Industry Use Cases
| Domain               | Use Case                                                                |
| -------------------- | ----------------------------------------------------------------------- |
| **Finance**          | Sentiment analysis of market news, fraud detection from email/chat logs |
| **Healthcare**       | Extracting key terms from clinical notes, medical report summarization  |
| **E-Commerce**       | Product recommendation based on reviews, intelligent search systems     |
| **Legal**            | Contract analysis, case summarization, legal document classification    |
| **Customer Service** | Chatbots, ticket routing, intent classification                         |
| **Marketing**        | Brand monitoring, personalized messaging, ad targeting                  |

### 🛠️ Tools & Libraries
| Library/Tool          | Description                                                                 |
| --------------------- | --------------------------------------------------------------------------- |
| **NLTK**              | Natural Language Toolkit, a suite of libraries for NLP tasks in Python      |
| **spaCy**             | Industrial-strength NLP library for Python, optimized for performance      |
| **Transformers**      | Hugging Face library for state-of-the-art NLP models (BERT, GPT, etc.)     |
| **Gensim**            | Topic modeling and document similarity, supports Word2Vec and Doc2Vec     |
| **TextBlob**          | Simplified text processing, built on NLTK and Pattern                      |
| **OpenNLP**           | Apache's machine learning-based toolkit for processing natural language    |
| **Stanford NLP**      | Java-based library for various NLP tasks, including parsing and NER       |

### 📖 Further Reading
- [Speech and Language Processing (3rd Edition)](https://web.stanford.edu/~jurafsky/slp3/)
- [Natural Language Processing with Python](https://www.nltk.org/book/)
- [Introduction to Natural Language Processing](https://www.coursera.org/learn/natural-language-processing)




### 📘 Common NLP Terminologies
| Term                                                     | Definition                                                                                                           |
| -------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------- |
| **Corpus**                                               | A large and structured set of texts. It serves as the dataset for NLP tasks.                                         |
| **Token**                                                | A single unit of text, typically a word or punctuation mark.                                                         |
| **Tokenization**                                         | The process of splitting text into individual tokens (words, subwords, or characters).                               |
| **Stop Words**                                           | Common words (e.g., “is”, “the”, “a”) that are often removed from text as they carry little semantic meaning.        |
| **Stemming**                                             | The process of reducing a word to its base/root form (e.g., “playing” → “play”).                                     |
| **Lemmatization**                                        | Similar to stemming, but more linguistically accurate, considering context and vocabulary (e.g., “better” → “good”). |
| **POS Tagging (Part-of-Speech Tagging)**                 | Assigning each word a grammatical category such as noun, verb, adjective, etc.                                       |
| **Named Entity Recognition (NER)**                       | Identifying and classifying entities in text (like person names, organizations, locations, dates).                   |
| **Bag of Words (BoW)**                                   | A text representation method where each word is treated as a feature, ignoring grammar and order.                    |
| **TF-IDF (Term Frequency – Inverse Document Frequency)** | A statistical method that evaluates how important a word is to a document in a collection.                           |
| **Word Embedding**                                       | A dense vector representation of words capturing semantic relationships (e.g., Word2Vec, GloVe).                     |
| **n-gram**                                               | A contiguous sequence of 'n' items (usually words or characters) from a given text (e.g., bigram, trigram).          |
| **Cosine Similarity**                                    | A metric to measure similarity between two text vectors.                                                             |
| **Text Classification**                                  | Assigning predefined categories to text (e.g., spam or not spam).                                                    |
| **Sentiment Analysis**                                   | Identifying and categorizing opinions expressed in text as positive, negative, or neutral.                           |
| **Topic Modeling**                                       | Unsupervised technique to discover hidden thematic structure in text (e.g., LDA - Latent Dirichlet Allocation).      |
| **Parsing**                                              | Analyzing the syntactic structure of a sentence (e.g., constituency or dependency parsing).                          |
| **Language Modeling**                                    | Predicting the next word or sequence of words in a sentence (e.g., GPT, BERT).                                       |
| **Perplexity**                                           | A measurement of how well a probabilistic language model predicts a sample; lower is better.                         |
| **Sequence Labeling**                                    | Assigning labels to each token in a sequence (e.g., POS tagging, NER).                                               |
| **Transformer**                                          | Deep learning model architecture based on attention mechanisms (used in BERT, GPT, etc.).                            |
| **Attention Mechanism**                                  | A method that allows the model to focus on relevant parts of the input sequence while making predictions.            |
| **Contextual Embedding**                                 | Word embeddings that consider surrounding context, used in models like BERT.                                         |
| **Fine-tuning**                                          | Adapting a pre-trained NLP model to a specific downstream task using task-specific data.                             |
| **Zero-shot / Few-shot Learning**                        | Performing NLP tasks with zero or minimal task-specific training examples.                                           |
| **Prompt Engineering**                                   | Designing effective input prompts to guide the behavior of language models (especially in generative AI).            |
