![image.png](https://i.imgur.com/a3uAqnb.png)

# Natural Language Processing (NLP)
# First: Text Pre-Processing



## 🔍 What is Natural Language Processing?

**Natural Language Processing (NLP)** is a branch of Artificial Intelligence (AI) that focuses on enabling computers to understand, interpret, and generate human language in a meaningful and useful way.

It combines:
- **Computational Linguistics**
- **Machine Learning**
- **Deep Learning**

to process and analyze large amounts of natural language data.

### 🎯 Key Objectives of NLP:
- **Understanding**: Extracting meaning from language.
- **Generation**: Producing human-like text.
- **Translation**: Converting text between languages.
- **Interaction**: Facilitating natural human-computer communication.

---

## 📝 Core NLP Terminology

### 📚 Fundamental Concepts

| Term            | Description                                                                                  | Example                              |
|-----------------|----------------------------------------------------------------------------------------------|--------------------------------------|
| **Corpus**      | A large collection of texts used for training NLP models.                                     | Brown Corpus, Reuters Corpus         |
| **Token**       | The smallest unit of text (word, character, or subword) obtained after tokenization.          | "Hello world!" → `["Hello", "world", "!"]` |
| **Tokenization**| The process of breaking text into tokens.                                                     | Input: "It's great!" → `["It", "'s", "great", "!"]` |
| **Vocabulary**  | The set of unique tokens in a corpus (also called lexicon).                                   | {"Hello", "world", "!"}              |

### 💡 Notes:
- Tokenization can be complex due to punctuation, contractions, and language-specific rules.
- The size of the vocabulary can impact both model performance and resource requirements.


## 🚀 Getting Started with NLTK and spaCy

Now that we understand the theoretical foundations of NLP, let’s dive into practical implementation using two of the most popular NLP libraries: **NLTK** and **spaCy**.

---

### 🐍 NLTK (Natural Language Toolkit)

**NLTK** is a comprehensive Python library designed for working with human language data. It provides simple interfaces to over 50 corpora and lexical resources and is widely used in education, research, and rapid prototyping.

#### ⭐ Key Features of NLTK:
- 📚 Extensive collection of text processing libraries
- 🗂️ Built-in corpora and datasets for experimentation
- 🎓 Educational focus with detailed tutorials and documentation
- 🔍 Wide range of algorithms for:
  - Tokenization
  - Stemming
  - Tagging
  - Parsing
  - Classification
  - Semantic reasoning

---



#1- **NLTK Installation**

In [None]:
!pip install nltk



In [None]:
sample_text = """
Natural Language Processing is a fascinating field that combines linguistics,
computer science, and artificial intelligence. It enables computers to understand,
interpret, and generate human language in a meaningful way. NLP has revolutionized
how we interact with technology, from search engines to chatbots.
"""


In [None]:
sample_text_arabic = """
جامعة الملك عبدالله للعلوم والتقنية هي جامعة بحثية متطورة تقع في المملكة العربية السعودية.
تأسست الجامعة عام 2009 وتهدف إلى أن تكون منارة للعلم والمعرفة في المنطقة والعالم.
تضم الجامعة أحدث المختبرات والمرافق البحثية، وتجذب الطلاب والباحثين من جميع أنحاء العالم.
تتميز كاوست بتركيزها على البحث العلمي والابتكار في مجالات متعددة مثل الهندسة والعلوم الطبيعية
والحاسوب والذكاء الاصطناعي. تسعى الجامعة لحل التحديات العالمية من خلال البحث والتطوير.
"""


## ✂️ 2. **Tokenization Function**

**Tokenization** is the process of splitting text into smaller meaningful units called **tokens**. These tokens are often words, subwords, or even characters, depending on the task.

Tokenization is usually the **first step in text preprocessing** because most NLP models require structured input instead of raw text.

---

### 💡 Why is Tokenization Important?

- ✅ Converts unstructured text into manageable pieces
- ✅ Enables word-level feature extraction and analysis
- ✅ Impacts vocabulary size and downstream model performance

---


In [None]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tag import pos_tag
from nltk.sentiment import SentimentIntensityAnalyzer
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
words = word_tokenize(sample_text)
sentences = sent_tokenize(sample_text)
print (words)
print (sentences)
print(f"Number of words: {len(words)}")
print(f"Number of sentences: {len(sentences)}")

['Natural', 'Language', 'Processing', 'is', 'a', 'fascinating', 'field', 'that', 'combines', 'linguistics', ',', 'computer', 'science', ',', 'and', 'artificial', 'intelligence', '.', 'It', 'enables', 'computers', 'to', 'understand', ',', 'interpret', ',', 'and', 'generate', 'human', 'language', 'in', 'a', 'meaningful', 'way', '.', 'NLP', 'has', 'revolutionized', 'how', 'we', 'interact', 'with', 'technology', ',', 'from', 'search', 'engines', 'to', 'chatbots', '.']
['\nNatural Language Processing is a fascinating field that combines linguistics,\ncomputer science, and artificial intelligence.', 'It enables computers to understand,\ninterpret, and generate human language in a meaningful way.', 'NLP has revolutionized\nhow we interact with technology, from search engines to chatbots.']
Number of words: 50
Number of sentences: 3


In [None]:
Arabic_words = word_tokenize(sample_text_arabic)
Arabic_sentences = sent_tokenize(sample_text_arabic)
print (Arabic_words)
print (Arabic_sentences)
print(f"Number of words: {len(Arabic_words)}")
print(f"Number of sentences: {len(Arabic_sentences)}")

['جامعة', 'الملك', 'عبدالله', 'للعلوم', 'والتقنية', 'هي', 'جامعة', 'بحثية', 'متطورة', 'تقع', 'في', 'المملكة', 'العربية', 'السعودية', '.', 'تأسست', 'الجامعة', 'عام', '2009', 'وتهدف', 'إلى', 'أن', 'تكون', 'منارة', 'للعلم', 'والمعرفة', 'في', 'المنطقة', 'والعالم', '.', 'تضم', 'الجامعة', 'أحدث', 'المختبرات', 'والمرافق', 'البحثية،', 'وتجذب', 'الطلاب', 'والباحثين', 'من', 'جميع', 'أنحاء', 'العالم', '.', 'تتميز', 'كاوست', 'بتركيزها', 'على', 'البحث', 'العلمي', 'والابتكار', 'في', 'مجالات', 'متعددة', 'مثل', 'الهندسة', 'والعلوم', 'الطبيعية', 'والحاسوب', 'والذكاء', 'الاصطناعي', '.', 'تسعى', 'الجامعة', 'لحل', 'التحديات', 'العالمية', 'من', 'خلال', 'البحث', 'والتطوير', '.']
['\nجامعة الملك عبدالله للعلوم والتقنية هي جامعة بحثية متطورة تقع في المملكة العربية السعودية.', 'تأسست الجامعة عام 2009 وتهدف إلى أن تكون منارة للعلم والمعرفة في المنطقة والعالم.', 'تضم الجامعة أحدث المختبرات والمرافق البحثية، وتجذب الطلاب والباحثين من جميع أنحاء العالم.', 'تتميز كاوست بتركيزها على البحث العلمي والابتكار في مجالات 

## 🚫 3. Stop Words Removal Function

**Stop words** are commonly used words in a language that carry little meaningful information for NLP tasks. Examples include:  
**"the"**, **"is"**, **"in"**, **"and"**, **"on"**, etc.

Removing stop words is a common preprocessing step that helps:
- 🧹 Reduce noise in the data
- 📉 Decrease vocabulary size
- 🚀 Improve model efficiency and focus on important words

---




In [None]:
import nltk
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
filtered_words = [word.lower() for word in words if word.lower() not in stop_words and word.isalpha()]
print(f"Original words: {len(words)}")
print(f"After removing stop words: {len(filtered_words)}")
print(f"Filtered words: {filtered_words}")

Original words: 50
After removing stop words: 27
Filtered words: ['natural', 'language', 'processing', 'fascinating', 'field', 'combines', 'linguistics', 'computer', 'science', 'artificial', 'intelligence', 'enables', 'computers', 'understand', 'interpret', 'generate', 'human', 'language', 'meaningful', 'way', 'nlp', 'revolutionized', 'interact', 'technology', 'search', 'engines', 'chatbots']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
stop_words = set(stopwords.words('arabic'))
Arabic_filtered_words = [word.lower() for word in Arabic_words if word.lower() not in stop_words and word.isalpha()]

print(f"Original words: {len(Arabic_words)}")
print(f"After removing stop words: {len(Arabic_filtered_words)}")
print(f"Filtered words: {Arabic_filtered_words}")

Original words: 72
After removing stop words: 54
Filtered words: ['جامعة', 'الملك', 'عبدالله', 'للعلوم', 'والتقنية', 'جامعة', 'بحثية', 'متطورة', 'تقع', 'المملكة', 'العربية', 'السعودية', 'تأسست', 'الجامعة', 'عام', 'وتهدف', 'تكون', 'منارة', 'للعلم', 'والمعرفة', 'المنطقة', 'والعالم', 'تضم', 'الجامعة', 'أحدث', 'المختبرات', 'والمرافق', 'وتجذب', 'الطلاب', 'والباحثين', 'أنحاء', 'العالم', 'تتميز', 'كاوست', 'بتركيزها', 'البحث', 'العلمي', 'والابتكار', 'مجالات', 'متعددة', 'الهندسة', 'والعلوم', 'الطبيعية', 'والحاسوب', 'والذكاء', 'الاصطناعي', 'تسعى', 'الجامعة', 'لحل', 'التحديات', 'العالمية', 'خلال', 'البحث', 'والتطوير']


## 🌱 4. **Stemming Function**

**Stemming** is the process of reducing words to their **base or root form** by removing suffixes and prefixes. The resulting stem may not always be a valid word but represents a group of related words.

Example:  
`"running"`, `"runner"`, `"runs"` → `"run"`

---

### 💡 Why is Stemming Useful?

- ✅ Reduces different forms of a word to a common base
- ✅ Helps in grouping words with similar meaning
- ✅ Decreases vocabulary size and improves generalization

---



In [None]:
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in words]
print(stemmed_words)

['natur', 'languag', 'process', 'is', 'a', 'fascin', 'field', 'that', 'combin', 'linguist', ',', 'comput', 'scienc', ',', 'and', 'artifici', 'intellig', '.', 'it', 'enabl', 'comput', 'to', 'understand', ',', 'interpret', ',', 'and', 'gener', 'human', 'languag', 'in', 'a', 'meaning', 'way', '.', 'nlp', 'ha', 'revolution', 'how', 'we', 'interact', 'with', 'technolog', ',', 'from', 'search', 'engin', 'to', 'chatbot', '.']


In [None]:
from nltk.stem.isri import ISRIStemmer

# Arabic stemmer
stemmer = ISRIStemmer()
Arabic_stemmed_words = [stemmer.stem(word) for word in Arabic_filtered_words]
print(f"Original words: {Arabic_filtered_words[:10]}")
print(f"Stemmed words: {Arabic_stemmed_words[:10]}")

Original words: ['جامعة', 'الملك', 'عبدالله', 'للعلوم', 'والتقنية', 'جامعة', 'بحثية', 'متطورة', 'تقع', 'المملكة']
Stemmed words: ['جمع', 'ملك', 'عبدالل', 'علم', 'قني', 'جمع', 'بحث', 'تطر', 'تقع', 'ملك']


## 🍃 5-**Lemmatization Function**

**Lemmatization** is the process of reducing words to their **dictionary base form** (called the *lemma*) while ensuring the result is a **valid word**. Unlike stemming, lemmatization uses **context and vocabulary** to produce meaningful roots.

Example:  
`"running"` → `"run"`  
`"better"` → `"good"`

---

### 💡 Why Use Lemmatization?

- ✅ Produces linguistically correct root words
- ✅ Considers **part of speech (POS)** for accurate results
- ✅ More precise than stemming, especially in formal NLP tasks

---

In [None]:
import nltk
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print(lemmatized_words)

[nltk_data] Downloading package wordnet to /root/nltk_data...


['Natural', 'Language', 'Processing', 'is', 'a', 'fascinating', 'field', 'that', 'combine', 'linguistics', ',', 'computer', 'science', ',', 'and', 'artificial', 'intelligence', '.', 'It', 'enables', 'computer', 'to', 'understand', ',', 'interpret', ',', 'and', 'generate', 'human', 'language', 'in', 'a', 'meaningful', 'way', '.', 'NLP', 'ha', 'revolutionized', 'how', 'we', 'interact', 'with', 'technology', ',', 'from', 'search', 'engine', 'to', 'chatbots', '.']


# 6-**Part-of-Speech Tagging**

**Part-of-Speech (POS) Tagging** is the process of labeling each word in a sentence with its **grammatical role** such as noun, verb, adjective, etc.

POS tagging is essential in NLP for understanding sentence structure, syntactic parsing, named entity recognition, and more.

---

## 📝 Part-of-Speech (POS) Tags Reference

| Tag   | Meaning                             | Example Words            |
|-------|-------------------------------------|--------------------------|
| **CC**   | Coordinating conjunction            | and, but, or             |
| **CD**   | Cardinal number                    | one, two, 1, 2           |
| **DT**   | Determiner                         | the, a, an               |
| **EX**   | Existential 'there'                | there (as in *there is*) |
| **FW**   | Foreign word                       | bonjour, sayonara        |
| **IN**   | Preposition or subordinating conj. | in, of, like, after      |
| **JJ**   | Adjective                          | big, blue, smart         |
| **JJR**  | Adjective, comparative             | bigger, smarter          |
| **JJS**  | Adjective, superlative             | biggest, smartest        |
| **LS**   | List item marker                   | 1., A, B                 |
| **MD**   | Modal verb                         | can, could, will, would  |
| **NN**   | Noun, singular                     | cat, house, idea         |
| **NNS**  | Noun, plural                       | cats, houses, ideas      |
| **NNP**  | Proper noun, singular              | John, London, Apple      |
| **NNPS** | Proper noun, plural                | Americans, Smiths        |
| **PDT**  | Predeterminer                      | all, both, half          |
| **POS**  | Possessive ending                  | 's (as in *John's*)      |
| **PRP**  | Personal pronoun                   | I, he, she, it           |
| **PRP$** | Possessive pronoun                 | my, his, her, its        |
| **RB**   | Adverb                             | quickly, very, well      |

---

## 📌 Quick Reference for Common POS Categories

| Category        | POS Tags                          | Examples              |
|-----------------|-------------------------------------|-----------------------|
| **Nouns**       | NN, NNS, NNP, NNPS                 | cat, London, ideas    |
| **Verbs**       | VB, VBD, VBG, VBN, VBP, VBZ        | run, ran, running     |
| **Adjectives**  | JJ, JJR, JJS                       | blue, smarter, biggest |
| **Adverbs**     | RB, RBR, RBS                       | quickly, more, most   |
| **Pronouns**    | PRP, PRP$, WP, WP$                 | he, she, whose        |
| **Determiners** | DT, WDT                            | the, which            |
| **Prepositions**| IN                                  | in, on, after         |
| **Conjunctions**| CC                                  | and, but, or          |

---



In [None]:
import nltk
nltk.download('averaged_perceptron_tagger_eng')
pos_tags = pos_tag(words)
print(pos_tags)

[('Natural', 'JJ'), ('Language', 'NNP'), ('Processing', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('fascinating', 'JJ'), ('field', 'NN'), ('that', 'WDT'), ('combines', 'VBZ'), ('linguistics', 'NNS'), (',', ','), ('computer', 'NN'), ('science', 'NN'), (',', ','), ('and', 'CC'), ('artificial', 'JJ'), ('intelligence', 'NN'), ('.', '.'), ('It', 'PRP'), ('enables', 'VBZ'), ('computers', 'NNS'), ('to', 'TO'), ('understand', 'VB'), (',', ','), ('interpret', 'VB'), (',', ','), ('and', 'CC'), ('generate', 'VB'), ('human', 'JJ'), ('language', 'NN'), ('in', 'IN'), ('a', 'DT'), ('meaningful', 'JJ'), ('way', 'NN'), ('.', '.'), ('NLP', 'NNP'), ('has', 'VBZ'), ('revolutionized', 'VBN'), ('how', 'WRB'), ('we', 'PRP'), ('interact', 'VBP'), ('with', 'IN'), ('technology', 'NN'), (',', ','), ('from', 'IN'), ('search', 'NN'), ('engines', 'NNS'), ('to', 'TO'), ('chatbots', 'NNS'), ('.', '.')]


[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.




# 😊 7- Sentiment Analysis

**Sentiment Analysis** is the process of identifying and categorizing emotions or opinions expressed in a piece of text—typically as **positive**, **negative**, or **neutral**.

In **NLTK**, we can perform sentiment analysis using the built-in **VADER** sentiment analyzer, which is particularly effective for short texts like social media posts.
## ❓ Why is it Called **VADER**?

**VADER** stands for:

> **V**alence **A**ware **D**ictionary and s**E**ntiment **R**easoner

---

### 💡 What Does That Mean?

- **Valence**: The **emotional value** of a word or phrase (positive, negative, or neutral).
- **Aware**: It considers not just the words, but also how they are used in context.
- **Dictionary**: Uses a **predefined lexicon** of words with associated sentiment scores.
- **Sentiment Reasoner**: Applies **rules** to adjust sentiment based on:
  - Punctuation (e.g., “!!!” adds intensity)
  - Capitalization (e.g., “LOVE” is stronger than “love”)
  - Degree modifiers (e.g., “very good” vs. “good”)
  - Emojis and emoticons (e.g., 😊 😢)

---


In [None]:
import nltk
nltk.download('vader_lexicon')

def nltk_sentiment_analysis(text):
    """Analyze sentiment of text"""
    print("=== SENTIMENT ANALYSIS ===")
    sia = SentimentIntensityAnalyzer()
    sentiment_scores = sia.polarity_scores(text)

    print(f"Sentiment scores: {sentiment_scores}")

    # Interpret the scores
    if sentiment_scores['compound'] >= 0.05:
        print("Overall sentiment: Positive")
    elif sentiment_scores['compound'] <= -0.05:
        print("Overall sentiment: Negative")
    else:
        print("Overall sentiment: Neutral")

    return sentiment_scores

# Run the function
sentiment_scores = nltk_sentiment_analysis(sample_text)

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


=== SENTIMENT ANALYSIS ===
Sentiment scores: {'neg': 0.0, 'neu': 0.759, 'pos': 0.241, 'compound': 0.886}
Overall sentiment: Positive


## 🚀 **8-spaCy**

**spaCy** is an **industrial-strength Natural Language Processing (NLP) library** built for **production-level applications**. It is known for being **fast, efficient, and accurate**, with easy integration into real-world systems.

---

### ⭐ Key Features of spaCy:

- ⚡ **High Performance:** Optimized for speed and large-scale processing.
- 🌐 **Pre-trained Models:** Supports multiple languages (English, Arabic, German, etc.).
- 🧠 **Deep Learning Ready:** Seamless integration with deep learning frameworks (e.g., PyTorch, TensorFlow).
- 🏗️ **Production Focused:** Designed for real-world use, not just research or prototyping.
- 🔗 **Advanced NLP Tasks:**
  - **Tokenization**
  - **Part-of-Speech Tagging (POS)**
  - **Dependency Parsing**
  - **Named Entity Recognition (NER)**
  - **Text Classification**

---





# 9-**spaCy Installation**

In [None]:
!pip install spacy



In [None]:
import spacy
from spacy import displacy
# Load the English model
nlp = spacy.load("en_core_web_sm")
# Process the text
doc = nlp(sample_text)


# 🏷️ 10. Named Entity Recognition (NER)

## 📌 What is Named Entity Recognition (NER)?

**Named Entity Recognition (NER)** is a key task in Natural Language Processing (NLP) where the goal is to **automatically detect and classify named entities** in text into predefined categories such as:
- **Persons**
- **Organizations**
- **Locations**
- **Dates**
- **Monetary values**
- And more...

NER is useful in a variety of applications including:
- 📊 **Information extraction**
- 🔎 **Search engines**
- 📰 **News analysis**
- 📈 **Business intelligence**
- 💬 **Chatbots and virtual assistants**

---

## 📝 How Does NER Work in spaCy?

- spaCy's pre-trained models can **identify named entities** directly from raw text.
- Entities are returned with:
  - **The text span** (e.g., `"Apple"`)
  - **The entity type** (e.g., `ORG` for Organization)

---



In [None]:
import spacy

# Load the spaCy English model once
nlp = spacy.load("en_core_web_sm")

def detailed_ner_analysis(text):
    """
    Perform Named Entity Recognition (NER) on the given text using spaCy.
    Displays entities with their labels and explanations, along with token details.
    """
    doc = nlp(text)

    print("\n=== NAMED ENTITY RECOGNITION (NER) ===")
    print(f"📄 Text length: {len(text)} characters")
    print(f"🔢 Number of tokens: {len(doc)}")
    print(f"🔍 Number of entities: {len(doc.ents)}\n")

    if doc.ents:
        print("🗂 Entities Found:")
        for ent in doc.ents:
            label_desc = spacy.explain(ent.label_) or "No description available"
            print(f"- {ent.text}  ➔  {ent.label_} ({label_desc})")
    else:
        print("No entities detected in the text.")

    print("\n🔠 First 10 Tokens (with POS and Entity Type):")
    for token in doc[:10]:
        ent_type = token.ent_type_ if token.ent_type_ else "None"
        print(f"{token.text:<15} | {token.pos_:<10} | {ent_type}")

    return doc.ents



# Run the analysis
entities = detailed_ner_analysis(sample_text)



=== NAMED ENTITY RECOGNITION (NER) ===
📄 Text length: 311 characters
🔢 Number of tokens: 55
🔍 Number of entities: 1

🗂 Entities Found:
- NLP  ➔  ORG (Companies, agencies, institutions, etc.)

🔠 First 10 Tokens (with POS and Entity Type):

               | SPACE      | None
Natural         | PROPN      | None
Language        | PROPN      | None
Processing      | NOUN       | None
is              | AUX        | None
a               | DET        | None
fascinating     | ADJ        | None
field           | NOUN       | None
that            | PRON       | None
combines        | VERB       | None


# 🔗 11-Dependency Parsing - Common Dependency Relations

## 📌 What is Dependency Parsing?

**Dependency Parsing** is the process of analyzing the **grammatical structure** of a sentence by identifying relationships between words. Each word (token) is connected to another word called its **head**, forming a tree-like structure.

✅ Every word has:
- A **head word** it depends on (except the root word)
- A **dependency relation** that explains the grammatical link

---

## 🗂️ Common Dependency Relations

| Dependency | Full Name                    | Description                                  | Example (Relation)                   |
|-----------|------------------------------|----------------------------------------------|--------------------------------------|
| **ROOT**   | Root                          | Main verb or predicate of the sentence        | "runs" in "John runs fast"           |
| **nsubj**  | Nominal subject               | Subject of the verb                          | "John" → "runs"                      |
| **nsubjpass** | Passive nominal subject     | Subject of passive verb                      | "cake" in "The cake was eaten"       |
| **dobj**   | Direct object                 | Direct object of the verb                    | "ball" in "John threw the ball"      |
| **iobj**   | Indirect object               | Indirect object of the verb                  | "him" in "I gave him the book"       |
| **amod**   | Adjectival modifier          | Adjective modifying a noun                   | "red" → "car"                        |
| **advmod** | Adverbial modifier           | Adverb modifying verb/adj/adv                | "quickly" → "runs"                   |
| **det**    | Determiner                   | Article or determiner                        | "the" → "book"                       |
| **prep**   | Prepositional modifier       | Preposition linking to object                | "in" in "in the house"               |
| **pobj**   | Object of preposition        | Object of the preposition                    | "house" in "in the house"            |
| **aux**    | Auxiliary                    | Helping verb                                 | "is" in "is running"                 |
| **auxpass** | Passive auxiliary            | Passive helping verb                         | "was" in "was eaten"                 |
| **cop**    | Copula                       | Linking verb (be, seem, etc.)                | "is" in "John is tall"               |
| **cc**     | Coordinating conjunction     | And, or, but                                 | "and" in "cats and dogs"             |
| **conj**   | Conjunct                     | Coordinated element                          | "dogs" in "cats and dogs"            |
| **compound** | Compound                    | Modifier in compound words                   | "ice" → "cream"                      |
| **poss**   | Possessive modifier          | Possessive noun or pronoun                   | "John's" → "book"                    |
| **appos**  | Appositional modifier        | Noun phrase renaming another noun            | "CEO" in "John, the CEO"             |
| **acl**    | Adjectival clause            | Clause modifying a noun                      | "who runs" in "man who runs"         |
| **advcl**  | Adverbial clause             | Clause modifying verb or adjective           | "when he arrived"                   |
| **ccomp**  | Clausal complement           | Clause functioning as object                 | "that he left" in "I think that..."  |
| **xcomp**  | Open clausal complement      | Non-finite clause complement                 | "to run" in "I want to run"          |
| **mark**   | Marker                       | Subordinating conjunction                    | "that" in "I know that..."           |
| **punct**  | Punctuation                  | Punctuation marks                            | ".", "!", "?"                        |
| **neg**    | Negation modifier            | Negation word                                | "not" in "do not run"                |
| **prt**    | Particle                     | Verb particle                                | "up" in "give up"                    |
| **quantmod** | Quantifier modifier         | Quantifier                                    | "very" in "very good"                |
| **npadvmod** | Noun phrase adverbial modifier | Noun used adverbially                        | "today" in "I'll go today"           |
| **tmod**   | Temporal modifier            | Time expression                              | "yesterday"                          |
| **nummod** | Numeric modifier             | Numeric expression                           | "five" in "five books"               |
| **number** | Number compound              | Part of complex number                       | "twenty" in "twenty-five"            |
| **parataxis** | Parataxis                  | Parallel clauses                             | Quoted or separate sentences         |
| **dep**    | Unspecified dependency       | Fallback relation when parser is unsure      | Miscellaneous links                  |

---

## 📝 Example Dependency Analysis

**Sentence**:  
*“The quick brown fox jumps over the lazy dog.”*

| Token  | Dependency | Head   | Explanation                          |
|--------|------------|--------|--------------------------------------|
| The    | det        | fox    | Determiner modifies "fox"            |
| quick  | amod       | fox    | Adjective modifies "fox"             |
| brown  | amod       | fox    | Adjective modifies "fox"             |
| fox    | nsubj      | jumps  | Subject of the verb                  |
| jumps  | ROOT       | jumps  | Main verb                            |
| over   | prep       | jumps  | Preposition attached to "jumps"      |
| the    | det        | dog    | Determiner modifies "dog"            |
| lazy   | amod       | dog    | Adjective modifies "dog"             |
| dog    | pobj       | over   | Object of the preposition "over"     |

---

## 🔑 Key Concepts in Dependency Parsing

### 🔗 Head vs. Dependent:
- **Head**: The main word that governs the relation (e.g., verb or noun).
- **Dependent**: The word that is attached to and modifies the head.

### 🔄 Common Grammar Patterns:
| Pattern           | Example                             |
|-------------------|-------------------------------------|
| **Subject-Verb**  | `nsubj → ROOT`                      |
| **Verb-Object**   | `ROOT → dobj`                       |
| **Adj-Noun**      | `amod → noun`                       |
| **Det-Noun**      | `det → noun`                        |
| **Prep-Object**   | `prep → pobj`                       |

### 🌳 ROOT Token:
- Every sentence has **one ROOT**.
- It is usually the **main verb**.
- All other tokens connect back to this ROOT.

---

## ⚙️ Applications of Dependency Parsing:
- 🔍 **Information Extraction:** Find key actors (subjects) and actions (verbs).
- ❓ **Question Answering:** Identify important relationships for precise answers.
- ✂️ **Text Summarization:** Understand structure for better summarization.
- 🌐 **Machine Translation:** Preserve grammatical correctness across languages.

---



In [None]:
import spacy

# Load spaCy model and process text
nlp = spacy.load("en_core_web_sm")

def spacy_dependency_parsing(text):
    """Show dependency relationships using spaCy"""
    # Process the text through spaCy pipeline
    doc = nlp(text)

    print("=== DEPENDENCY PARSING ===")
    print("Token -> Dependency -> Head:")

    count = 0
    for token in doc:
        if token.is_alpha:  # Only show alphabetic tokens
            print(f"{token.text:15} | {token.dep_:10} | {token.head.text}")
            count += 1
            if count >= 10:  # Limit to first 10 tokens
                break

    return doc

# Run the function
doc = spacy_dependency_parsing(sample_text)

=== DEPENDENCY PARSING ===
Token -> Dependency -> Head:
Natural         | compound   | Language
Language        | compound   | Processing
Processing      | nsubj      | is
is              | ROOT       | is
a               | det        | field
fascinating     | amod       | field
field           | attr       | is
that            | nsubj      | combines
combines        | relcl      | field
linguistics     | dobj       | combines


# 🔍 12. Text Similarity

## 📌 What is Text Similarity?

**Text Similarity** measures how **similar or related two pieces of text are**. This is a fundamental task in NLP used in various applications such as:
- 📑 **Duplicate detection** (e.g., finding similar questions in forums)
- 🔎 **Information retrieval** (e.g., ranking search results)
- 💬 **Chatbots and conversational agents**
- 📄 **Plagiarism detection**

spaCy provides a simple way to compute **semantic similarity** between words, sentences, or entire documents using **word vectors** or **statistical models**.

---



In [None]:
def spacy_text_similarity(nlp):
    """Calculate similarity between texts"""
    try:
        # Two sample texts
        doc1 = nlp("Apple is a technology company")
        doc2 = nlp("Microsoft is a software company")

        # Calculate similarity
        similarity = doc1.similarity(doc2)

        # Simple, clear output
        print("=== TEXT SIMILARITY ===")
        print(f"Text 1: Apple is a technology company")
        print(f"Text 2: Microsoft is a software company")
        print(f"Similarity: {similarity:.3f}")

        return similarity

    except Exception as e:
        print("Error: Need en_core_web_md model for similarity")
        print("Install with: python -m spacy download en_core_web_md")
        return None

# Run the function
similarity = spacy_text_similarity(nlp)

=== TEXT SIMILARITY ===
Text 1: Apple is a technology company
Text 2: Microsoft is a software company
Similarity: 0.917


  similarity = doc1.similarity(doc2)


### Contributed by: Lama Ayash


## References and Further Reading

1. **Natural Language Toolkit (NLTK)**: https://www.nltk.org/
   - Python library with extensive NLP tools and datasets

2. **spaCy**: https://spacy.io/
   - Industrial-strength NLP library

3. **Hugging Face**: https://huggingface.co/
   - Repository of pre-trained models and datasets

4. **Stanford NLP Group**: https://nlp.stanford.edu/
   - Research papers and resources
