# Rule-Based NLP Foundations

## Introduction: Purpose
Rule-Based NLP is the **first stage of NLP learning**.  
Its purpose is to **process and extract information from text using predefined rules**, without relying on statistical models or deep learning.

### Why use Rule-Based NLP?
1. **Understand text structure**: Identify sentences, words, and patterns.  
2. **Quick solutions for simple tasks**: Keyword extraction, sentiment detection, and pattern matching.  
3. **Foundation for advanced NLP**: Essential before applying machine learning or deep learning methods.  
4. **Build intuition**: Understand how text is structured and how meaning can be extracted.

### Learning Objectives
By the end of this notebook, you will be able to:
- Extract keywords from text using lists of important words
- Match patterns in text using regular expressions (regex)
- Apply simple rules for sentiment or structural analysis
- Prepare the groundwork for statistical NLP methods


In [1]:
# Step 0: Import required libraries
import re


### Keyword Extraction
- Extract important words from a text using a **list of keywords**.
- Simple but effective for **domain-specific applications**.

In [2]:
# Sample text
text = "Python and NLP are awesome. NLP helps machines understand language."

# Keyword list
keywords = ["Python", "NLP", "language"]

# Extract keywords
found_keywords = [word for word in text.split() if word in keywords]
print("Keywords found:", found_keywords)


Keywords found: ['Python', 'NLP', 'NLP']


**Explanation:**  
- We split the text into words and check which words are in the keyword list.
- Students can try adding more keywords to see results.

### Pattern Matching with Regex
- Regex (Regular Expressions) allow us to find **patterns in text**.
- Can be used for:
  - Detecting words starting with uppercase
  - Finding emails, dates, or special patterns

In [3]:
# Regex example: words starting with capital letters
pattern = r'\b[A-Z][a-z]+\b'
matches = re.findall(pattern, text)
print("Regex matches:", matches)

Regex matches: ['Python']


**Explanation:**  
- `\b` → word boundary  
- `[A-Z]` → starts with uppercase  
- `[a-z]+` → followed by lowercase letters


### Sentiment/Rule-Based Pattern Matching
- Define **simple positive/negative word rules**.
- Can detect sentiment or custom patterns in text.

In [4]:
# Sample text
text2 = "I love NLP but sometimes Python is confusing."

# Sentiment keywords
positive_words = ["love", "awesome", "great"]
negative_words = ["confusing", "boring", "difficult"]

# Match sentiment words
pos_matches = [w for w in text2.split() if w in positive_words]
neg_matches = [w for w in text2.split() if w in negative_words]

print("Positive words found:", pos_matches)
print("Negative words found:", neg_matches)


Positive words found: ['love']
Negative words found: []


In [5]:
# Pattern matching example: find 'Python'
if re.search(r'\bPython\b', text2):
    print("Found 'Python' in the text!")

Found 'Python' in the text!


**Explanation:**  
- `re.search(pattern, text)` returns the **first match** of the pattern in text.  
- Rule-based NLP is **easy to understand** and a good foundation before machine learning approaches.

### Applying Rules to a Dataset

In [6]:
# Sample dataset
texts = [
    "I love Python and NLP.",
    "Learning regex is fun but sometimes confusing.",
    "Python is awesome for text analysis."
]

# Define rules
keywords = ["Python", "NLP", "regex"]
positive_words = ["love", "awesome", "fun"]

# Apply keyword and sentiment rules
for idx, t in enumerate(texts, 1):
    found_keywords = [w for w in t.split() if w in keywords]
    pos_matches = [w for w in t.split() if w in positive_words]
    print(f"Text {idx}: {t}")
    print(" - Keywords:", found_keywords)
    print(" - Positive words:", pos_matches)
    print("-----")


Text 1: I love Python and NLP.
 - Keywords: ['Python']
 - Positive words: ['love']
-----
Text 2: Learning regex is fun but sometimes confusing.
 - Keywords: ['regex']
 - Positive words: ['fun']
-----
Text 3: Python is awesome for text analysis.
 - Keywords: ['Python']
 - Positive words: ['awesome']
-----
