## Sentence Reconstruction Using NLP Techniques

This Jupyter notebook implements a no-library Python pipeline to reconstruct two ambiguous sentences using basic Natural Language Processing (NLP) techniques. The goal is to clarify the meaning of the sentences by addressing ambiguities through tokenization, part-of-speech (POS) tagging, custom transformation rules, and sentence reconstruction. Each step is explained with the underlying NLP theory to provide a comprehensive understanding.

### 1. Introduction to NLP and the Task

Theory: What is NLP?

Natural Language Processing (NLP) is a field of computer science that focuses on enabling computers to understand and process human language. It involves techniques such as tokenization, POS tagging, parsing, and grammars to analyze and manipulate text. In this task, we use these techniques to reconstruct two sentences:
- Sentence 1: "Thank your message to show our words to the doctor, as his next contract checking, to all of us."
- Sentence 2: "Overall, let us make sure all are safe and celebrate the outcome with strong coffee and future targets."

The sentences contain ambiguities, such as unclear phrasing ("Thank your message") and vague modifiers ("as his next contract checking"). Our pipeline aims to rephrase them for clarity using a rule-based approach.

**Approach**:
The pipeline consists of:
1. Tokenization: Breaking sentences into tokens (words and punctuation).
2. POS Tagging: Assigning grammatical categories to tokens.
3. Transformation Rules: Defining rules to rephrase ambiguous segments
4. Automaton: Applying rules to transform the tagged tokens.
5. Reconstruction: Reassembling tokens into clearer sentences.


### 2. Tokenization

#### **Theory: What is Tokenization?**
Tokenization is the process of splitting text into smaller units called tokens, typically words, punctuation, or symbols. It’s a foundational step in NLP, enabling further analysis like tagging or parsing. Tokenization must handle punctuation, spaces, and special cases (e.g., contractions or acronyms) appropriately.

**Implementation**
Without external libraries, we implement a simple tokenizer that:
- Iterates through each character in the sentence.
- Groups alphanumeric characters into words.
- Treats punctuation as separate tokens.
- Handles spaces to separate words.


In [1]:
def tokenize(sentence):
    tokens = []
    word = ""
    for char in sentence:
        if char.isalnum():
            word += char
        else:
            if word:
                tokens.append(word.lower())  # Case-insensitive
                word = ""
            if char.strip():  # Non-whitespace
                tokens.append(char)
    if word:
        tokens.append(word.lower())
    return tokens

### **How It Works**
- Input: A string (e.g., "Thank your message,").
- Process: The function scans each character, building words from alphanumeric sequences and separating punctuation.
- Output: A list of tokens (e.g., ["thank", "your", "message", ","]).
- Example: For "doctor,", it produces ["doctor", ","].

This approach is sufficient for the given sentences, which lack complex cases like contractions or acronyms.

### 3. Part-of-Speech (POS) Tagging
#### **Theory: What is POS Tagging?**
POS tagging assigns grammatical categories (e.g., noun, verb, adjective) to each token. It’s crucial for understanding sentence structure and disambiguating meanings. Common approaches include rule-based, statistical, or neural taggers, but without libraries, we use a dictionary-based method with context rules for ambiguous words.

#### **Implementation**
We define a dictionary mapping words to POS tags based on the sentences’ vocabulary. For ambiguous words like "to," we use context to decide between preposition (PREP) or infinitive marker (TO).
**POS Tags Used:**
tag | Description
---|---
N | Noun
V | Verb
DET | Determiner
PRON | Pronoun
PREP | Preposition
ADJ | Adjective
ADV | Adverb
CONJ | Conjunction
PUNCT | Punctuation
TO | Infinitive marker

In [2]:
pos_dict = {
    "thank": "V", "your": "PRON", "message": "N", "to": "PREP", "show": "V",
    "our": "PRON", "words": "N", "the": "DET", "doctor": "N", "as": "PREP",
    "his": "PRON", "next": "ADJ", "contract": "N", "checking": "V", "all": "DET",
    "of": "PREP", "us": "PRON", "overall": "ADV", "let": "V", "make": "V",
    "sure": "ADJ", "are": "V", "safe": "ADJ", "and": "CONJ", "celebrate": "V",
    "outcome": "N", "with": "PREP", "strong": "ADJ", "coffee": "N", "future": "ADJ",
    "targets": "N", ",": "PUNCT", ".": "PUNCT"
}
verbs = {word for word, tag in pos_dict.items() if tag == "V"}

def tag(tokens):
    tagged = [(token, pos_dict.get(token, "UNK")) for token in tokens]
    for i in range(len(tagged) - 1):
        if tagged[i][0] == "to":
            tagged[i] = ("to", "TO" if tagged[i+1][1] == "V" else "PREP")
    return tagged

### **How It Works**:
- Input: List of tokens (e.g., ["thank", "your", "message", "to", "show"]).
- Process:
    - Assigns tags from `pos_dict`.
    - For "to," checks if the next token is a verb to assign "TO" or "PREP".
- Output: List of (word, tag) tuples (e.g., [("thank", "V"), ("your", "PRON"), ("message", "N"), ("to", "TO"), ("show", "V")]).
- Example: For the first sentence, "to" before "show" is tagged as "TO," while "to" before "the doctor" is "PREP."


### **4. Custom Grammar and Transformation Rules**
#### **Theory: Grammars in NLP**
A grammar defines the syntactic rules of a language, often using production rules (e.g., S → NP VP). In this task, instead of a full context-free grammar, we use transformation rules that act as a custom grammar to rephrase ambiguous segments. These rules are pattern-based, matching sequences of tagged tokens and replacing them with clearer alternatives.
#### **Implementation**
We define rules to address specific ambiguities in the sentences:
- Rule 1: Insert "you" after "thank."
- Rule 2: Insert "for" before "your message."
- Rule 3: Replace "as" with "during."
- Rule 4: Replace "checking" with "review."

In [3]:
rules = [
    # Rule 1: Insert "you" after "thank"
    (
        [("thank", "V")],
        [("thank", "V"), ("you", "PRON")]
    ),
    # Rule 2: Insert "for" before "your message"
    (
        [("you", "PRON"), ("your", "PRON"), ("message", "N")],
        [("you", "PRON"), ("for", "PREP"), ("your", "PRON"), ("message", "N")]
    ),
    # Rule 3: Replace "as" with "during"
    (
        [("as", "PREP")],
        [("during", "PREP")]
    ),
    # Rule 4: Replace "checking" with "review"
    (
        [("checking", "V")],
        [("review", "N")]
    )
]

**How It Works**
- Each rule is a tuple of (pattern, replacement), where both are lists of (word, tag) pairs.
- Patterns match specific sequences in the tagged sentence.
- Replacements provide clearer phrasing with appropriate tags.

### **5. Rule-Based Automaton**
#### **Theory: Automata in NLP**
An automaton is a computational model that processes input sequentially, often used in NLP for tasks like tokenization or parsing. Here, we implement a simple automaton as a loop that scans the tagged tokens, identifies patterns matching our rules, and applies the corresponding replacements.
#### **Implementation**

In [4]:
def apply_rules(tagged):
    for pattern, replacement in rules:
        pattern_len = len(pattern)
        for i in range(len(tagged) - pattern_len + 1):
            if tagged[i:i+pattern_len] == pattern:
                tagged = tagged[:i] + replacement + tagged[i+pattern_len:]
                break  # Apply each rule once per pattern
    return tagged

### **How It Works**:
- Input: List of tagged tokens.
- Process: Scans for subsequences matching rule patterns and replaces them with the specified replacements.
- Output: Transformed list of tagged tokens.
- Example: For [("thank", "V"), ("your", "PRON"), ("message", "N"), ...], it replaces the first three tokens with [("thank", "V"), ("you", "PRON"), ("for", "PREP"), ("your", "PRON"), ("message", "N")].

### **6. Sentence Reconstruction**
#### **Theory: Reconstructing Sentences**
After transforming the tokens, we reassemble them into a coherent sentence. This involves handling spacing and punctuation correctly to produce readable output.
#### **Implementation**

In [5]:
def reconstruct(tagged):
    tokens = [word for word, _ in tagged]
    sentence = " ".join(tokens)
    for punct in [",", ".", "!", "?", ";", ":"]:
        sentence = sentence.replace(" " + punct, punct)
    return sentence[0].upper() + sentence[1:]  # Capitalize first letter

### **How It Works**:
- Input: List of tagged tokens.
- Process: Extracts words, joins with spaces, removes spaces before punctuation, and capitalizes the first letter.
- Output: A readable sentence.
- Example: For ["thank", "you", "for", "your", "message", ","], it produces "Thank you for your message,".

In [6]:
def process_sentence(sentence):
    tokens = tokenize(sentence)
    tagged = tag(tokens)
    transformed = apply_rules(tagged)
    reconstructed = reconstruct(transformed)
    return reconstructed

# Test the pipeline
sentence1 = "Thank your message to show our words to the doctor, as his next contract checking, to all of us."
sentence2 = "Overall, let us make sure all are safe and celebrate the outcome with strong coffee and future targets."

print("Original Sentence 1:", sentence1)
print("Reconstructed Sentence 1:", process_sentence(sentence1))
print("\nOriginal Sentence 2:", sentence2)
print("Reconstructed Sentence 2:", process_sentence(sentence2))

Original Sentence 1: Thank your message to show our words to the doctor, as his next contract checking, to all of us.
Reconstructed Sentence 1: Thank you for your message to show our words to the doctor, during his next contract review, to all of us.

Original Sentence 2: Overall, let us make sure all are safe and celebrate the outcome with strong coffee and future targets.
Reconstructed Sentence 2: Overall, let us make sure all are safe and celebrate the outcome with strong coffee and future targets.


### **Expected Output**
Original Sentence | Reconstructed Sentence
---|---
Thank your message to show our words to the doctor, as his next contract checking, to all of us. | Thank you for your message to show our words to the doctor during his next contract review, to all of us.
Overall, let us make sure all are safe and celebrate the outcome with strong coffee and future targets. | Overall, let us make sure all are safe and celebrate the outcome with strong coffee and future targets.

### **Analysis**:
- Sentence 1: The reconstruction clarifies "Thank your message" to "Thank you for your message" and "as his next contract checking" to "during his next contract review," making the purpose and timing clearer. The phrase "to all of us" remains unchanged, as it’s sufficiently clear.
- Sentence 2: No transformations are applied, as the sentence is relatively clear. The coordination in "with strong coffee and future targets" could be clarified (e.g., "while setting future targets"), but we retain the original for simplicity.

### **8. Addressing Ambiguities**
Sentence 1 Ambiguities
- "Thank your message": Likely a typo or non-standard phrasing, possibly meant as "Thank you for your message." The rule corrects this to a standard expression of gratitude.
- "as his next contract checking": Unclear modifier, possibly indicating timing or purpose. Rephrasing to "during his next contract review" suggests a temporal context and uses "review" as a clearer noun.
- "to all of us": Could be ambiguous in attachment, but interpreted as benefiting the group, so left unchanged.

Sentence 2 Ambiguities
- "with strong coffee and future targets": Could imply celebrating with both coffee and targets or setting targets separately. The original is retained, as it’s reasonably clear, but could be rephrased for explicitness if needed.

### **9. Limitations**
- Scalability: The dictionary-based POS tagging and specific rules are tailored to these sentences, limiting generalization.
- Complexity: Without libraries, we omit advanced techniques like parsing trees or statistical disambiguation.
- Ambiguity Resolution: Some ambiguities (e.g., attachment of "to all of us") rely on interpretation, which may not be definitive.


### **10. Conclusion**
This pipeline demonstrates how basic NLP techniques can clarify ambiguous sentences without external libraries. By tokenizing, tagging, applying custom rules, and reconstructing, we enhance readability while adhering to the task’s constraints. The markdown explanations provide insight into NLP concepts, making the notebook educational and practical.