# 01 — Load & Slice the First Letter
## Import Text Dataset

---


## 🎯 Concept Primer

### What We're Doing

Loading text data is the foundation of any NLP project. For this project, we're working with **Frankenstein** by Mary Shelley (Project Gutenberg).

To keep training fast and focused (this is a learning exercise, not production), we'll **slice just the first letter** from the novel — roughly 6,850 characters. This is enough to:
- See patterns emerge
- Train in seconds, not hours
- Iterate quickly

### Why This Matters

**Clean data access** and **reproducible slicing** mean:
- Your experiments are repeatable
- You can easily expand to more text later
- You understand exactly what data the model sees

### What Breaks If We Skip This?

- Training on the full novel = 10x longer training time
- Inconsistent slices = non-reproducible results
- No inspection = you won't understand the text patterns

### Shapes
- **Input**: A single `.txt` file
- **Output**: A Python string of length ~6,850 characters

---


## ✅ Objectives

By the end of this notebook, you should:

- [ ] Load `datasets/frankenstein.txt` into a Python string variable named `frankenstein`
- [ ] Slice characters `[1380:8230]` into `first_letter_text`
- [ ] Print the length of `first_letter_text`
- [ ] Display a preview (first 500 characters)

---


## 🎓 Acceptance Criteria

**You pass this notebook when:**

✅ `len(first_letter_text)` prints and equals **6850**  
✅ You can see a preview of the text (first 500 chars)  
✅ You notice patterns: punctuation style, sentence structure, archaic language

---


## 📝 TODO 1: Load the Full Text

**Hint:**  
Use a `with open(...)` context manager to read the file. This ensures the file is properly closed after reading.

**Steps:**
1. Use `open('../datasets/frankenstein.txt', 'r', encoding='utf-8')`
2. Read the entire file into a variable called `frankenstein`
3. The result should be a single string

**What you need:**
- `with open(...) as f:`
- `f.read()`


In [None]:
# TODO: Load the full Frankenstein text
# Use: with open('../datasets/frankenstein.txt', 'r', encoding='utf-8') as f:
#      frankenstein = ...

frankenstein = None  # Replace this line with your code


## 📝 TODO 2: Slice to First Letter

**Hint:**  
Python slicing syntax: `string[start:end]` (end is exclusive)

**Steps:**
1. Slice `frankenstein[1380:8230]`
2. Assign it to `first_letter_text`
3. This gives us Letter 1 from the novel (skipping Project Gutenberg header and preliminaries)

**Why these indices?**
- Index 1380 starts Letter 1
- Index 8230 ends Letter 1
- Total: 6,850 characters


In [None]:
# TODO: Slice to the first letter
# first_letter_text = frankenstein[1380:8230]

first_letter_text = None  # Replace this line with your code


## 📝 TODO 3: Inspect the Data

**Hint:**  
Use `len()` to get string length and slicing to preview

**Steps:**
1. Print `len(first_letter_text)` — should be **6850**
2. Print the first 500 characters: `first_letter_text[:500]`
3. Observe: punctuation, capitalization, sentence structure


In [None]:
# TODO: Print length and preview
# print(f"Length of first_letter_text: {len(first_letter_text)}")
# print(f"\nFirst 500 characters:\n{first_letter_text[:500]}")

# Your code here


## 💭 Reflection Prompts

**Write your observations:**

1. **Punctuation**: What punctuation marks appear frequently? (commas, periods, semicolons, dashes?)

2. **Sentence Structure**: Are sentences short or long? Simple or complex?

3. **Archaic Language**: Do you see any old-fashioned words or phrasing?

4. **Why 6,850 characters?**: How does this compare to a typical page of text? (Hint: ~2-3 pages)

---


## 🚀 Next Steps

Once you've completed the TODOs and can print the length + preview:

➡️ **Move to Notebook 02**: Character Tokenization & Vocabulary Building

---

## 📌 Key Takeaways

- ✅ Text data starts as a simple string
- ✅ Slicing lets us work with manageable chunks
- ✅ Inspection helps us understand patterns the model will learn
- ✅ Context managers (`with open(...)`) are best practice for file I/O

---

*Remember: The model will learn from every character, including spaces, punctuation, and newlines!*
