# 📊 Notebook 01: Import and Inspect Data

## Your First Look at Movie Reviews

This notebook introduces you to the dataset that will power your entire NLP learning journey. We'll load the IMDB movie reviews and understand their structure before diving into preprocessing.


## 🧠 Concept Primer: Why Data Inspection Matters

### What We're Doing
Loading and exploring the IMDB movie reviews dataset to understand its structure, distribution, and characteristics before building our models.

### Why This Step is Critical
**Data inspection prevents downstream bugs.** Common issues caught early:
- **Missing values** that would break tokenization
- **Label inconsistencies** that would confuse the model
- **Imbalanced classes** that would bias predictions
- **Encoding issues** that would cause training failures

### What We'll Discover
- **Schema**: What columns exist and their data types
- **Distribution**: How many examples per class
- **Content**: Sample reviews to understand the text quality
- **Target variable**: The aspect labels we're trying to predict

### How It Maps to Our Pipeline
This inspection directly informs:
- **Vocabulary size decisions** (notebook 03)
- **Padding length choices** (notebook 04) 
- **Model architecture** (number of output classes)
- **Evaluation strategy** (handling class imbalance)


## 📋 Checklist Objectives

By the end of this notebook, you will have:

- [ ] **Loaded both datasets** (train and test CSVs)
- [ ] **Inspected data schema** (columns, types, shapes)
- [ ] **Analyzed label distributions** (aspect and aspect_encoded counts)
- [ ] **Set up key variables** (`n_aspects` for model architecture)
- [ ] **Understood the data quality** (missing values, sample content)

## ✅ Acceptance Criteria

**You've completed this notebook when:**
- Both DataFrames load without errors
- You can print the schema and basic statistics
- You know how many aspect classes exist (`n_aspects`)
- You understand the distribution of labels in both train and test sets


## 🔧 TODO #1: Load the Datasets

**Task:** Load both train and test CSV files into pandas DataFrames.

**Hint:** Use `pd.read_csv()` pointing to 'datasets/imdb_movie_reviews_train.csv' and 'datasets/imdb_movie_reviews_test.csv'

**Expected Variables:**
- `train_reviews_df` → training data DataFrame
- `test_reviews_df` → test data DataFrame

**Expected Output:** You should see DataFrames with shape information when you print them.


In [None]:
# TODO #1: Load the datasets
# Your code here


## 🔧 TODO #2: Inspect Data Schema

**Task:** Display the first few rows and basic information about both datasets.

**Hint:** Use `.head()` and `.info()` methods on both DataFrames.

**What to look for:**
- Column names and data types
- Number of rows and columns
- Memory usage
- Any obvious data quality issues


In [None]:
# TODO #2: Inspect data schema
# Your code here


## 🔧 TODO #3: Analyze Label Distributions

**Task:** Examine the distribution of both `aspect` (string) and `aspect_encoded` (integer) columns.

**Hint:** Use `.value_counts()` on both columns for both datasets.

**Expected Output Example:**
```
aspect_encoded: 0 (2500), 1 (2500), 2 (2500), 3 (2500)
aspect: 'positive' (2500), 'negative' (2500), 'neutral' (2500), 'mixed' (2500)
```

**What to analyze:**
- Are the classes balanced?
- Do train and test have similar distributions?
- Are there any missing values?


In [None]:
# TODO #3: Analyze label distributions
# Your code here


## 🔧 TODO #4: Set Up Key Variables

**Task:** Extract the number of unique aspect classes and prepare text/label arrays.

**Hint:** Use `train_reviews_df['aspect_encoded'].nunique()` to get the number of classes.

**Expected Variables:**
- `n_aspects` → number of unique aspect classes (for model architecture)
- `train_texts` → list of review texts from training data
- `train_labels` → list of encoded labels from training data
- `test_texts` → list of review texts from test data  
- `test_labels` → list of encoded labels from test data

**Why this matters:** `n_aspects` determines the output size of your neural network's final layer.


In [None]:
# TODO #4: Set up key variables
# Your code here


## 📝 Reflection Prompts

Take a moment to reflect on what you've discovered about your dataset:

### 🤔 Understanding Check
1. **What would happen if aspect_encoded had gaps (e.g., 0, 1, 5)?** How would this affect your neural network?

2. **Why is it important to check both train and test distributions?** What could go wrong if they're different?

3. **Looking at your sample reviews, what challenges do you anticipate in tokenization?** (Think about punctuation, capitalization, special characters)

4. **If one aspect class had 90% of the examples, how would that affect your model training?**

### 🎯 Data Quality Assessment
- Are there any data quality issues you noticed?
- How will the class distribution affect your evaluation strategy?
- What preprocessing challenges do you foresee?

---

**Write your reflections here:**


---

## 🚀 Ready for Tokenization?

Great job exploring your dataset! You now understand the structure and distribution of your movie reviews. 

**Key takeaways from this notebook:**
- You know how many classes you're predicting (`n_aspects`)
- You understand the data quality and any potential issues
- You have clean text and label arrays ready for preprocessing

**Next up:** In Notebook 02, you'll learn how to transform raw text into processable tokens using regex and string manipulation.

**Remember:** Every step builds on the previous one. The text arrays you just created will become the input for your tokenization function!
