# Scientist's Notebook — Final Project (Deep Learning)

**Topic:** Sentiment analysis on movie reviews (IMDB 50K)  
**Format:** Work diary, decisions, and results  

---

## 1) Initial Approach

**Problem.** Classify movie reviews as positive or negative.

**Quantitative objective.** Exceed **70%** in the main metrics.

**Hypotheses.**
- **H1.** A "mini-transformer" in Keras with integrated vectorization will be sufficient for >70% on IMDB.
- **H2.** Lightweight text normalization (lowercase, noise and emoji cleanup) improves stability without requiring aggressive cleaning.
- **H3.** With stratified partitions and *early stopping*, simple validation + *bootstrap* on test provides sufficient evidence without exhaustive CV.

---

## 2) Stage-by-Stage Diary

### Week 1 — Dataset Selection and Scope
I chose **IMDB 50K Movie Reviews** for its size, clear format, and abundant literature.  
I discarded **Sentiment140** and **TweetEval** because they were too large and thus increased training time.  
I defined the technical approach: **"mini-transformer"** with Keras and **70%** as the minimum target in main metrics.

### Week 2 — EDA and Preparation
I performed a quick inspection:  
- Typical noise (HTML, tags, symbols) -> basic filtering.  
- Emojis and unicode: low frequency; I added handling later in the *standardize* function of `TextVectorization`.  
- Long and colloquial reviews; I confirmed the need for truncation via `seq_len`.  
I left for later: length analysis by class and exhaustive vocabulary (not critical for the MVP).

### Week 3 — Partitioning and Experimental Protocol
I applied **train / valid / test** with **stratification**. The **test** set was completely withheld until the end to avoid *leakage*.  
I explored `KerasClassifier` *wrappers* (scikeras) + `StratifiedKFold` with *multi-metrics*, but **changed plans**: **exhaustive CV was abandoned** due to cost/benefit; I switched to **internal stratified validation** during tuning and **bootstrap on test** to estimate uncertainty.

### Week 4 — Text Engineering and Model
I integrated `TextVectorization` into the model graph itself to accept raw text during inference.  
I adjusted *standardize* for emojis/unicode and fixed `vocab_size` based on pilot counts.  
Final architecture: **mini-transformer** with *embedding*, lightweight attention block, and *feed-forward*; decision threshold (`threshold`) configurable from the *config*.  
**Training operational adjustment:** I set **epochs = 3** and **seq_len = 256** for more reasonable training times. Consistent with this, **I removed the *callbacks*** *EarlyStopping* and *ReduceLROnPlateau* (they had **patience** 2 and 1, with little utility over so few *epochs*). The performance target and training stability were maintained.  
Tracked metrics: F1 / ROC-AUC / Accuracy as needed; the **>70%** target was met.  
Incidents: an **OOV** calculation path became very slow after varying `seq_len`; it stabilized after reviewing the pipeline and sequence limits. This happened mainly when the **OUT_OF_RANGE: End of sequence** associated with `TextVectorization` did not appear; it was resolved after adjusting some configuration values.

### Week 5 — Cleanup, Configuration, and *Hardening*
I reorganized the **configuration** to reflect the abandonment of CV; I added `threshold` and `prediction_confidence`.  
I tidied up the code with **type hints**, **docstrings**, and **markdowns**; I cleaned up `requirements.txt` from packages that were not ultimately used.  
I unified all project text to **English**.

### Week 6 — Minimum Viable Product and Deployment
I trained the **final model** with all training data and saved it in `SavedModel` format;  
I exposed the model via **REST API** and **dockerized** it; I cleaned up code comments and added the `prediction_confidence` function to the API.
Finally, I polished the project documentation, the READMEs, and the work diary.

---

## 3) Results and Evidence

- **Quantitative objective:** exceeded **70%** in main metrics.  
- **Behavior:** lightweight normalization + `TextVectorization` in the graph simplifies *serving* and maintains performance.  
- **Efficiency:** with **epochs = 3** and **seq_len = 256**, **more reasonable training times** were obtained without degrading the minimum target.  
- **Robustness:** *bootstrap* on the test set for uncertainty bands with controlled computational cost.

---

## 4) Key Decisions (and Why)

1. **IMDB 50K over Twitter datasets.** Less platform noise and lower data volume for better training time.  
2. **No exhaustive CV.** High cost and low marginal gain; replaced by internal stratified validation and bootstrap on test.
3. **Lightweight normalization.** Sufficient for the objective without complicating the pipeline."
