# Understanding TfidfVectorizer: What Do the Numbers Mean?

This notebook explains what each number in the vectorized review represents (Cell 7 from imdb_neural_network.ipynb).

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

## What is Vectorization?

When we vectorize text, we convert words into numbers. TfidfVectorizer does this by:
1. Finding all unique words in the training data
2. Assigning each word an index (0, 1, 2, ...)
3. Calculating a TF-IDF score for each word in each review
4. Creating a numerical vector where each position represents a word

## Simple Example First

In [None]:
# Let's start with simple reviews
simple_reviews = [
    'good movie',
    'bad movie',
    'movie is good'
]

# Create vectorizer with just 5 features (words)
vec = TfidfVectorizer(max_features=5, stop_words='english')
X = vec.fit_transform(simple_reviews).toarray()

print('Reviews:')
for i, review in enumerate(simple_reviews):
    print(f'{i}: {review}')

In [None]:
# Get the vocabulary - this maps word to index
vocab = vec.get_feature_names_out()
print('\nVocabulary (word -> index):')
for idx, word in enumerate(vocab):
    print(f'  Index {idx}: "{word}"')

In [None]:
# Now look at the vectorized form
print('\nVectorized Reviews:')
print('Review 0 ("good movie"):')
print(X[0])
print('\nBreakdown:')
for idx, word in enumerate(vocab):
    print(f'  Index {idx} ("{word}"): {X[0][idx]:.4f}')

## What Do These Numbers (TF-IDF Scores) Mean?

**TF-IDF** stands for **Term Frequency - Inverse Document Frequency**

### TF (Term Frequency):
- How many times does a word appear in THIS document?
- More frequent words get higher scores

### IDF (Inverse Document Frequency):
- How rare is this word across ALL documents?
- Words that appear in every document (like 'the', 'is') get lower scores
- Words that appear in few documents get higher scores

### TF-IDF Score = TF × IDF
- High score: Word appears frequently in this review AND is relatively rare overall
- Low score: Word appears rarely in this review OR appears in almost every review
- Zero: Word doesn't appear in this review at all

In [None]:
# Let's look at Review 0 more carefully
print('Review 0: "good movie"')
print('\nVectorized form:', X[0])
print('\nDetailed breakdown:')
print('-' * 50)
for idx, word in enumerate(vocab):
    score = X[0][idx]
    if score == 0:
        print(f'Index {idx} ("{word:10s}"): {score:.4f}  <- Not in review')
    else:
        print(f'Index {idx} ("{word:10s}"): {score:.4f}  <- In review!')

In [None]:
print('\nReview 1: "bad movie"')
print('Vectorized form:', X[1])
print('\nDetailed breakdown:')
print('-' * 50)
for idx, word in enumerate(vocab):
    score = X[1][idx]
    if score == 0:
        print(f'Index {idx} ("{word:10s}"): {score:.4f}  <- Not in review')
    else:
        print(f'Index {idx} ("{word:10s}"): {score:.4f}  <- In review!')

## Notice Something Interesting?

In [None]:
print('Review 0 (good movie) - TF-IDF score for "movie":', X[0][2])
print('Review 1 (bad movie)  - TF-IDF score for "movie":', X[1][2])
print('Review 2 (movie is good) - TF-IDF score for "movie":', X[2][2])
print()
print('All different! Why?')
print('Because "movie" appears in ALL 3 reviews, so it gets a LOWER score')
print('(it\'s not a good indicator of sentiment)')

In [None]:
print('\nBut look at "good":')
print('Review 0 (good movie) - TF-IDF score for "good":', X[0][0])
print('Review 1 (bad movie)  - TF-IDF score for "good":', X[1][0])
print('Review 2 (movie is good) - TF-IDF score for "good":', X[2][0])
print()
print('"good" gets HIGHER scores when it appears')
print('(because it\'s rarer - good indicator of sentiment!)')

## Now with IMDB Data: 5000 Features

In [None]:
# In the IMDB notebook, we use max_features=5000
# This means:
# - Find the 5000 most common words in all training reviews
# - Create a vector of length 5000 for each review
# - Each position (0-4999) represents one word
# - The number at that position is the TF-IDF score for that word

print('IMDB Vectorization:')
print('- 5000 features (positions in the vector)')
print('- 1 feature = 1 word')
print('- Each number = TF-IDF score for that word in that review')
print('- Range: 0.0 to ~1.0')
print('  * 0.0 = word not in review')
print('  * 0.1-0.3 = common word (appears everywhere)')
print('  * 0.4-1.0 = important/rare word')

## Concrete IMDB Example

In [None]:
# Simulating what happens in imdb_neural_network.ipynb Cell 7
imdb_reviews = [
    'This movie was absolutely amazing and fantastic. I loved it!',
    'Terrible movie. Worst waste of time. Horrible acting.',
    'The film was okay. Not great, not bad. Just average.'
]

# Create vectorizer like in the notebook
imdb_vec = TfidfVectorizer(max_features=5000, stop_words='english')
imdb_X = imdb_vec.fit_transform(imdb_reviews).toarray()

print('Review being analyzed:')
print(f'"{imdb_reviews[0]}"')
print(f'\nTotal vector length: {imdb_X.shape[1]} features')
print(f'Number of non-zero values: {np.count_nonzero(imdb_X[0])}')

In [None]:
# Show first 20 features (like Cell 7 does)
print('First 20 features of review 0:')
print(imdb_X[0][:20])
print('\nBreakdown:')
vocab_imdb = imdb_vec.get_feature_names_out()
for i in range(20):
    score = imdb_X[0][i]
    word = vocab_imdb[i]
    if score > 0:
        print(f'Feature {i:2d}: "{word:15s}" = {score:.4f}  <- In review!')
    else:
        print(f'Feature {i:2d}: "{word:15s}" = {score:.4f}')

## Key Takeaways

### Each Number in the Vector Represents:
1. **Position (Index)**: Which word it is (0-4999)
2. **Value (Number)**: How important that word is in THIS review
   - 0 = word not in review
   - 0.0-1.0 = TF-IDF importance score

### The Neural Network Uses This:
- Input layer: 5000 neurons (one per word)
- Each neuron receives a TF-IDF score (0 to ~1.0)
- The network learns patterns: high scores for certain words → positive sentiment

### Example Pattern the Network Might Learn:
```
If words like 'amazing', 'fantastic', 'loved' have high scores
  → Predict sentiment = POSITIVE (closer to 1.0)

If words like 'terrible', 'horrible', 'worst' have high scores
  → Predict sentiment = NEGATIVE (closer to 0.0)
```

In [None]:
# Which words had the highest TF-IDF scores?
print('Top 10 most important words in review 0:')
print(f'Review: "{imdb_reviews[0]}"\n')
top_indices = np.argsort(imdb_X[0])[-10:][::-1]
for idx in top_indices:
    word = vocab_imdb[idx]
    score = imdb_X[0][idx]
    if score > 0:
        print(f'  "{word:15s}": {score:.4f}')