# Wrangling Data with Scikit-learn

## Working with Deterministic Selection

### 1) What Does “Deterministic” Mean?
- Deterministic means getting the same result every time when you use the same input and process.
- In data science and machine learning, determinism is important for:
   - **Reproducibility** → getting the same training results again and again.
   - **Debugging** → ensuring consistent behavior.
   - **Fair comparison** → same splits, same random choices.

**Example:**
- If you shuffle data or split into training/testing sets, the split should stay the same every time you rerun the code.

#### Random vs Deterministic Behavior

| Aspect   | Random                              | Deterministic                        |
| -------- | ----------------------------------- | ------------------------------------ |
| Output   | Changes on each run                 | Always same output                   |
| Use case | For exploration, randomness testing | For production, reproducible results |
| Example  | `random.shuffle()` without seed     | `random.shuffle()` with fixed seed   |


### 2) Random vs Deterministic Splits

In [8]:
from sklearn.model_selection import train_test_split
import numpy as np

X = np.arange(10).reshape((5, 2))
y = np.array([0, 1, 2, 3, 4])

# Random Split (different each time)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print("Random Split:", y_train, y_test)

# Deterministic Split (using random_state)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
print("Deterministic Split:", y_train, y_test)

Random Split: [0 1 4] [3 2]
Deterministic Split: [2 0 3] [1 4]


In [9]:
from sklearn.model_selection import train_test_split
import numpy as np

X = np.arange(10).reshape((5, 2))
y = np.array([0, 1, 2, 3, 4])

# Random Split (different each time)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print("Random Split:", y_train, y_test)

# Deterministic Split (using random_state)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
print("Deterministic Split:", y_train, y_test)

Random Split: [3 1 0] [2 4]
Deterministic Split: [2 0 3] [1 4]


- Without random_state, results differ each time.
- With random_state=42 (or any fixed value), results stay the same every time you run it.

### 3) Deterministic Selection in Hashing

In the hashing trick, determinism ensures that:
- The same word always maps to the same hashed feature (bucket).
- Ensures consistent preprocessing across training & testing.

- Example:
    - If the word "apple" is hashed to index 3 in training data,
    - it must also go to index 3 in testing data → deterministic mapping.

#### What Happens in the Hashing Trick?

In the hashing trick, each feature (like a word, token, or value) is converted into a numerical index using a hash function.

Example:
- hash("apple") → some integer → bucket 3
- hash("banana") → some integer → bucket 7

In [14]:
print(hash("apple"))
print(hash("banana"))

-8840528395652088597
-4358372444463375106


#### Problem Without Determinism

- If the hash function gave different outputs each time, then:

    - During training, "apple" might map to index 3,
    - But during testing, "apple" might map to index 5 instead!

- This means the model will treat the same feature as a completely different one, leading to:

    - Wrong feature alignment,
    - Decreased accuracy,
    - Inconsistent results (non-reproducible behavior).

#### Solution: Deterministic Hashing

A deterministic hash function always produces the same output for the same input.

That means:
- hash("apple") → always gives the same index (say 3)

- So:
    - Both training and testing phases interpret "apple" as the same feature.
    - The feature representation remains **stable** and **consistent**.

#### Example
- Scikit-learn’s HashingVectorizer is designed to be deterministic:
    - It uses a non-random, fixed hashing algorithm (MurmurHash3) internally.
    - So the same text input will always generate the same feature vector.

In [15]:
from sklearn.feature_extraction.text import HashingVectorizer

vectorizer = HashingVectorizer(n_features=10)
print(vectorizer.transform(["apple banana"]).toarray())
print(vectorizer.transform(["apple banana"]).toarray())  # same result every time

[[ 0.70710678  0.          0.          0.          0.         -0.70710678
   0.          0.          0.          0.        ]]
[[ 0.70710678  0.          0.          0.          0.         -0.70710678
   0.          0.          0.          0.        ]]


Both results are identical, because the hashing process is deterministic.

### 4) Why Determinism Matters in Machine Learning

#### Reproducibility:
- Two data scientists running the same code should get the same result.

#### Consistency:
- Your model predictions must not change randomly across runs.

#### Debugging:
- If results change every time, it’s impossible to locate errors.

#### Model Deployment:
- Production systems must behave predictably on the same input data.

### Learning:
- Deterministic selection = same output for the same input → reproducible.
- In Scikit-learn, use random_state to fix randomness.
- Hashing and random splits should be deterministic for reliable ML workflows.
    - In hashing, determinism makes sure the same input always produces the same feature representation, this keeps model training and testing consistent, accurate, and reproducible.