# Wrangling Data with Scikit-learn

### 1) Concept of Hashing in Data Processing

- **Hashing** = applying a hash function to data (like words or IDs) to convert them into fixed-size numerical values.

- A hash function takes input (e.g., a word like "apple") and outputs a number.

- The output number is within a fixed range (say 0–999 if we set 1000 buckets).

- This is very useful for handling large, high-dimensional data (like text, logs, categorical values).

#### Key properties of hashing in data processing:

- Deterministic → same input always gives the same output.

- "apple" will always map to the same number.

- Fixed range → regardless of dataset size, results fit into a fixed number of buckets.

- Fast & memory-efficient → avoids creating huge feature dictionaries.

#### Example (conceptual):

- Suppose we hash words into 10 buckets:

- **Word** → **Hash Value** → **Bucket**

    "apple" → 232 → 2

    "banana" → 546 → 6

    "car" → 832 → 2 (collision with apple!)
    

- Sometimes different words land in the same bucket → called a collision.

### 2) Why and When to Use Hashing in Machine Learning

#### Why use it?

- **Scalability:** Works with very large vocabularies (millions of words).

- **Speed:** Faster than building large dictionaries.

- **Low Memory Usage:** Fixed-size representation saves RAM.

- **Streaming Data:** Useful when new features appear continuously (no need to rebuild dictionary).

#### When to use it?

- Text data (like tweets, reviews, logs) → feature extraction without storing the whole vocabulary.

- Large categorical variables (like millions of unique IDs, products, or URLs).

- Online learning systems where data arrives in a stream and vocabulary keeps changing.

#### Example (Text Classification Scenario)

- Without hashing:

    - We must store a dictionary:
    { "apple":0, "banana":1, "car":2, ... }

- With hashing:
    - We skip the dictionary.
    - Each word is directly hashed into a number → mapped into a fixed-size vector.
    
    
### Summary:

- Hashing = mapping input data to fixed-size numeric buckets using a hash function.

- Advantage: fast, memory-efficient, scalable for huge text or categorical datasets.

- Used in ML when vocab/feature space is very large or streaming.

- Limitation: collisions (different items map to same bucket) → but usually acceptable in practice.

## Using Hash Functions
### 3) Properties of Hash Functions

When we use hashing in data processing or ML, we need hash functions with certain properties:

##### 1) Determinism 

- The same input always produces the same output.

- Example: hash("apple") → always the same value (in the same Python session).

##### 2) Uniformity 

- Hash values should be evenly distributed across the range (buckets).

- Avoids “clustering” where too many items land in the same bucket.

##### 3) Speed 

- Hashing should be fast to compute since it is used for very large datasets.
- Slower functions (like cryptographic hashes) are secure but not efficient for ML wrangling.

##### 4) Fixed Range (Modulus Operation)

- Usually we use modulus (% N) to keep values in a fixed range of buckets.

##### Important Note:
- In machine learning, we usually prefer fast and simple hash functions (not cryptographic ones like SHA256).

### 4) Examples of Hash Functions in Python
#### A. Built-in hash() function

In [6]:
# Python built-in hash function
print(hash("apple"))
print(hash("banana"))
print(hash("car"))

-5445322363184041899
-8279985716640786902
5190923733118357545


- **Note:** Python’s built-in hash() changes between runs (for security reasons).

#### B. Using hashlib (Cryptographic Hashes)

In [3]:
import hashlib

# SHA256 hash
word = "apple"
hash_value = hashlib.sha256(word.encode()).hexdigest()
print("SHA256:", hash_value)

SHA256: 3a7bd3e2360a3d29eea436fcfb7e44c735d117c42d1c1835420b6b9942dd4f1b


- Cryptographic hashes (MD5, SHA1, SHA256) are deterministic, uniform, secure but slower.
- Used in security/forensics, not usually in ML feature hashing.

#### C. Using Scikit-learn’s Hashing for ML

In [4]:
from sklearn.feature_extraction.text import HashingVectorizer

corpus = ["I love data science", "Data wrangling with scikit-learn"]

hv = HashingVectorizer(n_features=10)  # 10 buckets
X = hv.fit_transform(corpus)

print("Hashed Feature Matrix:\n", X.toarray())

Hashed Feature Matrix:
 [[ 0.         -0.57735027  0.         -0.57735027  0.          0.57735027
   0.          0.          0.          0.        ]
 [-0.37796447  0.          0.          0.          0.          0.37796447
   0.         -0.37796447  0.         -0.75592895]]


Here, each word is hashed directly into one of 10 buckets.

- No need to store vocabulary.

- Useful for text processing with large data.

### Summary

- Hash functions must be deterministic, uniform, and fast.

- Python offers:

    - hash() (fast but changes across runs)
    - hashlib (cryptographic, slower, more secure)
    - HashingVectorizer in Scikit-learn (practical for ML text wrangling).

## Demonstrating the Hashing Trick
### 5) Implementing the Hashing Trick in Scikit-learn

Scikit-learn provides HashingVectorizer to apply the hashing trick directly.

#### HashingVectorizer

In [10]:
from sklearn.feature_extraction.text import HashingVectorizer

# Example text corpus
corpus = [
    "I love data science",
    "Data wrangling with scikit-learn",
    "Machine learning needs clean data"
]

# Apply hashing trick
hv = HashingVectorizer(n_features=10, alternate_sign=False)  # 10 buckets
X_hash = hv.fit_transform(corpus)

print("Hashed Feature Matrix:\n", X_hash.toarray())

Hashed Feature Matrix:
 [[0.         0.57735027 0.         0.57735027 0.         0.57735027
  0.         0.         0.         0.        ]
 [0.37796447 0.         0.         0.         0.         0.37796447
  0.         0.37796447 0.         0.75592895]
 [0.4472136  0.         0.4472136  0.         0.4472136  0.4472136
  0.4472136  0.         0.         0.        ]]


- Each word is hashed into one of 10 buckets (features).

- Vocabulary is not stored → very memory-efficient.

- alternate_sign=False ensures only positive values.

- Collision possible (different words → same bucket).

### 6) Comparing with Traditional Vectorization
#### A. CountVectorizer (Bag of Words)

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
X_count = cv.fit_transform(corpus)

print("Vocabulary:", cv.get_feature_names_out())
print("Count Vectorizer Matrix:\n", X_count.toarray())

Vocabulary: ['clean' 'data' 'learn' 'learning' 'love' 'machine' 'needs' 'science'
 'scikit' 'with' 'wrangling']
Count Vectorizer Matrix:
 [[0 1 0 0 1 0 0 1 0 0 0]
 [0 1 1 0 0 0 0 0 1 1 1]
 [1 1 0 1 0 1 1 0 0 0 0]]


- Creates a dictionary of words → indices.

- Good for small data, but memory-heavy for large vocabularies.

#### B. TF-IDF Vectorizer

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(corpus)

print("Vocabulary:", tfidf.get_feature_names_out())
print("TF-IDF Matrix:\n", X_tfidf.toarray())

Vocabulary: ['clean' 'data' 'learn' 'learning' 'love' 'machine' 'needs' 'science'
 'scikit' 'with' 'wrangling']
TF-IDF Matrix:
 [[0.         0.38537163 0.         0.         0.65249088 0.
  0.         0.65249088 0.         0.         0.        ]
 [0.         0.28321692 0.47952794 0.         0.         0.
  0.         0.         0.47952794 0.47952794 0.47952794]
 [0.47952794 0.28321692 0.         0.47952794 0.         0.47952794
  0.47952794 0.         0.         0.         0.        ]]


- Adjusts word counts using importance (rare words get higher weight).

- Better for text classification than plain counts.

### Summary:

- CountVectorizer: simple word counts (memory-heavy for big data).

- TF-IDF: weighted counts → considers importance of words.

- HashingVectorizer: no vocabulary stored, scalable, but risk of collisions.

##### Rule of Thumb:

- Use Count/TF-IDF for small to medium datasets.

- Use HashingVectorizer for large-scale or streaming text.