# Session 11 – Hashing Techniques and Optimized Data Structures in Python

_Exercise notebook (without solutions)._

## Exercise 1 – Indexing PDF books with hashing

Create a system that:
- Iterates through multiple PDF files
- Computes the SHA-256 hash of each file
- Stores the results in a dictionary
- Identifies duplicate files based on their hash

In [14]:
import hashlib
from pathlib import Path

folder_path = Path("ex1_pdfs")

def compute_hash(filepath):
    sha = hashlib.sha256()
    with open(filepath, "rb") as f:
        while chunk := f.read(4096):
            sha.update(chunk)
    return sha.hexdigest()

hashes = dict()
duplicates = []

for file in folder_path.iterdir():
    if file.is_file() and file.suffix == '.pdf':
        hashes.setdefault(compute_hash(file), []).append(file.name)

for k, v in hashes.items():
    print(k, end='')
    print(' [DUPLIC]' if len(v) > 1 else ' [UNIQUE]', end=' ')
    if len(v) > 1:
        duplicates.append({k: v})
    print(v)


58bd9a6814d01231253ed537c8308c9d323534172f0c499ab1e640e1bcc0c293 [DUPLIC] ['book_alpha_copy.pdf', 'book_alpha.pdf']
8c17eac149c61478980d1b956ff8fd3b0970f32762fbc64578f7ec81d43fddf6 [DUPLIC] ['book_beta_copy.pdf', 'book_beta.pdf']
5c2e490ef930af42dac33db0e4abf9f7b9c0fd5a75da1245eb2a787417f11d29 [UNIQUE] ['book_gamma.pdf']


## Exercise 2 – Finding duplicates in 10,000 names

Generate or load a large list of names.
- Use a `set` for fast duplicate detection
- Display the duplicate names
- Measure execution time

In [20]:
import random
import time

def generate_names(total=10_000, unique_pool=2_000, seed=42):
    random.seed(seed)
    first_names = [
        "Liam", "Noah", "Oliver", "Elijah", "James", "William", "Benjamin", "Lucas", "Henry", "Alexander",
        "Olivia", "Emma", "Charlotte", "Amelia", "Sophia", "Isabella", "Ava", "Mia", "Evelyn", "Luna"
    ]
    last_names = [
        "Smith", "Johnson", "Williams", "Brown", "Jones", "Garcia", "Miller", "Davis", "Rodriguez", "Martinez",
        "Hernandez", "Lopez", "Gonzalez", "Wilson", "Anderson", "Thomas", "Taylor", "Moore", "Jackson", "Martin"
    ]

    pool = [f"{random.choice(first_names)} {random.choice(last_names)}" for _ in range(unique_pool)]
    return [random.choice(pool) for _ in range(total)]

names = generate_names(total=10_000, unique_pool=2_000)

start = time.perf_counter()
seen = set()
duplicates = set()

for name in names:
    if name in seen:
        duplicates.add(name)
    else:
        seen.add(name)

elapsed = time.perf_counter() - start

print(f"Total names: {len(names)}")
print(f"Unique names: {len(seen)}")
print(f"Duplicate names found: {len(duplicates)}")
print(f"Execution time: {elapsed:.6f} seconds")
print("Sample duplicates:", sorted(duplicates)[:4])


Total names: 10000
Unique names: 397
Duplicate names found: 397
Execution time: 0.002777 seconds
Sample duplicates: ['Alexander Anderson', 'Alexander Brown', 'Alexander Davis', 'Alexander Garcia']


## Exercise 3 – Top 5 most frequent words (Counter)

Receive a long text.
- Normalize the text (lowercase, remove punctuation)
- Use `collections.Counter`
- Display the top 5 most frequent words

In [49]:
from collections import Counter

with open("ex3_text.txt", 'r', encoding='utf-8') as f:
    data = [x.strip('\n.,;-') for x in f.readlines()]

full_data = [y.strip().lower() for x in data for y in x.split(' ')]

counts = Counter(full_data)

for k, v in counts.most_common(5):
    print(v, k)

7 data
6 and
5 python
3 hashing
3 in


## Exercise 4 – Build your own HashMap

Manually implement a HashMap structure:
- Create a `HashMap` class
- Define a hash function
- Handle collisions (bucket chaining using lists)
- Implement methods: `put`, `get`, `delete`

In [50]:
class HashMap:
    def __init__(self, size=10):
        self.size = size
        self.buckets = [[] for _ in range(size)]

    def _hash(self, key):
        return hash(key) % self.size

    def put(self, key, value):
        i = self._hash(key)
        bucket = self.buckets[i]

        for item in bucket:
            if item[0] == key:
                item[1] = value
                return

        bucket.append([key, value])

    def get(self, key):
        i = self._hash(key)
        bucket = self.buckets[i]

        for k, v in bucket:
            if k == key:
                return v

        return None

    def delete(self, key):
        i = self._hash(key)
        bucket = self.buckets[i]

        for j, item in enumerate(bucket):
            if item[0] == key:
                del bucket[j]
                return True

        return False


hmap = HashMap(size=5)

hmap.put("apple", 10)
hmap.put("banana", 20)
hmap.put("orange", 30)
hmap.put("apple", 99)

print("apple ->", hmap.get("apple"))
print("banana ->", hmap.get("banana"))
print("grape ->", hmap.get("grape"))

print("delete banana:", hmap.delete("banana"))
print("banana ->", hmap.get("banana"))
print("delete grape:", hmap.delete("grape"))


apple -> 99
banana -> 20
grape -> None
delete banana: True
banana -> None
delete grape: False


## Exercise 5 – Auto-complete using prefix hashing

Create a simple auto-complete system:
- Receive a list of words
- Build a hashing-based structure for prefixes
- Return all words starting with a given prefix
- Optimize for fast lookup

In [51]:
from pathlib import Path

words_path = Path("ex5_words.txt")
words = [line.strip().lower() for line in words_path.read_text().splitlines() if line.strip()]

prefix_map = dict()

for word in words:
    for i in range(1, len(word) + 1):
        prefix = word[:i]
        prefix_map.setdefault(prefix, []).append(word)

def autocomplete(prefix):
    return prefix_map.get(prefix.lower(), [])

print("Total words loaded:", len(words))
print("Total prefixes indexed:", len(prefix_map))
print("ap ->", autocomplete("ap"))
print("sta ->", autocomplete("sta"))
print("py ->", autocomplete("py"))
print("zzz ->", autocomplete("zzz"))


Total words loaded: 36
Total prefixes indexed: 120
ap -> ['apple', 'application', 'apply', 'apricot']
sta -> ['stack', 'stamina', 'stamp', 'standard', 'start', 'starter', 'state', 'station']
py -> ['python', 'pycharm', 'pyramid', 'pytorch']
zzz -> []
