<a href="https://colab.research.google.com/github/Anze-/datathon2k25/blob/alberto/feature_engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature Engineering

# 1. Set up working environment

In [1]:
import pandas as pd
import numpy as np
import csv

In [2]:
# enable GPU if needed, GPU can speed up your vector embedding if you computing these vectors locally (not using API)

import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

Using device: cpu


In [3]:
import os
import json
import chromadb
import openai
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA

# Set OpenAI API Key
os.environ["OPENAI_API_KEY"] = open('api.key').read()[:-1]

# 2. Working on URLs

In [5]:
folder_path = "./data/hackathon_data/"# Google drive path of the dataset
files_in_folder = os.listdir(folder_path)
len(files_in_folder)

13144

In [6]:
def load_documents(json_file):
    """Loads the JSON file."""
    with open(json_file, 'r') as f:
      try:
          data = json.load(f)
          return data
      except json.JSONDecodeError:
          print(f"Error reading {json_file}, it may not be a valid JSON file.")
    return []

In [23]:
def load_urls(file=files_in_folder[0]):
    website_name=file[:-5]
    urldocs=load_documents(folder_path+file)['text_by_page_url']
    return {website_name:list(urldocs.keys())}

In [25]:
load_urls(files_in_folder[0])

{'skysolutions.com': ['http://skysolutions.com',
  'http://skysolutions.com/',
  'https://skysolutions.com/services/',
  'https://skysolutions.com/services/ai-solutions/',
  'https://skysolutions.com/services/ai-solutions/enterprise-data-management-and-platform-enablement/',
  'https://skysolutions.com/services/ai-solutions/ai-ml-and-advanced-analytics/',
  'https://skysolutions.com/services/ai-solutions/generative-ai-platforms-and-solutions/',
  'https://skysolutions.com/services/digital-transformation/',
  'https://skysolutions.com/services/digital-transformation/human-centered-design/',
  'https://skysolutions.com/services/digital-transformation/agile-safe-agile-product-management/',
  'https://skysolutions.com/services/digital-transformation/ci-cd-and-devsecops-practices/',
  'https://skysolutions.com/services/digital-transformation/low-code-no-code-development/',
  'https://skysolutions.com/services/digital-transformation/legacy-modernization-and-cloud-migration/',
  'https://skys

In [36]:
#choose a labeling set

np.random.seed(42)
labeling_set = np.random.choice(files_in_folder,100)

In [42]:
labelurls = []
for L in [list(load_urls(site).values())[0] for site in labeling_set]:
    labelurls.extend(L)

In [43]:
labelurls

['http://warehouseanywhere.com',
 'https://www.warehouseanywhere.com',
 'https://www.warehouseanywhere.com/who-we-serve/medical-devices/',
 'https://www.warehouseanywhere.com/who-we-serve/field-service-repair/',
 'https://www.warehouseanywhere.com/who-we-serve/pharmaceuticals/',
 'https://www.warehouseanywhere.com/technology/',
 'https://www.warehouseanywhere.com/about/',
 'https://www.warehouseanywhere.com/about/careers/',
 'https://www.warehouseanywhere.com/resources/',
 'https://www.warehouseanywhere.com/privacy-policy/',
 'https://www.warehouseanywhere.com/resources/?type=case-study',
 'https://www.warehouseanywhere.com/resources/?type=blog',
 'https://www.warehouseanywhere.com/resources/?type=ebook',
 'https://www.warehouseanywhere.com/resources/?type=whitepaper',
 'https://www.warehouseanywhere.com/resources/what-is-an-inventory-management-system/',
 'https://www.warehouseanywhere.com/resources/diebold-nixdorf/',
 'https://www.warehouseanywhere.com/resources/6-steps-to-medical-de

## tokenizer

In [47]:
# tokenizer

import tldextract
from urllib.parse import urlparse, parse_qs

def simple_url_tokenizer(url: str):
    # 1. Parse the URL with urllib to get basic components
    parsed_url = urlparse(url)
    
    # 2. Use tldextract to get domain/subdomain
    extracted = tldextract.extract(url)
    
    # 3. Extract components
    scheme = parsed_url.scheme    # 'http', 'https', etc.
    domain = extracted.domain      # 'example'
    subdomain = extracted.subdomain  # 'www'
    path = parsed_url.path        # '/path/to/resource'
    query = parsed_url.query      # 'id=123&name=abc'

    # 4. Tokenize path (split into parts by '/')
    path_tokens = path.strip('/').split('/') if path else []
    path_tokens = [tk.split('-') for tk in path_tokens]
    path_tokens = sum(path_tokens,[])

    # 5. Tokenize query parameters (key-value pairs)
    query_tokens = {k: v[0] for k, v in parse_qs(query).items()} if query else {}

    # Return all tokens in a dictionary
    return {
        #'scheme': scheme,
        #'subdomain': subdomain,
        #'domain': domain,
        'path_tokens': path_tokens,
        #'query_tokens': query_tokens
    }

In [48]:
# Apply the tokenizer to each URL
for url in labelurls:
    tokens = simple_url_tokenizer(url)
    print(f"Tokens for URL: {url}")
    print(tokens)
    print("-" * 50)

Tokens for URL: http://warehouseanywhere.com
{'path_tokens': []}
--------------------------------------------------
Tokens for URL: https://www.warehouseanywhere.com
{'path_tokens': []}
--------------------------------------------------
Tokens for URL: https://www.warehouseanywhere.com/who-we-serve/medical-devices/
{'path_tokens': ['who', 'we', 'serve', 'medical', 'devices']}
--------------------------------------------------
Tokens for URL: https://www.warehouseanywhere.com/who-we-serve/field-service-repair/
{'path_tokens': ['who', 'we', 'serve', 'field', 'service', 'repair']}
--------------------------------------------------
Tokens for URL: https://www.warehouseanywhere.com/who-we-serve/pharmaceuticals/
{'path_tokens': ['who', 'we', 'serve', 'pharmaceuticals']}
--------------------------------------------------
Tokens for URL: https://www.warehouseanywhere.com/technology/
{'path_tokens': ['technology']}
--------------------------------------------------
Tokens for URL: https://www.w

## model

In [None]:
# Import necessary libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Example dataset (text and labels)
texts = [
    "I love programming", 
    "Python is great for data science", 
    "I hate bugs", 
    "Coding is fun", 
    "Debugging is frustrating",
    "I enjoy solving problems",
    "Software development is exciting", 
    "Errors are annoying"
]

labels = [1, 1, 0, 1, 0, 1, 1, 0]  # 1 = positive, 0 = negative sentiment (binary labels)

# 1. Convert text to features using TF-IDF vectorization
vectorizer = TfidfVectorizer()

# 2. Vectorize the text data
X = vectorizer.fit_transform(texts)

# 3. Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2, random_state=42)

# 4. Initialize and train the Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# 5. Make predictions on the test set
y_pred = model.predict(X_test)

# 6. Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

# Print results
print("Accuracy:", accuracy)
print("Classification Report:")
print(class_report)


In [293]:
import re
from concurrent.futures import ThreadPoolExecutor
from typing import List, Tuple
from tqdm import tqdm

# City list
cities = geo_df['ASCII Name'].values
cities_set = set(city.lower() for city in cities)

# Regex for Title Case OR ALL CAPS
pattern = re.compile(
    r'\b(?:' +
    '|'.join(
        rf'{re.escape(city.title())}|{re.escape(city.upper())}'
        for city in cities_set
    ) +
    r')\b'
)

# Match function — now returns position too
def contains_city(text: str) -> Tuple[bool, str, int]:
    match = pattern.search(text)
    if match:
        return True, match.group(0), match.start()
    return False, "", -1

# ✅ You asked to keep this unchanged — still works perfectly
def process_documents(documents: List[str], max_workers: int = 4):
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        return list(tqdm(executor.map(contains_city, documents), total=len(documents), desc="Processing Docs"))

# Example documents
docs = load_text()

# Run it
results = process_documents(docs)

# Output results
found_cities = []
contexts = {}TfidfVectorizer
for doc, (found, city, pos) in zip(docs, results):
    if found:
        found_cities.append(city)
        if city in contexts.keys(): # add context
            contexts[city] = contexts[city]+' '+doc[pos:pos+30]
        else: #create context
            contexts[city] = doc[pos:pos+30]
        #print(f"{city}, Context: {doc[pos:pos+30]}")

found_cities = np.array(found_cities)

# Get unique values and their counts
cvalues, ccounts = np.unique(found_cities, return_counts=True)

# Find the most frequent (mode)

most_frequent = cvalues[np.argmax(ccounts)]
top_context = contexts[most_frequent]


# Now check if in the context

# Build regex to match full words (case-sensitive)
pattern = re.compile(r'\b(?:' + '|'.join(re.escape(code) for code in USCODES) + r')\b')

# Find all matches
code_matches = pattern.findall(top_context)

# Find the most frequent one
if code_matches:
    values, counts = np.unique(code_matches, return_counts=True)
    found_state = values[np.argmax(counts)]
else:
    found_state = None

print(f'we are in {most_frequent}, {found_state}')

skysolutions.com.json




Processing Docs:   0%|                                              | 0/70 [00:00<?, ?it/s][A[A

Processing Docs:  73%|██████████████████████████▏         | 51/70 [00:01<00:00, 179.27it/s][A[A

Processing Docs: 100%|█████████████████████████████████████| 70/70 [00:01<00:00, 40.73it/s][A[A

we are in Herndon, VA





## Now run it on the dataset

In [330]:
cities = geo_df['ASCII Name'].values
cities_set = set(city.lower() for city in cities)

def extract_location(baseurl=files_in_folder[0],max_workers=4,disabletqdm=True):
    # Documents
    docs = load_text(file=baseurl)
    
    # City list
    
    # Regex for Title Case OR ALL CAPS
    pattern = re.compile(
        r'\b(?:' +
        '|'.join(
            rf'{re.escape(city.title())}|{re.escape(city.upper())}'
            for city in cities_set
        ) +
        r')\b'
    )
    
    # Match function — now returns position too
    def contains_city(text: str) -> Tuple[bool, str, int]:
        match = pattern.search(text)
        if match:
            return True, match.group(0), match.start()
        return False, "", -1
    
    def process_documents(documents: List[str], max_workers: int = 4):
        with ThreadPoolExecutor(max_workers=max_workers) as executor:
            return list(tqdm(executor.map(contains_city, documents), total=len(documents), desc="Processing Docs", disable=disabletqdm))
    
    # Run it
    results = process_documents(docs,max_workers=max_workers)
    
    # Output results
    found_cities = []
    contexts = {}
    for doc, (found, city, pos) in zip(docs, results):
        if found:
            found_cities.append(city)
            if city in contexts.keys(): # add context
                contexts[city] = contexts[city]+' '+doc[pos:pos+30]
            else: #create context
                contexts[city] = doc[pos:pos+30]
            #print(f"{city}, Context: {doc[pos:pos+30]}")
    
    found_cities = np.array(found_cities)
    
    # Get unique values and their counts
    cvalues, ccounts = np.unique(found_cities, return_counts=True)

    if len(ccounts) == 0:
        return {'city':None, 'state':None, 'context':None}
        
    # Find the most frequent (mode)
    most_frequent = cvalues[np.argmax(ccounts)]
    top_context = contexts[most_frequent]
    
    
    # Now check if in the context
  
    # Build regex to match full words (case-sensitive)
    pattern = re.compile(r'\b(?:' + '|'.join(re.escape(code) for code in USCODES) + r')\b')
    
    # Find all matches
    code_matches = pattern.findall(top_context)
    
    # Find the most frequent one
    if code_matches:
        values, counts = np.unique(code_matches, return_counts=True)
        found_state = values[np.argmax(counts)]
    else:
        found_state = None
    
    return {'city':most_frequent, 'state':found_state, 'context':top_context}

In [331]:
len(geo_df[geo_df['Elevation']==-1000])

0

In [None]:
for file in tqdm(files_in_folder[67:]):
    info = extract_location(baseurl=file,max_workers=30)
    geodata = geo_df[(geo_df['ASCII Name']==info['city']) & (geo_df['Admin1 Code']==info['state'])]
    if len(geodata)==0:
        with open('./data/geofeature.csv', mode='a', newline='') as geo_out:
            writer = csv.writer(geo_out)
            writer.writerow([file,len(geodata),info['context']])
    else:
        with open('./data/geofeature.csv', mode='a', newline='') as geo_out:
            writer = csv.writer(geo_out)
            writer.writerow([file,len(geodata),geodata.to_json()])




  0%|                                                            | 0/13077 [00:00<?, ?it/s][A[A[A

affinityhealthcorp.com.json
stadiumpeople.com.json





  0%|                                                  | 2/13077 [00:00<1:00:38,  3.59it/s][A[A[A

gannettfleming.com.json





  0%|                                                 | 3/13077 [00:08<12:59:56,  3.58s/it][A[A[A

imigroup.com.json





  0%|                                                 | 4/13077 [00:15<17:02:13,  4.69s/it][A[A[A

trustapexinternational.com.json





  0%|                                                 | 5/13077 [00:25<24:03:55,  6.63s/it][A[A[A

lumapps.com.json





  0%|                                                 | 6/13077 [00:34<26:25:42,  7.28s/it][A[A[A

cardinals.com.json
revisionenergy.com.json





  0%|                                                 | 8/13077 [00:34<13:53:57,  3.83s/it][A[A[A

guthy-renker.com.json





  0%|                                                 | 9/13077 [00:42<17:44:23,  4.89s/it][A[A[A

marqeta.com.json





  0%|                                                | 10/13077 [00:53<23:47:00,  6.55s/it][A[A[A

mechatronics.com.json





  0%|                                                | 11/13077 [01:00<24:24:12,  6.72s/it][A[A[A

rewterz.com.json





  0%|                                                | 12/13077 [01:21<38:57:04, 10.73s/it][A[A[A

tektonlabs.com.json





  0%|                                                | 13/13077 [01:24<30:30:38,  8.41s/it][A[A[A

georgetowncommunityhospital.com.json





  0%|                                                | 14/13077 [01:29<26:44:40,  7.37s/it][A[A[A

sterlingequities.com.json





  0%|                                                | 15/13077 [01:30<20:15:09,  5.58s/it][A[A[A

coverwallet.com.json





  0%|                                                | 16/13077 [01:37<22:20:00,  6.16s/it][A[A[A

holyspiritretirementhome.com.json





  0%|                                                | 17/13077 [01:58<37:29:58, 10.34s/it][A[A[A

senecaglobal.com.json





  0%|                                                | 18/13077 [02:08<37:58:22, 10.47s/it][A[A[A

rwcatskills.com.json





  0%|                                                | 19/13077 [02:09<26:52:20,  7.41s/it][A[A[A

emersonrogers.com.json





  0%|                                                | 20/13077 [02:17<27:28:50,  7.58s/it][A[A[A

neogenomics.com.json





  0%|                                                | 21/13077 [02:17<19:41:12,  5.43s/it][A[A[A

cctvcamerapros.com.json





  0%|                                                | 22/13077 [02:25<22:05:52,  6.09s/it][A[A[A

thevillageshealth.com.json





  0%|                                                | 23/13077 [02:36<27:53:19,  7.69s/it][A[A[A

cjadvertising.com.json





  0%|                                                | 24/13077 [02:42<26:04:21,  7.19s/it][A[A[A

engageware.com.json





  0%|                                                | 25/13077 [02:54<30:47:22,  8.49s/it][A[A[A

cbservice.com.json
conceptrehab.com.json





  0%|                                                | 27/13077 [02:58<19:50:46,  5.47s/it][A[A[A

answernet.com.json


2


## other...

13144

In [9]:
files_in_folder

['skysolutions.com.json',
 'richardsonsports.com.json',
 'wilson-company.com.json',
 'westernallied.com.json',
 'bluescopebuildings.com.json',
 'cobbemc.com.json',
 'is4s.com.json',
 'berkleyselect.com.json',
 'webbwheel.com.json',
 'sygmanetwork.com.json',
 'smma.com.json',
 'cgicontainersales.com.json',
 'nydig.com.json',
 'protranslating.com.json',
 'traditionalbank.com.json',
 'percona.com.json',
 'lawnlove.com.json',
 'elementsdesign.com.json',
 'pghwong.com.json',
 'fwcook.com.json',
 'sofistadium.com.json',
 'sfopera.com.json',
 'helenwellsagency.com.json',
 'russdaviswholesale.com.json',
 'irissoftware.com.json',
 'klarquist.com.json',
 'govtact.com.json',
 'lowlandsgroup.com.json',
 'kingseducation.com.json',
 'qualcareinc.com.json',
 'mymotomart.com.json',
 'chartbeat.com.json',
 'kongbasileconsulting.com.json',
 'javacity.com.json',
 'saintmarks.com.json',
 'finchpaper.com.json',
 'optoro.com.json',
 'paccarparts.com.json',
 'tendercarehh.com.json',
 'ftei.com.json',
 'lesse

In [34]:
for filename in files_in_folder:
    if filename.endswith('.json'):
        file_path = os.path.join(folder_path, filename)
        doc = load_documents(file_path)
        break
print(doc.keys())

dict_keys(['url', 'timestamp', 'text_by_page_url', 'doc_id'])


## 2.2 Pre-process documents.

Feel free to explore and pre-process the data. You may want to clean or segment the documents as you see fit.

In [None]:
def page_segment(docs):
    """You may prefer to load each page separately."""
    i = 0
    page_segment = []
    for s in list(docs['text_by_page_url'].values()):
      page_segment.append({"docID": docs['doc_id'], "pageID": 'page_' + str(i), "text": s})
      i += 1
    return page_segment

In [None]:
def segment_documents(docs, chunk_size=500):
    """Segments documents into chunks of a given token size. Replace this function with your segmentation approach or maybe use the original document without segmentation."""
    segmented = []
    for doc_id, content in docs.items():
        for i in range(0, len(content), chunk_size):
            segment = content[i : i + chunk_size]
            segmented.append({"id": doc_id, "text": segment})
    return segmented



In [None]:
def document_clean(docs):
  """
  You may want to clean the dataset, add the code here.
  """
  pass

## 2.3 Document Indexing and Storage (Profiling)

Feel free to choose different ways to indexing and storing the provided documents in a knowledge database.

So that they can be retrieved in different ways according to your system design choices, such as search by keywords, vector representation, graph relation, and etc.

# 3. Retrieval Augmented Generation

## 3.1 Load Knowledge Database

## 3.2 Relevant Document Retrieval

Feel free to check and improve your retrieval performance as it affect the generation results significantly.

In [None]:
def retrieve_documents(query, db_path, embedding_model):
  """
  retrieve relevant documents from the knowledge database to the query.
  """
  return relevant_docs

## 3.3 Response Generation

Feel free to explore promp engineer to improve the quality of your generated response.

The retrieved documents are used as context to generate more relevant response. Gereral knowledge from the language model itself is also used.

In [None]:
def generate_answer(query, retrieved_texts, prompt_template):
    """Generates an answer using retrieved documents and GPT-4."""
    return response

In [None]:
query = "What company is located in 29010 Commerce Center Dr., Valencia, 91355, California, US?"
retrieved_docs = retrieve_documents(query, db_path, embedding_model)
response = generate_answer(query, retrieved_texts, prompt_template)

print("Query:", query)
print("Retrieved Documents:", [doc.page_content for doc in retrieved_docs])
print("Generated Answer:", response)

# 4. Evaluation

Try as many examples to evaluate your system and improve your performance!

As the final sysrtem will be evaluated from various aspects. Try to check different metrics when you evaluate. One trick is to do a "strict RAG" where the response is generated based on the retrieved documents only, i.e. no general knowledge from the LLMs will be used. This may be a good way to check if your retrieval part is working as expected. Note, that in the final system general knowledge from the LLMs are welcome. "Strict RAG" is only used as a way for you to check your performance :)