<div style="background-color: #2563eb; color: white; padding: 20px; border-radius: 8px; margin: 10px 0; max-width: 1120px; box-shadow: 0 2px 4px rgba(0,0,0,0.1);">

## Dataset Overview: Electric Vehicle Charging Stations (2024)

To power our AI chatbot, we are sourcing our knowledge from the **Electric Vehicle Charging Stations (2024)** dataset. This modern dataset is ideal for our mockup as it provides the essential data points needed to answer specific user queries. 

### Key Features:
- **Rich location information** like Latitude and Longitude for proximity searches
- **EV Connector Types** and **EV Network data** for filtering compatible and preferred charging options
- Comprehensive coverage of charging stations across the network

We will load this data into a **pandas DataFrame**, which will serve as the core "database" for all bot-driven station lookups.

</div>

### Installations and imports

In [10]:
!pip install -r ../requirements.txt

Collecting sentence-transformers (from -r ../requirements.txt (line 4))
  Using cached sentence_transformers-5.1.0-py3-none-any.whl.metadata (16 kB)
Collecting scikit-learn (from -r ../requirements.txt (line 5))
  Using cached scikit_learn-1.7.1-cp313-cp313-macosx_12_0_arm64.whl.metadata (11 kB)
Collecting transformers<5.0.0,>=4.41.0 (from sentence-transformers->-r ../requirements.txt (line 4))
  Downloading transformers-4.56.0-py3-none-any.whl.metadata (40 kB)
Collecting torch>=1.11.0 (from sentence-transformers->-r ../requirements.txt (line 4))
  Using cached torch-2.8.0-cp313-none-macosx_11_0_arm64.whl.metadata (30 kB)
Collecting scipy (from sentence-transformers->-r ../requirements.txt (line 4))
  Using cached scipy-1.16.1-cp313-cp313-macosx_14_0_arm64.whl.metadata (61 kB)
Collecting huggingface-hub>=0.20.0 (from sentence-transformers->-r ../requirements.txt (line 4))
  Using cached huggingface_hub-0.34.4-py3-none-any.whl.metadata (14 kB)
Collecting Pillow (from sentence-transforme

In [11]:
# imports

import kagglehub
from kagglehub import KaggleDatasetAdapter

import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import cosine_similarity
import re
from typing import List, Dict, Optional, Tuple
import kagglehub
from kagglehub import KaggleDatasetAdapter

### Fetching Data

In [18]:
class EVChargingStationBot:
    """
    AI Chatbot for Electric Vehicle Charging Station queries.
    Supports semantic search, geographic clustering, and structured filtering.
    """
    
    def __init__(self, dataset_handle: str = "sahirmaharajj/electric-vehicle-charging-stations-2024", 
                 file_path: str = None, embedding_model: str = 'all-MiniLM-L6-v2'):
        """
        Initialize the EV Charging Station Bot.
        
        Args:
            dataset_handle: Kaggle dataset identifier
            file_path: Specific file to load from dataset
            embedding_model: SentenceTransformer model for semantic search
        """
        self.dataset_handle = dataset_handle
        self.file_path = file_path
        self.embedding_model_name = embedding_model
        
        # Core data structures
        self.df = None
        self.model = None
        self.station_embeddings = None
        self.geo_clusters = None
        
        # Initialize the bot
        self._load_data()
        self._preprocess_data()
        self._initialize_embeddings()
        
    def _load_data(self) -> None:
        """Load the EV charging station dataset."""
        try:
            if self.file_path:
                self.df = kagglehub.dataset_load(
                    KaggleDatasetAdapter.PANDAS,
                    self.dataset_handle,
                    self.file_path
                )
            else:
                # Download and explore dataset first
                path = kagglehub.dataset_download(self.dataset_handle)
                print(f"Dataset downloaded to: {path}")
                # User needs to specify the correct file_path
                raise ValueError("Please specify the file_path after exploring the dataset")
                
            print(f"Loaded {len(self.df)} charging stations")
            
        except Exception as e:
            print(f"Error loading data: {e}")
            raise
    
    def _preprocess_data(self) -> None:
        """Preprocess the dataset for optimal searching."""
        # Clean and standardize columns FIRST
        self._clean_data()
        
        # Extract coordinates from georeferenced column
        if 'New Georeferenced Column' in self.df.columns:
            coordinates = self.df['New Georeferenced Column'].str.extract(r'POINT \(([^)]+)\)')
            coord_split = coordinates[0].str.split(' ', expand=True)
            if len(coord_split.columns) >= 2:
                self.df['longitude'] = pd.to_numeric(coord_split[0], errors='coerce')
                self.df['latitude'] = pd.to_numeric(coord_split[1], errors='coerce')
        
        # Create rich descriptions for semantic search (after cleaning)
        self.df['description'] = self.df.apply(self._create_station_description, axis=1)
        
        print("Data preprocessing completed")
    
    def _create_station_description(self, row) -> str:
        """Create a rich text description for each charging station."""
        description_parts = []
        
        if pd.notna(row.get('Station Name')):
            description_parts.append(f"Station: {row['Station Name']}")
        
        if pd.notna(row.get('City')) and pd.notna(row.get('Street Address')):
            description_parts.append(f"Located in {row['City']} at {row['Street Address']}")
        
        if pd.notna(row.get('Access Days Time')):
            description_parts.append(f"Hours: {row['Access Days Time']}")
        
        # Charging capabilities
        charging_info = []
        if pd.notna(row.get('EV Level1 EVSE Num')) and row['EV Level1 EVSE Num'] > 0:
            charging_info.append(f"Level 1: {row['EV Level1 EVSE Num']} ports")
        if pd.notna(row.get('EV Level2 EVSE Num')) and row['EV Level2 EVSE Num'] > 0:
            charging_info.append(f"Level 2: {row['EV Level2 EVSE Num']} ports")
        if pd.notna(row.get('EV DC Fast Count')) and row['EV DC Fast Count'] > 0:
            charging_info.append(f"DC Fast: {row['EV DC Fast Count']} ports")
        
        if charging_info:
            description_parts.append("Charging: " + ", ".join(charging_info))
        
        return ". ".join(description_parts)
    
    def _clean_data(self) -> None:
        """Clean and standardize the dataset."""
        # Convert numeric columns - handle "NONE" strings
        numeric_cols = ['EV Level1 EVSE Num', 'EV Level2 EVSE Num', 'EV DC Fast Count']
        for col in numeric_cols:
            if col in self.df.columns:
                # Replace "NONE" with 0, then convert to numeric
                self.df[col] = self.df[col].astype(str).str.replace('NONE', '0')
                self.df[col] = pd.to_numeric(self.df[col], errors='coerce').fillna(0).astype(int)
        
        # Clean string columns
        string_cols = ['Station Name', 'City', 'Street Address']
        for col in string_cols:
            if col in self.df.columns:
                self.df[col] = self.df[col].astype(str).str.strip()
    
    def _initialize_embeddings(self) -> None:
        """Initialize the sentence transformer model and create embeddings."""
        try:
            self.model = SentenceTransformer(self.embedding_model_name)
            self.station_embeddings = self.model.encode(self.df['description'].tolist())
            print("Embeddings initialized successfully")
        except Exception as e:
            print(f"Warning: Could not initialize embeddings: {e}")
            self.model = None
            self.station_embeddings = None
    
    def setup_geographic_clustering(self, n_clusters: int = 20) -> None:
        """Setup geographic clustering for location-based optimization."""
        if 'latitude' in self.df.columns and 'longitude' in self.df.columns:
            # Remove rows with missing coordinates
            valid_coords = self.df.dropna(subset=['latitude', 'longitude'])
            
            if len(valid_coords) > 0:
                kmeans = KMeans(n_clusters=min(n_clusters, len(valid_coords)), random_state=42)
                coords = valid_coords[['latitude', 'longitude']].values
                cluster_labels = kmeans.fit_predict(coords)
                
                # Map clusters back to original dataframe
                self.df['geo_cluster'] = -1  # Default for missing coordinates
                self.df.loc[valid_coords.index, 'geo_cluster'] = cluster_labels
                
                self.geo_clusters = kmeans
                print(f"Geographic clustering completed with {n_clusters} clusters")
            else:
                print("No valid coordinates found for clustering")
        else:
            print("Latitude/Longitude columns not available for clustering")
    
    # === SEARCH METHODS ===
    def find_by_city(self, city: str, limit: int = 10) -> pd.DataFrame:
        """Find charging stations by city name."""
        mask = self.df['City'].str.contains(city, case=False, na=False)
        return self.df[mask].head(limit)
    
    def find_fast_charging(self, city: str = None, limit: int = 10) -> pd.DataFrame:
        """Find stations with DC fast charging."""
        # Convert to numeric on the fly if needed
        dc_fast_col = pd.to_numeric(self.df['EV DC Fast Count'], errors='coerce').fillna(0)
        mask = dc_fast_col > 0
        
        if city:
            city_mask = self.df['City'].str.contains(city, case=False, na=False)
            mask = mask & city_mask
        
        return self.df[mask].head(limit)

    def find_level2_charging(self, city: str = None, min_ports: int = 1, limit: int = 10) -> pd.DataFrame:
        """Find stations with Level 2 charging."""
        # Convert to numeric on the fly if needed
        level2_col = pd.to_numeric(self.df['EV Level2 EVSE Num'], errors='coerce').fillna(0)
        mask = level2_col >= min_ports
        
        if city:
            city_mask = self.df['City'].str.contains(city, case=False, na=False)
            mask = mask & city_mask
        
        return self.df[mask].head(limit)
    
    def find_24_hour_stations(self, city: str = None, limit: int = 10) -> pd.DataFrame:
        """Find 24-hour accessible charging stations."""
        mask = self.df['Access Days Time'].str.contains('24 hours', case=False, na=False)
        
        if city:
            city_mask = self.df['City'].str.contains(city, case=False, na=False)
            mask = mask & city_mask
        
        return self.df[mask].head(limit)
    
    def find_nearby_stations(self, lat: float, lon: float, radius_km: float = 10, limit: int = 10) -> pd.DataFrame:
        """Find charging stations within a radius of given coordinates."""
        if 'latitude' not in self.df.columns or 'longitude' not in self.df.columns:
            raise ValueError("Coordinate data not available")
        
        # Calculate distances using Haversine formula
        distances = self._calculate_distances(lat, lon)
        nearby_mask = distances <= radius_km
        
        # Sort by distance
        nearby_stations = self.df[nearby_mask].copy()
        nearby_stations['distance_km'] = distances[nearby_mask]
        
        return nearby_stations.sort_values('distance_km').head(limit)
    
    def semantic_search(self, query: str, limit: int = 5) -> pd.DataFrame:
        """Perform semantic search using embeddings."""
        if self.model is None or self.station_embeddings is None:
            raise ValueError("Embeddings not available. Initialize embeddings first.")
        
        # Encode the query
        query_embedding = self.model.encode([query])
        
        # Calculate similarities
        similarities = cosine_similarity(query_embedding, self.station_embeddings)[0]
        
        # Get top results
        top_indices = np.argsort(similarities)[::-1][:limit]
        results = self.df.iloc[top_indices].copy()
        results['similarity_score'] = similarities[top_indices]
        
        return results
    
    def _calculate_distances(self, lat: float, lon: float) -> np.ndarray:
        """Calculate distances using Haversine formula."""
        R = 6371  # Earth's radius in kilometers
        
        lat_rad = np.radians(lat)
        lon_rad = np.radians(lon)
        
        station_lats = np.radians(self.df['latitude'].fillna(0))
        station_lons = np.radians(self.df['longitude'].fillna(0))
        
        dlat = station_lats - lat_rad
        dlon = station_lons - lon_rad
        
        a = np.sin(dlat/2)**2 + np.cos(lat_rad) * np.cos(station_lats) * np.sin(dlon/2)**2
        c = 2 * np.arcsin(np.sqrt(a))
        
        return R * c
    
    def get_station_details(self, station_id: int) -> Dict:
        """Get detailed information about a specific station."""
        station = self.df.iloc[station_id]
        return station.to_dict()
    
    def get_cities(self) -> List[str]:
        """Get list of all cities with charging stations."""
        return sorted(self.df['City'].unique())
    
    def get_summary_stats(self) -> Dict:
        """Get summary statistics about the dataset."""
        return {
            'total_stations': len(self.df),
            'cities': len(self.df['City'].unique()),
            'level1_stations': (self.df['EV Level1 EVSE Num'] > 0).sum(),
            'level2_stations': (self.df['EV Level2 EVSE Num'] > 0).sum(),
            'fast_dc_stations': (self.df['EV DC Fast Count'] > 0).sum(),
            'total_level2_ports': self.df['EV Level2 EVSE Num'].sum(),
            'total_fast_dc_ports': self.df['EV DC Fast Count'].sum()
        }

In [19]:
# Usage:
bot = EVChargingStationBot(file_path="Electric_Vehicle_Charging_Stations.csv")
results = bot.find_fast_charging("Boston")
nearby = bot.find_nearby_stations(42.3601, -71.0589, radius_km=5)
semantic_results = bot.semantic_search("Tesla supercharger downtown")

Loaded 385 charging stations
Data preprocessing completed
Embeddings initialized successfully


In [20]:
# use a decoder to return the results with a direction to it
semantic_results

Unnamed: 0,Station Name,Street Address,City,Access Days Time,EV Level1 EVSE Num,EV Level2 EVSE Num,EV DC Fast Count,EV Other Info,New Georeferenced Column,longitude,latitude,description,similarity_score
280,Ridgeway Shopping Center - Tesla Supercharger,2233 Summer Street,Stamford,24 hours daily; for Tesla use only,0,0,12,NONE,POINT (-73.546435 41.068704),-73.546435,41.068704,Station: Ridgeway Shopping Center - Tesla Supe...,0.76358
18,The Plaza at Buckland Hills - Tesla Supercharger,1470 Pleasant Valley Road,Manchester,24 hours daily; for Tesla use only,0,0,16,NONE,POINT (-72.562282 41.80452),-72.562282,41.80452,Station: The Plaza at Buckland Hills - Tesla S...,0.74995
217,Greenwich Northbound Travel Plaza - Tesla Supe...,3000 Merritt Parkway,Greenwich,24 hours daily; for Tesla use only,0,0,4,NONE,POINT (-73.671661 41.041538),-73.671661,41.041538,Station: Greenwich Northbound Travel Plaza - T...,0.746748
375,Grand Central Fashion Plaza Shopping Center - ...,1145 High Ridge Rd,North Stamford,24 hours daily; for Tesla use only,0,0,8,NONE,POINT (-73.546513 41.107722),-73.546513,41.107722,Station: Grand Central Fashion Plaza Shopping ...,0.739274
160,Greenwich Southbound Travel Plaza - Tesla Supe...,2000 Merritt Parkway,Greenwich,24 hours daily; for Tesla use only,0,0,4,NONE,POINT (-73.673445 41.040555),-73.673445,41.040555,Station: Greenwich Southbound Travel Plaza - T...,0.737158
