# Function Notebook

## Introduction
This notebook is designed to be a flexible and expandable template for developing and documenting functions for various tasks.

## Next Steps 
- Data Reduction
- Feature Selection
- Dimensionality Reduction
- Binning / Encoding
- Feature Engineering
- Text Handling
- Data Validation


## NOTE : This is still in progress 

## Table of Contents1. [Configuration and Setup](#Configuration-and-Setpu)
2. [API Get Dataset no key](#Get-Data-No-ApiKey)
3. [Pre-Processing Functions](#Pre-Processing-Functions)
    - [Dealing with NULL Values (#Finding Missing Data Count )](Dealing-with-NULL-Values-(-Finding-Missing-Data-Count-))
    - [Remove(drop), mean , median , mode](#Remove(drop),-mean-,-median-,-mode)
    - [Data Combining / Intergration](#Data-Combining-/-Intergration)
    - [Normalizing / Feature Scaling](#Normalizing-/-Feature-Scaling)


## Configuration and Setup
Set up the environment with necessary libraries and configurations, Make sure you have all libraries installed under functions 

In [1]:
###################################################################
# Libraries used:
###################################################################
import numpy as np
import pandas as pd
import seaborn as sns
import folium
import matplotlib.pyplot as plt
import requests
import math
import tensorflow as tf
from io import StringIO
from geopy.distance import geodesic
from folium.plugins import MarkerCluster
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler,PowerTransformer,MaxAbsScaler
from sklearn.preprocessing import RobustScaler,Normalizer,QuantileTransformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from haversine import haversine
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
import tkinter as tk
from tkinter import messagebox

## Get Data No ApiKey

In [2]:
def API_Unlimited(datasetname): # pass in dataset name and api key
    dataset_id = datasetname

    base_url = 'https://data.melbourne.vic.gov.au/api/explore/v2.1/catalog/datasets/'
    #apikey = api_key
    dataset_id = dataset_id
    format = 'csv'

    url = f'{base_url}{dataset_id}/exports/{format}'
    params = {
        'select': '*',
        'limit': -1,  # all records
        'lang': 'en',
        'timezone': 'UTC'
    }

    # GET request
    response = requests.get(url, params=params)

    if response.status_code == 200:
        # StringIO to read the CSV data
        url_content = response.content.decode('utf-8')
        datasetname = pd.read_csv(StringIO(url_content), delimiter=';')
        print(datasetname.sample(10, random_state=999)) # Test
        return datasetname 
    else:
        return (print(f'Request failed with status code {response.status_code}'))


"""
Get unlimited data from the API Function 

Parameters:
datasetname (string): dataset name as from city of melbourn 
apikey (string): the current api Key ( this should be gotton via the below if api stored in current workspace / google drive ( refer to Te API)

f = open("API.txt","r")
api_key = f.read()

Returns:
Csv : Returns the csv dataset of the dataset name 
"""


'\nGet unlimited data from the API Function \n\nParameters:\ndatasetname (string): dataset name as from city of melbourn \napikey (string): the current api Key ( this should be gotton via the below if api stored in current workspace / google drive ( refer to Te API)\n\nf = open("API.txt","r")\napi_key = f.read()\n\nReturns:\nCsv : Returns the csv dataset of the dataset name \n'

#### Testing :

In [None]:
dataset_id_1 = 'litter-traps'
dataset_id_2 = 'public-barbecues'
dataset_id_3 = 'cafes-and-restaurants-with-seating-capacity'
dataset_id_4 = 'argyle-square-air-quality'
litter_df = API_Unlimited(dataset_id_1)
bbq_df = API_Unlimited(dataset_id_2)
cafe_df = API_Unlimited(dataset_id_3)
AirQuality_df = API_Unlimited(dataset_id_4)

# Pre-Processing Functions

## Dealing with NULL Values ( Finding Missing Data Count )

In [4]:
def FindMissingVal(df):
  #now lets have a array to store the feature with number of NAN values
  MissingFeaturenValues = []
  #now we check each column
  for column in df.columns:
    missingVals = np.sum(df[column].isnull()) # sum the number of NAN values into variable
    MissingFeaturenValues.append({'Feature':column ,'Number of Missing Values':missingVals}) #the array consist of dictionary with feature and its missing values
  return MissingFeaturenValues

"""
Function to get column names with count of missing values 

Parameters:
datasetname (string): dataset name as from city of melbourn 
apikey (string): the current api Key ( this should be gotton via the below if api stored in current workspace / google drive ( refer to Te API)

f = open("API.txt","r")
api_key = f.read()

Returns:
Csv : Returns the csv dataset of the dataset name 
"""

'\nFunction to get column names with count of missing values \n\nParameters:\ndatasetname (string): dataset name as from city of melbourn \napikey (string): the current api Key ( this should be gotton via the below if api stored in current workspace / google drive ( refer to Te API)\n\nf = open("API.txt","r")\napi_key = f.read()\n\nReturns:\nCsv : Returns the csv dataset of the dataset name \n'

In [5]:
FindMissingVal(litter_df)

[{'Feature': 'asset_number', 'Number of Missing Values': 0},
 {'Feature': 'asset_description', 'Number of Missing Values': 0},
 {'Feature': 'construct_material_lupvalue', 'Number of Missing Values': 7},
 {'Feature': 'inspection_frequency', 'Number of Missing Values': 5},
 {'Feature': 'maintained_by', 'Number of Missing Values': 0},
 {'Feature': 'object_type_lupvalue', 'Number of Missing Values': 4},
 {'Feature': 'lat', 'Number of Missing Values': 0},
 {'Feature': 'lon', 'Number of Missing Values': 0},
 {'Feature': 'location', 'Number of Missing Values': 0}]

### Remove(drop), mean , median , mode 

In [6]:


def handle_null_values(dataset, columns, action): # nested conditions
    if action == 'remove':
        modified_dataset = dataset.dropna(subset=columns)
    elif action in ['mean', 'median', 'mode']:
        for column in columns:
            if dataset[column].isnull().any():  
                if action == 'mean':
                    fill_value = dataset[column].mean()
                elif action == 'median':
                    fill_value = dataset[column].median()
                elif action == 'mode':
                    fill_value = dataset[column].mode()[0]
                dataset[column] = dataset[column].fillna(fill_value)
        modified_dataset = dataset
    else:
        raise ValueError("Action must be 'remove', 'mean', 'median', or 'mode'")
    return modified_dataset

"""
Handling Missing Values Functions

Parameters:

dataset(dataframe) -  Dataframe you want to deal null values 
columns (array) - a array of all columns you want to handle missing values for the picked action
actions (string) - 'remove' , 'mode' , 'mean' , 'median' performs the said actions when selected ( can select one at a time )

Returns:
Dataframe : Returns Dataframe including handled values
"""



"\nHandling Missing Values Functions\n\nParameters:\n\ndataset(dataframe) -  Dataframe you want to deal null values \ncolumns (array) - a array of all columns you want to handle missing values for the picked action\nactions (string) - 'remove' , 'mode' , 'mean' , 'median' performs the said actions when selected ( can select one at a time )\n\nReturns:\nDataframe : Returns Dataframe including handled values\n"

#### Testing - I made a array of all columns i want to use mode on and ran function , returns to a new df called modified_mode

In [7]:
"""Usage Example"""
columns=['inspection_frequency','construct_material_lupvalue']
modified_mode = handle_null_values(litter_df,columns,'mode') #<========== Pass DATASET and Prefered Method


In [8]:
FindMissingVal(modified_mode)

[{'Feature': 'asset_number', 'Number of Missing Values': 0},
 {'Feature': 'asset_description', 'Number of Missing Values': 0},
 {'Feature': 'construct_material_lupvalue', 'Number of Missing Values': 0},
 {'Feature': 'inspection_frequency', 'Number of Missing Values': 0},
 {'Feature': 'maintained_by', 'Number of Missing Values': 0},
 {'Feature': 'object_type_lupvalue', 'Number of Missing Values': 4},
 {'Feature': 'lat', 'Number of Missing Values': 0},
 {'Feature': 'lon', 'Number of Missing Values': 0},
 {'Feature': 'location', 'Number of Missing Values': 0}]

### Data Combining / Intergration

In [9]:
import pandas as pd

def Combine_Dataset(datasets, mode='outer'):
    # Check if no datset is given 
    if not datasets:
        raise ValueError("No datasets provided for merging.")
    
    #We check if there are any common columns
    common_columns = set(datasets[0].columns) # making a SET
    for dataset in datasets[1:]:
        common_columns.intersection_update(dataset.columns) #Appending if we find any matching 
        
    #Error if no common found 
    if not common_columns:
        raise ValueError("No common columns available for combining the datasets. Please give datasets with common columns.")

    #Merge
    combined_dataset = datasets[0]
    for dataset in datasets[1:]:
        combined_dataset = pd.merge(combined_dataset, dataset, how=mode, on=list(common_columns))# combine with mode ( default is outer) with the common columns
    
    return combined_dataset

"""
Combining multiple datasets

Parameters:

datasets-  Array of multiple datasets
Mode - inner , outer , left , right    (JOIN) Default : outer

Returns:
Dataframe : Returns Dataframe combined
"""


'\nCombining multiple datasets\n\nParameters:\n\ndatasets-  Array of multiple datasets\nMode - inner , outer , left , right    (JOIN) Default : outer\n\nReturns:\nDataframe : Returns Dataframe combined\n'

#### Testing - I pass a array of all datasets, i want to use mode=inner on and ran the function , returns to a new df called combinedDF

In [10]:
Datasets = [litter_df,cafe_df] # passing 2 datasets As aRRAY HERE 

combinedDF = Combine_Dataset(Datasets, mode='outer') # using mode "outer"

In [11]:
combinedDF.head(5)

Unnamed: 0,asset_number,asset_description,construct_material_lupvalue,inspection_frequency,maintained_by,object_type_lupvalue,lat,lon,location,census_year,...,building_address,clue_small_area,trading_name,business_address,industry_anzsic4_code,industry_anzsic4_description,seating_type,number_of_seats,longitude,latitude
0,,,,,,,,,"-37.77619476202889, 144.93912821315",2013.0,...,52-62 Cade Way PARKVILLE 3052,Parkville,Corner Cafe & Convenience Store,"Shop 1, Ground , 52 Cade Way PARKVILLE 3052",4511.0,Cafes and Restaurants,Seats - Outdoor,6.0,144.939128,-37.776195
1,,,,,,,,,"-37.77619476202889, 144.93912821315",2013.0,...,52-62 Cade Way PARKVILLE 3052,Parkville,Corner Cafe & Convenience Store,"Shop 1, Ground , 52 Cade Way PARKVILLE 3052",4511.0,Cafes and Restaurants,Seats - Indoor,35.0,144.939128,-37.776195
2,,,,,,,,,"-37.776194762058005, 144.93912821305003",2010.0,...,52-62 Cade Way PARKVILLE 3052,Parkville,Corner Cafe & Convenience Store,"Unit 1, 62-0 Cade Way PARKVILLE 3052",4511.0,Cafes and Restaurants,Seats - Outdoor,6.0,144.939128,-37.776195
3,,,,,,,,,"-37.776194762058005, 144.93912821305003",2010.0,...,52-62 Cade Way PARKVILLE 3052,Parkville,Corner Cafe & Convenience Store,"Unit 1, 62-0 Cade Way PARKVILLE 3052",4511.0,Cafes and Restaurants,Seats - Indoor,35.0,144.939128,-37.776195
4,,,,,,,,,"-37.776194762090775, 144.93912821290002",2011.0,...,52-62 Cade Way PARKVILLE 3052,Parkville,Corner Cafe & Convenience Store,"Unit 1, 62-0 Cade Way PARKVILLE 3052",4511.0,Cafes and Restaurants,Seats - Indoor,35.0,144.939128,-37.776195


### Normalizing / Feature Scaling 

In [12]:

def Scale_data(dataframe, columns, method='minmax'):

    #Copy so we dont change original dataFrame
    df_scaled = dataframe.copy()
    
    # Check if all specified columns exist in the DataFrame
    if not all(col in df_scaled.columns for col in columns):
        missing_cols = [col for col in columns if col not in df_scaled.columns]
        raise ValueError(f"Columns not found in DataFrame: {missing_cols}")
    
    # Select the normalization method nested if 
    if method == 'minmax':
        scaler = MinMaxScaler()
    elif method == 'zscore':
        scaler = StandardScaler()
    elif method == 'zscore':
        scaler = PowerTransformer()
    elif method == 'zscore':
        scaler = MaxAbsScaler()
    elif method == 'zscore':
        scaler = RobustScaler()
    elif method == 'zscore':
        scaler = Normalizer()
    elif method == 'zscore':
        scaler = QuantileTransformer()
    else:
        raise ValueError("Please Enter one scalar method : minmax , zscore , powertransformer , absscalar , robustscalar , normalizer , quantile") #exception

    # Use the selected scalar
    df_scaled[columns] = scaler.fit_transform(df_scaled[columns])
    
    return df_scaled

"""
Scaling Features in dataset

Parameters:

dataframe-  Array of multiple datasets
columns - array of all columns/features to normalize or scale 
method -  minmax , zscore , powertransformer , absscalar , robustscalar , normalizer , quantile . Default : minmax

Returns:
Dataframe : Returns Dataframe Scaled/Normalized
"""



'\nScaling Features in dataset\n\nParameters:\n\ndataframe-  Array of multiple datasets\ncolumns - array of all columns/features to normalize or scale \nmethod -  minmax , zscore , powertransformer , absscalar , robustscalar , normalizer , quantile . Default : minmax\n\nReturns:\nDataframe : Returns Dataframe Scaled/Normalized\n'

### Get column names in a list ( Easy to copy over )

In [13]:
list(AirQuality_df.columns.values)

['time',
 'dev_id',
 'sensor_name',
 'lat_long',
 'averagespl',
 'carbonmonoxide',
 'humidity',
 'ibatt',
 'nitrogendioxide',
 'ozone',
 'particulateserr',
 'particulatesvsn',
 'peakspl',
 'pm1',
 'pm10',
 'pm25',
 'temperature',
 'vbatt',
 'vpanel']

#### Testing : Adding columns I want to scale on AirQuality DF in function , selecting minmax scalar 

In [14]:
 # minmax , zscore , powertransformer , absscalar , robustscalar , normalizer , quantile Are Current Supported 

Scaled_min_max_df = Scale_data(AirQuality_df, ['averagespl','carbonmonoxide','humidity','ibatt','nitrogendioxide','ozone','particulateserr','particulatesvsn','peakspl','pm1','pm10','pm25','temperature'], method='minmax')
Scaled_min_max_df.head(6)

Unnamed: 0,time,dev_id,sensor_name,lat_long,averagespl,carbonmonoxide,humidity,ibatt,nitrogendioxide,ozone,particulateserr,particulatesvsn,peakspl,pm1,pm10,pm25,temperature,vbatt,vpanel
0,2020-06-09T09:02:38+00:00,ems-ec8a,Air Quality Sensor 2,"-37.802772, 144.9655513",0.1,0.136969,0.583333,0.866142,0.2997,0.733945,0.0,1.0,0.241379,0.1,0.075697,0.001002,0.261421,3.96,0.0
1,2020-06-09T11:17:37+00:00,ems-ec8a,Air Quality Sensor 2,"-37.802772, 144.9655513",0.075,0.115559,0.619048,0.877953,0.322523,0.768807,0.0,1.0,0.12069,0.125,0.095618,0.001296,0.225888,3.93,0.0
2,2020-06-09T11:32:37+00:00,ems-ec8a,Air Quality Sensor 2,"-37.802772, 144.9655513",0.075,0.115559,0.630952,0.869423,0.322523,0.768807,0.0,1.0,0.224138,0.158333,0.115538,0.001414,0.215736,3.92,0.0
3,2020-06-09T12:17:37+00:00,ems-ec8a,Air Quality Sensor 2,"-37.802772, 144.9655513",0.125,0.102704,0.666667,0.874016,0.33994,0.776147,0.0,1.0,0.327586,0.166667,0.115538,0.001709,0.200508,3.91,0.0
4,2020-06-09T13:47:36+00:00,ems-ec8a,Air Quality Sensor 2,"-37.802772, 144.9655513",0.075,0.115559,0.714286,0.91273,0.333934,0.785321,0.0,1.0,0.12069,0.2,0.163347,0.002121,0.180203,3.89,0.0
5,2020-06-09T14:02:36+00:00,ems-ec8a,Air Quality Sensor 2,"-37.802772, 144.9655513",0.125,0.094149,0.738095,0.879921,0.333934,0.785321,0.0,1.0,0.275862,0.133333,0.099602,0.001296,0.182741,3.89,0.0


## Point to point distance calculator minimum ( Thomas )( NEED FIXING)


In [None]:
# Function to calculate the minimum distance from a point to any point in a list
"""
Calculate the minimum geodesic distance from a point to any point in a given list.

Parameters:
point (tuple): A tuple representing the coordinates (latitude, longitude) of the point.
list_of_points (list of tuples): A list of tuples, each representing coordinates (latitude, longitude) of points to compare against.

Returns:
float: The minimum Euclidean distance from the given point to the closest point in the list.
"""

def min_distance(point, list_of_points): 
    return min([geodesic(point, pt).meters for pt in list_of_points]) #get min dis

#example :


row = {'lat': 40.7128, 'lon': -74.0060}
# Call the lambda function with the row as an argument
value = lambda row: min_distance((row['lat'], row['lon']), bbq_coords)
# Get the result by calling the lambda function
result = value(row)
# Print the result
print("test distance in meters :",result)

# example used in dataset :


litter_df['Nearest BBQ Distance (m)'] = litter_df.apply(lambda row: min_distance((row['lat'], row['lon']), bbq_coords), axis=1)
#creates a new column for nearest distance to a point

## Point to point distance calculator maximum ( NEED FIXING)


In [None]:
# Function to calculate the maximum distance from a point to any point in a list
"""
Calculate the maximum geodesic distance from a point to any point in a given list.

Parameters:
point (tuple): A tuple representing the coordinates (latitude, longitude) of the point.
list_of_points (list of tuples): A list of tuples, each representing coordinates (latitude, longitude) of points to compare against.

Returns:
float: The maximum Euclidean distance from the given point to the closest point in the list.
"""

def max_distance(point, list_of_points): 
    return max([geodesic(point, pt).meters for pt in list_of_points]) #get min dis

#example : 

value = max_distance((row['lat'], row['lon']), bbq_coords) 

## Number of points in a given radius ( NEED FIXING)

In [None]:
#Calculate the Number of points in a radius from a point 
"""
Calculate the number of geodesic distances from a point to any point in a given list.

Parameters:
center_point (tuple): A tuple representing the coordinates (latitude, longitude) of the point.
list_of_points (list of tuples): A list of tuples, each representing coordinates (latitude, longitude) of points to compare against.
radius_meters

Returns:
INT: The Number of points in the radius given
"""

def count_points_in_radius(center_point, list_of_points, radius_meters):
    count = sum(1 for pt in list_of_points if geodesic(center_point, pt).meters <= radius_meters)
    return count

#Example into dataset : 

#========Parameter 1 : Centur point 
#========Parameter 2 : all coordinate points [must be list form , see example ]
#========Parameter 3 : radius 

radius = 100
litter_df['Number of Nearby Points in Radius'] = litter_df.apply(lambda row: count_points_in_radius((row['lat'], row['lon']), bbq_coords + cafe_coords,radius),axis=1)

# Example ( singular ) :

values = count_points_in_radius((row['lat'], row['lon']),cafe_coords,radius)

## The Map using folium ( basic ) ( NEED FIXING )


from folium.plugins import MarkerCluster

    
"""
Calculate the minimum geodesic distance from a point to any point in a given list.

Parameters:
dataframe : A datset representing the coordinates (latitude, longitude) of the index and also other values hence when
using this we can also include other things from the dataset in the map , when using the html legend

Returns:
Map: The folium based map is returned
"""

def map_func(PointsDatasets,):
    # Create a folium map centered at the mean coordinates of litter traps / intial setup
    map_center = [PointsDatasets['lat'].mean(), PointsDatasets['lon'].mean()]
    mymap = folium.Map(location=map_center, zoom_start=13)
    
    # Add circles for the points
    for index, row in PointsDatasets.iterrows():
        location = [row['lat'], row['lon']] 
        # Add a circle for the radius around the litter trap
        folium.Circle(
            location=location,
            radius=30,
            color='red',
            fill=True,
            fill_opacity=0.2
        ).add_to(mymap)
    return mymap

# Example usage ========================= Pass in your function =================
"""Make sure your dataframe has a column with both lat and lon"""
map_func(litter_df)

### correlation heat map for spearman and pearson correlation

In [22]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import spearmanr

def plot_correlation_heatmaps(data, labels, order=None):
    """
    Plots Pearson and Spearman correlation heatmaps for the given data.

    Parameters:
    - data: 2D numpy array or DataFrame containing the data to analyze.
    - labels: List of column names corresponding to the data.
    - order: List of indices specifying the order of columns for aesthetic purposes in the heatmap.

    The function creates a figure with two subplots: one for Pearson correlation and one for Spearman correlation.
    """
    if order is None:
        order = range(len(labels))  # Default order if none provided

    # Compute Pearson correlation coefficients
    R = np.corrcoef(data, rowvar=False)

    # Compute Spearman's rank correlation
    rho, pval = spearmanr(data, axis=0)

    # Create a figure with two subplots
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))

    # Plot Pearson correlation heatmap
    ax1.set_title('Pearson Correlation')
    plt.sca(ax1)
    corrheatmap(R[np.ix_(order, order)], np.array(labels)[order])

    # Plot Spearman correlation heatmap
    ax2.set_title('Spearman Correlation')
    plt.sca(ax2)
    corrheatmap(rho[np.ix_(order, order)], np.array(labels)[order])

    plt.show()

def corrheatmap(R, labels):
    """
    Helper function to draw a correlation heat map.
    """
    k = len(labels)
    plt.imshow(R, cmap='RdBu', vmin=-1, vmax=1)
    plt.xticks(np.arange(k), labels=labels, rotation=45)
    plt.yticks(np.arange(k), labels=labels)
    plt.colorbar()
    for i in range(k):
        for j in range(k):
            plt.text(j, i, f"{R[i, j]:.2f}", ha="center", va="center",
                     color="white" if np.abs(R[i, j]) > 0.5 else "black")
    plt.grid(False)

# Usage example
# data = np.random.rand(100, 5)  # Dummy data
# labels = ['Var1', 'Var2', 'Var3', 'Var4', 'Var5']
# order = [0, 1, 2, 3, 4]
# plot_correlation_heatmaps(data, labels, order)


In [23]:
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

def optimal_k_clusters(data, k_range):
    """
    Determines the optimal number of clusters for K-means clustering based on silhouette scores.

    Parameters:
    - data: The dataset on which clustering is to be performed.
    - k_range: A range of k values to test. Typically, this is a range object.

    Returns:
    - optimal_k: The optimal number of clusters with the highest silhouette score.
    - Plots the silhouette scores for each k in k_range.
    """
    # List to store silhouette scores for each value of k
    silh_scores = []

    # Iterate over each value of k in the range provided
    for k in k_range:
        # Fit KMeans clustering model to the data with 'k' clusters
        kmeans = KMeans(n_clusters=k, n_init=10) # n_init=10 to ensure consistency across initializations
        cluster_labels = kmeans.fit_predict(data)
        
        # Calculate the silhouette score for the current number of clusters
        silhouette_avg = silhouette_score(data, cluster_labels)
        silh_scores.append(silhouette_avg)

    # Determine the value of k that has the maximum silhouette score
    optimal_k = k_range[np.argmax(silh_scores)]
    print("Optimal number of clusters (k):", optimal_k)

    # Plot the silhouette scores against the number of clusters
    plt.figure(figsize=(10, 6))
    plt.plot(k_range, silh_scores, marker='o')
    plt.xlabel('Number of clusters (k)')
    plt.ylabel('Silhouette Score')
    plt.title('Silhouette Score for Different Values of k')
    plt.grid(True)
    plt.show()

    return optimal_k

# Example of how to use the function
# data = your_data_frame  # make sure to define your DataFrame
# k_range = range(2, 11)  # Setting a range from 2 to 10
# optimal_k = optimal_k_clusters(data, k_range)


In [24]:
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer

def find_optimal_clusters(data, k_range=(2, 12)):
    """
    Determines the optimal number of clusters for K-means clustering using the elbow method and plots the results.

    Parameters:
    - data: The dataset on which clustering is to be performed, typically preprocessed (e.g., PCA-transformed).
    - k_range: A tuple indicating the range of k values to test (inclusive). Default is (2, 12).

    Returns:
    - Plots the elbow plot showing the distortion for each k, helping to identify the optimal number of clusters.
    """
    # Initialize the KMeans model with a fixed number of initializations to avoid random seed variability
    model = KMeans(n_init=10)

    # Initialize the KElbowVisualizer with the KMeans model, specifying the range of k and the metric 'distortion'
    visualizer = KElbowVisualizer(
        model, k=k_range, metric='distortion', timings=False
    )

    # Fit the visualizer to the data
    visualizer.fit(data)

    # Finalize and render the figure
    visualizer.show()

# Example of how to use the function
# X_pca = your_pca_transformed_data  # Ensure your data is appropriately preprocessed, e.g., using PCA
# find_optimal_clusters(X_pca)
