<div style="border: none; margin: 5px 0; border-top: 1px dashed #FFFFFF; border-bottom: 1px dashed #FFFFFF; height: 5px;"></div>

<h2 style="color: #FFA07A;">4. Analysis and Explainability of Clusters Using Artificial Intelligence</h2>

In [1]:
import ipywidgets as widgets
import time
from IPython.display import display

# --- Create HTML widget ---
output = widgets.HTML(value="<div></div>")
display(output)

# --- Formatted HTML text ---
text = """
<div style="background-color: #FFFFFF; color: #333333; padding: 15px; 
            border-left: 5px solid #FFA500; font-family: Arial, sans-serif; 
            text-align: justify; font-size: 16px; line-height: 1.6;">
    <p>Algorithmic decisions can be difficult to interpret. Therefore, in this section, we apply complementary interpretability techniques (<i>Explainable Artificial Intelligence</i> – XAI), a branch of AI that seeks to make algorithms more transparent by clearly explaining how decisions are made.</p> 

    <p>Since K-means does not inherently provide mechanisms to explain cluster formation, we rely on the following approaches to explore <u>how and why</u> buildings were grouped into two <i>clusters</i>, using explainability techniques to understand the decisions:</p> 

    <p> 🔸 <b><u>LIME (<i>Local Interpretable Model-Agnostic Explanations</i>)</u></b>: interprets results at a local level, highlighting the individual impact of each variable on predictions, based on a specific sample within each cluster.</p> 

    <p> 🔸 <b><u>Decision Tree</u></b>: offers a hierarchical and non-linear view of how variables influence the clustering or predictions.</p> 

    <p> 🔸 <b><u>Feature Importance in Clusters</u></b>: calculated using a Random Forest model, this analysis identifies which variables contributed most to distinguishing the groups.</p> 

    <p> 🔸 These techniques allow for a deeper understanding of the patterns identified by K-Means and will be integrated into the overall model analysis.</p> 
</div>
"""

# --- Typing effect character by character ---
typed = ""
for char in text:
    typed += char
    output.value = typed
    time.sleep(0.005)  # Adjust speed 

# --- Ensure final rendering of complete HTML ---
output.value = text

HTML(value='<div></div>')

In [2]:
from IPython.display import Javascript, display
# hide-me
display(Javascript('window.cellVisibilityManager.hideCells();'))

# Run libraries
ipython = get_ipython()
ipython.run_line_magic("run", "case_study_prep.ipynb")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [9]:
# --- Load the data ---
input_pkl_path = "data.pkl"
with open(input_pkl_path, 'rb') as pkl_file:
    data = pickle.load(pkl_file)

df_services = data['df']  # DataFrame with clusters

# --- Mapping for readable feature names (English) ---
label_map = {
    'number_of_nearby_services': 'Total number of nearby services',
    'pop_65_plus': 'Population aged 65+',
    'average_distance_to_services': 'Average distance to services',
    'Health Centers': 'Health Centers',
    'Pharmacies': 'Pharmacies',
    'Hospitals': 'Hospitals',
    'Supermarkets': 'Supermarkets',
    'Banks': 'Banks',
    'Parks and Gardens': 'Parks or Gardens',
    'Post Offices': 'Post Offices'
}

# --- Prepare data for the models ---
X = df_services[['pop_65_plus', 'number_of_nearby_services', 'average_distance_to_services',
                 'Health Centers', 'Pharmacies', 'Hospitals',
                 'Supermarkets', 'Banks', 'Parks and Gardens', 'Post Offices']]
y = df_services['cluster_kmeans']  # Use K-Means clusters

# --- Train the RandomForest model ---
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X, y)

# --- SHAP explainer for Random Forest ---
explainer_shap = shap.TreeExplainer(rf_model)
shap_values = explainer_shap.shap_values(X)

# --- Initialize LIME explainer with mapped feature names ---
explainer = LimeTabularExplainer(
    X.values,
    feature_names=[label_map.get(col, col) for col in X.columns],
    class_names=[f"Cluster {i}" for i in sorted(y.unique())],
    mode="classification",
    discretize_continuous=True
)

# --- Train the DecisionTree model ---
dt_model = DecisionTreeClassifier(max_depth=4, random_state=42)
dt_model.fit(X, y)

# --- Initialize Dash app ---
app = Dash(__name__, suppress_callback_exceptions=True)

# --- Define the layout of the app ---
app.layout = html.Div([
    html.H1("Explainable AI Model for Cluster Evaluation", style={
        'text-align': 'center',
        'color': 'white',
        'background-color': '#000',
        'border': '2px solid white',
        'padding': '10px',
        'font-weight': 'bold',
        'font-size': '32px'
    }),

    dcc.Tabs(id="tabs", value='tree-tab', children=[
        dcc.Tab(label='Decision Tree', value='tree-tab',
                style={'backgroundColor': '#000', 'color': 'white', 'padding': '10px'},
                selected_style={'backgroundColor': '#000', 'color': 'white', 'padding': '10px', 'borderTop': '4px solid #ffcc00'}),
        dcc.Tab(label='LIME', value='lime-tab',
                style={'backgroundColor': '#000', 'color': 'white', 'padding': '10px'},
                selected_style={'backgroundColor': '#000', 'color': 'white', 'padding': '10px', 'borderTop': '4px solid #ffcc00'}),
        dcc.Tab(label='Variable Importance', value='importance-tab',
                style={'backgroundColor': '#000', 'color': 'white', 'padding': '10px'},
                selected_style={'backgroundColor': '#000', 'color': 'white', 'padding': '10px', 'borderTop': '4px solid #ffcc00'}),
        dcc.Tab(label='SHAP', value='shap-tab',
                style={'backgroundColor': '#000', 'color': 'white', 'padding': '10px'},
                selected_style={'backgroundColor': '#000', 'color': 'white', 'padding': '10px', 'borderTop': '4px solid #ffcc00'})
    ]),

    html.Div(id='tabs-content')
])

# --- Function to generate the decision tree as SVG ---
def generate_tree_svg():
    with NamedTemporaryFile(delete=False, suffix=".dot") as dot_file:
        export_graphviz(
            dt_model,
            out_file=dot_file.name,
            feature_names=[label_map.get(col, col) for col in X.columns],
            class_names=[f"Cluster {i}" for i in sorted(y.unique())],
            filled=True,
            rounded=True,
            special_characters=True,
            precision=0 
        )
        dot_file.close()

        svg_file = NamedTemporaryFile(delete=False, suffix=".svg")
        subprocess.run(["dot", "-Tsvg", dot_file.name, "-o", svg_file.name], check=True)

        with open(svg_file.name, "rb") as f:
            svg_content = f.read()

    return base64.b64encode(svg_content).decode('utf-8')

# --- Render content depending on selected tab ---
@app.callback(Output('tabs-content', 'children'), Input('tabs', 'value'))
def render_tab_content(tab):
    if tab == 'tree-tab':
        svg_base64 = generate_tree_svg()
        return html.Div([
            html.Div([
                html.Img(src=f"data:image/svg+xml;base64,{svg_base64}")
            ], style={'text-align': 'center', 'overflow-x': 'scroll'})
        ])
    elif tab == 'lime-tab':
        return html.Div([
            dcc.Dropdown(
                id='lime-cluster-selector',
                options=[{'label': f'Cluster {i}', 'value': i} for i in sorted(y.unique())],
                placeholder="Select a cluster",
                style={'backgroundColor': 'white', 'color': 'black'}
            ),
            html.Div(id='lime-output', style={'padding': '10px', 'border': '1px solid #ccc',
                                              'borderRadius': '5px', 'backgroundColor': 'white'})
        ])
    elif tab == 'importance-tab':
        importances = rf_model.feature_importances_
        features = [label_map.get(col, col) for col in X.columns]
        fig = px.bar(x=importances, y=features, orientation='h',
                     labels={'x': 'Importance', 'y': 'Variables'})
        return html.Div([dcc.Graph(figure=fig)])
    elif tab == 'shap-tab':
        cluster_options = [{'label': f'Cluster {i}', 'value': i} for i in sorted(y.unique())]
        return html.Div([
            dcc.Dropdown(
                id='shap-cluster-selector',
                options=cluster_options,
                value=sorted(y.unique())[0],  # valor padrão: primeiro cluster
                clearable=False,
                style={'width': '50%', 'margin': 'auto', 'marginBottom': '20px', 'backgroundColor': 'white', 'color': 'black'}
            ),
            html.Div(id='shap-output')
        ])
    return html.Div("Select a tab to view results.")

# --- Callback for SHAP beeswarm plot by cluster (amostras do cluster selecionado) ---
@app.callback(
    Output('shap-output', 'children'),
    Input('shap-cluster-selector', 'value')
)
def update_shap_output(cluster_selected):
    plt.clf()
    feature_names_list = list([str(label_map.get(col, col)) for col in X.columns])

    mask = (y == cluster_selected).values
    idx = int(cluster_selected)
    # Corrige para array 3D
    if isinstance(shap_values, list):
        cluster_shap = shap_values[idx][mask]
    elif isinstance(shap_values, np.ndarray) and shap_values.ndim == 3:
        cluster_shap = shap_values[mask, :, idx]
    else:
        cluster_shap = shap_values[mask]
    cluster_X = X.loc[mask]
    if cluster_shap.shape[0] == 0:
        return f"No sample found for cluster {cluster_selected}."

    # SHAP plot + eixo X customizado
    shap.summary_plot(
        cluster_shap,
        cluster_X.values,
        feature_names=feature_names_list,
        show=False
    )
    plt.xlabel("SHAP value (impact on model output)")
    buf = io.BytesIO()
    plt.savefig(buf, format="png", bbox_inches="tight")
    buf.seek(0)
    encoded = base64.b64encode(buf.read()).decode()
    plt.close('all')
    return html.Div([
        html.Img(src=f"data:image/png;base64,{encoded}", style={'width': '100%'})
    ])

# --- Generate LIME output for selected cluster ---
@app.callback(
    Output('lime-output', 'children'),
    Input('lime-cluster-selector', 'value')
)
def update_lime_output(cluster_selected):
    if cluster_selected is not None:
        try:
            cluster_indices = y[y == cluster_selected].index.tolist()
            if len(cluster_indices) == 0:
                return "No instance found for the selected cluster."

            np.random.seed(42)
            idx = np.random.choice(cluster_indices)

            explanation = explainer.explain_instance(
                X.iloc[idx].values,
                lambda x: rf_model.predict_proba(pd.DataFrame(x, columns=X.columns)),
                num_features=len(X.columns),
                labels=[cluster_selected]
            )
            return html.Iframe(
                srcDoc=explanation.as_html(),
                style={'width': '100%', 'height': '600px', 'border': 'none'}
            )
        except Exception as e:
            return f"Error generating explanation with LIME: {e}"
    return "Select a cluster to view the LIME explanation."

# --- Find a free network port for the interactive dashboard ---
def find_free_port():
    while True:
        port = random.randint(8000, 9000)
        with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
            if s.connect_ex(("localhost", port)) != 0:
                return port

port = find_free_port()

# --- Launch the interactive dashboard (Dash) if run directly ---
if __name__ == '__main__':
    app.run(debug=False, port=port)

print("\033[92m[INFO] Analysis completed. You may continue.\033[0m")

<IPython.core.display.Javascript object>

[92m[INFO] Analysis completed. You may continue.[0m


In [1]:
import ipywidgets as widgets
import time
from IPython.display import display

# --- Create an HTML widget to display the typing effect for the second text ---
output2 = widgets.HTML(value="<div></div>")
display(output2)

# --- Full explanatory HTML-formatted text including SHAP ---
translated_text = """
<p style="text-align: justify;">
<p><b>Explanation:</b></p>
🔹 <strong><u>Decision Tree:</u></strong> The Decision Tree reveals that the <strong>number of nearby services</strong> is the most influential variable in segmentation, appearing prominently in the top levels of the tree. This variable is essential for distinguishing the two <i>clusters</i>. <i><strong>Cluster 0</strong></i> represents buildings located in areas with a <strong>lower availability of nearby services</strong>, while <i><strong>Cluster 1</strong></i> groups areas with a <strong>higher concentration and diversity of urban services</strong>.</p>

<p>The variable <strong>population</strong> appears only in deeper levels, indicating that the <strong>demographic factor</strong> plays a secondary role compared to service accessibility. Other variables such as <em>hospitals</em>, <em>banks</em>, and <em>health centers</em> also contribute to classification but act as complements, refining segmentation in specific cases.</p>

<p><strong><u>LIME (Local Interpretable Model-Agnostic Explanations)</u></strong>: LIME helps explain how the model made a decision about a specific building. It presents three main graphs, each with a distinct function, making it easier to understand even for users unfamiliar with the technique:</p>

<ul>
  <li><strong>Prediction probability chart:</strong> Shows the likelihood of the building belonging to each <i>cluster</i>. For example, if the model indicates a 100% probability for <i><strong>Cluster 0</strong></i> and 0% for <i><strong>Cluster 1</strong></i>, it means the model is fully confident that the building belongs to cluster 0.</li>

  <li><strong>Most relevant variables chart (center of the figure):</strong> Displays the variables that most influenced the model’s decision for that building. The <span style="color:orange;"><strong>orange bars</strong></span> indicate influence in favor of <i>cluster 1</i>, while the <span style="color:blue;"><strong>blue bars</strong></span> indicate influence in favor of <i>cluster 0</i>. The longer the bar, the greater the impact of that variable on the final decision. <strong>Note:</strong> Even if most bars are orange, the building can still be classified into <i>cluster 0</i>. This happens because LIME highlights the variables that most could have changed the decision, not necessarily all the variables used by the model.</li>

  <li><strong>"Value" column (Variable Values):</strong> Shows the actual value of each variable for the analyzed building. For example, if the variable <em>"banks"</em> has a value of <strong>9</strong>, it means there are 9 banks nearby (within 1.5 km). LIME explains how that value contributed to the classification into a specific <i>cluster</i>.</li>
</ul>

<p style="text-align: justify;">
🔹 <b><u>Variable importance:</u></b> The most relevant variable is the <i>number of nearby services</i>, followed by variables like supermarkets, banks, post offices (CTT), and pharmacies. On the other hand, variables like average distance to services, hospitals, and parks/gardens have less influence. This suggests that the presence and diversity of nearby services are the main criteria for defining <i>clusters</i>.
</p>
<p style="text-align: justify;">
🔹 <b><u>SHAP (SHapley Additive exPlanations):</u></b> SHAP values provide a detailed view of how each variable contributes to the model’s output, based on cooperative game theory. In the beeswarm plots, each dot represents a building. The position on the x-axis shows the impact of that feature on the prediction, while the color indicates the actual feature value (blue = low, red = high). In Cluster 0, low values for nearby services tend to push predictions toward this cluster. In Cluster 1, high values for the same variable strongly push the prediction toward that group. Other variables like supermarkets, banks, and pharmacies also play a significant role, while demographic variables like population aged 65+ have minor influence.</p>
</p>

<p style="text-align: justify;">
<b>Conclusion:</b> Both the <b>Decision Tree</b> and <b>Random Forest</b> (<b><u>feature importance</u></b>) assign greater weight to <b>accessibility and diversity of services</b> in cluster definition. The variable <b>population aged 65 or older</b> did not emerge as a relevant criterion, appearing only in the lower levels of the Decision Tree. This suggests that the <b>presence and concentration of services</b> was the most determining factor in the segmentation performed by the <b>K-means</b> algorithm.
</p>
</div>
"""

# --- Simulate character-by-character typing effect with preserved HTML formatting ---
typed_html = """
<div style="background-color: #FFFFFF; color: #333333; padding: 15px; 
            border-left: 5px solid #6A0DAD; font-family: Arial, sans-serif; 
            text-align: justify; font-size: 16px; line-height: 1.6;">
"""

for word in translated_text.split():
    typed_html += word + " "
    output2.value = typed_html + "</div>"
    time.sleep(0.10)  # Typing speed

# --- Ensure final full rendering ---
output2.value = typed_html + "</div>"

HTML(value='<div></div>')

<div style="border: none; margin: 5px 0; border-top: 1px dashed #FFFFFF; border-bottom: 1px dashed #FFFFFF; height: 5px;"></div>

In [None]:
import ipywidgets as widgets
import time
from IPython.display import display

# --- Create a third HTML widget for the "Completed" message ---
output3 = widgets.HTML(value="<div></div>")
display(output3)

# --- Stylized HTML block for the "Completed" message ---
completed_text = """
<div style="background-color: #2E3B4E; color: #FFFFFF; padding: 30px; 
            border-left: 5px solid #FFA500; font-family: Arial, sans-serif; 
            text-align: center;">
    <p style="font-size: 78px; font-weight: bold; margin: 0;">
        Completed
    </p>
</div>
"""

# --- Typing effect for the final message (character by character) ---
typed_completed = ""
for char in completed_text:
    typed_completed += char
    output3.value = typed_completed
    time.sleep(0.005)  # Adjust speed as needed

# --- Ensure the final full rendering ---
output3.value = completed_text

<div style="border: none; margin: 5px 0; border-top: 1px dashed #FFFFFF; border-bottom: 1px dashed #FFFFFF; height: 5px;"></div>