### Model Analysis
This notebook is to guide users through evaluating multiple language models across different server types using llama-stack’s agent framework. It demonstrates the performance of the models results by visualization using stacked bar charts, enabling quick comparison of tool call accuracy. The analysis includes tool calling matching and verified inference output comparison charts.


### Overview
This notebook evaluates model tool calling and inference output capabilities across various mcp server types.

1. The metrics for this notebook are generated by test_mcp_servers.py and loaded from results/metrics.csv. Please run python test_mcp_servers.py first before rerunning this Notebook.

2. The data is filtered for tool_call_matching and inference_not_empty

3. Generate clean, dynamic visualizations to compare model performance across server environments.

In [10]:
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import math
import re

In [11]:
file_path = './results/metrics.csv'
df = pd.read_csv(file_path)

# Preview the data
# print(df.head())

### Preprocessing
Removing time stamp and simplying the models' name

In [12]:
df = df.drop(columns=['timestamp'])

def clean_model_name(name):
    # Keep only the last part after the slash (if present)
    name = name.split('/')[-1]
    # Remove '-instruct' or '-Instruct' (case insensitive)
    name = re.sub(r'(?i)-instruct$', '', name)
    return name

df['model'] = df['model'].apply(clean_model_name)


### Function to create a comparison bar chart of each model. 
This shows the count of true and false values for both tool_call_match and inference_not_empty
It displays this information in a bar chart with the True values in green and False values in grey above for each model.

In [13]:

def add_subplot(fig, df, row, col,column_name,title):
    # Calculate true counts, false counts, and total counts per model
    models = df['model'].unique().tolist()
    true_counts = df[df[column_name] == True].groupby('model')[column_name].count().reindex(models, fill_value=0)
    false_counts = df[df[column_name] == False].groupby('model')[column_name].count().reindex(models, fill_value=0)

    # total_counts = df.groupby('model')[column_name].count()

    models = true_counts.index.tolist()
    true_vals = true_counts.values.tolist()
    false_vals = false_counts.values.tolist()

    # Create stacked bar chart with true (bottom) and false (top) parts
    fig.add_trace(go.Bar(
        x=models,
        y=true_vals,
        name='True',
        marker_color='mediumseagreen',
        text=true_vals,
        textposition='outside',
    ), row=row, col=col)

    fig.add_trace(go.Bar(
        x=models,
        y=false_vals,
        name='False',
        marker_color='lightgray',  # Changed to light gray
        text=false_vals,
        textposition='inside',  # Move text inside the bars
        insidetextanchor='middle'  # Center the text inside the bar
    ), row=row, col=col)

    # Update layout to move legend to the top-right corner
    fig.update_layout(
        title=title,
        yaxis_title='True/False Count',
        barmode='stack',  # Stack the bars
        template='plotly_dark',  # plotly_white for bright background
        showlegend=False
    )

    return fig

In [14]:
#Function to help set the number of rows and columns in the subplot grid
def get_subplot_grid(n):
    cols = math.ceil(math.sqrt(n))          # Try to make columns as sqrt(n)
    rows = math.ceil(n / cols)              
    return rows, cols

### Bar chart comparing both Llama-3.2-3b and granite-3.2-8b models tool call matching for all the server models

In [15]:

fig = make_subplots(rows=1, cols=1)  
fig = add_subplot(fig, df, row=1, col=1,column_name='tool_call_match',title= 'Overall comparison check of correct tool call ')  # Add the subplot
fig.show()


### Bar chart comparing both Llama-3.2-3b and granite-3.2-8b models tool call matching for each of the server models

In [16]:
from plotly.subplots import make_subplots
# Make subplots for each server type
subplots = df['server_type'].unique().tolist()
rows, cols = get_subplot_grid(len(subplots))
fig = make_subplots(rows=rows, cols=cols, subplot_titles=subplots)  

# Loop through each subplot and assign correct (row, col)
for idx, subplot in enumerate(subplots):
    row = (idx // cols) + 1
    col = (idx % cols) + 1
    # Filter the DataFrame for the current server type
    filtered_df = df[df['server_type'] == subplot]
    #plot for each server type
    fig = add_subplot(
        fig,
        filtered_df,
        row=row,
        col=col,
        column_name='tool_call_match',
        title=f'Comparison check for correct tool call match for each sever type'
    )

fig.show()


### Bar chart comparing both Llama-3.2-3b and granite-3.2-8b models inference output given, for all the server models

In [17]:
fig = make_subplots(rows=1, cols=1) 
fig = add_subplot(fig, df, row=1, col=1,column_name='inference_not_empty',title= 'Comparison check of inference output given')  # Add the subplot
fig.show()

### Bar chart showing a breakdown of both Llama-3.2-3b and granite-3.2-8b models inference output given for all the server models

In [18]:
from plotly.subplots import make_subplots
# Make subplots for each server type
subplots = df['server_type'].unique().tolist()
rows, cols = get_subplot_grid(len(subplots))
fig = make_subplots(rows=rows, cols=cols, subplot_titles=subplots)  

# Loop through each subplot and assign correct (row, col)
for idx, subplot in enumerate(subplots):
    row = (idx // cols) + 1
    col = (idx % cols) + 1
    # Filter the DataFrame for the current server type
    filtered_df = df[df['server_type'] == subplot]
    #plot for each server type
    fig = add_subplot(
        fig,
        filtered_df,
        row=row,
        col=col,
        column_name='inference_not_empty',
        title=f'Comparison check of inference output given for each server type'
    )

fig.show()

### Key takeaways
1. For Tool call matching, The Granite-3.2-8b model outperforms the Llama-3.2-3B model  by over double.
2. For infernece output given, The Granite-3.2-8b model outperforms the Llama-3.2-3B by over double also.
3. The breakdown shows were the models are unable or, have difficulty calling the correct tools and providing an inference output.