In [1]:
import pandas as pd
import mlflow
import plotly.express as px

## Data importation

In [2]:
mlflow.set_tracking_uri("http://localhost:5000")

# 3. Cargar las runs en un DataFrame
df_runs = mlflow.search_runs(experiment_ids=["1"])

# Ver las primeras filas
df_runs.head().T

Unnamed: 0,0,1,2,3,4
run_id,f556b6c922974f23a1be74a3c473bbb3,ba35a24b7a044b6fb48394b6d41e50af,8441e3f09d094442a4782890560c5c29,53fef2064ced494e8d35798c5ed0c97d,3a9b1123f498464abe1f58ee7e7fba28
experiment_id,1,1,1,1,1
status,FINISHED,FINISHED,FINISHED,FINISHED,FINISHED
artifact_uri,/mlflow/artifacts/1/f556b6c922974f23a1be74a3c4...,/mlflow/artifacts/1/ba35a24b7a044b6fb48394b6d4...,/mlflow/artifacts/1/8441e3f09d094442a478289056...,/mlflow/artifacts/1/53fef2064ced494e8d35798c5e...,/mlflow/artifacts/1/3a9b1123f498464abe1f58ee7e...
start_time,2026-01-08 16:45:53.770000+00:00,2026-01-08 16:44:20.433000+00:00,2026-01-08 16:42:47.092000+00:00,2026-01-08 16:41:13.794000+00:00,2026-01-08 16:39:40.458000+00:00
end_time,2026-01-08 16:47:27.180000+00:00,2026-01-08 16:45:53.672000+00:00,2026-01-08 16:44:20.332000+00:00,2026-01-08 16:42:47.003000+00:00,2026-01-08 16:41:13.709000+00:00
metrics.Q3_data_availability_match,1.0,1.0,0.0,1.0,1.0
metrics.Q1_mentions_transitory,1.0,1.0,1.0,1.0,1.0
metrics.Q3_hallucination_detected,1.0,0.0,1.0,1.0,0.0
metrics.overall_score,8.0,6.0,6.0,7.0,6.0


In [3]:
# Dropping unneeded columns for analysis
df_runs = df_runs.drop(columns=['run_id', 'experiment_id', 'status', 'artifact_uri',
                                'start_time', 'end_time', 'tags.mlflow.source.name',
                                'tags.mlflow.source.type', 'tags.mlflow.user','tags.mlflow.source.git.commit'])
df_runs = df_runs.rename(columns={'tags.mlflow.runName': 'run_name'})
df_runs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39 entries, 0 to 38
Data columns (total 19 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   metrics.Q3_data_availability_match  39 non-null     float64
 1   metrics.Q1_mentions_transitory      39 non-null     float64
 2   metrics.Q3_hallucination_detected   39 non-null     float64
 3   metrics.overall_score               39 non-null     float64
 4   metrics.Q2_mentions_2008_tone       39 non-null     float64
 5   metrics.Q1_mentions_early_2021      39 non-null     float64
 6   metrics.Q2_final_score              39 non-null     float64
 7   metrics.Q3_final_score              39 non-null     float64
 8   metrics.Q1_shifted                  39 non-null     float64
 9   metrics.Q1_final_score              39 non-null     float64
 10  metrics.Q2_mentions_2020_tone       39 non-null     float64
 11  metrics.Q2_provides_comparison      39 non-null

In [4]:
df_runs['params.k'] = df_runs['params.k'].astype(int)
df_runs['metrics.overall_score'] = df_runs['metrics.overall_score'].astype(int)

**Findings**  
For the semantic chuncking the parameter "percentile" wasn't recorded on mlflow, after reviewing the function responsible for the chuncking parameters, the finding of and explanation for it wasn't achieved.

## Analysis

### Top 5 Runs

In [None]:
top_5_runs = df_runs.nlargest(5, 'metrics.overall_score')

fig = px.bar(
        top_5_runs,
        x='run_name',
        y='metrics.overall_score',
        color='run_name',
        title='Top 5 Runs by Overall Score',
        labels={'run_name': 'Run Name', 'metrics.overall_score': 'Overall Score'},
        height=600,
        width=1680
    )
fig.update_xaxes(
    showticklabels=False, 
    title_text="Runs"   
)
fig.show()

**Findings**  
The best 4 of the best 5 runs were the Recursive character splitting method with chunk size of 1500, the 2 best with an overlap of 15.

Since 20 chunks consume less recourses than 30, it will be the configuration chosen to the deployment.

### K analysis respect overall score

In [6]:
# Create the scatterplot

fig = px.box(
    df_runs, 
    x="params.k", 
    y="metrics.overall_score",
    color="params.k",        
    title="Experiment Analysis: Overall Score vs k",
    width=1000,
    height=600
    )
fig.update_layout(showlegend=False)

# Show the interactive plot
fig.show()

**Findings**  
At first glance it doesn't seem there is any correlation between number of chuncks (k) and overall score.

## Conclusion

The recursive splitting method outperformed the semantic chunking and there is not an evident correlation between chunks retrieved and score.

The param configuration for the deployed rag will be:
-   Recursive splitting
-   chunk size: 1500
-   chunk overlap: 15
-   number of chuncks retrieved (k): 20