load dataset used in kpi 2.1

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import joypy
from matplotlib import cm
from scipy.stats import mannwhitneyu

In [None]:
df = pd.read_csv("/Users/frankdong/Documents/Analytics Local/williams-racing-strategies/processed_data/williams-deltas-by-sector-type.csv")

In [None]:
print(df)


<br>
EDA - let the data speak for itself through visualisations. <br>
Get to know the data both visually and statistically, lay the groundwork for analysis and hypothesis testing.<br>
1. Start with core, rough visualisations. <br>
Histograms, boxplots, violin plots, scatterplots, barplots.  <br>
2. Save all interesting "candidate" plots, output and save by exporting as PNG for review.<br>
Plots may reveal surprising outliers, odd clusters, or clear trends<br>
3. Annotate and document observations in Markdown<br>
Write brief markdown notes next to each PNG. Note patterns, anomalies, large/small group sizes, etc.<br>
4. Identify outliers, check sample sizes, assess normality. <br>
Key for hypothesis testing - boxplots and histograms help spot outliers and skew.<br>
Check sample sizes with .groupby() or .value_counts() - are all groups (teams, circuits, years) large enough for statistical tests?<br>
5. Iterate and refine - select the most informative charts for polish, annotation, and inclusion in Tableau<br>


-------- Steps 1, 2, 3. Rough vizs, save plots, annotate observations --------

-------------------- 1: Delta boxplot of grid delta by sector ------------------

set up new figure and size - taller boxplot helps compare heights 

In [None]:
plt.figure(figsize=(9, 9))

create boxplot

In [None]:
sector_delta_boxplot = sns.boxplot(
    x = 'sector_type', 
    y = 'sector_delta', 
    data = df, 
    hue = 'sector_type', 
    palette = 'Set2', # unbiased colour set for boxplot visualisation
    order = ['power', 'balanced', 'technical'] # order the boxes this way
)

set title, xlabel and ylable

In [None]:
sector_delta_boxplot.set_title("Williams' Qualifying Deficit to Midfield Fastest by Sector Type (Time, 2018-2019)")
sector_delta_boxplot.set_xlabel('Sector Type')
sector_delta_boxplot.set_ylabel('Time Deficit to Midfield Fastest (seconds)')

In [None]:
plt.grid(linewidth = 0.25)
plt.show()


<br>
A boxplot summarises how Williams' qualifying sector deficits—time lost <br>
versus the fastest midfield team—are distributed across different sector types.<br>
The box captures the middle 50% of performances (the interquartile range, or IQR), <br>
while the median line indicates the typical time lost for each sector type.<br>
Whiskers extend to include most other values, and any dots (if present) represent outlying results.<br>
Power sectors show the smallest and most consistent qualifying deficits, <br>
with a tight range and a median around 0.45 seconds.<br>
Technical sectors have the second largest median deficit (about 0.6s) and the widest range, <br>
highlighting differing performance outcomes and potentially greater setup challenges.<br>
Balanced sectors surprisingly display the largest variability and maximum time lost (up to ~1.4s). <br>
This could reflect subjective labeling of sector types or underline William' particular struggles <br>
to find the optimal balance between top speed and downforce.<br>
No outliers present. <br>


------------------ 2. Percentage delta boxplot by sector ----------------

set up new figure 

In [None]:
plt.figure(figsize=(9, 9))

create boxplot

In [None]:
sector_percentage_delta_boxplot = sns.boxplot(
    x = 'sector_type', 
    y = 'pct_slower', 
    data = df, 
    hue = 'sector_type', 
    palette = 'Set2', # unbiased colour set for boxplot visualisation
    order = ['power', 'balanced', 'technical'] # order the boxes this way
)

set title, xlabel and ylable

In [None]:
sector_percentage_delta_boxplot.set_title("Williams' Qualifying Deficit to Midfield Fastest by Sector Type (%, 2018-2019)")
sector_percentage_delta_boxplot.set_xlabel('Sector Type')
sector_percentage_delta_boxplot.set_ylabel('% Slower to Midfield Fastest')

In [None]:
plt.grid(linewidth = 0.25)
plt.show()


<br>
This boxplot compares qualifying sector times in relative terms compared to the fastest <br>
midfield team, offering an additional perspective on where and how competitive gaps open up.<br>
Power sectors again show the smallest median relative deficit (~2%) and the tightest IQR.<br>
There is noticeable relative variability in power sectors, spanning a wide range, and an<br>
outlier around 4.46%. This should be kept in mind when interpreting test results or considering data cleaning.<br>
Balanced sectors now exhibit less volatility in relative terms compared to absolute times, <br>
with a consistently narrow band of percentage deficits. <br>
Technical sectors demonstrate the largest variability within the IQR. However, the gap is capped at approximately +3.5%, <br>
and, in encouraging moments, Williams nearly matches the fastest midfield team with a minimum deficit near +0.15%.<br>



<br>
Both charts confirm power sectors are Williams' comparative stronghold, with the smallest<br>
and most stable time losses. <br>
Technical and balanced sectors introduce greater unpredictability, and at times, larger<br>
performance gaps - possibly pointing to either difficulties in setup or sector labelling. <br>


---------------- 3. Histograms to check for normality - absolute -----------------

plot a grid of histograms, with each of the three representing the sector type

In [None]:
grid = sns.FacetGrid(
    df, 
    col = 'sector_type', 
    col_order = ['power', 'balanced', 'technical'],
    sharex = True, sharey = True, 
    height = 4, aspect = 1
)
grid.map(
    sns.histplot, 
    'sector_delta', 
    kde=True, 
    stat='density', 
    bins=15, 
    color='royalblue'
)

annotate counts to each plot on the grid

In [None]:
for ax, sector in zip(grid.axes.flat, ['power', 'balanced', 'technical']):
    n = df[df['sector_type'] == sector].shape[0]
    ax.text(0.95, 0.95, f'n = {n}', ha='right', va='top', transform=ax.transAxes,
            fontsize=12, bbox=dict(boxstyle='round', alpha=0.2))
    ax.set_xlabel('Time Deficit (s)')
    ax.set_ylabel('Density')
    ax.set_title(f'{sector.capitalize()} Sectors')

In [None]:
plt.suptitle("Williams' Sector Delta Distributions by Sector Type (s)", y=1.08, fontsize=16)
plt.tight_layout()
plt.show()

------------------ 4. Histograms to check for normality - relative -------------------

plot a grid of histograms, with each of the three representing the sector type

In [None]:
grid = sns.FacetGrid(
    df, 
    col = 'sector_type', 
    col_order = ['power', 'balanced', 'technical'],
    sharex = True, sharey = True, 
    height = 4, aspect = 1
)
grid.map(
    sns.histplot, 
    'pct_slower', 
    kde=True, 
    stat='density', 
    bins=15, 
    color='royalblue'
)

annotate counts to each plot on the grid

In [None]:
for ax, sector in zip(grid.axes.flat, ['power', 'balanced', 'technical']):
    n = df[df['sector_type'] == sector].shape[0]
    ax.text(0.95, 0.95, f'n = {n}', ha='right', va='top', transform=ax.transAxes,
            fontsize=12, bbox=dict(boxstyle='round', alpha=0.2))
    ax.set_xlabel('Percent Slower (%)')
    ax.set_ylabel('Density')
    ax.set_title(f'{sector.capitalize()} Sectors')

In [None]:
plt.suptitle("Williams' Sector Delta Distributions by Sector Type (%)", y=1.08, fontsize=16)
plt.tight_layout()
plt.show()


<br>
For both plots, <br>
    -> Sample Size: (P: 21, B: 17, T: 16). <br>
        Since none exceed n = 30, we cannot rely on the Central Limit Theorem to justify<br>
        standard parametric tests like the t-test.<br>
    -> Outliers: <br>
        No obvious outliers present in the first grid. <br>
        Technical sectors in the second '% relative' plot seems skewed by a denser x = +3.4%.<br>
    -> Shape: <br>
        All three sector types show some skew and multimodal patterns. <br>
        Especially for balanced and technical sectors, distributions don't follow typical, <br>
        classic bell-curve shapes. <br>
Conclusion: <br>
-> Mann-Whitney U test probably the best choice to compare Williams' sector time deficits, <br>
particularly between technical and power sectors. <br>
-> Test works well without assuming normality or large n. Also robust to small samples and <br>
subtle data quirks. <br>


------------------- Step 4. Perform hypothesis testing ----------------

count samples for each group at technical and power sectors

In [None]:
sample_counts = df['sector_type'].value_counts()
print("Sample sizes for each sector: \n", sample_counts)


<br>
nP, nB, nT < 30.<br>
We will perform a Mann-Whitney U test - a non-parametric alternative to a t-test. <br>
This ensures robustness given small sample sizes and multimodal, non-normal samples. <br>
Note results and obserations at a 95% level. <br>



<br>
Hypothesis Recap:<br>
Compare Williams' absolute qualifying sector deficits to the fastest midfield team  <br>
between technical and power sectors during 2018-2019.<br>
Groups:  <br>
- Power sectors: n = 21  <br>
- Technical sectors: n = 16  <br>
- Two independent populations representing sector types.<br>
We are testing whether the deficits in technical sectors are significantly worse (larger) than those in power sectors,  <br>
focusing on one direction because of the hypothesis that technical sectors reveal greater performance limitations.<br>
Test type:  <br>
A two-sample, independent, one-tailed test to determine if  <br>
Williams' mean absolute deficit in technical sectors is significantly greater than in power sectors.<br>
Hypotheses:  <br>
- Null (H0): μ_technical ≤ μ_power (Deficits in technical sectors are not greater than in power sectors)  <br>
- Alternative (H1): μ_technical > μ_power (Deficits in technical sectors are greater than in power sectors)<br>


--------- 1. Absolute deficits (s) --------

1. extract relevant data for the test

In [None]:
technical_deficits = df[df['sector_type'] == 'technical']['sector_delta']
power_deficits = df[df['sector_type'] == 'power']['sector_delta']

2. run the mann-whitney u test

In [None]:
m_stat, p_value = mannwhitneyu(technical_deficits, power_deficits, alternative = "greater")

In [None]:
print("\nMann-Whitney U Test for technical vs. power absolute sector deficits at 95% confidence level.\n")
print(f"Mann-Whitney U statistic: {m_stat:.3f}")
print(f"One-tailed p-value: {p_value:.4f}")

In [None]:
alpha = 0.05 # 95% confidence level

In [None]:
if p_value < alpha:
    print("\nReject the null hypothesis (H1): Williams' qualifying deficit in technical sectors is significantly greater than in power sectors.")
else: # p_value_one_tailed >= alpha
    print("\nFail to reject the null hypothesis (H0): There is no significant evidence that Williams' qualifying deficit in technical sectors is greater than in power sectors.")


<br>
Mann-Whitney U Test for technical vs. power absolute sector deficits at 95% confidence level.<br>
Mann-Whitney U statistic: 204.000<br>
One-tailed p-value: 0.1382<br>
Fail to reject the null hypothesis (H0): There is no significant evidence that Williams' qualifying deficit in technical sectors is greater than in power sectors.<br>
---<br>
While our data sample suggests a trend towards larger deficits in technical sectors, the current sample size<br>
and variability mean we cannot confidently state that Williams struggles more in technical sectors compared to <br>
power sectors based on qualifying sector times. <br>
This invites further analysis, with additional data, alternative metrics, or complementary performance angles, <br>
like driver experience or consistency - which is explored in SQ/KPI 3. <br>


--------- 2. Relative deficits (%) ----------

1. Filter out the outlier

In [None]:
df_pct = df[df['pct_slower'] < 4.471] # filters out the one pct_slower value of 4.471 in power sectors

2. Extract relevant data

In [None]:
pct_technical = df_pct[df_pct['sector_type'] == 'technical']['pct_slower']
pct_power = df_pct[df_pct['sector_type'] == 'power']['pct_slower']

3. Perform Mann-Whitney U Test (one-tailed, technical > power)

In [None]:
m_stat, p_value = mannwhitneyu(pct_technical, pct_power, alternative="greater")

4. Print results

In [None]:
print("\n\nMann-Whitney U Test for technical vs. power '% slower' at 95% confidence level.\n")
print(f"Mann-Whitney U statistic: {m_stat:.3f}")
print(f"One-tailed p-value: {p_value:.4f}")

In [None]:
alpha = 0.05  # significance level

In [None]:
if p_value < alpha:
    print("\nReject the null hypothesis (H1): Williams was significantly slower in technical sectors (as % behind fastest midfield).")
else:
    print("\nFail to reject the null hypothesis (H0): No significant evidence Williams was slower in technical sectors (as % behind fastest midfield).")


<br>
Mann-Whitney U Test for technical vs. power '% slower' at 95% confidence level.<br>
Mann-Whitney U statistic: 181.000<br>
One-tailed p-value: 0.2570<br>
Fail to reject the null hypothesis (H0): No significant evidence Williams was slower in technical sectors (as % behind fastest midfield).<br>
---<br>
The relative performance gap, measured in % behind the fastest midfield team, did not reach statistical significance. <br>
This suggests: <br>
    -> While limitations may exist, they weren't consistently large enough to be confirmed in this sample. <br>
    -> Williams' deficits in power and technical sectors may have been more evenly distributed than expected<br>
    -> Otherwise, the small sample size, n < 30, may not offer enough power to detect small, but real differences. <br>
Nonetheless, this trend invites further investigation, possibly using a broader time period or cross-validating with<br>
driver performance, complementing racecraft performance. <br>


-------------------- Step 5: Further visualisations ------------------


<br>
Main variables of interest are: <br>
- sector_delta (absolute deficit in seconds)<br>
- pct_slower (relative deficit, percentage slower behind fastest)<br>
- sector_type (power, balanced, or technical)<br>
- race, sector, fastest_team<br>
As H0 is the way to go - we are looking for charts that help with storytelling leverage. <br>


1. Heatmap - avg % slower by sector type and sector number

In [None]:
pivot = df.pivot_table(index="sector", columns="sector_type", values="pct_slower", aggfunc="mean")

In [None]:
pivot = pivot[['power', 'balanced', 'technical']]

In [None]:
plt.figure(figsize = (12, 8))

In [None]:
sectors_heatmap = sns.heatmap(
    data = pivot, 
    annot = True, 
    cmap = "YlOrRd", 
    fmt = ".2f", 
    cbar_kws = {'label': '% Slower'} # labels the colour bar on the side
)

In [None]:
sectors_heatmap.set_title("Average % Slower by Sector Number and Sector Type")
sectors_heatmap.set_xlabel("Sector Type")
sectors_heatmap.set_ylabel("Sector Number")

In [None]:
plt.tight_layout()
plt.savefig("plots/plots2/5-sectors-heatmap.png")
plt.show()

2. strip plot - view individual performances per sector type

In [None]:
plt.figure(figsize = (12, 8))

In [None]:
sector_stripplot = sns.stripplot(
    data = df, 
    x = 'sector_type', 
    y = 'pct_slower', 
    hue = 'sector_type',
    palette = 'Set2', 
    order = ['power', 'balanced', 'technical']
)

In [None]:
sector_stripplot.set_title("All Relative Deficits by Sector Type")
sector_stripplot.set_xlabel("Sector Type")
sector_stripplot.set_ylabel("% Slower")

In [None]:
plt.grid(linewidth = 0.25)
plt.tight_layout()
plt.savefig("plots/plots2/6-sectors-stripplot.png")
plt.show()


<br>
Stripplots are great for identifying individual race performances, but also give us a sense of the number of values considered in the data. <br>
Here, it's very apparent there's not enough entries in each segment. <br>


3. violin plot - view distribution shape per sector type

In [None]:
plt.figure(figsize = (12, 8))

In [None]:
sector_violinplot = sns.violinplot(
    data = df, 
    x = 'sector_type', 
    y = 'pct_slower', 
    hue = 'sector_type',
    palette = 'Set2', 
    order = ['power', 'balanced', 'technical']
)

In [None]:
sector_violinplot.set_title("Distribution of Relative Deficits by Sector Type")
sector_violinplot.set_xlabel("Sector Type")
sector_violinplot.set_ylabel("% Slower")

In [None]:
plt.grid(linewidth = 0.25)
plt.tight_layout()
plt.savefig("plots/plots2/7-sectors-violinplot.png")
plt.show()


<br>
This sort of plot helps us identify distributions, combining KDEs and boxplots. <br>
However, the granularity of seeing individual performances is now lost. <br>


4. barchart - average relative deficit per race - grouped by sector type

In [None]:
grouped = df.groupby('race')['pct_slower'].mean().sort_values(ascending = False).reset_index() # form a new groupby dataframe

In [None]:
plt.figure(figsize = (12, 6))

In [None]:
barchart = sns.barplot(
    data = grouped, 
    x = 'pct_slower', 
    y = 'race', 
    zorder = 3 # place bars over the gridlines
)

create rounded labels as strings with 2 decimal places

In [None]:
labels = [f"{x:.2f}" for x in grouped['pct_slower']]
barchart.bar_label(barchart.containers[0], labels=labels, fontsize=10)

In [None]:
barchart.set_title("Williams' Average Relative Qualifying Deficit by Circuit (2018-2019)")
barchart.set_xlabel("% Slower to Fastest Midfield")
barchart.set_ylabel("Circuit")

In [None]:
plt.grid(linewidth = 0.25, axis = 'x', zorder = 0) # vertical lines only, behind the bars. 
plt.tight_layout()
plt.savefig("plots/plots2/8-relative-barplot.png")
plt.show()

5. barchart - average absolute deficit per race - grouped by sector type

In [None]:
grouped = df.groupby('race')['sector_delta'].mean().sort_values(ascending = False).reset_index() # form a new groupby dataframe

In [None]:
plt.figure(figsize = (12, 6))

In [None]:
barchart = sns.barplot(
    data = grouped, 
    x = 'sector_delta', 
    y = 'race', 
    zorder = 3 # place bars over the gridlines
)

create rounded labels as strings with 2 decimal places

In [None]:
labels = [f"{x:.2f}" for x in grouped['sector_delta']]
barchart.bar_label(barchart.containers[0], labels=labels, fontsize=10)

In [None]:
barchart.set_title("Williams' Average Absolute Qualifying Deficit by Circuit (2018-2019)")
barchart.set_xlabel("(s) Slower to Fastest Midfield")
barchart.set_ylabel("Circuit")

In [None]:
plt.grid(linewidth = 0.25, axis = 'x', zorder = 0) # vertical lines only, behind the bars. 
plt.tight_layout()
plt.savefig("plots/plots2/9-absolute-barplot.png")
plt.show()

6. countplot - find the most frequent fastest rivals, and by what sector types

In [None]:
plt.figure(figsize = (12, 8))

In [None]:
teams_countplot = sns.countplot(
    data=df, 
    x="fastest_team", 
    hue="sector_type", 
    palette = 'Set2',
    order=df['fastest_team'].value_counts().index, # sorts in the correct order automatically, before printing results
    zorder = 3
)

In [None]:
teams_countplot.set_title("Fastest Midfield Rivals by Sector Types")
teams_countplot.set_xlabel("Midfield Rival")
teams_countplot.set_ylabel("Times Achieved Fastest Midfield Team")

In [None]:
plt.tight_layout()
plt.savefig('plots/plots2/10-teams-countplot.png')
plt.grid(linewidth = 0.25, axis = 'y', zorder = 1)
plt.show()