# Improving Developer Satisfaction through focusing on problems that provably matter

---

- Stanislaw Swierc (stansw)
- Mateusz Machalica (stupaq)
- Scott Yost (scyost)

\* @meta.com

https://github.com/StanislawSwierc/CCIW2025-Improving-Developer-Satisfaction

<img src="media/logo.jpeg" style="float:right;">

# Disclaimer

> The sample code in this presentation is for demonstration purposes only and does not reflect or represent any actual code, systems, or data used by Meta Platforms, Inc.

Notes:  
*To make the presentation more approachable I will avoid Meta specific jargon, but in case I use a term which is unclear, please let me know.*

# Spark of Curiosity
> Why don't the Developer Experience Survey results reflect the improvements in CI?

Possible explanations:
- Developer Experience Survey 
  - Questions are not clear enough
- Improvements 
  - Improvements do not address the key pain points
- Other
  - Moving target due to evergrowing codebase
  - Hedonic adaptation


# Developer Experience Survey

The Developer Experience Survey is a bi-annual survey conducted at Meta to gather feedback from developers about their experiences with the tools and processes they use in their daily work. The primary goal of the survey is to assess the perceived quality of the tools, identify areas of improvement and compare experiences across all the product groups.


![](media/survey_cycle_1.png)


# Developer Experience Survey
Frequency of development cycles greatly exceeds the frequency of Developer Experience Survey feedback loop.

![](media/survey_cycle_2.png)

# Developer Experience Survey
By testing for correlation we can find metrics which partially explain variations in developer productivity and sentiment.

> **We analyzed the correlation between Time In Review and user satisfaction (as measured by a company-wide survey).** The results were clear: The longer someone’s slowest 25 percent of diffs take to review, the less satisfied they were by their code review process. We now had our north star metric: P75 Time In Review.

Correlations can help us identify **Trustworthy** metrics which are demonstrably aligned with the key product goal.

- [Riggs, Patrick, Louise Huang, Seth Rogers, and James Saindon. "Move faster, wait less: Improving code review time at Meta." Engineering at Meta. 2022.](https://engineering.fb.com/2022/11/16/culture/meta-code-review-time-improving/)



*The blueprint for how to link metrics with Developer Experience Survey results was shared by Patrick Riggs in Engineering at Meta blog post titled "Improving code review time at Meta." There, he casually stated...*

# Metrics: STEDII Framework

The STEDII framework outlines six essential properties that define a good metric for online controlled experiments: Sensitivity, Trustworthiness, Efficiency, Debuggability, Interpretability, and Inclusivity. 


1. **Sensitive** metric can detect real changes or effects when they occur.
2. **Trustworthy** metrics provide accurate measurements aligned with the product goals.
3. **Efficient** metric is cost-effective in terms of data collection, processing, and analysis.
4. **Debuggable**  metric makes it easy to answer why the metric moved in either direction.
5. **Interpretable** metric is straightforward to understand and communicate.
6. **Inclusive and Fair** metric provides unbiased measurement. 

[Gupta, Somit, and Widad Machmouchi. "STEDII Properties of a Good Metric." Microsoft Experimentation Platform. 2022.](https://www.microsoft.com/en-us/research/articles/stedii-properties-of-a-good-metric/)

Debuggabiilty is not somethign you would analyze in isolation. It is a product of the metric and supporting tools.

Examples:
- Debuggability - Imagine that an experiment shows that test group has higher resoruce consumption. With debuggable metric we can identify a set of tests which became more expensive to run.


# Metrics: Developer Experience Survey
Survey results can be formed into metrics.

![](media/survey_results_metric.png)

# Metrics: Excess Merge Attempts
The number of failed merge attempts for Code Reviews that ultimately closed.

![](media/excess_merge_attempts_1.png)

# Metrics: Excess Merge Attempts
The number of failed merge attempts for Code Reviews that ultimately closed.

![](media/excess_merge_attempts_2.png)

# Metrics: Number of Unreliable Code Review Iterations

The number of code review iterations which were incorrectly blocked by a failed signal (e.g. flaky test), which subsequently recovered upon retry or trivial rebase.

![](media/unreliable_code_review_iterations_1.png)

# Metrics: Number of Unreliable Code Review Iterations

The number of code review iterations which were incorrectly blocked by a failed signal (e.g. flaky test), which subsequently recovered upon retry or trivial rebase.

![](media/unreliable_code_review_iterations_2.png)

# Metrics: Summary


![](media/metric_summary.png)

# Blame Models: Schema
Sample Code Review Schema

In [1]:
import duckdb
import numpy as np
import pandas as pd

db = duckdb.connect(database=':memory:')

%load_ext sql
%sql db --alias duckdb

In [2]:
%%sql

CREATE TABLE code_review (
    code_review_id INTEGER NOT NULL,
    title VARCHAR NOT NULL,
    status VARCHAR NOT NULL
);
    
CREATE TABLE code_review_iterations (
    code_review_iteration_id INTEGER NOT NULL,
    status VARCHAR NOT NULL,

    code_review_id INTEGER NOT NULL
);

SELECT NULL AS Success WHERE FALSE;

Success


# Blame Models: Schema
Sample Test Schema

In [3]:
%%sql

CREATE TABLE tests (
    test_id INTEGER NOT NULL,
    test_name VARCHAR,
);

CREATE TABLE test_results (
    test_result_id BIGINT NOT NULL,
    status VARCHAR NOT NULL,
    framework VARCHAR,

    test_id INTEGER NOT NULL,
    code_review_iteration_id INTEGER
);

SELECT NULL AS Success WHERE FALSE;

Success


# Blame Models: Schema
Blame columns

In [4]:
%%sql

ALTER TABLE test_results
ADD COLUMN blamed_for_unreliable_code_review_iteration BIGINT;

ALTER TABLE test_results
ADD COLUMN blamed_for_excess_merge_attempts BIGINT[]; -- Array!

Success


# Blame Models: Sample Data

In [5]:
n = 100
np.random.seed(40)

test_results = pd.DataFrame({
    'test_result_id': np.arange(1, n + 1),
    'status': np.random.choice(['passed', 'failed', 'skipped'], n),
    'framework': np.random.choice(['JUnit', 'Pytest', 'Jest'], n),
    'test_id': np.random.randint(1, 50, n),
    'code_review_iteration_id': np.random.choice([None, *range(1, 10)], n),
    'blamed_for_unreliable_code_review_iteration': np.random.choice([None] * 5 + list(range(1, 10)), n),
    'blamed_for_excess_merge_attempts': np.random.choice(np.array([None, [1], [1, 2], [2, 3]], dtype=object), n)
})
db.append('test_results', test_results)

<duckdb.duckdb.DuckDBPyConnection at 0x1077a74b0>

In [6]:
test_results = pd.DataFrame({
    'test_result_id': np.arange(1, n + 1),
    'status': np.random.choice(['passed', 'failed', 'skipped'], n),
    'framework': np.random.choice(['JUnit', 'Pytest', 'Jest'], n),
    'test_id': np.random.randint(1, 50, n),
    'code_review_iteration_id': np.random.choice([None, *range(1, 10)], n),
    'blamed_for_unreliable_code_review_iteration': np.random.choice(list(range(1, n * 10)), n),
    'blamed_for_excess_merge_attempts': np.random.choice(np.array([None, [1], [1, 2], [2, 3]], dtype=object), n)
})
db.append('test_results', test_results)

<duckdb.duckdb.DuckDBPyConnection at 0x1077a74b0>

In [7]:
%%sql
SELECT * FROM test_results LIMIT 5

test_result_id,status,framework,test_id,code_review_iteration_id,blamed_for_unreliable_code_review_iteration,blamed_for_excess_merge_attempts
1,skipped,Pytest,7,1.0,7.0,"[1, 2]"
2,failed,Jest,48,,,
3,passed,Pytest,29,,,
4,passed,JUnit,42,8.0,8.0,[1]
5,skipped,Pytest,7,,5.0,[1]


# Blame Models: Partial Blame Score
Number of Unreliable Code Review Iterations where a particular test was to blame.

In [8]:
%%sql
SELECT test_id,
    COUNT(DISTINCT blamed_for_unreliable_code_review_iteration) as partial_blame
FROM test_results 
GROUP BY 1
ORDER BY 2 DESC
LIMIT 5

test_id,partial_blame
35,7
18,7
7,6
6,5
13,5


# Blame Models: Partial Blame Score

Number of Bad Interactions (e.g. Unreliable Code Review Iterations) where a particular test was to blame (e.g. unreliable).

In [9]:
%%sql --save test_results_blamed_for_unreliable_code_review_iteration --no-execute
SELECT *, blamed_for_unreliable_code_review_iteration as bad_interaction_id
FROM test_results
WHERE blamed_for_unreliable_code_review_iteration IS NOT NULL

In [10]:
%%sql
SELECT test_id,
    COUNT(DISTINCT bad_interaction_id) as partial_blame
FROM test_results_blamed_for_unreliable_code_review_iteration
GROUP BY 1
ORDER BY 2 DESC
LIMIT 5

test_id,partial_blame
18,7
35,7
7,6
48,5
12,5


# Blame Models: Partial Blame Score
Number of Excess Merge Attempts where a particular test was to blame.

In [11]:
%%sql --save test_results_blamed_for_excess_merge_attempts --no-execute
SELECT * 
FROM test_results 
CROSS JOIN UNNEST(blamed_for_excess_merge_attempts) as t(bad_interaction_id)
WHERE blamed_for_excess_merge_attempts IS NOT NULL

In [12]:
%%sql
SELECT test_id,
    COUNT(DISTINCT bad_interaction_id) as partial_blame
FROM test_results_blamed_for_excess_merge_attempts
GROUP BY 1
ORDER BY 2 DESC
LIMIT 5

test_id,partial_blame
34,3
2,3
46,3
25,3
38,3


# Blame Models: Uniform Blame Score
When a test is blamed for a bad interaction, its score is calculated by taking the reciprocal of the total number of tests blamed for that interaction. When there are multiple test results, the score gets equally divided between them.

In [13]:
%%sql
DROP TABLE IF EXISTS unreliable_code_review_iterations

Success


In [14]:
%%sql

CREATE TABLE unreliable_code_review_iterations (
    bad_interaction_id INTEGER,
    test_result_count INTEGER,
    test_id_count INTEGER,
    test_id_frequency MAP(INTEGER, INTEGER)
);

SELECT NULL AS Success WHERE FALSE;

Success


# Blame Models: Uniform Blame Score
When test is to blame for a bad interaction it receives score of inverse the number of all test blamed for that bad interaction.

In [15]:
%%sql
INSERT INTO unreliable_code_review_iterations
SELECT
    blamed_for_unreliable_code_review_iteration,
    COUNT() AS test_result_count,
    COUNT(DISTINCT test_id) test_id_count,
    HISTOGRAM(test_id) as test_id_frequency
FROM test_results
WHERE blamed_for_unreliable_code_review_iteration IS NOT NULL
GROUP BY 1;

SELECT * FROM unreliable_code_review_iterations ORDER BY 2 DESC LIMIT 5

bad_interaction_id,test_result_count,test_id_count,test_id_frequency
8,11,10,"{5: 1, 21: 1, 22: 1, 25: 1, 30: 1, 31: 1, 35: 1, 41: 2, 42: 1, 48: 1}"
5,11,11,"{1: 1, 4: 1, 6: 1, 7: 1, 18: 1, 23: 1, 26: 1, 28: 1, 37: 1, 38: 1, 43: 1}"
6,8,8,"{3: 1, 15: 1, 20: 1, 23: 1, 24: 1, 36: 1, 40: 1, 42: 1}"
1,7,6,"{9: 1, 18: 2, 25: 1, 38: 1, 40: 1, 43: 1}"
9,7,7,"{1: 1, 17: 1, 18: 1, 23: 1, 27: 1, 38: 1, 40: 1}"


In [16]:
%%sql
SELECT * FROM unreliable_code_review_iterations ORDER BY 2 DESC LIMIT 5

bad_interaction_id,test_result_count,test_id_count,test_id_frequency
8,11,10,"{5: 1, 21: 1, 22: 1, 25: 1, 30: 1, 31: 1, 35: 1, 41: 2, 42: 1, 48: 1}"
5,11,11,"{1: 1, 4: 1, 6: 1, 7: 1, 18: 1, 23: 1, 26: 1, 28: 1, 37: 1, 38: 1, 43: 1}"
6,8,8,"{3: 1, 15: 1, 20: 1, 23: 1, 24: 1, 36: 1, 40: 1, 42: 1}"
1,7,6,"{9: 1, 18: 2, 25: 1, 38: 1, 40: 1, 43: 1}"
9,7,7,"{1: 1, 17: 1, 18: 1, 23: 1, 27: 1, 38: 1, 40: 1}"


# Blame Models: Uniform Blame Score
When test is to blame for a bad interaction it receives score of inverse the number of all test blamed for that bad interaction.

In [17]:
%%sql
SELECT test_id,
    SUM(1.0 / test_id_count / test_id_frequency[test_id]) as uniform_blame
FROM test_results_blamed_for_unreliable_code_review_iteration
JOIN unreliable_code_review_iterations USING(bad_interaction_id)
GROUP BY 1
ORDER BY 2 DESC
LIMIT 5

test_id,uniform_blame
35,5.6
33,5.0
13,4.5
12,4.2
48,4.1


# Blame Models: Full Blame Score

When test is to blame for a bad interaction it receives score of 1.0 but only if no other tests were also blamed for that bad interaction.

In [18]:
%%sql
SELECT test_id,
    SUM(full_blame) AS full_blame
FROM (   
    SELECT test_id,
        bad_interaction_id,
        IF(COUNT() = ARBITRARY(test_result_count), 1, 0) full_blame,
        ARBITRARY(test_result_count) all_count
    FROM test_results_blamed_for_unreliable_code_review_iteration
    JOIN unreliable_code_review_iterations USING(bad_interaction_id)
    GROUP BY 1, 2
)
GROUP BY 1
ORDER BY 2 DESC
LIMIT 5

test_id,full_blame
35,5
33,5
13,4
12,4
6,4


# Blame Models: All Scores
We can efficiently calcualte and present all scores.

In [19]:
%%sql 
SELECT test_id,
    SUM(full_blame) AS full_blame,
    SUM(uniform_blame) AS uniform_blame,
    SUM(partial_blame) AS partial_blame
FROM (   
    SELECT test_id,
        bad_interaction_id,
        IF(COUNT() = ARBITRARY(test_result_count), 1, 0) full_blame,
        SUM(1.0 / test_id_count / test_id_frequency[test_id]) AS uniform_blame,
        1.0 AS partial_blame
    FROM test_results_blamed_for_unreliable_code_review_iteration
    JOIN unreliable_code_review_iterations USING(bad_interaction_id)
    GROUP BY 1, 2
)    
GROUP BY 1
ORDER BY 2 DESC
LIMIT 5

test_id,full_blame,uniform_blame,partial_blame
35,5,5.6,7.0
33,5,5.0,5.0
13,4,4.5,5.0
12,4,4.2,5.0
6,4,4.090909090909091,5.0


# Blame Models: All Scores
We can calculate scores for an arbitrary filtering and grouping scheme.

In [20]:
bucket, predicate = "framework", "test_id NOT IN (10)"

In [21]:
%%sql 
SELECT {{bucket}},
    SUM(full_blame) AS full_blame,
    SUM(uniform_blame) as uniform_blame,
    SUM(partial_blame) AS partial_blame
FROM (   
    SELECT {{bucket}},
        bad_interaction_id,
        IF(COUNT() = ARBITRARY(test_result_count), 1, 0) full_blame,
        SUM(1.0 / test_id_count / test_id_frequency[test_id]) as uniform_blame,
        1.0 AS partial_blame
    FROM test_results_blamed_for_unreliable_code_review_iteration
    JOIN unreliable_code_review_iterations USING(bad_interaction_id)
    WHERE {{predicate}}
    GROUP BY bad_interaction_id, {{bucket}}
)    
GROUP BY 1
ORDER BY 2 DESC
LIMIT 5

framework,full_blame,uniform_blame,partial_blame
Pytest,32,35.98430735930736,42.0
JUnit,30,34.90573593073593,43.0
Jest,27,31.109956709956712,38.0


# Blame Models: Summary

**Full Blame**
- **Lower bound** - If you eliminate the problem the topline will move by at least X
- **Sensitive** - even the lowest unnacounted participation in bad interaction counts

**Uniform Blame**
- **Fine grained** - the only sub-bad-interaction score
- **Additive** - scores can be summed up and presented in an area chart

**Partial Blame**
  - **Upper bound** - If you eliminate the problem the topline will move by at most X
  - **Sensitive** - even the lowest participation in bad interaction counts


# Experimentation

**Use Code Review as experiment unit**. This unit is coarse grained enough to avoid interactions between test and control groups and fine grained enough to yield well powered experiments. Alternatives include User, Code Review Creator, Code Review Iteration.

**Run experiments in increments of 7 days**. This will account for strong weekly seasonality in the development cycle.

**Record exposure only for Code Reviews which can be affected by the feature**. This will reduce variance in measurements. You can do it by first checking if Code Review can be affected and then assigning it to test or control groups.

**Record exposure for Code Reviews created after the start of the experiment**. Existing code reviews implicitly start in control groups. Exclude them from the experiment to avoid bad interactions between groups.
    
**Communicate impact using Daily impact in absolute terms.** Percentage can be misleading if exposures were limited. By communicating daily impact in absolute terms we can compare results of multiple experiments.


# Conclusion
We rely on topline metrics throughout the development cycle.

![](media/survey_cycle_3.png)

# Thank You