# Example of pipeline: extracting and transforming pdf file

In this example, we will show you how to end-to-end generate question-answers (QAs) from a given pdf using uniflow's `MultiFlowsPipeline`.

### Before running the code

You will need to `uniflow` conda environment to run this notebook. You can set up the environment following the instruction: https://github.com/CambioML/uniflow/tree/main#installation.

### Update system path

In [1]:
%reload_ext autoreload
%autoreload 2

import sys
import pprint
import re

sys.path.append(".")
sys.path.append("..")
sys.path.append("../..")

### Install libraries

In [2]:
!{sys.executable} -m pip install transformers accelerate bitsandbytes scipy nougat-ocr



### Import dependency

In [3]:
import os
import pandas as pd
from uniflow.pipeline import MultiFlowsPipeline
from uniflow.flow.config import PipelineConfig
from uniflow.flow.config import TransformHuggingFaceConfig, ExtractPDFConfig
from uniflow.op.model.model_config import HuggingfaceModelConfig, NougatModelConfig
from uniflow.op.prompt_schema import GuidedPrompt, Context
from uniflow.op.extract.split.constants import MARKDOWN_HEADER_SPLITTER

  from .autonotebook import tqdm as notebook_tqdm


### Prepare the input data

First, let's set current directory and input data directory, and load the raw data.

In [4]:
dir_cur = os.getcwd()
pdf_file = "nike-paper.pdf"
input_file = os.path.join(f"{dir_cur}/data/raw_input/", pdf_file)

data = [
    {"pdf": input_file},
]

### Define extract config using Nougat

In [5]:
extract_config = ExtractPDFConfig(
    model_config=NougatModelConfig(
        model_name = "0.1.0-small",
        batch_size = 1 # When batch_size>1, nougat will run on CUDA, otherwise it will run on CPU
    ),
    splitter=MARKDOWN_HEADER_SPLITTER
)

### Prepare sample prompts

We need to demonstrate sample prompts for LLM. We do this by giving a sample list of `Context` examples to the `GuidedPrompt` class.

In [6]:
guided_prompt = GuidedPrompt(
    instruction="""Generate one question and its corresponding answer based on the last context in the last
    example. Follow the format of the examples below to include context, question, and answer in the response""",
    examples=[
        Context(
            context="In 1948, Claude E. Shannon published A Mathematical Theory of\nCommunication (Shannon, 1948) establishing the theory of\ninformation. In his article, Shannon introduced the concept of\ninformation entropy for the first time. We will begin our journey here.",
            question="Who published A Mathematical Theory of Communication in 1948?",
            answer="Claude E. Shannon.",
        ),
])

### Define transform config

In this example, we will use the [HuggingfaceModelConfig](https://github.com/CambioML/uniflow/blob/main/uniflow/model/config.py#L39)'s default LLM to generate questions and answers. Let's import the config of this model.

Here, we pass in our `guided_prompt` to the `HuggingfaceConfig` to use our customized instructions and examples, instead of the `uniflow` default ones.

Note, base on your GPU memory, you can set your optimal `batch_size` below. 

In [7]:
current_batch_size = 1
print("batch size:", current_batch_size)

transform_config = TransformHuggingFaceConfig(
    guided_prompt_template=guided_prompt,
    model_config=HuggingfaceModelConfig(batch_size=current_batch_size)
)

batch size: 1


### Use MultiFlowsPipeline

Let's import the `PipelineConfig` of `MultiFlowsPipeline` to connect `extract_config` and `transform_config`.

In [8]:
p = MultiFlowsPipeline(PipelineConfig(
    extract_config=extract_config,
    transform_config=transform_config,
))

  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Loading checkpoint shards: 100%|██████████| 2/2 [00:04<00:00,  2.16s/it]


Now we call the `run` method on the `MultiFlowsPipeline` object to execute the question-answer generation operation on the data shown above.

In [9]:
output = p.run(data)

  0%|          | 0/1 [00:00<?, ?it/s]

100%|██████████| 1/1 [00:41<00:00, 41.20s/it]
100%|██████████| 13/13 [02:33<00:00, 11.79s/it]


### Output

Let's take a look of the generated output of Abstract segmentation.

In [11]:
pprint.pprint(output[0][1]['output'][0]['response'][0])

('instruction: Generate one question and its corresponding answer based on the '
 'last context in the last\n'
 '    example. Follow the format of the examples below to include context, '
 'question, and answer in the response\n'
 'context: In 1948, Claude E. Shannon published A Mathematical Theory of\n'
 'Communication (Shannon, 1948) establishing the theory of\n'
 'information. In his article, Shannon introduced the concept of\n'
 'information entropy for the first time. We will begin our journey here.\n'
 'question: Who published A Mathematical Theory of Communication in 1948?\n'
 'answer: Claude E. Shannon.\n'
 'context: ###### Abstract\n'
 'We collected marathon performance data from a systematic sample of elite and '
 'sub-elite athletes over the period 2015 to 2019, then searched the internet '
 'for publicly-available photographs of these performances, identifying '
 'whether the Nike Vaporly shoes were worn or not in each performance. '
 'Controlling for athlete ability and ra

### Process the output

Let's take a look of the generated output. We need to do a little postprocessing on the raw output.

In [43]:
# Extracting context, question, and answer into a DataFrame
contexts = []
questions = []
answers = []

keywords = ["context:", "question:", "answer:"]
pattern = '|'.join(map(re.escape, keywords))

for item in output[0]:
    o = item['output'][0]['response'][0]
    segments = [segment for segment in re.split(pattern, o) if segment.strip()]

    contexts.append(segments[-3])
    questions.append(segments[-2])
    answers.append(segments[-1])

# Set display options
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', 1000)

df = pd.DataFrame({
    'Context': contexts,
    'Question': questions,
    'Answer': answers
})

styled_df = df.style.set_properties(**{'text-align': 'left'}).set_table_styles([{
    'selector': 'th',
    'props': [('text-align', 'left')]
}])
styled_df

Unnamed: 0,Context,Question,Answer
0,"# An Observational Study of the Effect of Nike Vaporly Shoes on Marathon Performance Joseph Guinness, Debasmita Bhattacharya, Jenny Chen, Max Chen, Angela Loh _Cornell University_, _Ithaca NY_, _USA_, _2023_",What is the title of the study that Joseph Guinness et al conducted at Cornell University in 2023?,An Observational Study of the Effect of Nike Vaporly Shoes on Marathon Performance
1,"###### Abstract We collected marathon performance data from a systematic sample of elite and sub-elite athletes over the period 2015 to 2019, then searched the internet for publicly-available photographs of these performances, identifying whether the Nike Vaporly shoes were worn or not in each performance. Controlling for athlete ability and race difficulty, we estimated the effect on marathon times of wearing the Vaporfly shoes. Assuming that the effect of Vaporfly shoes is additive, we estimate that the Vaporfly shoes improve men's times between 2.0 and 3.9 minutes, while they improve women's times between 0.8 and 3.5 minutes. Assuming that the effect of Vaporfly shoes is multiplicative, we estimate that they improve men's times between 1.4 and 2.8 percent and women's performances between 0.6 and 2.2 percent. The improvements are in comparison to the shoe the athlete was wearing before switching to Vaporfly shoes, and represents an expected improvement rather than a guaranteed improvement.",What is the estimated effect of wearing Nike Vaporly shoes on men's marathon times?,Between 2.0 and 3.9 minutes.
2,"## 1 Introduction There is a growing consensus that Nike Corporation's new line of marathon racing shoes, which are commonly referred to as Vaporflys, provide a significant performance advantage to athletes who wear them. While several different versions of the shoes have appeared in races, including the Vaporfly 4%, the Vaporfly Next%, the Alphafly, and several prototype shoes, each iteration of the shoes has in common a carbon fiber plate stacked inside of a highly responsive foam sole. Several research studies have investigated the magnitude of the Vaporfly performance benefit. Hoogkamer et al. (2018) and Barnes and Kilding (2019) tested highly trained distance runners in laboratory studies, measuring various biomechanical and physiological variables while subjects wore Vaporflys and several other shoes in trial runs on a treadmill. Although the measured benefits varied somewhat from athlete to athlete, both studies found a roughly 4% average reduction in energy expenditures while wearing Vaporflys, in comparison to other popular racing shoes such as the Adidas Adizero Adios Boost line of racing shoes, and Nike Zoom Matumbo track spikes. The Upshot, a division of The New York Times, collected data from actual marathon performances recorded on Strava, a popular running log and GPS tracking website. Their study included hundreds of thousands of marathon performances, and dozens of different shoes. The Upshot found that the Vaporflys imparted a 4 to 5% advantage in finishing time over anaverage shoe and a 1.5 to 2.5% advantage over the second-best shoe (Kealy and Katz, 2018, 2019). A study published by Wired Magazine found that a sample of runners in the 2017 New York City Marathon were more likely to run the second half of the race faster than the first if they were wearing Vaporflys (Thompson, 2017). Our study is most similar to the Upshot study in that we analyze data from marathon performances and compare people's performances with and without the Vaporflys. However, our study differs in a few ways. First, instead of relying on a convenient sample of athletes who upload their data to Strava, we take an exhaustive sample of athletes who met a minimum performance standard at one of 22 of the largest marathon venues in 2015 and 2016 in the US and Canada. Second, instead of relying on self-reported shoe data, we searched the internet for photos of races and visually identified the shoes that runners wore. Third, we focus only on athletes who performed at an elite level before the Vaporflys were released to the public. Thus, we are only considering accomplished runners with marathon experience who, most likely, settled on a suitable shoe before the Vaporflys were released. These runners are also those most likely to be affected by shoe regulations because many of them compete in national Olympic qualifying races subject to regulations.",What is the performance advantage of Vaporflys compared to other popular racing shoes?,"On average, Vaporflys provide a 4% reduction in energy expenditures and a 1.5 to 2.5% advantage in finishing time over an average shoe and a 1.5 to 2.5% advantage over the second-best shoe."
3,"## 2 Study Design We selected athletes who recorded a sufficiently fast marathon time--men under 2:24 and women under 2:45--at a collection of 22 distinct marathon venues in 2015 or 2016, including the 2016 U.S. Olympic Marathon Trials, which were contested in Los Angeles in February of 2016. The list of marathons is included in the Appendix. This resulted in a sample of 270 distinct women and 308 distinct men after matching names and our best effort to correct alternate spellings of names. We recorded these athletes' performances in the same 22 marathon venues over the period 2015 to 2019, and searched publicly available online photographs, manually identifying whether or not each athlete was wearing a Nike Vaporfly shoe by visual inspection. All marathon times were downloaded from the website www.marathonguide.com. Our criteria for inclusion in the study were meant to satisfy certain objectives. First, we wanted to study elite and sub-elite athletes, since shoe regulations are motivated by performance advantages for athletes in this group. Second, we wanted to study athletes who had achieved success in the marathon before the Nike Vaporfly shoes had been released to the public. This ensures that inclusion in the study is unrelated to whether an athlete was wearing the shoes in the race where they qualified for inclusion in the study. This is important because, if any shoe effect exists, the magnitude of the effect may differ among different athletes. If we were to use performances potentially aided by the shoes to select the athletes, that might have biased our sample towards athletes who benefit most from the shoes. To identify shoes worn by the runners, we used photos posted on public websites such as marathonfoto.com, marathon-photos.com, sportphoto.com, and flashframe.io. We also collected photographs from social media sites such as facebook.com and instagram.com. We assumed that Vaporfly shoes were not worn in 2015 or 2016 by any runners except for a few that were reported to have worn prototypes in the 2016 US Olympic Trials Marathon. Identification of shoes via photos is a manual process that is subject to error. We have made all of our shoe identifications publicly available at [https://github.com/joeguinness/vaporfly](https://github.com/joeguinness/vaporfly) and will update this paper with new data if we are made aware of any errors in shoe identification. We identified the shoes worn in 840 of 880 (95.5%) men's performances in our dataset and in 778 of 810 (96.0%) women's performances.",What were the selection criteria for the athletes included in the study?,"The selection criteria for the athletes included in the study were elite and sub-elite athletes, athletes who had achieved success in the marathon before the Nike Vaporfly shoes had been released to the public, and athletes whose performances could be verified through publicly available online photographs."
4,"## 3 Data Exploration In Figure 1, we plot some summaries of the data. The left plot contains the proportion of runners wearing Vaporflys in each race in our dataset, separated by sex. Aside from a few prototypes being used in 2016, adoption of the shoes began in early 2017 and rose to over 50% on average in races at the end of 2019. The right plot contains the average marathon time for each athlete in the dataset in Vaporfly vs. non-Vaporfly shoes. Most runners' average time in Vaporfly shoes is faster than their average time in non-Vaporfly shoes. Specifically, 53 of 71 men (74.5%) who switched to Vaporflys ran faster in them, and 40 of 56 women (71.4%) who switched to Vaporflys ran faster in them. The right plot does not tell the whole story because it might be the case that runners who switched to Vaporflys did so when they ran on faster marathon courses. Some courses, such as the Boston Marathon course, have hills or often have poor weather, while others are flat and fast. So it is important to use the data to attempt to account for the difficulty of each Figure 1: (Left) Each circle represents an individual race, with the area of the circle proportional to the number of runners from the race in our dataset, and the vertical position equal to the proportion of runners wearing Vaporfly shoes in the race. (Right) Each circle represents an athlete, with the horizontal position being the athlete’s average marathon time in non-Vaporfly shoes, and the vertical position being the athlete’s average time in Vaporfly shoes. course. To get a satisfactory estimate of the effect of Vaporfly shoes, we need to analyze all of the data holistically, controlling for the strength of each runner and the difficulty of each marathon course. In the next section, we describe a statistical model intended for that purpose.",What is shown in the left plot of Figure 1?,"The proportion of runners wearing Vaporflys in each race in our dataset, separated by sex."
5,Who published A Mathematical Theory of Communication in 1948?,Claude E. Shannon.,"## 4 Statistical Model We seek to estimate the effect of Vaporfly shoes on marathon performance, controlling for runner ability and marathon course difficulty. This is achieved by fitting a statistical model with non-random effects for Vaporfly shoes and random effects for runner ability, overall course difficulty, and course difficulty specific to year. To allow for the possibility that men and women have different performance characteristics, we analyze data from the two sexes separately. We assign each performance a label between \(1\) and \(n\) (\(n=840\) for men, \(n=778\) for women). Each athlete is assigned a label between \(1\) and the number of athletes \(A\) (\(A=308\) men, \(A=270\) women). We assign each marathon course a label between \(1\) and the number of marathon courses \(C\) (\(C=22\)), and we assign each individual race a label between \(1\) and the number of races \(R\), (\(R=106\)). We summarize our notation for the data here: \[y_{i} =\text{ marathon time in minutes for performance }i\] \[x_{i} =\left\{\begin{array}{ll}1&\text{if Vaporfly shoes worn in performance }i\\ 0&\text{if Vaporfly shoes not worn in performance }i\end{array}\right.\] \[j(i) =\text{ label for athlete who completed performance }i\] \[k(i) =\text{ label for marathon course associated with performance }i\] \[\ell(i) =\text{ label for individual marathon race associated with performance }i\] A statistical model is a family of probability distributions that encodes the assumptions we make about the processes generating the data. Models generally include unknown parameters relevant to the questions posed in the study. The goal of the analysis is to use the data to make inferences about these parameters. We consider the following two models for the performances \(y_{1},\ldots,y_{n}\): \[\text{Untransformed:}\qquad\quad Y_{i} =b_{0}+b_{1}x_{i}+U_{j(i)}+V_{k(i)}+W_{\ell(i)}+Z_{i}\] \[\text{Log Untransformed:}\quad\log Y_{i} =b_{0}+b_{1}x_{i}+U_{j(i)}+V_{k(i)}+W_{\ell(i)}+Z_{i}\] with each of the individual terms defined in the following table The primary parameter of interest is \(b_{1}\), which is the effect of the Vaporfly shoes. The model assumes that, all else held constant, switching to Vaporfly shoes changes the response by \(b_{1}\). We do not attempt to model Vaporfly effects that vary among individual runners. The interpretations of the parameters are different depending on whether we take a log transformation of the times or not. When modeling untransformed times, the effect of Vaporfly shoes is additive, meaning that we expect the time to change by adding \(b_{1}\), and when modeling log-transformed times, the effect is multiplicative, meaning that we expect the time to change by multiplying by \(\exp(b_{1})\). Aside from \(b_{0}\) and \(b_{1}x_{i}\), the rest of the terms are independent normal random effects. Each of the \(A\) runners has its own offset term \(U_{j}\) to account for the fact that runners have differing abilities; each of the \(C\) marathon courses has its own offset term \(V_{k}\) to account for the fact that different marathon courses are slower or faster than others; each of the \(R\) individual races has its own offset term \(W_{\ell}\) to account for the fact that race conditions vary from year to year, making some years slower or faster than others at the same course; and finally each of the \(n\) individual performances has a term \(Z_{i}\) to account for any other factors that affected the performance. We also considered including time-varying runner effects, to allow each runner's fitness to improve or decline over time independently of the fitness of other runners, but we decided that this model placed too much weight on athletes who raced frequently. In the Appendix, we also include results from a combined model for men and women. In summary, we fit two models to the data: an untransformed linear regression model and a logarithmic linear regression model. Both models assume that the effect of Vaporfly shoes is additive when modeling untransformed times and multiplicative when modeling log-transformed times. The models also include random effects for runner ability, marathon course difficulty, and course difficulty specific to year. We analyze the data separately for men and women."
6,Who published A Mathematical Theory of Communication in 1948?,Claude E. Shannon.,"## 5 Estimated Parameters To fit the models, we used the lmer function, which is part of the lme4 package (Bates et al., 2015) in the R programming language (R Core Team, 2018). The lme4 package is a well-established piece of statistical software for fitting random effects models of the type we seek to estimate in this research study. Code and data for reproducing our results are available online at [https://github.com/joeguinness/vaporfly](https://github.com/joeguinness/vaporfly). We fit separate models for men and women, and additionally, separate models for the untransformed and log-transformed marathon times. The estimated parameters are summarized in the table below: For both men and women and for untransformed and log-transformed times, the Vaporfly effect is negative, indicating that the evidence supports the hypothesis that Vaporflys decrease, \begin{table} \begin{tabular}{c c c c} & men minutes & women minutes & men log minutes & women log minutes \\ \cline{2-4} & estimate (s.e.) & estimate (s.e.) & estimate (s.e.) & estimate (s.e.) \\ \hline \(b_{0}\) & 139.69 (0.59) & 159.83 (0.81) & 4.94 (0.004) & 5.070 (0.0050) \\ \(b_{1}\) & \(\boldsymbol{-2.95}\) **(0.60)** & \(\boldsymbol{-2.18}\) **(0.81)** & \(\boldsymbol{-0.0209}\) **(0.0041)** & \(\boldsymbol{-0.0135}\) **(0.0049)** \\ \(\sigma_{1}\) & 4.175 & 6.40 & 0.030 & 0.041 \\ \(\sigma_{2}\) & 1.852 & 2.33 & 0.013 & 0.014 \\ \(\sigma_{3}\) & 1.874 & 2.43 & 0.013 & 0.015 \\ \(\sigma_{4}\) & 4.108 & 5.02 & 0.028 & 0.030 \\ \end{tabular} \end{table} Table 1: Table of parameter estimates for the statistical models. or improve, marathon times. Our best estimates of the additive effects are \(-2.95\) minutes for men and \(-2.18\) minutes for women. Using log-transformed data, our best estimates of the multiplicative effects are \(\exp(-0.0209)=0.979\) for men, and \(\exp(-0.0135)=0.986\) for women, meaning that we expect men's times to decrease by \(2.06\%\), and women's times to decrease by \(1.34\%\), when wearing Vaporfly shoes, as compared to the shoes each athlete was wearing before switching to Vaporfly. While our estimates suggest that the effect of Vaporfly shoes is greater for men, the estimates come with some uncertainty. In the following table, we include \(90\%\) confidence intervals for each of the Vaporfly effects, constructed using a normal approximation to the sampling distribution of the estimates. None of the intervals contain zero, which indicates strong evidence for a non-zero Vaporfly effect. There is substantial overlap between the men's and women's confidence intervals, which leaves some uncertainty about which sex benefits most from Vaporfly shoes. In the Appendix, we include an analysis assessing the difference between the men's and women's Vaporfly effects; we find that we do not have sufficient evidence to conclude that the effects differ by sex. In random effects models such as those we use here, the estimates of the fixed effects, \(b_{0}\) and \(b_{1}\), are calculated using the generalized least squares criterion. Generalized least squares attempts to triangulate all of the dependencies in the data, for example the fact that there are several performances for each runner and for each race, to arrive at a statistically optimal estimate of the effects. The estimates are linear combinations of the responses, for example \[\widehat{b}_{1}=\sum_{i=1}^{n}c_{i}y_{i},\] where \(c_{1},\ldots,c_{n}\) are coefficients calculated using the covariance matrix of the random effects model. These coefficients can sometimes have seemingly counter-intuitive values. In the spirit of attempting to make sense of the magic of generalized least squares, and to promote its utility for this type of problem, we plot the coefficients for the estimates of the Vaporfly effects in Figure 2. To help make sense of why generalized least squares picks these coefficients, consider four performances \[\begin{array}{l}y_{1}=\mbox{Time for Runner 1 at Boston Marathon 2016}\\ y_{2}=\mbox{Time for Runner 2 at Boston Marathon 2016}\\ y_{3}=\mbox{Time for Runner 1 at Chicago Marathon 2017}\\ y_{4}=\mbox{Time for Runner 2 at Chicago Marathon 2017}\end{array}\] \begin{table} \begin{tabular}{c c c c} men minutes & women minutes & men log minutes & women log minutes \\ \hline \((-3.933,-1.959)\) & \((-3.514,-0.847)\) & \((-0.028,-0.014)\) & \((-0.022,-0.006)\) \\ \end{tabular} \end{table} Table 2: 90% confidence intervals for Vaporfly effects in each model. The first runner (\(y_{1}\) and \(y_{3}\)) did not wear Vaporflys, but the second runner (\(y_{2}\) and \(y_{4}\)) switched to Vaporflys at Chicago in 2017. A reasonable estimate for the Vaporfly effect from these data might be the average of the Vaporfly performances minus the average of the non-Vaporfly performances, \[y_{4}-\frac{1}{3}(y_{1}+y_{2}+y_{3}),\] which places a positive coefficient (1.0) on the Vaporfly performance and negative coefficients (\(-0.33\)) on the non-Vaporfly performances. However, a better estimate would consider how much the second athlete's advantage increased after switching to the Vaporflys, \[(y_{4}-y_{3})-(y_{2}-y_{1}).\] The first difference (\(y_{4}-y_{3}\)) measures how much better the second athlete did in Chicago (wearing Vaporflys), and the second difference is how much better the second athlete did in Boston (not wearing Vaporflys). This estimate places a positive coefficient on the second runner's Vaporfly performance in Chicago, a positive coefficient on the first runner's Boston performance, a negative coefficient on the second runner's Boston performance, and a negative coefficient on the first runner's Chicago performance. This pattern can be observed in Figure 2; early performances--before the Vaporfly appeared on the market--have either positive or negative coefficients, whereas later performances generally have positive coefficients when the Vaporfly is worn and negative coefficients when not worn. There is some variation in the magnitude of these coefficients, which we expect is due to the differing number of performances from each runner and from each race. Figure 2: Coefficients for the generalized least squares estimate of the Vaporfly effect \(\widehat{b}_{1}\) for the untransformed data. Each point represents an individual performance \(y_{i}\). The height of the point is the corresponding coefficient \(c_{i}\). Note that Figure 2 shows the coefficients in the generalized least squares estimate, not raw performances. Raw performances are simply the original observations, while the coefficients represent the change in performance associated with wearing the Vaporflys. For example, if a runner ran a time of 139 minutes without the Vaporflys and 137 minutes with them, then the coefficient for the Vaporfly performance would be 2 minutes. If the same runner ran a time of 150 minutes without the Vaporflys and 148 minutes with them, then the coefficient for the Vaporfly performance would be -2 minutes. Therefore, it is important to note that the coefficients in Figure 2 are not directly comparable across different runs or races. They reflect only the change in performance associated with wearing the Vaporflys, and they should not be interpreted as absolute differences in performance. \begin{figure}[htbp!] \centering \includegraphics[width=\textwidth]{figures/coef_plot.png} \caption{\textbf{Coefficients for the generalized least squares estimate of the Vaporfly effect $\hat{b}_1$ for the untransformed data.} Each point represents an individual performance $y\_i$. The height of the point is the corresponding coefficient $c\_i$. Note that the coefficients in Figure 2 are not directly comparable across different runs or races. They reflect only the change in performance associated with wearing the Vaporflys, and they should not be interpreted as absolute differences in performance.} \label{fig:coef_plot} \end{figure} \section {Discussion} Our findings provide robust evidence that Vaporfly shoes decrease marathon running times. Specifically, we found that the best estimates of the additive effects were $-2.95$ minutes for men and $-2.18$ minutes for women. When using log-transformed data, our best estimates of the multiplicative effects were $0.979$ for men and $0.986$ for women, meaning that we expect men’s times to decrease by $2.06\%$, and women’s times to decrease by $1.34\%$, when wearing Vaporfly shoes, as compared to the shoes each athlete was wearing before switching to Vaporfly. These findings are consistent with previous studies reporting similar improvements in speed when using Vaporfly shoes (Hermansson et al., 2019; Kjellström et al., 2019; Lundberg et al., 2019; Nilsson et al., 2019; Östling et al., 2019; Sörensen et al., 2019; Sundqvist et al., 2019). While our sample size is relatively small, we believe that our results are trustworthy because we included multiple models (separate models for men and women, and separately for untransformed and log-transformed times) and we replicated our analyses using two independent samples (the original dataset and the publicly available dataset). It is worth noting that our estimates come with some uncertainty, particularly for the men’s group. While none of the confidence intervals contain zero, there is substantial overlap between the men’s and women’s confidence intervals, which suggests that there may be some variability in the effect of Vaporfly shoes across genders. Future studies could investigate whether the effect of Vaporfly shoes differs by gender more systematically. Another limitation of our study is that we did not control for other factors"
7,Who published A Mathematical Theory of Communication in 1948?,Claude E. Shannon.,"## 6 Discussion By collecting data on marathon times and identifying shoes worn in a systematic sample of elite and sub-elite marathon runners, we studied how much a runner's marathon time can be expected to improve after switching to Vaporfly shoes. For men, the improvement is most likely somewhere between 2.0 and 3.9 minutes, or between 1.4% and 2.8%. For women it is likely between 0.8 and 3.5 minutes, or between 0.6% and 2.2%. To put these numbers into perspective, elite marathon runners cover more than half a mile in 3 minutes. We made several assumptions that we believe to be reasonable, but nevertheless are open for debate and could be refined. We assumed that the expected Vaporfly effect (\(b_{1}\)) is the same for every runner of the same sex. Including prototypes, there are several different versions of Vaporflys, and it is logical to expect that newer versions improve upon older versions. Moreover, depending on biomechanical factors, some runners may benefit from Vaporflys more than others, or could have been wearing shoes that were not optimal for them before switching to Vaporflys. The models assume that, conditional on the runner and race, the marathon time follows a normal distribution. This may not be entirely appropriate because we believe that a runner is more likely to run 5 minutes slower than expected rather than 5 minutes faster; when things go wrong in a marathon, they can go really wrong. The study did not include data on runners that did not finish their races. This data is more difficult to obtain and more difficult to model. Based on the Wired study (Thompson, 2017), which found that people wearing Vaporflys generally ran better in the second half of the race, we believe that people wearing Vaporflys are less likely to drop out, so we are more likely to miss the very worst performances in non-Vaporfly shoes. Thus, we believe that including drop-outs would only strengthen our estimate of the Vaporfly effect. The shoes were identified via a manual process of searching through photographs. We believe that this manual process for getting the shoe data is better (though labor intensive) than relying on self-reported shoes. Nonetheless, it is prone to error in misidentification of the person and the shoes. For example, Nike has an orange Vaporfly with a black Swoosh logo, and also a Zoom Fly with the same color scheme. Another example, some professional athletes ran in a neon yellow and pink prototype Vaporfly, which looks very similar to the neon yellow and pink Nike Zoom Streak 6. See Johnson (2020) for a more detailed analysis. Yet a further example, we identified one athlete who attempted to conceal the identity of his shoes by coloring in the white Swoosh on a blue pair of Vaporflys. Some of the athletes may have competed in marathons not included in the 22 marathons that we sampled. Missing these performances shouldn't bias our results, but our results could be strengthened if we are able to track down every performance from the athletes in the study. It is possible that athletes are more likely to switch to Vaporfly shoes when they know they are ready to turn in a good marathon performance. Inversely, some athletes might not be willing to pay $250 for shoes when they are out of shape. The Upshot study investigated this possibility by controlling for training volume; they did not see substantially different results. Further, our sample consists of solely highly accomplished runners. We believe that these athletes are generally wearing the best shoes available to them whenever they run a marathon. We were able to identify shoes in nearly all, but not all marathon performances. Athletes have the ability to suppress photographs of themselves, for example by untagging themselves in Facebook photos, or simply electing not to post pictures of themselves from their races. If athletes are more likely to suppress photos of poor performances in Vaporfly shoes, our estimated effect of Vaporfly shoes could be larger than it should be. **Acknowledgements** The authors thank Richard Heffron and Melissa Hardesty for discussions and providing comments on an early draft of the manuscript, and Richard Cleary for comments on the first version of the paper. We thank two peer reviewers, Harry Crane and Ted Westling, for providing detailed assessments of the paper and suggestions for improvements. Their reviews, along with our responses, are included in the Appendix. Finally, we express gratitude for those who take photos of major marathons and post them to social media, especially Karen Mitchell, Clay Shaw, and Malrie Sonier. # 7 Conclusion In this work, we present a novel approach to studying the impact of new technology on human performance using machine learning techniques. Our goal was to understand whether the introduction of carbon fiber soles in running shoes improves athletic performance. We used a combination of expert knowledge and machine learning algorithms to analyze large amounts of data collected over many years. We then applied our findings to predict the impact of introducing carbon fiber soles on the performance of elite and sub-elite marathon runners. Our approach involved three main steps: data collection, feature engineering, and modeling. First, we collected data on marathon times and shoe types from various sources, including online databases, scientific studies, and personal records. Next, we engineered features from this data that captured important characteristics of both the runners and the shoes. These features included information about the runner’s age, weight, height, gender, and previous marathon times, as well as details about the shoe’s brand, style, and materials. Finally, we trained machine learning models on this data to predict the impact of introducing carbon fiber soles on marathon times. To evaluate the effectiveness of our approach, we compared our predictions to actual marathon times recorded during a recent competition. We found that our models accurately predicted the impact of introducing carbon fiber soles on marathon times, with an average prediction error of just 0.3%. This suggests that our approach is effective at capturing important relationships between variables and making accurate predictions. Overall, our work demonstrates the potential of machine learning techniques to help us better understand complex systems and make informed decisions. By combining expert knowledge with large amounts of data and sophisticated algorithms, we can gain insights into phenomena that are otherwise difficult to observe directly. As such, our approach could be useful in a variety of fields, ranging from sports science to finance to healthcare."
8,Who published A Mathematical Theory of Communication in 1948?,Claude E. Shannon.,"## References * Barnes and Kilding (2019) Barnes, K. R. and Kilding, A. E. (2019). A randomized crossover study investigating the running economy of highly-trained male and female distance runners in marathon racing shoes versus track spikes. _Sports Medicine_, 49(2):331-342. * Bates et al. (2015) Bates, D., Machler, M., Bolker, B., and Walker, S. (2015). Fitting linear mixed-effects models using lme4. _Journal of Statistical Software_, 67(1):1-48. * Hoogkamer et al. (2018) Hoogkamer, W., Kipp, S., Frank, J. H., Farina, E. M., Luo, G., and Kram, R. (2018). A comparison of the energetic cost of running in marathon racing shoes. _Sports Medicine_, 48(4):1009-1019. * will anything be done about it? [https://tinyurl.com/umvmy7t](https://tinyurl.com/umvmy7t). * Kealy and Katz (2018) Kealy, K. and Katz, J. (2018). Nike says its $250 running shoes will make you run much faster. what if that's actually true? [https://www.nytimes.com/interactive/2018/07/18/upshot/nike-vaporfly-shoe-strava.html](https://www.nytimes.com/interactive/2018/07/18/upshot/nike-vaporfly-shoe-strava.html). * Kealy and Katz (2019) Kealy, K. and Katz, J. (2019). Nike's fastest shoes may give runners an even bigger advantage than we thought. [https://www.nytimes.com/interactive/2019/12/13/upshot/nike-vaporfly-next-percent-shoe-estimates.html](https://www.nytimes.com/interactive/2019/12/13/upshot/nike-vaporfly-next-percent-shoe-estimates.html). * R Core Team (2018) R Core Team (2018). _R: A Language and Environment for Statistical Computing_. R Foundation for Statistical Computing, Vienna, Austria. * R Core Team (2018)Thompson, N. (2017). Do nike's new marathon shoes actually make you run faster? [https://www.wired.com/story/do-nike-zoom-vaporfly-make-you-run-faster/](https://www.wired.com/story/do-nike-zoom-vaporfly-make-you-run-faster/). * Thompson, N. (2018). The science behind Nike's VaporFly shoes. [https://www.wired.com/story/the-science-behind-nikes-vaporfly-shoes/](https://www.wired.com/story/the-science-behind-nikes-vaporfly-shoes/). * Thompson, N. (2019). How do Nike's VaporFly shoes work? [https://www.wired.com/story/how-do-nikes-vaporfly-shoes-work/](https://www.wired.com/story/how-do-nikes-vaporfly-shoes-work/). * Thompson, N. (2020). What makes Nike's VaporFly shoes so fast? [https://www.wired.com/story/what-makes-nikes-vaporfly-shoes-so-fast/](https://www.wired.com/story/what-makes-nikes-vaporfly-shoes-so-fast/). * Thompson, N. (2021). Can Nike's VaporFly shoes really help you run faster? [https://www.wired.com/story/can-nikes-vaporfly-shoes-really-help-you-run-faster/](https://www.wired.com/story/can-nikes-vaporfly-shoes-really-help-you-run-faster/). * Thompson, N. (2022). Why are Nike's VaporFly shoes so expensive? [https://www.wired.com/story/why-are-nikes-vaporfly-shoes-so-expensive/](https://www.wired.com/story/why-are-nikes-vaporfly-shoes-so-expensive/). * Thompson, N. (2022). Are Nike's VaporFly shoes worth the hype? [https://www.wired.com/story/are-nikes-vaporfly-shoes-worth-the-hype/](https://www.wired.com/story/are-nikes-vaporfly-shoes-worth-the-hype/). * Thompson, N. (2022). How do Nike's VaporFly shoes compare to other high-end running shoes? [https://www.wired.com/story/how-do-nikes-vaporfly-shoes-compare-to-other-high-end-running-shoes/](https://www.wired.com/story/how-do-nikes-vaporfly-shoes-compare-to-other-high-end-running-shoes/). * Thompson, N. (2022). What is the science behind Nike's VaporFly shoes? [https://www.wired.com/story/what-is-the-science-behind-nikes-vaporfly-shoes/](https://www.wired.com/story/what-is-the-science-behind-nikes-vaporfly-shoes/). * Thompson, N. (2022)."
9,## Appendix A Table of Marathons Used in Analysis \begin{tabular}{l} \hline Boston Marathon \\ California International Marathon \\ Chicago Marathon \\ Columbus Marathon \\ Eugene Marathon \\ Grandma's Marathon \\ Houston Marathon \\ Indianapolis Monumental Marathon \\ Los Angeles Marathon \\ Lakefront Marathon \\ Marine Corps Marathon \\ New York City Marathon \\ Olympic Trials Marathon \\ Ottawa Marathon \\ Philadelphia Marathon \\ Phoenix Marathon \\ Richmond Marathon \\ Toronto Waterfront Marathon \\ Twin Cities Marathon \\ Vancouver International Marathon \\ Vermont City Marathon \\ Wineglass Marathon \\ \hline \end{tabular},What is the name of the marathon used as an example in this table?,The name of the marathon used as an example in this table is not specified.
