# Analyzing ISU "Deviation Points"

The International Skating Union analyzes judges' scores using a "deviation points" system. The analysis below replicates the calculations described in [ISU Communication 2098](http://www.isu.org/communications/593-isu-communication-2098/file), Section H, entitled "Criteria for the identification of cases of evaluation in the Judges’ GOEs and Program Components scores."

## Load judge data

In [1]:
import pandas as pd

In [2]:
all_judges = pd.read_csv("../data/processed/judges.csv")
all_judges.head()

Unnamed: 0,judge_name,assigned_country,role,segment_category,pdf_name,program,competition,officials_table_link,results_pdf_link,clean_judge_name
0,Ms. Chihee RHEE,KOR,Judge No.1,Free Dance|Ice Dance,4f3031488c_data0405.pdf,ICE DANCE FREE DANCE,Grand Prix Final 2017 Senior and Junior,http://www.isuresults.com/results/season1718/g...,http://www.isuresults.com/results/season1718/g...,Chihee RHEE
1,Mr. David MUNOZ,ESP,Judge No.2,Free Dance|Ice Dance,4f3031488c_data0405.pdf,ICE DANCE FREE DANCE,Grand Prix Final 2017 Senior and Junior,http://www.isuresults.com/results/season1718/g...,http://www.isuresults.com/results/season1718/g...,David MUNOZ
2,Mr. Feng HUANG,CHN,Judge No.3,Free Dance|Ice Dance,4f3031488c_data0405.pdf,ICE DANCE FREE DANCE,Grand Prix Final 2017 Senior and Junior,http://www.isuresults.com/results/season1718/g...,http://www.isuresults.com/results/season1718/g...,Feng HUANG
3,Mr. Richard DALLEY,USA,Judge No.4,Free Dance|Ice Dance,4f3031488c_data0405.pdf,ICE DANCE FREE DANCE,Grand Prix Final 2017 Senior and Junior,http://www.isuresults.com/results/season1718/g...,http://www.isuresults.com/results/season1718/g...,Richard DALLEY
4,Mr. Walter ZUCCARO,ITA,Judge No.5,Free Dance|Ice Dance,4f3031488c_data0405.pdf,ICE DANCE FREE DANCE,Grand Prix Final 2017 Senior and Junior,http://www.isuresults.com/results/season1718/g...,http://www.isuresults.com/results/season1718/g...,Walter ZUCCARO


In [3]:
judge_nat = pd.read_csv("../data/processed/judge-country.csv")
judge_nat.head()

Unnamed: 0,clean_judge_name,judge_country
0,Adriana DOMANSKA,SVK
1,Adriana ORDEANU,ROU
2,Agita ABELE,LAT
3,Agnieszka SWIDERSKA,POL
4,Aigul KUANISHEVA,KAZ


In [4]:
judges = pd.merge(
    all_judges,
    judge_nat,
    on="clean_judge_name"
)

In [5]:
def clean_judge_number(role):
    return "J" + role.strip()[-1]

In [6]:
judges["clean_role"] = judges["role"].apply(clean_judge_number)

## Load scoring data

In [7]:
performances = pd.read_csv("../data/raw/performances.csv")
scores = pd.read_csv("../data/raw/judge-scores.csv")
aspects = pd.read_csv("../data/raw/judged-aspects.csv")

In [8]:
scores_with_context = scores.pipe(
    pd.merge,
    aspects,
    on = "aspect_id",
    how = "left"
).pipe(
    pd.merge,
    performances,
    on = "performance_id",
    how = "left"
).assign(
    is_junior = lambda x: x["program"].str.contains("JUNIOR")
)

In [9]:
assert len(scores_with_context) == len(scores)

The data for each aspect looks like this:

In [10]:
scores_with_context.tail(3).T

Unnamed: 0,214528,214529,214530
aspect_id,fff81cacfb,fff81cacfb,fff81cacfb
judge,J7,J8,J9
score,2,2,2
performance_id,ba47e4e8f3,ba47e4e8f3,ba47e4e8f3
section,elements,elements,elements
aspect_num,2,2,2
aspect_desc,1MB4+kpYYY,1MB4+kpYYY,1MB4+kpYYY
info_flag,,,
credit_flag,,,
base_value,5,5,5


In [11]:
len(scores_with_context)

214531

In [12]:
scores_with_context["section"].value_counts()

elements      136861
components     77670
Name: section, dtype: int64

In [13]:
scores_with_context["performance_id"].nunique()

1726

## Element Deviation Points

Section H, Part 1 states:

> **Method of Calculating Deviation Points in Grades of Execution (GOE)**
    
> a) For each element performed the computer calculates the average GOE of all the Judges. The GOE’s awarded by the Referee are NOT used in this calculation.

> b) The computer then calculates the difference between the “calculated average” and each Judges GOE’s which results in so called “Deviation Points”.
    
> c) The Deviation Points for each Judge will be added to give a Deviation Total for that Judge (+ and – Deviation Points do not compensate each other).
    
> d) The Deviation Total must not exceed the Total Number of Elements performed\*.

> \* In the case of elements receiving NO VALUE the Deviation Total still remains with the number of elements performed; the Deviation Points for this element for all Judges will be 0.

Here we select only the elements for non-junior performances:

In [14]:
elements = scores_with_context[
    (scores_with_context["section"] == "elements") &
    (~scores_with_context["is_junior"])
].copy()

In [15]:
len(elements)

129814

Here we find the average score for each element:

In [16]:
elem_grps = elements.groupby("aspect_id")

In [17]:
len(elem_grps)

14430

In [18]:
elem_averages = pd.DataFrame({
    "elem_avg": elem_grps["score"].mean(),
    "elem_judge_count": elem_grps.size()
})

In [19]:
elements_with_avg = pd.merge(
    elements,
    elem_averages.reset_index(),
    on="aspect_id",
    how="left"
)

In [20]:
assert len(elements_with_avg) == len(elements)

This is what the joined data looks like:

In [21]:
elements_with_avg.head().T

Unnamed: 0,0,1,2,3,4
aspect_id,000df5399a,000df5399a,000df5399a,000df5399a,000df5399a
judge,J1,J2,J3,J4,J5
score,-1,-1,-1,0,-1
performance_id,b017147b2f,b017147b2f,b017147b2f,b017147b2f,b017147b2f
section,elements,elements,elements,elements,elements
aspect_num,1,1,1,1,1
aspect_desc,3Tw2,3Tw2,3Tw2,3Tw2,3Tw2
info_flag,,,,,
credit_flag,,,,,
base_value,5.8,5.8,5.8,5.8,5.8


### Element Deviation Points

In [22]:
elements_with_avg["deviation_points"] = elements_with_avg["score"] - elements_with_avg["elem_avg"]
elements_with_avg["abs_deviation_points"] = elements_with_avg["deviation_points"].abs()

In [23]:
judge_elem_grps = elements_with_avg.groupby(["performance_id", "judge"])

In [24]:
len(judge_elem_grps)

14680

In [25]:
# Counting the total number of performed elements, 
# rather than the number of scored elements,
# per the final stipulation of Section H, Part 1.

performed_elem_counts = aspects[
    (aspects["section"] == "elements")
].groupby("performance_id")["aspect_id"]\
    .nunique()\
    .to_frame("num_elem_performed")\
    .reset_index()
    
performed_elem_counts.head()

Unnamed: 0,performance_id,num_elem_performed
0,0055dd827d,13
1,00693b66b5,5
2,007541009f,12
3,007e8ef343,7
4,008085d237,13


In [26]:
elem_eval = pd.DataFrame({
    "num_elem_scored": judge_elem_grps.size(),
    "total_deviation_points": judge_elem_grps["abs_deviation_points"].sum(),
    "total_elem_outside_range": judge_elem_grps["abs_deviation_points"].apply(lambda x: len(x[x >= 1.5])),
    "positive_elem_outside_range": judge_elem_grps["deviation_points"].apply(lambda x: len(x[x >= 1.5])),
    "negative_elem_outside_range": judge_elem_grps["deviation_points"].apply(lambda x: len(x[x <= -1.5])),
    "nation": judge_elem_grps["nation"].first(),
    "program": judge_elem_grps["program"].first(),
    "competition": judge_elem_grps["competition"].first(),
}).reset_index().pipe(
    pd.merge,
    performed_elem_counts,
    on = "performance_id",
    how = "left"
).pipe(
    pd.merge,
    judges,
    left_on = ["program", "competition", "judge"],
    right_on = ["program", "competition", "clean_role"],
    how="left"
)

This is what the data-frame for counting element evaluations looks like:

In [27]:
elem_eval.head().T

Unnamed: 0,0,1,2,3,4
performance_id,0055dd827d,0055dd827d,0055dd827d,0055dd827d,0055dd827d
judge,J1,J2,J3,J4,J5
competition,ISU GP Trophee de France 2016,ISU GP Trophee de France 2016,ISU GP Trophee de France 2016,ISU GP Trophee de France 2016,ISU GP Trophee de France 2016
nation,KAZ,KAZ,KAZ,KAZ,KAZ
negative_elem_outside_range,0,0,0,0,0
num_elem_scored,13,13,13,13,13
positive_elem_outside_range,0,0,0,0,0
program,MEN FREE SKATING,MEN FREE SKATING,MEN FREE SKATING,MEN FREE SKATING,MEN FREE SKATING
total_deviation_points,5.44444,6.88889,5.11111,6.66667,5.88889
total_elem_outside_range,0,0,0,0,0


Evaluate whether the total element deviations exceed the number of elements for any judged performance:

In [28]:
elem_eval["exceeds_eval_threshold"] = elem_eval["total_deviation_points"] > \
    elem_eval["num_elem_performed"]

In [29]:
print("Of the {:,} total judge/performance combinations, {:,} ({:.2f}%) are outside the acceptable range."\
    .format(
        len(elem_eval),
        elem_eval["exceeds_eval_threshold"].sum(),
        (elem_eval["exceeds_eval_threshold"].sum() / 
            len(elem_eval) * 100)
    ))

Of the 14,680 total judge/performance combinations, 30 (0.20%) are outside the acceptable range.


In [30]:
elem_eval_same_country = elem_eval[
    elem_eval["nation"] == elem_eval["judge_country"]
].copy()

In [31]:
print("Of the {:,} total judge/performance combinations where the judge and skaters represent the same country,"
      " {:,} ({:.2f}%) are outside the acceptable range."\
    .format(
        len(elem_eval_same_country),
        elem_eval_same_country["exceeds_eval_threshold"].sum(),
        (elem_eval_same_country["exceeds_eval_threshold"].sum() / 
            len(elem_eval_same_country) * 100)
    ))

Of the 1,305 total judge/performance combinations where the judge and skaters represent the same country, 1 (0.08%) are outside the acceptable range.


Section H, Part 2 states:

> **Method of Calculating Range of Grade of Execution (GOE)**

> a) For each element performed the computer calculates the average GOE of all the Judges.
The GOE’s awarded by the Referee are NOT used in this calculation.

> b) The computer then calculates the difference between the “calculated average” and each
Judge’s GOE’s which results in so called “Deviation Points”.

> c) Acceptable range of scores in GOE is stated as less than “+ or –“ 1.5 Deviation Points.

> d) If the Deviation Points of an element for a Judge equals 1.5 points or more, the GOEs of
that Judge for that element will constitute a case of evaluation.

In [32]:
print("Of the {:,} total evaluated elements, {:,} ({:.2f}%) are outside the acceptable range."\
    .format(
        elem_eval["num_elem_scored"].sum(),
        elem_eval["total_elem_outside_range"].sum(),
        (elem_eval["total_elem_outside_range"].sum() / 
            elem_eval["num_elem_scored"].sum() * 100)
    ))

Of the 129,814 total evaluated elements, 1,238 (0.95%) are outside the acceptable range.


In [33]:
print("Of the {:,} total evaluated elements where the judge and skaters represent the same country,"
      " {:,} ({:.2f}%) are outside the acceptable range."\
    .format(
        elem_eval_same_country["num_elem_scored"].sum(),
        elem_eval_same_country["total_elem_outside_range"].sum(),
        (elem_eval_same_country["total_elem_outside_range"].sum() / 
            elem_eval_same_country["num_elem_scored"].sum() * 100)
    ))

Of the 11,597 total evaluated elements where the judge and skaters represent the same country, 108 (0.93%) are outside the acceptable range.


In [34]:
total_elem_for_eval = elem_eval[
    elem_eval["exceeds_eval_threshold"]
]["num_elem_performed"].sum() + elem_eval[
    ~elem_eval["exceeds_eval_threshold"]
]["total_elem_outside_range"].sum()

print(("Of the {:,} total evaluated elements, {:,} ({:.2f}%) elements would have been flagged "
      "by either Part 1 or Part 2.")\
    .format(
        elem_eval["num_elem_scored"].sum(),
        total_elem_for_eval,
        (total_elem_for_eval / 
            elem_eval["num_elem_scored"].sum() * 100)
    ))

Of the 129,814 total evaluated elements, 1,428 (1.10%) elements would have been flagged by either Part 1 or Part 2.


In [35]:
total_elem_for_eval_same_country = elem_eval_same_country[
    elem_eval_same_country["exceeds_eval_threshold"]
]["num_elem_performed"].sum() + elem_eval_same_country[
    ~elem_eval_same_country["exceeds_eval_threshold"]
]["total_elem_outside_range"].sum()

print(("Of the {:,} total evaluated elements where the judge and skaters represent the same country,"
       " {:,} ({:.2f}%) elements would have been flagged "
      "by either Part 1 or Part 2.")\
    .format(
        elem_eval_same_country["num_elem_scored"].sum(),
        total_elem_for_eval_same_country,
        (total_elem_for_eval_same_country / 
            elem_eval_same_country["num_elem_scored"].sum() * 100)
    ))

Of the 11,597 total evaluated elements where the judge and skaters represent the same country, 115 (0.99%) elements would have been flagged by either Part 1 or Part 2.


## Component Deviation Points

Section H, Part 3 states:

> **Method of Calculating Deviation Points in Program Component scores**

> a) For each Program Component, the computer calculates the average scores of all the Judges. The Program Components scores awarded by the Referee are NOT used in this calculation.
    
> b) The computer then calculates the difference between the “calculated average” and the Judges Program Components scores which results in “Deviation Points”.

> c) The Total Deviation points for each Judge will be added to provide a Total Net Deviation Points (+ and – Deviation Points compensate each other).
    
> d) Total Net Deviation Points must not exceed 7.5.

Here we select only the components for non-junior performances.

In [36]:
components = scores_with_context[
    (scores_with_context["section"] == "components") &
    (~scores_with_context["is_junior"]) &
    # There is one program in the data where one judge does not have scores;
    # they were marked as 0 in the scoresheet. We remove these cases.
    (scores_with_context["score"] != 0)
].copy()

In [37]:
len(components)

73400

This is what the components data looks like:

In [38]:
components.head(2).T

Unnamed: 0,0,1
aspect_id,00034b9414,00034b9414
judge,J1,J2
score,9,9
performance_id,b639d77459,b639d77459
section,components,components
aspect_num,,
aspect_desc,Transitions,Transitions
info_flag,,
credit_flag,,
base_value,,


Here we find the average score for each component:

In [39]:
comp_grps = components.groupby("aspect_id")

In [40]:
len(comp_grps)

8160

In [41]:
comp_averages = pd.DataFrame({
    "comp_avg": comp_grps["score"].mean(),
    "comp_judge_count": comp_grps.size()
})

In [42]:
comp_with_avg = pd.merge(
    components,
    comp_averages.reset_index(),
    on="aspect_id",
    how="left"
)

In [43]:
assert len(components) == len(comp_with_avg)

### Calculate deviation points for components

In [44]:
comp_with_avg["deviation_points"] = comp_with_avg["score"] - comp_with_avg["comp_avg"]
comp_with_avg["abs_deviation_points"] = comp_with_avg["deviation_points"].abs()

In [45]:
judge_comp_grps = comp_with_avg.groupby(["performance_id", "judge"])

In [46]:
comp_eval = pd.DataFrame({
    "number_of_comp": judge_comp_grps.size(),
    "total_deviation_points": judge_comp_grps["deviation_points"].sum(),
    "total_comp_outside_range": judge_comp_grps["abs_deviation_points"].apply(lambda x: len(x[x >= 1.5])),
    "positive_comp_outside_range": judge_comp_grps["deviation_points"].apply(lambda x: len(x[x >= 1.5])),
    "negative_comp_outside_range": judge_comp_grps["deviation_points"].apply(lambda x: len(x[x <= -1.5])),
    "name": judge_comp_grps["name"].first(),
    "nation": judge_comp_grps["nation"].first(),
    "program": judge_comp_grps["program"].first(),
    "competition": judge_comp_grps["competition"].first()
}).reset_index()\
.pipe(
    pd.merge,
    judges,
    left_on = ["program", "competition", "judge"],
    right_on = ["program", "competition", "clean_role"],
    how="left"
)

In [47]:
comp_eval["exceeds_evaluation_threshold"] = comp_eval["total_deviation_points"] > 7.5

This is what the component evaluation dataframe looks like:

In [48]:
comp_eval.head(3).T

Unnamed: 0,0,1,2
performance_id,0055dd827d,0055dd827d,0055dd827d
judge,J1,J2,J3
competition,ISU GP Trophee de France 2016,ISU GP Trophee de France 2016,ISU GP Trophee de France 2016
name,Denis TEN,Denis TEN,Denis TEN
nation,KAZ,KAZ,KAZ
negative_comp_outside_range,0,0,0
number_of_comp,5,5,5
positive_comp_outside_range,0,0,0
program,MEN FREE SKATING,MEN FREE SKATING,MEN FREE SKATING
total_comp_outside_range,0,0,0


In [49]:
comp_eval_same_country = comp_eval[
    comp_eval["nation"] == comp_eval["judge_country"]
].copy()

In [50]:
print((
    "Of the {:,} total judge/program combinations, {} ({:.2f}%) " 
    "exceeded the component Total Deviation Points threshold."
    ).format(
        len(comp_eval),
        comp_eval["exceeds_evaluation_threshold"].sum(),
        comp_eval["exceeds_evaluation_threshold"].mean() * 100,
    ))

Of the 14,680 total judge/program combinations, 4 (0.03%) exceeded the component Total Deviation Points threshold.


In [51]:
print((
    "Of the {:,} total judge/program combinations where the judge and skaters represent the same country,"
    " {} ({:.2f}%) exceeded the component Total Deviation Points threshold."
    ).format(
        len(comp_eval_same_country),
        comp_eval_same_country["exceeds_evaluation_threshold"].sum(),
        comp_eval_same_country["exceeds_evaluation_threshold"].mean() * 100,
    ))

Of the 1,305 total judge/program combinations where the judge and skaters represent the same country, 0 (0.00%) exceeded the component Total Deviation Points threshold.


Section H, Part 4 states:

> **Method of Calculating the Range of Program Components scores**

> (i) For each Program Component, the computer calculates the average scores of all of the
Judges. The Program Components scores awarded by the Referee are NOT used in this
calculation.

> (ii) The computer then calculates the difference between the “calculated average” and the
Judges Program Components scores which results in “Deviation Points”.

> (iii) Acceptable range of scores in Program Components is stated as less than “+ or –“ 1.5
Deviation Points.

> (iv) If the Deviation Points of a component for a Judge equals 1.5 points or more, the scores of
 that Judge for that component will constitute a case of evaluation.

In [52]:
print((
    "Of the {:,} components {:,} ({:.2f}%) "
    "individually exceeded the acceptable range."
    ).format(
        comp_eval["number_of_comp"].sum(),
        comp_eval["total_comp_outside_range"].sum(),
        100 * comp_eval["total_comp_outside_range"].sum() / \
            comp_eval["number_of_comp"].sum()
    ))

Of the 73,400 components 110 (0.15%) individually exceeded the acceptable range.


In [53]:
print((
    "Of the {:,} components in programs where the judge and skaters represent" 
    " the same country, {:,} ({:.2f}%) "
    "individually exceeded the acceptable range."
    ).format(
        comp_eval_same_country["number_of_comp"].sum(),
        comp_eval_same_country["total_comp_outside_range"].sum(),
        100 * comp_eval_same_country["total_comp_outside_range"].sum() / \
            comp_eval_same_country["number_of_comp"].sum()
    ))

Of the 6,525 components in programs where the judge and skaters represent the same country, 2 (0.03%) individually exceeded the acceptable range.


---

---

---