# Red Card Exploratory Data Analysis

### *abstract*

### *introduction*



### *Research Question*: 
### Are soccer referees more likely to give red cards to dark skin toned players than light skin toned players?

### bref

>the questions of whether soccer referees are more likely to give red cards to dark skin toned players than light skin toned players, and whether this effect is moderated by skin-tone prejudice across cultures.

>  The available dataset provides an opportunity to identify the magnitude of the relationship among these variables. It does not offer opportunity to identify causal relations.



### dataset

* ### data
>The data and profile photos from all soccer players (N = 2,053) playing in the first male divisions of England, Germany, France and Spain in the 2012-2013 season and all referees (​N​ = 3,147) that these players played under in their professional caree. The dataset of player–referee dyads including the number of matches players and referees encountered each other and our dependent variable, the number of red cards given to a player by a particular referee throughout all matches the two encountered each other.



* ### skin tones
>Player’s photo was available from the source for 1,586 out of 2,053 players. ​Players’ skin tone was coded by two independent raters blind to the research question who, based on their profile photo, categorized players on a 5-point scale ranging from “very light skin” to “very dark skin” with “neither dark nor light skin” as the center value





* ### bias scores (about refrees)
>  implicit bias scores for each referee country were calculated using a race implicit association test (IAT), with higher values corresponding to faster white | good, black | bad associations. Explicit bias scores for each referee country were calculated using a racial thermometer task, with higher values corresponding to greater feelings of warmth toward whites versus blacks. Both these measures were created by aggregating data from many online users in referee countries taking these tests on ​Project Implicit​.

### **Data Structure**
The dataset is available as a list with 146,028 dyads of players and referees and includes details from players, details from referees and details regarding the interactions of player-referees. A summary of the variables of interest can be seen below. A detailed description of all variables included can be seen in the README file on the project website.

https://osf.io/jv6yw/files/

| Variable Name: | Variable Description: | 
| -- | -- | 
| playerShort | short player ID | 
| player | player name | 
| club | player club | 
| leagueCountry | country of player club (England, Germany, France, and Spain) | 
| height | player height (in cm) | 
| weight | player weight (in kg) | 
| position | player position | 
| games | number of games in the player-referee dyad | 
| goals | number of goals in the player-referee dyad | 
| yellowCards | number of yellow cards player received from the referee | 
| yellowReds | number of yellow-red cards player received from the referee | 
| redCards | number of red cards player received from the referee | 
| photoID | ID of player photo (if available) | 
| rater1 | skin rating of photo by rater 1 | 
| rater2 | skin rating of photo by rater 2 | 
| refNum | unique referee ID number (referee name removed for anonymizing purposes) | 
| refCountry | unique referee country ID number | 
| meanIAT | mean implicit bias score (using the race IAT) for referee country | 
| nIAT | sample size for race IAT in that particular country | 
| seIAT | standard error for mean estimate of race IAT   | 
| meanExp | mean explicit bias score (using a racial thermometer task) for referee country | 
| nExp | sample size for explicit bias in that particular country | 
| seExp |  standard error for mean estimate of explicit bias measure | 



In [12]:
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import os, sys

In [13]:
data_frame = pd.read_csv("../Datasets/CrowdstormingDataJuly1st.csv")

In [15]:
data_frame.head()

Unnamed: 0,playerShort,player,club,leagueCountry,birthday,height,weight,position,games,victories,...,rater2,refNum,refCountry,Alpha_3,meanIAT,nIAT,seIAT,meanExp,nExp,seExp
0,lucas-wilchez,Lucas Wilchez,Real Zaragoza,Spain,31.08.1983,177.0,72.0,Attacking Midfielder,1,0,...,0.5,1,1,GRC,0.326391,712.0,0.000564,0.396,750.0,0.002696
1,john-utaka,John Utaka,Montpellier HSC,France,08.01.1982,179.0,82.0,Right Winger,1,0,...,0.75,2,2,ZMB,0.203375,40.0,0.010875,-0.204082,49.0,0.061504
2,abdon-prats,Abdón Prats,RCD Mallorca,Spain,17.12.1992,181.0,79.0,,1,0,...,,3,3,ESP,0.369894,1785.0,0.000229,0.588297,1897.0,0.001002
3,pablo-mari,Pablo Marí,RCD Mallorca,Spain,31.08.1993,191.0,87.0,Center Back,1,1,...,,3,3,ESP,0.369894,1785.0,0.000229,0.588297,1897.0,0.001002
4,ruben-pena,Rubén Peña,Real Valladolid,Spain,18.07.1991,172.0,70.0,Right Midfielder,1,1,...,,3,3,ESP,0.369894,1785.0,0.000229,0.588297,1897.0,0.001002


In [7]:
data_frame.shape #

(146028, 28)

In [17]:
data_frame.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
height,145765.0,181.935938,6.738726,161.0,177.0,182.0,187.0,203.0
weight,143785.0,76.075662,7.140906,54.0,71.0,76.0,81.0,100.0
games,146028.0,2.921166,3.413633,1.0,1.0,2.0,3.0,47.0
victories,146028.0,1.278344,1.790725,0.0,0.0,1.0,2.0,29.0
ties,146028.0,0.708241,1.116793,0.0,0.0,0.0,1.0,14.0
defeats,146028.0,0.934581,1.383059,0.0,0.0,1.0,1.0,18.0
goals,146028.0,0.338058,0.906481,0.0,0.0,0.0,0.0,23.0
yellowCards,146028.0,0.385364,0.795333,0.0,0.0,0.0,1.0,14.0
yellowReds,146028.0,0.011381,0.107931,0.0,0.0,0.0,0.0,3.0
redCards,146028.0,0.012559,0.112889,0.0,0.0,0.0,0.0,2.0


In [19]:
all_columns = data_frame.columns.tolist()
all_columns

['playerShort',
 'player',
 'club',
 'leagueCountry',
 'birthday',
 'height',
 'weight',
 'position',
 'games',
 'victories',
 'ties',
 'defeats',
 'goals',
 'yellowCards',
 'yellowReds',
 'redCards',
 'photoID',
 'rater1',
 'rater2',
 'refNum',
 'refCountry',
 'Alpha_3',
 'meanIAT',
 'nIAT',
 'seIAT',
 'meanExp',
 'nExp',
 'seExp']

In [20]:
np.mean(data_frame.groupby('playerShort').height.mean()) # Average height

181.74372848007872

## Tidy Players Table

In [21]:
player_index = 'playerShort'
player_cols = [#'player', # drop player name, we have unique identifier
               'birthday',
               'height',
               'weight',
               'position',
               'photoID',
               'rater1',
               'rater2',
              ]

In [24]:
all_cols_unique_players = data_frame.groupby('playerShort').agg({col:'nunique' for col in player_cols})
all_cols_unique_players.head()

Unnamed: 0_level_0,birthday,height,weight,position,photoID,rater1,rater2
playerShort,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
aaron-hughes,1,1,1,1,1,1,1
aaron-hunt,1,1,1,1,1,1,1
aaron-lennon,1,1,1,1,1,1,1
aaron-ramsey,1,1,1,1,1,1,1
abdelhamid-el-kaoutari,1,1,1,1,1,1,1


In [26]:
all_cols_unique_players[all_cols_unique_players > 1].dropna().head() # clean data and distinct

Unnamed: 0_level_0,birthday,height,weight,position,photoID,rater1,rater2
playerShort,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1


In [27]:
def get_subgroup(dataframe, g_index, g_columns):
    g = dataframe.groupby(g_index).agg({col:'nunique' for col in g_columns})
    if g[g > 1].dropna().shape[0] != 0:
        print("Warning: you probably assumed this had all unique values but it doesn't.")
    return dataframe.groupby(g_index).agg({col:'max' for col in g_columns})

In [29]:
players = get_subgroup(data_frame, player_index, player_cols)
players.head()

Unnamed: 0_level_0,birthday,height,weight,position,photoID,rater1,rater2
playerShort,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
aaron-hughes,08.11.1979,182.0,71.0,Center Back,3868.jpg,0.25,0.0
aaron-hunt,04.09.1986,183.0,73.0,Attacking Midfielder,20136.jpg,0.0,0.25
aaron-lennon,16.04.1987,165.0,63.0,Right Midfielder,13515.jpg,0.25,0.25
aaron-ramsey,26.12.1990,178.0,76.0,Center Midfielder,94953.jpg,0.0,0.0
abdelhamid-el-kaoutari,17.03.1990,180.0,73.0,Center Back,124913.jpg,0.25,0.25


In [30]:
def save_subgroup(dataframe, g_index, subgroup_name, prefix='raw_'):
    save_subgroup_filename = "".join([prefix, subgroup_name, ".csv.gz"])
    dataframe.to_csv(save_subgroup_filename, compression='gzip', encoding='UTF-8')
    test_df = pd.read_csv(save_subgroup_filename, compression='gzip', index_col=g_index, encoding='UTF-8')
    # Test that we recover what we send in
    if dataframe.equals(test_df):
        print("Test-passed: we recover the equivalent subgroup dataframe.")
    else:
        print("Warning -- equivalence test!!! Double-check.")

In [32]:
save_subgroup(players, player_index, "players")

Test-passed: we recover the equivalent subgroup dataframe.


##  Tidy Clubs Table


In [34]:
club_index = 'club'
club_cols = ['leagueCountry']
clubs = get_subgroup(data_frame, club_index, club_cols)
clubs.head()

Unnamed: 0_level_0,leagueCountry
club,Unnamed: 1_level_1
1. FC Nürnberg,Germany
1. FSV Mainz 05,Germany
1899 Hoffenheim,Germany
AC Ajaccio,France
AFC Bournemouth,England


In [35]:
clubs['leagueCountry'].value_counts() # mix : england clubs 

England    48
Spain      27
France     22
Germany    21
Name: leagueCountry, dtype: int64

In [36]:
save_subgroup(clubs, club_index, "clubs")

Test-passed: we recover the equivalent subgroup dataframe.


## Tidy Referees Table

In [38]:
referee_index = 'refNum'
referee_cols = ['refCountry']
referees = get_subgroup(data_frame, referee_index, referee_cols)
referees.head()

Unnamed: 0_level_0,refCountry
refNum,Unnamed: 1_level_1
1,1
2,2
3,3
4,4
5,5


In [39]:
referees.refCountry.nunique() # how many country where the referees come from 

161

In [40]:
referees.shape

(3147, 1)

In [41]:
save_subgroup(referees, referee_index, "referees")

Test-passed: we recover the equivalent subgroup dataframe.


 ## players and referees Dyads Table

In [42]:
dyad_index = ['refNum', 'playerShort']
dyad_cols = ['games',
             'victories',
             'ties',
             'defeats',
             'goals',
             'yellowCards',
             'yellowReds',
             'redCards',
            ]

In [43]:
dyads = get_subgroup(data_frame, g_index=dyad_index, g_columns=dyad_cols)

In [47]:
dyads.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,games,victories,ties,defeats,goals,yellowCards,yellowReds,redCards
refNum,playerShort,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,lucas-wilchez,1,0,0,1,0,0,0,0
2,john-utaka,1,0,0,1,0,1,0,0
3,abdon-prats,1,0,1,0,0,1,0,0
3,pablo-mari,1,1,0,0,0,0,0,0
3,ruben-pena,1,1,0,0,0,0,0,0
4,aaron-hughes,1,0,0,1,0,0,0,0
4,aleksandar-kolarov,1,1,0,0,0,0,0,0
4,alexander-tettey,1,0,0,1,0,0,0,0
4,anders-lindegaard,1,0,1,0,0,0,0,0
4,andreas-beck,1,1,0,0,0,0,0,0


In [48]:
dyads.shape

(146028, 8)

In [49]:
dyads[dyads.redCards > 1].head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,games,victories,ties,defeats,goals,yellowCards,yellowReds,redCards
refNum,playerShort,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
140,bodipo,6,2,1,3,1,0,0,2
367,antonio-lopez_2,8,5,2,1,0,2,0,2
432,javi-martinez,14,4,3,7,2,2,0,2
432,jonas,9,1,4,4,1,0,0,2
487,phil-jagielka,7,2,1,4,1,0,0,2
586,cyril-jeunechamp,14,8,0,6,0,6,0,2
804,sergio-ramos,18,12,1,5,4,6,1,2
985,aly-cissokho,9,1,5,3,1,1,0,2
1114,eugen-polanski,8,4,0,4,0,0,0,2
1214,emmanuel-adebayor,23,9,7,7,10,4,1,2


In [50]:
save_subgroup(dyads, dyad_index, "dyads")

Test-passed: we recover the equivalent subgroup dataframe.


In [51]:
dyads.redCards.max()

2