Question do we predict on data that doesn't have skin color label???

Let's first import the necessary libraries.

In [1]:
import pandas as pd

Now let's get ahold of the data we will be working with! We will first

In [2]:
soccer_data = pd.read_csv('CrowdstormingDataJuly1st.csv')
soccer_data.shape

(146028, 28)

That's a lot of data ;) Instead of using `head()`, let's print the first 14 columns and the next 14 seperately.

In [3]:
soccer_data.ix[:5,:14]

Unnamed: 0,playerShort,player,club,leagueCountry,birthday,height,weight,position,games,victories,ties,defeats,goals,yellowCards
0,lucas-wilchez,Lucas Wilchez,Real Zaragoza,Spain,31.08.1983,177.0,72.0,Attacking Midfielder,1,0,0,1,0,0
1,john-utaka,John Utaka,Montpellier HSC,France,08.01.1982,179.0,82.0,Right Winger,1,0,0,1,0,1
2,abdon-prats,Abdón Prats,RCD Mallorca,Spain,17.12.1992,181.0,79.0,,1,0,1,0,0,1
3,pablo-mari,Pablo Marí,RCD Mallorca,Spain,31.08.1993,191.0,87.0,Center Back,1,1,0,0,0,0
4,ruben-pena,Rubén Peña,Real Valladolid,Spain,18.07.1991,172.0,70.0,Right Midfielder,1,1,0,0,0,0
5,aaron-hughes,Aaron Hughes,Fulham FC,England,08.11.1979,182.0,71.0,Center Back,1,0,0,1,0,0


In [4]:
soccer_data.ix[:5,14:]

Unnamed: 0,yellowReds,redCards,photoID,rater1,rater2,refNum,refCountry,Alpha_3,meanIAT,nIAT,seIAT,meanExp,nExp,seExp
0,0,0,95212.jpg,0.25,0.5,1,1,GRC,0.326391,712.0,0.000564,0.396,750.0,0.002696
1,0,0,1663.jpg,0.75,0.75,2,2,ZMB,0.203375,40.0,0.010875,-0.204082,49.0,0.061504
2,0,0,,,,3,3,ESP,0.369894,1785.0,0.000229,0.588297,1897.0,0.001002
3,0,0,,,,3,3,ESP,0.369894,1785.0,0.000229,0.588297,1897.0,0.001002
4,0,0,,,,3,3,ESP,0.369894,1785.0,0.000229,0.588297,1897.0,0.001002
5,0,0,3868.jpg,0.25,0.0,4,4,LUX,0.325185,127.0,0.003297,0.538462,130.0,0.013752


Description of each feature: link to file in GitHub TODO

# Data Cleaning

Let's first clean the data a bit. The first thing we can do is drop those players which have no rating for the skin color as we will not be able to do any training with this data or even evaluate our classifier with such entries.

In [25]:
soccer_data_clean = soccer_data[~soccer_data.photoID.isnull()]
soccer_data.shape[0] - soccer_data_clean.shape[0]

21407

21407 entries have been dropped! Let's just make sure that all `rater1` and `rater2` fields are valid.

In [6]:
print(soccer_data_clean[soccer_data_clean.rater1.isnull()].shape)
soccer_data_clean[soccer_data_clean.rater2.isnull()].shape

(0, 28)


(0, 28)

The given data has an inconvenient structure for our analysis: each row is a _dyad_, which has a single player-referee interaction. This means that if a player has played games with more than 1 referee, that player will have several rows in this dataset. For example, let's look at everyone's favorite googly-eyed German: Mesut Ozil.

In [7]:
soccer_data_clean[soccer_data_clean.playerShort == "mesut-oezil"][:5]

Unnamed: 0,playerShort,player,club,leagueCountry,birthday,height,weight,position,games,victories,...,rater2,refNum,refCountry,Alpha_3,meanIAT,nIAT,seIAT,meanExp,nExp,seExp
26,mesut-oezil,Mesut Özil,Real Madrid,Spain,15.10.1988,183.0,76.0,Attacking Midfielder,1,1,...,0.25,4,4,LUX,0.325185,127.0,0.003297,0.538462,130.0,0.013752
1051,mesut-oezil,Mesut Özil,Real Madrid,Spain,15.10.1988,183.0,76.0,Attacking Midfielder,1,1,...,0.25,66,4,LUX,0.325185,127.0,0.003297,0.538462,130.0,0.013752
1773,mesut-oezil,Mesut Özil,Real Madrid,Spain,15.10.1988,183.0,76.0,Attacking Midfielder,2,2,...,0.25,72,28,IRL,0.355498,4078.0,9.8e-05,0.517225,4238.0,0.000405
2852,mesut-oezil,Mesut Özil,Real Madrid,Spain,15.10.1988,183.0,76.0,Attacking Midfielder,14,11,...,0.25,88,3,ESP,0.369894,1785.0,0.000229,0.588297,1897.0,0.001002
3407,mesut-oezil,Mesut Özil,Real Madrid,Spain,15.10.1988,183.0,76.0,Attacking Midfielder,1,0,...,0.25,94,3,ESP,0.369894,1785.0,0.000229,0.588297,1897.0,0.001002


As we can see certain "features" of a player stay the same: his name, his height, and his weight. We will also assume that `club` and `leagueCountry` stay the same in order to help us with the analysis and this is a "not-so" invalid assumption as most players stay in the same team during one season. Some other variables depend on the referee (see table below).

In the first exercise, we would like to predict the skin color of a player given his description and the second exercise asks to "aggregate the referee information grouping by soccer player". Therefore, we will have to perform some careful aggregation with those variables that depend on the referee. The table below describes how we will deal with each feature when performing aggregation. We decided to disregard the referee scores as another [work](http://nbviewer.jupyter.org/github/mathewzilla/redcard/blob/master/Crowdstorming_visualisation.ipynb) found that country attitude scores do not predict carding by individual referees. Moreover, referees are professionals so it would be surprising such a prediction could be made!

| Feature  | Process  | Reason/Assumption  |
|---|---|---|
| _playerShort_  | Keep first  | Unique for player  |
| _player_  | Keep first  | Unique for player  |
| _club_  | Keep first  | Assuming player stays in same team  |
| _leagueCountry_  | Keep first  | Assumping player stays in same team  |
| _birthday_  | Keep first  | Unique for player  |
| _height_  | Keep first  | Assuming player does not grow or have a significant height increase during a single season.  |
| _weight_  | Keep first  | Assuming player does gain a significant amount of weight during a single season.  |
| _position_  | Keep first  | Assuming players has the same position during a single season.  |
| _games_  | Sum over rows  | Yields total number of games during the 2012/2013 season.  |
| _victories_  | Sum over rows  | Yields total number of victories.   |
| _ties_  | Sum over rows  | Yields total number of ties.  |
| _defeats_  | Sum over rows  | Yields total number of defeats.  |
| _goals_  | Sum over rows  | Yields total number of goals.  |
| _yellowCards_  | Sum over rows  | Yields total number of yellow cards.  |
| _yellowReds_  | Sum over rows  | Yields total number of red cards obtained by two yellow cards.  |
| _redCards_  | Sum over rows  | Yields total number of straight red cards.  |
| _photoID_  | Disregard  | Not needed the photo ID for our analysis.  |
| _rater1_  | Keep first  | Unique for player  |
| _rater2_  | Keep first  | Unique for player  |
| _refNum_  | Disregard  | Purpose of aggregation is to remove "relationship" with a particular referee.   |
| _refCountry_  | Disregard  | Purpose of aggregation is to remove "relationship" with a particular referee.  |
| <em>Alpha\_3</em>  | Disregard  | As it is another representation of the referee's country, we will also disregard this.  |
| _meanIAT_  | Disregard  | Assuming honesty of the referee as they are professionals.  |
| _nIAT_  | Disregard  | Assuming honesty of the referee as they are professionals.  |
| _seIAT_  | Disregard  | Assuming honesty of the referee as they are professionals.  |
| _meanExp_  | Disregard  | Assuming honesty of the referee as they are professionals.  |
| _nExp_  |  Disregard | Assuming honesty of the referee as they are professionals.  |
| _seExp_  | Disregard  | Assuming honesty of the referee as they are professionals.  |

In [51]:
# add racism features somehow --> average?
# extract columns that need to be summed over
soccer_data_trim = soccer_data_clean[["playerShort", "games", "victories","ties","defeats","goals","yellowCards","yellowReds","redCards"]]
soccer_data_trim.shape
soccer_data_trim.head()

Unnamed: 0,playerShort,games,victories,ties,defeats,goals,yellowCards,yellowReds,redCards
0,lucas-wilchez,1,0,0,1,0,0,0,0
1,john-utaka,1,0,0,1,0,1,0,0
5,aaron-hughes,1,0,0,1,0,0,0,0
6,aleksandar-kolarov,1,1,0,0,0,0,0,0
7,alexander-tettey,1,0,0,1,0,0,0,0


In [52]:
data_agg = soccer_data_trim.groupby("playerShort").aggregate(np.sum)
data_agg.head()

Unnamed: 0_level_0,games,victories,ties,defeats,goals,yellowCards,yellowReds,redCards
playerShort,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
aaron-hughes,654,247,179,228,9,19,0,0
aaron-hunt,336,141,73,122,62,42,0,1
aaron-lennon,412,200,97,115,31,11,0,0
aaron-ramsey,260,150,42,68,39,31,0,1
abdelhamid-el-kaoutari,124,41,40,43,1,8,4,2


In [53]:
ref_bias = soccer_data_clean[["playerShort", "games","meanIAT","meanExp"]].groupby("playerShort")

In [54]:
# weighted sum of IAT and Exp scores
def weighted_average(group, feature):
    weights = group['games']
    total_games = weights.sum()
    iat_scores = group[feature]
    return (iat_scores * weights).sum() / total_games

data_agg["weightedIAT"] = ref_bias.apply(weighted_average, 'meanIAT')   
data_agg["weightedExp"] = ref_bias.apply(weighted_average, 'meanExp') 
data_agg.head()

Unnamed: 0_level_0,games,victories,ties,defeats,goals,yellowCards,yellowReds,redCards,weightedIAT,weightedExp
playerShort,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
aaron-hughes,654,247,179,228,9,19,0,0,0.333195,0.400637
aaron-hunt,336,141,73,122,62,42,0,1,0.341438,0.380811
aaron-lennon,412,200,97,115,31,11,0,0,0.332389,0.399459
aaron-ramsey,260,150,42,68,39,31,0,1,0.336638,0.433294
abdelhamid-el-kaoutari,124,41,40,43,1,8,4,2,0.331882,0.328895


Number of unique referees which gave each card type.

In [55]:
unique_ref = soccer_data_trim.groupby("playerShort")

In [56]:
# count number of non-zero entries, i.e. number of unique referees who have given the card
def num_unique_ref_card(group, card_type):
    ref_card = group[card_type]
    return (ref_card!=0).sum()

# count number of unique referees that have given a card to a particular player
def num_unique_ref(group):
    ref_cards = group['yellowCards']+group['yellowReds']+group['redCards']
    return (ref_cards!=0).sum()


data_agg['uniqueYellow'] = unique_ref.apply(num_unique_ref_card, 'yellowCards')
data_agg['uniqueYellowReds'] = unique_ref.apply(num_unique_ref_card, 'yellowReds')
data_agg['uniqueReds'] = unique_ref.apply(num_unique_ref_card, 'redCards')
data_agg['uniqueRefCards'] = unique_ref.apply(num_unique_ref)
data_agg.head()

Unnamed: 0_level_0,games,victories,ties,defeats,goals,yellowCards,yellowReds,redCards,weightedIAT,weightedExp,uniqueYellow,uniqueYellowReds,uniqueReds,uniqueRefCards
playerShort,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
aaron-hughes,654,247,179,228,9,19,0,0,0.333195,0.400637,16,0,0,16
aaron-hunt,336,141,73,122,62,42,0,1,0.341438,0.380811,29,0,1,29
aaron-lennon,412,200,97,115,31,11,0,0,0.332389,0.399459,10,0,0,10
aaron-ramsey,260,150,42,68,39,31,0,1,0.336638,0.433294,25,0,1,26
abdelhamid-el-kaoutari,124,41,40,43,1,8,4,2,0.331882,0.328895,8,4,2,13


In [57]:
data_agg.tail()

Unnamed: 0_level_0,games,victories,ties,defeats,goals,yellowCards,yellowReds,redCards,weightedIAT,weightedExp,uniqueYellow,uniqueYellowReds,uniqueReds,uniqueRefCards
playerShort,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
zoltan-gera,392,150,96,146,71,44,1,1,0.336001,0.417374,35,1,1,37
zoltan-stieber,142,48,37,57,27,12,0,0,0.336786,0.345085,12,0,0,12
zoumana-camara,395,148,117,130,7,46,2,6,0.338068,0.363993,30,2,6,33
zubikarai,47,14,15,18,0,2,0,2,0.36927,0.590521,2,0,2,4
zurutuza,160,68,39,53,12,22,0,0,0.368915,0.588902,16,0,0,16


Now we need to extract the characteristic data of the players: Name, Height, Weight, Birthday, Position and Skin Color ratings, Club, Country of the League.

In [58]:
def extract_param(group, param):
    return group[param].values[0]

const_param = ["player","height","weight","club","leagueCountry","birthday","position","rater1","rater2"]
for param in const_param:
    data_agg[param] = soccer_data_clean.groupby("playerShort").apply(extract_param, param)

data_agg.head()

Unnamed: 0_level_0,games,victories,ties,defeats,goals,yellowCards,yellowReds,redCards,weightedIAT,weightedExp,...,uniqueRefCards,player,height,weight,club,leagueCountry,birthday,position,rater1,rater2
playerShort,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
aaron-hughes,654,247,179,228,9,19,0,0,0.333195,0.400637,...,16,Aaron Hughes,182.0,71.0,Fulham FC,England,08.11.1979,Center Back,0.25,0.0
aaron-hunt,336,141,73,122,62,42,0,1,0.341438,0.380811,...,29,Aaron Hunt,183.0,73.0,Werder Bremen,Germany,04.09.1986,Attacking Midfielder,0.0,0.25
aaron-lennon,412,200,97,115,31,11,0,0,0.332389,0.399459,...,10,Aaron Lennon,165.0,63.0,Tottenham Hotspur,England,16.04.1987,Right Midfielder,0.25,0.25
aaron-ramsey,260,150,42,68,39,31,0,1,0.336638,0.433294,...,26,Aaron Ramsey,178.0,76.0,Arsenal FC,England,26.12.1990,Center Midfielder,0.0,0.0
abdelhamid-el-kaoutari,124,41,40,43,1,8,4,2,0.331882,0.328895,...,13,Abdelhamid El-Kaoutari,180.0,73.0,Montpellier HSC,France,17.03.1990,Center Back,0.25,0.25


In [59]:
data_agg.columns.values

array(['games', 'victories', 'ties', 'defeats', 'goals', 'yellowCards',
       'yellowReds', 'redCards', 'weightedIAT', 'weightedExp',
       'uniqueYellow', 'uniqueYellowReds', 'uniqueReds', 'uniqueRefCards',
       'player', 'height', 'weight', 'club', 'leagueCountry', 'birthday',
       'position', 'rater1', 'rater2'], dtype=object)

In [62]:
data_agg = data_agg[["player", "height", "weight", "rater1", "rater2", "club", "leagueCountry", 
                     "birthday", "position", "games", "victories", "ties", "defeats", "goals", 'yellowCards',
                     'yellowReds', 'redCards', 'weightedIAT', 'weightedExp', 'uniqueYellow', 'uniqueYellowReds', 
                     'uniqueReds', 'uniqueRefCards']]
data_agg.ix[:5,:14]

Unnamed: 0_level_0,player,height,weight,rater1,rater2,club,leagueCountry,birthday,position,games,victories,ties,defeats,goals
playerShort,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
aaron-hughes,Aaron Hughes,182.0,71.0,0.25,0.0,Fulham FC,England,08.11.1979,Center Back,654,247,179,228,9
aaron-hunt,Aaron Hunt,183.0,73.0,0.0,0.25,Werder Bremen,Germany,04.09.1986,Attacking Midfielder,336,141,73,122,62
aaron-lennon,Aaron Lennon,165.0,63.0,0.25,0.25,Tottenham Hotspur,England,16.04.1987,Right Midfielder,412,200,97,115,31
aaron-ramsey,Aaron Ramsey,178.0,76.0,0.0,0.0,Arsenal FC,England,26.12.1990,Center Midfielder,260,150,42,68,39
abdelhamid-el-kaoutari,Abdelhamid El-Kaoutari,180.0,73.0,0.25,0.25,Montpellier HSC,France,17.03.1990,Center Back,124,41,40,43,1


In [63]:
data_agg.ix[:5,14:]

Unnamed: 0_level_0,yellowCards,yellowReds,redCards,weightedIAT,weightedExp,uniqueYellow,uniqueYellowReds,uniqueReds,uniqueRefCards
playerShort,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
aaron-hughes,19,0,0,0.333195,0.400637,16,0,0,16
aaron-hunt,42,0,1,0.341438,0.380811,29,0,1,29
aaron-lennon,11,0,0,0.332389,0.399459,10,0,0,10
aaron-ramsey,31,0,1,0.336638,0.433294,25,0,1,26
abdelhamid-el-kaoutari,8,4,2,0.331882,0.328895,8,4,2,13


# Classification

<em>Train a `sklearn.ensemble.RandomForestClassifier` that given a soccer player description outputs his skin color. Show how different parameters passed to the Classifier affect the overfitting issue. Perform cross-validation to mitigate the overfitting of your model. Once you assessed your model, inspect the `feature_importances_` attribute and discuss the obtained results. With different assumptions on the data (e.g., dropping certain features even before feeding them to the classifier), can you obtain a substantially different `feature_importances_` attribute?</em>

Create dummy variables, cross-validation