# Constructing a Team

My flatmates were struggling to create a team that met the criteria of [this tweet](https://twitter.com/Carra23/status/1250066001821130759).

With the right data this is surely just a simple constraints problem. I challenged myself to come up with my own team. I know little about football beyond my interest as an 8 year-old. 

I took data from [this repository](https://github.com/pratapvardhan/FIFAWorldCup) which contained players from each world cup, with their corresponding clubs.

An important component of this problem is that none of the players can have played for the same team at any point. Each data point contains the club the player played for in that year.

Due to limitations in the data there are a range of assumptions that must be made which prevent the results being 100% accurate. More like 99%.

There are other data set out there which contain rankings, however this could require significant fuzzy matching of names, and missing data for older players. 

The driver of this selector is constructing a set of clubs and nations that the players have been part of, checking whether their attributes are within that set, and adding them to the team. Each player is selected randomly from their position set.

In [1]:
import pandas as pd
import random as rd

In [None]:
total_data = pd.read_csv("squads.csv")

# Number of people in each position
GK_NUM = 1
DF_NUM = 4
FW_NUM = 3
MF_NUM = 3

# Improve the quality of players by how many times they played for their nation.
MIN_CAPS = 2

YEAR_SINCE = 1996

In [2]:
total_data.head()

Unnamed: 0,No,Pos,Player,DOB/Age,Caps,Club,Country,ClubCountry,Year
0,1,1GK,Ángel Bossio,(1905-05-05)5 May 1905 (aged 25),,Talleres,Argentina,Argentina,1930
1,1,1GK,Juan Botasso,(1908-10-23)23 October 1908 (aged 21),,Quilmes,Argentina,Argentina,1930
2,9,4FW,Roberto Cherro,(1907-02-23)23 February 1907 (aged 23),,Boca Juniors,Argentina,Argentina,1930
3,4,2DF,Alberto Chividini,(1907-02-23)23 February 1907 (aged 23),,Central Norte Tucumán,Argentina,Argentina,1930
4,10,4FW,Attilio Demaría,(1909-03-19)19 March 1909 (aged 21),,Estudiantil Porteño,Argentina,Argentina,1930


How many different players do we have?


In [25]:
all_players = len(total_data["Player"])
unique_players = len(total_data["Player"].unique())
print("Total player instances {0} and {1} unique players in the data".format(all_players, 
                                                                             unique_players))
percent_one_cap = (all_players - unique_players)/all_players
print("{0}% of players only played once in the World Cup since 1930".format(percent_one_cap))

Total player instances 8897 and 7191 unique players in the data
0.19175002809935934% of players only played once in the World Cup since 1930


In [4]:
total_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8897 entries, 0 to 8896
Data columns (total 9 columns):
No             8873 non-null object
Pos            8897 non-null object
Player         8897 non-null object
DOB/Age        8805 non-null object
Caps           6086 non-null object
Club           8897 non-null object
Country        8897 non-null object
ClubCountry    8897 non-null object
Year           8897 non-null int64
dtypes: int64(1), object(8)
memory usage: 625.7+ KB


Columns we are most interested in are: Pos, Player, Club and Country. These don't have missing data.

In [5]:
# Keeps in only players who played in a world cup after a certain year
correct_age = total_data[total_data["Year"] > YEAR_SINCE]

# How many players in our timeframe
len(correct_age["Player"].unique())

2576

In [6]:
# Prevent duplicates by removing "(c)"'s denoting captains //TODO make cleaner
correct_age["Player"] = correct_age["Player"].str.replace(r"(c)", "").str.strip()
correct_age["Player"] = correct_age["Player"].str.replace(r"(", "").str.strip()
correct_age["Player"] = correct_age["Player"].str.replace(r")", "").str.strip()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


In [7]:
len(correct_age["Player"].unique())

2519

In [8]:
# Filter out the players who only had one world cup appearance.
player_caps = correct_age['Player'].value_counts()
correct_age = correct_age[correct_age.isin(player_caps.index[player_caps >= MIN_CAPS]).values]
correct_age.sample(n=10)

Unnamed: 0,No,Pos,Player,DOB/Age,Caps,Club,Country,ClubCountry,Year
6641,21,3MF,Luis Enrique,(1970-05-08)8 May 1970 (aged 32),57,Barcelona,Spain,Spain,2002
6445,8,3MF,Hidetoshi Nakata,(1977-01-22)22 January 1977 (aged 21),21,Bellmare Hiratsuka,Japan,Japan,1998
5967,15,2DF,Lilian Thuram,(1972-01-01)1 January 1972 (aged 26),32,Parma,France,Italy,1998
8566,13,2DF,Miguel,(1980-01-04)4 January 1980 (aged 30),53,Valencia,Portugal,Spain,2010
7029,18,4FW,Ivia Olić,(1979-09-14)14 September 1979 (aged 22),4,Zagreb,Croatia,Croatia,2002
7413,11,4FW,Didier Drogba,(1978-03-11)11 March 1978 (aged 28),32,Chelsea,Côte d'Ivoire,England,2006
6560,9,4FW,Roque Santa Cruz,(1981-08-16)16 August 1981 (aged 20),24,Bayern Munich,Paraguay,Germany,2002
6998,10,4FW,Marus Allbäk,(1973-07-05)5 July 1973 (aged 28),18,Heerenveen,Sweden,Netherlands,2002
8299,22,1GK,Rihard Kingson,(1978-06-13)13 June 1978 (aged 31),58,Wigan Athletic,Ghana,England,2010
7759,12,4FW,Thierry Henry,(1977-08-17)17 August 1977 (aged 28),78,Arsenal,France,England,2006


Assume player names are unique

In [9]:
# We want each player to be a row, with a list of the clubs they played in.
gb_name = pd.DataFrame(correct_age.groupby('Player')["Club"].agg(pd.Series.tolist))

In [10]:
gb_name.head()

Unnamed: 0_level_0,Club
Player,Unnamed: 1_level_1
Aaron Lennon,"[Tottenham Hotspur, Tottenham Hotspur]"
Aaron Mokoena,"[Germinal Beerschot, Portsmouth]"
Abdulaziz Khathran,"[Al-Shabab, Al-Hilal]"
Abdullah Zubromawi,"[Al-Ahli, Al-Ahli]"
Adel Sellimi,"[Real Jaén, SC Freiburg]"


In [11]:
correct_age.head()

Unnamed: 0,No,Pos,Player,DOB/Age,Caps,Club,Country,ClubCountry,Year
5756,2,2DF,Cafu,(1970-06-07)7 June 1970 (aged 28),,Roma,Brazil,Italy,1998
5760,6,2DF,Roberto Carlos,(1973-04-10)10 April 1973 (aged 25),,Real Madrid,Brazil,Spain,1998
5763,9,4FW,Ronaldo,(1976-09-22)22 September 1976 (aged 21),,Internazionale,Brazil,Italy,1998
5764,10,3MF,Rivaldo,(1972-04-19)19 April 1972 (aged 26),,Barcelona,Brazil,Spain,1998
5765,11,3MF,Emerson,(1976-04-04)4 April 1976 (aged 22),,Bayer Leverkusen,Brazil,Germany,1998


In [12]:
# Merge the list of clubs frames with the original to get more attributes.
result = pd.merge(gb_name, correct_age[["Player", "Country", "Pos", "Year"]], 
                  left_on="Player", 
                  right_on="Player",
                  how="inner")

In [13]:
# The merge creates duplicates, can probably stop this but also easy to just remove duplicates.
result = result.drop_duplicates(subset=["Player"])
result = result.reset_index(drop=True)

In [14]:
result.head()

Unnamed: 0,Player,Club,Country,Pos,Year
0,Aaron Lennon,"[Tottenham Hotspur, Tottenham Hotspur]",England,3MF,2006
1,Aaron Mokoena,"[Germinal Beerschot, Portsmouth]",South Africa,2DF,2002
2,Abdulaziz Khathran,"[Al-Shabab, Al-Hilal]",Saudi Arabia,3MF,2002
3,Abdullah Zubromawi,"[Al-Ahli, Al-Ahli]",Saudi Arabia,3MF,1998
4,Adel Sellimi,"[Real Jaén, SC Freiburg]",Tunisia,4FW,1998


In [15]:
# There are for different positions given by this dataset. Easier than specific positions.
len(result["Pos"].unique())

4

In [16]:
# //TODO make this cleaner / change data structure (groupbys?)
def split_position(input_data=result, GK_NUM=GK_NUM, DF_NUM=DF_NUM, MF_NUM=MF_NUM, FW_NUM=FW_NUM):
    """
    Creates a dictionary containing different frames for each position
    
    Input: original dataframe + team structure constants. 
    Output: dictionary of dataframes and how many players needed.
    """
    goalies = input_data[input_data["Pos"] == "1GK"]
    defense = input_data[input_data["Pos"] == "2DF"]
    mids = input_data[input_data["Pos"] == "3MF"]
    forwards = input_data[input_data["Pos"] == "4FW"]
    data_key = {
    "goalies": [goalies, GK_NUM],
    "defense": [defense, DF_NUM],
    "mids": [mids, MF_NUM],
    "forwards": [forwards, FW_NUM]
    }
    return data_key

This is one the most significant unit of the code, it determines if the proposed player can be added to the team based on the given constraints. 

The to_list then indexing part is ugly, but it's that or .values with numpy structure. 

This assumes names are unique.

In [17]:
def check_membership(df, player_name, nt, cl, pl):
    """
    Checks whether a player is a valid addition to the team.
    
    Input: dataframe of players, playername and constants
    Output: Boolean if player can be added. 
    """
    row = df[df["Player"] == player_name]
    if row["Country"].to_list()[0] in nt:
        return False
    elif any(club in row["Club"].to_list()[0] for club in cl):
        return False
    elif row["Player"].to_list()[0] in pl:
        return False
    else:
        nt.add(row["Country"].to_list()[0])
        return True

In [18]:
def find_player_position(position, frames_dict, nt, cl, pl):
    """
    Given a desired position, the function randomly samples the frame 
    of the right position. It checks each suggested addition to the team to ensure
    that they are a valid addition. Continues until the required number are found.
    
    Input: team position searched for, dataframe dictionary, constants
    Output: None, edits internal state of an outer function (bad practice!)
    """
    player_count = 0
    while player_count < frames_dict[position][1]:
        test_player = frames_dict[position][0]["Player"].sample().to_list()[0]
        if check_membership(frames_dict[position][0], test_player, nt, cl, pl):
            pl.add(test_player)
            player_count += 1
    

In [19]:
def find_team(data):
    """
    Wrapper function that runs the search for the team given the data dictionary.
    Randomly shuffles the different positions so GK etc aren't always favoured first. 
    
    Input: data dictionary
    Output: team players 
    """
    nationalities = set()
    clubs = set()
    players = set()
    pos_keys = list(data.keys())
    rd.shuffle(pos_keys)
    diff_positions = pos_keys
    for rand_position in diff_positions:
        find_player_position(rand_position, frames_dict=data, nt=nationalities, cl=clubs, pl=players)
    return players

In [20]:
def construct_info_frame(data=result):
    """
    Creates a dataframe for the team.
    Initial set of players is not very informative, combining it with the initial 
    data set gives us information about the players.
    
    Input: data dictionary
    Output: data frame
    """
    split_keys = split_position(data)
    players = pd.DataFrame(data=list(find_team(split_keys)), columns=["Names"])
    player_info = pd.merge(players, data[["Player", "Pos", "Country", "Year"]], 
                  left_on="Names", 
                  right_on="Player",
                  how="inner")
    player_info = player_info.drop(columns="Player")
    return player_info

In [21]:
construct_info_frame()

Unnamed: 0,Names,Pos,Country,Year
0,Andranik Teymourian,3MF,Iran,2006
1,Oguhi Onyewu,2DF,United States,2006
2,Abdulaziz Khathran,3MF,Saudi Arabia,2002
3,Rónald Gómez,4FW,Costa Rica,2002
4,Thomas Helveg,2DF,Denmark,1998
5,Vinent Enyeama,1GK,Nigeria,2002
6,Mihael Owen,4FW,England,1998
7,Hatem Trabelsi,2DF,Tunisia,1998
8,Danny Boffin,3MF,Belgium,1998
9,Seol Ki-Hyeon,4FW,Korea Republic,2002


### TO DO

Change global/local variables for constants.

Work out a way to evaluate teams quality (FIFA scores?)

Do we want sets (unordered but faster) for membership structure?

Use classes instead of dicts / constants