### Prepping Data Challenge: It's Coming Rome (week 28)

### Challenge
The challenge this week is to analyse the all of the penalty shootouts in the World Cup and European Championships (Euro's) since 1976.

### Input
Data is from Wikipedia (World Cup & Euro's) and is two sheets: 

### Requirements
 - Input Data
 - Determine what competition each penalty was taken in
 - Clean any fields, correctly format the date the penalty was taken, & group the two German countries (eg, West Germany & Germany)
 - Rank the countries on the following: 
   - Shootout win % (exclude teams who have never won a shootout)
   - Penalties scored %
 - What is the most and least successful time to take a penalty? (What penalty number are you most likely to score or miss?)
 - Output the Data
 
 #### my solution for the challenge was heavily influrenced by @ArseneXie 's solution

In [1]:
import pandas as pd
import numpy as np
import re

In [21]:
#Input the data
temp = []
with pd.ExcelFile('WK28-InternationalPenalties.xlsx') as xlsx:
    for sheet in xlsx.sheet_names:
        df = pd.read_excel(xlsx, sheet)
        df.insert(0, 'Sheet', sheet)
        df.columns = [x.title().strip() for x in df.columns]
        temp.append(df)
        df1 = pd.concat(temp)[['Sheet','No.','Penalty Number','Winner','Loser','Winning Team Taker','Losing Team Taker']]

In [22]:
df1.head()

Unnamed: 0,Sheet,No.,Penalty Number,Winner,Loser,Winning Team Taker,Losing Team Taker
0,WorldCup,1,1,West Germany,France,Kaltz Penalty scored,Penalty scored Giresse
1,WorldCup,1,2,West Germany,France,Breitner Penalty scored,Penalty scored Amoros
2,WorldCup,1,3,West Germany,France,Stielike Penalty missed,Penalty scored Rocheteau
3,WorldCup,1,4,West Germany,France,Littbarski Penalty scored,Penalty missed Six
4,WorldCup,1,5,West Germany,France,Rummenigge Penalty scored,Penalty scored Platini


In [32]:
#clean data
df1['Winner'] = df1['Winner'].str.strip()
df1['Loser'] = df1['Loser'].str.strip()
df1 = df1.replace({'Winner':{'West Germany':'Germany'}, 'Loser':{'West Germany':'Germany'}})
df1['Winner Scored'] = df1['Winning Team Taker'].apply(lambda x: 1 if re.search('(scored)', str(x)) else 0)
df1['Total Winner Penalties'] = df1['Winning Team Taker'].apply(lambda x: 0 if pd.isna(x) else 1)
df1['Loser Scored'] = df1['Losing Team Taker'].apply(lambda x: 1 if re.search('(scored)', str(x)) else 0)
df1['Total Loser Penalties'] = df1['Losing Team Taker'].apply(lambda x: 0 if pd.isna(x) else 1)

In [33]:
#Rank the countries on the following: 
#Shootout win % (exclude teams who have never won a shootout)
output1 = df1[['Sheet','No.','Winner','Loser']].drop_duplicates().copy()
output1 = output1.melt(id_vars=['Sheet','No.'], value_name='Team',var_name='Win Lose')
output1['Shootouts'] = output1['Win Lose'].apply(lambda x: 1 if x=='Winner' else 0)
output1['Total Shootouts'] = 1
output1 = output1.groupby(['Team'], as_index=False).agg({'Shootouts':'sum','Total Shootouts':'sum'})
output1 = output1[output1['Shootouts']>0]
output1['Shootouts Win %'] = round(output1['Shootouts']*100/output1['Total Shootouts'])
output1['Win % Rank'] = output1['Shootouts Win %'].rank(method='dense', ascending=False).astype(int)
output1 = output1[['Win % Rank','Shootouts Win %','Total Shootouts','Shootouts','Team']].sort_values(['Win % Rank','Shootouts'], ascending=[True,False])

In [34]:
output1.head()

Unnamed: 0,Win % Rank,Shootouts Win %,Total Shootouts,Shootouts,Team
9,1,100.0,2,2,Czechoslovakia
1,1,100.0,1,1,Belgium
3,1,100.0,1,1,Bulgaria
8,1,100.0,1,1,Czech Republic
20,1,100.0,1,1,Paraguay


In [37]:
#Rank the countries on the following: 
#Penalties scored %
output2 = df1[['Winner','Loser','Winner Scored','Total Winner Penalties','Loser Scored','Total Loser Penalties']].copy()
output2 = output2.melt(id_vars=[c for c in output2.columns if re.search('\s\w',c)], value_name='Team',var_name='Win Lose')
output2['Penalties Scored'] = output2.apply(lambda x: x['Winner Scored'] if x['Win Lose']=='Winner' else x['Loser Scored'], axis=1)
output2['Total Penalties'] =  output2.apply(lambda x: x['Total Winner Penalties'] if x['Win Lose']=='Winner' else x['Total Loser Penalties'], axis=1)
output2 = output2.groupby(['Team'], as_index=False).agg({'Penalties Scored':'sum','Total Penalties':'sum'})
output2['Penalties Missed'] = output2['Total Penalties']-output2['Penalties Scored']
output2['% Total Penalties Scored'] = round(output2['Penalties Scored']*100/output2['Total Penalties'])
output2['Penalties Scored % Rank'] = output2['Penalties Scored'].rank(method='dense', ascending=False).astype(int)
output2 = output2[['Penalties Scored % Rank','% Total Penalties Scored','Penalties Missed','Penalties Scored','Team']].sort_values(['Penalties Scored % Rank','Penalties Scored'], ascending=[True,False])

In [38]:
output2.head()

Unnamed: 0,Penalties Scored % Rank,% Total Penalties Scored,Penalties Missed,Penalties Scored,Team
16,1,69.0,19,42,Italy
27,2,70.0,14,33,Spain
13,3,86.0,5,32,Germany
11,4,64.0,16,29,England
12,4,81.0,7,29,France


In [40]:
#What is the most and least successful time to take a penalty? (What penalty number are you most likely to score or miss?)
output3 = df1[['Penalty Number','Winner Scored','Total Winner Penalties', 'Loser Scored','Total Loser Penalties']].copy()
output3['Total Penalties'] = output3['Total Winner Penalties']+ output3['Total Loser Penalties']
output3['Penalties Scored'] = output3['Winner Scored']+ output3['Loser Scored']
output3 = output3.groupby(['Penalty Number'], as_index=False).agg({'Total Penalties':'sum','Penalties Scored':'sum'})
output3['Penalties Missed'] = output3['Total Penalties']-output3['Penalties Scored']
output3['Penalties Scored %'] = round(output3['Total Penalties']*100/output3['Total Penalties'])
output3['Rank'] = output3['Penalties Scored %'].rank(method='dense', ascending=False).astype(int)
output3 = output3[['Rank','Penalties Scored %','Penalties Missed','Penalties Scored','Total Penalties','Penalty Number']].sort_values(['Rank','Penalty Number'], ascending=[True,False])

In [41]:
output3.head()

Unnamed: 0,Rank,Penalties Scored %,Penalties Missed,Penalties Scored,Total Penalties,Penalty Number
8,1,100.0,2,2,4,9
7,1,100.0,0,4,4,8
6,1,100.0,1,5,6,7
5,1,100.0,5,11,16,6
4,1,100.0,19,49,68,5


In [12]:
output.to_csv('wk27-output.csv', index=False)