### Prepping Data Challenge: Games Night Viz Collab (week 47)

### Requirements
- Input the Data
- Add the player names to their poker events
- Create a column to count when the player finished 1st in an event
- Replace any nulls in prize_usd with zero
- Find the dates of the players first and last events
- Use these dates to calculate the length of poker career in years (with decimals)
- Create an aggregated view to find the following player stats:
    - Number of events they've taken part in
    - Total prize money
    - Their biggest win
    - The percentage of events they've won
    - The distinct count of the country played in
    - Their length of career
- Reduce the data to name, number of events, total prize money, biggest win, percentage won, countries visited, career length

#### Creating a Pizza Plot / Coxcomb chart output:
- Using the player stats to create two pivot tables
   - a pivot of the raw values
   - a pivot of the values ranked from 1-100, with 100 representing the highest value
     - Note: we're using a ranking method that averages ties, pay particular attention to countries visited!
- Join the pivots together
- Output the data

In [1]:
import pandas as pd
import numpy as np

In [2]:
#Input the data
  
with pd.ExcelFile(r"\Dataprep\2021\top_female_poker_players_and_events.xlsx") as xl:
    df1 = pd.read_excel(xl, 'top_100')
    df2 = pd.read_excel(xl, 'top_100_poker_events')

In [3]:
# This is the granularity of the data set throughout the whole challenge (56,350 rows)
df1.sample(5)

Unnamed: 0,position,country,name,all_time_money_usd,player_url,player_id,source,last_updated
29,30th,England,Lucy Rokach,1329114,https://pokerdb.thehendonmob.com/player.php?a=...,268,https://pokerdb.thehendonmob.com/ranking/137/,2021-10-19
48,49th,United States,Melissa Hayden,888824,https://pokerdb.thehendonmob.com/player.php?a=...,113,https://pokerdb.thehendonmob.com/ranking/137/,2021-10-19
61,62nd,United States,Jamie Kerstetter,691812,https://pokerdb.thehendonmob.com/player.php?a=...,124549,https://pokerdb.thehendonmob.com/ranking/137/,2021-10-19
38,39th,United States,Jessica Dawley,1094697,https://pokerdb.thehendonmob.com/player.php?a=...,82653,https://pokerdb.thehendonmob.com/ranking/137/,2021-10-19
27,28th,United States,Cyndy Violette,1407044,https://pokerdb.thehendonmob.com/player.php?a=...,14506,https://pokerdb.thehendonmob.com/ranking/137/,2021-10-19


In [4]:
df2.sample(5)

Unnamed: 0,event_date,event_country,event_name,player_place,prize_usd,player_id,source,last_updated
239,2009-11-17,United States,$ 500 + 60 No Limit Hold'em2009 United States ...,9th,1191.0,154,https://pokerdb.thehendonmob.com/player.php?a=...,2021-10-19
2598,2012-06-30,United States,"$ 1,500 No Limit Hold'em (Event #53)43rd World...",185th,3932.0,145000,https://pokerdb.thehendonmob.com/player.php?a=...,2021-10-19
760,2018-04-07,Spain,"€ 1,000 + 100 No Limit Hold'em - Open #1partyp...",128th,4902.0,39790,https://pokerdb.thehendonmob.com/player.php?a=...,2021-10-19
5218,2012-04-17,United States,$ 350 No Limit Hold'emSeminole Hard Rock Showd...,4th,1320.0,73227,https://pokerdb.thehendonmob.com/player.php?a=...,2021-10-19
456,1997-03-02,International,"$ 100 + 20 Omaha Hi/LoCard Player Cruises, Cruise",7th,205.0,154,https://pokerdb.thehendonmob.com/player.php?a=...,2021-10-19


In [5]:
#Add the player names to their poker events
df = df1.merge(df2, on='player_id', how='left')\
    .rename(columns={'source_x':'source','last_updated_x':'last_updated',})\
    .drop(columns=['source_y','last_updated_y'])

In [6]:
df.head()

Unnamed: 0,position,country,name,all_time_money_usd,player_url,player_id,source,last_updated,event_date,event_country,event_name,player_place,prize_usd
0,1st,United States,Vanessa Selbst,11906247,https://pokerdb.thehendonmob.com/player.php?a=...,68149,https://pokerdb.thehendonmob.com/ranking/137/,2021-10-19,2020-02-21,Canada,"C$ 4,700 + 300 No Limit Hold'em - WPT Main Eve...",22nd,14915.0
1,1st,United States,Vanessa Selbst,11906247,https://pokerdb.thehendonmob.com/player.php?a=...,68149,https://pokerdb.thehendonmob.com/ranking/137/,2021-10-19,2019-09-15,United States,"$ 3,300 + 200 No Limit Hold'em - WPT Borgata P...",14th,39950.0
2,1st,United States,Vanessa Selbst,11906247,https://pokerdb.thehendonmob.com/player.php?a=...,68149,https://pokerdb.thehendonmob.com/ranking/137/,2021-10-19,2017-07-07,United States,"$ 1,000 No Limit Hold'em - Ladies Championship...",56th,2040.0
3,1st,United States,Vanessa Selbst,11906247,https://pokerdb.thehendonmob.com/player.php?a=...,68149,https://pokerdb.thehendonmob.com/ranking/137/,2021-10-19,2017-06-14,United States,"$ 3,000 No Limit Hold'em - 6 Handed (Event #27...",56th,6191.0
4,1st,United States,Vanessa Selbst,11906247,https://pokerdb.thehendonmob.com/player.php?a=...,68149,https://pokerdb.thehendonmob.com/ranking/137/,2021-10-19,2016-12-13,Czech Republic,"€ 5,000 + 300 No Limit Hold'em - EPT Main Even...",211th,8001.0


In [7]:
#Create a column to count when the player finished 1st in an event
df['1st_place'] = np.where(df['player_place'] == '1st', 1, 0)


In [8]:
#Replace any nulls in prize_usd with zero
#print(df.isna().sum())
df['prize_usd'] = df['prize_usd'].fillna(0)
print(df.isna().sum())

position              0
country               0
name                  0
all_time_money_usd    0
player_url            0
player_id             0
source                0
last_updated          0
event_date            0
event_country         0
event_name            0
player_place          1
prize_usd             0
1st_place             0
dtype: int64


In [9]:
#Find the dates of the players first and last events
df['first_event'] = df.groupby('player_id')['event_date'].transform('min')
df['last_event'] = df.groupby('player_id')['event_date'].transform('max')

In [10]:
#Use these dates to calculate the length of poker career in years (with decimals)
df['career_length'] = (df['last_event'] - df['first_event']).dt.days / 365.25

In [11]:
#- Create an aggregated view to find the following player stats:
#  - Number of events they've taken part in
#  - Total prize money
#  - Their biggest win 
#  - The percentage of events they've won
#  - The distinct count of the country played in
#  - Their length of career
#Reduce the data to name, number of events, total prize money, biggest win, percentage won, countries visited, career length
player_stats = df.groupby('name').agg(
    number_of_events=('player_id', 'count'),
    total_prize_money=('prize_usd', 'sum'),
    biggest_win=('prize_usd', 'max'),
    percentage_won=('1st_place', lambda x: (x.sum() / x.count()) * 100),
    countries_visited=('event_country', 'nunique'),
    career_length=('career_length', 'max')
).reset_index()

In [12]:
player_stats.head()

Unnamed: 0,name,number_of_events,total_prize_money,biggest_win,percentage_won,countries_visited,career_length
0,Abbey Daniels,124,713438.0,286819.0,4.83871,1,10.652977
1,Allyn Jaffrey Shulman,135,1568871.0,603713.0,5.185185,7,19.983573
2,Amanda Musumeci,71,1103874.0,481643.0,4.225352,5,6.26694
3,Ana Marquez,87,1891934.0,445000.0,2.298851,15,11.227926
4,Angelina Rich,50,578802.0,304386.0,6.0,11,7.509925


In [13]:
Output = player_stats.melt(id_vars='name', value_vars=['number_of_events', 'total_prize_money', 'biggest_win', 'percentage_won', 'countries_visited', 'career_length'], var_name = 'metric', value_name = 'raw_value')
Output['scaled_value'] = Output.groupby('metric')['raw_value'].rank(method='average', ascending=True)
Output = Output.sort_values(by = 'name').reset_index(drop=True)

In [14]:
Output.head()

Unnamed: 0,name,metric,raw_value,scaled_value
0,Abbey Daniels,number_of_events,124.0,84.0
1,Abbey Daniels,biggest_win,286819.0,52.0
2,Abbey Daniels,total_prize_money,713438.0,36.0
3,Abbey Daniels,percentage_won,4.83871,47.0
4,Abbey Daniels,career_length,10.652977,34.0


In [15]:
#output the data
Output.to_csv('wk47-output.csv', index=False)