### Prepping Data Challenge: Comparing Prize Money for Professional Golfers (week 6)


#### Requirement:

 1. Input the data
 2. Answer these questions:
    - What's the Total Prize Money earned by players for each tour? 
    - How many players are in this dataset for each tour?
    - How many events in total did players participate in for each tour?
    - How much do players win per event? What's the average of this for each tour? 
    - How do players rank by prize money for each tour? What about overall? What is the average difference between where they are ranked within their tour compared to the overall rankings where both tours are combined? 
    
        - Here we would like the difference to be positive as you would presume combining the tours would cause a player's ranking to increase
        
 3. Combine the answers to these questions into one dataset 
 4. Pivot the data so that we have a column for each tour, with each row representing an answer to the above questions 
 5. Clean up the Measure field and create a new column showing the difference between the tours for each measure
    - We're looking at the difference between the LPGA from the PGA, so in most instances this number will be negative
 6. Output the data

### 1. Input the data 

In [1]:
#import libraries
import pandas as pd

In [2]:
df = pd.read_excel('WK6-Official Money.xlsx')

df.head()

Unnamed: 0,PLAYER NAME,MONEY,EVENTS,TOUR
0,Brooks Koepka,9684006,21,PGA
1,Rory McIlroy,7785286,19,PGA
2,Matt Kuchar,6294690,22,PGA
3,Patrick Cantlay,6121488,21,PGA
4,Gary Woodland,5690965,24,PGA


### 2. Answer these questions:
   - What's the Total Prize Money earned by players for each tour? 
   - How many players are in this dataset for each tour?
   - How many events in total did players participate in for each tour?
   - How much do players win per event? What's the average of this for each tour?
   - How do players rank by prize money for each tour? What about overall? What is the average difference between where they are ranked within their tour compared to the overall rankings where both tours are combined? 
   
      - Here we would like the difference to be positive as you would presume combining the tours would cause a player's ranking to increase

In [3]:
#What's the Total Prize Money earned by players for each tour?
q1 = df.groupby(['TOUR']).agg(Total_Prize_Money = ('MONEY','sum')).reset_index()

In [4]:
#How many players are in this dataset for each tour?
q2 = df.groupby(['TOUR']).agg(Number_of_Players = ('PLAYER NAME','count')).reset_index()

In [5]:
#How many events in total did players participate in for each tour?
q3 = df.groupby(['TOUR']).agg(Number_of_Event = ('EVENTS','sum')).reset_index()

In [6]:
#How much do players win per event? What's the average of this for each tour?
df['AVG MONEY PER PLAYER'] = df['MONEY']/df['EVENTS']
q4 = df.groupby(['TOUR']).agg(Avg_money_per_event = ('AVG MONEY PER PLAYER','mean')).reset_index()

In [7]:
#How do players rank by prize money for each tour? What about overall? 
#What is the average difference between where they are ranked within their tour compared to 
#the overall rankings where both tours are combined? 
df['TM_RANK'] = df.groupby(['TOUR'])['MONEY'].rank(ascending=False)
df['OVERALL_RANK'] = df['MONEY'].rank(ascending=False)
df['DIFF EARNING RANK'] = df['OVERALL_RANK'] - df['TM_RANK']

q5 = df.groupby(['TOUR']).agg(Avg_difference_in_rank = ('DIFF EARNING RANK','mean')).reset_index()

###  3. Combine the answers to these questions into one dataset 

In [8]:
df_all = q1
df_all = df_all.merge(q2, on='TOUR', how='inner')
df_all = df_all.merge(q3, on='TOUR', how='inner')
df_all = df_all.merge(q4, on='TOUR', how='inner')
df_all = df_all.merge(q5, on='TOUR', how='inner')

### 4. Pivot the data so that we have a column for each tour, with each row representing an answer to the above questions 

In [9]:
df_all

Unnamed: 0,TOUR,Total_Prize_Money,Number_of_Players,Number_of_Event,Avg_money_per_event,Avg_difference_in_rank
0,LPGA,58410411,100,2266,25525.30112,96.13
1,PGA,256726356,100,2282,120281.569273,3.87


In [10]:
df_all.set_index('TOUR',inplace=True)
df_all = df_all.T
df_all

TOUR,LPGA,PGA
Total_Prize_Money,58410410.0,256726400.0
Number_of_Players,100.0,100.0
Number_of_Event,2266.0,2282.0
Avg_money_per_event,25525.3,120281.6
Avg_difference_in_rank,96.13,3.87


###  5. Clean up the Measure field and create a new column showing the difference between the tours for each measure
   - We're looking at the difference between the LPGA from the PGA, so in most instances this number will be negative

In [11]:
df_all['DIFF IN TOUR'] = df_all['LPGA'] - df_all['PGA']

In [12]:
df_all

TOUR,LPGA,PGA,DIFF IN TOUR
Total_Prize_Money,58410410.0,256726400.0,-198315900.0
Number_of_Players,100.0,100.0,0.0
Number_of_Event,2266.0,2282.0,-16.0
Avg_money_per_event,25525.3,120281.6,-94756.27
Avg_difference_in_rank,96.13,3.87,92.26


### 6. Output the data

In [13]:
df_all.to_csv('WK6-Official Money Output.csv')