<a href="https://colab.research.google.com/github/OptimalDecisions/sports-analytics-foundations/blob/main/sa-examples/SA_4_4_Reading_Files_with_2_Header_Rows.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

  ## Pandas Full Example of Data Preparation



  # Reading Files with 2 Header Rows

  <img src = "../img/sa_logo.png" width="100" align="left">

  Ram Narasimhan

  <br><br><br>



## NFL Coach Standings



Concepts covered in this notebook.

1. Validate the data


In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt


In [None]:
curr_season = '2000'

url = f'https://raw.githubusercontent.com/keshavmp/Sports_Analytics_Project/main/{curr_season}_Coach_Standings.csv'
df = pd.read_csv(url)

In [None]:
df.head(2)

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,2000 Season,Unnamed: 3,Unnamed: 4,Unnamed: 5,w/ Team,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Playoffs,Unnamed: 15,Unnamed: 16,w/ Team.1,Unnamed: 18,Unnamed: 19,Career.1,Unnamed: 21,Unnamed: 22,Unnamed: 23
0,Coach,Tm,G,W,L,T,G,W,L,T,...,G plyf,W plyf,L plyf,G plyf,W plyf,L plyf,G plyf,W plyf,L plyf,Remark
1,Vince Tobin,ARI,7,2,5,0,71,28,43,0,...,0,0,0,2,1,1,2,1,1,Fired after week 8


In [None]:

df.columns = df.columns.str.replace('w/ ','with_')
df.columns = df.columns.str.replace('Team','Team_')
df.columns = df.columns.str.replace('.','-')
df.columns = df.columns.str.replace(' ','_')


  df.columns = df.columns.str.replace('.','-')


In [None]:
column_categories = df.columns

In [None]:
df = pd.read_csv(url, header=1)


In [None]:
df.head(1)

Unnamed: 0,Coach,Tm,G,W,L,T,G.1,W.1,L.1,T.1,...,G plyf,W plyf,L plyf,G plyf.1,W plyf.1,L plyf.1,G plyf.2,W plyf.2,L plyf.2,Remark
0,Vince Tobin,ARI,7,2,5,0,71,28,43,0,...,0,0,0,2,1,1,2,1,1,Fired after week 8


In [None]:
df.rename(columns={'Tm':'Team'}, inplace=True)

The following is a key step:

1. We have Broad Categories in a variable called `column_categories` (The first row of the csv file)
2. We have actual columns names in our df.columns now. (Second row of the csv file)
3. In the follow step, we are trying to prepend the Column Categories to each relevant column name.

The result is not clean, but it is getting us close to where we want to be.

In [None]:
current_cat = ""
new_column_names = []
for idx, cat in enumerate(column_categories):
  if "Unnamed" not in cat: #we have found a new category
    print(idx, cat)
    current_cat = cat + "_"
  if "Season" in cat:
    current_cat = 'Season_' # we don't want the Year. We will make a separate column for it.
  new_column_names.append(current_cat + df.columns[idx])

2 2000_Season
6 with_Team_
10 Career
14 Playoffs
17 with_Team_-1
20 Career-1


Let's print out the result and examine the new names.

In [None]:
new_column_names

['Coach',
 'Team',
 'Season_G',
 'Season_W',
 'Season_L',
 'Season_T',
 'with_Team__G.1',
 'with_Team__W.1',
 'with_Team__L.1',
 'with_Team__T.1',
 'Career_G.2',
 'Career_W.2',
 'Career_L.2',
 'Career_T.2',
 'Playoffs_G plyf',
 'Playoffs_W plyf',
 'Playoffs_L plyf',
 'with_Team_-1_G plyf.1',
 'with_Team_-1_W plyf.1',
 'with_Team_-1_L plyf.1',
 'Career-1_G plyf.2',
 'Career-1_W plyf.2',
 'Career-1_L plyf.2',
 'Career-1_Remark']

Handle `.`, `_` and spaces

In [None]:
df.columns = new_column_names
df.columns = df.columns.str.replace('.','-', regex=False)
df.columns = df.columns.str.replace(' ','_', regex=False)

In [None]:
df.columns

Index(['Coach', 'Team', 'Season_G', 'Season_W', 'Season_L', 'Season_T',
       'with_Team__G-1', 'with_Team__W-1', 'with_Team__L-1', 'with_Team__T-1',
       'Career_G-2', 'Career_W-2', 'Career_L-2', 'Career_T-2',
       'Playoffs_G_plyf', 'Playoffs_W_plyf', 'Playoffs_L_plyf',
       'with_Team_-1_G_plyf-1', 'with_Team_-1_W_plyf-1',
       'with_Team_-1_L_plyf-1', 'Career-1_G_plyf-2', 'Career-1_W_plyf-2',
       'Career-1_L_plyf-2', 'Career-1_Remark'],
      dtype='object')

Notice that the "Remark" column has now gotten a category. We don't want that. We can just rename it to simply read "Remark"

In [None]:
remark_columns = df.filter(like='Remark')

if not remark_columns.empty:
    # Get the first matching column and rename it
    column_to_rename = remark_columns.columns[0]
    df.rename(columns={column_to_rename: 'Remark'}, inplace=True)

The \_plyf is a bit clumsy. Let's be explicit and rename all those columns to be Playoff_`.  

In [None]:
for column in df.columns:
    if '_plyf' in column:
      new_column = 'Playoff_' + column
      df.rename(columns={column: new_column}, inplace=True)

df.columns = df.columns.str.replace("_plyf", "") #drop the _plyf
df.columns = df.columns.str.replace("Playoff_Playoffs", "Playoff_") # take care of dupes
df.columns = df.columns.str.replace("__", "_") #clean up all dunders to single underscore

In [None]:
df.columns

Index(['Coach', 'Team', 'Season_G', 'Season_W', 'Season_L', 'Season_T',
       'with_Team_G-1', 'with_Team_W-1', 'with_Team_L-1', 'with_Team_T-1',
       'Career_G-2', 'Career_W-2', 'Career_L-2', 'Career_T-2', 'Playoff_G',
       'Playoff_W', 'Playoff_L', 'Playoff_with_Team_-1_G-1',
       'Playoff_with_Team_-1_W-1', 'Playoff_with_Team_-1_L-1',
       'Playoff_Career-1_G-2', 'Playoff_Career-1_W-2', 'Playoff_Career-1_L-2',
       'Remark'],
      dtype='object')

### New Column showing Current Season

Rule of thumb: Our code should work for any season we read, from 2000 to 2022.

Thus we should not have any column name that contains the text "_2000" etc. If we have _2000 in one df and _2001 in another df, we cannot merge those easily.

Trick: We know which season we are working with. It is in a variable we called `curr_season` at the very beginning. So let's create a new column and store the current season in it.


In [None]:
df['Season'] = curr_season

### Reorder Columns

Final clean up step: It would be good to keep the Season right after the Coach's name and team.

Here's how we'd move the last column to be the third column. We perform some careful "Column name surgery" here. We cut at the 3rd element, graft the last element into the 3rd slot, and drop the last one.

In [None]:
cols = df.columns.tolist()
new_order = cols[:2] + [cols[-1]] + cols[2:-1]
df = df[new_order]
df

Unnamed: 0,Coach,Team,Season,Season_G,Season_W,Season_L,Season_T,with_Team_G-1,with_Team_W-1,with_Team_L-1,...,Playoff_G,Playoff_W,Playoff_L,Playoff_with_Team_-1_G-1,Playoff_with_Team_-1_W-1,Playoff_with_Team_-1_L-1,Playoff_Career-1_G-2,Playoff_Career-1_W-2,Playoff_Career-1_L-2,Remark
0,Vince Tobin,ARI,2000,7,2,5,0,71,28,43,...,0,0,0,2,1,1,2,1,1,Fired after week 8
1,Dave McGinnis,ARI,2000,9,1,8,0,57,17,40,...,0,0,0,0,0,0,0,0,0,Coach starting week 9
2,Dan Reeves,ATL,2000,16,4,12,0,109,49,59,...,0,0,0,5,3,2,20,11,9,
3,Brian Billick,BAL,2000,16,12,4,0,144,80,64,...,4,4,0,8,5,3,8,5,3,Super Bowl Champions
4,Wade Phillips,BUF,2000,16,8,8,0,48,29,19,...,0,0,0,2,0,2,6,1,5,
5,George Seifert,CAR,2000,16,7,9,0,48,16,32,...,0,0,0,0,0,0,15,10,5,
6,Dick Jauron,CHI,2000,16,5,11,0,80,35,45,...,0,0,0,1,0,1,1,0,1,
7,Bruce Coslet,CIN,2000,3,0,3,0,60,21,39,...,0,0,0,0,0,0,1,0,1,Fired after week 4
8,Dick LeBeau,CIN,2000,13,4,9,0,45,12,33,...,0,0,0,0,0,0,0,0,0,Coach starting week 5
9,Chris Palmer,CLE,2000,16,3,13,0,32,5,27,...,0,0,0,0,0,0,0,0,0,


In the next notebook [4.5 Handling Multiple dfs](SA_4_5_Concatenating_Multiple_Dataframes.ipynb) we will read all the files in one shot and concat them into one very long dataframe.