<a href="https://colab.research.google.com/github/OptimalDecisions/sports-analytics-foundations/blob/main/sa-getting-started/SA_3_0_Taking_Python_and_Pandas_for_a_Sports_Workout.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# prompt: import pandas, matplotlib and make the plotting inline

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline


In [None]:
url = 'https://raw.githubusercontent.com/OptimalDecisions/sports-analytics-foundations/main/data/nfl_2022_games.csv'
df = pd.read_csv(url)

Let's take a look at a few rows in our dataframe. `sample()` does just that. We can provide the number of rows we want to look at. 10 in this case. Note that each time you run this command, 10 rows will be selected at random.

In [None]:
df.sample(10)

Unnamed: 0,Week,Day,Date,Time,Winner/tie,Unnamed: 5,Loser/tie,Unnamed: 7,PtsW,PtsL,YdsW,TOW,YdsL,TOL
157,11,Sun,2022-11-20,1:00PM,Philadelphia Eagles,@,Indianapolis Colts,boxscore,17.0,16.0,314.0,2.0,284.0,1.0
82,6,Sun,2022-10-16,1:00PM,Cincinnati Bengals,@,New Orleans Saints,boxscore,30.0,26.0,348.0,1.0,399.0,0.0
166,12,Thu,2022-11-24,8:20PM,Minnesota Vikings,,New England Patriots,boxscore,33.0,26.0,358.0,1.0,409.0,0.0
40,3,Sun,2022-09-25,1:00PM,Tennessee Titans,,Las Vegas Raiders,boxscore,24.0,22.0,361.0,1.0,396.0,1.0
87,6,Sun,2022-10-16,1:00PM,New York Giants,,Baltimore Ravens,boxscore,24.0,20.0,238.0,1.0,406.0,2.0
243,17,Sun,2023-01-01,1:00PM,Detroit Lions,,Chicago Bears,boxscore,41.0,10.0,504.0,0.0,230.0,2.0
257,18,Sun,2023-01-08,1:00PM,Buffalo Bills,,New England Patriots,boxscore,35.0,23.0,327.0,3.0,341.0,3.0
22,2,Sun,2022-09-18,1:00PM,Tampa Bay Buccaneers,@,New Orleans Saints,boxscore,20.0,10.0,260.0,1.0,308.0,5.0
34,3,Sun,2022-09-25,1:00PM,Carolina Panthers,,New Orleans Saints,boxscore,22.0,14.0,293.0,0.0,426.0,3.0
185,13,Sun,2022-12-04,1:00PM,Washington Commanders,@,New York Giants,boxscore,20.0,20.0,411.0,1.0,316.0,1.0


### Data Cleaning and reformatting

from the table above, there seem to be two columns that needed to be either dropped, or at least to have better names. Before we simply drop them, we have to see what they contain.

A good way to check for the usefulness of a column is to see what it contains. We could start by looking at the unique values in the columns in question.

Let's find out all the unique values in this column. To do that we use the `unique()` function that comes with Pandas.

In [None]:
df['Unnamed: 5'].unique(), df['Unnamed: 7'].unique(),

(array(['@', nan, 'N'], dtype=object), array(['boxscore', nan], dtype=object))

From the results, it seems that Unnamed: 5 might actually be useful. Unnamed-7 is not very useful (any column that only has 1 value throughout is not generally useful!)

### Drop a column


In [None]:
# prompt: drop the column called "unnamed: 7'

df.drop('Unnamed: 7', axis=1, inplace=True)


`inplace = True` makes sure that the column is permanently dropped.

Let's examine the column called "Unnamed: 5" - the `@` symbol gives us a good clue. It tells us in which of the two cities, each game was held. We can infer that by default, it was held in the first city, *except when there is an @ in front of the second team.*

But, wait. The column also contains a strange value -- namely, `N`. What could that be? Let's examine that row and find out.

In [None]:
df[df['Unnamed: 5'] == 'N']

Unnamed: 0,Week,Day,Date,Time,Winner/tie,Unnamed: 5,Loser/tie,PtsW,PtsL,YdsW,TOW,YdsL,TOL
284,SuperBowl,Sun,2023-02-12,6:30PM,Kansas City Chiefs,N,Philadelphia Eagles,38.0,35.0,340.0,0.0,417.0,1.0


That explains it! The N refers to the SuperBowl game, which is held at a pre-determined stadium each year. Now, let's rename the column to something better, such as `location.` From the location, the venue can be inferred.

In [None]:
# prompt: rename the Unnamed:5 column to location

df.rename(columns={'Unnamed: 5': 'location'}, inplace=True)


In [None]:
df

Unnamed: 0,Week,Day,Date,Time,Winner/tie,location,Loser/tie,PtsW,PtsL,YdsW,TOW,YdsL,TOL
0,1,Thu,2022-09-08,8:20PM,Buffalo Bills,@,Los Angeles Rams,31.0,10.0,413.0,4.0,243.0,3.0
1,1,Sun,2022-09-11,1:00PM,New Orleans Saints,@,Atlanta Falcons,27.0,26.0,385.0,1.0,416.0,2.0
2,1,Sun,2022-09-11,1:00PM,Cleveland Browns,@,Carolina Panthers,26.0,24.0,355.0,0.0,261.0,1.0
3,1,Sun,2022-09-11,1:00PM,Chicago Bears,,San Francisco 49ers,19.0,10.0,204.0,1.0,331.0,2.0
4,1,Sun,2022-09-11,1:00PM,Pittsburgh Steelers,@,Cincinnati Bengals,23.0,20.0,267.0,0.0,432.0,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
280,Division,Sun,2023-01-22,3:00PM,Cincinnati Bengals,@,Buffalo Bills,27.0,10.0,412.0,0.0,325.0,1.0
281,Division,Sun,2023-01-22,6:30PM,San Francisco 49ers,,Dallas Cowboys,19.0,12.0,312.0,1.0,282.0,2.0
282,ConfChamp,Sun,2023-01-29,3:00PM,Philadelphia Eagles,,San Francisco 49ers,31.0,7.0,269.0,0.0,164.0,3.0
283,ConfChamp,Sun,2023-01-29,6:30PM,Kansas City Chiefs,,Cincinnati Bengals,23.0,20.0,357.0,1.0,309.0,2.0


### Find the games with the biggest Margin of Victory

How can we find games with the highest point differential in terms of the winning team and the losing team?

A simple way to do this is to create a new column called `Margin` and store the Points difference (Winner Points minus the Losing team's points) in it.

In [None]:
df['Margin'] = df['PtsW'] - df['PtsL']

Now that we have a new column called `Margin`, we can sort all the 2022-NFL games by the margin of victory. `Ascending=False` is our way of telling pandas that we want the highest values first, and the lowest margin will be the last row.

In [None]:
df.sort_values(by = 'Margin', ascending=False)

Unnamed: 0,Week,Day,Date,Time,Winner/tie,location,Loser/tie,PtsW,PtsL,YdsW,TOW,YdsL,TOL,Margin
161,11,Sun,2022-11-20,4:25PM,Dallas Cowboys,@,Minnesota Vikings,40.0,3.0,458.0,0.0,183.0,1.0,37.0
237,16,Sun,2022-12-25,4:30PM,Los Angeles Rams,,Denver Broncos,51.0,14.0,388.0,0.0,323.0,4.0,37.0
67,5,Sun,2022-10-09,1:00PM,Buffalo Bills,,Pittsburgh Steelers,38.0,3.0,552.0,2.0,364.0,2.0,35.0
193,13,Sun,2022-12-04,8:20PM,Dallas Cowboys,,Indianapolis Colts,54.0,19.0,385.0,1.0,309.0,5.0,35.0
30,2,Mon,2022-09-19,7:15PM,Buffalo Bills,,Tennessee Titans,41.0,7.0,414.0,0.0,187.0,4.0,34.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
119,8,Sun,2022-10-30,4:25PM,Washington Commanders,@,Indianapolis Colts,17.0,16.0,362.0,1.0,324.0,2.0,1.0
142,10,Sun,2022-11-13,1:00PM,Detroit Lions,@,Chicago Bears,31.0,30.0,323.0,0.0,408.0,1.0,1.0
185,13,Sun,2022-12-04,1:00PM,Washington Commanders,@,New York Giants,20.0,20.0,411.0,1.0,316.0,1.0,0.0
5,1,Sun,2022-09-11,1:00PM,Houston Texans,,Indianapolis Colts,20.0,20.0,299.0,1.0,517.0,2.0,0.0


### How to find high-scoring games?

One easy way of doing that is to create a new column which contains the total points -- by both the winning and losing teams, and to then use that column `combined` to sort the data frame.

In [None]:
# prompt: In this df, which row has the highest combined PtsW and PtsL?

df['Combined'] = df['PtsW'] + df['PtsL']
df.sort_values(by = 'Combined', ascending=False)


Unnamed: 0,Week,Day,Date,Time,Winner/tie,location,Loser/tie,PtsW,PtsL,YdsW,TOW,YdsL,TOL,Margin,Combined
55,4,Sun,2022-10-02,1:00PM,Seattle Seahawks,@,Detroit Lions,48.0,45.0,555.0,1.0,520.0,2.0,3.0,93.0
21,2,Sun,2022-09-18,1:00PM,Miami Dolphins,@,Baltimore Ravens,42.0,38.0,547.0,2.0,473.0,0.0,4.0,80.0
113,8,Sun,2022-10-30,1:00PM,Dallas Cowboys,,Chicago Bears,49.0,29.0,442.0,1.0,371.0,1.0,20.0,78.0
94,7,Thu,2022-10-20,8:15PM,Arizona Cardinals,,New Orleans Saints,42.0,34.0,326.0,0.0,494.0,3.0,8.0,76.0
209,15,Sat,2022-12-17,1:00PM,Minnesota Vikings,,Indianapolis Colts,39.0,36.0,518.0,3.0,341.0,1.0,3.0,75.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
210,15,Sat,2022-12-17,4:30PM,Cleveland Browns,,Baltimore Ravens,13.0,3.0,283.0,0.0,324.0,2.0,10.0,16.0
156,11,Sun,2022-11-20,1:00PM,Baltimore Ravens,,Carolina Panthers,13.0,3.0,308.0,1.0,205.0,3.0,10.0,16.0
154,11,Sun,2022-11-20,1:00PM,New England Patriots,,New York Jets,10.0,3.0,297.0,0.0,103.0,0.0,7.0,13.0
177,12,Sun,2022-11-27,4:25PM,San Francisco 49ers,,New Orleans Saints,13.0,0.0,317.0,0.0,260.0,2.0,13.0,13.0


### Subset just one team -- say the Pittsburgh Steelers.

We do this in two steps
 1. We create a condition to identify the rows in which the Pittsburgh Steelers either won or lost. To do that, we use `str.contains(string)` function. Note that when `cond` is printed out, only the rows with a Steelers game are marked `True`
 2.  Now that we have a Boolean vector, the next step is very easy. We apply the Boolean mask (`cond`) to the `df` by typing `df[cond]`

In [None]:
# prompt: just show me the rows in which Pittshburgh Steelers are in the Winner/tie or Loser/tie column

cond = df['Winner/tie'].str.contains('Pittsburgh Steelers') | df['Loser/tie'].str.contains('Pittsburgh Steelers')
cond

0      False
1      False
2      False
3      False
4       True
       ...  
280    False
281    False
282    False
283    False
284    False
Length: 285, dtype: bool

In [None]:
# STep 2, apply the cond to the data frame, and only those rows will remain. All the other rows are left out.
df[cond]

Unnamed: 0,Week,Day,Date,Time,Winner/tie,location,Loser/tie,PtsW,PtsL,YdsW,TOW,YdsL,TOL,Margin,Combined
4,1,Sun,2022-09-11,1:00PM,Pittsburgh Steelers,@,Cincinnati Bengals,23.0,20.0,267.0,0.0,432.0,5.0,3.0,43.0
23,2,Sun,2022-09-18,1:00PM,New England Patriots,@,Pittsburgh Steelers,17.0,14.0,376.0,1.0,243.0,2.0,3.0,31.0
32,3,Thu,2022-09-22,8:15PM,Cleveland Browns,,Pittsburgh Steelers,29.0,17.0,376.0,0.0,308.0,1.0,12.0,46.0
58,4,Sun,2022-10-02,1:00PM,New York Jets,@,Pittsburgh Steelers,24.0,20.0,348.0,2.0,297.0,4.0,4.0,44.0
67,5,Sun,2022-10-09,1:00PM,Buffalo Bills,,Pittsburgh Steelers,38.0,3.0,552.0,2.0,364.0,2.0,35.0,41.0
88,6,Sun,2022-10-16,1:00PM,Pittsburgh Steelers,,Tampa Bay Buccaneers,20.0,18.0,270.0,0.0,304.0,0.0,2.0,38.0
106,7,Sun,2022-10-23,8:20PM,Miami Dolphins,,Pittsburgh Steelers,16.0,10.0,372.0,0.0,341.0,3.0,6.0,26.0
111,8,Sun,2022-10-30,1:00PM,Philadelphia Eagles,,Pittsburgh Steelers,35.0,13.0,401.0,0.0,302.0,2.0,22.0,48.0
144,10,Sun,2022-11-13,1:00PM,Pittsburgh Steelers,,New Orleans Saints,20.0,10.0,379.0,0.0,186.0,2.0,10.0,30.0
160,11,Sun,2022-11-20,4:25PM,Cincinnati Bengals,@,Pittsburgh Steelers,37.0,30.0,408.0,2.0,351.0,0.0,7.0,67.0


In the cell above, we can see that only the Steelers games are being printed out.

A natural question is to wonder if our original data frame, `df` is still intact. In fact it is quite safe. If we wish to, we could save a *copy* of our original df with another name.

In [None]:
pitt = df[cond].copy()

Let's make sure that *both* dataframes are now in memory, and we can use them as we wish

In [None]:
df.shape, pitt.shape

((285, 15), (17, 15))

## Challenge

Use the 'location' column to create a new column called 'Venue.' It should default to be the City of the Winner/tie column, unless location is @ in which case venue should be the city of the Loser/tie column.