<a href="https://colab.research.google.com/github/OptimalDecisions/sports-analytics-foundations/blob/main/pandas-basics/Pandas_Basics_2_4_Column_Operations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Pandas Basics 2.4

# Column Operations

  <img src = "../img/sa_logo.png" width="100" align="left">

  Ram Narasimhan

  <br><br><br>

  << [Examining, Describing & Summarizing Data](Pandas_Basics_2_3_Exploring_Data.ipynb)|[Column Operations](Pandas_Basics_2_4_Column_Operations.ipynb) | [Filtering using Conditions](Pandas_Basics_2_5_Filtering_Data.ipynb) >>


In this notebook we will see how we can:
- Select a subset of columns
- Add two or more columns (or perform math ops on columns)
- Create entire new columns
- Drop columns we no longer wish to keep
- Rename the column names to suit our needs
- use the `apply()` method to operate on columns.




In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline


In [2]:
url = "https://raw.githubusercontent.com/OptimalDecisions/sports-analytics-foundations/main/data/fifa-world-cup-2022.csv"
df = pd.read_csv(url)

In [3]:
df.shape

(64, 8)

In [4]:
df.head()

Unnamed: 0,Match Number,Round Number,Date,Location,Home Team,Away Team,Group,Result
0,1,1,20/11/2022 16:00,Al Bayt Stadium,Qatar,Ecuador,Group A,0 - 2
1,3,1,21/11/2022 13:00,Khalifa International Stadium,England,Iran,Group B,6 - 2
2,2,1,21/11/2022 16:00,Al Thumama Stadium,Senegal,Netherlands,Group A,0 - 2
3,4,1,21/11/2022 19:00,Ahmad Bin Ali Stadium,USA,Wales,Group B,1 - 1
4,8,1,22/11/2022 10:00,Lusail Stadium,Argentina,Saudi Arabia,Group C,1 - 2


## Selecting Columns

Often, we only want to work with a subset of columns -- just a few columns of interest. An easy way to select just those is to put them in a list and to then call the df with that list.

In [5]:
keep_cols = ['Round Number', 'Home Team', 'Away Team', 'Result']

df[keep_cols] # select just the subset

Unnamed: 0,Round Number,Home Team,Away Team,Result
0,1,Qatar,Ecuador,0 - 2
1,1,England,Iran,6 - 2
2,1,Senegal,Netherlands,0 - 2
3,1,USA,Wales,1 - 1
4,1,Argentina,Saudi Arabia,1 - 2
...,...,...,...,...
59,Quarter Finals,England,France,1 - 2
60,Semi Finals,Argentina,Croatia,3 - 0
61,Semi Finals,France,Morocco,2 - 0
62,Finals,Croatia,Morocco,2 - 1


## 	Adding columns

Pandas allows us to add numerical columns, and to perform several mathematical operators on 2 or more columns. For example, we can divide two columns of numbers.

In fact, we can even "add" two columns of strings.

For example,

In [15]:
df['Home Team'] + ' vs ' + df['Away Team']

0              Qatar vs Ecuador
1               England vs Iran
2        Senegal vs Netherlands
3                  USA vs Wales
4     Argentina vs Saudi Arabia
                ...            
59            England vs France
60         Argentina vs Croatia
61            France vs Morocco
62           Croatia vs Morocco
63          Argentina vs France
Length: 64, dtype: object

## 	Create new columns


In [16]:
df['game'] = df['Home Team'] + ' vs ' + df['Away Team']

In [17]:
df['Round Number'].unique()

array(['1', '2', '3', 'Round of 16', 'Quarter Finals', 'Semi Finals',
       'Finals'], dtype=object)

We could create a new column called `match_stage`

This column would be "Group Stage" for Rounds 1,2 and 3.
This column would be "Knockout Stage" for all the future rounds.

We will create it using `np.where()` based on the Round Number.

In [10]:
#If the Round Number is '1', '2' or '3', this column will have the value of "Group Stage".
#Else, it will take on the value of "Knockout Stage"

df['match_stage'] = np.where(df['Round Number'].isin(['1', '2', '3']), 'Group Stage', 'Knockout Stage')


In [12]:
df

Unnamed: 0,Match Number,Round Number,Date,Location,Home Team,Away Team,Group,Result,match_stage
0,1,1,20/11/2022 16:00,Al Bayt Stadium,Qatar,Ecuador,Group A,0 - 2,Group Stage
1,3,1,21/11/2022 13:00,Khalifa International Stadium,England,Iran,Group B,6 - 2,Group Stage
2,2,1,21/11/2022 16:00,Al Thumama Stadium,Senegal,Netherlands,Group A,0 - 2,Group Stage
3,4,1,21/11/2022 19:00,Ahmad Bin Ali Stadium,USA,Wales,Group B,1 - 1,Group Stage
4,8,1,22/11/2022 10:00,Lusail Stadium,Argentina,Saudi Arabia,Group C,1 - 2,Group Stage
...,...,...,...,...,...,...,...,...,...
59,59,Quarter Finals,10/12/2022 19:00,Al Bayt Stadium,England,France,,1 - 2,Knockout Stage
60,61,Semi Finals,13/12/2022 19:00,Lusail Stadium,Argentina,Croatia,,3 - 0,Knockout Stage
61,62,Semi Finals,14/12/2022 19:00,Al Bayt Stadium,France,Morocco,,2 - 0,Knockout Stage
62,63,Finals,17/12/2022 15:00,Khalifa International Stadium,Croatia,Morocco,,2 - 1,Knockout Stage


We can see in the df above that there are now 2 new columns. These are the columns that we've created.

## 	Drop Columns


We can use the intuitive command, `df.drop()` to drop one or more columns.
Note that we specify the `axis=1` (to refer to a column-wise operation. In contrast, `axis=0` is a row-wise operation.)

If we want to "permanently" drop the column(s), we specify `inplace=True` and then the column is truly gone.

In [18]:
df.drop(['game'], axis=1, inplace=True)


## 	Rename columns


We can use the aptly name df.rename command to give the columns more intuitive names. To do this, we simply create a dictionary.
Each entry in our dictionary has the form
`old name: new name`,




In [22]:
print("Before renaming", df.columns.values)

# rename some of the columns in df
df = df.rename(columns={'Home Team': 'Home', 'Away Team': 'Away'})
df.head()

print("After renaming", df.columns.values)


Before renaming ['Match Number' 'Round Number' 'Date' 'Location' 'Home' 'Away' 'Group'
 'Result' 'match_stage']
After renaming ['Match Number' 'Round Number' 'Date' 'Location' 'Home' 'Away' 'Group'
 'Result' 'match_stage']


## 	.apply()

Pandas allows us to `apply` any function (existing ones, or functions that we can define to each element of any column.

The format is:
`df[column_name].apply(some_python_function)`

Let's use this to find out the Day of Week for each of the FIFA WC 2022 game.
We do this in three steps.

1. We convert the 'Date' column, which is a string to a `datetime` format, using `pd.to_datetime()`
2. The datetime object has a very handy method called `day_name` which is what we want. So we can write a tiny function that returns this string.
3. We then use the `apply()` method to call this function.

In [34]:
# Convert the string Date to a datetime object
df['Date'] = pd.to_datetime(df['Date'])

# Create a new column 'day_of_week' using apply() to extract the day of the week

def day_of_week(d):
  ''' Returns the day name from the datetime object'''
  return d.day_name()

df['DOW'] = df['Date'].apply(day_of_week)


In [35]:
df[['Home', 'Away', 'Date', 'DOW']]

Unnamed: 0,Home,Away,Date,DOW
0,Qatar,Ecuador,2022-11-20 16:00:00,Sunday
1,England,Iran,2022-11-21 13:00:00,Monday
2,Senegal,Netherlands,2022-11-21 16:00:00,Monday
3,USA,Wales,2022-11-21 19:00:00,Monday
4,Argentina,Saudi Arabia,2022-11-22 10:00:00,Tuesday
...,...,...,...,...
59,England,France,2022-10-12 19:00:00,Wednesday
60,Argentina,Croatia,2022-12-13 19:00:00,Tuesday
61,France,Morocco,2022-12-14 19:00:00,Wednesday
62,Croatia,Morocco,2022-12-17 15:00:00,Saturday



<< [Examining, Describing & Summarizing Data](Pandas_Basics_2_3_Exploring_Data.ipynb)|[Column Operations](Pandas_Basics_2_4_Column_Operations.ipynb) | [Filtering using Conditions](Pandas_Basics_2_5_Filtering_Data.ipynb) >>