<a href="https://colab.research.google.com/github/OptimalDecisions/sports-analytics-foundations/blob/main/pandas-basics/Pandas_Intermediate_2_8_Dates_and_Time.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


  ## Pandas Basics 2.8

# Formatting Date and time columns


  <img src = "../img/sa_logo.png" width="100" align="left">

  Ram Narasimhan

  <br><br><br>

  << [Writing to Files](Pandas_Basics_2_7_Writing_to_Files.ipynb) | [Time Series](Pandas_Intermediate_2_8_Time_Series.ipynb) | [Merging Dataframes](Pandas_Intermediate_2_9_Merging_DataFrames.ipynb) >>


An integral part of any Sports Analysis is working with dates (and time).

Pandas is extremely useful (and versatile) when it comes to handling dates. In particular, there is a function called `to_datetime()`.

This function is a fundamental tool in data cleaning and preprocessing workflows, when dealing with datasets containing date information.

## Introduction to `pd.to_datetime`

The `pd.to_datetime` function in Pandas serves the purpose of converting "date-like" strings to Pandas `datetime objects`. (Yes, Pandas has its own "datetime" object type -- just as Integers and strings are types.)

The `to_datetime() function is a fundamental tool in data cleaning and preprocessing workflows, especially when dealing with datasets containing date information.



## Key Uses of pd.to_datetime:

### Standardizing Date Formats

It allows us to standardize the format of date-like strings in a DataFrame, making it consistent and suitable for analysis.



### Handling Missing or Invalid Dates

The pd.to_datetime function can handle missing or invalid date values gracefully. By default, it can coerce invalid values to NaT (Not a Time), allowing us to clean datasets with inconsistent date representations.



### Extracting Date Components

Once the date-like strings are converted to datetime objects, we can easily extract various components such as `year`, `month`, `day`, `hour`, `minute`, and `second`. This facilitates time-based analysis and filtering.



### Supporting Datetime Operations

`Datetime` objects support a wide range of time-related operations. After conversion, we can perform operations like time-based filtering, resampling, and calculating time intervals.

## Converting a Date/Time Column

- Walk through the process of converting a single column containing date strings to `datetime`
- Discuss how to handle errors and invalid values using the errors parameter
- Illustrate the impact of specifying the date format when needed



### Basic Syntax

```
import pandas as pd

pd.to_datetime(arg, errors='raise', format=None, unit=None, infer_datetime_format=False, origin='unix', cache=True)
```



Common Parameters:
- `arg`: The object to convert to datetime.
- `errors`: How to handle parsing errors (`raise`, `coerce`, `ignore`).
- `format`: Specify the expected date format.
unit: Unit of the arg (e.g., `s` for seconds).
- `infer_datetime_format`: If `True`, infer the datetime format.
- `origin`: The reference date for numeric time-related units.


In [11]:
import pandas as pd

# Example DataFrame with a messy date column
data = {'Event': ['Game 1', 'Game 2', 'Game 3', 'Game 4', 'Game 5', ],
        'Date': ['2022-01-15', 'Feb 20, 2022', '2022-03-10', '05/15/2022', '06-25-2022']}
df = pd.DataFrame(data)
df


Unnamed: 0,Event,Date
0,Game 1,2022-01-15
1,Game 2,"Feb 20, 2022"
2,Game 3,2022-03-10
3,Game 4,05/15/2022
4,Game 5,06-25-2022


In [12]:
# Convert the 'Date' column to datetime
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df

Unnamed: 0,Event,Date
0,Game 1,2022-01-15
1,Game 2,2022-02-20
2,Game 3,2022-03-10
3,Game 4,2022-05-15
4,Game 5,2022-06-25


In [13]:
df

Unnamed: 0,Event,Date
0,Game 1,2022-01-15
1,Game 2,2022-02-20
2,Game 3,2022-03-10
3,Game 4,2022-05-15
4,Game 5,2022-06-25


## Extraction of time components: `year`, `month`, `day`, `hour`

### Extracting Date, time and Year and Creating New Columns

- How to access datetime properties like year, month, day, hour, minute, and second.
- How to extract the day of the week using the dt.day_name() method.
- Other common operations like extracting the date, time, and year.



In [5]:
data = {'GameDate': ['2022/01/15 6:30:00 PM', '2022-02-20 15:45:00',
                     '2022-March-10 20:00:00', '16 May 2022 12:00:00', '2022-06-25 19:15:00'],
        'Team': ['TeamA', 'TeamB', 'TeamC', 'TeamA', 'TeamB']}

sports_df = pd.DataFrame(data)
sports_df


Unnamed: 0,GameDate,Team
0,2022/01/15 6:30:00 PM,TeamA
1,2022-02-20 15:45:00,TeamB
2,2022-March-10 20:00:00,TeamC
3,16 May 2022 12:00:00,TeamA
4,2022-06-25 19:15:00,TeamB


In [6]:

# Convert the 'GameDate' column to datetime
sports_df['GameDate'] = pd.to_datetime(sports_df['GameDate'], errors='coerce')

# Extracting Year
sports_df['Year'] = sports_df['GameDate'].dt.year

# Extracting Time
sports_df['Time'] = sports_df['GameDate'].dt.time

# Extracting Day of Week
sports_df['DayOfWeek'] = sports_df['GameDate'].dt.day_name()

# Display the DataFrame
print(sports_df)


             GameDate   Team  Year      Time DayOfWeek
0 2022-01-15 18:30:00  TeamA  2022  18:30:00  Saturday
1 2022-02-20 15:45:00  TeamB  2022  15:45:00    Sunday
2 2022-03-10 20:00:00  TeamC  2022  20:00:00  Thursday
3 2022-05-16 12:00:00  TeamA  2022  12:00:00    Monday
4 2022-06-25 19:15:00  TeamB  2022  19:15:00  Saturday


In [7]:
data = {'GameDate': ['2022-01-15 18:30:00', '2022-02-20 15:45:00', '2022-03-10 20:00:00', '2022-05-15 12:00:00', '2022-06-25 19:15:00'],
        'Team': ['TeamA', 'TeamB', 'TeamC', 'TeamA', 'TeamB']}

sports_df = pd.DataFrame(data)
sports_df


Unnamed: 0,GameDate,Team
0,2022-01-15 18:30:00,TeamA
1,2022-02-20 15:45:00,TeamB
2,2022-03-10 20:00:00,TeamC
3,2022-05-15 12:00:00,TeamA
4,2022-06-25 19:15:00,TeamB


In [8]:

# Convert the 'GameDate' column to datetime
sports_df['GameDate'] = pd.to_datetime(sports_df['GameDate'], errors='coerce')

# Accessing datetime properties
sports_df['Year'] = sports_df['GameDate'].dt.year
sports_df['Month'] = sports_df['GameDate'].dt.month
sports_df['Day'] = sports_df['GameDate'].dt.day
sports_df['Hour'] = sports_df['GameDate'].dt.hour
sports_df['Minute'] = sports_df['GameDate'].dt.minute
sports_df['Second'] = sports_df['GameDate'].dt.second

sports_df


Unnamed: 0,GameDate,Team,Year,Month,Day,Hour,Minute,Second
0,2022-01-15 18:30:00,TeamA,2022,1,15,18,30,0
1,2022-02-20 15:45:00,TeamB,2022,2,20,15,45,0
2,2022-03-10 20:00:00,TeamC,2022,3,10,20,0,0
3,2022-05-15 12:00:00,TeamA,2022,5,15,12,0,0
4,2022-06-25 19:15:00,TeamB,2022,6,25,19,15,0


## Handling Multiple Columns:

### Using a For Loop, convert each Column

In [36]:
data = { 'Team': ['TeamA', 'TeamB', 'TeamC'],
        'StartDate': ['2022-01-15', '02/20/2022', '2022-03-10'],
        'EndDate': ['2022-02-15', '03/05/2022', '04/20/2022'],
        'GameTime': ['18:30', '3:45:00 PM', '20:00:00'],
        }

sports_df = pd.DataFrame(data)



In [37]:
sports_df

Unnamed: 0,Team,StartDate,EndDate,GameTime
0,TeamA,2022-01-15,2022-02-15,18:30
1,TeamB,02/20/2022,03/05/2022,3:45:00 PM
2,TeamC,2022-03-10,04/20/2022,20:00:00


In [38]:
# List of date columns
date_columns = ['StartDate', 'EndDate']

for col in date_columns:
    sports_df[col] = pd.to_datetime(sports_df[col], errors='coerce') # format='%m/%d/%Y %H:%M:%S')

sports_df['GameTime'] = pd.to_datetime(sports_df['GameTime']).dt.time

print(sports_df)


    Team  StartDate    EndDate  GameTime
0  TeamA 2022-01-15 2022-02-15  18:30:00
1  TeamB 2022-02-20 2022-03-05  15:45:00
2  TeamC 2022-03-10 2022-04-20  20:00:00


In [39]:
sports_df.dtypes

Team                 object
StartDate    datetime64[ns]
EndDate      datetime64[ns]
GameTime             object
dtype: object

In the code above,

- We use a `for` loop to iterate over each column specified in the date_columns list.

- The `pd.to_datetime` function is applied to each column individually.

- The `errors='coerce'` parameter handles parsing errors by coercing them to `NaT`.

- The `format='%m/%d/%Y %H:%M:%S'` parameter may be used to specify the expected format for columns containing date and time.

### Using the `apply()` method

In [92]:
df = DataFrame.from_dict(
      {'Opponent':["Bulls","Knicks","Warriors"],
     'StartTime': ["11/12/2021, 19:00","10/22/2021, 18:09:00","01/22/2021, 18:29:14"],
     'EndTime':  ["11/12/2021, 22:15","10/22/2021, 20:39:00","01/22/2021, 21:00:04"]},)
print("Create DataFrame:\n", df)

Create DataFrame:
    Opponent             StartTime               EndTime
0     Bulls     11/12/2021, 19:00     11/12/2021, 22:15
1    Knicks  10/22/2021, 18:09:00  10/22/2021, 20:39:00
2  Warriors  01/22/2021, 18:29:14  01/22/2021, 21:00:04


In [93]:
# Use DataFrame.apply() to convert multiple columns to datetime
df[['StartTime','EndTime']] = df[['StartTime','EndTime']].apply(pd.to_datetime)
print(" After converting multiple columns to datetime:\n", df)

 After converting multiple columns to datetime:
    Opponent           StartTime             EndTime
0     Bulls 2021-11-12 19:00:00 2021-11-12 22:15:00
1    Knicks 2021-10-22 18:09:00 2021-10-22 20:39:00
2  Warriors 2021-01-22 18:29:14 2021-01-22 21:00:04


## Calculation Time Difference (between 2 columns)

There is a convenient function called `total_seconds()` and we can use that to covert to minutes, hours etc.

In [98]:
df['DurationMinutes'] = (df['EndTime'] - df['StartTime']).dt.total_seconds()/60

In [96]:
df.dtypes

Opponent                   object
StartTime          datetime64[ns]
EndTime            datetime64[ns]
DurationMinutes           float64
dtype: object

In [97]:
df


Unnamed: 0,Opponent,StartTime,EndTime,DurationMinutes
0,Bulls,2021-11-12 19:00:00,2021-11-12 22:15:00,195.0
1,Knicks,2021-10-22 18:09:00,2021-10-22 20:39:00,150.0
2,Warriors,2021-01-22 18:29:14,2021-01-22 21:00:04,150.833333




<< [Writing to Files](Pandas_Basics_2_7_Writing_to_Files.ipynb) | [Time Series](Pandas_Intermediate_2_8_Time_Series.ipynb) | [Merging Dataframes](Pandas_Intermediate_2_9_Merging_DataFrames.ipynb) >>