# **Guided Lab 343.3.16 - Convert date-like strings to Pandas DateTime Series using to_datetime()**

## **Lab Overview:**
In this lab, we will demonstrate how to convert the list of date-like strings to a pandas DateTime format.

Pandas has a built-in function called **`to_datetime()`** that converts date-like string to DateTime format. When working with DateTime Series data, this function is incredibly beneficial.

This lab is suitable for beginners in data analysis who want to enhance their skills in handling date and time data.

## **Lab Objective:**

By the end of this lab, you will be able to:

- Describe the role of the pandas.to_datetime() function in converting date-like strings.
- Use the function to convert a list of date-like strings to pandas DateTime objects.
- Handle different date formats and deal with common parsing challenges.

---
# **Begin:**
### **Example: Convert List of String Date to Pandas DateTime Series**

int the below example we will define a List of Date-like Strings, then convert to Pandas DateTime format.

In [None]:
import pandas as pd
input = ['2023-01-01', '2023-01-02', '2023-02-06']
print("Original Date Strings:")
print(input)
# Display the original DateTime objects
x = type(input)

print("\n======= Before convert============")
print("datatype is " ,x)

print("\n======= after convert ============")

output = pd.to_datetime(input)
# Display the converted DateTime objects
print("\nConverted DateTime Objects:")

print("Output: ", output)
y = type(output)
print(y)



Original Date Strings:
['2023-01-01', '2023-01-02', '2023-02-06']

datatype is  <class 'list'>


Converted DateTime Objects:
Output:  DatetimeIndex(['2023-01-01', '2023-01-02', '2023-02-06'], dtype='datetime64[ns]', freq=None)
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>


**Lets demonstrates the flexibility of pd.to_datetime() to handle a variety of date formats within the input list. It can handle different formats and convert them to a consistent DatetimeIndex.**

In [None]:
import pandas as pd
# Creating a list of date
input_list = ['2023-01-01', '2023-01-02', '3/10/2020 143045', '13th of October, 2023']
print("Original Date Strings:")
print(input_list)
# Display the original DateTime objects
x = type(input_list)
print("\n======= Before convert ============")
print("datatype is " ,x)


print("\n======= after convert ============")

# Convert with format='mixed'
output = pd.to_datetime(input_list, format='mixed')

print("\nConverted DateTime Objects:")
print("datatype: ", output)
# Display the converted DateTime objects
y = type(output)
print(y)



Original Date Strings:
['2023-01-01', '2023-01-02', '3/10/2020 143045', '13th of October, 2023']

datatype is  <class 'list'>


Converted DateTime Objects:
datatype:  DatetimeIndex(['2023-01-01 00:00:00', '2023-01-02 00:00:00',
               '2020-03-10 14:30:45', '2023-10-13 00:00:00'],
              dtype='datetime64[ns]', freq=None)
<class 'pandas.core.indexes.datetimes.DatetimeIndex'>


## **Example: String column to datetime**
### **Dealing with Ambiguities:**

Create date strings with ambiguous formats (e.g., '01/02/2023' - is it January 2nd or February 1st?) and explore how pd.to_datetime() handles these situations.

In [None]:
import pandas as pd
df = pd.DataFrame({
    'patientID':[101,23,48,49],
    'name': ['alice','bob','charlie','Eric'],
    'date_of_birth': ['2023-01-01', '2023-01-02', '3/10/2020 143045', '13th of October, 2023']
})

print("\n====== before convert ============")
print("Original Date Strings:")
print(df)
#print(df.info())
print("Data Type of date_of_birth column: ",df['date_of_birth'].dtype)
print(type(df["date_of_birth"]))
print(df['date_of_birth'])

print("\n======= after convert============")

#df['date_of_birth'] = pd.to_datetime(df['date_of_birth'])

#print(df.info())


print("Data Type of date_of_birth column: ",df['date_of_birth'].dtype)

print("\nConverted DateTime Objects:")
#df['date_of_birth']
df




Original Date Strings:
   patientID     name          date_of_birth
0        101    alice             2023-01-01
1         23      bob             2023-01-02
2         48  charlie       3/10/2020 143045
3         49     Eric  13th of October, 2023
Data Type of date_of_birth column:  object
<class 'pandas.core.series.Series'>
0               2023-01-01
1               2023-01-02
2         3/10/2020 143045
3    13th of October, 2023
Name: date_of_birth, dtype: object

Data Type of date_of_birth column:  object

Converted DateTime Objects:


Unnamed: 0,patientID,name,date_of_birth
0,101,alice,2023-01-01
1,23,bob,2023-01-02
2,48,charlie,3/10/2020 143045
3,49,Eric,"13th of October, 2023"


## **Example : String column to datetime, custom format**

The **format** argument of the **to_datetime()** function allows you to pass a custom format. For example, let’s say you want to parse your string with the following timestamp format
- YYYY-MM-DD HH: MM: SS.
- [Click here to see all formats](https://strftime.org/)

If your date string does not meet the timestamp format, you will get a `TypeError` or `ValueError`, as shown in the below example. Let’s see how we can do this:


**Note:** the below example throw error because of inconsistency in the formate.



In [None]:
import pandas as pd
df = pd.DataFrame({
    'patientID':[101,23,48,49],
    'name': ['alice','bob','charlie','keyla'],
    'date_of_birth': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04']
})

print("\n======= before============")
#print(df.info())
print("Data Type of date_of_birth column: ",df['date_of_birth'].dtype)
#print(df['date_of_birth'])
df


print("\n======= after============")

df['date_of_birth'] = pd.to_datetime(df['date_of_birth'], format='%d/%m/%Y')

df


#print(df.info())
print(df['date_of_birth'].dtype)
print("Data Type of date_of_birth column: ",df['date_of_birth'].dtype)

df


Data Type of date_of_birth column:  object



ValueError: time data "2023-01-01" doesn't match format "%d/%m/%Y", at position 0. You might want to try:
    - passing `format` if your strings have a consistent format;
    - passing `format='ISO8601'` if your strings are all ISO8601 but not necessarily in exactly the same format;
    - passing `format='mixed'`, and the format will be inferred for each element individually. You might want to use `dayfirst` alongside this.

**Handling Parsing Errors**
You can set the argument **errors** to **‘ignore’** or **‘coerce’** to avoid error.




In [None]:
import pandas as pd
df = pd.DataFrame({
    'patientID':[101,23,48,49],
    'name': ['alice','bob','charlie','keyla'],
    'date_of_birth': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04']
})

print("======= before============")
#print(df.info())
print("Data Type of date_of_birth column: ",df['date_of_birth'].dtype)
#print(df['date_of_birth'])
df


print("\n======= after============")

df['date_of_birth'] = pd.to_datetime(df['date_of_birth'],format='%d/%m/%Y', errors='ignore')
df


print(df.info())
print(df['date_of_birth'].dtype)
print("Data Type of date_of_birth column: ",df['date_of_birth'].dtype)

df

Data Type of date_of_birth column:  object

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   patientID      4 non-null      int64 
 1   name           4 non-null      object
 2   date_of_birth  4 non-null      object
dtypes: int64(1), object(2)
memory usage: 228.0+ bytes
None
object
Data Type of date_of_birth column:  object


  df['date_of_birth'] = pd.to_datetime(df['date_of_birth'],format='%d/%m/%Y', errors='ignore')


Unnamed: 0,patientID,name,date_of_birth
0,101,alice,2023-01-01
1,23,bob,2023-01-02
2,48,charlie,2023-01-03
3,49,keyla,2023-01-04


**Another approach:** We can use use argument **format='mixed'** and **dayfirst=True** for ambiguous dates, as shown in the following example.

In [None]:
df = pd.DataFrame({
    'patientID': [101, 23, 48, 49],
    'name': ['alice', 'bob', 'charlie', 'Eric'],
    'date_of_birth': ['2023-01-01', '2023-01-02', '3/10/2020 143045', '13th of October, 2023']
})

print("\n====== before convert ============")
print("Original Date Strings:")
print(df)
print("Data Type of date_of_birth column: ", df['date_of_birth'].dtype)
print(type(df["date_of_birth"]))
print(df['date_of_birth'])

print("\n======= after convert============")

# Use format='mixed' and dayfirst=True for ambiguous dates
df['date_of_birth'] = pd.to_datetime(df['date_of_birth'], format='mixed', dayfirst=True, errors='coerce')

print("Data Type of date_of_birth column: ", df['date_of_birth'].dtype)

print("\nConverted DateTime Objects:")
print(df)

**Explanation of Changes**

- **format='mixed':** This tells pd.to_datetime() to try and infer the format of each date string individually, allowing for different formats within the column.
- **errors='coerce':** It converts the invalid parsing to NaT
- **dayfirst=True**: It considers the first value in the date string as the day.

# **Specifying datetime format when impoting a csv file**

**By default, when importing a CSV file into a DataFrame, date/time values are read as strings (object dtype). However, we can convert them to DateTime objects explicitly using parse_dates in pd.read_csv().**


Let’s try to import the CSV dataset into a Pandas DataFrame and check the date's column data types.

To read the date column correctly, we can use the argument parse_dates to specify a list of date columns.

In [None]:
import pandas as pd
url='https://raw.githubusercontent.com/bprasad26/lwd/master/data/tesla_stock_prices.csv'
# Assuming the date format is YYYY-MM-DD
tesla_df = pd.read_csv(url, parse_dates=['Date'], date_format='%Y-%m-%d')
# Display the DataFrame

tesla_df

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2010-06-29,19.000000,25.000000,17.540001,23.889999,23.889999,18766300
1,2010-06-30,25.790001,30.420000,23.299999,23.830000,23.830000,17187100
2,2010-07-01,25.000000,25.920000,20.270000,21.959999,21.959999,8218800
3,2010-07-02,23.000000,23.100000,18.709999,19.200001,19.200001,5139800
4,2010-07-06,20.000000,20.000000,15.830000,16.110001,16.110001,6866900
...,...,...,...,...,...,...,...
1786,2017-08-02,318.940002,327.119995,311.220001,325.890015,325.890015,13091500
1787,2017-08-03,345.329987,350.000000,343.149994,347.089996,347.089996,13535000
1788,2017-08-04,347.000000,357.269989,343.299988,356.910004,356.910004,9198400
1789,2017-08-07,357.350006,359.480011,352.750000,355.170013,355.170013,6276900


This is great! It looks like everything worked fine. Not so fast – let’s check the data types of the columns in the dataset. We can do this using the **.info()** function and **dtypes** attribute method.

In [None]:
tesla_df.dtypes

Unnamed: 0,0
Date,datetime64[ns]
Open,float64
High,float64
Low,float64
Close,float64
Adj Close,float64
Volume,int64


In [None]:
tesla_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1791 entries, 0 to 1790
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Date       1791 non-null   datetime64[ns]
 1   Open       1791 non-null   float64       
 2   High       1791 non-null   float64       
 3   Low        1791 non-null   float64       
 4   Close      1791 non-null   float64       
 5   Adj Close  1791 non-null   float64       
 6   Volume     1791 non-null   int64         
dtypes: datetime64[ns](1), float64(5), int64(1)
memory usage: 98.1 KB


We can see that the data type of the Date column is object. This means that the data are stored as strings, meaning that you can’t access the slew of DateTime functionality available in Pandas.

## **Using to_datetime to Convert Columns to DateTime**

The pandas `to_datetime()` function converts a date/time value stored in a DataFrame column into a DateTime object. Having date/time values as DateTime objects makes manipulating them much easier. Run the following statement and see the changes:

In [None]:
# Convert the 'Date' column to pandas DateTime format
tesla_df['Date'] = pd.to_datetime(tesla_df['Date'], errors='coerce')
tesla_df['Date']

Unnamed: 0,Date
0,2010-06-29
1,2010-06-30
2,2010-07-01
3,2010-07-02
4,2010-07-06
...,...
1786,2017-08-02
1787,2017-08-03
1788,2017-08-04
1789,2017-08-07


In [None]:
# Display the DataFrame with the converted 'Date' column
print("\nDataFrame with Converted 'Date' Column:")
print(tesla_df.head())

**Lets verify the Data Type of Date Column**

In [None]:
print(tesla_df['Date'].dtype)

In [None]:
tesla_df.dtypes

In [None]:
tesla_df.info()

## **Extract month and year from the 'Date' column**



In [None]:
tesla_df['Month'] = tesla_df['Date'].dt.month
print("\n  Month:")
tesla_df['Month'].head(10)




###**Explanation :**
`tesla_df['Date']:` This accesses the Date column in the DataFrame.

`.dt:` This is a datetime accessor that lets you extract date/time properties (like .year, .month, .day, etc.) from a column of datetime64 objects.

`.dt.month:` This extracts the month component (as an integer from 1 to 12) from each date.

`tesla_df['Month']` : This creates a new column in the DataFrame named Month, storing the extracted month values.


---



In [None]:
tesla_df['Year'] = tesla_df['Date'].dt.year # .dt.year is used to get the year of the date column
print("\n  Year:")
tesla_df['Year'].head(10)

###**Explanation**
`tesla_df['Date']` : Accesses the Date column in the DataFrame, which must be in datetime64 format.

`.dt`: A datetime accessor that provides properties of date/time values.

`.dt.year` : Extracts the year component from each date.

`tesla_df['Year']` : Creates a new column called Year that holds the extracted year values.

---



In [None]:
tesla_df['day_name_sample'] = tesla_df['Date'].dt.day_name()
tesla_df['day_name_sample'].head(20)

### **Explanation**:
`tesla_df['Date']` : Accesses the Date column from the tesla_df DataFrame.

`.dt` : A datetime accessor used to extract components of date/time values from a column of datetime64 objects.

`.dt.day_name()` : Returns the full weekday name (e.g., 'Monday', 'Tuesday') for each date in the Date column.

`tesla_df['day_name_sample']` : Creates a new column called day_name_sample in the DataFrame, storing the name of the day for each corresponding date.


---



Similarly, you can access different calculated attributes. For example, you can calculate the largest and smallest dates using the .max() and .min() methods. Let’s see what this looks like:

In [None]:
# Calculating Max and Min DateTimes
print(tesla_df['Date'].max())
print(tesla_df['Date'].min())

### **Group by month and sum the volumn for each month**

In [None]:
monthly_sales = tesla_df.groupby(['Year', 'Month'])['Volume'].sum().reset_index()
monthly_sales


## **Filterting Based on Date**

**Lets filter date by using loc[] and query().**


In [None]:
# Filter data between two dates
filtered_df_twoDate = tesla_df.loc[(tesla_df['Date'] >= '2010-02-01') & (tesla_df['Date'] < '2010-12-01')]
# Display
filtered_df_twoDate

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2010-06-29,19.000000,25.000000,17.540001,23.889999,23.889999,18766300
1,2010-06-30,25.790001,30.420000,23.299999,23.830000,23.830000,17187100
2,2010-07-01,25.000000,25.920000,20.270000,21.959999,21.959999,8218800
3,2010-07-02,23.000000,23.100000,18.709999,19.200001,19.200001,5139800
4,2010-07-06,20.000000,20.000000,15.830000,16.110001,16.110001,6866900
...,...,...,...,...,...,...,...
103,2010-11-23,33.290001,35.680000,32.189999,34.570000,34.570000,1577800
104,2010-11-24,35.270000,35.970001,34.330002,35.470001,35.470001,1425000
105,2010-11-26,35.599998,36.000000,34.750000,35.320000,35.320000,350600
106,2010-11-29,35.410000,35.950001,33.330002,34.330002,34.330002,1145600


In [None]:
# Filter data between two dates using query
filtered_df_twoDate = tesla_df.query("Date >= '2010-02-01' and Date < '2010-12-01'")
filtered_df_twoDate

## **Filter data for specific weekday (Wednesday)**

In [None]:

filtered_df_week = tesla_df.loc[tesla_df['Date'].dt.weekday == 2]
filtered_df_week


Reference:[ Official documentation](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html)

# **Submission Instructions:**
- Submit your completed lab using the Start Assignment button on the assignment page in Canvas.
- Your submission can be include:
  - if you are using notebook then, all tasks should be written and submitted in a single notebook file, for example: (**your_name_labname.ipynb**).
  - if you are using python script file, all tasks should be written and submitted in a single python script file for example: **(your_name_labname.py)**.
- Add appropriate comments and any additional instructions if required.
