# About
- **By**: Tsombou Christian
- **@** : tsombouchris@gmail.com
- **linkedIn**: https://www.linkedin.com/in/tsombouchris/

This Notebook performs data cleaning on load shedding schedule dataset in Chennai from 2013 to 2022

- **Data Source**: [livechennai website](https://www.livechennai.com/powercut_schedule.asp)
- **Data scrapped by**: Tsombou Christian (Github repo for the webscrapping notebook)
- **Reference**: 
- **Input Dataset**: 2524 rows and 8 columns
- **Ouput Dataset**: 2524 rows and 11 columns

_Process_:

The data clening consisted in:

- checking the unique dates for cleaning
- converting posting dates to datetime
- checking the unique values for load shedding dates for cleaning
- Replace the dirty load shedding dates by adding one day on the posting date
- Convert load shedding date column to datetime
- infering the day of the week for load schedding with load sheddi
- verify that the load shedding day column is clean by checking unique values
- Checking unique values of load shedding start time for cleaning
- replace incorrect time
- convert new column to time
- Checking unique values of load shedding end time for cleaning
- replace incorrect time
- Create a column for load shedding duration


NB: This Notebook was produced within the context of The Omdena [Chennai Chapter project on Electricity Power Outage Analysis](https://github.com/OmdenaAI/chennai-india-power-outage)

In [1]:
#   import the necessary libraries for EDA
import pandas as pd
import numpy as np
import seaborn as sns ; sns.set(font_scale=1)
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from matplotlib.dates import DateFormatter
import datetime as dt
from datetime import timedelta
import calendar
import re
%matplotlib inline

In [2]:
po_schedule_chennai = pd.read_csv('17.07.22_power_outages_chennai_located.csv')

In [3]:
po_schedule_chennai

Unnamed: 0,PO_day,PO_date,PO_start_time,PO_end_time,PO_posting_date,PO_posting_time,PO_locations_link,location
0,Tuesday,19/07/2022,09.00 am,02.00 pm,18/Jul/2022,9:26:34 AM,https://www.livechennai.com/detailnews.asp?new...,TAMBARAM
1,Tuesday,19/07/2022,09.00 am,02.00 pm,18/Jul/2022,9:26:34 AM,https://www.livechennai.com/detailnews.asp?new...,T. NAGAR
2,Tuesday,19/07/2022,09.00 am,02.00 pm,18/Jul/2022,9:26:34 AM,https://www.livechennai.com/detailnews.asp?new...,ADYAR
3,Saturday,16/07/2022,9 am,"2 pm,",16/Jul/2022,8:47:41 AM,https://www.livechennai.com/detailnews.asp?new...,Egmore
4,Saturday,16/07/2022,9 am,"2 pm,",16/Jul/2022,8:47:41 AM,https://www.livechennai.com/detailnews.asp?new...,Tambaram
...,...,...,...,...,...,...,...,...
2519,Aug,2018,9 am,4 pm:,22/Aug/2018,9:45:26 AM,https://www.livechennai.com/detailnews.asp?new...,Shastri Nagar
2520,Aug,2018,9 am,4 pm:,22/Aug/2018,9:45:26 AM,https://www.livechennai.com/detailnews.asp?new...,Ayanavaram
2521,Aug,2018,9 am,4 pm:,22/Aug/2018,9:45:26 AM,https://www.livechennai.com/detailnews.asp?new...,Tagore Nagar
2522,Jul,2018,9 am,4 pm,23/Jul/2018,10:03:05 AM,https://www.livechennai.com/detailnews.asp?new...,Kovoor


In [4]:
# checking the unique dates for cleaning
pd.unique(po_schedule_chennai.PO_posting_date)

array(['18/Jul/2022', '16/Jul/2022', '14/Jul/2022', '12/Jul/2022',
       '11/Jul/2022', '07/Jul/2022', '05/Jul/2022', '04/Jul/2022',
       '01/Jul/2022', '30/Jun/2022', '28/Jun/2022', '27/Jun/2022',
       '26/Jun/2022', '25/Jun/2022', '23/Jun/2022', '22/Jun/2022',
       '21/Jun/2022', '20/Jun/2022', '18/Jun/2022', '15/Jun/2022',
       '14/Jun/2022', '13/Jun/2022', '11/Jun/2022', '09/Jun/2022',
       '07/Jun/2022', '06/Jun/2022', '05/Jun/2022', '09/Apr/2022',
       '05/Apr/2022', '04/Apr/2022', '01/Apr/2022', '31/Mar/2022',
       '30/Mar/2022', '29/Mar/2022', '28/Mar/2022', '25/Mar/2022',
       '24/Mar/2022', '18/Mar/2022', '17/Mar/2022', '16/Mar/2022',
       '15/Mar/2022', '14/Mar/2022', '11/Mar/2022', '10/Mar/2022',
       '08/Mar/2022', '07/Mar/2022', '03/Mar/2022', '02/Mar/2022',
       '01/Mar/2022', '28/Feb/2022', '26/Feb/2022', '24/Feb/2022',
       '21/Feb/2022', '28/Jan/2022', '26/Jan/2022', '24/Jan/2022',
       '22/Jan/2022', '20/Jan/2022', '18/Jan/2022', '15/Jan/20

In [5]:
# converting posting dates to datetime
po_schedule_chennai.PO_posting_date = pd.to_datetime(po_schedule_chennai.PO_posting_date)

In [6]:
# checking the unique values for load shedding dates for cleaning
pd.unique(po_schedule_chennai.PO_date)

array(['19/07/2022', '16/07/2022', '15/07/2022', '14/07/2022',
       '13/07/2022', '12/07/2022', '11/07/2022', '08/07/2022',
       '07/07/2022', '06/07/2022', '05/07/2022', '04/07/2022',
       '02/07/2022', '01/07/2022', '30/06/2022', '29/06/2022',
       '28/06/2022', '27/06/2022', '25/06/2022', 'Friday(24/06/2022',
       '23/06/2022', '22/06/2022', '21/06/2022', '20/06/2022',
       '18/06/2022', '17/06/2022', '16/06/2022', '15/06/2022',
       '14/06/2022', '13/06/2022', '11/06/2022', '10/06/2022',
       '09/06/2022', '07/06/2022', '08/06/2022', '06/05/2022',
       '09/04/2022', '06/04/2022', '05/04/2022', '04/04/2022',
       '02/04/2022', '01/04/2022', '31/03/2022', '30/03/2022',
       '29/03/2022', '26/03/2022', '25/03/2022', '19/03/2022',
       '18/03/2022', '17/03/2022', '16/03/2022', '15/03/2022',
       '14/03/2022', '12/03/2022', '11/03/2022', '10/03/2022',
       '09/03/2022', '08/03/2022', 'Monday(07/03/2022', '04/03/2022',
       '03/03/2022', '02/03/2022', '01/03

In [7]:
po_schedule_chennai['PO_date'] = po_schedule_chennai['PO_date'].apply(lambda x: x[len(x)-10:])

In [8]:
# verify the cleaning for the dates
pd.unique(po_schedule_chennai.PO_date)

array(['19/07/2022', '16/07/2022', '15/07/2022', '14/07/2022',
       '13/07/2022', '12/07/2022', '11/07/2022', '08/07/2022',
       '07/07/2022', '06/07/2022', '05/07/2022', '04/07/2022',
       '02/07/2022', '01/07/2022', '30/06/2022', '29/06/2022',
       '28/06/2022', '27/06/2022', '25/06/2022', '24/06/2022',
       '23/06/2022', '22/06/2022', '21/06/2022', '20/06/2022',
       '18/06/2022', '17/06/2022', '16/06/2022', '15/06/2022',
       '14/06/2022', '13/06/2022', '11/06/2022', '10/06/2022',
       '09/06/2022', '07/06/2022', '08/06/2022', '06/05/2022',
       '09/04/2022', '06/04/2022', '05/04/2022', '04/04/2022',
       '02/04/2022', '01/04/2022', '31/03/2022', '30/03/2022',
       '29/03/2022', '26/03/2022', '25/03/2022', '19/03/2022',
       '18/03/2022', '17/03/2022', '16/03/2022', '15/03/2022',
       '14/03/2022', '12/03/2022', '11/03/2022', '10/03/2022',
       '09/03/2022', '08/03/2022', '07/03/2022', '04/03/2022',
       '03/03/2022', '02/03/2022', '01/03/2022', '28/02

In [9]:
# Checking if there are still inconsistent load shedding dates
clean_PO_dates_bool = po_schedule_chennai.PO_date.apply(lambda x: len(x) == 10)
po_schedule_chennai[~clean_PO_dates_bool]

Unnamed: 0,PO_day,PO_date,PO_start_time,PO_end_time,PO_posting_date,PO_posting_time,PO_locations_link,location
1126,20-04-2021,day,09.00 am,05.00 pm,2021-04-19,6:18:19 PM,https://www.livechennai.com/detailnews.asp?new...,Avadi
1127,20-04-2021,day,09.00 am,05.00 pm,2021-04-19,6:18:19 PM,https://www.livechennai.com/detailnews.asp?new...,Sembium
1128,20-04-2021,day,09.00 am,05.00 pm,2021-04-19,6:18:19 PM,https://www.livechennai.com/detailnews.asp?new...,Tambaram
1129,20-04-2021,day,09.00 am,05.00 pm,2021-04-19,6:18:19 PM,https://www.livechennai.com/detailnews.asp?new...,Puzhal
1193,(Feb,05,09.00 am,02.00 pm,2021-02-03,2:36:25 PM,https://www.livechennai.com/detailnews.asp?new...,BESANT NAGAR
...,...,...,...,...,...,...,...,...
2519,Aug,2018,9 am,4 pm:,2018-08-22,9:45:26 AM,https://www.livechennai.com/detailnews.asp?new...,Shastri Nagar
2520,Aug,2018,9 am,4 pm:,2018-08-22,9:45:26 AM,https://www.livechennai.com/detailnews.asp?new...,Ayanavaram
2521,Aug,2018,9 am,4 pm:,2018-08-22,9:45:26 AM,https://www.livechennai.com/detailnews.asp?new...,Tagore Nagar
2522,Jul,2018,9 am,4 pm,2018-07-23,10:03:05 AM,https://www.livechennai.com/detailnews.asp?new...,Kovoor


In [10]:
# Replace the dirty load shedding dates by adding one day on the posting date
dirty_PO_date_index = po_schedule_chennai[~clean_PO_dates_bool].index
for i in dirty_PO_date_index:
    po_schedule_chennai.loc[i,'PO_date'] =po_schedule_chennai.loc[i, 'PO_posting_date'] + timedelta(days=1)



In [11]:
# Convert load shedding date column to datetime
po_schedule_chennai.PO_date = pd.to_datetime(po_schedule_chennai.PO_date)

In [12]:
# infering the day of the week for load schedding with load shedding date since PO_day column is dirty
po_schedule_chennai.PO_day = po_schedule_chennai.PO_date.apply(lambda x: calendar.day_name[x.weekday()])

In [13]:
# verify that the load shedding day column is clean by checking unique values
pd.unique(po_schedule_chennai.PO_day)


array(['Tuesday', 'Saturday', 'Friday', 'Thursday', 'Wednesday', 'Monday',
       'Sunday'], dtype=object)

In [14]:
# Checking unique values of load shedding start time for cleaning
pd.unique(po_schedule_chennai.PO_start_time)

array(['09.00 am', '9 am', '09:00 AM', 'from9.00 am', '10.00 am',
       '06.00 am', '9.00 AM', '9.00 A.M.'], dtype=object)

In [17]:
# replace incorrect time
to_replace_dic = {'09.00 am':'09:00 AM', '9 am':'09:00 AM', 'from9.00 am':'09:00 AM', '9.00 AM':'09:00 AM','9.00 A.M.':'09:00 AM', '10.00 am':'10:00 AM', '06.00 am':'06:00 AM'}
po_schedule_chennai.PO_start_time.replace(to_replace=to_replace_dic , inplace=True)

#convert new column to time
po_schedule_chennai.PO_start_time = po_schedule_chennai.PO_start_time.apply(lambda x: dt.datetime.strptime(x, '%I:%M %p'))


In [18]:
# Checking unique values of load shedding end time for cleaning
pd.unique(po_schedule_chennai.PO_end_time)

array(['02.00 pm', '2 pm,', '05.00 pm', '04.00 pm', '2 pm', '5 pm',
       '01.00 pm', '12.00 pm', '02:00 PM', '05:00 PM', '4 pm.', '4 pm',
       '4.00 pm,', '4 pm:', '02.00pm.', '5.00 PM', '2.00 PM', '2.00 P.M.',
       '5.00 P.M.'], dtype=object)

In [21]:
# replace incorrect time
to_replace_dic2 = {'02.00 pm':'02:00 PM', '2 pm,':'02:00 PM', '2 pm':'02:00 PM', '2.00 PM':'02:00 PM', '02.00pm.':'02:00 PM',
 '05.00 pm':'05:00 PM', '5 pm':'05:00 PM', '5.00 PM':'05:00 PM','5.00 P.M.':'05:00 PM', '04.00 pm':'04:00 PM', '4 pm.':'04:00 PM', '4 pm':'04:00 PM',
  '4.00 pm,':'04:00 PM', '4 pm:':'04:00 PM',  '01.00 pm':'01:00 PM', '12.00 pm':'12:00 PM', '02.00pm.':'02:00 PM' ,
  '2.00 PM':'02:00 PM','2.00 P.M.':'02:00 PM'}
po_schedule_chennai.PO_end_time.replace(to_replace=to_replace_dic2 , inplace=True)

#convert new column to time
po_schedule_chennai.PO_end_time = po_schedule_chennai.PO_end_time.apply(lambda x: dt.datetime.strptime(x, '%I:%M %p'))


In [22]:
# Create a column for load shedding duration
for i in po_schedule_chennai.index:
    po_schedule_chennai.loc[i,'PO_duration(Hrs)'] = pd.Timedelta(po_schedule_chennai.loc[i,'PO_end_time']-po_schedule_chennai.loc[i,'PO_start_time']).seconds / 3600.0


In [23]:
# Create two column for short end start and end load shedding time
po_schedule_chennai['PO_start_time_short'] = po_schedule_chennai.PO_start_time.apply(lambda x: x.time())
po_schedule_chennai['PO_end_time_short'] = po_schedule_chennai.PO_end_time.apply(lambda x: x.time())

In [130]:
# convert posting time column to time
#po_schedule_chennai.PO_posting_time = po_schedule_chennai.PO_posting_time.apply(lambda x: dt.datetime.strptime(x, '%I:%M %p'))

In [24]:
po_schedule_chennai.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2524 entries, 0 to 2523
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   PO_day               2524 non-null   object        
 1   PO_date              2524 non-null   datetime64[ns]
 2   PO_start_time        2524 non-null   datetime64[ns]
 3   PO_end_time          2524 non-null   datetime64[ns]
 4   PO_posting_date      2524 non-null   datetime64[ns]
 5   PO_posting_time      2524 non-null   object        
 6   PO_locations_link    2524 non-null   object        
 7   location             2514 non-null   object        
 8   PO_duration(Hrs)     2524 non-null   float64       
 9   PO_start_time_short  2524 non-null   object        
 10  PO_end_time_short    2524 non-null   object        
dtypes: datetime64[ns](4), float64(1), object(6)
memory usage: 217.0+ KB


In [25]:
po_schedule_chennai.to_csv('located_dayly_load_shedding_schedule_chennai_2014_2022.csv', index=False)