# Cleaning dates in pandas

Often we encounter dates in different formats. Pandas has a helpful method called pd.to_datetime which can be good for converting dates in their string format to a datetime format. It is good to get dates in their date time format since it makes them far easier to work with. 

https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html

A further resource on pd.to_datetime can be found here

https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html



Lets look at a few examples of using pd.to_datetime since it could be what you need early on and do not need very customised functions. THe codes below will help us

- %Y: Four-digit year
- %y: Two-digit year
- %m: Two-digit month [01, 12]
- %d: Two-digit day [01, 31]
- %H: Hour (24-hour clock) [00, 23]
- %I: Hour (12-hour clock) [01, 12]
- %M: Two-digit minute [00, 59]
- %S: Second [00, 61]
- %f: Microsecond [000000, 999999]

This link gives us even more
https://dataindependent.com/pandas/pandas-to-datetime-string-to-date-pd-to_datetime/



In [44]:
import pandas as pd
date_1 = "01-03-18"
pd.to_datetime(date_1, format = "%d-%m-%y")

Timestamp('2018-03-01 00:00:00')

Notice if i put the format in wrong it throws an error

In [46]:
pd.to_datetime(date_1, format = "%d-%m-%Y") # this will error

Example 2

In [49]:
date_2 = "15/Jan/2020"
pd.to_datetime(date_2, format = "%d/%b/%Y")

Timestamp('2020-01-15 00:00:00')

Example 3

In [51]:
date_3 = "1/March/2020"
pd.to_datetime(date_3, format = "%d/%B/%Y")

Timestamp('2020-03-01 00:00:00')

Example 4

In [52]:
date_3 = "3rd april 1954"
pd.to_datetime(date_3, format = "%drd %B %Y")

Timestamp('1954-04-03 00:00:00')

### Real date data

Often the problem with real data is that date format comes in in a variety of formats and often in a complete mess. Below we are going to walk through a process (and by no doubt the most efficient) that I have gone through to clean some date time data. In the analysis below using pd.datetime could have simplified things in heinds sight - so bare that in mind

In [1]:
import pandas as pd

Only Dates is a spreadhseet that has had all the other data removed and just the date columns. We are going to use this spreadsheet to explore the data in the date columns

In [2]:
xl= pd.read_excel('OnlyDates.xlsx', sheet_name = None)

sheet_names = xl.keys() 
sheet_names

dict_keys(['2021', '2019', '2018', '2017', '2016', '2015', '2014', '2013', '2012'])

In [3]:
column_of_dates = []
years = []
for name in sheet_names:
    df = pd.read_excel('OnlyDates.xlsx', sheet_name = name)
    for i in range(len(df)):
        column_of_dates.append(df.values[i][0])
        years.append(int(name))
    

I now have two lists. One that contains a list of all the dates across the various sheets. The other is making use of the sheet name so I have a list of years that I have got from the sheet name. I will then make a dataframe from this

In [6]:
df = pd.DataFrame({'raw_date' : column_of_dates, 'year': years})
df.head()

Unnamed: 0,raw_date,year
0,12th Jan 2021,2021
1,13th Jan 2021,2021
2,13th Jan 2021,2021
3,16th Jan 2021,2021
4,23rd Jan 2021,2021


In order to get an idea of the tpyes of date formats I have I have used the sample method

In [8]:
df.sample(100)

Unnamed: 0,raw_date,year
1154,27/6/13,2013
463,2017-02-15 00:00:00,2017
1335,2012-09-03 00:00:00,2012
414,07.Oct.18,2018
1320,2012-07-11 00:00:00,2012
...,...,...
106,17th Oct 2021,2021
857,2015-05-15 00:00:00,2015
776,2110-10-20 00:00:00,2016
893,2015-08-26 00:00:00,2015


Immediately I can see this is dirty. We have some records that look like they are already in dat time format and others that are in various other formats of strings. Run the cell above a few times to get a feel

From looking at the data above I am going to have to have different approaches for the different formats the date is commin in at. I am going to make this slightly easier to visualise by dividing the data up by the number of numerical digits in the date column.

I have made a custom function that will count the number of digits in the date record. If it cant count it for some reason it will return None when applied to the dataframe using the apply method. I have also created a function that looks if the data is already in a date time format

In [9]:
from  datetime import datetime
def count_digits(string):
    try:
        return sum(x.isdigit() for x in string)
    except:
        return None
        
def check_datetime(string):
    if isinstance(string, datetime):
        return 1
    else:
        return 0
    
    

In order to apply these functions to the dataframe I am going to use the pandas apply method in combination with a lambda function

##### lambda function

A Python lambda function is a small anonymous function that can take any number of arguments, but can only have one expression. The syntax for a lambda function is:


lambda arguments : expression

The expression is executed and returned when the function is called. The lambda function is often used as an argument to higher-order functions such as map(), filter(), or reduce(). Here's an example:


In [10]:
lam_function = lambda x : x*6
lam_function(5)

30

We are going to use the Python pandas apply method

#### Pandas apply method

The apply() method in Pandas is used to apply a function to each element or row/column of a DataFrame. The function can be either a built-in or user-defined.

The basic syntax for using apply() is:

df.apply(function, axis=0)

df is the Pandas DataFrame on which the method is applied.

function is the function to be applied. It can be a built-in function or a user-defined function.

axis is an optional parameter that specifies along which axis the function should be applied.axis=0 means apply the function along the rows (i.e., to each element in a row), and axis=1 means apply the function along the columns (i.e., to each element in a column).


#### Pandas apply with lambda


Lambda and apply functions in Python can be used together to create anonymous functions, which can then be passed as arguments to higher-order functions like apply.

For example, let's say you have a Pandas DataFrame and you want to apply a custom function to a particular column. You can use the lambda function to create an anonymous function, and then pass it as an argument to the apply method on the column:






Below we shall use lambda and apply to apply the count digits and chec_datetime to our data

In [12]:
df['raw_datetime_format'] = df.raw_date.apply(lambda x : check_datetime(x))
df['digit_number'] = df.raw_date.apply(lambda x : count_digits(x))
df.head()

Unnamed: 0,raw_date,year,raw_datetime_format,digit_number
0,12th Jan 2021,2021,0,6.0
1,13th Jan 2021,2021,0,6.0
2,13th Jan 2021,2021,0,6.0
3,16th Jan 2021,2021,0,6.0
4,23rd Jan 2021,2021,0,6.0


Now I am going to view sections of the data depending on the numeric characters on how to clean it

In [15]:
df[df.raw_datetime_format ==0]

Unnamed: 0,raw_date,year,raw_datetime_format,digit_number
0,12th Jan 2021,2021,0,6.0
1,13th Jan 2021,2021,0,6.0
2,13th Jan 2021,2021,0,6.0
3,16th Jan 2021,2021,0,6.0
4,23rd Jan 2021,2021,0,6.0
...,...,...,...,...
1215,(D),2012,0,0.0
1316,,2012,0,
1330,8/1812,2012,0,5.0
1365,11/291/12,2012,0,7.0


Above we see records that have not been recognised as datetime

Now we shall filter by number of digits

In [16]:
df[df.digit_number ==0]

Unnamed: 0,raw_date,year,raw_datetime_format,digit_number
291,day/month/yr,2018,0,0.0
443,day/month/yr,2017,0,0.0
588,day/month/yr,2016,0,0.0
810,day/month/yr,2015,0,0.0
945,Date of injury or onset of illness,2014,0,0.0
947,day/month/yr,2014,0,0.0
948,(D),2014,0,0.0
1064,Log of Work-Related Injuries and Illnesses,2013,0,0.0
1065,Date of injury or onset of illness,2013,0,0.0
1067,(month/day),2013,0,0.0


Looking at the above if there is not number in there then we can say this is not a date. We will create an extra columns that flag these records as non dates

In [17]:
df['cleaned_date'] = ['NOT A DATE' if df.digit_number.iloc[i] ==0 else 'DATE' for i in range(len(df))]
df.sample(10)

Unnamed: 0,raw_date,year,raw_datetime_format,digit_number,cleaned_date
1243,2012-02-24 00:00:00,2012,1,,DATE
1337,2012-09-10 00:00:00,2012,1,,DATE
692,2016-07-02 00:00:00,2016,1,,DATE
845,2015-04-20 00:00:00,2015,1,,DATE
1099,15/3/13,2013,0,5.0,DATE
629,2016-02-25 00:00:00,2016,1,,DATE
1328,2012-08-09 00:00:00,2012,1,,DATE
1304,2012-06-14 00:00:00,2012,1,,DATE
580,2017-11-15 00:00:00,2017,1,,DATE
567,2017-09-10 00:00:00,2017,1,,DATE


Lets look at number of digits 1 and 2 together

In [18]:
df[df.digit_number ==1]

Unnamed: 0,raw_date,year,raw_datetime_format,digit_number,cleaned_date
293,4th Jan,2018,0,1.0,DATE
294,4th Jan,2018,0,1.0,DATE
312,7th March,2018,0,1.0,DATE
346,6th May,2018,0,1.0,DATE
363,3rd June,2018,0,1.0,DATE
381,9th July,2018,0,1.0,DATE
390,7th August,2018,0,1.0,DATE
405,9th Sept,2018,0,1.0,DATE


In [19]:
df[df.digit_number ==2]

Unnamed: 0,raw_date,year,raw_datetime_format,digit_number,cleaned_date
182,30th March,2019,0,2.0,DATE
183,31st March,2019,0,2.0,DATE
290,04th December,2019,0,2.0,DATE
300,27th Jan,2018,0,2.0,DATE
302,31st Jan,2018,0,2.0,DATE
303,31st jan,2018,0,2.0,DATE
328,24th March,2018,0,2.0,DATE
329,25th March,2018,0,2.0,DATE
331,30th March,2018,0,2.0,DATE
332,31st March,2018,0,2.0,DATE


It seems for this type of format the user has not put in the year. But we can get that from the sheet name. Our task is to extract the number and the month. We will write a function below to do this. 

First we extract the month by looking for string pattersn in the 3 letter month. 

In [20]:
months = ["jan", "feb", "mar", "apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec"]
def extract_month(x, string_list):
    
    input_string = str(x.raw_date).lower()
    date_t = x.raw_datetime_format
    if date_t ==0:
        for substring in string_list:
            if substring in input_string:
                return  str(string_list.index(substring) + 1).zfill(2)
        return None
    else:
        return None

df['month'] = df.apply(lambda x : extract_month(x, months), axis = 1)

We will then look to extract the digits before the month. The day seem to have standard th rd after them. So we shall use regular expressionsn to pull out these days. Depending on if its 01 or 1 we deal with it slightly differently 

In [23]:
df[df.digit_number ==2].sample(10)

Unnamed: 0,raw_date,year,raw_datetime_format,digit_number,cleaned_date,month
340,15th April,2018,0,2.0,DATE,4
424,10th November,2018,0,2.0,DATE,11
350,14th May,2018,0,2.0,DATE,5
438,04th December,2018,0,2.0,DATE,12
423,07th November,2018,0,2.0,DATE,11
332,31st March,2018,0,2.0,DATE,3
434,26th November,2018,0,2.0,DATE,11
440,12th December,2018,0,2.0,DATE,12
302,31st Jan,2018,0,2.0,DATE,1
427,19th November,2018,0,2.0,DATE,11


In [24]:
import re

def extract_day_2_digits(x):
    input_string = str(x.raw_date)
    date_t = x.raw_datetime_format
    if date_t ==0:
        pattern = r'\b\d{1,2}(?:rd|th|st|nd|yh)\b'
        match = re.search(pattern, input_string)
        if match:
            if len(str(match.group()))==4:
                return str(match.group())[0:2]
            else:
                return str(match.group())[0:1].zfill(2)
        return None
    else:
        return None

df['day'] = df.apply(lambda x : extract_day_2_digits(x), axis = 1)

Below we can see how the month and the day have been succesfuly retrieved

In [25]:
df[(df.digit_number ==2) |(df.digit_number ==1) ]

Unnamed: 0,raw_date,year,raw_datetime_format,digit_number,cleaned_date,month,day
182,30th March,2019,0,2.0,DATE,3,30
183,31st March,2019,0,2.0,DATE,3,31
290,04th December,2019,0,2.0,DATE,12,4
293,4th Jan,2018,0,1.0,DATE,1,4
294,4th Jan,2018,0,1.0,DATE,1,4
300,27th Jan,2018,0,2.0,DATE,1,27
302,31st Jan,2018,0,2.0,DATE,1,31
303,31st jan,2018,0,2.0,DATE,1,31
312,7th March,2018,0,1.0,DATE,3,7
328,24th March,2018,0,2.0,DATE,3,24


Lets now look at digits 3 and 4

In [26]:
df[df.digit_number ==3]

Unnamed: 0,raw_date,year,raw_datetime_format,digit_number,cleaned_date,month,day


In [27]:
df[df.digit_number ==4]

Unnamed: 0,raw_date,year,raw_datetime_format,digit_number,cleaned_date,month,day
347,06.May.18,2018,0,4.0,DATE,5.0,
349,13.May.18,2018,0,4.0,DATE,5.0,
351,15.May 18,2018,0,4.0,DATE,5.0,
352,15.May.18,2018,0,4.0,DATE,5.0,
354,17.May.18,2018,0,4.0,DATE,5.0,
356,22.May 18,2018,0,4.0,DATE,5.0,
358,23.May 18,2018,0,4.0,DATE,5.0,
411,04.Oct.18,2018,0,4.0,DATE,10.0,
412,04.Oct.18,2018,0,4.0,DATE,10.0,
413,05.Oct.18,2018,0,4.0,DATE,10.0,


We can see in this sample there are dates which are separated by a dot. The previous transformations have detected the month but the day still does not work . So lets write a function to pull out the day before the dot

In [28]:

def extract_day_before_dot(x):
    input_string = str(x.raw_date)
    date_t = x.raw_datetime_format
    day_correct = str(x.day)
    if len(day_correct) != 2:
        if date_t ==0:
            input_string = input_string.strip()
            day = input_string.split('.')[0:3][0]
            if len(day)<3:
                return day.zfill(2)
            else:
                return None
        else:
            return None
            
    else:
        return day_correct



df['day'] = df.apply(lambda x : extract_day_before_dot(x), axis = 1)


Alternaitvely you could have used pd.to_datetime(input_string, format = "%d.%b.%y") however i didnt find it working well

In [29]:
df[df.digit_number ==4]

Unnamed: 0,raw_date,year,raw_datetime_format,digit_number,cleaned_date,month,day
347,06.May.18,2018,0,4.0,DATE,5.0,6.0
349,13.May.18,2018,0,4.0,DATE,5.0,13.0
351,15.May 18,2018,0,4.0,DATE,5.0,15.0
352,15.May.18,2018,0,4.0,DATE,5.0,15.0
354,17.May.18,2018,0,4.0,DATE,5.0,17.0
356,22.May 18,2018,0,4.0,DATE,5.0,22.0
358,23.May 18,2018,0,4.0,DATE,5.0,23.0
411,04.Oct.18,2018,0,4.0,DATE,10.0,4.0
412,04.Oct.18,2018,0,4.0,DATE,10.0,4.0
413,05.Oct.18,2018,0,4.0,DATE,10.0,5.0


We can see we have got some of the dates with that however there are some that are still typos. We will leave them however the format with the two // lets extract that

In [30]:
df[df.digit_number ==5]

Unnamed: 0,raw_date,year,raw_datetime_format,digit_number,cleaned_date,month,day
10,5th Mar 2021,2021,0,5.0,DATE,03,05
79,9th Jul 2021,2021,0,5.0,DATE,07,09
102,8th Oct 2021,2021,0,5.0,DATE,10,08
103,8th Oct 2021,2021,0,5.0,DATE,10,08
362,4th June 2018,2018,0,5.0,DATE,06,04
...,...,...,...,...,...,...,...
1184,18/9/13,2013,0,5.0,DATE,,
1185,24/9/13,2013,0,5.0,DATE,,
1186,25/9/13,2013,0,5.0,DATE,,
1187,26/9/13,2013,0,5.0,DATE,,


In [31]:
def double_slash_datetime(x):
    input_string = str(x.raw_date)
    date_t = x.raw_datetime_format
    day_correct = str(x.day)
    if date_t ==0:
        pattern = r'.*(?<!/)/(?!/).*(?<!/)/(?!/).*'
        if re.match(pattern, input_string):
            try:
                date_time_format = pd.to_datetime(input_string)
                x.day = date_time_format.day
                x.month = date_time_format.month
                return x
            except:
                return x
        else:
            return x
    else: 
        return x
    


Note we could have made use of pd.to_datetime(input_String, "%d/%m/%y") however as before i dint find it worked well

In [32]:
df = df.apply(lambda x : double_slash_datetime(x), axis = 1)

In [33]:
df[df.digit_number ==5]

Unnamed: 0,raw_date,year,raw_datetime_format,digit_number,cleaned_date,month,day
10,5th Mar 2021,2021,0,5.0,DATE,03,05
79,9th Jul 2021,2021,0,5.0,DATE,07,09
102,8th Oct 2021,2021,0,5.0,DATE,10,08
103,8th Oct 2021,2021,0,5.0,DATE,10,08
362,4th June 2018,2018,0,5.0,DATE,06,04
...,...,...,...,...,...,...,...
1184,18/9/13,2013,0,5.0,DATE,9,18
1185,24/9/13,2013,0,5.0,DATE,9,24
1186,25/9/13,2013,0,5.0,DATE,9,25
1187,26/9/13,2013,0,5.0,DATE,9,26


In [34]:
df[df.digit_number ==6]

Unnamed: 0,raw_date,year,raw_datetime_format,digit_number,cleaned_date,month,day
0,12th Jan 2021,2021,0,6.0,DATE,01,12
1,13th Jan 2021,2021,0,6.0,DATE,01,13
2,13th Jan 2021,2021,0,6.0,DATE,01,13
3,16th Jan 2021,2021,0,6.0,DATE,01,16
4,23rd Jan 2021,2021,0,6.0,DATE,01,23
...,...,...,...,...,...,...,...
1206,22/11/13,2013,0,6.0,DATE,11,22
1207,26/11/13,2013,0,6.0,DATE,11,26
1208,30/11/13,2013,0,6.0,DATE,11,30
1209,15/12/13,2013,0,6.0,DATE,12,15


Now we can combine all our date year and month columns together into a date time format

In [35]:
def combined_date(x):
    date_t = x.raw_datetime_format
    if date_t == 1:
        return x.raw_date
    else:
        try:
        
            date_string = str(x.month) +'-' + str(x.day) + '-' + str(x.year)
            return pd.to_datetime(date_string)
                                 
        except:
            return None
        
def usa_convert(x):
    try:
        date_obj = x.combined_date
        return  date_obj.strftime('%m/%d/%Y')
    except:
        return None
    
def check_date(x):
    try:
        if len(x.combined_date_USA) == 10:
            return True
        else:
            return False
    except:
        return False

    
df['combined_date'] = df.apply(lambda x : combined_date(x), axis = 1)    
df['combined_date_USA'] = df.apply(lambda x: usa_convert(x), axis =1 )
df['date_format'] = df.apply(lambda x: check_date(x), axis = 1)

In [36]:
df.head()

Unnamed: 0,raw_date,year,raw_datetime_format,digit_number,cleaned_date,month,day,combined_date,combined_date_USA,date_format
0,12th Jan 2021,2021,0,6.0,DATE,1,12,2021-01-12,01/12/2021,True
1,13th Jan 2021,2021,0,6.0,DATE,1,13,2021-01-13,01/13/2021,True
2,13th Jan 2021,2021,0,6.0,DATE,1,13,2021-01-13,01/13/2021,True
3,16th Jan 2021,2021,0,6.0,DATE,1,16,2021-01-16,01/16/2021,True
4,23rd Jan 2021,2021,0,6.0,DATE,1,23,2021-01-23,01/23/2021,True


We have also labelled the data now that we could not convert. This could be easily flagged and changed manually

In [37]:
df[df.date_format == False]

Unnamed: 0,raw_date,year,raw_datetime_format,digit_number,cleaned_date,month,day,combined_date,combined_date_USA,date_format
63,19th 05.2021,2021,0,8.0,DATE,,19.0,NaT,,False
223,9/1/201+G4:M99,2019,0,8.0,DATE,,,NaT,,False
271,,2019,0,,DATE,,,NaT,,False
272,,2019,0,,DATE,,,NaT,,False
273,,2019,0,,DATE,,,NaT,,False
274,,2019,0,,DATE,,,NaT,,False
275,,2019,0,,DATE,,,NaT,,False
276,,2019,0,,DATE,,,NaT,,False
277,,2019,0,,DATE,,,NaT,,False
278,,2019,0,,DATE,,,NaT,,False


## Combine to make a function

Now we have gone through how to catch most of the date formats we are going to combine our steps in one function that will take in a date column and spit out cleaned dates with a check date column to make it easy to find dates that need manual verification. So lets collect the bits of code together into one function

lets collect the functions we need first in the cell below

In [38]:
from  datetime import datetime
import re 
import pandas as pd
import numpy as np

def count_digits(string):
    try:
        return sum(x.isdigit() for x in string)
    except:
        pass
        
def check_datetime(string):
    if isinstance(string, datetime):
        return 1
    else:
        return 0
    
def extract_month(x, string_list):
    
    input_string = str(x.raw_date).lower()
    date_t = x.raw_datetime_format
    if date_t ==0:
        for substring in string_list:
            if substring in input_string:
                return  str(string_list.index(substring) + 1).zfill(2)
        return None
    else:
        return None
    
def extract_day_2_digits(x):
    input_string = str(x.raw_date)
    date_t = x.raw_datetime_format
    if date_t ==0:
        pattern = r'\b\d{1,2}(?:rd|th|st|nd|yh)\b'
        match = re.search(pattern, input_string)
        if match:
            if len(str(match.group()))==4:
                return str(match.group())[0:2]
            else:
                return str(match.group())[0:1].zfill(2)
        return None
    else:
        return None
    
def extract_day_before_dot(x):
    input_string = str(x.raw_date)
    date_t = x.raw_datetime_format
    day_correct = str(x.day)
    if len(day_correct) != 2:
        if date_t ==0:
            input_string = input_string.strip()
            day = input_string.split('.')[0:3][0]
            if len(day)<3:
                return day.zfill(2)
            else:
                return None
        else:
            return None
            
    else:
        return day_correct
    
def double_slash_datetime(x):
    input_string = str(x.raw_date)
    date_t = x.raw_datetime_format
    day_correct = str(x.day)
    if date_t ==0:
        pattern = r'.*(?<!/)/(?!/).*(?<!/)/(?!/).*'
        if re.match(pattern, input_string):
            try:
                date_time_format = pd.to_datetime(input_string)
                x.day = date_time_format.day
                x.month = date_time_format.month
                return x
            except:
                return x
        else:
            return x
    else: 
        return x
    
def combined_date(x):
    date_t = x.raw_datetime_format
    if date_t == 1:
        return x.raw_date
    else:
        try:
        
            date_string = str(int(x.month)).zfill(2) +'-' + str(int(x.day)).zfill(2) + '-' + str(x.year)
            return pd.to_datetime(date_string)
                                 
        except:
            return None
        
def usa_convert(x):
    try:
        date_obj = x.combined_date
        return  date_obj.strftime('%m/%d/%Y')
    except:
        return None
    
def check_date(x):
    try:
        if len(x.combined_date_USA) == 10:
            return True
        else:
            return False
    except:
        return False

### Bring in a sheet with other columns to make the script more usable

In [39]:
book_name = 'dummy_sheet.xlsx'
write_name = 'dummy_date_cleaned_2.xlsx'
date_column_name = "Date of injury or onset of illness"


xl= pd.read_excel(book_name, sheet_name = None)
sheet_names = xl.keys() 
sheet_names # change this if you want certain sheets excluded

dict_keys(['2021', '2019', '2018', '2017', '2016', '2015', '2014', '2013', '2012'])

The cell below combines all of the transformations above and goes across all the sheets

In [53]:
# def clean_xls_dates(xls_name, xls_new_name):
writer = pd.ExcelWriter(write_name, engine='xlsxwriter')
for sheet_name in list(sheet_names):
    data = pd.read_excel('dummy_sheet.xlsx', sheet_name = sheet_name)
    try:
        result = data.applymap(lambda x: True if date_column_name in str(x) else False)
        rows_to_skip = np.where(result)[0][0]
        data = data.iloc[rows_to_skip:]
        data.rename(columns = data.iloc[0], inplace = True)
        data = data.iloc[1:]
    except:
        pass

    
    months = ["jan", "feb", "mar", "apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec"]
    date_series = data[date_column_name]
    year_series = [sheet_name]*len(date_series)
    df = pd.DataFrame({'raw_date':date_series, 'year': year_series})
    df['raw_datetime_format'] = df.raw_date.apply(lambda x : check_datetime(x))
    df['digit_number'] = df.raw_date.apply(lambda x : count_digits(x))
    df['cleaned_date'] = ['NOT A DATE' if df.digit_number.iloc[i] ==0 else 'DATE' for i in range(len(df))]
    df['month'] = df.apply(lambda x : extract_month(x, months), axis = 1)
    df['day'] = df.apply(lambda x : extract_day_2_digits(x), axis = 1)
    df['day'] = df.apply(lambda x : extract_day_before_dot(x), axis = 1)
    df = df.apply(lambda x : double_slash_datetime(x), axis = 1)
    df['combined_date'] = df.apply(lambda x : combined_date(x), axis = 1)    
    df['combined_date_USA'] = df.apply(lambda x: usa_convert(x), axis =1 )
    df['date_format'] = df.apply(lambda x: check_date(x), axis = 1)
    data['cleaned_date_UK'] = df.combined_date
    data['cleaned_date_USA'] = df.combined_date_USA
    data['date_format_correct'] = df.date_format
    # print(data[data.date_format_correct == False])

    data.to_excel(writer, sheet_name = sheet_name, index = False)
    print(sheet_name)
writer.save()

PermissionError: [Errno 13] Permission denied: 'dummy_date_cleaned_2.xlsx'

At this point you should open up your new sheet in excel and manauly change the remaining ones using the date format correct column as a flag

In [43]:
writer.close()

  warn("Calling close() on already closed file.")
