# Dates and strings in Python

## Miguel Ángel Canela, IESE Business School

******

### Date and datetime

In computer environments, there are typically two data types for time, called date and datetime. In **type date** we can store dates, that is, year, month and day. Most software applications for data management and analysis can deal with different date formats. The default format for dates in most languages, including Python, is `yyyy-mm-dd`. I advise you to use only this everywhere. Under the hood, a variable of type `date` is just a number, the number of days since a time origin, which is, typically, 1970-01-01 (in particular in Python). 

In data of **type datetime**, we store the same as in type date, plus hour, minute and second. The preferred format is `yyyy-mm-dd hh:mm:ss`. Sometimes, an indication of the time zone is added at the end, as we will see below. Examples are CET (Central European Time), CEST (Central European Summer Time), GMT (Greenwich Mean Time) and UTC (Coordinated Universal Time). Datetime is also called **timestamp**. Under the hood, datetime is the number of seconds since the time origin.

Data of type datetime can be managed in many ways in Python. The package `datetime` is recommended if you want to the deal with times one by one, not within a data frame. The functions `datetime.date` and `datetime.datetime` can be used to create dates and datetimes. The dates are just datetimes in disguise, that is, the date `1954-04-30` is the same as the datetime `1954-04-30 00:00:00`.

The old date and datetime types became obsolete when the **type datetime64** was introduced in Numpy. In this data type, times are recorded down to the nanosecond,
$$1\ {\rm nanosecond} = 10^{-9}\ {\rm seconds}.$$ 

So a datetime64 is just the number of nanoseconds since the time origin. Pandas inherits this from Numpy. My presentation here is very brief, and restricted to the management of times in Pandas data structures. So, I start by creating a series which contains a time as a string:

In [1]:
import pandas as pd
time1 = pd.Series('1954-04-30 05:00:00')
time1

0    1954-04-30 05:00:00
dtype: object

With the function `astype`, which is typically used for type conversions in Pandas structures, we can put this as a datetime:

In [2]:
time2 = time1.astype('datetime64')
time2

0   1954-04-30 05:00:00
dtype: datetime64[ns]

The expression 'ns' between the square brackets means nanoseconds. We can go back to the original string type:

In [3]:
time2.astype('str')

0    1954-04-30 05:00:00
dtype: object

As I said before, `time2` is just a number of nanoseconds:

In [4]:
time3 = time2.astype('int64')
time3

0   -494622000000000000
dtype: int64

In [5]:
time3.astype('datetime64[ns]')

0   1954-04-30 05:00:00
dtype: datetime64[ns]

Conversion of numbers into times can also be performed at other levels. The following two examples illustrate this:

In [6]:
num_days = pd.Series(10**4)
num_days.astype('datetime64[D]')

0   1997-05-19
dtype: datetime64[ns]

In [7]:
num_secs = pd.Series(10**9)
num_secs.astype('datetime64[s]')

0   2001-09-09 01:46:40
dtype: datetime64[ns]

The attribute `apply` is used to apply a function term by term to a column of a data frame. `apply` is typically used in combination with a lambda expression. Using `apply` and an appropriate lambda expression, we can extract from a datetime series any information that can be extracted from a datatime variable. For instance, we can see that April 30, 1954 was a Monday.

In [8]:
time2.apply(lambda x: x.weekday)

0    <built-in method weekday of Timestamp object a...
dtype: object

*Note*. For the function `weekday`, Monday is 0 and Sunday is 6. With `isoweekday`, Monday is 1 and Sunday is 7.

### String data

Python provides a collection of methods for manipulating string variables. This includes not only sequences of **alphanumeric characters**, but also **white space** and **punctuation**. Beware that the **empty string** (`''`) is not the same as `nan`. It is a string of length zero.

We also have in Pandas a collection of string functions. They are mostly those of plain Python, but they are vectorized in Pandas and use the syntax we are familiar with. So, if `df` is a data frame and `var` is one of the columns of `df`, we input `df['var'].str.func` to obtain a Pandas series whose terms will result from applying the function `func` to the series `df['var']`. I present next a brief review of some useful Pandas string functions.

* With the function `str.len`, we **get the length of every element of a string series**. Note, in the following example, how the empty string and the missing value are dealt with. Also, note that the series containing the lengths has been coerced by Python to type float to cope with the `nan` value (the length has int type when all the strings have length).

In [9]:
import numpy as np
presidents = pd.Series(['Donald Trump', 'Bill Clinton',
    '', np.nan])
presidents

0    Donald Trump
1    Bill Clinton
2                
3             NaN
dtype: object

In [10]:
presidents.str.len()

0    12.0
1    12.0
2     0.0
3     NaN
dtype: float64

* **Substrings** can be extracted from a string variable just as we extract elements from a list. The same works for a string series (adding `str`). This can be useful to manage dates, as shown in the following example.

In [11]:
'2016-10-06'[0:4]

'2016'

In [12]:
dates = pd.Series(['2016-10-06', '2015-08-19', '2016-01-30'])
dates.str[0:4]

0    2016
1    2015
2    2016
dtype: object

* **Strings are joined** just as lists, with the plus sign (`+`) :

In [13]:
'Leo' + 'nard'

'Leonard'

In [14]:
firstnames = pd.Series(['Marvin', 'Leonard'])
secondnames = pd.Series(['Gaye', 'Cohen'])
firstnames + ' ' + secondnames

0      Marvin Gaye
1    Leonard Cohen
dtype: object

* Many methods of string data analysis are based on counting the occurrences of selected terms. Counting is typically preceded by **conversion to lowercase**, performed with the function `str.lower`.

In [15]:
students = pd.Series(['Pablo', 'Liudmila', 'Nana Yaa'])
students.str.lower()

0       pablo
1    liudmila
2    nana yaa
dtype: object

* `str.contains` is used to **detect the presence or absence of a pattern in a string**. It returns a Boolean series indicating, term by term, whether the pattern occurs.

In [16]:
students.str.contains(pat='an')

0    False
1    False
2     True
dtype: bool

* The function`str.findall` is used to **extract matching patterns from a string**. It produces, for each term of the series, a list containing all the occurrences of the pattern. 

In [17]:
students.str.findall(pat='a')

0             [a]
1             [a]
2    [a, a, a, a]
dtype: object

* With the function `str.replace`, we can **replace matched patterns in a string**:

In [18]:
students.str.replace(pat=' ', repl='-')

0       Pablo
1    Liudmila
2    Nana-Yaa
dtype: object

Not that, while the third argument of `str.replace` (the replacement) has to be a single string, the second argument (the pattern) can be multiple. In the preceding example, we replaced a single white space by a dash. Now, to replace either white space or the letter 'o', we set as the pattern to replace the regular expression 'o| '. Note that, in Python (as in many other computer environments), the vertical bar means 'OR'.

In [19]:
 students.str.replace(pat='o| ', repl='-')

0       Pabl-
1    Liudmila
2    Nana-Yaa
dtype: object

* The function `str.split` **splits up a string into pieces**. This is one way to transform a string into a **bag of words**, that is, a list whose terms are the words contained in the string. For every term of a string series, the function returns the corresponding bag of words.

In [20]:
sayings = pd.Series(['Correlation is not causation',
    'Flattery is the food of fools'])
sayings.str.split(pat=' ')

0       [Correlation, is, not, causation]
1    [Flattery, is, the, food, of, fools]
dtype: object

### Regular expressions

Some of the transformations performed by the methods described in the preceding section are dramatically simplified by using **regular expressions**. A regular expression is a pattern which describes a set of strings. Regular expressions are a whole chapter of programming, with entire books, such as Friedl (2007), devoted to them. If you are interested, you may also try the website `regexr.com`, which makes fun of learning regular expressions.

Regular expressions can be used in many computer environments, but they are not exactly the same for different languages. Nevertheless, the basic ones are universal. Among them, **character classes** are the simplest case. They are built by enclosing a collection of characters within square brackets. The square brackets indicate *any* of the characters enclosed. For instance, `[0-9]` stands for any digit, and `[A-Z]` for any capital letter. 

I show how this works with some simple examples.

In [21]:
bio = pd.Series(['I was born in 1954',
    'My phone is +34 932 534 200'])
bio.str.replace(pat='[a-z]', repl='x')

0             I xxx xxxx xx 1954
1    Mx xxxxx xx +34 932 534 200
dtype: object

In [22]:
bio.str.replace(pat='[0-9]', repl='x')

0             I was born in xxxx
1    My phone is +xx xxx xxx xxx
dtype: object

Character classes get more powerful when complemented with **quantifiers**. For instance, followed by a plus sign (+), a character class indicates a sequence of any length. So, `[0-9]+` indicates any sequence of digits, therefore any number. We can also specify the minimum and maximum length of the sequence, as in the second example below.

In [23]:
bio.str.replace(pat='[a-zA-Z]+', repl='x')

0             x x x x 1954
1    x x x +34 932 534 200
dtype: object

In [24]:
bio.str.replace(pat='[0-9]{1,3}', repl='x')

0        I was born in xx
1    My phone is +x x x x
dtype: object

Finally, a simple clean way of getting a bag of words is as follows.

In [25]:
bio.str.findall(pat='[a-zA-Z0-9]+')

0              [I, was, born, in, 1954]
1    [My, phone, is, 34, 932, 534, 200]
dtype: object

### Special characters

Text imported from PDF or HTML documents, or from devices like mobile phones, may contain **special characters** like the left/right quotation marks (‘, “, etc), or the three-dot character (…), which is better to control, to avoid confusion. To keep this note short, I do not develop this point here, but mind that, if you capture string data on your own, you will probably find some of that in your data. Even if the documents are expected to be in English, they can be contaminated by other languages: Han characters, German umlaut, Spanish eñe, etc.

Another source of trouble is that these special symbols can be encoded by different computers or different text editors in different ways. The preferred **encoding** is **UTF-8**, which is the default encoding in Macintosh computers. Reading and writing files text files in R, the argument `fileEncoding` allows you to mange allows you to manage both UTF-8 and the alternative encoding **Latin-1**. The problem with Windows computers is that they use a third system, called **Windows-1252**, which is very close to Latin-1, but not exactly the same. I cannot say more, because this topic goes beyond my expertise, so I do not discuss encodings in this course. If you are interested, you may take a look at Korpela (2006).

### References

1. JEF Friedl (2007), *Mastering Regular Expressions*, O'Reilly.

2. JK Korpela (2006), *Unicode Explained*, O'Reilly.

3. J VanderPlas (2017), *Python Data Science Handbook*, O'Reilly.