# Times and places

Elements of Data Science

by [Allen Downey](https://allendowney.com)

[MIT License](https://opensource.org/licenses/MIT)

### A note

Before we start this notebook, I have something to say.  In these notebooks, I generally try to define terms when they are first used, and to explain the code examples as we go along.

However, there are a few places where I provide code with the expectation that you will not understand all of the details yet.  When that happens, I'll explain what you should or should not understand.

Generally, if you are able to do the exercises, that means you are understanding what you need to understand.  

I understand it can be irritating to see examples that don't make sense, but if I tell you there's something you don't need to know yet, try not to let it bother you.

## Strings

A **string** is a sequence of letters, numbers, and punctuation marks.

In Python you can create a string by typing letters between single or double quotation marks.

In [1]:
'Data'

In [2]:
"Science"

And you can assign string values to variables.

In [3]:
first = 'Data'

In [4]:
last = "Science"

Some arithmetic operators work with strings, but they might no do what you expect.  For example, the `+` operator "concatenates" two strings; that is, it creates a new string that contains the first string followed by the second string:

In [5]:
first + last

Strings are used to store text data like names, addresses, titles, etc.

When you read data from a file, you might see values that look like numbers, but they are actually strings, like this:

In [6]:
not_actually_a_number = '123'

If you try to do math with these strings, you might get an error:

In [7]:
not_actually_a_number + 1

Or you might get a surprising result:

In [8]:
not_actually_a_number + '1'

Fortunately, you can convert strings to numbers.

If you have a string that contains only digits, you can convert it to an integer using the `int` function:

In [9]:
int('123')

Or you can convert it to a floating-point number using `float`:

In [10]:
float('123')

But if the string contains a decimal point, you can't convert it to an `int`:

In [11]:
int('12.3')

Going in the other direction, you can convert almost any type of value to a string using `str`:

In [12]:
str(123)

In [13]:
str(12.3)

## Dates and times

If you read data from a file, you might also find that dates and times are represented with strings.

In [14]:
not_really_a_date = 'May 11, 1967'

To confirm that this value is a string, we can use the `type` function, which takes a value and reports its type.

In [15]:
type(not_really_a_date)

`str` indicates that the value of `not_really_a_date` is a string.

We get the same result with `not_really_a_time`, below:

In [16]:
not_really_a_time = '6:30:00'

In [17]:
type(not_really_a_time)

Representing dates and times using strings provides human-readable values, but they are not useful for doing computation.

Fortunately, Python provides tools for working with date and time data.  Specifically, the Pandas library provides `Timestamp`, which represents a date and time.

As always, we have to import a library before we use it; it is conventional to import Pandas with the abbreviated name `pd`:

In [18]:
import pandas as pd

Now we can use the `Timestamp` function to convert a string to a `Timestamp`:

In [19]:
pd.Timestamp('6:30:00')

If the string specifies a time but no date, Pandas fills in today's date.

A `Timestamp` is a value, so you can assign it to a variable.

In [20]:
date_of_birth = pd.Timestamp('June 4, 1989')
date_of_birth

If the string specifies a date but no time, Pandas fills in midnight as the default time.

If you assign the `Timestamp` to a variable, you can use the variable name to get the year, month, and day, like this:

In [21]:
date_of_birth.year, date_of_birth.month, date_of_birth.day

You can also gets the name of the month and the day of the week.

In [22]:
date_of_birth.day_name(), date_of_birth.month_name()

`Timestamp` provides a function called `now` that returns the current date and time.

In [23]:
now = pd.Timestamp.now()
now

**Exercise:** Use the value of `now` to display the name of the current month and day of the week.

In [24]:
# Solution goes here

## Timedelta

`Timestamp` values support some arithmetic operations.  For example, you can compute the difference between two `Timestamps`:

In [25]:
age = now - date_of_birth
age

The result is a `Timedelta` that represents the current age of someone born on `date_of_birth`.

The `Timedelta` contains `components` that store the number of days, hours, etc. between the two `Timestamp` values.

In [26]:
age.components

You can get one of the components like this:

In [27]:
age.days

The biggest component of `Timedelta` is days, not years, because days are well defined and years are problematic.

Most years are 365 days, but some are 366.  The average calendar year is 365.24 days, which is a very good approximation of a solar year, [but it is not exact](https://pumas.jpl.nasa.gov/files/04_21_97_1.pdf).

One way to compute age in years is to divide age in days by 365.24:

In [28]:
age.days / 365.24

But people usually report their ages in integer years.  We can use the Numpy `floor` function to round down:

In [29]:
import numpy as np

np.floor(age.days / 365.24)

Or the `ceil` function (which stands for "ceiling") to round up:

In [30]:
np.ceil(age.days / 365.24)

We can also compare `Timestamp` values to see which comes first.

For example, let's see if a person with a given birthdate has already had a birthday this year.

We can create a new `Timestamp` with the year from `now` and the month and day from `date_of_birth`.

In [31]:
bday_this_year = pd.Timestamp(now.year, date_of_birth.month, date_of_birth.day)
bday_this_year

The result represents the person's birthday this year.  Now we can use the `>` operator to check whether `now` is later than the birthday:

In [32]:
now > bday_this_year

The result is either `True` or `False`, which are special values in Python used to represent results from this kind of comparison.

These values belong to a type called `bool`, short for "Boolean algebra", which is a branch of algebra where all values are either true or false. 

In [33]:
type(True)

In [34]:
type(False)

**Exercise:** Any two people with different birthdays have a "Double Day" when one is twice as old as the other.

Suppose you are given two `Timestamp` values, `d1` and `d2`, that represent birthdays for two people.  Compute their double day.

Hint: if `x` is the unknown double day, we can write:

$(x - d_1) = 2 (x - d_2)$

If we solve for `x`, we get

$x = 2 d_2 - d_1$

But if you try to compute that, you will get an error, because you cannot multiply a `Timestamp` by 2.

However, you can compute the double day using `Timestamp` and `Timedelta` values; you just have to express it a different way.

Here are two example dates; with these dates, the result should be December 19, 2009.

In [35]:
d1 = pd.Timestamp('2003-07-12')

In [36]:
d2 = pd.Timestamp('2006-09-30')

In [37]:
# Solution goes here

## Location

There are many ways to represent geographical locations, but the most common, at least for global data, is latitude and longitude.

When stored as strings, latitude and longitude are expressed in degrees with compass directions N, S, E, and W.  For example, this string represents the location of Boston, MA, USA:

In [38]:
lat_lon_string = '42.3601° N, 71.0589° W'

When we compute with location information, we use floating-point numbers, with 

* Positive latitude for the northern hemisphere, negative latitude for the southern hemisphere, and 

* Positive longitude for the eastern hemisphere and negative latitude for the western hemisphere.

Of course, the choice of the origin, and the orientation of positive and negative, are arbitrary choices that were made for historical reasons.  We might not be able to chance conventions like these, but we should be aware that they are conventions.

Here's how we might represent the location of Boston with two variables.

In [39]:
lat = 42.3601
lon = -71.0589

It is also possible to combine two numbers into a composite value and assign it to a single variable:

In [40]:
boston = lat, lon

The type of this variable is `tuple`, which is a mathematical term for a value that contains a sequence of elements.  Math people pronounce it "tuh' ple", but computational people usually say "too' ple".  Take your pick.

In [41]:
type(boston)

If you have a tuple with two elements, you can assign them to two variables, like this:

In [42]:
y, x = boston
y

In [43]:
x

Notice that I assigned latitude to `y` and longitude to `x`, because a `y` coordinate usually goes up and down like latitude, and an `x` coordinate usually goes side-to-side like longitude.

**Exercise:** Find the latitude and logitude of the place you were born, or some place you think of as your "home town".

Make a tuple of floating-point numbers that represents this location.

In [44]:
# Solution goes here

## Distance

If you are given two tuples that represent locations, you can compute the approximate distance between them, along the surface of the globe, using the haversine function.

If you are curious about it, [you can read an explanation in this article](https://janakiev.com/blog/gps-points-distance-python/).

The following cell defines two new functions, called `hav` and `haversine`.  When you run this cell, it creates the functions, but it doesn't run them yet.

We have not talked about function definitions, so there might be some things here you don't understand.  That's ok, for now.  You can use a function without knowing how it works.

We will see more functions, and learn more about them, soon.


In [45]:
import numpy as np

def hav(theta):
    """Computer the haversine function of theta."""
    return np.sin(theta/2)**2

def haversine(coord1, coord2):
    """Haversine distance between two locations.
    
    coord1: lat-lon as tuple of float 
    coord2: lat-lon as tuple of float
    
    returns: distance in km
    """
    R = 6372.8  # Earth radius in km
    lat1, lon1 = coord1
    lat2, lon2 = coord2
    
    phi1, phi2 = np.radians(lat1), np.radians(lat2) 
    dphi       = np.radians(lat2 - lat1)
    dlambda    = np.radians(lon2 - lon1)
    
    a = hav(dphi) + np.cos(phi1)*np.cos(phi2)*hav(dlambda)
    
    distance = 2*R*np.arctan2(np.sqrt(a), np.sqrt(1 - a))
    
    return distance

Now we can use these functions to compute the distance from Boston to London.

Here's the location of London, England, UK:

In [46]:
london = 51.5074, -0.1278

And here's the haversine distance between Boston and London.

In [47]:
haversine(boston, london)

The actual geographic distance is slightly different because Earth is not a perfect sphere.  But the error of this estimate is less than 1%.

**Exercise:** Use `haversine` to compute the distance between Boston and your "home town" from the previous exercise.

If possible, use an online map to check the result.

In [48]:
# Solution goes here

## Geopandas

Python provides libraries for working with geographical data.  One of the most popular is Geopandas, which is based on another library called Shapely.  I'll introduce these libraries here, and we'll come back to them later.

Shapely provides `Point` and `LineString` values, which we'll use to represent geographic locations and lines between locations.

In [49]:
from shapely.geometry import Point, LineString

We can use the tuples we defined in the previous section to create Shapely `Point` values, but we have to reverse the order of the coordinates, providing them in x-y order rather than lat-lon order, because that's the order the `Point` function expects.

In [50]:
p1 = Point(reversed(boston))
p2 = Point(reversed(london))

If we display a `Point` value, we get a graphical representation, but not a very useful one.

In [51]:
p1

Soon we will see how to plot a `Point`, more usefully, on a map.

We can use the points we just defined to create a `LineString`:

In [52]:
line = LineString([p1, p2])

If we display the result, we get another not very useful graphical representation.

In [53]:
line

However, now we can use Geopandas to show these points and lines on a map.

If you are running this notebook on Colab, the following cell will install Geopandas, which should only take a few seconds.  It uses features we have not seen yet; you might be able to read it and guess how it works, but you don't have to.

In [54]:
import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    !pip install geopandas

Now the following import statement should work.

In [55]:
import geopandas as gpd

If you are running this notebook on Colab, the libraries you install disappear when you shut down the notebook.  When you start the notebook again, you have to install them again.

The following code loads a map of the world and plots it.

In [56]:
path = gpd.datasets.get_path('naturalearth_lowres')
world = gpd.read_file(path)
world.plot(color='white', edgecolor='gray');

Here's a version that just plots North America and Europe:

In [57]:
north_america = world.continent == 'North America'
europe = world.continent == 'Europe'
north_america = world[north_america | europe]
north_america.plot(color='white', edgecolor='gray');

Notice:

* By default, Geopandas uses the political definition of "Europe", which includes the part of Russia that is on the Asian continent.

* It also uses a Mercator projection, which provides a misleading picture of relative land areas.

You can't make a map without making visualization decisions.

Now let's put dots on the map for Boston and London.  We have to put the `Point` values and the `LineString` into a `GeoSeries`, which provides a `plot` function:

In [59]:
t = [p1, p2, line]
series = gpd.GeoSeries(t)
series.plot();

Here's a first attempt to plot the maps and the lines together:

In [60]:
# plot the map
north_america.plot(color='white', edgecolor='gray')

# plot Boston, London, and the line
series.plot();

GeoPandas puts the two plots on different axes, which is not what we want in this case.

To get the points and the map on the same axes, we have to use a function from Matplotlib, which is a visualization library we will use extensively.

The function is `gca`, which stands for "get current axes".  We can use the result to tell `plot` to put the points and lines on the current axes, rather than create a new one.

In [62]:
import matplotlib.pyplot as plt

# plot the map
north_america.plot(color='white', edgecolor='gray')

# plot Boston, London, and the line
series.plot(ax=plt.gca());

There are a few features in this example we have not explained completely, but hopefully you get the idea.  We will come back to Geopandas later.

**Exercise:** Modify the code in the previous section to plot a point that shows the "home town" you chose in a previous exercise and a line from there to Boston.

Then go to [this online survey](https://forms.gle/RJva9c3JhAUL3THS6) and answer the questions there.  We will use your responses for an upcoming example.