## Problem 3: How far did people travel? (8 points)

During this task, the aim is to calculate the (air-line) distance in meters that each social media user in the data set prepared in *Problem 2* has travelled in-between the posts. We’re interested in the Euclidean distance between subsequent points generated by the same user.

For this, we will need to use the `userid` column of the data set `kruger_posts.shp` that we created in *Problem 2*.

Answer the following questions:
- What was the shortest distance a user travelled between all their posts (in meters)?
- What was the mean distance travelled per user (in meters)?
- What was the maximum distance a user travelled (in meters)?

---


### a) Read the input file and re-project it

- Read the input file `kruger_points.shp` into a geo-data frame `kruger_points`
- Transform the data from WGS84 to an `EPSG:32735` projection (UTM Zone 35S, suitable for South Africa). This CRS has *metres* as units.

In [1]:
# ADD YOUR OWN CODE HERE
import geopandas as gpd

# Read the input file kruger_points.shp into a geo-data frame kruger_points
import pathlib 
NOTEBOOK_PATH = pathlib.Path().resolve()
DATA_DIRECTORY = NOTEBOOK_PATH / "data"

kruger_points = gpd.read_file(DATA_DIRECTORY / "kruger_points.shp")

# Transform the data from WGS84 to an EPSG:32735 projection (UTM Zone 35S, suitable for South Africa). This CRS has metres as units.
kruger_points = kruger_points.to_crs("EPSG:32735")

In [2]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

# Check the data
kruger_points.head()

Unnamed: 0,lat,lon,timestamp,userid,geometry
0,-24.980792,31.484633,2015-07-07 03:02,66487960,POINT (952912.890 7229683.258)
1,-25.499225,31.508906,2015-07-07 03:18,65281761,POINT (953433.223 7172080.632)
2,-24.342578,30.930866,2015-03-07 03:38,90916112,POINT (898955.144 7302197.408)
3,-24.854614,31.519718,2015-10-07 05:04,37959089,POINT (956927.218 7243564.942)
4,-24.921069,31.520836,2015-10-07 05:19,27793716,POINT (956794.955 7236187.926)


In [3]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

# Check that the crs is correct after re-projecting (should be epsg:32735)
import pyproj
assert kruger_points.crs == pyproj.CRS("EPSG:32735")

### b) Group the data by user id

Group the data by `userid` and store the grouped data in a variable `grouped_by_users`

In [4]:
grouped_by_users = kruger_points.groupby("userid")

In [None]:
# ADD YOUR OWN CODE HERE

In [5]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

# Check the number of groups:
assert len(grouped_by_users.groups) == kruger_points["userid"].nunique(), "Number of groups should match number of unique users!"

In [6]:
len(grouped_by_users.groups)

14990

In [7]:
kruger_points["userid"].nunique()

14990

As expected the two values match :) The `groupby()` function groups together the unique `userid` values.

### c) Create `shapely.geometry.LineString` objects for each user connecting the points from oldest to most recent

There are multiple ways to solve this problem (see the [hints for this exercise](https://autogis-site.readthedocs.io/en/latest/lessons/lesson-2/exercise-2.html). You can use, for instance, a dictionary or an empty GeoDataFrame to collect data that is generated using the steps below:

- Use a for-loop to iterate over the grouped object. For each user’s data: 
    - [sort](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) the rows by timestamp 
    - create a `shapely.geometry.LineString` based on the user’s points

**CAREFUL**: Remember that every LineString needs at least two points. Skip users who have less than two posts.

Store the results in a `geopandas.GeoDataFrame` called `movements`, and remember to assign a CRS.

In [8]:
# ADD YOUR OWN CODE HERE

from shapely.geometry import LineString
import pandas as pd

# Sort the GeoDataFrame by timestamp
sorted_points = kruger_points.sort_values('timestamp')

# Create empty lists to store userids and LineString objects
userids = []
lines = []

# Iterate over unique userids
for userid in sorted_points['userid'].unique():
    
    # Filter points for the current userid
    user_points = sorted_points[sorted_points['userid'] == userid]

    # Skip if less than two points for the current userid
    if len(user_points) < 2:
        continue

    # Sort the points by timestamp
    user_points = user_points.sort_values('timestamp')

    # Create a LineString object from the sorted points
    line = LineString(user_points[['lon', 'lat']].values.tolist())

    # Append the userid and LineString object to the respective lists
    userids.append(userid)
    lines.append(line)

# Create a DataFrame with userid and geometry columns
line_data = pd.DataFrame({'userid': userids, 'geometry': lines})

# Create a GeoDataFrame from the line_data DataFrame
movements = gpd.GeoDataFrame(line_data, geometry='geometry', crs = "EPSG:32735")


In [9]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

# Check the result
print(type(movements))
print(movements.crs)

movements

<class 'geopandas.geodataframe.GeoDataFrame'>
EPSG:32735


Unnamed: 0,userid,geometry
0,78183633,"LINESTRING (31.151 -25.468, 31.134 -25.343, 30..."
1,20420100,"LINESTRING (31.593 -24.993, 31.593 -24.993, 31..."
2,88360442,"LINESTRING (31.470 -24.945, 31.509 -24.897, 31..."
3,48538532,"LINESTRING (31.174 -25.474, 30.966 -25.439, 31..."
4,91153427,"LINESTRING (31.141 -25.160, 31.141 -25.160, 31..."
...,...,...
9021,46347466,"LINESTRING (31.892 -25.204, 31.054 -25.313)"
9022,39778980,"LINESTRING (31.351 -25.519, 31.351 -25.519)"
9023,19119058,"LINESTRING (31.497 -24.739, 31.514 -24.730)"
9024,81326644,"LINESTRING (30.924 -23.696, 30.983 -23.800)"


### d) Calculate the distance between all posts of a user

- Check once more that the CRS of the data frame is correct
- Compute the lengths of the lines, and store it in a new column called `distance`

In [10]:
# ADD YOUR OWN CODE HERE

# Check once more that the CRS of the data frame is correct
movements.crs

<Derived Projected CRS: EPSG:32735>
Name: WGS 84 / UTM zone 35S
Axis Info [cartesian]:
- E[east]: Easting (metre)
- N[north]: Northing (metre)
Area of Use:
- name: Between 24°E and 30°E, southern hemisphere between 80°S and equator, onshore and offshore. Botswana. Burundi. Democratic Republic of the Congo (Zaire). Rwanda. South Africa. Tanzania. Uganda. Zambia. Zimbabwe.
- bounds: (24.0, -80.0, 30.0, 0.0)
Coordinate Operation:
- name: UTM zone 35S
- method: Transverse Mercator
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

In [11]:
movements["distance"] = movements.length

In [12]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

#Check the output
movements.head()

Unnamed: 0,userid,geometry,distance
0,78183633,"LINESTRING (31.151 -25.468, 31.134 -25.343, 30...",3.549311
1,20420100,"LINESTRING (31.593 -24.993, 31.593 -24.993, 31...",1.433487
2,88360442,"LINESTRING (31.470 -24.945, 31.509 -24.897, 31...",0.124166
3,48538532,"LINESTRING (31.174 -25.474, 30.966 -25.439, 31...",0.422277
4,91153427,"LINESTRING (31.141 -25.160, 31.141 -25.160, 31...",0.34669


### e) Answer the original questions

You should now be able to quickly find answers to the following questions: 
- What was the shortest distance a user travelled between all their posts (in meters)? (store the value in a variable `shortest_distance`)
- What was the mean distance travelled per user (in meters)? (store the value in a variable `mean_distance`)
- What was the maximum distance a user travelled (in meters)? (store the value in a variable `longest_distance`)

How do we know what the unit of the coordinates are for EPSG:32735 ?

In [13]:
# ADD YOUR OWN CODE HERE
shortest_distance = movements["distance"].min()
shortest_distance

0.0

In [14]:
mean_distance = movements["distance"].mean()
mean_distance

1.0045096183927393

In [15]:
longest_distance = movements["distance"].max()
longest_distance

63.94749765236673

EPSG:32735 is a specific coordinate reference system (CRS) that corresponds to the Universal Transverse Mercator (UTM) projection for the southern hemisphere zone 35. The unit of measurement for EPSG:32735 is `meters`.

In UTM projections, coordinates are measured in meters along the easting (x-axis) and northing (y-axis) directions. The values represent distances from a reference point, which in UTM zone 35 is located at the intersection of the central meridian (longitude 27°E) and the equator.

Therefore, if you are working with coordinates in EPSG:32735, you can assume that the unit of measurement for both the easting and northing values is `meters`.

### f) Save the movements in a file

Save the `movements` into a new Shapefile called `movements.shp` inside the `data` directory.

In [16]:
# ADD YOUR OWN CODE HERE
movements.to_file(DATA_DIRECTORY / "movements.shp")

In [17]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

assert (DATA_DIRECTORY / "movements.shp").exists()


---

# Fantastic job!

That’s all for this week! 