## Problem 3: How far did people travel? (8 points)

During this task, the aim is to calculate the (air-line) distance in meters that each social media user in the data set prepared in *Problem 2* has travelled in-between the posts. We’re interested in the Euclidean distance between subsequent points generated by the same user.

For this, we will need to use the `userid` column of the data set `kruger_posts.shp` that we created in *Problem 2*.

Answer the following questions:
- What was the shortest distance a user travelled between all their posts (in meters)?
- What was the mean distance travelled per user (in meters)?
- What was the maximum distance a user travelled (in meters)?

---


### a) Read the input file and re-project it

- Read the input file `kruger_points.shp` into a geo-data frame `kruger_points`
- Transform the data from WGS84 to an `EPSG:32735` projection (UTM Zone 35S, suitable for South Africa). This CRS has *metres* as units.

In [1]:
import geopandas as gpd
import pathlib
EXERCISE_PATH = pathlib.Path().resolve()
DATA_DIRECTORY = EXERCISE_PATH / 'data'
kruger_points = gpd.read_file(DATA_DIRECTORY / 'kruger_points.shp')
kruger_points = kruger_points.to_crs('EPSG:32735')


In [2]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

# Check the data
kruger_points.head()

Unnamed: 0,lat,lon,timestamp,userid,geometry
0,-24.980792,31.484633,2015-07-07 03:02,66487960,POINT (952912.890 7229683.258)
1,-25.499225,31.508906,2015-07-07 03:18,65281761,POINT (953433.223 7172080.632)
2,-24.342578,30.930866,2015-03-07 03:38,90916112,POINT (898955.144 7302197.408)
3,-24.854614,31.519718,2015-10-07 05:04,37959089,POINT (956927.218 7243564.942)
4,-24.921069,31.520836,2015-10-07 05:19,27793716,POINT (956794.955 7236187.926)


In [3]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

# Check that the crs is correct after re-projecting (should be epsg:32735)
import pyproj
assert kruger_points.crs == pyproj.CRS("EPSG:32735")

### b) Group the data by user id

Group the data by `userid` and store the grouped data in a variable `grouped_by_users`

In [4]:
grouped_by_users = kruger_points.groupby('userid')

In [5]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

# Check the number of groups:
assert len(grouped_by_users.groups) == kruger_points["userid"].nunique(), "Number of groups should match number of unique users!"

### c) Create `shapely.geometry.LineString` objects for each user connecting the points from oldest to most recent

There are multiple ways to solve this problem (see the [hints for this exercise](https://autogis-site.readthedocs.io/en/latest/lessons/lesson-2/exercise-2.html). You can use, for instance, a dictionary or an empty GeoDataFrame to collect data that is generated using the steps below:

- Use a for-loop to iterate over the grouped object. For each user’s data: 
    - [sort](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) the rows by timestamp 
    - create a `shapely.geometry.LineString` based on the user’s points

**CAREFUL**: Remember that every LineString needs at least two points. Skip users who have less than two posts.

Store the results in a `geopandas.GeoDataFrame` called `movements`, and remember to assign a CRS.

In [8]:
from shapely.geometry import LineString
line_geometries = []
user_ids = []
for userid, group in grouped_by_users:
    sorted_group = group.sort_values(by='timestamp')
    coordinates = sorted_group[['lon', 'lat']].values
    
    if len(coordinates) >= 2:
        line = LineString(coordinates)
        line_geometries.append(line)
        user_ids.append(userid)

movements = gpd.GeoDataFrame({'userid': user_ids, 'geometry': line_geometries})
movements.set_crs('EPSG:32735', inplace=True)


Unnamed: 0,userid,geometry
0,16301,"LINESTRING (31.371 -24.760, 31.338 -24.750, 31..."
1,45136,"LINESTRING (31.026 -25.321, 31.026 -25.321)"
2,50136,"LINESTRING (31.394 -24.770, 31.593 -24.993, 31..."
3,88775,"LINESTRING (31.000 -25.329, 31.000 -25.329)"
4,88918,"LINESTRING (31.551 -25.067, 31.593 -24.993)"
...,...,...
9021,99921781,"LINESTRING (31.000 -25.290, 31.011 -25.295, 31..."
9022,99936874,"LINESTRING (31.593 -24.993, 31.592 -24.993, 31..."
9023,99964140,"LINESTRING (31.322 -24.305, 31.322 -24.305)"
9024,99986933,"LINESTRING (31.293 -24.299, 31.299 -24.276)"


In [9]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

# Check the result
print(type(movements))
print(movements.crs)

movements

<class 'geopandas.geodataframe.GeoDataFrame'>
EPSG:32735


Unnamed: 0,userid,geometry
0,16301,"LINESTRING (31.371 -24.760, 31.338 -24.750, 31..."
1,45136,"LINESTRING (31.026 -25.321, 31.026 -25.321)"
2,50136,"LINESTRING (31.394 -24.770, 31.593 -24.993, 31..."
3,88775,"LINESTRING (31.000 -25.329, 31.000 -25.329)"
4,88918,"LINESTRING (31.551 -25.067, 31.593 -24.993)"
...,...,...
9021,99921781,"LINESTRING (31.000 -25.290, 31.011 -25.295, 31..."
9022,99936874,"LINESTRING (31.593 -24.993, 31.592 -24.993, 31..."
9023,99964140,"LINESTRING (31.322 -24.305, 31.322 -24.305)"
9024,99986933,"LINESTRING (31.293 -24.299, 31.299 -24.276)"


### d) Calculate the distance between all posts of a user

- Check once more that the CRS of the data frame is correct
- Compute the lengths of the lines, and store it in a new column called `distance`

In [13]:
assert movements.crs == pyproj.CRS('EPSG:32735')
movements['distance'] = movements.length


In [14]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

#Check the output
movements.head()

Unnamed: 0,userid,geometry,distance
0,16301,"LINESTRING (31.371 -24.760, 31.338 -24.750, 31...",3.158937
1,45136,"LINESTRING (31.026 -25.321, 31.026 -25.321)",0.0
2,50136,"LINESTRING (31.394 -24.770, 31.593 -24.993, 31...",1.490359
3,88775,"LINESTRING (31.000 -25.329, 31.000 -25.329)",7.28011e-07
4,88918,"LINESTRING (31.551 -25.067, 31.593 -24.993)",0.08527984


### e) Answer the original questions

You should now be able to quickly find answers to the following questions: 
- What was the shortest distance a user travelled between all their posts (in meters)? (store the value in a variable `shortest_distance`)
- What was the mean distance travelled per user (in meters)? (store the value in a variable `mean_distance`)
- What was the maximum distance a user travelled (in meters)? (store the value in a variable `longest_distance`)

In [16]:
shortest_distance = movements['distance'].min()
mean_distance = movements['distance'].mean()
longest_distance = movements['distance'].max()
print(f'Shortest distance: {shortest_distance:.3f} meters')
print(f'Mean distance: {mean_distance:.3f} meters')
print(f'Longest distance: {longest_distance:.3f} meters')


Shortest distance: 0.000 meters
Mean distance: 1.004 meters
Longest distance: 63.947 meters


### f) Save the movements in a file

Save the `movements` into a new Shapefile called `movements.shp` inside the `data` directory.

In [17]:
DATA_FILE = DATA_DIRECTORY / 'movements.shp'
movements.to_file(DATA_FILE, driver='GPKG')


In [18]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

assert (DATA_DIRECTORY / "movements.shp").exists()


---

# Fantastic job!

That’s all for this week! 