## Problem 3: How long distance individuals have travelled? (8 points)

In this problem the aim is to calculate the "distance" in meters that the individuals have travelled according the social media posts (Euclidean distances between points). In this problem, we will need the `userid` -column and the points created in the previous problem. You will need the shapefile `Kruger_posts.shp` generated in Problem 2 as input file.

Our goal is to answer these questions based on the input data:
- What was the shortest distance travelled in meters?
- What was the mean distance travelled in meters?
- What was the maximum distance travelled in meters?

**In your code, you should first:**
 - Import required modules.
 - Read in the shapefile as a geodataframe called `data`
 - Reproject the data from WGS84 projection into `EPSG:32735` -projection which stands for UTM Zone 35S (UTM zone for South Africa) to transform the data into the metric system.
 
*Store the result in a variable called `data`*!

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
%cd /content/drive/MyDrive/GitHub/automating-gis-processes/Lesson-2/exercise-2-Nkumah7-main/files

/content/drive/MyDrive/GitHub/automating-gis-processes/Lesson-2/exercise-2-Nkumah7-main/files


In [3]:
!pip install geopandas

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [4]:
# Import required modules.
import os
import pandas as pd
from pyproj import CRS
import geopandas as gpd
import matplotlib.pyplot as plt
from shapely.geometry import LineString

In [5]:
# Read in the shapefile as a geodataframe called data
data = gpd.read_file('Problem-2/Kruger_posts.shp')
data.head()

Unnamed: 0,lat,lon,timestamp,userid,geometry
0,-24.980792492,31.484633302,2015-07-07 03:02,66487960,POINT (-24.98079 31.48463)
1,-25.499224667,31.508905612,2015-07-07 03:18,65281761,POINT (-25.49922 31.50891)
2,-24.342578456,30.930866066,2015-03-07 03:38,90916112,POINT (-24.34258 30.93087)
3,-24.85461393,31.519718439,2015-10-07 05:04,37959089,POINT (-24.85461 31.51972)
4,-24.921068894,31.520835558,2015-10-07 05:19,27793716,POINT (-24.92107 31.52084)


- Check the crs of the input data. If this information is missing, set it as epsg:4326 (WGS84).
- Reproject the data from WGS84 to `EPSG:32735` -projection which stands for UTM Zone 35S (UTM zone for South Africa) to transform the data into metric system. (don't create a new variable, update the existing variable `data`!)"

In [6]:
# Check the coordinate reference system
data.crs

<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

In [7]:
# REPLACE THE ERROR BELOW WITH YOUR OWN CODE

'''Reproject the data from WGS84 projection into EPSG:32735 -projection which stands for UTM Zone 35S (UTM zone for South Africa) to transform 
the data into the metric system.'''

# Make a backup copy of the data
data_wgs84 = data[:]

# Reproject the data into EPSG:32735
data = data.to_crs(epsg=32735)

In [8]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

# Check the data
print(data.head())

             lat           lon         timestamp    userid  \
0  -24.980792492  31.484633302  2015-07-07 03:02  66487960   
1  -25.499224667  31.508905612  2015-07-07 03:18  65281761   
2  -24.342578456  30.930866066  2015-03-07 03:38  90916112   
3   -24.85461393  31.519718439  2015-10-07 05:04  37959089   
4  -24.921068894  31.520835558  2015-10-07 05:19  27793716   

                            geometry  
0  POINT (-4695752.719 14973674.275)  
1  POINT (-4748939.258 15014098.837)  
2  POINT (-4672729.591 14859391.193)  
3  POINT (-4679391.656 14969037.444)  
4  POINT (-4686373.982 14973910.589)  


In [9]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

# Check that the crs is correct after re-projecting (should be epsg:32735)
print(data.crs)

epsg:32735


 - Group the data by userid

In [10]:
# REPLACE THE ERROR BELOW WITH YOUR OWN CODE
# group the data by 'userid'
grouped = data.groupby('userid')
grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f5201e14350>

In [11]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

#Check the number of groups:
assert len(grouped.groups) == data["userid"].nunique(), "Number of groups should match number of unique users!"

**Create LineString objects for each user connecting the points from oldest to latest:**

*Suggested steps:*
- Create an empty DataFrame called `movements`. 
- Create an empty column "geometry"
- Use a for-loop where you iterate over the grouped object. For each user's data: 
    - [sort](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) the rows by timestamp 
    - create a LineString object based on the user's points
    - Add the LineString to the geometry column of the `movements` dataframe. You can also add the `userid` in a separate column (or use the userid as index).
- Convert `movements` into a `GeoDataFrame` (you can replace the DataFrame created in the previous steps with the GeoDataFrame). Remember to set the `geometry` column.
- Set the CRS of the ``movements`` GeoDataFrame as ``EPSG:32735`` 

In [12]:
# Create an empty DataFrame called movements.
movements = pd.DataFrame()

In [13]:
# Create an empty column "geometry"
movements['geometry'] = None

In [None]:
for id, group in grouped:
  group = group.sort_values(by=['timestamp'])
  
  if len(group.geometry) > 1:   
    line = [g for g in group.geometry]  
    movements = movements.append({'geometry': LineString(line), 'userid': id}, ignore_index=True)

In [15]:
movements.head()

Unnamed: 0,geometry,userid
0,LINESTRING (-4730621.203672637 14933217.538595...,10019400
1,LINESTRING (-4781540.532141375 14959979.491119...,10028023
2,LINESTRING (-4771281.624748364 14939830.766613...,10051964
3,LINESTRING (-4671809.9654520005 14967904.54142...,10060150
4,LINESTRING (-4798946.074502462 14936289.665372...,10081875


In [16]:
# REPLACE THE ERROR BELOW WITH YOUR OWN CODE
movements = gpd.GeoDataFrame(movements, geometry=movements['geometry'])

In [17]:
# Make a backup copy of 'movements'
movements_bkp = movements[:]

In [18]:
# Initialize the CRS class for epsg code 32735:
CRS.from_epsg(32735)

<Projected CRS: EPSG:32735>
Name: WGS 84 / UTM zone 35S
Axis Info [cartesian]:
- E[east]: Easting (metre)
- N[north]: Northing (metre)
Area of Use:
- name: Between 24°E and 30°E, southern hemisphere between 80°S and equator, onshore and offshore. Botswana. Burundi. Democratic Republic of the Congo (Zaire). Rwanda. South Africa. Tanzania. Uganda. Zambia. Zimbabwe.
- bounds: (24.0, -80.0, 30.0, 0.0)
Coordinate Operation:
- name: UTM zone 35S
- method: Transverse Mercator
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

In [19]:
# Set the CRS of the movements GeoDataFrame as EPSG:32735
movements = movements.set_crs('epsg:32735')

In [20]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

#Check the result
print(type(movements))
print(movements.crs)
print(movements["geometry"].head())

<class 'geopandas.geodataframe.GeoDataFrame'>
epsg:32735
0    LINESTRING (-4730621.204 14933217.539, -473038...
1    LINESTRING (-4781540.532 14959979.491, -474888...
2    LINESTRING (-4771281.625 14939830.767, -476489...
3    LINESTRING (-4671809.965 14967904.541, -467180...
4    LINESTRING (-4798946.075 14936289.665, -476905...
Name: geometry, dtype: geometry


**Finally:**
- Check once more the crs definition of your dataframe (should be epsg:32735, define the correct crs if this information is missing)
- Calculate the lenghts of the lines into a new column called ``distance`` in ``movements`` GeoDataFrame.

In [21]:
# REPLACE THE ERROR BELOW WITH YOUR OWN CODE
movements['distance'] = movements['geometry'].length

In [22]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

#Check the output
movements.head()

Unnamed: 0,geometry,userid,distance
0,"LINESTRING (-4730621.204 14933217.539, -473038...",10019400,373.543124
1,"LINESTRING (-4781540.532 14959979.491, -474888...",10028023,228993.826929
2,"LINESTRING (-4771281.625 14939830.767, -476489...",10051964,238709.357864
3,"LINESTRING (-4671809.965 14967904.541, -467180...",10060150,0.0
4,"LINESTRING (-4798946.075 14936289.665, -476905...",10081875,29894.451742


You should now be able to print answers to the following questions: 

 - What was the shortest distance travelled in meters?
 - What was the mean distance travelled in meters?
 - What was the maximum distance travelled in meters?

In [23]:
print(f"Shortest distance = {movements['distance'].min()}")
print(f"Mean distance = {movements['distance'].mean()}")
print(f"Maximum distance = {movements['distance'].max()}")

Shortest distance = 0.0
Mean distance = 138871.14194460004
Maximum distance = 8457917.497356467


- Finally, save the movements of into a Shapefile called ``some_movements.shp``

In [24]:
# REPLACE THE ERROR BELOW WITH YOUR OWN CODE
# Check if folder exists
if not os.path.exists('Problem-3/'):

  print('Creating Problem-3 folder...')
  os.makedirs('Problem-3/')
else:
  print('Problem-3 folder exists')

Creating Problem-3 folder...


In [25]:
# Chenge to output directory
os.chdir('Problem-3/')

In [26]:
fp = 'some_movements.shp'
movements.to_file(fp)

In [27]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

import os

#Check if output file exists
assert os.path.isfile(fp), "Output file does not exits."

That's all for this week!