## Problem 3: How long distance individuals have travelled? (8 points)

In this problem the aim is to calculate the "distance" in meters that the individuals have travelled according the social media posts (Euclidean distances between points). In this problem, we will need the `userid` -column an the points created in the previous problem. You will need the shapefile `Kruger_posts.shp` generated in Problem 2 as input file.

Our goal is to answer these questions based on the input data:
- What was the shortest distance travelled in meters?
- What was the mean distance travelled in meters?
- What was the maximum distance travelled in meters?

**In your code, you should first:**
 - Import required modules.
 - Read in the shapefile as a geodataframe called `data`
 - Reproject the data from WGS84 projection into `EPSG:32735` -projection which stands for UTM Zone 35S (UTM zone for South Africa) to transform the data into metric system.
 
*Store the result in a variable called `data`*!

In [1]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import geopandas as gpd
from shapely.geometry import Point
from shapely.geometry import LineString
from pyproj import CRS

data = gpd.read_file("data/Kruger_posts.shp")

- Check the crs of the input data. If this information is missing, set it as epsg:4326 (WGS84).
- Reproject the data from WGS84 to `EPSG:32735` -projection which stands for UTM Zone 35S (UTM zone for South Africa) to transform the data into metric system. (don't create a new variable, update the existing variable `data`!)"

In [2]:
data = data.to_crs(epsg=32735)

In [3]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

# Check the data
print(data.head())

          timestamp    userid                        geometry
0  2015-07-07 03:02  66487960  POINT (952912.890 7229683.258)
1  2015-07-07 03:18  65281761  POINT (953433.223 7172080.632)
2  2015-03-07 03:38  90916112  POINT (898955.144 7302197.408)
3  2015-10-07 05:04  37959089  POINT (956927.218 7243564.942)
4  2015-10-07 05:19  27793716  POINT (956794.955 7236187.926)


In [4]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

# Check that the crs is correct after re-projecting (should be epsg:32735)
print(data.crs)

epsg:32735


 - Group the data by userid

In [5]:
grouped = data.groupby("userid")
len(grouped)
grouped.get_group(37959089)

Unnamed: 0,timestamp,userid,geometry
3,2015-10-07 05:04,37959089,POINT (956927.218 7243564.942)
300,2015-06-10 10:03,37959089,POINT (959476.770 7241547.289)
1795,2015-05-27 18:12,37959089,POINT (959477.643 7241503.537)
1975,2015-03-30 14:36,37959089,POINT (959479.354 7241529.289)
1981,2015-07-30 15:44,37959089,POINT (959479.354 7241529.289)
...,...,...,...
33447,2015-06-22 08:04,37959089,POINT (954981.384 7235222.703)
33838,2015-02-24 08:45,37959089,POINT (956919.751 7236232.069)
38364,2015-11-13 17:26,37959089,POINT (956785.963 7236159.685)
38451,2015-07-14 09:32,37959089,POINT (956808.640 7236099.800)


In [6]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

#Check the number of groups:
assert len(grouped.groups) == data["userid"].nunique(), "Number of groups should match number of unique users!"

**Create LineString objects for each user connecting the points from oldest to latest:**

*Suggested steps:*
- Create an empty DataFrame called `movements`. 
- Create an empty column "geometry"
- Use a for-loop where you iterate over the grouped object. For each user's data: 
    - [sort](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) the rows by timestamp 
    - create a LineString object based on the user's points
    - Add the LineString to the geometry column of the `movements` dataframe. You can also add the `userid` in a separate column (or use the userid as index).
- Convert `movements` into a `GeoDataFrame` (you can replace the DataFrame created in the previous steps with the GeoDataFrame). Remember to set the `geometry` column.
- Set the CRS of the ``movements`` GeoDataFrame as ``EPSG:32735`` 

In [7]:
movements = pd.DataFrame()
movements['geometry'] = None
linestring_list = []

for key, group in grouped:
    group.sort_values(by='timestamp')
    if len(group['geometry']) >= 2:
        point_geometry = list(group['geometry'])
        line_string = LineString(point_geometry)    
        linestring_list.append(line_string) 
    
movements['geometry'] = linestring_list
movements = gpd.GeoDataFrame(movements, geometry='geometry', crs=CRS.from_epsg(32735))

In [8]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

#Check the result
print(type(movements))
print(movements.crs)
print(movements["geometry"].head())

<class 'geopandas.geodataframe.GeoDataFrame'>
epsg:32735
0    LINESTRING (939011.113 7254636.121, 942231.630...
1    LINESTRING (905394.500 7193375.148, 905394.500...
2    LINESTRING (963788.403 7228015.063, 944551.607...
3    LINESTRING (902800.817 7192546.975, 902800.839...
4    LINESTRING (959332.961 7219877.715, 963788.403...
Name: geometry, dtype: geometry


**Finally:**
- Check once more the crs definition of your dataframe (should be epsg:32735, define the correct crs if this information is missing)
- Calculate the lenghts of the lines into a new column called ``distance`` in ``movements`` GeoDataFrame.

In [9]:
movements.crs
movements['distance'] = movements['geometry'].length

In [10]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

#Check the output
movements.head()

Unnamed: 0,geometry,distance
0,"LINESTRING (939011.113 7254636.121, 942231.630...",195251.395657
1,"LINESTRING (905394.500 7193375.148, 905394.500...",0.0
2,"LINESTRING (963788.403 7228015.063, 944551.607...",254702.52963
3,"LINESTRING (902800.817 7192546.975, 902800.839...",0.080245
4,"LINESTRING (959332.961 7219877.715, 963788.403...",9277.252211


You should now be able to print answers to the following questions: 

 - What was the shortest distance travelled in meters?
 - What was the mean distance travelled in meters?
 - What was the maximum distance travelled in meters?

In [11]:
shortest_distance = movements['distance'].min()
average_distance = movements['distance'].mean()
maximum_distance = movements['distance'].max()

print("The shortest distance is:", round(shortest_distance, 2), "meters")
print("The average distance is:", round(average_distance, 2), "meters")
print("The maximum distance is:", round(maximum_distance, 2), "meters")

The shortest distance is: 0.0 meters
The average distance is: 69090.38 meters
The maximum distance is: 4535318.99 meters


- Finally, save the movements of into a Shapefile called ``some_movements.shp``

In [12]:
fp = "data/some_movements.shp"
movements.to_file(fp)

In [13]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION

import os

#Check if output file exists
assert os.path.isfile(fp), "Output file does not exits."

That's all for this week!