## Problem 3: How long distance individuals have travelled? (8 points)

In this problem the aim is to calculate the distance in meters that the individuals have travelled according the social media posts (Euclidean distances between points). In this problem, we will need the `userid` -column an the points created in the previous problem. You will need the shapefile `Kruger_posts.shp` generated in Problem 2 as input file.

Our goal is to answer these questions based on the input data:

 - What was the shortest distance travelled in meters?
 - What was the mean distance travelled in meters?
 - What was the maximum distance travelled in meters?

**In your code, you should first:**
 - Import required modules
 - Read in the shapefile as a geodataframe called `data`

In [15]:
# YOUR CODE HERE
# raise NotImplementedError()
import geopandas as gpd
from shapely.geometry import LineString
from pyproj import CRS
import os, sys 

# Read shapefile data
fp = 'data/Kruger_posts.shp'
data = gpd.read_file(fp)

 - Check the crs of the input data. If this information is missing, set it as epsg:4326 (WGS84).
 - Reproject the data from WGS84 to `EPSG:32735` -projection which stands for UTM Zone 35S (UTM zone for South Africa) to transform the data into metric system. (don't create a new variable, update the existing variable `data`!)

In [6]:
# YOUR CODE HERE
# raise NotImplementedError()

# Check CRS of dataframe
print(f'Original CRS the shapefile: {data.crs}')

# RE-project the geoDataFrame data
sAfrica = CRS.from_epsg(32735) # define projection 
data = data.to_crs(sAfrica)

Original CRS the shapefile:epsg:32735


In [4]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION
print(data.head())

     userid        lat        lon         timestamp  \
0  66487960 -24.980792  31.484633  2015-07-07 03:02   
1  65281761 -25.499225  31.508906  2015-07-07 03:18   
2  90916112 -24.342578  30.930866  2015-03-07 03:38   
3  37959089 -24.854614  31.519718  2015-10-07 05:04   
4  27793716 -24.921069  31.520836  2015-10-07 05:19   

                            geometry  
0  POINT (-4695752.719 14973674.275)  
1  POINT (-4748939.258 15014098.837)  
2  POINT (-4672729.591 14859391.193)  
3  POINT (-4679391.656 14969037.444)  
4  POINT (-4686373.982 14973910.589)  


In [5]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION
# Check that the crs is correct after re-projecting (should be epsg:32735)
print(data.crs)

epsg:32735


 - Group the data by userid column

In [29]:
# YOUR CODE HERE
# raise NotImplementedError()

try:
    # group data by 'userid'
    grouped = data.groupby(by='userid')
    # Specify the userid as key
    u_id = 16301
    group1 = grouped.get_group(u_id)
    group1_sorted = group1.sort_values(by=['timestamp'])
    # let's see what we have
    print(f'Group sorted: \n{group1_sorted}')
    # Create line object from group1_sorted
    line = LineString(list(group1_sorted['geometry']))
    # print line bounds
    print(f'\nLine Bounds: {line.bounds}')
except:
    raise NotImplementedError()

Group sorted 
       userid        lat        lon         timestamp  \
30535   16301 -24.759508  31.371200  2015-02-08 06:18   
30770   16301 -24.749845  31.338317  2015-02-09 08:09   
38235   16301 -24.995803  31.592000  2015-03-13 10:59   
38232   16301 -24.791483  31.865172  2015-05-13 10:51   
30512   16301 -24.760170  31.339430  2015-06-08 04:34   
38909   16301 -25.102336  31.894695  2015-08-16 14:27   
30545   16301 -24.774158  31.380342  2015-09-08 06:58   
38911   16301 -24.985142  31.625662  2015-09-16 14:30   
38913   16301 -25.122811  31.911867  2015-11-16 14:31   

                                geometry  
30535  POINT (-4681550.088 14943799.279)  
30770  POINT (-4683233.102 14939015.568)  
38235  POINT (-4688386.821 14988087.394)  
38232  POINT (-4643987.777 15007357.316)  
30512  POINT (-4684246.015 14939886.378)  
38909  POINT (-4674277.458 15033193.770)  
30545  POINT (-4682360.424 14945977.388)  
38911  POINT (-4684442.069 14991502.196)  
38913  POINT (-4674988.223 1

**Let's check what do we have in the grouped variable**

In [30]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION
assert len(grouped.groups) == data["userid"].nunique(), "Number of groups should match number of unique users!"

**Then:**
- Create an empty GeoDataFrame called `movements`
- Create a for-loop where you iterate over the grouped object. For each user's data: 
    - [sort](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) the rows by timestamp 
    - create a LineString object based on the user's points
    - add the geometry and the userid into the `movements` dataframe (one userid per row). You can achieve this either by using the `.at` indexer, or the `append` method. See hints for more help.
- Set the CRS of the ``movements`` GeoDataFrame as ``EPSG:32735`` 

In [34]:
%%time
# YOUR CODE HERE
# raise NotImplementedError()

# define columns for the empty GeoDataFrame: Create dictionary 
d = {'geometry': [], 'userid': []}
# Create an empty geoDataFrame
movements = gpd.GeoDataFrame(data= d)
try:
    for key, group in grouped:
        if len(group) >=2: # select only groups that contain at least 2 points to create line
            # sorted group by 'timestamp'
            group.sort_values(by=['timestamp'])
            line = LineString(list(group['geometry']))
            movements.at[key, 'userid'] = int(key)
            movements.at[key, 'geometry'] = line        
except:
    raise NotImplementedError()

Wall time: 42.6 s


In [35]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION
movements.head()

Unnamed: 0,geometry,userid
16301,"LINESTRING (-4684246.015 14939886.378, -468155...",16301.0
45136,"LINESTRING (-4770692.230 14940874.449, -477069...",45136.0
50136,"LINESTRING (-4687987.866 14987928.782, -468073...",50136.0
88775,"LINESTRING (-4773713.345 14938272.132, -477371...",88775.0
88918,"LINESTRING (-4699374.159 14988142.858, -468798...",88918.0


**Finally:**
- Check once the crs definition of your dataframe (should be epsg:32735, define the correct crs if this information is missing)
- Calculate the lenghts of the lines into a new column called ``distance`` in ``movements`` GeoDataFrame.

In [37]:
# YOUR CODE HERE
# raise NotImplementedError()

# Re-projecting CRS of movement GeoDataFrame
try:
    movements.set_crs(epsg= 32735, inplace=True)
    print(movements.crs)
except:
    raise NotImplementedError()



epsg:32735


In [38]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION
movements.head()

Unnamed: 0,geometry,userid
16301,"LINESTRING (-4684246.015 14939886.378, -468155...",16301.0
45136,"LINESTRING (-4770692.230 14940874.449, -477069...",45136.0
50136,"LINESTRING (-4687987.866 14987928.782, -468073...",50136.0
88775,"LINESTRING (-4773713.345 14938272.132, -477371...",88775.0
88918,"LINESTRING (-4699374.159 14988142.858, -468798...",88918.0


You should now be able to print answers to the following questions: 

 - What was the shortest distance travelled in meters?
 - What was the mean distance travelled in meters?
 - What was the maximum distance travelled in meters?

In [41]:
# Create distance column in movements GeoDataFrame
movements['distance'] = movements['geometry'].length

# Define max, min, & mean distances
max_length = movements['distance'].max()
min_length = movements['distance'].min()
mean_lenght = movements['distance'].mean()

print(f'Max distance travelled was: {round(max_length, 2)} meters')
print(f'Min distance travelled was: {round(min_length, 2)} meters' )
print(f'Mean distance travelled was {round(mean_lenght, 2)} meters') 

Max distance travelled was: 5486426.02 meters
Min distance travelled was: 0.0 meters
Mean distance travelled was 90256.51 meters


- Finally, save the movements of into a Shapefile called ``some_movements.shp``

In [42]:
# YOUR CODE HERE
# raise NotImplementedError()

# Save GeoDataFrame as shapefile
fp = 'data/some_movements.shp'
movements.to_file(fp)

In [43]:
# NON-EDITABLE CODE CELL FOR TESTING YOUR SOLUTION
import os
assert os.path.isfile(fp), "output shapefile does not exits"

That's all for this week!