---
### Assignment 4
Collect  tweets  published  via  Endomondo  by  people  from  3-4  major  cities  in Poland.   Check  if  there  are  some  differences  in  types  of  activities  between them.  Visualize the activity areas on a map.

---

*Importing of all necessary libaries.*

In [99]:
import json  # For reading configuartion file
import tweepy  # For connecting to tweeter API
import regex

Authentication keys and tokens are stored locally to not share then with other people.

In [2]:
# Getting the authentication keys and tokens
with open('twitter_config.json', mode='r') as file:
    config = json.load(file)

Now we can authenticate using `tweepy`

In [3]:
# Authenticating to Twitter
auth = tweepy.OAuthHandler(config['consumerKey'], config['consumerSecret'])
auth.set_access_token(config['accessToken'], config['accessTokenSecret'])

and create and `api` object, which we will use for making requests.

In [9]:
# Creating an API object
api = tweepy.API(auth)
# api.verify_credentials()

### Obtaining the data

We will define a dictionary with largest Polish cities and their geographic locations (latitude and longitude). The data comes from Wikimedia Toolforge Geohack.

In [11]:
LOCATIONS = {
    'Warszawa': (52.232222, 21.008333),
    'Kraków': (50.061389, 19.938333),
    'Łódź': (51.776667, 19.454722),
    'Wrocław': (51.11, 17.022222),
    'Poznań': (52.408333, 16.934167),
    'Gdańsk': (54.3475, 18.645278 ),
    'Szczecin': (53.438056, 14.542222 ),
    'Bydgoszcz': (53.125, 18.011111 ),
    'Lublin': (51.248056, 22.570278 ),
    'Białystok': (53.135278, 23.145556 ),
}

And now we will fetch all tweets conatining hashtag *#Endomondo* and their locations which were posted at most 50km from center of those cities (so the user had to have the automatic posting and geolocation turned on!).

In [92]:
city_data = {city: [] for city in LOCATIONS}
for city, (lat, lon) in LOCATIONS.items():
    response = api.search(q='#Endomondo #endorphins', 
                         geocode=f'{lat},{lon},50km', 
                         count=150)
    for tweet in response:
        city_data[city].append({
            'text': tweet.text,
            'lat': tweet.geo['coordinates'][0] if tweet.geo else None,
            'lon': tweet.geo['coordinates'][1] if tweet.geo else None
        })

As we might have suspected, we don't have too much data.

In [96]:
{city: len(tweets) for city, tweets in city_data.items()}

{'Warszawa': 38,
 'Kraków': 16,
 'Łódź': 13,
 'Wrocław': 20,
 'Poznań': 19,
 'Gdańsk': 16,
 'Szczecin': 2,
 'Bydgoszcz': 1,
 'Lublin': 4,
 'Białystok': 3}

If we look for example at Warsaw

In [97]:
city_data['Warszawa'][:5]

[{'text': 'I just finished cycling 15.95 km in 48m:46s with #Endomondo #endorphins https://t.co/tFLL1EkUyU',
  'lat': 52.176159,
  'lon': 21.049438},
 {'text': 'I was out cycling 10.03 km with #Endomondo #endorphins https://t.co/xlTi9h9pIo',
  'lat': 52.1725727,
  'lon': 21.1146933},
 {'text': 'I just finished cycling 12.10 km in 37m:41s with #Endomondo #endorphins https://t.co/DAvFB8vTsl',
  'lat': 52.201857,
  'lon': 20.955691},
 {'text': 'I was out cycling 9.64 km with #Endomondo #endorphins https://t.co/v25vGCTbt0',
  'lat': 52.2101056,
  'lon': 21.0394128},
 {'text': 'I was out walking 1.77 km with #Endomondo #endorphins https://t.co/EX3CZ1nNDk',
  'lat': 52.2638483,
  'lon': 20.9845198}]

We can see that we have only few types of activities, namely: *cycling*, *running*, *walking*, *stretching*, *fitness*, *swimming*, *weight training* and *playing soccer*. We will create an appropriate array.

In [98]:
ACTIVITES = [
    'cycling',
    'running', 
    'walking',
    'stretching',
    'fitness',
    'swimming',
    'weight training',
    'playing soccer'
]

In [100]:
def extract_activities(tweet):
    activities = []
    for activity in ACTIVITES:
        if activity in tweet['text']:
            activities.append(activity)
    return activities

In [131]:
tmp = city_data['Warszawa'][0]
print(tmp)
regex.search(r'\s\d+(\.\d{2})?\skm', tmp['text'])[0]  # Distance
regex.search(r'\s\d{1,2}m:\d{2}s\s', tmp['text'])[0] # Time

{'text': 'I just finished cycling 15.95 km in 48m:46s with #Endomondo #endorphins https://t.co/tFLL1EkUyU', 'lat': 52.176159, 'lon': 21.049438}


' 48m:46s '

In [153]:
def extract_distance(tweet):
    matches = regex.search(r'\s\d+(\.\d{2})?\skm', tweet['text'])
    if matches:
        return float(regex.search(r'\d+\.\d{2}', matches[0])[0])
    else:
        return None

In [161]:
def extract_time(tweet):
    matches = regex.search(r'\s\d{1,2}m:\d{2}s\s', tweet['text'])
    if matches:
        return matches[0].strip()
    else:
        return None

In [162]:
enhanced_data = city_data.copy()

In [163]:
for city, tweets in enhanced_data.items():
    for tweet in tweets:
        tweet['activities'] = extract_activities(tweet)
        tweet['distance'] = extract_distance(tweet)
        tweet['time'] = extract_time(tweet)

In [164]:
enhanced_data

{'Warszawa': [{'text': 'I just finished cycling 15.95 km in 48m:46s with #Endomondo #endorphins https://t.co/tFLL1EkUyU',
   'lat': 52.176159,
   'lon': 21.049438,
   'activities': ['cycling'],
   'distance': 15.95,
   'time': '48m:46s'},
  {'text': 'I was out cycling 10.03 km with #Endomondo #endorphins https://t.co/xlTi9h9pIo',
   'lat': 52.1725727,
   'lon': 21.1146933,
   'activities': ['cycling'],
   'distance': 10.03,
   'time': None},
  {'text': 'I just finished cycling 12.10 km in 37m:41s with #Endomondo #endorphins https://t.co/DAvFB8vTsl',
   'lat': 52.201857,
   'lon': 20.955691,
   'activities': ['cycling'],
   'distance': 12.1,
   'time': '37m:41s'},
  {'text': 'I was out cycling 9.64 km with #Endomondo #endorphins https://t.co/v25vGCTbt0',
   'lat': 52.2101056,
   'lon': 21.0394128,
   'activities': ['cycling'],
   'distance': 9.64,
   'time': None},
  {'text': 'I was out walking 1.77 km with #Endomondo #endorphins https://t.co/EX3CZ1nNDk',
   'lat': 52.2638483,
   'lon':

We will check if there are any tweets which don't have any activities.

In [157]:
uncategorized = {city: [] for city in enhanced_data}
for city, tweets in enhanced_data.items():
    for tweet in tweets:
        if not tweet['activities']:
            uncategorized[city].append(tweet)

In [158]:
uncategorized

{'Warszawa': [],
 'Kraków': [],
 'Łódź': [],
 'Wrocław': [],
 'Poznań': [],
 'Gdańsk': [],
 'Szczecin': [],
 'Bydgoszcz': [],
 'Lublin': [],
 'Białystok': []}

As we can see we categorized everything!
Thank you python, very cool.

### Visualizing the results.

To visualize data on map we will use `folium` library.

In [173]:
import folium

Now we create a new map centered at Wrocław (**obviously...**)

In [174]:
map_ = folium.Map(
    location=LOCATIONS['Wrocław'],
    world_copy_jump=True,
    no_wrap=False,
    width='100%',
    zoom_start=6
)

And we add appropriate markers:

In [175]:
# We flatten the dictionary into a list of tweets as we don't need the city names no more.
tweets = [tweet for tweets in enhanced_data.values() for tweet in tweets]

In [176]:
for tweet in tweets:
    if tweet['lat'] and  tweet['lon']:
        popup = (
            f"<p>Distance: {tweet['distance']}km</p>"
            f"<p>Time: {tweet['time']}</p>"
        )
        folium.Marker(location=(tweet['lat'], tweet['lon']), 
                      tooltip=' and '.join(tweet['activities']),
                      popup=popup).add_to(map_)

In [177]:
map_