# <center> COVID'19 Tweets Geographical Distribution 🌎</center>

<center><h2>Real Time Tweets and Geographical Distribution Application</h2></center><br>
<center><h5>Recovered more than 70% of Geo Cordinates from user tweeted address</h5></center>

<center><img src='https://raw.githubusercontent.com/AkhilRam7/Covid19Tweets/master/Webp.net-gifmaker.gif')></center>

## Dataset Description

These tweets are collected using Twitter API and a Python script. A query for this high-frequency hashtag (#covid19) is run on a daily basis for a certain time period, to collect a larger number of tweets samples.

**Content**
The tweets have #covid19 hashtag. Collection started on 25/7/2020, with an initial 17k batch and will continue on a daily basis.


The collection script can be found here: https://github.com/gabrielpreda/covid-19-tweets

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Importing Necesssary Libraries

In [None]:
#importing necessery libraries for future analysis of the dataset

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import folium
from folium.plugins import HeatMapWithTime, TimestampedGeoJson

# Dataset Exploration

In [None]:
# Pandas to read Covid Tweets dataset
tweets = pd.read_csv('../input/covid19-tweets/covid19_tweets.csv')

In [None]:
# Examining the dataset from begining
tweets.head()

In [None]:
# Checking the shape of dataset
tweets.shape

In [None]:

tweets.info()

# Data Insights
We have around 166,656 Tweets in the dataset and each tweet will possibly contain 13 parameters. Lets try to explore how many of them are necessary for visualizing the geographical distribution. To visualize geographically we need to have either country codes or Location co-ordinates. But te dataset has a parameter '*user_location*' in which the locations are very vague to plot them.


Main idea is to extract geographical cordinates of user. Here I'm using two other datasets to improve the location feature in dataset.
* world-cities-datasets (https://www.kaggle.com/viswanathanc/world-cities-datasets)
* Average Locations cordinates based on country cides (https://gist.githubusercontent.com/tadast/8827699/raw/3cd639fa34eec5067080a61c69e3ae25e3076abb/countries_codes_and_coordinates.csv')

world-cities-datasets is used to identify Alpha 2,Alpha 3,Country Name, City Name from user location field in Covid Tweet Dataset. T

Average Locations cordinates dataset is used to consider average geo cordinates of Alpha2, Alpha3, Country Name for world-cities-datasets.

In [None]:
# World City Dataset

cities = pd.read_csv('../input/world-cities-datasets/worldcities.csv')

In [None]:
# Exploring city dataset
cities.head()

In [None]:
## Empty Columns of Latitude and Longitudes are added in Tweets Dataset

tweets["lat"] = np.NaN
tweets["lng"] = np.NaN
tweets["location"] = tweets["user_location"]


## Missing Data Handling in tweets dataset

Here i dont want to remove the NaN values now. So i'm trying to replace it with a empty string.

In [None]:
user_location = tweets['location'].fillna(value='').str.split(',')

In [None]:
# AVG Location Dataset

avg_countries_loc = pd.read_csv('https://gist.githubusercontent.com/tadast/8827699/raw/3cd639fa34eec5067080a61c69e3ae25e3076abb/countries_codes_and_coordinates.csv')

In [None]:
avg_countries_loc.head()

**Here I Tried to check if both datasets world city dataset and abg location dataset are having same county codes to track and nothing is missing betwwen them. Found that four country codes are missing in Avg Location Dataset. So added these four country codes and their geo cordinate manually.**

In [None]:
# Make a list of all countries in Avg Location Dataset
codes = avg_countries_loc['Alpha-2 code'].str.replace('"','').str.strip().to_list()
world_city_iso2 = []
for c in cities['iso2'].str.lower().str.strip().values.tolist():
    if c not in world_city_iso2:
        world_city_iso2.append(c)
        
# Try to identify if both external share same countries codes for tracking between them
l_codes = [c.lower() for c in codes]
for a in world_city_iso2:
    if a not in l_codes:
        print(a)

In [None]:
# Adding the missing country codes manually

codes = avg_countries_loc['Alpha-2 code'].str.replace('"','').str.strip().to_list() + ['XW','SX', 'CW','XK']
code_lat = avg_countries_loc['Latitude (average)'].str.replace('"','').to_list() + ['31.953112', '18.0255', '12.2004', '42.609778']
code_lng = avg_countries_loc['Longitude (average)'].str.replace('"','').to_list() + ['35.301170', '-63.0450', '-69.0200', '20.918062']

# Feature Engineering

In [None]:
lat = cities['lat'].fillna(value = '').values.tolist()
lng = cities['lng'].fillna(value = '').values.tolist()


# Getting all alpha 3 codes into  a list
world_city_iso3 = []
for c in cities['iso3'].str.lower().str.strip().values.tolist():
    if c not in world_city_iso3:
        world_city_iso3.append(c)
        
# Getting all alpha 2 codes into  a list    
world_city_iso2 = []
for c in cities['iso2'].str.lower().str.strip().values.tolist():
    if c not in world_city_iso2:
        world_city_iso2.append(c)
        
# Getting all countries into  a list        
world_city_country = []
for c in cities['country'].str.lower().str.strip().values.tolist():
    if c not in world_city_country:
        world_city_country.append(c)

# Getting all amdin names into  a list
world_states = []
for c in cities['admin_name'].str.lower().str.strip().tolist():
    world_states.append(c)


# Getting all cities into  a list
world_city = cities['city'].fillna(value = '').str.lower().str.strip().values.tolist()



# Data Mapping among Datasets

1. Each User Location may possibly contain combination of city name, country name with country codes. (Ex: Pewee Valley, KY)
2. We need to split the user location from user location
3. Now each location will be possibily containing list of (country, city, code)
4. If we find city name or admin name match in world city database we will assign the location from world city database to tweets dataset for a respective row.
5. If we find alpha 2, alpha 3 or country match in world city database then we will get average location from Avg countries dataset and add it to tweet dataset.

In [None]:

for each_loc in range(len(user_location)):
    ind = each_loc
    order = [False,False,False,False,False]
    each_loc = user_location[each_loc]
    for each in each_loc:
        each = each.lower().strip()
        if each in world_city:
            order[0] = world_city.index(each)
        if each in world_states:
            order[1] = world_states.index(each)
        if each in world_city_country:
            order[2] = world_city_country.index(each)
        if each in world_city_iso2:
            order[3] = world_city_iso2.index(each)
        if each in world_city_iso3:
            order[4] = world_city_iso3.index(each)

    if order[0]:
        tweets['lat'][ind] = lat[order[0]]
        tweets['lng'][ind] = lng[order[0]]
        continue
    if order[1]:
        tweets['lat'][ind] = lat[order[1]]
        tweets['lng'][ind] = lng[order[1]]
        continue
    if order[2]:
        try:
            tweets['lat'][ind] = code_lat[codes.index(world_city_iso2[order[2]].upper())]
            tweets['lng'][ind] = code_lng[codes.index(world_city_iso2[order[2]].upper())]
        except:
            pass
        continue
    if order[3]:
        tweets['lat'][ind] = code_lat[codes.index(world_city_iso2[order[3]].upper())]
        tweets['lng'][ind] = code_lng[codes.index(world_city_iso2[order[3]].upper())]
        continue
    if order[4]:
        tweets['lat'][ind] = code_lat[codes.index(world_city_iso2[order[4]].upper())]
        tweets['lng'][ind] = code_lng[codes.index(world_city_iso2[order[4]].upper())]
        continue


In [None]:
# Null values of location in tweets
all_tweets = len(tweets)
bad_tweets_without_location = tweets['user_location'].isnull().sum()
tweets_unrecovered_location = tweets['lat'].isnull().sum()

print(all_tweets, bad_tweets_without_location, tweets_unrecovered_location)
print('\nPercentage of recovering Tweet Locations using extrenal datasets...')
print((all_tweets-(tweets_unrecovered_location))/(all_tweets-bad_tweets_without_location))


We have recovered atmost 71% of tweet geo cordinate locations which is highly satisfactory.

Now we will create a partial dataframe from datasets for visualizing animations of geographical distribution

In [None]:
map_df = tweets[['lat','lng','user_location','date']].dropna()

In [None]:
map_df.head()

In [None]:
dates = map_df['date'].str.split(' ').str.get(0).unique().tolist()
print('Number of Days in dataset:', len(dates))

# Daily Tweets and their Geographical Distribution

In [None]:

daily_tweets = folium.Map(tiles='cartodbpositron', min_zoom=2) 

# Ensure you're handing it floats
map_df['lat'] = map_df['lat'].astype(float)
map_df['lng'] = map_df['lng'].astype(float)
map_df['date'] = map_df['date'].str.split(' ').str.get(0)


# List comprehension to make out list of lists
heat_data = [[[row['lat'],row['lng']] for index, row in map_df[map_df['date'] == i].iterrows()] for i in dates]

# Plot it on the map
hm = HeatMapWithTime(data=heat_data, name=None, radius=7, min_opacity=0, max_opacity=0.8, 
                     scale_radius=False, gradient=None, use_local_extrema=False, auto_play=False, 
                     display_index=True, index_steps=1, min_speed=0.1, max_speed=10, speed_step=0.1, 
                     position='bottomleft', overlay=True, control=True, show=True)
hm.add_to(daily_tweets)
# Display the map
daily_tweets.save('daily_tweets.html')
daily_tweets

In [None]:
def geojson_features(map_df):
    features = []
    for _, row in map_df.iterrows():
        feature = {
            'type': 'Feature',
            'geometry': {
                'type':'Point', 
                'coordinates':[row['lng'],row['lat']]
            },
            'properties': {
                'time': row['date'],
                'style': {'color' : 'red'},
                'icon': 'circle',
                'iconstyle':{
                    'fillColor': 'red',
                    'fillOpacity': 0.5,
                    'stroke': 'true',
                    'radius': 3
                }
            }
        }
        features.append(feature)
    return features


# Timely Tweets and Their Geographic Distribution

This is very bulky cell as we are plotting for all existing timestamps (!Cell output map animation will be slow)

In [None]:

map_df = tweets[['lat','lng','user_location','date']].dropna()
timely_tweets = folium.Map(tiles='cartodbpositron', min_zoom=2) 

# Ensure you're handing it floats
map_df['lat'] = map_df['lat'].astype(float)
map_df['lng'] = map_df['lng'].astype(float)
map_df['date'] = map_df['date']


# List comprehension to make out list of lists
heat_data = [[[row['lat'],row['lng']] for index, row in map_df[map_df['date'] == i].iterrows()] for i in dates]

# Plot it on the map
hm = TimestampedGeoJson(geojson_features(map_df), transition_time=200, loop=True, auto_play=False, add_last_point=True, 
                   period='P1D', min_speed=0.1, max_speed=10, loop_button=False, date_options='YYYY-MM-DD HH:mm:ss', 
                   time_slider_drag_update=False, duration=None)
hm.add_to(timely_tweets)
# Display the map
timely_tweets


# Conclusion

We have generated Animation for tweets and their geographical distribution

We will take a big leap by making an realtime web application displaying geographical distribution of tweets along with realtime top tweets displaying.

Here i am trying to tweak the folium generated html file. 

I found a js function in folium generated html which is triggred to update the maps which we can use to call another function which will display the top tweets at that timestamp.

And now I will host this html in a simple flask server


Here I am cloning already tweaked html file and will try to render it in thi kernal

In [None]:
!apt install git
!git clone https://github.com/AkhilRam7/Covid19Tweets.git
%cd Covid19Tweets

In [None]:
!pip install flask-ngrok

In [None]:

from flask_ngrok import run_with_ngrok
from flask import Flask, render_template
app = Flask(__name__)
run_with_ngrok(app)   #starts ngrok when the app is run
@app.route('/')
def index():
    return render_template('daily_tweets.html')
app.run()

In [None]:
from IPython.display import IFrame

#Add NGROK SERVING address below in src

IFrame(src='http://705d513a190c.ngrok.io/', width=700, height=600)