<a href="https://colab.research.google.com/github/Elzawawy/covid-case-estimator/blob/master/Temperature_Features_Exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Temperature Feature Exploration

In this dataset we set to explore daily features for our daily cases model estimator.

One of the most important daily features that could have high influence on COVID-19 cases is the temperature and its related features like humidity, wind,.etc. So, and under the lght of the previous work conclusions we explore in this notebook temperature features and expose the features in the form we need for our models.

In [8]:
#imports cell
import pandas as pd
import numpy as np
import csv
from shutil import copyfile
from enum import Enum

# mount google drive to copy files from repo into drive.
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### We use the Official API for https://www.kaggle.com


*   The first dataset we explore is the Temperature Dataset by [Pierre Winter](https://www.kaggle.com/winterpierre91/covid19-global-weather-data)
*   You can get your own Kaggle API key to run this cell by going to kaggle.com and navigating to `My Account` Tab and use the `Create API Key` button, you then upload it to the notebook's temproray storage.



In [4]:
!pip install kaggle
# You have to upload you own Kaggle API which is the `kaggle.json` into the temp directory first.
!cp /content/kaggle.json ~/.kaggle/kaggle.json
# For the Kaggle API key to be un-readable by other users on this system.
!chmod 600 /root/.kaggle/kaggle.json
!kaggle datasets download -d winterpierre91/covid19-global-weather-data
!unzip covid19-global-weather-data.zip
!rm covid19-global-weather-data.zip

Downloading covid19-global-weather-data.zip to /content
  0% 0.00/204k [00:00<?, ?B/s]
100% 204k/204k [00:00<00:00, 76.6MB/s]
Archive:  covid19-global-weather-data.zip
  inflating: temperature_dataframe.csv  


### Reading and Understanding the Global Wheather Dataset

*   For each country we have useless columns and ones those we actually need. 
*   There are some countries with mutiple Provinces and thus multiple data points for each day and ones with single data row for each day (required).
* We map those multiple provinces countries into single ones by taking the mean of features of interest across all provinces for each day.

* We also drop these useless columns to us early on before processing the dataframe to save some extra time.



In [14]:
GLOBAL_WHEATHER_DATA_FILE = "/content/temperature_dataframe.csv"
STORAGE_DIR = "/content/drive/My Drive/COVID-19/daily-features/"
copyfile(GLOBAL_WHEATHER_DATA_FILE, STORAGE_DIR+"global_weather_data.csv");

temperature_dataframe = pd.read_csv(STORAGE_DIR+"global_weather_data.csv")
temperature_dataframe.head()

Unnamed: 0.1,Unnamed: 0,id,province,country,lat,long,date,cases,fatalities,capital,humidity,sunHour,tempC,windspeedKmph
0,0,1,,Afghanistan,33.0,65.0,2020-01-22,0.0,0.0,Kabul,65.0,8.7,-1.0,8.0
1,1,2,,Afghanistan,33.0,65.0,2020-01-23,0.0,0.0,Kabul,59.0,8.7,-3.0,8.0
2,2,3,,Afghanistan,33.0,65.0,2020-01-24,0.0,0.0,Kabul,71.0,7.1,0.0,7.0
3,3,4,,Afghanistan,33.0,65.0,2020-01-25,0.0,0.0,Kabul,79.0,8.7,0.0,7.0
4,4,5,,Afghanistan,33.0,65.0,2020-01-26,0.0,0.0,Kabul,64.0,8.7,-1.0,8.0


### Cleaning the dataset and preparing for Dictionary Construction

1. Get country names with multiple provinces.
2. Get country names with single provinces. 
3. Remove useless columns.
4. Get the Dates Available Range (We know that its from 1-22 till 3-21 but need it represented in code not hard coded)

In [0]:
def extract_from_dataset(dataframe):
  # step 1: countires with NaN in province column is dropped and the rest are ones with many provinces.
  countries_with_mutiple_provinces = dataframe.dropna(subset=["province"]).country.unique()
  # get the difference between the 2 dataframes: all countires dataframe and the countires dataframe with mutliple provinces we already built dict for above.
  countries_with_single_province = dataframe.merge(dataframe.dropna(subset=["province"]),indicator = True, how='left').loc[lambda x : x['_merge']!='both'].country.unique()
  # step 3: remove un-needed columns from dataframe in place.
  dataframe = dataframe.drop(columns=["Unnamed: 0","id","lat","long","cases","fatalities","capital","province"])
  # step 4: get the avaiable date range (22-1 to 21-3) instead of hard-coding it.
  dates_range = dataframe.date.unique()
  return (dataframe,countries_with_mutiple_provinces,countries_with_single_province,dates_range)

### Create Feature Dictionary Method

1. Calls the `extract_from_dataset` method to prepare dataframe and extract needed smaller Pandas Dataframes Objects.

2. Creates `K:Country- V:Feature` Dictionary for the feature asked for in the params for those countries with multiple provinces first as they need special handling and needs to calculate the mean for their provinces first.

3. Creates `K:Country- V:Feature` Dictionary for the feature asked for in the params for these rest of counties with only one single province which is easier to handle.

**Notes about data that had to be handled:**

* There were found some countires with no desired features, that's why we add the count != 0 check at the second loop. 

* There were found one country (Gambia) with duplicated data for each day, that's why we add the drop_duplicates() at the second loop as well.

In [0]:
def create_feature_dict(dataframe, feature):
  if(feature not in ['tempC', 'humidity','sunHour', 'windspeedKmph']):
    raise Exception("Feature must be one of the four temperature-related features")
  (dataframe,multi_countries,single_countries,avail_dates) = extract_from_dataset(dataframe)
  # iterate on each country and create a dictionary for feature where the key is the country and the value.
  country_dict = {}
  country_feature = []
  for country in multi_countries:
    # iterate on each date available for this country provinces and get a mean value for them.
    for date in avail_dates:
      country_feature.append(dataframe[(dataframe['country'] == country) & (dataframe['date'] == date)].mean()[feature])
    country_dict[country] = np.array(country_feature)
    country_feature.clear()

  # iterate on each country and create a dictionary for feature where the key is the country and the value.
  for country in single_countries:
      # Gambia Data has an issue because all of its dates are repeated two times, so we have to drop duplicates.
      feature_series = dataframe[dataframe['country'] == country].drop_duplicates()[feature]
      # escape counties with no feature data.
      if(feature_series.count() != 0):
        country_dict[country] = feature_series.to_numpy()
  return country_dict

def save_dict_to_csv(dict, csv_file):
  w = csv.writer(open(csv_file, "w"))
  for key, val in dict.items():
      w.writerow([key, val])

### Finally, we iterate on each feature from our four temperature-related features.

* Obtain a country and feature dictionary, where the key is the string country name and the value is 1-darray of values ranging from Day 1 till Day 60. (i.e 1-D array with 60 values)

* Save that dictionary into a csv file in permanent google drive storage for later use.

In [0]:
for feature in ['tempC', 'humidity','sunHour', 'windspeedKmph']:
  country_feature_dict = create_feature_dict(temperature_dataframe, feature)
  save_dict_to_csv(country_feature_dict, STORAGE_DIR+feature+"_dict.csv")