# Wykład 9 - Proces analizy danych - od pobrania do wizualizacji na przykładzie danych o COVID-19

## Pobranie danych
### - JSON. Pobieranie danych ze zdalnych API.
### - Data Wrangling

##  Przykłady Feature Engineering
##  Wizualizacja przykładu

https://github.com/MichalKorzycki/WarsztatPythonDataScience.git/

# Pobranie danych

API

https://api.covid19api.com/

_Impact of non-pharmaceutical interventions (NPIs) to reduce COVID19 mortality and healthcare demand_ - Neil M Ferguson et al. 

https://www.imperial.ac.uk/media/imperial-college/medicine/sph/ide/gida-fellowships/Imperial-College-COVID19-NPI-modelling-16-03-2020.pdf

## JavaScript Object Notation - JSON

In [None]:
import json
from pprint import pprint

data = [ {"Name": "Jan", "Surname": "Kowalski", "Age": 37}, {"Name": "Marek", "Surname": "Nowak", "Age": 53}]

my_json_string = json.dumps(data)
type(my_json_string)
pprint(my_json_string)

In [None]:
my_json_object = json.loads(my_json_string)
pprint(my_json_object)

# Pobieranie danych ze zdalnych API

In [None]:
import requests

In [None]:
import requests

url = "https://api.covid19api.com/"
response = requests.request(method="GET", url=url)
print(response.text)

In [None]:
from pprint import pprint

json_response = response.json()
pprint(json_response[:2])

In [None]:
url = "https://api.covid19api.com/countries"
response = requests.request(method="GET", url=url)
pprint(response.json()[:15])

Pobranie danych i zapis do pliku

In [None]:
from datetime import datetime 
now = datetime.now().strftime("%d-%b-%Y-%H:%M:%S")
fname = "all-"+now+".json"
print("Nazwa pliku: %s" % fname)

url = "https://api.covid19api.com/all"
response = requests.request(method="GET", url=url)
with open('data.json', 'w') as f:
    json.dump(response.json(), f)
    print("Plik zapisano")

In [None]:
import json

with open('wyklad9/all.json') as json_file:
    data = json.load(json_file)

print("Rekordów: %d" % len(data))

Do wyboru - jeden z poniższych

In [None]:
import requests 
import json

url = "https://api.covid19api.com/all"

with requests.Session() as s:
    input_data = s.get(url).json()


print("Rekordów: %d" % len(input_data))

In [None]:
import json 

filename = 'wyklad9/all.json'

with open(filename) as json_file:
    input_data = json.load(json_file)
    
print(len(input_data))

In [None]:
import pprint

print(input_data[-3:-1])

``` python 
import pandas as pd
data = pd.read_json ('wyklad9/all.json')
```

In [None]:
import pandas as pd

data = pd.DataFrame(input_data)
data.head()

In [None]:
data.dtypes

wymuszamy kolumnę 'Date' jako datę ...

In [None]:
data['Date'] = pd.to_datetime(data['Date'], errors='coerce', format='%Y-%m-%dT%H:%M:%S') 
data['Day'] = data['Date'].dt.date
data.dtypes

In [None]:
data.head()

In [None]:
data = data.dropna()
data.head()

In [None]:
australia = data[ data['Country'] == 'Australia' ]
australia.head(15)

Pozbywamy się współrzędnych geograficznych i prowincji

In [None]:
df = data[['Country','Date', 'Day', 'Cases', 'Status']] 
df.head()

Sumujemy kraje po prowincjach

In [None]:
df = df.groupby(['Country', 'Date', 'Day', 'Status', 'Cases',]).sum()
df.reset_index(inplace=True)
df.head()

Robimy pivot table z kolumnami: confirmed, deaths i recovered

In [None]:
import numpy as np
df = df.pivot_table(
        values='Cases', 
        index=['Country', 'Date', 'Day'], 
        columns='Status', 
        aggfunc=np.sum)

df.reset_index(inplace=True)
df.head(10)

In [None]:
df[df["Country"]=="Canada"].head(15)

... sprzątamy nazwy krajów

In [None]:
df_iran = df[ df["Country"].str.contains("Iran")  ]
df_iran

In [None]:
df = df.replace('Iran (Islamic Republic of)', 'Iran')
df_iran = df[ df["Country"].str.contains("Iran")  ]
df_iran

In [None]:
df_korea = df[ df["Country"].str.contains("Korea")  ]
df_korea.shape

In [None]:
pd.set_option('display.max_rows', None)
df_korea

In [None]:
df = df.replace('Korea, South', 'South Korea')
df = df.replace('Republic of Korea', 'South Korea')
df_korea = df[ df["Country"].str.contains("Korea")  ]
df_korea

In [None]:
df["Country"].value_counts()

In [None]:
df = df.replace('Russian Federation', 'Russia')
df = df.replace(' Azerbaijan', 'Azerbaijan')
df = df.replace('Republic of Ireland', 'Ireland')
df = df.replace('Republic of Moldova', 'Moldova')
df = df.replace('Hong Kong SAR', 'Hong Kong')
df = df.replace('Taipei and environs', 'Taiwan')
df = df.replace('Taiwan*', 'Taiwan')

Bierzemy datę 15 ostatnich dni ze zbioru danych

In [None]:
lastday = max(df["Date"])
daysbefore = lastday + pd.Timedelta(days=-15)
daysbefore

filtrujemy po dacie

In [None]:
df = df[ df["Date"] >= pd.to_datetime(daysbefore) ]
df.head(15)

bierzemy top N krajów po ilości przypadków na dzień

In [None]:
topdf = df[ df["Date"] == lastday ]
topdf = topdf.sort_values(by=['confirmed'])
topdf.head(25)

In [None]:
topdf.sort_values(by=['confirmed'], inplace=True, ascending=False)
topdf.head(25)

In [None]:
N=10
first_N_countries = topdf.iloc[0:N]["Country"]
first_N_countries

In [None]:
second_N_countries = topdf.iloc[N:2*N]["Country"]
second_N_countries

In [None]:
df = df[ df['Country'].isin(first_N_countries)  ]
df.head(10)

In [None]:
df = df.sort_values(by=['Country', 'Date'])
df.reset_index(inplace=True)
df.head(100)

# Cała obróbka wstępna zebrana razem...

In [None]:
import json
import requests 
import pandas as pd
import numpy as np
url = "https://api.covid19api.com/all"

with requests.Session() as s:
    input_data = s.get(url).json()
print("Przeczytano %d wierszy z %s" % (len(input_data), url) )

with open('wyklad9/all.json', 'w') as f:
    json.dump(input_data, f)

data = pd.DataFrame(input_data)
data['Date'] = pd.to_datetime(data['Date'], errors='coerce', format='%Y-%m-%dT%H:%M:%S') 
data['Day'] = data['Date'].dt.date

data = data.dropna()
print("Na wejściu mamy %d rekordów i %d kolumn" % (data.shape[0],data.shape[1]))

In [None]:
######################################
DAYS_WINDOW=31
N=10
######################################

lastday = max(data["Date"])
daysbefore = lastday + pd.Timedelta(days=-DAYS_WINDOW)
print("Dane od %s do %s" % (str(daysbefore).split(' ')[0], str(lastday).split(' ')[0]) )
df = data[ data["Date"] > pd.to_datetime(daysbefore) ]
print("Zostało %d rekordów i %d kolumn" % (df.shape[0],df.shape[1]))

df = df[['Country','Date', 'Day', 'Cases', 'Status']] 
df = df.groupby(['Country', 'Date', 'Day', 'Status', 'Cases',]).sum()
df.reset_index(inplace=True)
print("Po agregacji prowincji mamy %d rekordówi %d kolumn: %s" % ( df.shape[0],df.shape[1], " ".join(df.columns) ))

df = df.replace('Iran (Islamic Republic of)', 'Iran')
df = df.replace('Korea, South', 'South Korea')
df = df.replace('Republic of Korea', 'South Korea')
df = df.replace('Russian Federation', 'Russia')
df = df.replace(' Azerbaijan', 'Azerbaijan')
df = df.replace('Republic of Ireland', 'Ireland')
df = df.replace('Republic of Moldova', 'Moldova')
df = df.replace('Hong Kong SAR', 'Hong Kong')
df = df.replace('Taipei and environs', 'Taiwan')
df = df.replace('Taiwan*', 'Taiwan')

df = df.pivot_table(
        values='Cases', 
        index=['Country', 'Date', 'Day'], 
        columns='Status', 
        aggfunc=np.sum)

df.reset_index(inplace=True)
print("Po operacji pivot mamy %d rekordów i %d kolumn:  %s" % ( df.shape[0], df.shape[1], " ".join(df.columns) ))

topdf = df[ df["Date"] == lastday ]
topdf = topdf.sort_values(by=['confirmed'], ascending=False)
first_N_countries = topdf.iloc[0:N]["Country"]
smaller_top_N = topdf.iloc[0:(N//2)]["Country"]

italy = df[ df['Country'] == 'Italy'  ] 
us = df[ df['Country'] == 'US'  ] 

df = df[ df['Country'].isin(first_N_countries)  ]
df = df.sort_values(by=['Country', 'Date'])
df.reset_index(drop=True, inplace=True)

print("Po odfiltrowaniu mamy %d rekordów i %d kolumn: %s" % ( df.shape[0],df.shape[1]," ".join(df.columns) ))
print("Przygotowane dane z %d dni dla %d krajów" % 
      ( len(df["Date"].value_counts()), len(df["Country"].value_counts()) ))
df.head()

In [None]:
smaller_df = df[ df['Country'].isin(smaller_top_N)  ]
smaller_df = smaller_df.sort_values(by=['Country', 'Date'])
smaller_df.reset_index(drop=True, inplace=True)

print("Po odfiltrowaniu mniejszych danych mamy %d rekordów i %d kolumn: %s" % ( smaller_df.shape[0],smaller_df.shape[1]," ".join(smaller_df.columns) ))
print("Przygotowane mniejsze dane z %d dni dla %d krajów" % 
      ( len(smaller_df["Date"].value_counts()), len(smaller_df["Country"].value_counts()) ))
smaller_df.head()

# Wizualizacja

In [None]:
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

1 kraj

In [None]:
plt.figure(figsize=(20,10))
plt.style.use("dark_background")

chart = sns.lineplot(x='Day',
                     y='confirmed',
                     color='g',linestyle='-', marker='o',
                     data=italy
                    )

chart.set_title('Confirmed COVID-19 Cases')

plt.show();

2 kraje

In [None]:
plt.figure(figsize=(20,10))
plt.style.use("dark_background")

chart = sns.lineplot(x='Day',
                     y='confirmed',
                     color='g',linestyle='-', marker='o',
                     data=italy
                    )
chart = sns.lineplot(x='Day',
                     y='confirmed',
                     color='r',linestyle='-', marker='o',
                     data=us
                    )
chart.set_title('Confirmed COVID-19 Cases')

plt.show();

N krajów

In [None]:
plt.figure(figsize=(20,10))
plt.style.use("dark_background")

chart = sns.lineplot(x='Day',
                     y='confirmed',
                     hue='Country',linestyle='-', marker='o',
                     palette='bright',    
                     data=df
                    )
chart.set_title('Confirmed COVID-19 Cases')

plt.show();

poprawiamy legendę

In [None]:
def fix_legend(chart, marker="o"):
    handles, labels = chart.get_legend_handles_labels()
    sorting_order = dict(map(lambda x: (x[1],x[0]), enumerate(first_N_countries)))
    labels_handles = list(zip(labels,handles))

    labels_handles.sort(key = lambda x: sorting_order.get(x[0],-1))
    labels = [ x[0] for x in labels_handles[1:]]
    handles = [ x[1] for x in labels_handles[1:]]
    for handle in handles: 
        handle.set_marker(marker)
        handle.set_markeredgecolor("white")
        
    return handles, labels

In [None]:
plt.figure(figsize=(20,10))
plt.style.use("dark_background")

chart = sns.lineplot(x='Day',
                     y='confirmed',
                     hue='Country',linestyle='-', marker='o',
                     palette='bright',    
                     data=df
                    )

chart.set_title('Confirmed COVID-19 Cases')

handles, labels = fix_legend(chart)
plt.legend(handles, labels, frameon=False, loc="best")

plt.show();

In [None]:
plt.figure(figsize=(20,10))
plt.style.use("dark_background")

chart = sns.lineplot(x='Day',
                     y='confirmed',
                     hue='Country',linestyle='-', marker='o',
                     palette='bright',    
                     data=df
                    )

chart.set_title('Confirmed COVID-19 Cases')

handles, labels = fix_legend(chart)
plt.legend(handles, labels, frameon=False, loc="best")

plt.yscale("log")

plt.show();

Dwa zestawy danych 

In [None]:
plotdata=df

plt.figure(figsize=(20,10))
plt.style.use("dark_background")

chart = sns.lineplot(x='Day',
                     y='confirmed',
                     hue='Country',linestyle='-', marker='o',
                     palette='bright',    
                     data=plotdata
                    )


chart.set_title('Confirmed COVID-19 cases vs number of deaths for %d countries' % N)

handles, labels = fix_legend(chart)
plt.legend(handles, labels, frameon=False, loc=2, title="Confirmed")

ax2 = chart.twinx()

chart2 = sns.lineplot(x='Day',
                     y='deaths',
                     hue='Country', linestyle='-', marker='s',
                     palette='muted',    
                     data=plotdata,
                       ax=ax2
                    )

handles, labels = fix_legend(chart2, marker="s")
legend2 = plt.legend(handles, labels, loc=2, frameon=False, title="Deaths", bbox_to_anchor=(0.15, 1))

plt.show();

Mniej krajów

In [None]:
plotdata=smaller_df

plt.figure(figsize=(20,10))
plt.style.use("dark_background")

chart = sns.lineplot(x='Day',
                     y='confirmed',
                     hue='Country',linestyle='-', marker='o',
                     palette='bright',    
                     data=plotdata
                    )


chart.set_title('Confirmed COVID-19 cases vs number of deaths for %d countries' % (N//2) )

handles, labels = fix_legend(chart)
plt.legend(handles, labels, frameon=False, loc=2, title="Confirmed")

ax2 = chart.twinx()

chart2 = sns.lineplot(x='Day',
                     y='deaths',
                     hue='Country', linestyle='-', marker='s',
                     palette='muted',    
                     data=plotdata,
                     ax=ax2
                    )

handles, labels = fix_legend(chart2, marker="s")
legend2 = plt.legend(handles, labels, loc=2, frameon=False, title="Deaths", bbox_to_anchor=(0.15, 1))

plt.show();

## Dodanie wymiarów - stosunek między wymiarami

In [None]:
df["Mortality"] = 100*df["deaths"]    / df["confirmed"] 
df["Recovery"]  = 100*df["recovered"] / df["confirmed"] 

In [None]:
plt.figure(figsize=(20,10))
plt.style.use("dark_background")

chart = sns.lineplot(x='Day',
                     y='Mortality',
                     hue='Country',linestyle='-', marker='o',
                     palette='bright',    
                     data=df
                    )

chart.set_title('COVID-19 Mortality')

handles, labels = fix_legend(chart)
plt.legend(handles, labels, frameon=False, loc="best")

plt.show();

In [None]:
plotdata=df

plt.figure(figsize=(20,10))
plt.style.use("dark_background")

chart = sns.lineplot(x='Day',
                     y='Recovery',
                     hue='Country',linestyle='-', marker='o',
                     palette='bright',    
                     data=plotdata
                    )

chart.set_title('COVID-19 Recovery rate')

handles, labels = fix_legend(chart)
plt.legend(handles, labels, frameon=False, loc="best")

plt.show();

## Wymiary jako funkcja kilku wierszy - różnica i średnia krocząca 

In [None]:
countries = df["Country"].unique()
countries

In [None]:
dataframes = [ df[ df["Country"] == x] for x in countries ] 
dataframes[0].head()

... będzie błąd

In [None]:
dataframes[0].diff()

In [None]:
m_df = dataframes[0][ ["Day", "confirmed", "deaths", "recovered"] ]
m_df.set_index("Day", inplace=True)
m_df.head()

In [None]:
m_df.diff().head()

In [None]:
m_df.diff().rolling(2).mean()

In [None]:
ROLL=7
result = []

for m_df in dataframes:
    country = m_df['Country'].iloc[0]
    m_df = m_df[ ["Day", "confirmed", "deaths", "recovered"] ]
    m_df.set_index("Day", inplace=True)
    df_diff = m_df.diff()
    df_diff.columns=["confirmed change", "deaths change", "recovered change"]
    
    m_df = pd.concat([m_df, df_diff], axis=1, sort=False)
    
    m_df["confirmed pct change"] = 100.0 * m_df["confirmed change"] / m_df["confirmed"]
    m_df["confirmed pct change"]  = m_df["confirmed pct change"].apply(lambda x: x if x > -50.0 else 0.0)

    m_df["deaths pct change"] = 100.0 * m_df["deaths change"] / m_df["deaths"]
    m_df["deaths pct change"]  = m_df["deaths pct change"].apply(lambda x: x if x > -50.0 else 0.0)
    m_df["deaths pct change"]  = m_df["deaths pct change"].apply(lambda x: x if x < 99.0 else 0.0)
    
    m_df["rolling confirmed pct change"] = m_df["confirmed pct change"] .rolling(window=ROLL).mean()
    m_df["rolling deaths pct change"] = m_df["deaths pct change"] .rolling(window=ROLL).mean()
    m_df = m_df.dropna()
    
    m_df = m_df.sort_values(by="Day")
    m_df["Country"] = country
    m_df.reset_index(inplace=True)
    result.append(m_df)
    
new_df = pd.concat(result, axis=0, sort=False)
new_df.head(15)

In [None]:
plotdata=new_df

plt.figure(figsize=(20,10))
plt.style.use("dark_background")

chart = sns.lineplot(x='Day',
                     y='confirmed pct change',
                     hue='Country',linestyle='-', marker='o',
                     palette='bright',    
                     data=plotdata
                    )

chart.set_title('COVID-19 Confirmed percentage change')

handles, labels = fix_legend(chart)
plt.legend(handles, labels, frameon=False, loc="best")

plt.show();

In [None]:
plotdata=new_df

plt.figure(figsize=(20,10))
plt.style.use("dark_background")

chart = sns.lineplot(x='Day',
                     y='rolling confirmed pct change',
                     hue='Country',linestyle='-', marker='o',
                     palette='bright',    
                     data=plotdata
                    )

chart.set_title('COVID-19 Confirmed percentage change daily rolling average over %d days' % ROLL)

handles, labels = fix_legend(chart)
plt.legend(handles, labels, frameon=False, loc="best")

plt.show();

In [None]:
plotdata=new_df

plt.figure(figsize=(20,10))
plt.style.use("dark_background")

chart = sns.lineplot(x='Day',
                     y='deaths pct change',
                     hue='Country',linestyle='-', marker='o',
                     palette='bright',    
                     data=plotdata
                    )

chart.set_title('COVID-19 Confirmed deaths change in pct daily')

handles, labels = fix_legend(chart)
plt.legend(handles, labels, frameon=False, loc="best")

plt.show();

In [None]:
plotdata=new_df

plt.figure(figsize=(20,10))
plt.style.use("dark_background")

chart = sns.lineplot(x='Day',
                     y='rolling deaths pct change',
                     hue='Country',linestyle='-', marker='o',
                     palette='bright',    
                     data=plotdata
                    )

chart.set_title('COVID-19 deaths change in pct daily rolling average over %d days' % ROLL)

handles, labels = fix_legend(chart)
plt.legend(handles, labels, frameon=False, loc="best")


plt.show();

## Zmiana wymiaru _X_

In [None]:
plt.figure(figsize=(20,10))
plt.style.use("dark_background")
plt.xscale("log")
plt.yscale("log")
chart = sns.lineplot(x='confirmed',
                     y='deaths',
                     hue='Country',linestyle='-', marker='o',
                     palette='bright',    
                     data=df
                    )

chart.set_title('COVID-19 Mortality')

handles, labels = fix_legend(chart)
plt.legend(handles, labels, frameon=False, loc="best")

plt.show();

---
# _Feature engineering („konstrukcja wymiarów”)_ to proces zastosowania wiedzy dziedzinowej do tworzenia wymiarów danych na potrzeby nauczania maszynowego.

---

### Coming up with features is difficult, time-consuming, requires expert knowledge. 
### "Applied machine learning" is basically feature engineering.
## <p style='text-align: right;'>- Andrew Ng</p>

---

# Proces analizy

## Wstępna obróbka
- W miarę możliwości - dane pobierać online ale mieć kopię offline
- Obróbka danych to 80% pracy
- Kontroluj parametry obróbki i analizy, tu: ROLL, N, DAYS_WINDOW
- Jakość danych jest kluczowa
- Często potrzebna manualna poprawa jakości danych 

## Analiza danych
- Kluczowe jest dobranie wymiarów (czas niekoniecznie jest jednym z nich)
- Często potzebne wymiary trzeba stworzyć
- Duża część zjawisk ma charakter ekspotencjalny (np. Zasada Pareto) - uzywać wartości względnych

## Wizualizacja 
- Dobrać rodzaj grafu do zjawiska
- Dla zjawisk ekspotencjalnych - skala logarytmiczna
