# Notebook for IART project 2

## Data Importing

The dataframe currently being imported features data from the WHO situation reports from 2020-01-22 to 2020-05-21 (currently) for 188 different countries. This data presents the number of total confirmed cases, number of total confirmed deaths, and number of total recovered patients up to that day.

In [1]:
import pandas as pd

imported_df = pd.read_csv('Data/covid_19_clean_complete.csv')

display(imported_df)

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered
0,,Afghanistan,33.000000,65.000000,1/22/20,0,0,0
1,,Albania,41.153300,20.168300,1/22/20,0,0,0
2,,Algeria,28.033900,1.659600,1/22/20,0,0,0
3,,Andorra,42.506300,1.521800,1/22/20,0,0,0
4,,Angola,-11.202700,17.873900,1/22/20,0,0,0
...,...,...,...,...,...,...,...,...
32060,,Sao Tome and Principe,0.186360,6.613081,5/21/20,251,8,4
32061,,Yemen,15.552727,48.516388,5/21/20,197,33,0
32062,,Comoros,-11.645500,43.333300,5/21/20,34,1,8
32063,,Tajikistan,38.861034,71.276093,5/21/20,2350,44,0


## Modifying the data

For the purpose of this project we will perform quality of life changes to the data to make it more adquate for our usage. The changes we will perform are:

- Change the date column to a counter of the days passed since the first data colected.
- Group the data of each country into one single line per day, this means removing the Province/State column.
- Add a column for the number of current active cases(Total cases confirmed - (Deaths and Recovered)).
- Add a column for the number of active cases of the day before.
- Remove lines in which a country has 0 confirmed cases.
- MAYBE make each country have its own day count since the first infection detected in the data.

### Changing Date column to counter of days passed

In [2]:
imported_df['Date'] = pd.to_datetime(imported_df['Date'])
x = pd.to_datetime('2020-01-22')
imported_df['Day'] = imported_df['Date'] - x
del imported_df['Date']
display(imported_df)


Unnamed: 0,Province/State,Country/Region,Lat,Long,Confirmed,Deaths,Recovered,Day
0,,Afghanistan,33.000000,65.000000,0,0,0,0 days
1,,Albania,41.153300,20.168300,0,0,0,0 days
2,,Algeria,28.033900,1.659600,0,0,0,0 days
3,,Andorra,42.506300,1.521800,0,0,0,0 days
4,,Angola,-11.202700,17.873900,0,0,0,0 days
...,...,...,...,...,...,...,...,...
32060,,Sao Tome and Principe,0.186360,6.613081,251,8,4,120 days
32061,,Yemen,15.552727,48.516388,197,33,0,120 days
32062,,Comoros,-11.645500,43.333300,34,1,8,120 days
32063,,Tajikistan,38.861034,71.276093,2350,44,0,120 days


### Grouping the data of each country into one single line per day

In [3]:
del imported_df['Province/State']
new_df = imported_df.groupby(['Country/Region','Day']).agg({'Lat': 'first', 'Long': 'first', 'Confirmed': 'sum',
'Deaths': 'sum', 'Recovered': 'sum'}).reset_index()

###pd.set_option('display.max_rows', 40000)
display(new_df)

Unnamed: 0,Country/Region,Day,Lat,Long,Confirmed,Deaths,Recovered
0,Afghanistan,0 days,33.0,65.0,0,0,0
1,Afghanistan,1 days,33.0,65.0,0,0,0
2,Afghanistan,2 days,33.0,65.0,0,0,0
3,Afghanistan,3 days,33.0,65.0,0,0,0
4,Afghanistan,4 days,33.0,65.0,0,0,0
...,...,...,...,...,...,...,...
22743,Zimbabwe,116 days,-20.0,30.0,44,4,17
22744,Zimbabwe,117 days,-20.0,30.0,46,4,18
22745,Zimbabwe,118 days,-20.0,30.0,46,4,18
22746,Zimbabwe,119 days,-20.0,30.0,48,4,18


### Adding a column for the number of current active cases.

In [6]:
new_df['Active_Cases'] = new_df['Confirmed'] - (new_df['Deaths']+new_df['Recovered'])
display(new_df)

Unnamed: 0,Country/Region,Day,Lat,Long,Confirmed,Deaths,Recovered,Active_Cases
0,Afghanistan,0 days,33.0,65.0,0,0,0,0
1,Afghanistan,1 days,33.0,65.0,0,0,0,0
2,Afghanistan,2 days,33.0,65.0,0,0,0,0
3,Afghanistan,3 days,33.0,65.0,0,0,0,0
4,Afghanistan,4 days,33.0,65.0,0,0,0,0
...,...,...,...,...,...,...,...,...
22743,Zimbabwe,116 days,-20.0,30.0,44,4,17,23
22744,Zimbabwe,117 days,-20.0,30.0,46,4,18,24
22745,Zimbabwe,118 days,-20.0,30.0,46,4,18,24
22746,Zimbabwe,119 days,-20.0,30.0,48,4,18,26


### Adding a column for the number of active cases of the day before (not sure how to do this)

### Removing lines in which the countries have 0 confirmed cases

In [9]:
new_df = new_df[new_df.Confirmed != 0]
display(new_df)

Unnamed: 0,Country/Region,Day,Lat,Long,Confirmed,Deaths,Recovered,Active_Cases
33,Afghanistan,33 days,33.0,65.0,1,0,0,1
34,Afghanistan,34 days,33.0,65.0,1,0,0,1
35,Afghanistan,35 days,33.0,65.0,1,0,0,1
36,Afghanistan,36 days,33.0,65.0,1,0,0,1
37,Afghanistan,37 days,33.0,65.0,1,0,0,1
...,...,...,...,...,...,...,...,...
22743,Zimbabwe,116 days,-20.0,30.0,44,4,17,23
22744,Zimbabwe,117 days,-20.0,30.0,46,4,18,24
22745,Zimbabwe,118 days,-20.0,30.0,46,4,18,24
22746,Zimbabwe,119 days,-20.0,30.0,48,4,18,26


# problemas atuais: como fazer para referenciar as colunas de casos do dia anterior de um pais em especifico, como fazer com a variavel que vai ser dependente, qual é que ficou a variavel ? o crescimento de casos vai ser maior que o dia anterior sim ou nao