<header style="padding:1px;background:#f9f9f9;border-top:3px solid #00b2b1"><img id="Teradata-logo" src="https://www.teradata.com/Teradata/Images/Rebrand/Teradata_logo-two_color.png" alt="Teradata" width="220" align="right" />

<b style = 'font-size:28px;font-family:Arial;color:#E37C4D'>Data Science Flow 101</b>
</header>

<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>Introduction</b></p>
<p style = 'font-size:16px;font-family:Arial'>In this Jupyter Notebook, we will walk you through the standard flow of Data Science using the CRISP-DM standard (Cross-Industry Standard Process for Data Mining):</p>

<img src="images/800px-CRISP-DM_Process_Diagram.png"  alt="CRISP-DM" style="width: 300px;"/>

<p style = 'font-size:16px;font-family:Arial'>You can find more information on <a href="https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining">Wikipedia</a> about this framework.</p>

<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b> 1. Business Understanding </b></p>
<p style = 'font-size:16px;font-family:Arial'>The first step in CRISP-DM is the Business Understanding. Here, we want to better understand the impact of COVID-19 in Italy. To be precise, we would like to understand the maximum number of new daily cases that we will have in Italy.</p>

<p style = 'font-size:16px;font-family:Arial'>To forecast the number of cases, we will use the framework of Diffusion of Innovations where the number of people purchasing a new product (e.g. iPhone 11) or contracting a virus such as COVID-19 is proportional to the number of people who already have it (product or virus) and to the number of people left (word of mouth, comunity spreading) as well as extragenous factor which will be proportional to the number of people left.</p>

<p style = 'font-size:16px;font-family:Arial'>(New People) = (some parameter) * (Number of People who already have it) * (Number of People left who don't have it) + (some other parameter) * (Number of People left who don't have it) </p>

<p style = 'font-size:16px;font-family:Arial'>dN = a * N * (NMAX - N) + b * (NMAX - N)</p>

<p style = 'font-size:16px;font-family:Arial'>with dN = New People (new cases), N = Total People with it (cumulative cases), NMAX = Total people in the market who could have it, and a & b  some parameters that we need to model.</p>

<p style = 'font-size:16px;font-family:Arial'>Transforming this function and you get a basic 2nd order polynomial, i.e. f(X) = A + B * x + C * x^2 with x = N</p>

<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>2. Data Understanding</b></p>
<p style = 'font-size:16px;font-family:Arial'>There are many sources of data. Here, we will use the data from the European Centre for Disease Prevention and Control at https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide.</p>

<p style = 'font-size:16px;font-family:Arial'>The current data in Excel is located at https://www.ecdc.europa.eu/sites/default/files/documents/COVID-19-geographic-disbtribution-worldwide-2020-03-10.xls. </p>



<b style = 'font-size:20px;font-family:Arial;color:#E37C4D'> Accessing the Data
<p style = 'font-size:16px;font-family:Arial'>These demos will work either with foreign tables accessed from Cloud Storage via NOS or you may import the tables to your machine. If you import data for multiple demos, you may need to use the Data Dictionary "Manage Your Space" routine to cleanup tables you no longer need. 
    
<p style = 'font-size:16px;font-family:Arial'>Use the link below to access the 2 options for using data from the data dictionary notebook:

[Click Here to get data for this notebook](../Data_Dictionary/Data_Dictionary.ipynb#TRNG_DataScienceFlow)

[Click Here to Manage Your Space](../Data_Dictionary/Data_Dictionary.ipynb#Manage_Your_Space)
    
<p style = 'font-size:16px;font-family:Arial'>We start by importing the required libraries and connecting to the database. You will be asked to enter the password.</p>    

In [None]:
!pip install --user iminuit

<p style = 'font-size:16px;font-family:Arial'><b><i>*BEFORE proceeding, please RESTART the kernel to bring new software into Jupyter.</i></b></p>

In [None]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore")
import getpass
from teradataml import *
# Import Python wrapper on CERN's Minuit to fit a function using the Chi Square Minimalization approach
from iminuit import Minuit, describe
# Import NumPy to numeric computation
import numpy as np
# Import Matplotlib for charts
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
eng = create_context(host = 'host.docker.internal', username='demo_user', password = getpass.getpass())
print(eng)

<p style = 'font-size:16px;font-family:Arial'>Get the data from Vantage in the DataFrame.</p>    

In [None]:
df = pd.read_sql('select * from TRNG_DataScienceFlow.covid_geo_dist;',eng)

In [None]:
df.head()

<p style = 'font-size:16px;font-family:Arial'>Then we explore the data to better understand it. What is inside the dataframe?</p>

<p style = 'font-size:16px;font-family:Arial'>From this, we can guess/infer the following:</p>

<ol style = 'font-size:16px;font-family:Arial'>
    <li>DateRep is the date of the reported values</li>
    <li>CountryExp is the country</li>
    <li>NewConfCases is the number of new cases for that day in that country</li>
    <li>NewDeaths is the number of new deaths for that day in that country</li>
    <li>GeoId and EU are metadata information related to the country</li>
 </ol>   

<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>3. Data Preparation</b></p>
<p style = 'font-size:16px;font-family:Arial'>Now that we understand what each columns and column's values mean, we can manipulate the dataframe to get the analytics dataset that we will use for the modeling</p>

<p style = 'font-size:16px;font-family:Arial'>Let's filter down to Italy. First we create a list of boolean to check if the row is for Italy:</p>

In [None]:
df['CountryExp']=='Italy'

<p style = 'font-size:16px;font-family:Arial'>Then we apply this list to the current dataframe:</p>

In [None]:
df = df[df['CountryExp']=='Italy']

<p style = 'font-size:16px;font-family:Arial'>We will now also remove all the zeros before the virus reached Italy with the same technique:</p>

In [None]:
df = df[df['NewConfCases']>0]

<p style = 'font-size:16px;font-family:Arial'>Let's order now by day:</p>

In [None]:
df=df.sort_values(by=['DateRep'])

<p style = 'font-size:16px;font-family:Arial'>Let's add the cumulative number of cases and deaths with:</p>

In [None]:
df["CumulConfCases"] = df["NewConfCases"].cumsum()
df["CumulDeaths"] = df["NewDeaths"].cumsum()

In [None]:
df

<p style = 'font-size:16px;font-family:Arial'>Finally, we filter down the dataset to the columns of interest while renaming it dN and N:</p>

In [None]:
df_cases = df.filter(items=['DateRep', 'NewConfCases','CumulConfCases']) \
    .rename(columns={"DateRep": "T", "NewConfCases": "dN", "CumulConfCases": "N"})
df_deaths = df.filter(items=['DateRep', 'NewDeaths','CumulDeaths']) \
    .rename(columns={"DateRep": "T", "NewDeaths": "dN", "CumulDeaths": "N"})

<p style = 'font-size:16px;font-family:Arial'>Finally, we define the time series index of those two final datasets:</p>

In [None]:
df_cases = df_cases.set_index('T')
df_deaths = df_deaths.set_index('T')

In [None]:
df_cases

<p style = 'font-size:16px;font-family:Arial'>Let's do some charts.</p>

<p style = 'font-size:16px;font-family:Arial'>The time-series for N and dN:</p>

In [None]:
df_cases.plot(y=["N","dN"])

<p style = 'font-size:16px;font-family:Arial'>Let's plot the data in a scatter plot dN as function of N, as defined in our Diffusion of Innovation framework:</p>

In [None]:
df_cases.plot.scatter(x='N',y='dN')

<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>4. Modeling </b></p>


<p style = 'font-size:16px;font-family:Arial'>As seen in the scatter chart above, there is no sign of the data trending. This means that there is no sensitivity to how big NMAX will ever be (i.e. the total number of people that will be eventually infected).</p>

<p style = 'font-size:16px;font-family:Arial'>To make a prediction, let's define this as 1% of Italy's population (1% of 60.5 million)</p>

In [None]:
NMAX = 60.5e6 * 0.01

In [None]:
def fit(a, b):
    return a * (NMAX - df_cases['N']) + b * df_cases['N'] * (NMAX - df_cases['N'])

<p style = 'font-size:16px;font-family:Arial'>We define the error function (i.e. Chi Square) which return the sum of the error squared for each points:</p>

In [None]:
def chisquare(a, b):    
    return (df_cases['dN'] - fit(a,b)).pow(2).sum()

<p style = 'font-size:16px;font-family:Arial'>For example, the total Chi Square error for a = 0.1 and b = 0.2 is </p>

In [None]:
chisquare(0.1, 0.2)

<p style = 'font-size:16px;font-family:Arial'>We create now the minimization object m with Minuit to minimize the function 'chisquare' </p>

In [None]:
m = Minuit(chisquare, 0.1, 0.2)

<p style = 'font-size:16px;font-family:Arial'>We now execute the fit to find the best parameters describing the data:</p>

In [None]:
m.migrad()

<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>5. Evaluation</b></p>

<p style = 'font-size:16px;font-family:Arial'>Now that we have our model with parameters, we can check the robustness of our model. We can do this multiple way.</p> 
<p style = 'font-size:16px;font-family:Arial'>In cases, we have a lot of data, we can use a test sample, or cross-validate with a different time range or market.</p>

<p style = 'font-size:16px;font-family:Arial'>Here, with very limited number, we can do basic assessment of the error of our parameters.</p>

<p style = 'font-size:16px;font-family:Arial'>We can start looking at this Chi Square function as function of the parameter `a` and identify the range in which we can trust its value (i.e. confidence interval).</p>

In [None]:
m.draw_mnprofile("a")

<p style = 'font-size:16px;font-family:Arial'>Ballpark, the best value from the data is 6.15e-7, but there is 68% chance that the value is between 6.1e-7 and 6.2e-7.</p>

<p style = 'font-size:16px;font-family:Arial'>We can check now the parameter `b`:</p>

In [None]:
m.draw_mnprofile("b")

<p style = 'font-size:16px;font-family:Arial'>Here the best value is 3.1885E-9, but there is 68% chance that the value is between 3.187e-9 and 3.190e-9.</p>

<p style = 'font-size:16px;font-family:Arial'>We can check the dependency between the two variables:</p>

In [None]:
m.draw_mncontour("a","b", cl = [0.680, 0.950])

<p style = 'font-size:16px;font-family:Arial'>Here, the black ellipse represent the range we just identified at 68% confidence interval. The red ellipse is at 95% confidence interval.</p>

<p style = 'font-size:16px;font-family:Arial'>Let's now apply those fitted value to see the result in a chart:</p>

In [None]:
a=m.values["a"]
b=m.values["b"]

In [None]:
df_cases['fit'] = a * (NMAX - df_cases['N']) + b * df_cases['N'] * (NMAX - df_cases['N'])

In [None]:
df_cases.plot('N',['dN','fit'],style=['o','-'])

<p style = 'font-size:20px;font-family:Arial;color:#E37C4D'><b>6. Deployment</b></p>
<p style = 'font-size:16px;font-family:Arial'>Deployment is about using the insights we found to do something differently. It can be a single event or a repetitive tasks.</p>

<p style = 'font-size:16px;font-family:Arial'>Let's start with the simpler case. For one off case such as here, we would just extrapolate the insights into the answer we are looking for: what is the maximum daily new cases we can have in Italy.</p>

In [None]:
xx=np.arange(0,NMAX*0.55,NMAX/1000)

In [None]:
yy = a * (NMAX - xx) + b * xx * (NMAX - xx)

In [None]:
plt.plot(df_cases['N'], df_cases['dN'], '^', xx, yy, '-')

<p style = 'font-size:16px;font-family:Arial'>Assuming all our assumptions are correct (which they are likely not), this means that at the peak of the epidemic in Italy, we could see 30,000 new cases per day.</p>

<p style = 'font-size:16px;font-family:Arial'>For repetitive tasks, we will need to automate the analysis and the deployment of the insights. 
So, we can start by developing one Python script with only the necessary steps, as well as input parameter to collect 
the relevant date of report, country, assumptiong, etc.</p>

<p style = 'font-size:14px;font-family:Arial'>Here is an example of a covid19.py script:</p>

```python
#!//usr/bin/python3
import sys
import pandas as pd
from iminuit import Minuit, describe
import numpy as np

# Get Report Date from command line
if len(sys.argv) != 3:
        print('Usage: covid19.py [date] [country]')
        sys.exit(1)
rep_date = sys.argv[1]
country = sys.argv[2]

# Import data locally
data_url = "https://www.ecdc.europa.eu/sites/default/files/documents/COVID-19-geographic-disbtribution-worldwide-" + rep_date + ".xls"
df = pd.read_excel(data_url)

# Filter data
df = df[df['CountryExp']==country]
df = df[df['NewConfCases']>0]
df = df.sort_values(by=['DateRep'])
df["CumulConfCases"] = df["NewConfCases"].cumsum()
df["CumulDeaths"] = df["NewDeaths"].cumsum()
df = df.filter(items=['DateRep', 'NewConfCases','CumulConfCases']). \
    rename(columns={"DateRep": "T", "NewConfCases": "dN", "CumulConfCases": "N"}). \
    set_index('T')

# Prepare model fit
NMAX = 60.5e6 * 0.01
x = df['N']
y = df['dN']

def fit(a, b):
    return a * (NMAX - x) + b * x * (NMAX - x)
def chisquare(a, b):
    return (y - fit(a,b)).pow(2).sum()

# Execute the fit
m = Minuit(chisquare, print_level=0, pedantic=False)
m.migrad()
a=m.values["a"]
b=m.values["b"]
xx=np.arange(0,NMAX,NMAX/100)
yy = a * (NMAX - xx) + b * xx * (NMAX - xx)
print("Max:",max(yy))
```

<p style = 'font-size:16px;font-family:Arial'>Running ```./covid19.py 2020-03-10 Italy``` will give ```Max: 29582.958532824043```.</p>

<p style = 'font-size:16px;font-family:Arial'>Whereas ```./covid19.py 2020-03-08 Italy``` will give ```Max: 29105.409956316485```.</p>

<p style = 'font-size:16px;font-family:Arial'>With this script doing the work, we would need to deploy it on a production system, for example a Docker image in AppCenter, AWS EC2, or
other Google GCE. Then set up a scheduler to refresh the analysis, a monitoring to alert when there is an error, and finally 
integrate in the operation. What will you do with this data? Maybe here an automated email to the secretary of health in Italy with the forecast of the worst to come?</p>

<p style = 'font-size:16px;font-family:Arial'>And you, what do you think? What would you do?</p>

<footer style="padding:10px;background:#f9f9f9;border-bottom:3px solid #394851">Copyright © Teradata Corporation - 2023. All Rights Reserved.</footer>