# Encapsulation into Classes

In this chapter, we'll encapsulate all of our work from the smoothing and model building chapters into Python classes. Encapsulation is just a fancy word for bringing together (capturing in a capsule, if you will) data and functions that act together to complete a task.

## The `CasesModel` class

We now write a class, `CasesModel`, that is responsible for performing nearly all tasks involving cases after downloading the data. For each area, it smooths, trains, and predicts the number of cases. It encapsulates all of the steps into a single class. Several of our previous functions (`smooth`, `get_bounds_p0`, `get_L_limits`, `predict`) have been rewritten as methods.

Once instantiated, the `run` method must be called to complete the smoothing, training, and predicting. When the `run` method is called, a number of dictionaries are created to hold DataFrames with the following data:

* smoothed data
* bounds
* initial parameter guess
* fitted parameters
* daily predicted cases
* cumulative predicted cases
* combined daily/cumulative actual and predicted cases
* combined daily/cumulative smoothed actual and predicted cases

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.optimize import least_squares
from statsmodels.nonparametric.smoothers_lowess import lowess
plt.style.use('dashboard.mplstyle')

GROUPS = 'world', 'usa'
KINDS = 'cases', 'deaths'
MIN_OBS = 15  # Minimum observations needed to make prediction

def general_logistic_shift(x, L, x0, k, v, s):
    return (L - s) / ((1 + np.exp(-k * (x - x0))) ** (1 / v)) + s

def optimize_func(params, x, y, model):
    y_pred = model(x, *params)
    error = y - y_pred
    return error

class CasesModel:
    def __init__(self, model, data, last_date, n_train, n_smooth, 
                 n_pred, L_n_min, L_n_max, **kwargs):
        """
        Smooths, trains, and predicts cases for all areas
        
        Parameters
        ----------
        model : function such as general_logistic_shift
        
        data : dictionary of data from all areas - result of PrepareData().run()
        
        last_date : str, last date to be used for training
        
        n_train : int, number of preceding days to use for training
        
        n_smooth : integer, number of points used in LOWESS
        
        n_pred : int, days of predictions to make
        
        L_n_min, L_n_max : int, min/max number of days used to estimate L_min/L_max
        
        **kwargs : extra keyword arguments passed to scipy's least_squares function
        """
        # Set basic attributes
        self.model = model
        self.data = data
        self.last_date = self.get_last_date(last_date)
        self.n_train = n_train
        self.n_smooth = n_smooth
        self.n_pred = n_pred
        self.L_n_min = L_n_min
        self.L_n_max = L_n_max
        self.kwargs = kwargs
        
        # Set attributes for prediction
        self.first_pred_date = pd.Timestamp(self.last_date) + pd.Timedelta("1D")
        self.pred_index = pd.date_range(start=self.first_pred_date, periods=n_pred)
        
    def get_last_date(self, last_date):
        # Use the most current date as the last actual date if not provided
        if last_date is None:
            return self.data['world_cases'].index[-1]
        else:
            return pd.Timestamp(last_date)
        
    def init_dictionaries(self):
        # Create dictionaries to store results for each area
        # Executed first in `run` method
        self.smoothed = {'world_cases': {}, 'usa_cases': {}}
        self.bounds = {'world_cases': {}, 'usa_cases': {}}
        self.p0 = {'world_cases': {}, 'usa_cases': {}}
        self.params = {'world_cases': {}, 'usa_cases': {}}
        self.pred_daily = {'world_cases': {}, 'usa_cases': {}}
        self.pred_cumulative = {'world_cases': {}, 'usa_cases': {}}
        
        # Dictionary to hold DataFrame of actual and predicted values
        self.combined_daily = {}
        self.combined_cumulative = {}
        
        # Same as above, but stores smoothed and predicted values
        self.combined_daily_s = {}
        self.combined_cumulative_s = {}
        
    def smooth(self, s):
        s = s[:self.last_date]
        if s.values[0] == 0:
            # Filter the data if the first value is 0
            last_zero_date = s[s == 0].index[-1]
            s = s.loc[last_zero_date:]
            s_daily = s.diff().dropna()
        else:
            # If first value not 0, use it to fill in the 
            # first missing value
            s_daily = s.diff().fillna(s.iloc[0])

        # Don't smooth data with less than MIN_OBS values
        if len(s_daily) < MIN_OBS:
            return s_daily.cumsum()

        y = s_daily.values
        frac = self.n_smooth / len(y)
        x = np.arange(len(y))
        y_pred = lowess(y, x, frac=frac, is_sorted=True, return_sorted=False)
        s_pred = pd.Series(y_pred, index=s_daily.index).clip(0)
        s_pred_cumulative = s_pred.cumsum()
        
        if s_pred_cumulative[-1]  == 0:
            # Don't use smoothed values if they are all 0
            return s_daily.cumsum()
        
        last_actual = s.values[-1]
        last_smoothed = s_pred_cumulative.values[-1]
        s_pred_cumulative *= last_actual / last_smoothed
        return s_pred_cumulative
    
    def get_train(self, smoothed):
        # Filter the data for the most recent to capture new waves
        return smoothed.iloc[-self.n_train:]
    
    def get_L_limits(self, s):
        last_val = s[-1]
        last_pct = s.pct_change()[-1] + 1
        L_min = last_val * last_pct ** self.L_n_min
        L_max = last_val * last_pct ** self.L_n_max + 1
        L0 = (L_max - L_min) / 2 + L_min
        return L_min, L_max, L0
    
    def get_bounds_p0(self, s):
        L_min, L_max, L0 = self.get_L_limits(s)
        x0_min, x0_max = -50, 50
        k_min, k_max = 0.01, 0.5
        v_min, v_max = 0.01, 2
        s_min, s_max = 0, s[-1] + 0.01
        s0 = s_max / 2
        lower = L_min, x0_min, k_min, v_min, s_min
        upper = L_max, x0_max, k_max, v_max, s_max
        bounds = lower, upper
        p0 = L0, 0, 0.1, 0.1, s0
        return bounds, p0
    
    def train_model(self, s, bounds, p0):
        y = s.values
        n_train = len(y)
        x = np.arange(n_train)
        res = least_squares(optimize_func, p0, args=(x, y, self.model), bounds=bounds, **self.kwargs)
        return res.x
    
    def get_pred_daily(self, n_train, params):
        x_pred = np.arange(n_train - 1, n_train + self.n_pred)
        y_pred = self.model(x_pred, *params)
        y_pred_daily = np.diff(y_pred)
        return pd.Series(y_pred_daily, index=self.pred_index)
    
    def get_pred_cumulative(self, s, pred_daily):
        last_actual_value = s.loc[self.last_date]
        return pred_daily.cumsum() + last_actual_value
    
    def convert_to_df(self, gk):
        # convert dictionary of areas mapped to Series to DataFrames
        self.smoothed[gk] = pd.DataFrame(self.smoothed[gk]).fillna(0).astype('int')
        self.bounds[gk] = pd.concat(self.bounds[gk].values(), 
                                    keys=self.bounds[gk].keys()).T
        self.bounds[gk].loc['L'] = self.bounds[gk].loc['L'].round()
        self.p0[gk] = pd.DataFrame(self.p0[gk], index=['L', 'x0', 'k', 'v', 's'])
        self.p0[gk].loc['L'] = self.p0[gk].loc['L'].round()
        self.params[gk] = pd.DataFrame(self.params[gk], index=['L', 'x0', 'k', 'v', 's'])
        self.pred_daily[gk] = pd.DataFrame(self.pred_daily[gk])
        self.pred_cumulative[gk] = pd.DataFrame(self.pred_cumulative[gk])
        
    def combine_actual_with_pred(self):
        for gk, df_pred in self.pred_cumulative.items():
            df_actual = self.data[gk][:self.last_date]
            df_comb = pd.concat((df_actual, df_pred))
            self.combined_cumulative[gk] = df_comb
            self.combined_daily[gk] = df_comb.diff().fillna(df_comb.iloc[0]).astype('int')
            
            df_comb_smooth = pd.concat((self.smoothed[gk], df_pred))
            self.combined_cumulative_s[gk] = df_comb_smooth
            self.combined_daily_s[gk] = df_comb_smooth.diff().fillna(df_comb.iloc[0]).astype('int')

    def run(self):
        self.init_dictionaries()
        for group in GROUPS:
            gk = f'{group}_cases'
            df_cases = self.data[gk]
            for area, s in df_cases.items():
                smoothed = self.smooth(s)
                train = self.get_train(smoothed)
                n_train = len(train)
                if n_train < MIN_OBS:
                    bounds = np.full((2, 5), np.nan)
                    p0 = np.full(5, np.nan)
                    params = np.full(5, np.nan)
                    pred_daily = pd.Series(np.zeros(self.n_pred), index=self.pred_index)
                else:
                    bounds, p0 = self.get_bounds_p0(train)
                    params = self.train_model(train, bounds=bounds,  p0=p0)
                    pred_daily = self.get_pred_daily(n_train, params).round(0)
                pred_cumulative = self.get_pred_cumulative(s, pred_daily)
                
                # save results to dictionaries mapping each area to its result
                self.smoothed[gk][area] = smoothed
                self.bounds[gk][area] = pd.DataFrame(bounds, index=['lower', 'upper'], 
                                                     columns=['L', 'x0', 'k', 'v', 's'])
                self.p0[gk][area] = p0
                self.params[gk][area] = params
                self.pred_daily[gk][area] = pred_daily.astype('int')
                self.pred_cumulative[gk][area] = pred_cumulative.astype('int')
            self.convert_to_df(gk)
        self.combine_actual_with_pred()
        
    def plot_prediction(self, group, area, **kwargs):
        group_kind = f'{group}_cases'
        actual = self.data[group_kind][area]
        pred = self.pred_cumulative[group_kind][area]
        first_date = self.last_date - pd.Timedelta(self.n_train, 'D')
        last_pred_date = self.last_date + pd.Timedelta(self.n_pred, 'D')
        actual.loc[first_date:last_pred_date].plot(label='Actual', **kwargs)
        pred.plot(label='Predicted').legend()

An instance of the `CasesModel` class is created below. It uses the 60 days leading up to November 5, 2020 as training data for the model and makes predictions for the next 30 days. If `last_date` is not provided, then the last date from the given data is used. The integers `L_n_min` and `L_n_max` are used to find the bounds of `L`.

In [None]:
from prepare import PrepareData
data = PrepareData(download_new=False).run()
cm = CasesModel(model=general_logistic_shift, data=data, last_date='2020-11-05', 
                n_train=60, n_smooth=15, n_pred=30, L_n_min=5, L_n_max=50)

The `run` method must be called in order to smooth, train, and predict. Executing the following cell took about 25 seconds on my machine, as it completed the process for all areas.

In [None]:
cm.run()

### Results

Let's take a look at all of the results which are stored as DataFrames within dictionaries with keys `world_cases` and `usa_cases`. The original unprocessed data is in the `data` attribute. We select the last five rows of the first 10 areas.

In [None]:
cm.data['world_cases'].iloc[-5:, :10]

The smoothed data is what is used for training and is therefore only calculated through the date we wish to make a prediction from.

In [None]:
cm.smoothed['usa_cases'].iloc[-5:, :10]

The bounds for each of the parameters when fitting are below.

In [None]:
cm.bounds['world_cases'].iloc[:, :10]

The initial guess for each parameter:

In [None]:
cm.p0['world_cases'].iloc[:, :10]

The first five predicted values for daily cases:

In [None]:
cm.pred_daily['usa_cases'].iloc[:5, :10]

The first five predicted values for cumulative cases:

In [None]:
cm.pred_cumulative['usa_cases'].iloc[:5, :10]

The `combined_daily` attribute contains the actual and predicted values combined in a single DataFrame. Below, we have the last three days of actual data and the first three predicted values.

In [None]:
cm.combined_daily['usa_cases'].loc['2020-11-03':'2020-11-08', :'Delaware']

Similarly, the `combined_cumulative` dictionary holds the actual cumulative along with the predicted values.

In [None]:
cm.combined_cumulative['usa_cases'].loc['2020-11-03':'2020-11-08', :'Delaware']

The `combined_daily_s` and `combined_cumulative_s` have the smoothed actual values with the predicted values.

In [None]:
cm.combined_daily_s['usa_cases'].loc['2020-11-03':'2020-11-08', :'Delaware']

In [None]:
cm.combined_cumulative_s['usa_cases'].loc['2020-11-03':'2020-11-08', :'Delaware']

### Plotting results

A `plot_prediction` method was also defined to visualize the actual vs predicted values for a given area.

In [None]:
cm.plot_prediction('usa', 'Texas', title="Texas Cases")

## Create Prediction for Deaths using Case Fatality Ratio

We've only discussed making predictions for cases and not deaths. We could use a generalized logistic function to model deaths and get reasonable results, though a  simpler approach exists using the Case Fatality Ratio or CFR. The CFR is the fraction of cases resulting in death. Knowing this ratio can help us estimate the number of deaths that result from the number of recorded cases. While the CFR can change substantially over the course of a pandemic as treatments change and more people have access to tests, it should be reasonably stable over a short amount of time.

From clinical data, deaths usually occur two to three weeks after the initial coronavirus infection. Using this knowledge, we can estimate the CFR based on historical cases and deaths. To calculate the CFR, we do the following:

* Find total cases between 15 and 45 days prior
* Find total deaths between 0 and 30 days prior
* Divide the total cases by the total deaths

The function below takes the unprocessed data and the last date of known values and then calculates the CFR for each area. A CFR of 0.005 is used for countries that have no cases in the last 30 days.

In [None]:
def calculate_cfr(data, last_date):
    last_day_deaths = last_date
    first_day_deaths = last_date - pd.Timedelta('30D')
    last_day_cases = last_day_deaths - pd.Timedelta('15D')
    first_day_cases = last_day_cases - pd.Timedelta('30D')
    
    cfr = {}
    for group in GROUPS:
        deaths, cases = data[f'{group}_deaths'], data[f'{group}_cases']
        deaths_total = deaths.loc[last_day_deaths] - deaths.loc[first_day_deaths]
        cases_total = cases.loc[last_day_cases] - cases.loc[first_day_cases]
        cfr[group] = (deaths_total / cases_total).fillna(0.005)
    return cfr

Let's use the function to get the CFR for all areas and output some of the calculated values.

In [None]:
last_date = pd.Timestamp('2020-11-05')
cfr = calculate_cfr(data, last_date)
cfr['world'].head(10).round(3)

In [None]:
cfr['usa'].head(10).round(3)

## Create class to model deaths

We create another class, `DeathsModel`, to model the deaths of each area. It allows the user to set the `lag`, number of days between cases and deaths, and the `period`, number of days to tabulate the total cases/deaths for the CFR calculation. The `predict` method multiplies the CFR by the number of cases that happened `lag` days ago. For example, if we want to predict the number of deaths on November 6, we look back at the number of cases on October 22 (assuming the lag is 15) and multiply this number by the CFR of that area. To help get smoother results, we use a 7-day rolling average instead of the actual value.

In [None]:
class DeathsModel:
    def __init__(self, data, last_date, cm, lag, period):
        """
        Build simple model based on CFR to predict deaths for all areas

        Parameters
        ----------
        data : dictionary of data from all areas - result of PrepareData().run()

        last_date : str, last date to be used for training

        cm : CasesModel instance after calling `run` method
        
        lag : int, number of days between cases and deaths, used to calculate CFR
        
        period : int, window size of number of days to calculate CFR
        """
        self.data = data
        self.last_date = self.get_last_date(last_date)
        self.cm = cm
        self.lag = lag
        self.period = period
        self.pred_daily = {}
        self.pred_cumulative = {}
        
        # Dictionary to hold DataFrame of actual and predicted values
        self.combined_daily = {}
        self.combined_cumulative = {}
        
    def get_last_date(self, last_date):
        if last_date is None:
            return self.data['world_cases'].index[-1]
        else:
            return pd.Timestamp(last_date)
        
    def calculate_cfr(self):
        first_day_deaths = self.last_date - pd.Timedelta(f'{self.period}D')
        last_day_cases = self.last_date - pd.Timedelta(f'{self.lag}D')
        first_day_cases = last_day_cases - pd.Timedelta(f'{self.period}D')

        cfr = {}
        for group in GROUPS:
            deaths = self.data[f'{group}_deaths']
            cases = self.data[f'{group}_cases']
            deaths_total = deaths.loc[self.last_date] - deaths.loc[first_day_deaths]
            cases_total = cases.loc[last_day_cases] - cases.loc[first_day_cases]
            cfr[group] = (deaths_total / cases_total).fillna(0.01)
        return cfr
    
    def run(self):
        self.cfr = self.calculate_cfr()
        for group in GROUPS:
            group_cases = f'{group}_cases'
            group_deaths = f'{group}_deaths'
            cfr_start_date = self.last_date - pd.Timedelta(f'{self.lag}D')
            
            daily_cases_smoothed = self.cm.combined_daily_s[group_cases]
            pred_daily = daily_cases_smoothed[cfr_start_date:] * self.cfr[group]
            pred_daily = pred_daily.iloc[:self.cm.n_pred]
            pred_daily.index = self.cm.pred_daily[group_cases].index
            
            # Use repeated rolling average to smooth out the predicted deaths
            for i in range(5):
                pred_daily = pred_daily.rolling(14, min_periods=1, center=True).mean()
            
            pred_daily = pred_daily.round(0).astype("int")
            self.pred_daily[group_deaths] = pred_daily
            last_deaths = self.data[group_deaths].loc[self.last_date]
            self.pred_cumulative[group_deaths] = pred_daily.cumsum() + last_deaths
        self.combine_actual_with_pred()
            
    def combine_actual_with_pred(self):
        for gk, df_pred in self.pred_cumulative.items():
            df_actual = self.data[gk][:self.last_date]
            df_comb = pd.concat((df_actual, df_pred))
            self.combined_cumulative[gk] = df_comb
            self.combined_daily[gk] = df_comb.diff().fillna(df_comb.iloc[0]).astype('int')
            
    def plot_prediction(self, group, area, **kwargs):
        group_kind = f'{group}_deaths'
        actual = self.data[group_kind][area]
        pred = self.pred_cumulative[group_kind][area]
        first_date = self.last_date - pd.Timedelta(60, 'D')
        last_pred_date = self.last_date + pd.Timedelta(30, 'D')
        actual.loc[first_date:last_pred_date].plot(label='Actual', **kwargs)
        pred.plot(label='Predicted').legend()

Let's instantiate this class and then call the `run` method, which should execute immediately as the model for deaths is far simpler than it is for cases.

In [None]:
dm = DeathsModel(data=data, last_date='2020-11-05', cm=cm, lag=15, period=30)
dm.run()

Let's output the first daily and cumulative predictions.

In [None]:
dm.pred_daily['usa_deaths'].iloc[:5, :10]

In [None]:
dm.pred_cumulative['usa_deaths'].iloc[:5, :10]

Just as with the cases, `combined_daily` and `combined_cumulative` are available combining actual and predicted values. Again, we look at the three days preceding and following the predicted date.

In [None]:
dm.combined_daily['usa_deaths'].loc['2020-11-03':'2020-11-08', :'Delaware']

In [None]:
dm.combined_cumulative['usa_deaths'].loc['2020-11-03':'2020-11-08', :'Delaware']

Use the `plot_prediction` method to plot the actual and predicted values of deaths for a particular area.

In [None]:
dm.plot_prediction('usa', 'Texas', title='Texas Deaths')

## Creating final tables for the dashboard

Now that we have all the information for our dashboard, we'll need to extract it from these classes and store them as files. For simplicity, we will store them as CSV files.

### Adding USA to world tables

Originally, we filtered out the USA from the world tables. We did this knowing that the predictions for the entire country were likely to be different than the sum of the predictions of each individual state. Now that we have each state's predictions, we will total them up and append them to the world tables. Here, the daily and cumulative cases and deaths are assigned to new variables. Each of the world DataFrames has the USA added as a new column.

In [None]:
# Get Daily Cases and Deaths from dictionaries
world_cases_d = cm.combined_daily['world_cases']
usa_cases_d = cm.combined_daily['usa_cases']
world_deaths_d = dm.combined_daily['world_deaths']
usa_deaths_d = dm.combined_daily['usa_deaths']

# Add USA to world 
world_cases_d = world_cases_d.assign(USA=usa_cases_d.sum(axis=1))
world_deaths_d = world_deaths_d.assign(USA=usa_deaths_d.sum(axis=1))

# Get Cumulative Cases and Deaths
world_cases_c = cm.combined_cumulative['world_cases']
usa_cases_c = cm.combined_cumulative['usa_cases']
world_deaths_c = dm.combined_cumulative['world_deaths']
usa_deaths_c = dm.combined_cumulative['usa_deaths']

# Add USA to world 
world_cases_c = world_cases_c.assign(USA=usa_cases_c.sum(axis=1))
world_deaths_c = world_deaths_c.assign(USA=usa_deaths_c.sum(axis=1))

We verify that the USA has been added to the world cases DataFrame.

In [None]:
world_cases_d.iloc[-5:, -10:]

### Creating a single table to store results

A single table will be used to hold the daily and cumulative cases and deaths for each area for each date. We'll reshape the DataFrames using the `stack` method so that all values are in a single column with the index containing the date and the area name.

In [None]:
world_cases_d.stack().tail()

We can place all four Series as columns in a single DataFrame using the `concat` function using the `keys` parameter to label each new column.

In [None]:
df_world = pd.concat((world_deaths_d.stack(), world_cases_d.stack(), 
                    world_deaths_c.stack(), world_cases_c.stack()), axis=1, 
                   keys=['Daily Deaths', 'Daily Cases', 'Deaths', 'Cases'])
df_world.tail()

The same thing is done for the USA data.

In [None]:
df_usa = pd.concat((usa_deaths_d.stack(), usa_cases_d.stack(), 
                    usa_deaths_c.stack(), usa_cases_c.stack()), axis=1, 
                   keys=['Daily Deaths', 'Daily Cases', 'Deaths', 'Cases'])
df_usa.tail()

All of the above code is placed in a function that accepts instances of the `CasesModel` and `DeathsModel` as arguments.

In [None]:
def combine_all_data(cm, dm):
    # Get Daily Cases and Deaths
    world_cases_d = cm.combined_daily['world_cases']
    usa_cases_d = cm.combined_daily['usa_cases']
    world_deaths_d = dm.combined_daily['world_deaths']
    usa_deaths_d = dm.combined_daily['usa_deaths']

    # Add USA to world 
    world_cases_d = world_cases_d.assign(USA=usa_cases_d.sum(axis=1))
    world_deaths_d = world_deaths_d.assign(USA=usa_deaths_d.sum(axis=1))

    # Get Cumulative Cases and Deaths
    world_cases_c = cm.combined_cumulative['world_cases']
    usa_cases_c = cm.combined_cumulative['usa_cases']
    world_deaths_c = dm.combined_cumulative['world_deaths']
    usa_deaths_c = dm.combined_cumulative['usa_deaths']

    # Add USA to world 
    world_cases_c = world_cases_c.assign(USA=usa_cases_c.sum(axis=1))
    world_deaths_c = world_deaths_c.assign(USA=usa_deaths_c.sum(axis=1))
    
    df_world = pd.concat((world_deaths_d.stack(), world_cases_d.stack(), 
                          world_deaths_c.stack(), world_cases_c.stack()), axis=1, 
                         keys=['Daily Deaths', 'Daily Cases', 'Deaths', 'Cases'])
    
    df_usa = pd.concat((usa_deaths_d.stack(), usa_cases_d.stack(), 
                        usa_deaths_c.stack(), usa_cases_c.stack()), axis=1, 
                       keys=['Daily Deaths', 'Daily Cases', 'Deaths', 'Cases'])
    df_all = pd.concat((df_world, df_usa), keys=['world', 'usa'], 
                       names=['group', 'date', 'area'])
    df_all.to_csv('data/all_data.csv')
    return df_all

Notice that the very end of this function concatenates the world and USA DataFrames one on top of each other and adds a new index level 'group' to the DataFrame. The data is written to the file `all_data.csv`.

In [None]:
df_all = combine_all_data(cm, dm)
df_all.tail()

## Create summary table

The main table in our dashboard is a summary of the data at the current date. We'll create it now and begin by selecting the current day's rows.

In [None]:
df_summary = df_all.query('date == @last_date')
df_summary.head()

We read in a file called population.csv that has the population and code (used in the map) of each area.

In [None]:
pop = pd.read_csv("data/population.csv")
pop.head()

Let's merge these two tables together and add columns for deaths and cases per million.

In [None]:
df_summary = df_summary.merge(pop, how='left', on=['group','area'])
df_summary["Deaths per Million"] = (df_summary["Deaths"] / df_summary["population"]).round(0)
df_summary["Cases per Million"] = (df_summary["Cases"] / df_summary["population"]).round(-1)
df_summary.head()

Let's place all of this code within its own function which also writes the data to a file.

In [None]:
def create_summary_table(df_all, last_date):
    df = df_all.query('date == @last_date')
    pop = pd.read_csv("data/population.csv")
    df = df.merge(pop, how='left', on=['group','area'])
    df["Deaths per Million"] = (df["Deaths"] / df["population"]).round(0)
    df["Cases per Million"] = (df["Cases"] / df["population"]).round(-1)
    df['date'] = last_date
    df.to_csv('data/summary.csv', index=False)
    return df

In [None]:
create_summary_table(df_all, last_date).head()

## Code within the modules

The `CasesModel` and `DeathsModel` class are placed in the `models.py` file. The `PrepareData` class and `combine_all_data` and `create_summary_table` functions are placed in the `prepare.py` file. In the next chapter, we'll run all of our code for the entire project to prepare the data, make predictions, and save the final tables.