In [22]:
import os
import json
import numpy as np
import pandas as pd
import sqlite3
import functools as ft
import matplotlib.pyplot as plt
%matplotlib inline

ETL - Extract, Transform, Load

Extract: Our data is extracted from an Excel file named 'Death reasons table.xlsx' . In the ETL process, the data is first extracted so that we can work on it.

In [23]:
Death_reasons_table_df = pd.read_excel('Death reasons table.xlsx')
Death_reasons_table_df

Unnamed: 0,Date,Patient Number,isDead,Routine test,DeathReason
0,2005-06-30,1,0,1,
1,2005-07-08,1,0,1,
2,2005-10-24,1,0,1,
3,2006-01-08,1,0,1,
4,2006-02-02,1,0,1,
...,...,...,...,...,...
99995,2019-10-09,300,0,0,
99996,2019-11-19,300,0,0,
99997,2019-11-21,300,0,0,
99998,2019-12-17,300,0,0,


In [24]:
Death_reasons_table_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 5 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   Date            100000 non-null  datetime64[ns]
 1   Patient Number  100000 non-null  int64         
 2   isDead          100000 non-null  int64         
 3   Routine test    100000 non-null  int64         
 4   DeathReason     66 non-null      object        
dtypes: datetime64[ns](1), int64(3), object(1)
memory usage: 3.8+ MB


In [25]:
Death_reasons_table_df.shape

(100000, 5)

Transform: The ETL's central process is in which various changes are made to the original data so that we can adapt the data to the project's goals. In addition, in the process, we clean the data of empty values ​​and noise so that they do not interfere with drawing conclusions from the data.

In [26]:
Death_reasons_table_df.isnull().sum()

Date                  0
Patient Number        0
isDead                0
Routine test          0
DeathReason       99934
dtype: int64

Because each row in the table represents a test and we are working on a table of causes of death, we will obviously have many blank values ​​since a person only dies once, so there are few rows with values ​​that have a cause of death. We will still take these lines to try to draw conclusions from them.

In [27]:
Death_reasons_table_df = Death_reasons_table_df.dropna()
Death_reasons_table_df

Unnamed: 0,Date,Patient Number,isDead,Routine test,DeathReason
579,2019-12-16,2,1,1,Kidney disease
3131,2019-12-26,12,1,1,Septicemia
3653,2019-12-31,15,1,1,Septicemia
3883,2019-11-05,20,1,1,Liver disease
4975,2019-08-22,31,1,1,Other causes
...,...,...,...,...,...
96471,2019-12-24,281,1,0,Septicemia
96947,2019-12-21,285,1,0,Cerebrovascular disease
97003,2019-10-06,286,1,0,Anemias
98823,2019-12-15,291,1,0,Hypertension


Because all the people appearing in the data are dead, column "isDead" is not needed, so we will delete it

In [28]:
del Death_reasons_table_df['isDead']
Death_reasons_table_df

Unnamed: 0,Date,Patient Number,Routine test,DeathReason
579,2019-12-16,2,1,Kidney disease
3131,2019-12-26,12,1,Septicemia
3653,2019-12-31,15,1,Septicemia
3883,2019-11-05,20,1,Liver disease
4975,2019-08-22,31,1,Other causes
...,...,...,...,...
96471,2019-12-24,281,0,Septicemia
96947,2019-12-21,285,0,Cerebrovascular disease
97003,2019-10-06,286,0,Anemias
98823,2019-12-15,291,0,Hypertension


In [29]:
Death_reasons_table_df['DeathReason'] = Death_reasons_table_df['DeathReason'].str.upper()
Death_reasons_table_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Death_reasons_table_df['DeathReason'] = Death_reasons_table_df['DeathReason'].str.upper()


Unnamed: 0,Date,Patient Number,Routine test,DeathReason
579,2019-12-16,2,1,KIDNEY DISEASE
3131,2019-12-26,12,1,SEPTICEMIA
3653,2019-12-31,15,1,SEPTICEMIA
3883,2019-11-05,20,1,LIVER DISEASE
4975,2019-08-22,31,1,OTHER CAUSES
...,...,...,...,...
96471,2019-12-24,281,0,SEPTICEMIA
96947,2019-12-21,285,0,CEREBROVASCULAR DISEASE
97003,2019-10-06,286,0,ANEMIAS
98823,2019-12-15,291,0,HYPERTENSION


In [30]:
Death_reasons_table_df.duplicated().sum()

0

We will note that because the data was invented, there is a duplication of the cause of death for each patient, for hospitalization and for a routine test.

Load: Loading all the tables and merging them into one final table. Because in our project we worked only on a Fact table, we will present the final table after the Transform.

In [31]:
final_death_reasons_table_df = Death_reasons_table_df.copy()
final_death_reasons_table_df

Unnamed: 0,Date,Patient Number,Routine test,DeathReason
579,2019-12-16,2,1,KIDNEY DISEASE
3131,2019-12-26,12,1,SEPTICEMIA
3653,2019-12-31,15,1,SEPTICEMIA
3883,2019-11-05,20,1,LIVER DISEASE
4975,2019-08-22,31,1,OTHER CAUSES
...,...,...,...,...
96471,2019-12-24,281,0,SEPTICEMIA
96947,2019-12-21,285,0,CEREBROVASCULAR DISEASE
97003,2019-10-06,286,0,ANEMIAS
98823,2019-12-15,291,0,HYPERTENSION
