# Creating the fake database
****

<a id="top"></a>

<b>Table of contents</b><br>

1. [Creating fake population](#population)
2. [Populating the database](#Database)

In this notebook I am going to create a few CSV files with fake data. Then I will use this data to create tables in a new database. The grafic representation (schema) of the database I want to create can be seen below.



![Image of Yaktocat](https://github.com/Manuel-DominguezCBG/SQL2Dashboard/blob/main/Covid-19/Images/screenshot-aca1dabf.png?raw=true)

This database contains 4 interconnected tables. The main table is Patient_data which has basic patient data such as ID, Name, NHS number, gender and so on. In this fake database, this can be the total number of patients of a Trust.
The second table is Covid_19_admission. This table has a proportion of patients from the Patient_data table. E columns in this table, the column Patient_admiytted_id linked with Patient_data (id).


The third table is Covid_19_deaths what is the number of patients that dye a few random days after being admitted. This table has Patient_admited_id column that links with Covid_19_admission (Patient_admitted_id).

The last table is Hospital_features which contained the information of the three hospitals belonging the Trust. The column Hospital_ID of this table links with Covid_19_death (Hospital_ID) and with Covid_19_admission (Hospital_ID).

<a id="population"></a>
## 1. Creating fake people 

In [1]:
# Populating Patient_data dataframe
# 2000 patients with ID number, Name, NHS number, Age, Gender, Ethnicity and postcode.

# Import libraries
import pandas as pd
from pandas import DataFrame
import numpy as np
import random
import datetime
from datetime import timedelta
import names                                     # pip install names
from faker.providers.person.en import Provider   # pip install faker
import sqlite3


size = 2000

In [2]:
# Let’s create some functions to randomly generate our data 

def random_id(size):
    id_patient = random.sample(range(100000000), size)
    return id_patient

In [3]:
def random_NHS_number(size):
    NHS_numbers = random.sample(range(100000000,999999999),size)
    return NHS_numbers

In [4]:
def random_names(name_type, size):
    """
    Generate n-length ndarray of person names.
    name_type: a string, either first_names or last_names
    """
    names = getattr(Provider, name_type)
    return np.random.choice(names, size=size)

In [5]:
def random_genders(size, p=None):
    """Generate n-length ndarray of genders."""
    if not p:
        # Equal probability of gender
        p = (0.5, 0.5)
    gender = ("M", "F")
    return np.random.choice(gender, size=size, p=p)

In [6]:
def random_Ethnicity(size, p=None):
    """Generate n-length ndarray of genders."""
    if not p:
        # 5 groups with different probability
        p = (0.49, 0.10, 0.11, 0.01, 0.29)
    Ethnicity = ("White British", "Black British people", "British Indians", "White Gypsy or Irish Traveller", "Other White")
    return np.random.choice(Ethnicity, size=size, p=p)

In [7]:
def random_Postcode(size, p=None):
    """
    A real Faker's UK postcode generation can be found here
    https://github.com/joke2k/faker/blob/07ca4ede54c26554fdb5c7a4f55432cb0498d338/faker/providers/address/en_GB/__init__.py
    However, for this small fake database there is no need to populate it with real UK ppostcode.
    Instead of this, only a few fake postcodes are generate manually"""
    if not p:
        # 10 postcodes, same p
        p = (0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1,0.1)
    Postcode = ("SO15 5FL", "SP01 10MA", "BE01 5SA", "CB19 5US", "SO15 10FL", "LO12 8HG", "WE1 7YG", "SO01 7JH", "SP2 8BJ", "SP3 8BJ")
    return np.random.choice(Postcode, size=size, p=p)

In [8]:
def random_Hospital_names(size, p=None):
    """
    3 Fake hospitals names with similar probability
    """
    if not p:
        # 10 postcodes, same p
        p = (0.3, 0.3, 0.4)
    Hospital_names = ("Robin Hood Hospital", "Alfred Hitchcock Hospital", "Chris Martin Hospital")
    return np.random.choice(Hospital_names, size=size, p=p)

In [9]:
def random_dates(start, end, size):
    """
    Generate random dates within range between start and end.    
    Adapted from: https://stackoverflow.com/a/50668285
    """
    # Unix timestamp is in nanoseconds by default, so divide it by
    # 24*60*60*10**9 to convert to days.
    divide_by = 24 * 60 * 60 * 10**9
    start_u = start.value // divide_by
    end_u = end.value // divide_by
    return pd.to_datetime(np.random.randint(start_u, end_u, size), unit="D")

## patient_data_df

In [10]:
# Empty df with headers
patients_admitted = pd.DataFrame(columns=['ID', 'NHS_Number','Full_Name','Gender', 'Birthdate', 'Ethnicity', 'Postcode'])

# Populate the dataframe with the functions created above.
patients_admitted['ID'] = random_id(size) 
patients_admitted['NHS_Number'] = random_NHS_number(size)
patients_admitted['first_names'] = random_names('first_names', size)
patients_admitted['last_names'] = random_names('last_names', size)
patients_admitted['Full_Name'] = patients_admitted['first_names']  + ' ' + patients_admitted['last_names']
del patients_admitted['first_names']
del patients_admitted['last_names']
patients_admitted['Gender'] = random_genders(size)
patients_admitted['Birthdate'] = random_dates(start=pd.to_datetime('1900-01-01'), end=pd.to_datetime('2008-01-01'), 
                                              size=size)
patients_admitted['Ethnicity'] = random_Ethnicity(size)
patients_admitted['Postcode'] = random_Postcode(size)
patients_admitted

Unnamed: 0,ID,NHS_Number,Full_Name,Gender,Birthdate,Ethnicity,Postcode
0,16025501,854569122,Turner Frami,F,1919-01-31,Other White,SO15 10FL
1,81087031,228528071,Tatianna Lehner,F,1993-01-19,Other White,BE01 5SA
2,91017798,342178620,Yahir Reynolds,F,1918-05-21,White British,LO12 8HG
3,91436914,420895431,Terry Heathcote,F,1952-12-15,Other White,SO15 5FL
4,479005,496878295,Essie Bode,F,1964-11-26,White British,SP01 10MA
...,...,...,...,...,...,...,...
1995,98581986,325159334,Alphons Prohaska,F,1966-05-02,White British,SP01 10MA
1996,10295584,760009254,Charls Moen,M,1977-02-25,White British,LO12 8HG
1997,42950741,421296048,Charlotte Jones,F,1915-01-06,White British,CB19 5US
1998,16476090,657475861,Harlen Satterfield,F,1985-10-18,British Indians,BE01 5SA


In [11]:
# To save this as CSV if neccesary

#patients_admitted.to_csv('./patient_data_df.csv')

## covid_19_admission_df

In [12]:
# Populating COVID-19 admission dataframe

# Empty df
covid_19_admission_df = pd.DataFrame(columns=['Patient_admitted_id', 'Date', 'Hospital_ID' ])


# Admission of COVID patients in three hospitals of the same TRUST for one month period

# Populate the Pt_admited_id with the ID of people from patients_admitted
# This table contain the 5% of the patients found in the patient_data_df
patients_admitted_length = int(len(patients_admitted)*0.05)
ID = patients_admitted['ID'].tolist() # ID to list to select the 5% of the values of the ID column
ID2df = (random.choices(ID, k=patients_admitted_length))
covid_19_admission_df['Patient_admitted_id'] = ID2df

#Date from 1 January 2021 to 31 January 2021
covid_19_admission_df['Date'] = random_dates(start=pd.to_datetime('2021-01-01'), 
                                             end=pd.to_datetime('2021-01-31'), size=patients_admitted_length)




# So far, each hospital will get ramdon number of patients

The_hospitals_list = [214321,224323,3234234]
covid_19_admission_df['Hospital_ID'] = np.random.choice(list(The_hospitals_list), len(covid_19_admission_df))

covid_19_admission_df

Unnamed: 0,Patient_admitted_id,Date,Hospital_ID,Date_discharge
0,74385428,2021-01-10,224323,2021-03-28
1,14990182,2021-01-08,214321,2021-02-13
2,62291818,2021-01-23,224323,2021-02-23
3,26321802,2021-01-15,224323,2021-01-21
4,30418443,2021-01-03,224323,2021-01-16
...,...,...,...,...
95,64054283,2021-01-27,224323,2021-02-06
96,63979213,2021-01-09,224323,2021-01-20
97,81639508,2021-01-04,3234234,2021-02-04
98,23084923,2021-01-06,3234234,2021-02-07


In [None]:
# To save this as CSV
#covid_19_admission_df.to_csv('./covid_19_admission_df.csv')

## covid_19_death_df

Similar that covid_19_admission_df in which a small proportion of the admitted patients died a few days after admission.


In [None]:

# Crete a dict with ID number and date of admission
id2date_admission = pd.Series(covid_19_admission_df.Date.values,covid_19_admission_df.Patient_admitted_id.values).to_dict()

# Select the 5% of the total number of items  covid_19_admission_df
covid_19_admission_length = int(len(covid_19_admission_df)*0.05)


#Take the 5% of total number of items 
random_entry = random.sample(list(id2date_admission.items()), k=covid_19_admission_length)

# Populate a new df with the patients who are going to die and their date of admission
covid_19_death_df = DataFrame (random_entry,columns=['Patient_admitted_id','Date_admission'])

# Now, we suppose they will die a few random days later, between the 3rd and the 20th day after admission for example.
covid_19_death_df["Death_dates"] = covid_19_death_df["Date_admission"] + timedelta(days=random.randint(3, 20))

# we need the hospital where they were admitted and they died
covid_19_death_df = covid_19_admission_df.merge(covid_19_death_df, on="Patient_admitted_id")

# Some deletion of the columns I dont need
covid_19_death_df = covid_19_death_df.drop(['Date',  'Date_admission','Date_discharge' ], axis = 1)

# I dont need the date_admission column
#
covid_19_death_df


## Hospital_features_df
Some characteristics of the hospitals of this fake database

In [None]:
data = {'Hospital_ID': [214321,224323,3234234],
        'Hospital_name':  ['Chris Martin Hospital', 'Alfred Hitchcock Hospital','Robin Hood Hospital'],
        'Hospital_location': ['SO15 5FL', 'BE01 5SA','LO12 8HG'],
        'Number_of_beds' : [100,200,150],
        'Number_of_staff' : [300,600,400],
        'Number_of_ITU_Beds' : [10,20,15]}
Hospital_features_df = pd.DataFrame(data, columns=['Hospital_ID','Hospital_name', 'Hospital_location', 
                                                   'Number_of_beds','Number_of_staff', 'Number_of_ITU_Beds' ])
Hospital_features_df

In [None]:
#Hospital_features_df.to_csv('./Hospital_features_df.csv')


<a id="Database"></a>
## Once all CSV have been done, we can use them to create the database.

In [None]:
# Create a database connection and cursor to execute queries.
conn = sqlite3.connect('./fake_db.db') # This create an empty database in the current directory
c = conn.cursor()

In [None]:
#   1. patient_data_df

### Add and empty table and load patient_data_df into sql table

c.execute('''DROP TABLE IF EXISTS patient_data''')
c.execute(''' CREATE TABLE patient_data (ID NOT NULL,
NHS_Number,
Full_Name,
Gender,
Birthdate,
Ethnicity, 
Postcode,
PRIMARY KEY (ID),
FOREIGN KEY (ID) REFERENCES  covid_19_admission (Patient_admitted_id) ON DELETE CASCADE)''')

# patient_data[ID] ---> covid_19_admission[Patient_admitted_id]
patients_admitted.to_sql('patient_data', conn, if_exists='append', index = False) #LOAD
# c.execute('''SELECT * FROM patient_data''').fetchall() 

In [None]:
#   2. covid_19_admission_df

c.execute('''DROP TABLE IF EXISTS covid_19_admission''')
c.execute(''' CREATE TABLE covid_19_admission (Patient_admitted_id NOT NULL,
Date,
Hospital_ID,
Date_discharge,
PRIMARY KEY (Patient_admitted_id),
FOREIGN KEY (Hospital_ID) REFERENCES  Hospital_features (Hospital_ID) ON DELETE CASCADE,
FOREIGN KEY (Patient_admitted_id) REFERENCES covid_19_death (Patient_admitted_id) ON DELETE CASCADE)''')
covid_19_admission_df.to_sql('covid_19_admission', conn, if_exists='append', index = False)
c.execute('''SELECT * FROM covid_19_admission''').fetchall()

In [None]:
#   3. covid_19_death_df

c.execute('''DROP TABLE IF EXISTS covid_19_death''')
c.execute(''' CREATE TABLE covid_19_death (Patient_admitted_id NOT NULL,
Hospital_ID,
Death_dates,
PRIMARY KEY (Patient_admitted_id),
FOREIGN KEY (Hospital_ID) REFERENCES  Hospital_features (Hospital_ID) ON DELETE CASCADE)''')
covid_19_death_df.to_sql('covid_19_death', conn, if_exists='append', index = False)
c.execute('''SELECT * FROM covid_19_death''').fetchall()

In [None]:
#   4. Hospital_features_df

c.execute('''DROP TABLE IF EXISTS Hospital_features''')
c.execute(''' CREATE TABLE Hospital_features ( Hospital_ID NOT NULL PRIMARY KEY,
Hospital_name,
Hospital_location,
Number_of_beds number (3),
Number_of_staff number (3),
Number_of_ITU_Beds number (3))''')
Hospital_features_df.to_sql('Hospital_features', conn, if_exists='append', index = False)
c.execute('''SELECT * FROM Hospital_features''').fetchall()

## Database created. 

In [None]:
# Let's ensure everything is ok.
# The encoding pragma controls how strings are encoded and stored in a database file.
c.execute("PRAGMA table_info(patient_data);").fetchall()
# id 	name 	type 	notnull 	dflt_value 	pk

In [None]:
c.execute("PRAGMA table_info(covid_19_admission);").fetchall()

In [None]:
c.execute("PRAGMA table_info(covid_19_death);").fetchall()

In [None]:
c.execute("PRAGMA table_info(Hospital_features);").fetchall()

### Notebook details
<br>
<i>Notebook created by <strong>Manuel Dominguez</strong> 

Creation date: May 2021<br>


Code to create the sheme in Database schema 

//// -- Tables and References

// Creating tables
Table Patient_data {
  id int [pk, increment] // auto-increment
  Full_name varchar
  NHS_number int
  Birthdate int
  Gender varchar
  Ethnicity varchar
  Postcode varchar
}

Table Covid_19_admission {
  Patients_admitted_id int [ref: > Patient_data.id]  // inline relationship (many-to-one)
  Date_adm int
  Hospital_ID varchar [ref: > Hospital_features.Hospital_ID]
  Indexes {
    (Patients_admitted_id) [pk]
  }
}



Table Hospital_features {
 Hospital_ID varchar
 Hospital_name varchar 
 Hospital_location varchar
 Number_of_beds varchar
 Number_of_ITU_beds varchar
 

}



Table covid_19_death {
 Patients_admitted_id  int [ref: > Covid_19_admission.Patients_admitted_id]
 Hospital_ID varchar [ref: > Hospital_features.Hospital_ID]
 Death_dates varchar 
 Indexes {
    (Patients_admitted_id) [pk]
  }
 

}

In [None]:
covid_19_admission_df['Diff'] = covid_19_admission_df.Date - covid_19_admission_df.Date_discharge

In [None]:
futures_days = [random.randint(3, 50) for x in range(len(covid_19_admission_df))]
covid_19_admission_df["ololo"] = covid_19_admission_df["Date"] + timedelta(days=futures_days)

In [16]:
for idx, value in covid_19_admission_df.iloc[:,3].iteritems():
    covid_19_admission_df.loc[idx, ['discharge_date']] = covid_19_admission_df.loc[idx, ['Date']] + timedelta(days=random.randint(3, 50))
    print(timedelta(days=random.randint(3, 50)))
covid_19_admission_df

18 days, 0:00:00
16 days, 0:00:00
48 days, 0:00:00
39 days, 0:00:00
24 days, 0:00:00
33 days, 0:00:00
48 days, 0:00:00
42 days, 0:00:00
50 days, 0:00:00
22 days, 0:00:00
12 days, 0:00:00
30 days, 0:00:00
31 days, 0:00:00
19 days, 0:00:00
38 days, 0:00:00
21 days, 0:00:00
49 days, 0:00:00
20 days, 0:00:00
44 days, 0:00:00
48 days, 0:00:00
37 days, 0:00:00
6 days, 0:00:00
44 days, 0:00:00
46 days, 0:00:00
6 days, 0:00:00
27 days, 0:00:00
46 days, 0:00:00
46 days, 0:00:00
12 days, 0:00:00
7 days, 0:00:00
25 days, 0:00:00
39 days, 0:00:00
7 days, 0:00:00
11 days, 0:00:00
48 days, 0:00:00
20 days, 0:00:00
45 days, 0:00:00
27 days, 0:00:00
42 days, 0:00:00
41 days, 0:00:00
46 days, 0:00:00
33 days, 0:00:00
48 days, 0:00:00
12 days, 0:00:00
22 days, 0:00:00
27 days, 0:00:00
9 days, 0:00:00
34 days, 0:00:00
31 days, 0:00:00
44 days, 0:00:00
34 days, 0:00:00
26 days, 0:00:00
26 days, 0:00:00
10 days, 0:00:00
39 days, 0:00:00
9 days, 0:00:00
33 days, 0:00:00
41 days, 0:00:00
17 days, 0:00:00
35 

Unnamed: 0,Patient_admitted_id,Date,Hospital_ID,Date_discharge,discharge_date
0,74385428,2021-01-10,224323,2021-03-28,
1,14990182,2021-01-08,214321,2021-02-13,
2,62291818,2021-01-23,224323,2021-02-23,
3,26321802,2021-01-15,224323,2021-01-21,
4,30418443,2021-01-03,224323,2021-01-16,
...,...,...,...,...,...
95,64054283,2021-01-27,224323,2021-02-06,
96,63979213,2021-01-09,224323,2021-01-20,
97,81639508,2021-01-04,3234234,2021-02-04,
98,23084923,2021-01-06,3234234,2021-02-07,


In [None]:
covid_19_admission_df.loc[4, ['Date']] + timedelta(days=random.randint(3, 50))

In [15]:
data = {'Dates': ['2021-01-01','2021-01-08','2021-01-02']}
        
dates = pd.DataFrame(data, columns=['Dates'])

dates

Unnamed: 0,Dates
0,2021-01-01
1,2021-01-08
2,2021-01-02
