# Table SIM_AV_PATIENT

The Simulacrum database contains synthetic cancer data which imitates some of the data held securely by the National Cancer Registration and Analysis Service (NCRAS) by the National Disease Registration Service, which is part of Public Health England (PHE).This database is a collection of linked data tables which contain the same structure as those used in the original NCRAS data. The SIM_AV tables represent the patient and tumour registration data, and the SIM_SACT tables represent the Systematic Anti-Cancer Therapy (SACT) data.

I am going to start with the cancer registration table "SIM_AV_PATIENT". This table includes patient demographics, vital status date and cause of death.

Firstly, I am going to install the dependencies that are needed and then I will open the table.

In [1]:
import pandas as pd
import numpy as np
from zipfile import ZipFile

In [2]:
pd.set_option('display.max_columns', None) 

In [3]:
zf = ZipFile('../Data/sim_av_patient.zip')
csv = zf.open('sim_av_patient.csv')
df = pd.read_csv(csv)
df.head()

Unnamed: 0,PATIENTID,SEX,LINKNUMBER,ETHNICITY,DEATHCAUSECODE_1A,DEATHCAUSECODE_1B,DEATHCAUSECODE_1C,DEATHCAUSECODE_2,DEATHCAUSECODE_UNDERLYING,DEATHLOCATIONCODE,NEWVITALSTATUS,VITALSTATUSDATE
0,10000001,2,810000001,A,,,,,,,A,2017-01-17
1,10000002,2,810000002,Z,,,,,,,A,2017-01-14
2,10000003,1,810000003,A,,,,,,,A,2017-01-17
3,10000004,1,810000004,A,,,,,,,A,2017-01-13
4,10000005,2,810000005,,,,,,,,A,2017-01-16


Now I will check the characteristics of the table, including a quick check of a sample of the data, the shape of the table, the missing data, the type of variables I have and other basic information about the table. 

In [4]:
df.sample(5)

Unnamed: 0,PATIENTID,SEX,LINKNUMBER,ETHNICITY,DEATHCAUSECODE_1A,DEATHCAUSECODE_1B,DEATHCAUSECODE_1C,DEATHCAUSECODE_2,DEATHCAUSECODE_UNDERLYING,DEATHLOCATIONCODE,NEWVITALSTATUS,VITALSTATUSDATE
336728,20006755,2,820006755,A,,,,,,,A,2017-01-13
595866,40027516,1,840027516,A,C349,C349,,,C349,1.0,A,2017-01-16
141105,10143577,1,810143577,A,,,,,,,A,2017-01-13
1314640,220040287,1,1020040287,A,,J386,,"I489,R18,N179",,,D,2015-04-26
655061,40092908,1,840092908,A,C349,,,,C349,1.0,D,2016-10-09


In [5]:
df.shape

(1322100, 12)

In [6]:
df.isna().sum()

PATIENTID                          0
SEX                                0
LINKNUMBER                         0
ETHNICITY                     129851
DEATHCAUSECODE_1A             991820
DEATHCAUSECODE_1B            1224015
DEATHCAUSECODE_1C            1303994
DEATHCAUSECODE_2             1180118
DEATHCAUSECODE_UNDERLYING     994190
DEATHLOCATIONCODE             991719
NEWVITALSTATUS                     0
VITALSTATUSDATE                    0
dtype: int64

In [7]:
df.count()

PATIENTID                    1322100
SEX                          1322100
LINKNUMBER                   1322100
ETHNICITY                    1192249
DEATHCAUSECODE_1A             330280
DEATHCAUSECODE_1B              98085
DEATHCAUSECODE_1C              18106
DEATHCAUSECODE_2              141982
DEATHCAUSECODE_UNDERLYING     327910
DEATHLOCATIONCODE             330381
NEWVITALSTATUS               1322100
VITALSTATUSDATE              1322100
dtype: int64

In [8]:
df.dtypes

PATIENTID                     int64
SEX                           int64
LINKNUMBER                    int64
ETHNICITY                    object
DEATHCAUSECODE_1A            object
DEATHCAUSECODE_1B            object
DEATHCAUSECODE_1C            object
DEATHCAUSECODE_2             object
DEATHCAUSECODE_UNDERLYING    object
DEATHLOCATIONCODE            object
NEWVITALSTATUS               object
VITALSTATUSDATE              object
dtype: object

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1322100 entries, 0 to 1322099
Data columns (total 12 columns):
PATIENTID                    1322100 non-null int64
SEX                          1322100 non-null int64
LINKNUMBER                   1322100 non-null int64
ETHNICITY                    1192249 non-null object
DEATHCAUSECODE_1A            330280 non-null object
DEATHCAUSECODE_1B            98085 non-null object
DEATHCAUSECODE_1C            18106 non-null object
DEATHCAUSECODE_2             141982 non-null object
DEATHCAUSECODE_UNDERLYING    327910 non-null object
DEATHLOCATIONCODE            330381 non-null object
NEWVITALSTATUS               1322100 non-null object
VITALSTATUSDATE              1322100 non-null object
dtypes: int64(3), object(9)
memory usage: 121.0+ MB


I am also going to check if there are duplicate entries.

In [10]:
print(f'The original dataset has data for {len(df)} patients')
print(f'After removing duplicate entries, the dataset has data for {len(df.drop_duplicates())} patients')

The original dataset has data for 1322100 patients
After removing duplicate entries, the dataset has data for 1322100 patients


As we can see, there are no duplicates in the dataset.

I am going to save this dataframe as a pickle file and then I will merge it with another table in the next notebook.

In [11]:
df.to_pickle('./avpat.pickle')