# SIM_SACT_TUMOUR

I am going to open the cancer registration table "SIM_SACT_TUMOUR". This table contains PatientID, TumourID (not the same as in "SIM_AV_TUMOUR" table), primary diagnosis site, morphology code and consultant specialty code.

Firstly, I am going to install the dependencies that are needed and then I am going to open the pickle file that I had saved in the previous notebook (which contains the tables "SIM_AV_PATIENT", "SIM_AV_TUMOUR" and "SIM_SACT_PATIENT"). Secondly, I am going to open the table SIM_SACT_TUMOUR. The data frame saved in the pickle file will be defined as "df" and the table "SIM_SACT_TUMOUR" as "sim_sact_tumour".

In [1]:
import pandas as pd
import numpy as np
from zipfile import ZipFile

In [2]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [3]:
zf = ZipFile('../Data/sim_sact_tumour.zip')
csv = zf.open('sim_sact_tumour.csv')
sim_sact_tumour = pd.read_csv(csv)
sim_sact_tumour.head()

Unnamed: 0,MERGED_TUMOUR_ID,MERGED_PATIENT_ID,CONSULTANT_SPECIALITY_CODE,PRIMARY_DIAGNOSIS,MORPHOLOGY_CLEAN
0,10000001,10000235,101.0,C61,81403.0
1,10000002,10000315,101.0,C679,81403.0
2,10000003,10000337,100.0,C500,
3,10000004,10000480,303.0,C829,
4,10000005,10000533,823.0,D473,


Now I am going to explore some characteristics of the table and check if there are duplicate entries.

In [4]:
sim_sact_tumour.shape

(299727, 5)

In [5]:
print(f'The original dataset has data for {len(sim_sact_tumour)} patients')
print(f'After removing duplicate entries, the dataset has data for {len(sim_sact_tumour.drop_duplicates())} patients')

The original dataset has data for 299727 patients
After removing duplicate entries, the dataset has data for 299727 patients


As we can see, there are no duplicate entries. Now I am going to select the variables that are more relevant.

In [6]:
sim_sact_tumour.columns

Index(['MERGED_TUMOUR_ID', 'MERGED_PATIENT_ID', 'CONSULTANT_SPECIALITY_CODE',
       'PRIMARY_DIAGNOSIS', 'MORPHOLOGY_CLEAN'],
      dtype='object')

In [7]:
columns_selected = ['MERGED_TUMOUR_ID', 'MERGED_PATIENT_ID']
sim_sact_tumour = sim_sact_tumour[columns_selected]
sim_sact_tumour.head()

Unnamed: 0,MERGED_TUMOUR_ID,MERGED_PATIENT_ID
0,10000001,10000235
1,10000002,10000315
2,10000003,10000337
3,10000004,10000480
4,10000005,10000533


Now I am going to open de pickle file I had saved in the previous notebook and I merge it with the table "SIM_SACT_TUMOUR".

In [8]:
df = pd.read_pickle('./avpat_avtum_sactpat.pickle')
df.head()

Unnamed: 0,PATIENTID,SEX,LINKNUMBER,ETHNICITY,NEWVITALSTATUS,NUMBER_TUMOURS,C180,C181,C182,C183,C184,C185,C186,C187,C188,BEH_BENIGN,BEH_MALIG,BEH_MICINV,BEH_INSITU,BEH_UNCERT,BEH_MALIG_METAS,BEH_MALIG_UNCERT,T,N,M,STAGE,GRADE_2,AGE_MEDIAN,L0801,L1001,L1701,L0201,L0401,L1201,L0301,L0901,CURATIVE_TREAT,NON_CURATIVE_TREAT,NO_ACTIVE_TREAT,ECOG,DEPR,CANCER_YEARS_MEDIAN,DIAG_TO_SURG_DAYS_MEDIAN,DIAGNOSISDATEBEST,MERGED_PATIENT_ID,LINK_NUMBER
0,10001000,F,810001000,White British,A,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,3.0,0.0,0.0,3.0,2.0,79.0,0,1,0,0,0,0,0,0,0,0,0,2.0,4.0,3.857711,120.0,2013-03-07,10001000.0,810001000.0
1,10001128,F,810001128,,A,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,3.0,0.0,0.0,3.0,2.0,86.0,0,0,1,0,0,0,0,0,0,0,0,0.0,1.0,2.234132,0.0,2014-10-23,,
2,10001482,F,810001482,,A,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,4.0,0.0,0.0,3.0,2.0,77.0,1,0,0,0,0,0,0,0,0,1,0,0.0,3.0,3.022649,0.0,2014-01-08,10001482.0,810001482.0
3,10001901,M,810001901,,A,1,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,2.0,2.0,0.0,2.0,2.0,62.0,0,1,0,0,0,0,0,0,1,0,0,0.0,1.0,1.325147,0.0,2015-09-20,,
4,10002351,F,810002351,,A,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1.0,1.0,0.0,1.0,2.0,63.0,0,1,0,0,0,0,0,0,0,0,0,0.0,3.0,1.158135,29.0,2015-11-20,,


In [9]:
df.shape

(34062, 46)

In [10]:
df = df.merge(sim_sact_tumour, left_on='MERGED_PATIENT_ID', right_on='MERGED_PATIENT_ID', how='left')
df.head()

Unnamed: 0,PATIENTID,SEX,LINKNUMBER,ETHNICITY,NEWVITALSTATUS,NUMBER_TUMOURS,C180,C181,C182,C183,C184,C185,C186,C187,C188,BEH_BENIGN,BEH_MALIG,BEH_MICINV,BEH_INSITU,BEH_UNCERT,BEH_MALIG_METAS,BEH_MALIG_UNCERT,T,N,M,STAGE,GRADE_2,AGE_MEDIAN,L0801,L1001,L1701,L0201,L0401,L1201,L0301,L0901,CURATIVE_TREAT,NON_CURATIVE_TREAT,NO_ACTIVE_TREAT,ECOG,DEPR,CANCER_YEARS_MEDIAN,DIAG_TO_SURG_DAYS_MEDIAN,DIAGNOSISDATEBEST,MERGED_PATIENT_ID,LINK_NUMBER,MERGED_TUMOUR_ID
0,10001000,F,810001000,White British,A,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,3.0,0.0,0.0,3.0,2.0,79.0,0,1,0,0,0,0,0,0,0,0,0,2.0,4.0,3.857711,120.0,2013-03-07,10001000.0,810001000.0,10002223.0
1,10001000,F,810001000,White British,A,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,3.0,0.0,0.0,3.0,2.0,79.0,0,1,0,0,0,0,0,0,0,0,0,2.0,4.0,3.857711,120.0,2013-03-07,10001000.0,810001000.0,10005351.0
2,10001128,F,810001128,,A,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,3.0,0.0,0.0,3.0,2.0,86.0,0,0,1,0,0,0,0,0,0,0,0,0.0,1.0,2.234132,0.0,2014-10-23,,,
3,10001482,F,810001482,,A,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,4.0,0.0,0.0,3.0,2.0,77.0,1,0,0,0,0,0,0,0,0,1,0,0.0,3.0,3.022649,0.0,2014-01-08,10001482.0,810001482.0,10005354.0
4,10001482,F,810001482,,A,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,4.0,0.0,0.0,3.0,2.0,77.0,1,0,0,0,0,0,0,0,0,1,0,0.0,3.0,3.022649,0.0,2014-01-08,10001482.0,810001482.0,10010150.0


In [11]:
df.shape

(38485, 47)

In [12]:
df.isna().sum()

PATIENTID                       0
SEX                             0
LINKNUMBER                      0
ETHNICITY                    2249
NEWVITALSTATUS                  0
NUMBER_TUMOURS                  0
C180                            0
C181                            0
C182                            0
C183                            0
C184                            0
C185                            0
C186                            0
C187                            0
C188                            0
BEH_BENIGN                      0
BEH_MALIG                       0
BEH_MICINV                      0
BEH_INSITU                      0
BEH_UNCERT                      0
BEH_MALIG_METAS                 0
BEH_MALIG_UNCERT                0
T                               0
N                               0
M                               0
STAGE                           0
GRADE_2                         0
AGE_MEDIAN                      0
L0801                           0
L1001         

I am going to save this data frame in a new pickle file and I will merge it with another table in the next notebook.

In [13]:
df.to_pickle('./avpat_avtum_sactpat_sacttum.pickle')