In [1]:
import pandas as pd
df = pd.read_csv('ir_data.csv')

We begin to work on the first 1000 data sets in the beginning just to get the structure working

In [124]:
x = df.as_matrix()
x = x[:1001,:]
print(x)

[[0 174 22532 ..., 'RT LOWER LOBE PNEUMONIA' 0 1]
 [1 170 22532 ..., 'RT LOWER LOBE PNEUMONIA' 0 1]
 [2 59795 22532 ..., 'RT LOWER LOBE PNEUMONIA' 0 1]
 ..., 
 [998 59810 26175 ..., 'RESPIRATORY DISTRESS' 0 1]
 [999 59811 26175 ..., 'RESPIRATORY DISTRESS' 0 1]
 [1000 105662 26175 ..., 'RESPIRATORY DISTRESS' 0 1]]


In [49]:
list(df)
for num, col_name in enumerate(df):
    print(num, col_name)

0 Unnamed: 0
1 ROW_ID_x
2 SUBJECT_ID
3 HADM_ID
4 CHARTDATE
5 CHARTTIME
6 STORETIME
7 CATEGORY
8 DESCRIPTION
9 CGID
10 ISERROR
11 TEXT
12 ROW_ID_y
13 ADMITTIME
14 DISCHTIME
15 DEATHTIME
16 ADMISSION_TYPE
17 ADMISSION_LOCATION
18 DISCHARGE_LOCATION
19 INSURANCE
20 LANGUAGE
21 RELIGION
22 MARITAL_STATUS
23 ETHNICITY
24 EDREGTIME
25 EDOUTTIME
26 DIAGNOSIS
27 HOSPITAL_EXPIRE_FLAG
28 HAS_CHARTEVENTS_DATA


### SUBJECT_ID, HADM_ID
Identifiers which specify the patient: SUBJECT_ID is unique to a patient and HADM_ID is unique to a patient hospital stay.

### CG_ID 
This refers to the ID of the caregivere who made the observation and this is pretty irrelevant to us.

### STORE_TIME
This refers to the time at which the observation was actually recorded in the EHR. We shall use this time as CHART_TIME is not actually avalaible in the EHR.

### CATEGORY, DESCRIPTION
CATEGORY and DESCRIPTION define the type of note recorded. For example, a CATEGORY of ‘Discharge’ indicates that the note is a discharge note, and a DESCRIPTION of ‘Summary’ in conjunction with this indicates that the note is a discharge summary

### EDREGTIME, EDOUTTIME

Time that the patient was registered and discharged from the emergency department

### HOSPITAL_EXPIRE_FLG
Whether or not the patient deid in the hospital

### The columns that we consider are important for the initial part of the project
* 2 SUBJECT_ID
* 3 HADM_ID
* 6 STORETIME
* 7 CATEGORY
* 8 DESCRIPTION
* 11 TEXT
* 13 ADMITTIME
* 14 DISCHTIME
* 15 DEATHTIME
* 16 ADMISSION_TYPE
* 26 DIAGNOSIS
* 27 HOSPITAL_EXPIRE_FLAG

In [50]:
pat_dict = {}
# for rows in x:
#     pat_id = rows[2]
#     pat_dict[pat_id] = pat_dict.get(pat_id, 0)+1
#     print(pat_id, pat_dict[pat_id])
count = 0
count2 = 0
for items in (df['ISERROR']):
    if pd.isnull(items) == False:
        count2 += 1
    else:
        count += 1


We check the number of data entries that have chart events and the entries that don't.

In [51]:
count1 = 0
count2 = 0
for items in (df['HAS_CHARTEVENTS_DATA']):
    if items==1:
        count1 += 1
    else:
        count2 += 1
print(count1, count2)

1842222 9122


In [52]:
count = 1
patid_set = set()
patid = {}
for items in x:
    patid_set.add(items[2])
    patid[items[2]] = patid.get(items[2], 0)+1

There occurs just one case where there is **more than one entry for the same set of patient_id, hospital_id.** Ignoring such a case for now.

But there is the issue of same patient id having multiple stays. How do we take care of that?

In [53]:
data = [ { 'patient_id':, 'b':[2, 4], 'c':3.0 } ]
data[0]['b'].append(3)
print(repr(data))


data = []
pat_set = set()
for entry in x:
    pat_id = entry[2]
    stay_id = entry[3]
    diagnosis = entry[26]
    store_time = entry[6]
    if pat_id not in pat_set:
        pat_set.add(pat_id)
        item = {
            'patient':{
                'pat_id': pat_id,
                'hospital':{
                    'stay_id': stay_id,
                    'diagnosis': diagnosis
                    
                }
            }
        }
        data.append(item)
for items in data:
    print(items)

{'patient': {'pat_id': 22532, 'hospital': {'stay_id': 167853.0, 'diagnosis': 'RT LOWER LOBE PNEUMONIA'}}}
{'patient': {'pat_id': 13702, 'hospital': {'stay_id': 107527.0, 'diagnosis': 'CHRONIC OBSTRUCTIVE PULMONARY DISEASE'}}}
{'patient': {'pat_id': 26880, 'hospital': {'stay_id': 135453.0, 'diagnosis': 'S/P FALL;TELEMETRY'}}}
{'patient': {'pat_id': 53181, 'hospital': {'stay_id': 170490.0, 'diagnosis': 'RIGHT TEMPORAL MENIGIOMA/SDA'}}}
{'patient': {'pat_id': 20646, 'hospital': {'stay_id': 134727.0, 'diagnosis': 'PNEUMONIA;HYPOXIA'}}}
{'patient': {'pat_id': 42130, 'hospital': {'stay_id': 114236.0, 'diagnosis': 'LEFT SPHENOID MENENGIOMA/SDA'}}}
{'patient': {'pat_id': 56174, 'hospital': {'stay_id': 163469.0, 'diagnosis': 'BRAIN MASS/SDA'}}}
{'patient': {'pat_id': 28063, 'hospital': {'stay_id': 121936.0, 'diagnosis': 'CONGESTIVE HEART FAILURE'}}}
{'patient': {'pat_id': 1136, 'hospital': {'stay_id': 139574.0, 'diagnosis': 'COLITIS'}}}
{'patient': {'pat_id': 5350, 'hospital': {'stay_id': 16968

In [54]:
# for items in x:
#     if str(items[6])!="nan":
#         print(items[6])

In [55]:
check_set = set()
for items in x:
    print(items[26])

RT LOWER LOBE PNEUMONIA
RT LOWER LOBE PNEUMONIA
RT LOWER LOBE PNEUMONIA
RT LOWER LOBE PNEUMONIA
RT LOWER LOBE PNEUMONIA
RT LOWER LOBE PNEUMONIA
RT LOWER LOBE PNEUMONIA
RT LOWER LOBE PNEUMONIA
RT LOWER LOBE PNEUMONIA
RT LOWER LOBE PNEUMONIA
RT LOWER LOBE PNEUMONIA
RT LOWER LOBE PNEUMONIA
RT LOWER LOBE PNEUMONIA
RT LOWER LOBE PNEUMONIA
RT LOWER LOBE PNEUMONIA
RT LOWER LOBE PNEUMONIA
RT LOWER LOBE PNEUMONIA
RT LOWER LOBE PNEUMONIA
RT LOWER LOBE PNEUMONIA
RT LOWER LOBE PNEUMONIA
RT LOWER LOBE PNEUMONIA
RT LOWER LOBE PNEUMONIA
RT LOWER LOBE PNEUMONIA
RT LOWER LOBE PNEUMONIA
RT LOWER LOBE PNEUMONIA
RT LOWER LOBE PNEUMONIA
RT LOWER LOBE PNEUMONIA
RT LOWER LOBE PNEUMONIA
RT LOWER LOBE PNEUMONIA
RT LOWER LOBE PNEUMONIA
RT LOWER LOBE PNEUMONIA
RT LOWER LOBE PNEUMONIA
CHRONIC OBSTRUCTIVE PULMONARY DISEASE
CHRONIC OBSTRUCTIVE PULMONARY DISEASE
CHRONIC OBSTRUCTIVE PULMONARY DISEASE
CHRONIC OBSTRUCTIVE PULMONARY DISEASE
CHRONIC OBSTRUCTIVE PULMONARY DISEASE
CHRONIC OBSTRUCTIVE PULMONARY DISEASE
CHRO

STAB WOUND
STAB WOUND
TAMPONDAE
TAMPONDAE
TAMPONDAE
TAMPONDAE
TAMPONDAE
TAMPONDAE
TAMPONDAE
TAMPONDAE
TAMPONDAE
TAMPONDAE
TAMPONDAE
TAMPONDAE
TAMPONDAE
TAMPONDAE
TAMPONDAE
METASTATIC MELANOMA;BRAIN METASTASIS
METASTATIC MELANOMA;BRAIN METASTASIS
METASTATIC MELANOMA;BRAIN METASTASIS
METASTATIC MELANOMA;BRAIN METASTASIS
METASTATIC MELANOMA;BRAIN METASTASIS
METASTATIC MELANOMA;BRAIN METASTASIS
METASTATIC MELANOMA;BRAIN METASTASIS
METASTATIC MELANOMA;BRAIN METASTASIS
METASTATIC MELANOMA;BRAIN METASTASIS
METASTATIC MELANOMA;BRAIN METASTASIS
METASTATIC MELANOMA;BRAIN METASTASIS
METASTATIC MELANOMA;BRAIN METASTASIS
METASTATIC MELANOMA;BRAIN METASTASIS
METASTATIC MELANOMA;BRAIN METASTASIS
METASTATIC MELANOMA;BRAIN METASTASIS
METASTATIC MELANOMA;BRAIN METASTASIS
METASTATIC MELANOMA;BRAIN METASTASIS
METASTATIC MELANOMA;BRAIN METASTASIS
METASTATIC MELANOMA;BRAIN METASTASIS
METASTATIC MELANOMA;BRAIN METASTASIS
METASTATIC MELANOMA;BRAIN METASTASIS
CONGESTIVE HEART FAILURE
CONGESTIVE HEART FAILURE
C

In [41]:
data = [ { 'a':'A', 'b':[2, 4], 'c':3.0 } ]
data[0]['b'].append(3)
print(repr(data))

[{'a': 'A', 'b': [2, 4, 3], 'c': 3.0}]


In [161]:
# data = [{"pat_id":None}]
# data[0]['pat_id'] = 2
# data.append({'pat_id':3})
# l = 5
# desc = [items['pat_id'] for items in data]
# if l not in desc:
#     data.append({'pat_id':l})
# # print(data)
pat_set = set()
final_structure = []
for entry in x:
    pat_id = entry[2]
    stay_id = entry[3]
    diagnosis = entry[26]
    store_time = entry[6]
    print(store_time)
    if pat_id not in pat_set:
        # then we need to construct a new tuple with the entries
        pat_set.add(pat_id)
        new_struct = {'pat_id':pat_id, 'hospital':[]}

        stay_struct = {'stay_id':stay_id, 'diagnosis':[diagnosis], 'time':[store_time]} 
        # we make diagnosis a list cause we might have differnt diagnosis for the same stay
        new_struct['hospital'].append(stay_struct)
        final_structure.append(new_struct)
final_arr = [data['pat_id'] for data in final_structure]
for items in final_structure:
    print(items)
    

nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
2151-07-18 14:06:00
2151-07-18 19:00:00
2151-07-19 05:10:00
2151-07-19 16:38:00
2151-07-17 17:42:00
2151-07-18 05:41:00
2151-07-16 15:36:00
2151-07-17 02:18:00
2151-07-17 05:49:00
2151-07-17 05:49:00
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
2118-06-03 15:02:00
2118-06-03 17:27:00
2118-06-10 18:31:00
2118-06-07 05:47:00
2118-06-07 06:27:00
2118-06-11 05:45:00
2118-06-03 05:50:00
2118-06-07 15:40:00
2118-06-07 18:34:00
2118-06-07 19:03:00
2118-06-03 06:40:00
2118-06-06 18:28:00
2118-06-06 18:54:00
2118-06-07 03:13:00
2118-06-09 05:58:00
2118-06-09 18:38:00
2118-06-10 04:26:00
2118-06-10 04:56:00
2118-06-05 16:37:00
2118-06-05 17:06:00
2118-06-06 06:02:00
2118-06-06 05:58:00
2118-06-04 17:46:00
2118-06-05 05:43:00
2118-06-05 06:43:00
2118-06-08 19:24:00
2118-06-08 18:43:00
2118-06-09 05:59:00
2118-06-08 03:44:00
2118-06-08 05:59:00
2118-06-03 19:28:00
2118-06-04 05:18:00
2118-06-04 06:35:00


nan
2174-02-13 02:50:21
2174-02-13 02:52:40
2174-02-13 06:22:12
2174-02-13 05:44:32
2174-02-12 19:43:31
2174-02-13 14:27:48
2174-02-13 14:29:45
2174-02-13 14:34:40
2174-02-12 19:14:21
2174-02-12 18:28:36
2174-02-13 10:57:13
2174-02-13 10:58:06
2174-02-13 11:47:07
2174-02-12 18:23:31
2174-02-13 07:41:27
2174-02-13 07:43:39
2174-02-13 07:34:28
nan
nan
nan
nan
nan
nan
nan
2145-12-01 03:47:32
2145-12-01 05:37:36
2145-12-01 05:20:01
2145-12-01 04:54:13
2145-12-01 04:44:48
2145-12-01 04:46:21
2145-12-01 10:17:19
2145-12-01 11:08:54
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
2194-05-09 18:57:00
2194-05-08 18:03:00
2194-05-09 06:16:00
nan
nan
nan
nan
2194-08-15 18:06:00
2194-08-15 23:46:00
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
2147-06-27 16:47:00
2147-06-30 06:35:00
2147-06-30 16:09:00
2147-06-30 19:39:00
2147-07-01 07:14:00
2147-06-24 17:22:00
2147-06-25 06:58:00
2147-06-25 17:58:00
2147-06-27 05:45:00
2147-06-24 06:54:00
2147-06-26 13:25:00
2147

print(x)

In [177]:
for items in x[:2,]:
    print(items[2], items[3],items[6], items[7], items[8], items[26], items[11])

22532 167853.0 nan Discharge summary Report RT LOWER LOBE PNEUMONIA Admission Date:  [**2151-7-16**]       Discharge Date:  [**2151-8-4**]


Service:
ADDENDUM:

RADIOLOGIC STUDIES:  Radiologic studies also included a chest
CT, which confirmed cavitary lesions in the left lung apex
consistent with infectious process/tuberculosis.  This also
moderate-sized left pleural effusion.

HEAD CT:  Head CT showed no intracranial hemorrhage or mass
effect, but old infarction consistent with past medical
history.

ABDOMINAL CT:  Abdominal CT showed lesions of
T10 and sacrum most likely secondary to osteoporosis. These can
be followed by repeat imaging as an outpatient.



                            [**First Name8 (NamePattern2) **] [**First Name4 (NamePattern1) 1775**] [**Last Name (NamePattern1) **], M.D.  [**MD Number(1) 1776**]

Dictated By:[**Hospital 1807**]
MEDQUIST36

D:  [**2151-8-5**]  12:11
T:  [**2151-8-5**]  12:21
JOB#:  [**Job Number 1808**]

22532 167853.0 nan Discharge summary Repor