---
## Data preprocessing upper GI
---
### Content
---
- **[Libraries to use](#Libraries_to_use)**

- **[Loading the dataset](#Loading_dataset)**

- **[Data extraction and cleaning](#Data_extraction_cleaning)**
 ---

### Libraries to use <a id="Libraries_to_use"> </a>

In [1]:
import numpy as np
import pandas as pd
import re
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer, KNNImputer

### Loading the dataset <a id="Loading_dataset"> </a>

In [2]:
df = pd.read_csv("../data/OGD_FakeSet.csv")

In [3]:
df.head()

Unnamed: 0,text
0,"Hospital Number: O8583042 , Hospital: Random N..."
1,"Hospital Number: H2587908 , Hospital: Random N..."
2,"Hospital Number: P9907275 , Hospital: Random N..."
3,"Hospital Number: D9990587 , Hospital: Random N..."
4,"Hospital Number: K9010769 , Hospital: Random N..."


In [4]:
df.text[0]

"Hospital Number: O8583042 , Hospital: Random NHS Foundation Trust , DOB:  1903-05-10 , General Practitioner: Dr. Lutter, Carrie Nkaus Xwb  , Date of procedure:  2004-11-14Endoscopist: Dr. al-Rashid, Muzna  , 2nd Endoscopist: Dr. Labeaux, Destiny  , Medications: Fentanyl  25mcg  , Midazolam  6mg  , Instrument:  FG2  , Extent of Exam:  D1  , Procedure Performed: Gastroscopy (OGD)  , INDICATIONS FOR PROCEDURE: Biopsies for H pylori FINDINGS: The patient has Barrett's oesophagus. It is a long segment. No loss of aceto-whitening  was also seen. Short segment only. There is a polyp in the third part of the duodenum which is stalked with an abnormal pit pattern. The mucosa surrounding the nodules is inflamed and edematous, with a granular appearance.The polyp was marked with a tattoo to aid in future surveillance endoscopy.. NA RECOMMENDATION: The patient should be informed of the diagnosis and the importance of adhering to the recommended treatment plan.  The patient may be prescribed medic

### Data extraction and cleaning <a id="Data_extraction_cleaning"> </a>

In [5]:
def regex_hosp(string):
    hospital_reg = r"\.*Hospital:.*"
    line = re.findall(hospital_reg, string)[0]
    return  line.replace(',',':').split(":")[1]
    
df["foundation_trust"] = df['text'].apply(regex_hosp)

In [6]:
def regex_hosp_num(string):
    hospital_reg = r"\.*Hospital Number.*"
    line= re.findall(hospital_reg, string)[0]
    return  line.replace(',',':').split(":")[1]
df["hospital_num"] = df['text'].apply(regex_hosp_num)

In [7]:
def regex_GP(string):
    hospital_reg = r"\.*General Practitioner:.*"
    line = re.findall(hospital_reg, string)[0]
    retrn_string= line.replace(',',':').split(":")[1]
    if retrn_string[-1:] == "\r":
        return retrn_string[:-1]
    else:
        return retrn_string
df["gp"] = df['text'].apply(regex_GP)

In [8]:
def regex_DOB(string):
    hospital_reg = r"\.*DOB:.*"
    line =  re.findall(hospital_reg, string)[0]
    retrn_string= line.replace(',',':').split(":")[1]
    if retrn_string[-1:] == "\r":
        return retrn_string[:-1]
    else:
        return retrn_string
df["DOB"] = df['text'].apply(regex_DOB)

In [9]:
def regex_procedure_date(string):
    hospital_reg = r"\.*Date of procedure:.*"
    line =  re.findall(hospital_reg, string)[0]
    retrn_string =  line.split(":")[1][:-11]
    if retrn_string[-1:] == "\r":
        return retrn_string[:-1]
    else:
        return retrn_string
df["procedure_date"] = df['text'].apply(regex_procedure_date)

In [10]:
df.head()

Unnamed: 0,text,foundation_trust,hospital_num,gp,DOB,procedure_date
0,"Hospital Number: O8583042 , Hospital: Random N...",Random NHS Foundation Trust,O8583042,Dr. Lutter,1903-05-10,2004-11-14
1,"Hospital Number: H2587908 , Hospital: Random N...",Random NHS Foundation Trust,H2587908,Dr. Akhmedov,1933-11-14,2015-09-02
2,"Hospital Number: P9907275 , Hospital: Random N...",Random NHS Foundation Trust,P9907275,Dr. al-Tabet,1967-03-08,2002-01-13
3,"Hospital Number: D9990587 , Hospital: Random N...",Random NHS Foundation Trust,D9990587,Dr. Garcia,1973-11-22,2007-04-20
4,"Hospital Number: K9010769 , Hospital: Random N...",Random NHS Foundation Trust,K9010769,Dr. el-Kaiser,1968-04-13,2016-01-12


In [11]:
def regex_endoscopist(string):
    hospital_reg = r"\.*Endoscopist:.*"
    line = re.findall(hospital_reg, string)[0]
    retrn_string =  line.replace(',',':').split(":")[1]
    if retrn_string[-1:] == "\r":
        return retrn_string[:-1]
    else:
        return retrn_string
df["endoscopist"] = df['text'].apply(regex_endoscopist)

In [12]:
def regex_2nd_endoscopist(string):
    hospital_reg = r"\.*2nd Endoscopist:.*"
    line= re.findall(hospital_reg, string)[0]
    retrn_string =  line.replace(',',':').split(":")[1]
    if retrn_string[-1:] == "\r":
        return retrn_string
    else:
        return retrn_string
df["second_endoscopist"] = df['text'].apply(regex_2nd_endoscopist)

In [13]:
def regex_medication(string):
    hospital_reg = r"\d*.\dmcg"
    retrn_string= re.findall(hospital_reg, string)[0]
    if retrn_string[-1:] == "\r":
        return float(retrn_string[:-4])
    else:
        return float(retrn_string[:-3])
df["medications_fentynl"] = df['text'].apply(regex_medication)

In [14]:
def regex_midazolam(string):
    hospital_reg = r"\.*Midazolam.*"
    line = re.findall(hospital_reg, string)[0]
    retrn_string =  line.split()[1]
    if retrn_string[-1:] == "\r":
        return int(retrn_string[:-3])
    else:
        return int(retrn_string[:-2])
df["midazolam"] = df['text'].apply(regex_midazolam)

In [15]:
def regex_instrument(string):
    hospital_reg = r"\.*Instrument.*"
    line = re.findall(hospital_reg, string)[0]
    retrn_string =  line.replace(',',':').split(":")[1]
    if retrn_string[-1:] == "\r":
        return retrn_string[:-1]
    else:
        return retrn_string
df["instrument"] = df['text'].apply(regex_instrument)

In [16]:
def regex_extent(string):
    hospital_reg = r"\.*Extent of Exam:.*"
    line = re.findall(hospital_reg, string)[0]
    retrn_string =  line.replace(',',':').split(":")[1]
    if retrn_string[-1:] == "\r":
        return retrn_string[:-1]
    else:
        return retrn_string
df["extent_of_exam"] = df['text'].apply(regex_extent)

In [17]:
def regex_indications(string):
    hospital_reg = r"\.*INDICATIONS FOR PROCEDURE:.*"
    line = re.findall(hospital_reg, string)[0]
    retrn_string =  line.replace(',',':').split(":")[1]
    if retrn_string[-1:] == "\r":
        retrn_string= retrn_string[:-1]
    if retrn_string[-8:] == "FINDINGS":
        return retrn_string[:-8]
    else:
        return retrn_string
    
df["indications"] = df['text'].apply(regex_indications)

In [18]:
df.head()

Unnamed: 0,text,foundation_trust,hospital_num,gp,DOB,procedure_date,endoscopist,second_endoscopist,medications_fentynl,midazolam,instrument,extent_of_exam,indications
0,"Hospital Number: O8583042 , Hospital: Random N...",Random NHS Foundation Trust,O8583042,Dr. Lutter,1903-05-10,2004-11-14,Dr. al-Rashid,Dr. Labeaux,25.0,6,FG2,D1,Biopsies for H pylori
1,"Hospital Number: H2587908 , Hospital: Random N...",Random NHS Foundation Trust,H2587908,Dr. Akhmedov,1933-11-14,2015-09-02,Dr. Garnier,Dr. Carter,25.0,1,FG5,D1,Known coeliac ch diarrhoea.Myelofibrosis on r...
2,"Hospital Number: P9907275 , Hospital: Random N...",Random NHS Foundation Trust,P9907275,Dr. al-Tabet,1967-03-08,2002-01-13,Dr. Stearns,Dr. Geist,50.0,1,FG3,Failed intubation,Known coeliac disease vomitting
3,"Hospital Number: D9990587 , Hospital: Random N...",Random NHS Foundation Trust,D9990587,Dr. Garcia,1973-11-22,2007-04-20,Dr. Ali,Dr. Carter,150.0,5,FG4,Pylorus,Reflux-like Symptoms/Atypical Chest Pain
4,"Hospital Number: K9010769 , Hospital: Random N...",Random NHS Foundation Trust,K9010769,Dr. el-Kaiser,1968-04-13,2016-01-12,Dr. Presta,Dr. Labeaux,125.0,7,FG3,Pylorus,Dysphagia/Odynophagia post oesophagectomy


In [19]:
def regex_procedure(string):
    hospital_reg = r"\.*Procedure Performed:.*"
    line = re.findall(hospital_reg, string)[0]
    retrn_string =  line.replace(',',':').split(":")[1]
    if retrn_string[-1:] == "\r":
        return retrn_string[:-1]
    else:
        return retrn_string
    print(retrn_string)
    
df["procedure_performed"] = df['text'].apply(regex_procedure)

In [20]:
regex_procedure(df.text[3])

' Gastroscopy (OGD)  '

In [21]:
df["procedure_performed"] = df['text'].apply(regex_procedure)

In [22]:
def regex_findings(string):
    hospital_reg = r"\.*FINDINGS:.*"
    line = re.findall(hospital_reg, string)[0][10:]
    return line
df["findings"] = df['text'].apply(regex_findings)

In [44]:
df.findings[0]

"The patient has Barrett's oesophagus. It is a long segment. No loss of aceto-whitening  was also seen. Short segment only. There is a polyp in the third part of the duodenum which is stalked with an abnormal pit pattern. The mucosa surrounding the nodules is inflamed and edematous, with a granular appearance.The polyp was marked with a tattoo to aid in future surveillance endoscopy.. NA RECOMMENDATION: The patient should be informed of the diagnosis and the importance of adhering to the recommended treatment plan.  The patient may be prescribed medication to reduce the risk of developing more polyps. FOLLOW UP: The patient may be prescribed medication to manage any symptoms associated with the nodule, such as pain or discomfort.  The patient should be advised to avoid consuming too much sugar, as this can increase the risk of bacterial overgrowth in the stomach and increase the risk of polyp growth."

In [24]:
df.shape

(1000, 15)

In [25]:
df.head()

Unnamed: 0,text,foundation_trust,hospital_num,gp,DOB,procedure_date,endoscopist,second_endoscopist,medications_fentynl,midazolam,instrument,extent_of_exam,indications,procedure_performed,findings
0,"Hospital Number: O8583042 , Hospital: Random N...",Random NHS Foundation Trust,O8583042,Dr. Lutter,1903-05-10,2004-11-14,Dr. al-Rashid,Dr. Labeaux,25.0,6,FG2,D1,Biopsies for H pylori,Gastroscopy (OGD),The patient has Barrett's oesophagus. It is a ...
1,"Hospital Number: H2587908 , Hospital: Random N...",Random NHS Foundation Trust,H2587908,Dr. Akhmedov,1933-11-14,2015-09-02,Dr. Garnier,Dr. Carter,25.0,1,FG5,D1,Known coeliac ch diarrhoea.Myelofibrosis on r...,Gastroscopy (OGD),There is a polyp in the antrum which is sessil...
2,"Hospital Number: P9907275 , Hospital: Random N...",Random NHS Foundation Trust,P9907275,Dr. al-Tabet,1967-03-08,2002-01-13,Dr. Stearns,Dr. Geist,50.0,1,FG3,Failed intubation,Known coeliac disease vomitting,Gastroscopy (OGD),The patient has inflammation in the second par...
3,"Hospital Number: D9990587 , Hospital: Random N...",Random NHS Foundation Trust,D9990587,Dr. Garcia,1973-11-22,2007-04-20,Dr. Ali,Dr. Carter,150.0,5,FG4,Pylorus,Reflux-like Symptoms/Atypical Chest Pain,Gastroscopy (OGD),Normal gastroscopy to the duodenum.
4,"Hospital Number: K9010769 , Hospital: Random N...",Random NHS Foundation Trust,K9010769,Dr. el-Kaiser,1968-04-13,2016-01-12,Dr. Presta,Dr. Labeaux,125.0,7,FG3,Pylorus,Dysphagia/Odynophagia post oesophagectomy,Gastroscopy (OGD),There is an ulcer in the second part of the du...


In [26]:
df.text[0]

"Hospital Number: O8583042 , Hospital: Random NHS Foundation Trust , DOB:  1903-05-10 , General Practitioner: Dr. Lutter, Carrie Nkaus Xwb  , Date of procedure:  2004-11-14Endoscopist: Dr. al-Rashid, Muzna  , 2nd Endoscopist: Dr. Labeaux, Destiny  , Medications: Fentanyl  25mcg  , Midazolam  6mg  , Instrument:  FG2  , Extent of Exam:  D1  , Procedure Performed: Gastroscopy (OGD)  , INDICATIONS FOR PROCEDURE: Biopsies for H pylori FINDINGS: The patient has Barrett's oesophagus. It is a long segment. No loss of aceto-whitening  was also seen. Short segment only. There is a polyp in the third part of the duodenum which is stalked with an abnormal pit pattern. The mucosa surrounding the nodules is inflamed and edematous, with a granular appearance.The polyp was marked with a tattoo to aid in future surveillance endoscopy.. NA RECOMMENDATION: The patient should be informed of the diagnosis and the importance of adhering to the recommended treatment plan.  The patient may be prescribed medic

In [27]:
## having a look at value counts ##

In [28]:
df.medications_fentynl.value_counts()

75.0     172
50.0     151
25.0     139
100.0    139
125.0    135
12.5     134
150.0    130
Name: medications_fentynl, dtype: int64

In [29]:
df.midazolam.value_counts()

6    167
2    162
4    148
1    139
3    135
7    132
5    117
Name: midazolam, dtype: int64

In [30]:
df.extent_of_exam.value_counts()

  Pylorus                162
  Stomach body           151
  Oesophagus             148
  D1                     142
  GOJ                    137
  Failed intubation      133
  D2                     127
Name: extent_of_exam, dtype: int64

In [31]:
df.procedure_performed.value_counts()

 Gastroscopy (OGD)      1000
Name: procedure_performed, dtype: int64

In [32]:
df.indications.value_counts()

 Abdominal Pain x1 hematemesis Crohns colitis                         15
 Dysphagia/Odynophagia post LTA                                       13
 Dysphagia oesophageal stricture post chemorad for scc oesophagus     13
 Other - chronic cough ?GORD                                          12
 Ongoing reflux symptoms.                                             12
                                                                      ..
 Dysphagia/Odynophagia post transhiatal oesophagectomy                 2
 Abdominal Pain Weight Loss                                            2
 On going epigastric pain                                              2
 Dyspepsia Other- Bloating                                             1
 Post CRT stricture                                                    1
Name: indications, Length: 159, dtype: int64

In [33]:
df.endoscopist.value_counts()

 Dr. Song         118
 Dr. Stearns      113
 Dr. al-Rashid    104
 Dr. al-Mannan    100
 Dr. Fears         99
 Dr. Currier       96
 Dr. Ali           95
 Dr. el-Haque      95
 Dr. Presta        93
 Dr. Garnier       87
Name: endoscopist, dtype: int64

In [34]:
df.findings.shape

(1000,)

In [35]:
#df.findings.value_counts().sort_values(ascending=False)

In [36]:
df_doc_drugs = df.groupby(df.endoscopist).mean()

In [37]:
df_doc_drugs.head()

Unnamed: 0_level_0,medications_fentynl,midazolam
endoscopist,Unnamed: 1_level_1,Unnamed: 2_level_1
Dr. Ali,81.447368,4.305263
Dr. Currier,76.171875,3.46875
Dr. Fears,72.474747,4.282828
Dr. Garnier,75.862069,4.241379
Dr. Presta,74.05914,3.913978


In [38]:
df_indications = df.groupby(df.indications).mean()

In [39]:
df_indications.head()

Unnamed: 0_level_0,medications_fentynl,midazolam
indications,Unnamed: 1_level_1,Unnamed: 2_level_1
Abdominal Pain,76.5625,3.25
Abdominal Pain .,81.25,3.5
Abdominal Pain Anaemia/Low Iron or Vitamins,64.583333,4.166667
Abdominal Pain Bloating,125.0,5.333333
Abdominal Pain Dyspepsia,86.111111,4.222222


In [40]:
df_extent_exam = df.groupby(df.extent_of_exam).mean()

In [41]:
df_extent_exam.head()

Unnamed: 0_level_0,medications_fentynl,midazolam
extent_of_exam,Unnamed: 1_level_1,Unnamed: 2_level_1
D1,77.90493,4.091549
D2,79.330709,4.165354
Failed intubation,76.597744,3.902256
GOJ,70.437956,3.817518
Oesophagus,75.168919,3.97973


In [42]:
df.findings[800]

'Normal gastroscopy to the duodenum.     The patient should be encouraged to maintain a healthy weight to reduce the risk of developing more polyps.'