<hr style="border:2px solid gray">

# Preprocessing

<hr style="border:2px solid gray">

## Content

<hr style="border:1px solid gray">
<hr style="border:1px solid gray">

- **[Libraries to use](#Libraries)**

- **[Loading the dataset](#Loading)**

- **[Cleaning the dataset](#Cleaning)**

- **[Extracting required features](#Extracting)**

- **[Data exploration](#Exploration)**


<hr style="border:2px solid gray">

---
<a name="Libraries"></a>
### Libraries to use 
---

In [3]:
#----------------------------------------------------------
# Regular modules
#----------------------------------------------------------
import numpy as np

import re
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
#----------------------------------------------------------
# Visualization
#----------------------------------------------------------
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
#----------------------------------------------------------
# Classifiers
#----------------------------------------------------------
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
#----------------------------------------------------------
# For metrics
#----------------------------------------------------------
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
#----------------------------------------------------------
# To avoid warnings
#----------------------------------------------------------
import os
import warnings 
warnings.filterwarnings('ignore')
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
#----------------------------------------------------------
# Not regular modules
#----------------------------------------------------------
import datasets # to create a dictionary of datasets.
import torch #The torch module provides support for multi-dimensional arrays called tensors.
from umap import UMAP #Uniform Manifold Approximation and Projection
# is a machine learning technique for dimensionality reduction, which is commonly
# used for visualizing high-dimensional data in two or three dimensions.
#----------------------------------------------------------
# Transformers
#----------------------------------------------------------
from transformers import AutoTokenizer # to tokenize dataset of text.
from transformers import AutoModel # to export last hidden layer from the outputs of the model.
from transformers import AutoModelForSequenceClassification # to export logists from the outputs of the model.
#----------------------------------------------------------

---
### Loading datasets <a id="Loading"> </a>
---

#### Loading real dataset

In [29]:
real = pd.read_csv("../data/real.csv")

In [30]:
real.head()

Unnamed: 0.1,Unnamed: 0,out,NA
0,out1,"Hospital Number: R1265623 , Hospital: Random N...",\nNA
1,out2,"Hospital Number: K2515095 , Hospital: Random N...",Nature of specimen:x9 stomach biopsy specimens...
2,out3,"Hospital Number: L7746099 , Hospital: Random N...",\ncharacter(0)
3,out4,"Hospital Number: J4131371 , Hospital: Random N...",
4,out5,"Hospital Number: S4637507 , Hospital: Random N...",Nature of specimen:x6 fundus biopsy specimens ...


#### Loading real dataset

In [31]:
synthetic = pd.read_csv("../data/synthetic.csv")

In [32]:
synthetic.head()

Unnamed: 0.1,Unnamed: 0,findings
0,0,Normal gastroscopy to the duodenum.
1,1,Normal gastroscopy to the duodenum.
2,2,Normal gastroscopy to the duodenum.
3,3,The patient has Barrett's oesophagus. No loss...
4,4,Normal gastroscopy to the duodenum.


<a id="Cleaning"> </a>

---
### Cleaning the dataset 
---

---
#### Function (cleaning_real)
---

In [40]:
def cleaning_real(df):
    """
    -----------------------------------------------
    Description: Cleaning the loaded dataset and separed the relevant features.
    -----------------------------------------------    
        - Input: DataFrame with medical reports (Unnamed: 0, out, NA).
        - Output: DataFrame with separated features from the original out feature.
    -----------------------------------------------    
    """
    #--------------------------------------------------------------------
    # List of features for regex_list
    #--------------------------------------------------------------------
    hospital_numb = r"\.*Hospital Number.*"
    hospital = r"\.*Hospital:.*"
    general_practitioner = r"\.*General Practitioner:.*"
    DOB = r"\.*DOB:.*"
    Endoscopist = r"\.*Endoscopist:.*"
    Endoscopist_2 = r"\.*2nd Endoscopist:.*"
    Instrument = r"\.*Instrument.*"
    Extent = r"\.*Extent of Exam:.*"
    Procedure = r"\.*Procedure Performed:.*"
    #--------------------------------------------------------------------
    list_features_regex = [hospital_numb,\
                           hospital,\
                           general_practitioner,\
                          DOB,\
                          Endoscopist,\
                          Endoscopist_2,\
                          Instrument,\
                          Extent,\
                          Procedure]
    #--------------------------------------------------------------------
    def regex_list(string,feature):
        """
        -----------------------------------------------
        Inputs:
            - string: All text included in the feature out for each row (str).
            - feature: feature to extract (str).
        Output:
            - retrn_string: returned string with the information of each feature.
            Example: for feature=hospital_numb retrn_string
        -----------------------------------------------
        """
        hospital_reg =  feature#r"\.*Hospital Number.*"
        line = re.findall(hospital_reg, string)[0]
        retrn_string= line.replace(',',':').split(":")[1]
        if retrn_string[-1:] == "\r":
            return retrn_string[:-1]
        else:
            return retrn_string
    #--------------------------------------------------------------------    
    df["Hospital Number"] = df['out'].apply(regex_list, args=(hospital_numb,))
    df["Hospital"] = df['out'].apply(regex_list, args=(hospital,))
    df["General Practitioner"] = df['out'].apply(regex_list, args=(general_practitioner,))
    df["DOB"] = df['out'].apply(regex_list, args=(DOB,))
    df["Endoscopist"] = df['out'].apply(regex_list, args=(Endoscopist,))
    df["2nd Endoscopist"] = df['out'].apply(regex_list, args=(Endoscopist_2,))
    df["Instrument"] = df['out'].apply(regex_list, args=(Instrument,))
    df["Extent of Exam"] = df['out'].apply(regex_list, args=(Extent,))
    df["Procedure Performed"] = df['out'].apply(regex_list, args=(Procedure,))
    #--------------------------------------------------------------------
    # Date of procedure
    #--------------------------------------------------------------------
    Date_procedure = r"\.*Date of procedure:.*"
    #--------------------------------------------------------------------   
    def regex_procedure_date(string):
        hospital_reg = r"\.*Date of procedure:.*"
        line =  re.findall(hospital_reg, string)[0]
        retrn_string =  line.split(":")[1][:-11]
        if retrn_string[-1:] == "\r":
            return retrn_string[:-1]
        else:
            return retrn_string
    #--------------------------------------------------------------------
    df["Date of procedure"] = df['out'].apply(regex_procedure_date)    
    #--------------------------------------------------------------------
    # Medication
    #--------------------------------------------------------------------
    dmcg = r"\d*.\dmcg"
    #--------------------------------------------------------------------       
    def regex_medication(string):
        hospital_reg = r"\d*.\dmcg"
        retrn_string= re.findall(hospital_reg, string)[0]
        if retrn_string[-1:] == "\r":
            return float(retrn_string[:-4])
        else:
            return float(retrn_string[:-3])  
    #--------------------------------------------------------------------    
    df["Medication"] = df['out'].apply(regex_medication)    
    #--------------------------------------------------------------------
    # Midazolam
    #--------------------------------------------------------------------
    Midazolam = r"\.*Midazolam.*"
    #--------------------------------------------------------------------     
    def regex_midazolam(string):
        hospital_reg = r"\.*Midazolam.*"
        line = re.findall(hospital_reg, string)[0]
        retrn_string =  line.split()[1]
        if retrn_string[-1:] == "\r":
            return int(retrn_string[:-3])
        else:
            return int(retrn_string[:-2])    
    #--------------------------------------------------------------------    
    df["Midazolam"] = df['out'].apply(regex_midazolam)      
    #--------------------------------------------------------------------
    # Indications for procedure
    #--------------------------------------------------------------------       
    def regex_indications(string):
        hospital_reg = r"\.*INDICATIONS FOR PROCEDURE:.*"
        line = re.findall(hospital_reg, string)[0]
        retrn_string =  line.replace(',',':').split(":")[1]
        if retrn_string[-1:] == "\r":
            retrn_string= retrn_string[:-1]
        if retrn_string[-8:] == "FINDINGS":
            return retrn_string[:-8]
        else:
            return retrn_string 
    #--------------------------------------------------------------------    
    df["Indications"] = df['out'].apply(regex_indications)        
    #--------------------------------------------------------------------
    # Findings
    #-------------------------------------------------------------------- 
    def regex_findings(string):
        hospital_reg = r"\.*FINDINGS:.*"
        line = re.findall(hospital_reg, string)[0][10:]
        return line
    #--------------------------------------------------------------------    
    df["findings"] = df['out'].apply(regex_findings)  
    #--------------------------------------------------------------------
    return df 

---
#### Result 
---

In [41]:
df_cleaned = cleaning_real(real)
df_cleaned.head(3)

Unnamed: 0.1,Unnamed: 0,out,NA,Hospital Number,Hospital,General Practitioner,DOB,Endoscopist,2nd Endoscopist,Instrument,Extent of Exam,Procedure Performed,Date of procedure,Medication,Midazolam,Indications,findings
0,out1,"Hospital Number: R1265623 , Hospital: Random N...",\nNA,R1265623,Random NHS Foundation Trust,Dr. Taylor,1960-06-23,Dr. el-Hasen,Dr. Phenna,FG2,D1,Gastroscopy (OGD),2014-11-13,75.0,6,Ongoing reflux symptoms.,Columnar lined oesophagus is present. The segm...
1,out2,"Hospital Number: K2515095 , Hospital: Random N...",Nature of specimen:x9 stomach biopsy specimens...,K2515095,Random NHS Foundation Trust,Dr. Cheek,1981-01-24,Dr. el-Hasen,Dr. Sherwood,FG4,Oesophagus,Gastroscopy (OGD),2002-05-01,25.0,2,Endoscopic ultrasound findings,There is an ulcer in the stomach which is supe...
2,out3,"Hospital Number: L7746099 , Hospital: Random N...",\ncharacter(0),L7746099,Random NHS Foundation Trust,Dr. al-Zamani,1912-06-02,Dr. Hall,Dr. Barrett,FG7,D1,Gastroscopy (OGD),2011-09-20,25.0,3,Nausea and/or Vomiting Haematemesis or Melaen...,LA Grade D oesophagitis. The oesopahgitis is ...


#### Function (cleaning_synthetic)

In [49]:
def cleaning_synthetic(synthetic):
    synthetic = synthetic[['findings']]
    synthetic['label'] = 1
    return synthetic

#### Result (cleaning_synthetic)

In [50]:
synthetic = cleaning_synthetic(synthetic)
synthetic

Unnamed: 0,findings,label
0,Normal gastroscopy to the duodenum.,1
1,Normal gastroscopy to the duodenum.,1
2,Normal gastroscopy to the duodenum.,1
3,The patient has Barrett's oesophagus. No loss...,1
4,Normal gastroscopy to the duodenum.,1
...,...,...
95,The oesophagus appears to have a mildly friab...,1
96,There is a nodule in the antrum which is beni...,1
97,The patient has a 9mm nodule in the third par...,1
98,Large sliding hiatus hernia. inorad for Exam:...,1


<a id="Extracting"> </a>

---
### Extracting required features
---

---
#### Function
---

In [55]:
def extracting_real(df):
    """
    -----------------------------------------------
    Description: Extracting relevan features from df.
    -----------------------------------------------
     Input: df: DataFrame resulted from Cleaning.
     Output: df_extracted: DataFrame with "extent_of_exam","indications" and "findings".
    """
    #--------------------------------------------------------------------
    df_extracted = df[["General Practitioner","Endoscopist","Instrument","Extent of Exam","Indications","findings"]]
    #--------------------------------------------------------------------    
    return df_extracted 

#### Result

In [56]:
real = extracting_real(df_cleaned)
real.head(3)

Unnamed: 0,General Practitioner,Endoscopist,Instrument,Extent of Exam,Indications,findings
0,Dr. Taylor,Dr. el-Hasen,FG2,D1,Ongoing reflux symptoms.,Columnar lined oesophagus is present. The segm...
1,Dr. Cheek,Dr. el-Hasen,FG4,Oesophagus,Endoscopic ultrasound findings,There is an ulcer in the stomach which is supe...
2,Dr. al-Zamani,Dr. Hall,FG7,D1,Nausea and/or Vomiting Haematemesis or Melaen...,LA Grade D oesophagitis. The oesopahgitis is ...


In [48]:
df_extracted.Findings.iloc[12]

"Normal gastroscopy to the duodenum.  FOLLOW UP: A blood test may be ordered to assess the patient's iron levels, as polyps can cause bleeding in the stomach and intestines. RECOMMENDATION: The patient should be advised to avoid acidic foods and drinks, which can irritate the lining of the stomach and increase the risk of developing more polyps."

<a id="Cleaning"> </a>

---
### Saving the dataset with extracted data
---


In [53]:
real.to_csv('../data/real_preprocessed.csv')

In [54]:
synthetic.to_csv('../data/synthetic_preprocessed.csv')

<a id="Complete preprocessing"> </a>

### Complete preprocessing

#### Function (preprocess_real)

In [59]:
def preprocess_real():
    real = pd.read_csv("../data/real.csv")
    real = cleaning_real(real)
    real = extracting_real(real)
    real.to_csv('../data/real_preprocessed.csv')

#### Function (preprocess_synthetic)

In [63]:
def preprocess_synthetic():
    synthetic = pd.read_csv("../data/fake.csv")
    synthetic = cleaning_synthetic(synthetic)
    synthetic.to_csv('../data/fake_preprocessed.csv')

#### Result preprocess synthetic and real

In [64]:
preprocess_real()

In [65]:
preprocess_synthetic()