### Inflammatory Bowel Disease (IBD) Clinical Data Analysis

##### Predicting IBD-UC Factors: Correlating Demographics, Disease Phenotype, and Treatment Response
##### Predicting IBD-UC Factors: Exploring the Correlation Between Demographics, Disease Phenotype, and Treatment Response in Inflammatory Bowel Disease

Inflammatory Bowel Disease (IBD) refers to a group of chronic disorders that cause inflammation in the gastrointestinal tract, including Ulcerative Colitis (UC), Crohn's Disease (CD), and Unclassified types. Timely identification and precise prediction of the disease phenotype are essential for effective treatment and enhanced patient outcomes. By analyzing the data, our goal is to identify patterns that may help ***predict*** which patients are at greater risk for severe **UC** forms, allowing for the development of personalized treatment plans.

#### Data Table for the study:  
The Study collects patiest data on Demographics, Disease Phenotype, Treatment and Medication History, Clinical History and Co-morbidities.  
 

| **Variable Name**                    | **Type**          | **Description**                                                                                                                                                     |
|--------------------------------------|-------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **id**                               | Categorical       | Unique identifier for each patient.                                                                                                                                  |
| **Country**                          | Categorical       | The country where the patient resides.                                                                                                                                 |
| **Age**                              | Continuous        | The age of the patient at the time of diagnosis or data collection.                                                                                                  |
| **Sex**                              | Categorical       | The sex of the patient (Male/Female).                                                                                                                                  |
| **Current Smoker**                   | Categorical       | Whether the patient is currently smoking (Yes/No).                                                                                                                   |
| **Year of Dx**                       | Continuous        | The year the patient was diagnosed with IBD.                                                                                                                          |
| **Duration**                         | Continuous        | The duration (in years or months) since the patient was diagnosed with IBD.                                                                                          |
| **UC (E1-E3)**                       | Categorical       | Classification of **Ulcerative Colitis** severity and extent (E1: Rectum only, E2: Left side, E3: Extensive disease).                                                |
| **CD (L1-L4)**                       | Categorical       | **Crohn's Disease** location classification (L1: Ileal, L2: Colonic, L3: Ileocolonic, L4: Upper GI).                                                                  |
| **CD (B1-B3)**                       | Categorical       | **Crohn's Disease** behavior classification (B1: Non-stricturing, Non-penetrating; B2: Stricturing; B3: Penetrating).                                                 |
| **CD (Perianal)**                    | Categorical       | Whether the patient has perianal disease (Yes/No), applicable for **Crohn's Disease**.                                                                              |
| **IBD-U**                            | Categorical       | Whether the patient has **IBD-Unclassified** (Yes/No).                                                                                                              |
| **Previous Surgery**                 | Categorical       | Whether the patient has undergone previous surgical procedures related to IBD (Yes/No).                                                                             |
| **5-ASA**                            | Categorical       | Whether the patient is taking **5-aminosalicylic acid** (5-ASA) as part of treatment (Yes/No).                                                                       |
| **Azathioprine or 6-MP**             | Categorical       | Whether the patient is using **Azathioprine** or **6-mercaptopurine** (Yes/No).                                                                                      |
| **Methotrexate**                     | Categorical       | Whether the patient is using **Methotrexate** for treatment (Yes/No).                                                                                                |
| **Cyclosporine**                     | Categorical       | Whether the patient is using **Cyclosporine** (Yes/No).                                                                                                             |
| **Infliximab**                       | Categorical       | Whether the patient has used **Infliximab**, a biologic treatment for IBD (Yes/No).                                                                                 |
| **Certolizumab**                     | Categorical       | Whether the patient has used **Certolizumab** (Yes/No).                                                                                                            |
| **Golimumab**                        | Categorical       | Whether the patient has used **Golimumab** (Yes/No).                                                                                                               |
| **Adalimumab**                       | Categorical       | Whether the patient has used **Adalimumab** (Yes/No).                                                                                                              |
| **Vedolizumab**                      | Categorical       | Whether the patient has used **Vedolizumab** (Yes/No).                                                                                                             |
| **Current Steroid Use**              | Categorical       | Whether the patient is currently using **steroids** for IBD management (Yes/No).                                                                                     |
| **Tofacitinib**                      | Categorical       | Whether the patient is using **Tofacitinib** (Yes/No).                                                                                                              |
| **Ustekinumab**                      | Categorical       | Whether the patient is using **Ustekinumab** (Yes/No).                                                                                                             |
| **# of Previous Biologics Used**     | Continuous        | The total number of biologic therapies the patient has used in the past.                                                                                           |
| **EIM**                              | Categorical       | Presence of **Extra-intestinal Manifestations (EIM)** (e.g., arthritis, skin lesions, eye inflammation) (Yes/No).                                                   |
| **Other Autoimmune Diseases**        | Categorical       | Whether the patient has other autoimmune diseases (Yes/No).                                                                                                         |
| **Co-morbidities**                   | Categorical       | Whether the patient has additional co-morbidities that may influence disease progression or treatment response (Yes/No).                                             |


### 1.1 Import the data

In [105]:
import pandas as pd 
import numpy as np

# To change scientific numbers to float
np.set_printoptions(formatter={'float_kind':'{:f}'.format})

# load the data
df = pd.read_csv('IBD.csv')

# Set 'id' column as the index
df.set_index('id', inplace=True)

# print the dataframe
df

Unnamed: 0_level_0,Country,Age,Sex,Current Smoker,Year of Dx,Duration,UC (E1-E3),CD(L1-L4),CD (B1-B3),CD (Perianal),...,Golimumab,Adalimumab,vedolizumab,current steroid use,Tofacitinib,Ustekinumab,No.of previous biologic used,EIM,Other autoimmune diseases,Co-morbidities
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,Spain,14.0,M,0,2022,2.0,0,L3+L4,B1,0,...,,No,No,Yes,No,No,2,Yes,0,P
2,Spain,55.0,F,0,2023,1.0,0,L1+L4,B1,0,...,,No,No,,No,No,0,,0,0
3,Spain,23.0,F,0,2021,3.0,E3,0,0,0,...,,No,No,,No,No,0,,0,0
4,Spain,23.0,F,0,2018,6.0,E3,0,0,0,...,,No,No,Yes,No,Yes,4,,0,0
5,Spain,9.0,M,0,2018,6.0,E2,0,0,0,...,,No,Yes,Yes,No,No,2,,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5536,France,,F,0,2007,16.0,0,L1,B1,0,...,,,,,,,0,,0,0
5537,France,,F,0,2013,10.0,0,L1,B3,1,...,,,,,,,0,,0,0
5538,France,,M,0,2019,4.0,0,L1,B1,0,...,,,,,,Yes,1,,0,0
5539,France,,M,0,2021,2.0,0,L1,B1,0,...,,,,Yes,,,0,,0,0


Our data is messesy and have a lot of 'NAN'. Let's explore more.

### 1.2 Check the data

In [106]:
# Top 5 rows
df.head()

Unnamed: 0_level_0,Country,Age,Sex,Current Smoker,Year of Dx,Duration,UC (E1-E3),CD(L1-L4),CD (B1-B3),CD (Perianal),...,Golimumab,Adalimumab,vedolizumab,current steroid use,Tofacitinib,Ustekinumab,No.of previous biologic used,EIM,Other autoimmune diseases,Co-morbidities
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,Spain,14.0,M,0,2022,2.0,0,L3+L4,B1,0,...,,No,No,Yes,No,No,2,Yes,0,P
2,Spain,55.0,F,0,2023,1.0,0,L1+L4,B1,0,...,,No,No,,No,No,0,,0,0
3,Spain,23.0,F,0,2021,3.0,E3,0,0,0,...,,No,No,,No,No,0,,0,0
4,Spain,23.0,F,0,2018,6.0,E3,0,0,0,...,,No,No,Yes,No,Yes,4,,0,0
5,Spain,9.0,M,0,2018,6.0,E2,0,0,0,...,,No,Yes,Yes,No,No,2,,0,0


In [107]:
# Bottom 5 rows
df.tail()

Unnamed: 0_level_0,Country,Age,Sex,Current Smoker,Year of Dx,Duration,UC (E1-E3),CD(L1-L4),CD (B1-B3),CD (Perianal),...,Golimumab,Adalimumab,vedolizumab,current steroid use,Tofacitinib,Ustekinumab,No.of previous biologic used,EIM,Other autoimmune diseases,Co-morbidities
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5536,France,,F,0,2007,16.0,0,L1,B1,0,...,,,,,,,0,,0,0
5537,France,,F,0,2013,10.0,0,L1,B3,1,...,,,,,,,0,,0,0
5538,France,,M,0,2019,4.0,0,L1,B1,0,...,,,,,,Yes,1,,0,0
5539,France,,M,0,2021,2.0,0,L1,B1,0,...,,,,Yes,,,0,,0,0
5540,France,,M,0,2003,20.0,0,L3,B1,0,...,,,,,,,0,Yes,0,0


In [108]:
# Shape of date
print(f'The IBD data consists of {df.shape[0]} rows and {df.shape[1]} columns.')

The IBD data consists of 5540 rows and 28 columns.


### 1.3 Data Preprocessing

In [109]:
# Check Info of df
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5540 entries, 1 to 5540
Data columns (total 28 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Country                       5540 non-null   object 
 1   Age                           4928 non-null   float64
 2   Sex                           5540 non-null   object 
 3   Current Smoker                5540 non-null   int64  
 4   Year of Dx                    5540 non-null   int64  
 5   Duration                      5531 non-null   float64
 6   UC (E1-E3)                    5540 non-null   object 
 7   CD(L1-L4)                     5540 non-null   object 
 8   CD (B1-B3)                    5540 non-null   object 
 9   CD (Perianal)                 5540 non-null   int64  
 10  IBD-U                         5540 non-null   int64  
 11  Previous surgery              5540 non-null   object 
 12  5-ASA                         4460 non-null   object 
 13  Azathiop

From data we need to tackle data types of varibale like Year of Dx, Duration, Adalimumab, current steroid use, No.of previous biologic used,  EIM,   Other autoimmune diseases and Co-morbiditiees.           

In [110]:
df['CD (Perianal)'] = df['CD (Perianal)'].fillna('0').astype(str)
df['IBD-U'] = df['IBD-U'].fillna('0').astype(str)
df['Current Smoker'] = df['Current Smoker'].astype(str)
# Convert 'Year of Dx' to datetime format
df['Year of Dx'] = pd.to_datetime(df['Year of Dx'], format='%Y')

Here we convert the variables into string type.

In [111]:
# Check Info of df
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5540 entries, 1 to 5540
Data columns (total 28 columns):
 #   Column                        Non-Null Count  Dtype         
---  ------                        --------------  -----         
 0   Country                       5540 non-null   object        
 1   Age                           4928 non-null   float64       
 2   Sex                           5540 non-null   object        
 3   Current Smoker                5540 non-null   object        
 4   Year of Dx                    5540 non-null   datetime64[ns]
 5   Duration                      5531 non-null   float64       
 6   UC (E1-E3)                    5540 non-null   object        
 7   CD(L1-L4)                     5540 non-null   object        
 8   CD (B1-B3)                    5540 non-null   object        
 9   CD (Perianal)                 5540 non-null   object        
 10  IBD-U                         5540 non-null   object        
 11  Previous surgery              5540 

In [112]:
# Check Missing 
df.isnull().sum()

Country                            0
Age                              612
Sex                                0
Current Smoker                     0
Year of Dx                         0
Duration                           9
UC (E1-E3)                         0
CD(L1-L4)                          0
CD (B1-B3)                         0
CD (Perianal)                      0
IBD-U                              0
Previous surgery                   0
5-ASA                           1080
Azathioprine or 6-MP            1072
Methotrexate                    2885
Cyclosporine                    2854
Infliximab                       742
Certolizumab                    2816
Golimumab                       2814
Adalimumab                       758
vedolizumab                      861
current steroid use             1603
Tofacitinib                      865
Ustekinumab                      361
No.of previous biologic used       0
EIM                             4060
Other autoimmune diseases          0
C

We have a lot of missing in dataset let's try to mitigate the missing values.

In [113]:
# Step 1: Define the columns for Yes/No mapping
cat_cols_y_n = ['5-ASA', 'Azathioprine or 6-MP', 'Methotrexate', 'Cyclosporine', 
                'Infliximab', 'Certolizumab', 'Golimumab', 'Adalimumab', 
                'vedolizumab', 'current steroid use', 'Tofacitinib', 'Ustekinumab']

# Step 2: Apply the mapping (Yes -> 1, No -> 0) and fill NaN with 0
df[cat_cols_y_n] = df[cat_cols_y_n].apply(lambda col: col.map({'Yes': 1, 'No': 0}).fillna(0))

# Step 3: Verify the changes
print(df[cat_cols_y_n].head())

# Copy the transformed DataFrame
new_df = df.copy()  # or do other transformations if needed

new_df.info()


    5-ASA  Azathioprine or 6-MP  Methotrexate  Cyclosporine  Infliximab  \
id                                                                        
1     0.0                   0.0           0.0           0.0         1.0   
2     0.0                   0.0           0.0           0.0         0.0   
3     0.0                   0.0           0.0           0.0         0.0   
4     1.0                   0.0           0.0           0.0         0.0   
5     1.0                   1.0           0.0           0.0         0.0   

    Certolizumab  Golimumab  Adalimumab  vedolizumab  current steroid use  \
id                                                                          
1            0.0        0.0         0.0          0.0                  1.0   
2            0.0        0.0         0.0          0.0                  0.0   
3            0.0        0.0         0.0          0.0                  0.0   
4            0.0        0.0         0.0          0.0                  1.0   
5           

Coverting the above variables into '0' and '1' for easy of anaysis.

In [114]:
# Concatenate the modified columns to the original DataFrame
# df = pd.concat([df, df[cat_cols_y_n]], axis=1)

# Verify the changes
# print(df.head())


In [115]:
new_df.columns

Index(['Country', 'Age ', 'Sex', 'Current Smoker', 'Year of Dx', 'Duration',
       'UC (E1-E3)', 'CD(L1-L4)', 'CD (B1-B3)', 'CD (Perianal)', 'IBD-U',
       'Previous surgery', '5-ASA', 'Azathioprine or 6-MP', 'Methotrexate',
       'Cyclosporine', 'Infliximab', 'Certolizumab', 'Golimumab', 'Adalimumab',
       'vedolizumab', 'current steroid use', 'Tofacitinib', 'Ustekinumab',
       'No.of previous biologic used', 'EIM', 'Other autoimmune diseases',
       'Co-morbidities'],
      dtype='object')

In [116]:
new_df

Unnamed: 0_level_0,Country,Age,Sex,Current Smoker,Year of Dx,Duration,UC (E1-E3),CD(L1-L4),CD (B1-B3),CD (Perianal),...,Golimumab,Adalimumab,vedolizumab,current steroid use,Tofacitinib,Ustekinumab,No.of previous biologic used,EIM,Other autoimmune diseases,Co-morbidities
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,Spain,14.0,M,0,2022-01-01,2.0,0,L3+L4,B1,0,...,0.0,0.0,0.0,1.0,0.0,0.0,2,Yes,0,P
2,Spain,55.0,F,0,2023-01-01,1.0,0,L1+L4,B1,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0,,0,0
3,Spain,23.0,F,0,2021-01-01,3.0,E3,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0,,0,0
4,Spain,23.0,F,0,2018-01-01,6.0,E3,0,0,0,...,0.0,0.0,0.0,1.0,0.0,1.0,4,,0,0
5,Spain,9.0,M,0,2018-01-01,6.0,E2,0,0,0,...,0.0,0.0,1.0,1.0,0.0,0.0,2,,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5536,France,,F,0,2007-01-01,16.0,0,L1,B1,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0,,0,0
5537,France,,F,0,2013-01-01,10.0,0,L1,B3,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0,,0,0
5538,France,,M,0,2019-01-01,4.0,0,L1,B1,0,...,0.0,0.0,0.0,0.0,0.0,1.0,1,,0,0
5539,France,,M,0,2021-01-01,2.0,0,L1,B1,0,...,0.0,0.0,0.0,1.0,0.0,0.0,0,,0,0


In [117]:
# # Investigate all the elements whithin each Feature 

# for col in df:
#     unique_vals = np.unique(df[col])
#     nr_vals = len(unique_vals)
#     if nr_vals <= 10:
#         print(f"The number of values for feature {col} : {nr_vals} -- {unique_vals}")
#     else:
#         print(f"The number of values for feature {col} : {nr_vals}")

In [118]:
df.columns

Index(['Country', 'Age ', 'Sex', 'Current Smoker', 'Year of Dx', 'Duration',
       'UC (E1-E3)', 'CD(L1-L4)', 'CD (B1-B3)', 'CD (Perianal)', 'IBD-U',
       'Previous surgery', '5-ASA', 'Azathioprine or 6-MP', 'Methotrexate',
       'Cyclosporine', 'Infliximab', 'Certolizumab', 'Golimumab', 'Adalimumab',
       'vedolizumab', 'current steroid use', 'Tofacitinib', 'Ustekinumab',
       'No.of previous biologic used', 'EIM', 'Other autoimmune diseases',
       'Co-morbidities'],
      dtype='object')

In [119]:
# cat_cols_y_n =df[['5-ASA', 'Azathioprine or 6-MP', 'Methotrexate',
#        'Cyclosporine', 'Infliximab', 'Certolizumab', 'Golimumab', 'Adalimumab',
#        'vedolizumab', 'current steroid use', 'Tofacitinib', 'Ustekinumab']]

# for col in cat_cols_y_n:
#     df[col] = df[col].map({'Yes': 1, 'No': 0}).fillna(0)  # Map 'Yes' to 1, 'No' to 0, and fill NaN values with 0

# # Display the DataFrame after applying the mapping
# for col in cat_cols_y_n:
#     print(cat_cols_y_n[col].value_counts())
#     print()


In [120]:
# cat = df[['Sex', 'Current Smoker', 'Previous surgery', 'No.of previous biologic used', 'EIM', 'Other autoimmune diseases',
#        'Co-morbidities']]
# df['Sex'] = df['Sex'].fillna('0').astype(str)
# df['Current Smoker'] = df['Current Smoker'].fillna('0').astype(str)
# df['Previous surgery'] = df['Previous surgery'].fillna('0').astype(str)
# df['No.of previous biologic used'] = df['No.of previous biologic used'].fillna('0').astype(str)
# df['EIM'] = df['EIM'].fillna('0').astype(str)
# df['Other autoimmune diseases'] = df['Other autoimmune diseases'].fillna('0').astype(str)
# df['Co-morbidities'] = df['Co-morbidities'].fillna('0').astype(str)



In [121]:
# new_df.info()

for col in new_df:
    print(new_df[col].value_counts())
    print()

Country
United Kingdom    1802
France            1338
Spain             1017
Italy              768
Poland             377
Greece             238
Name: count, dtype: int64

Age 
27.0    171
32.0    169
33.0    167
28.0    158
37.0    154
       ... 
87.0      2
83.0      2
91.0      1
82.0      1
84.0      1
Name: count, Length: 81, dtype: int64

Sex
M     2874
F     2662
m        2
f        1
M        1
Name: count, dtype: int64

Current Smoker
0    4972
1     481
3      87
Name: count, dtype: int64

Year of Dx
2019-01-01    445
2018-01-01    442
2017-01-01    375
2020-01-01    353
2015-01-01    347
2016-01-01    326
2021-01-01    308
2014-01-01    278
2022-01-01    277
2013-01-01    272
2012-01-01    263
2011-01-01    240
2010-01-01    232
2023-01-01    171
2008-01-01    168
2009-01-01    167
2007-01-01    136
2006-01-01    113
2005-01-01    105
2004-01-01     89
2003-01-01     64
2000-01-01     62
2002-01-01     47
2001-01-01     42
1998-01-01     37
1999-01-01     30
1997-01-01    

In [122]:
# Filling missing values with '0' and converting to string
df['Sex'] = df['Sex'].fillna('0').astype(str)
df['Current Smoker'] = df['Current Smoker'].fillna('0').astype(str)
df['Previous surgery'] = df['Previous surgery'].fillna('0').astype(str)
df['No.of previous biologic used'] = df['No.of previous biologic used'].fillna('0').astype(str)
df['EIM'] = df['EIM'].fillna('0').astype(str)
df['Other autoimmune diseases'] = df['Other autoimmune diseases'].fillna('0').astype(str)
df['Co-morbidities'] = df['Co-morbidities'].fillna('0').astype(str)

# Mapping categorical values to numerical values
df['Sex'] = df['Sex'].map({'M': 1, 'F': 0}).fillna(0)  # Mapping 'M' to 1, 'F' to 0
df['Current Smoker'] = df['Current Smoker'].apply(lambda x: 0 if x == '0' else 1)
df['New Previous surgery'] = df['Previous surgery'].apply(lambda x: 0 if x == '0' else 1)
df['New previous biologic used'] = df['No.of previous biologic used'].apply(lambda x: 0 if x == '0' else 1)
df['EIM'] = df['EIM'].apply(lambda x: 0 if x == '0' else 1)
df['New Co-morbidities'] = df['Co-morbidities'].apply(lambda x: 0 if x == '0' else 1)
df['New OAD'] = df['Other autoimmune diseases'].apply(lambda x: 0 if x == '0' else 1)


print(df[['Sex', 'Current Smoker', 'Previous surgery', 'No.of previous biologic used', 'EIM', 'Other autoimmune diseases', 'Co-morbidities']].head())


    Sex  Current Smoker Previous surgery No.of previous biologic used  EIM  \
id                                                                           
1   1.0               0                0                            2    1   
2   0.0               0              ICR                            0    0   
3   0.0               0                0                            0    0   
4   0.0               0              TAC                            4    0   
5   1.0               0                0                            2    0   

   Other autoimmune diseases Co-morbidities  
id                                           
1                          0              P  
2                          0              0  
3                          0              0  
4                          0              0  
5                          0              0  


In [123]:
# df['Sex'] = df['Sex'].map({'M': 1, 'F': 0}).fillna(0)
# df['Current Smoker'] = df['Current Smoker'].apply(lambda x: 0 if x == '0' else 1)
# df['New Previous surgery'] = df['Previous surgery'].apply(lambda x: 0 if x == '0' else 1)
# df['New previous biologic used'] = df['No.of previous biologic used'].apply(lambda x: 0 if x == '0' else 1)
# df['EIM'] = df['EIM'].apply(lambda x: 0 if x == '0' else 1)
# df['New Co-morbidities'] = df['Co-morbidities'].apply(lambda x: 0 if x == '0' else 1)
# df['New OAD'] = df['Other autoimmune diseases'].apply(lambda x: 0 if x == '0' else 1)

In [124]:
# cat = df[['Sex', 'New Previous surgery', 'New previous biologic used', 
#        'EIM', 'New OAD', 'New Co-morbidities']]

# for col in cat:
#     print(cat[col].value_counts())
#     print()

In [125]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5540 entries, 1 to 5540
Data columns (total 32 columns):
 #   Column                        Non-Null Count  Dtype         
---  ------                        --------------  -----         
 0   Country                       5540 non-null   object        
 1   Age                           4928 non-null   float64       
 2   Sex                           5540 non-null   float64       
 3   Current Smoker                5540 non-null   int64         
 4   Year of Dx                    5540 non-null   datetime64[ns]
 5   Duration                      5531 non-null   float64       
 6   UC (E1-E3)                    5540 non-null   object        
 7   CD(L1-L4)                     5540 non-null   object        
 8   CD (B1-B3)                    5540 non-null   object        
 9   CD (Perianal)                 5540 non-null   object        
 10  IBD-U                         5540 non-null   object        
 11  Previous surgery              5540 

In [126]:
cols_drop = df[['Previous surgery', 'No.of previous biologic used', 'Other autoimmune diseases', 'Co-morbidities']]

df.drop(cols_drop, axis=1, inplace=True)

In [127]:
for col in df:
    print(df[col].value_counts())
    print()

Country
United Kingdom    1802
France            1338
Spain             1017
Italy              768
Poland             377
Greece             238
Name: count, dtype: int64

Age 
27.0    171
32.0    169
33.0    167
28.0    158
37.0    154
       ... 
87.0      2
83.0      2
91.0      1
82.0      1
84.0      1
Name: count, Length: 81, dtype: int64

Sex
1.0    2874
0.0    2666
Name: count, dtype: int64

Current Smoker
0    4972
1     568
Name: count, dtype: int64

Year of Dx
2019-01-01    445
2018-01-01    442
2017-01-01    375
2020-01-01    353
2015-01-01    347
2016-01-01    326
2021-01-01    308
2014-01-01    278
2022-01-01    277
2013-01-01    272
2012-01-01    263
2011-01-01    240
2010-01-01    232
2023-01-01    171
2008-01-01    168
2009-01-01    167
2007-01-01    136
2006-01-01    113
2005-01-01    105
2004-01-01     89
2003-01-01     64
2000-01-01     62
2002-01-01     47
2001-01-01     42
1998-01-01     37
1999-01-01     30
1997-01-01     26
1995-01-01     18
1996-01-01     15
1

In [128]:
df['Age ']= df['Age '].fillna(df['Age '].median())
df['Duration']= df['Duration'].fillna(df['Duration'].median())

In [129]:
df.isnull().sum()

Country                       0
Age                           0
Sex                           0
Current Smoker                0
Year of Dx                    0
Duration                      0
UC (E1-E3)                    0
CD(L1-L4)                     0
CD (B1-B3)                    0
CD (Perianal)                 0
IBD-U                         0
5-ASA                         0
Azathioprine or 6-MP          0
Methotrexate                  0
Cyclosporine                  0
Infliximab                    0
Certolizumab                  0
Golimumab                     0
Adalimumab                    0
vedolizumab                   0
current steroid use           0
Tofacitinib                   0
Ustekinumab                   0
EIM                           0
New Previous surgery          0
New previous biologic used    0
New Co-morbidities            0
New OAD                       0
dtype: int64

After inputing numeric variable with their medians and remaining categoricl  (Yes/No) variables into 1,0 for statistical analysis, there are no more missing data as seen above.

#### TARGET Variable

In [133]:
df['UC'] = df['UC (E1-E3)'].apply(lambda x : x if x == '0' else 1)
df['CD(L1-L4)'] = df['CD(L1-L4)'].apply(lambda x : x if x == '0' else 1)

In [134]:
# Count of Target variable
df['UC'].value_counts()

UC
1    2825
0    2715
Name: count, dtype: int64

In our data we have 2825 patients with Ulcerative Colitis condition.

In [132]:
cleaned_data = df.to_csv('cleaned_data.csv', index=False)