##  1. Business Understanding
Our team has been deployed by the Kenya Medical Research Institute to carry out a detailed analysis on a medical condition,Hypothyroidism, which is a common disease facing our nation.We are tasked to identify and interprate the patterns and relationships between Hyperthyrodism and a patients medical history.


### 1.1Business Overview.
Hypothyroidism,also known as **underactive thyroid** is a medical condition where the thyroid gland fails to produce enough thyroid hormones, leading to symptoms like fatigue, weight gain, and depression.Iodine deficiency, autoimmune diseases, certain medications, and radiation therapy are some common factors affecting hypothyroidism in the real world. Age is one major feature affecting hypothyroidism. As age increases, the frequency of hypothyroidism, goiters, and thyroid nodules increases.Study shows that 2-20% of older age groups having some form of hypothyroidism. A Framingham study found hypothyroidism in 5.9% of women and 2.4% of men older than 60 years.Kenya Medical Research Institute aim to identify patterns and relations between demographic factors, medical history, and thyroid function test results. By analyzing these patterns, they hope to improve diagnostic accuracy, tailor treatments more effectively, and enhance the overall understanding of hypothyroidis management.

### 1.2 Problem statement
The Kenya Medical Research Institute is committed to improving the lives of patients with hypothyroidism. We aim to understand how different demographic and clinical factors affect the management and outcomes of this condition.

### 1.3 Objectives
**Main objective**
>To create a machine learning model that can predict if a patient is at high risk of being diagnosed with hypothyrodism

**Specific objectives**
1. To determine which risk factors heavily predipose one to getting the condition
2. To determine which gender is greatly affected by the condition
3. To determine what age group is greatly affected wiith the condition
4. To establish a diagnosing criteria for the condition

### 1.4 Success Criteria
Double-click (or enter) to edit

### 2. Data Understanding

Description of each feature in the dataset:

1. **status**: Indicates if the patient is hypothyroid.

2. **age** : Age of the patient.
3. **sex**: Gender of the patient (M/F).
4. **on_thyroxine**: Indicates if the patient is on thyroxine medication.
5. **query_on_thyroxine**: Query if the patient is on thyroxine.
6. **on_antithyroid_medication**: Indicates if the patient is on antithyroid medication.
7. **thyroid_surgery**: Indicates if the patient has had thyroid surgery.
7. **query_hypothyroid**: Query if the patient is hypothyroid.
8. **query_hyperthyroid**: Query if the patient is hyperthyroid.
9. **pregnant**: Indicates if the patient is pregnant.
10. **sick**: Indicates if the patient is sick.
11. **tumor**: Indicates if the patient has a tumor.
12. **lithium**: Indicates if the patient is on lithium.
13. **goitre**: Indicates if the patient has a goitre.
14. **TSH_measured**: Indicates whether TSH)level has been measured. Values are 'y' (yes) or 'n' (no).

15. TSH: The actual measured value of the TSH level if TSH_measured is 'y'.
e
16. T3_measured: Indicates whether T3 level has been measured.

17. T3: The actual measured value of the T3 level if T3_measurd is 'y'.

18. TT4_measured: Indicates whether the TT4 level has been measured. Values are 'y' (yes) or 'n' (no).

19. TT4: The actual measured value of the TT4 level if TT4_measured is 'y'.

18. T4U_measured: Indicates whether the Thyroxine Uptake (T4U) level has been measured. Values are 'y' (yes) or 'n' (no).

19. T4U: The actual measured value of the T4U level if T4U_measured is 'y'.
20. FTI: The actual measured value of the FTI if FTI_measured is 'y'.

21. TBG_measured: Indicates whether the Thyroxine-Binding Globulin (TBG) level has been measured. Values are 'y' (yes) or 'n' (no).

22. TBG: The actual measured value of the TBG level if TBG_measured is 'y'.



In [1]:
# importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

In [2]:
# Load the dataset
df = pd.read_csv("hypothyroid.csv")

In [3]:
#  First five rows of the dataset
df.head()

Unnamed: 0,status,age,sex,on_thyroxine,query_on_thyroxine,on_antithyroid_medication,thyroid_surgery,query_hypothyroid,query_hyperthyroid,pregnant,...,T3_measured,T3,TT4_measured,TT4,T4U_measured,T4U,FTI_measured,FTI,TBG_measured,TBG
0,hypothyroid,72,M,f,f,f,f,f,f,f,...,y,0.6,y,15,y,1.48,y,10,n,?
1,hypothyroid,15,F,t,f,f,f,f,f,f,...,y,1.7,y,19,y,1.13,y,17,n,?
2,hypothyroid,24,M,f,f,f,f,f,f,f,...,y,0.2,y,4,y,1.0,y,0,n,?
3,hypothyroid,24,F,f,f,f,f,f,f,f,...,y,0.4,y,6,y,1.04,y,6,n,?
4,hypothyroid,77,M,f,f,f,f,f,f,f,...,y,1.2,y,57,y,1.28,y,44,n,?


In [4]:
#  last five rows of the dataset
df.tail()

Unnamed: 0,status,age,sex,on_thyroxine,query_on_thyroxine,on_antithyroid_medication,thyroid_surgery,query_hypothyroid,query_hyperthyroid,pregnant,...,T3_measured,T3,TT4_measured,TT4,T4U_measured,T4U,FTI_measured,FTI,TBG_measured,TBG
3158,negative,58,F,f,f,f,f,f,f,f,...,y,1.7,y,86,y,0.91,y,95,n,?
3159,negative,29,F,f,f,f,f,f,f,f,...,y,1.8,y,99,y,1.01,y,98,n,?
3160,negative,77,M,f,f,f,f,f,f,f,...,y,0.6,y,71,y,0.68,y,104,n,?
3161,negative,74,F,f,f,f,f,f,f,f,...,y,0.1,y,65,y,0.48,y,137,n,?
3162,negative,56,F,t,f,f,f,f,f,f,...,y,1.8,y,139,y,0.97,y,143,n,?


### 3. Data preparation

In [5]:
df.shape

(3163, 26)

In [6]:
len(df)

3163

Our dataframe has 3163 rows and 26 columns, the first being the index column. This shows that we have 25 features in our dataset.Below the name of the data columns have been displayed.

In [7]:
columns = df.columns
print(columns)

Index(['status', 'age', 'sex', 'on_thyroxine', 'query_on_thyroxine',
       'on_antithyroid_medication', 'thyroid_surgery', 'query_hypothyroid',
       'query_hyperthyroid', 'pregnant', 'sick', 'tumor', 'lithium', 'goitre',
       'TSH_measured', 'TSH', 'T3_measured', 'T3', 'TT4_measured', 'TT4',
       'T4U_measured', 'T4U', 'FTI_measured', 'FTI', 'TBG_measured', 'TBG'],
      dtype='object')


In [8]:
# to print unique values
for col in columns:
    print(f"{col}:{df[col].unique()[:10]}")

status:['hypothyroid' 'negative']
age:['72' '15' '24' '77' '85' '64' '20' '42' '69' '75']
sex:['M' 'F' '?']
on_thyroxine:['f' 't']
query_on_thyroxine:['f' 't']
on_antithyroid_medication:['f' 't']
thyroid_surgery:['f' 't']
query_hypothyroid:['f' 't']
query_hyperthyroid:['f' 't']
pregnant:['f' 't']
sick:['f' 't']
tumor:['f' 't']
lithium:['f' 't']
goitre:['f' 't']
TSH_measured:['y' 'n']
TSH:['30' '145' '0' '430' '7.30' '138' '7.70' '21' '92' '48']
T3_measured:['y' 'n']
T3:['0.60' '1.70' '0.20' '0.40' '1.20' '1.10' '1.30' '1.90' '?' '0.80']
TT4_measured:['y' 'n']
TT4:['15' '19' '4' '6' '57' '27' '54' '34' '39' '7.60']
T4U_measured:['y' 'n']
T4U:['1.48' '1.13' '1' '1.04' '1.28' '1.19' '0.86' '1.05' '1.21' '1.02']
FTI_measured:['y' 'n']
FTI:['10' '17' '0' '6' '44' '23' '63' '32' '7.50' '61']
TBG_measured:['n' 'y']
TBG:['?' '28' '34' '0' '19' '30' '25' '48' '39' '31']


In [9]:
df.describe

<bound method NDFrame.describe of            status age sex on_thyroxine query_on_thyroxine  \
0     hypothyroid  72   M            f                  f   
1     hypothyroid  15   F            t                  f   
2     hypothyroid  24   M            f                  f   
3     hypothyroid  24   F            f                  f   
4     hypothyroid  77   M            f                  f   
...           ...  ..  ..          ...                ...   
3158     negative  58   F            f                  f   
3159     negative  29   F            f                  f   
3160     negative  77   M            f                  f   
3161     negative  74   F            f                  f   
3162     negative  56   F            t                  f   

     on_antithyroid_medication thyroid_surgery query_hypothyroid  \
0                            f               f                 f   
1                            f               f                 f   
2                            

### 4. Data Cleaning
**4.1 Validation**

The first step of data cleaning is **validation** which includes the following steps:
1. Remove irrelevant data: Based on our data description, all columns seem relevant to hypothyroid prediction. Therefore, no columns need to be removed for this initial step.
2. Check for whitespace and correct format:
>Ensure there is no leading or trailing whitespace in the data.
>Verify data types for each column.
>Replace any placeholder values (e.g., '?') with appropriate NaN values for further processing.


In [10]:
# Strip whitespace from the headers
df.columns = df.columns.str.strip()

# Replace '?' with NaN
df.replace('?', np.nan, inplace=True)

# Strip whitespace from string data entries
for col in df.select_dtypes(include=['object']).columns:
    df[col] = df[col].str.strip()

In [11]:
# Check data types
print(df.dtypes)

status                       object
age                          object
sex                          object
on_thyroxine                 object
query_on_thyroxine           object
on_antithyroid_medication    object
thyroid_surgery              object
query_hypothyroid            object
query_hyperthyroid           object
pregnant                     object
sick                         object
tumor                        object
lithium                      object
goitre                       object
TSH_measured                 object
TSH                          object
T3_measured                  object
T3                           object
TT4_measured                 object
TT4                          object
T4U_measured                 object
T4U                          object
FTI_measured                 object
FTI                          object
TBG_measured                 object
TBG                          object
dtype: object


In [12]:
# Convert columns to appropriate data types
# Convert numeric columns that are currently object types
numeric_columns = ['age', 'TSH', 'T3', 'TT4', 'T4U', 'FTI', 'TBG']
df[numeric_columns] = df[numeric_columns].apply(pd.to_numeric, errors='coerce')

In [13]:
# Verify data types again after conversion
print(df.dtypes)

status                        object
age                          float64
sex                           object
on_thyroxine                  object
query_on_thyroxine            object
on_antithyroid_medication     object
thyroid_surgery               object
query_hypothyroid             object
query_hyperthyroid            object
pregnant                      object
sick                          object
tumor                         object
lithium                       object
goitre                        object
TSH_measured                  object
TSH                          float64
T3_measured                   object
T3                           float64
TT4_measured                  object
TT4                          float64
T4U_measured                  object
T4U                          float64
FTI_measured                  object
FTI                          float64
TBG_measured                  object
TBG                          float64
dtype: object


In [14]:
# Summary statistics to check for any anomalies after validation
print(df.describe(include='all'))

# Check for unique values in categorical columns after validation
for col in df.select_dtypes(include=['object']).columns:
    print(f"{col}: {df[col].unique()}")

          status          age   sex on_thyroxine query_on_thyroxine  \
count       3163  2717.000000  3090         3163               3163   
unique         2          NaN     2            2                  2   
top     negative          NaN     F            f                  f   
freq        3012          NaN  2182         2702               3108   
mean         NaN    51.154214   NaN          NaN                NaN   
std          NaN    19.294405   NaN          NaN                NaN   
min          NaN     1.000000   NaN          NaN                NaN   
25%          NaN    35.000000   NaN          NaN                NaN   
50%          NaN    54.000000   NaN          NaN                NaN   
75%          NaN    67.000000   NaN          NaN                NaN   
max          NaN    98.000000   NaN          NaN                NaN   

       on_antithyroid_medication thyroid_surgery query_hypothyroid  \
count                       3163            3163              3163   
unique 

Our data is now valid and in the right format ie. changed all values of '?' are converted to 'nan'

**4.2 Accuracy**

Next step involves checking for **accuracy**. In this step we will check for:
1. Range Checks: Verifying that numerical values fall within plausible ranges. 
>for Age lifespan is between 0-120 years.

~ For the features below this is the range that suggest they are normal if not within this range it suggest the level is not normal.ie Its either too high or too low.
>TSH (Thyroid Stimulating Hormone):0.4 - 4.0 mIU/L

>T3 (Triiodothyronine):80 - 200 ng/dL

>TT4 (Total Thyroxine):5.0 - 12.0 µg/dL

>T4U (Thyroxine Uptake):0.8 - 1.3

>FTI (Free Thyroxine Index):1.0 - 4.3

>TBG (Thyroxine Binding Globulin):15 - 30 µg/mL

2. Cross-Field Validation: Checking that related fields have consistent values (e.g., TSH_measured and TSH).

In [15]:
# 1. Range Checks
# Identify and print out-of-range age values
out_of_range_age = df[~df['age'].between(0, 120)]
print("Out-of-range age values shape:")
print(out_of_range_age.shape)

# Handle out-of-range age values by replacing them with NaN
df.loc[~df['age'].between(0, 120), 'age'] = np.nan

# Thyroid hormone levels should be within plausible biological ranges 
# Identify and print out-of-range values for each hormone level
out_of_range_tsh = df[~df['TSH'].between(0.4, 4.0, inclusive='both')]
print("Out-of-range TSH values shape:")
print(out_of_range_tsh.shape)

out_of_range_t3 = df[~df['T3'].between(80, 200, inclusive='both')]
print("Out-of-range T3 values shape:")
print(out_of_range_t3.shape)

out_of_range_tt4 = df[~df['TT4'].between(5.0, 12.0, inclusive='both')]
print("Out-of-range TT4 values shape:")
print(out_of_range_tt4.shape)

out_of_range_t4u = df[~df['T4U'].between(0.8, 1.3, inclusive='both')]
print("Out-of-range T4U values shape:")
print(out_of_range_t4u.shape)

out_of_range_fti = df[~df['FTI'].between(1.0, 4.3, inclusive='both')]
print("Out-of-range FTI values shape:")
print(out_of_range_fti.shape)

out_of_range_tbg = df[~df['TBG'].between(15, 30, inclusive='both')]
print("Out-of-range TBG values shape:")
print(out_of_range_tbg.shape)

# 2. Cross-Field Validation
# If TSH_measured is 'n', TSH should be NaN
df.loc[df['TSH_measured'] == 'n', 'TSH'] = np.nan

# If T3_measured is 'n', T3 should be NaN
df.loc[df['T3_measured'] == 'n', 'T3'] = np.nan

# If TT4_measured is 'n', TT4 should be NaN
df.loc[df['TT4_measured'] == 'n', 'TT4'] = np.nan
9
# If T4U_measured is 'n', T4U should be NaN
df.loc[df['T4U_measured'] == 'n', 'T4U'] = np.nan

# If FTI_measured is 'n', FTI should be NaN
df.loc[df['FTI_measured'] == 'n', 'FTI'] = np.nan

# If TBG_measured is 'n', TBG should be NaN
df.loc[df['TBG_measured'] == 'n', 'TBG'] = np.nan


Out-of-range age values shape:
(446, 26)
Out-of-range TSH values shape:
(2042, 26)
Out-of-range T3 values shape:
(3163, 26)
Out-of-range TT4 values shape:
(3139, 26)
Out-of-range T4U values shape:
(890, 26)
Out-of-range FTI values shape:
(3156, 26)
Out-of-range TBG values shape:
(3015, 26)


The output above displays the number of rows and columns that have values that are out-of-range.

In [16]:
print("Out-of-range age values:")
print(out_of_range_age['age'].unique())

print( )
print("Unique out-of-range TSH values:")
print(out_of_range_tsh['TSH'].unique())

print( )
print("Unique out-of-range T3 values:")
print(out_of_range_t3['T3'].unique())

print( )
print("Unique out-of-range TT4 values:")
print(out_of_range_tt4['TT4'].unique())

print( )
print("Unique out-of-range T4U values:")
print(out_of_range_t4u['T4U'].unique())

print( )
print("Unique out-of-range FTI values:")
print(out_of_range_fti['FTI'].unique())

print( )
print("Unique out-of-range TBG values:")
print(out_of_range_tbg['TBG'].unique())


Out-of-range age values:
[nan]

Unique out-of-range TSH values:
[3.00e+01 1.45e+02 0.00e+00 4.30e+02 7.30e+00 1.38e+02 7.70e+00 2.10e+01
 9.20e+01 4.80e+01 3.60e+01 1.50e+01 1.53e+01 2.50e+01 6.10e+01 2.80e+01
 1.70e+02 5.40e+01 2.16e+02 5.60e+01 7.10e+01 4.60e+01 7.00e+01 3.40e+01
 5.30e+01 9.40e+00 1.26e+02 1.00e+01 5.30e+02 3.50e+01 6.50e+01 5.70e+01
 1.25e+02 2.30e+01 8.00e+01 1.17e+02 4.90e+01 6.60e+01 8.20e+00 1.50e+02
      nan 1.80e+01 1.65e+02 1.64e+02 2.40e+01 9.00e+01 7.70e+01 1.90e+01
 5.80e+01 1.00e+02 2.13e+02 1.70e+01 2.35e+02 1.53e+02 1.30e+01 3.10e+01
 1.09e+02 2.60e+02 4.30e+01 1.20e+01 1.10e+01 5.50e+01 6.50e+00 2.00e+01
 7.50e+00 1.40e+01 6.00e+01 1.40e+02 3.30e+01 8.70e+00 2.50e-01 1.07e+01
 8.20e+01 4.50e+01 4.20e+01 4.10e+01 1.60e+02 1.60e+01 8.90e+01 4.40e+01
 1.76e+02 6.40e+00 1.83e+02 2.90e+01 3.70e+01 3.90e+01 7.90e+00 5.90e+01
 6.80e+01 3.80e+01 4.70e+01 1.43e+02 6.60e+00 2.88e+02 9.60e+01 9.00e-02
 3.00e-01 4.60e+00 2.00e-01 5.80e+00 4.90e+00 1.03e+01 5.10e

Despite having values of the measured features which don't fall within the normal range, our data seems to be accurate. This can be proved by  looking at the unique values of the out-of-range values above.
> for **age**, the out of range values are the nan values this indicates that the range of age is between 0-120 years and the out of range values are the nan values where age has not been given.

>For the other measured features, the out of range values show their real measurement values and are correct as they indicate the levels of these features.

**4.3 completness**

The third step is checking for **completeness** .Here we deal with missing values

In [17]:
# Completeness Checks
# Check for any remaining NaN values
print(df.isna().sum())

# Fill missing values for categorical columns with the mode
categorical_columns = df.select_dtypes(include=['object']).columns
for col in categorical_columns:
    df[col].fillna(df[col].mode()[0], inplace=True)

# Fill missing values for numerical columns with the median
for col in numeric_columns:
    df[col].fillna(df[col].median(), inplace=True)

status                          0
age                           446
sex                            73
on_thyroxine                    0
query_on_thyroxine              0
on_antithyroid_medication       0
thyroid_surgery                 0
query_hypothyroid               0
query_hyperthyroid              0
pregnant                        0
sick                            0
tumor                           0
lithium                         0
goitre                          0
TSH_measured                    0
TSH                           468
T3_measured                     0
T3                            695
TT4_measured                    0
TT4                           249
T4U_measured                    0
T4U                           248
FTI_measured                    0
FTI                           247
TBG_measured                    0
TBG                          2903
dtype: int64


In [18]:
# Check for any remaining NaN values
print(df.isna().sum())

status                       0
age                          0
sex                          0
on_thyroxine                 0
query_on_thyroxine           0
on_antithyroid_medication    0
thyroid_surgery              0
query_hypothyroid            0
query_hyperthyroid           0
pregnant                     0
sick                         0
tumor                        0
lithium                      0
goitre                       0
TSH_measured                 0
TSH                          0
T3_measured                  0
T3                           0
TT4_measured                 0
TT4                          0
T4U_measured                 0
T4U                          0
FTI_measured                 0
FTI                          0
TBG_measured                 0
TBG                          0
dtype: int64


Given the nature of our features, they all seem important  and dropping them would lead to loss  of important information. we then decided to fill the categorical values with the mode value and for numerical columns we filled the null values with the median. Our data can now be confirmed as complete with no missing values.

**4.4 Consistency**

This is the forth step of data cleaning. It involves checking for duplicate values.

In [19]:
# 5. Check for Duplicates
duplicates = df.duplicated()
print(f"Number of duplicate rows: {duplicates.sum()}")

Number of duplicate rows: 78


In [20]:
# Print duplicate rows if any
if duplicates.any():
    print("Duplicate rows:")
    print(df[duplicates])

Duplicate rows:
           status   age sex on_thyroxine query_on_thyroxine  \
53    hypothyroid  69.0   F            f                  f   
66    hypothyroid  62.0   M            f                  f   
124   hypothyroid  77.0   F            f                  f   
128   hypothyroid  79.0   F            f                  f   
131   hypothyroid  50.0   F            t                  f   
...           ...   ...  ..          ...                ...   
3048     negative  28.0   M            f                  f   
3055     negative  33.0   F            f                  f   
3066     negative  74.0   F            t                  f   
3111     negative  89.0   M            f                  f   
3151     negative  58.0   F            f                  f   

     on_antithyroid_medication thyroid_surgery query_hypothyroid  \
53                           f               f                 f   
66                           f               f                 f   
124                    

**4.5 Uniformity**

This is the last step of data cleaning.  It involves standardizing your data in preparation for modeling.
In this case we will use:
1. Standard scaler for numerical data.
2. OneHot encoder for categorical data.

In [21]:
# Standardize the numerical columns
scaler = StandardScaler()
df[numeric_columns] = scaler.fit_transform(df[numeric_columns])

In [22]:
df.head()

Unnamed: 0,status,age,sex,on_thyroxine,query_on_thyroxine,on_antithyroid_medication,thyroid_surgery,query_hypothyroid,query_hyperthyroid,pregnant,...,T3_measured,T3,TT4_measured,TT4,T4U_measured,T4U,FTI_measured,FTI,TBG_measured,TBG
0,hypothyroid,1.141734,M,f,f,f,f,f,f,f,...,y,-1.483836,y,-2.14031,y,2.313337,y,-1.809839,n,-0.04841
1,hypothyroid,-2.041459,F,t,f,f,f,f,f,f,...,y,-0.236955,y,-2.048715,y,0.704388,y,-1.688885,n,-0.04841
2,hypothyroid,-1.538849,M,f,f,f,f,f,f,f,...,y,-1.937247,y,-2.392197,y,0.106779,y,-1.982629,n,-0.04841
3,hypothyroid,-1.538849,F,f,f,f,f,f,f,f,...,y,-1.710542,y,-2.346399,y,0.290659,y,-1.878955,n,-0.04841
4,hypothyroid,1.420961,M,f,f,f,f,f,f,f,...,y,-0.803719,y,-1.17856,y,1.393938,y,-1.222352,n,-0.04841


In [23]:
# List of categorical columns to be one-hot encoded
categorical_columns = ['sex', 'on_thyroxine', 'query_on_thyroxine', 'on_antithyroid_medication', 'thyroid_surgery', 'query_hypothyroid', 'query_hyperthyroid', 'pregnant', 'sick', 'tumor', 'lithium', 'goitre']

# Apply one-hot encoding
df = pd.get_dummies(df, columns=categorical_columns)

# Check the resulting DataFrame
df.head()


Unnamed: 0,status,age,TSH_measured,TSH,T3_measured,T3,TT4_measured,TT4,T4U_measured,T4U,...,pregnant_f,pregnant_t,sick_f,sick_t,tumor_f,tumor_t,lithium_f,lithium_t,goitre_f,goitre_t
0,hypothyroid,1.141734,y,1.122672,y,-1.483836,y,-2.14031,y,2.313337,...,True,False,True,False,True,False,True,False,True,False
1,hypothyroid,-2.041459,y,6.318212,y,-0.236955,y,-2.048715,y,0.704388,...,True,False,True,False,True,False,True,False,True,False
2,hypothyroid,-1.538849,y,-0.232686,y,-1.937247,y,-2.392197,y,0.106779,...,True,False,True,False,True,False,True,False,True,False
3,hypothyroid,-1.538849,y,19.194114,y,-1.710542,y,-2.346399,y,0.290659,...,True,False,True,False,True,False,True,False,True,False
4,hypothyroid,1.420961,y,0.097118,y,-0.803719,y,-1.17856,y,1.393938,...,True,False,True,False,True,False,True,False,True,False


Our data is now standardized. Notice that the no of columns increases to 38. This is because OneHot encoder created dummy featuers in order to standardize the data