## Part 3: Data Preparation for Machine Learning (ML)

__This part consists of 2 sections:__
1. Encoding categorical values
2. Splitting Train and Test data sets with various predictor variables

In [23]:
thyroiddata= pd.read_csv("thyroiddata.csv")

## 1. Encoding categorical variables into numerical values

Here, we will be creating another dataframe "thyroiddata_filtered" and encode the categorical variables so that our machine learning models will have no issue handling it.

In [26]:
relevant_columns = [
    "Age", "Gender", "Currently Smoking", "Smoking History", "Radiotherapy History",
    "Adenopathy", "Focality", "Risk", "Tumor", "Lymph Nodes", "Cancer Metastasis", "Recurred"
]

thyroiddata_filtered = thyroiddata.copy()

#### Our current data:

In [28]:
thyroiddata_filtered.head()

Unnamed: 0,Age,Gender,Currently Smoking,Smoking History,Radiotherapy History,Thyroid Function,Physical Examination,Adenopathy,Types of Thyroid Cancer (Pathology),Focality,Risk,Tumor,Lymph Nodes,Cancer Metastasis,Stage,Treatment Response,Recurred
0,27,F,No,No,No,Euthyroid,Single nodular goiter-left,No Lympth Adenopathy,Micropapillary,Uni-Focal,Low,tumor is less than or equal to 1cm,no evidence of regional lymph node metastasis,no evidence of distant metastasis,First-Stage,Indeterminate,No
1,34,F,No,Yes,No,Euthyroid,Multinodular goiter,No Lympth Adenopathy,Micropapillary,Uni-Focal,Low,tumor is less than or equal to 1cm,no evidence of regional lymph node metastasis,no evidence of distant metastasis,First-Stage,Excellent,No
2,30,F,No,No,No,Euthyroid,Single nodular goiter-right,No Lympth Adenopathy,Micropapillary,Uni-Focal,Low,tumor is less than or equal to 1cm,no evidence of regional lymph node metastasis,no evidence of distant metastasis,First-Stage,Excellent,No
3,62,F,No,No,No,Euthyroid,Single nodular goiter-right,No Lympth Adenopathy,Micropapillary,Uni-Focal,Low,tumor is less than or equal to 1cm,no evidence of regional lymph node metastasis,no evidence of distant metastasis,First-Stage,Excellent,No
4,62,F,No,No,No,Euthyroid,Multinodular goiter,No Lympth Adenopathy,Micropapillary,Multi-Focal,Low,tumor is less than or equal to 1cm,no evidence of regional lymph node metastasis,no evidence of distant metastasis,First-Stage,Excellent,No


### Here, we will be identifying different variables in each of the columns that will be later used as a predictor variables.

In [31]:
columns_to_check = ["Gender", "Currently Smoking", "Smoking History", "Radiotherapy History", "Thyroid Function", "Physical Examination", "Adenopathy", 
                    "Types of Thyroid Cancer (Pathology)", "Focality", "Risk", "Tumor", 
                    "Lymph Nodes", "Cancer Metastasis", "Stage", "Treatment Response"]

for col in columns_to_check:
    unique_values = []
    for value in thyroiddata_filtered[col]:
        if value not in unique_values:
            unique_values.append(value)
    print(f"Number of distinct variables in '{col}':", len(unique_values))
    print("Variables include:", unique_values)
    print()  # Adds a blank line between outputs


Number of distinct variables in 'Gender': 2
Variables include: ['F', 'M']

Number of distinct variables in 'Currently Smoking': 2
Variables include: ['No', 'Yes']

Number of distinct variables in 'Smoking History': 2
Variables include: ['No', 'Yes']

Number of distinct variables in 'Radiotherapy History': 2
Variables include: ['No', 'Yes']

Number of distinct variables in 'Thyroid Function': 5
Variables include: ['Euthyroid', 'Clinical Hyperthyroidism', 'Clinical Hypothyroidism', 'Subclinical Hyperthyroidism', 'Subclinical Hypothyroidism']

Number of distinct variables in 'Physical Examination': 5
Variables include: ['Single nodular goiter-left', 'Multinodular goiter', 'Single nodular goiter-right', 'Normal', 'Diffuse goiter']

Number of distinct variables in 'Adenopathy': 6
Variables include: ['No Lympth Adenopathy', 'Right Side Body Adenopathy', 'Extensive and Widespread', 'Left Side Body Adenopathy', 'Bilateral', 'Posterior']

Number of distinct variables in 'Types of Thyroid Cancer

In [49]:
# Encode categorical variables
# For example, Gender: F -> 0, M -> 1
thyroiddata_filtered["Gender"] = thyroiddata["Gender"].map({"F": 0, "M": 1})
thyroiddata_filtered["Currently Smoking"] = thyroiddata["Currently Smoking"].map({"No": 0, "Yes": 1})
thyroiddata_filtered["Smoking History"] = thyroiddata["Smoking History"].map({"No": 0, "Yes": 1})
thyroiddata_filtered["Adenopathy"] = thyroiddata["Adenopathy"].map({"No Lympth Adenopathy": 0, "Right Side Body Adenopathy": 1, "Extensive and Widespread": 2, "Left Side Body Adenopathy": 3, "Bilateral": 4, "Posterior": 5})
thyroiddata_filtered["Radiotherapy History"] = thyroiddata["Radiotherapy History"].map({"No": 0, "Yes": 1}) 
thyroiddata_filtered["Focality"] = thyroiddata["Focality"].map({"Uni-Focal": 0, "Multi-Focal": 1}) 
thyroiddata_filtered["Risk"] = thyroiddata["Risk"].map({"Low": 0, "Intermediate": 1, "High": 2}) 
thyroiddata_filtered["Tumor"] = thyroiddata["Tumor"].map({'tumor is less than or equal to 1cm': 0, 
                                                          'tumor between the size of 1cm to 2cm inclusive': 1, 
                                                          'tumor between the size of 2cm to 4cm inclusive': 2, 
                                                          'tumor larger than the size of 4 cm': 3, 
                                                          'tumor that has grown outside the thyroid': 4, 
                                                          'tumor that has invaded nearby Head and Neck structures': 5,
                                                          'tumor that has invaded nearby Cervicothoracic Spine and Vascular structures': 6})

thyroiddata_filtered["Cancer Metastasis"] = thyroiddata["Cancer Metastasis"].map({"no evidence of distant metastasis": 0, "presence of distant metastasis": 1}) 

thyroiddata_filtered["Lymph Nodes"] = thyroiddata["Lymph Nodes"].map({"no evidence of regional lymph node metastasis": 0,
                                                                      'regional lymph node metastasis in the central of the neck': 1,
                                                                      'regional lymph node metastasis in the lateral of the neck': 2})

# Encode remaining categorical variables

# Thyroid Function
thyroiddata_filtered["Thyroid Function"] = thyroiddata["Thyroid Function"].map({
    "Euthyroid": 0,
    "Clinical Hyperthyroidism": 1,
    "Clinical Hypothyroidism": 2,
    "Subclinical Hyperthyroidism": 3,
    "Subclinical Hypothyroidism": 4
})

# Physical Examination
thyroiddata_filtered["Physical Examination"] = thyroiddata["Physical Examination"].map({
    "Normal": 0,
    "Single nodular goiter-left": 1,
    "Single nodular goiter-right": 2,
    "Multinodular goiter": 3,
    "Diffuse goiter": 4
})

# Types of Thyroid Cancer (Pathology)
thyroiddata_filtered["Types of Thyroid Cancer (Pathology)"] = thyroiddata["Types of Thyroid Cancer (Pathology)"].map({
    "Micropapillary": 0,
    "Papillary": 1,
    "Follicular": 2,
    "Hurthel cell": 3
})

# Stage
thyroiddata_filtered["Stage"] = thyroiddata["Stage"].map({
    "First-Stage": 0,
    "Second-Stage": 1,
    "Third-Stage": 2,
    "IVA": 3,
    "IVB": 4
})

# Treatment Response
thyroiddata_filtered["Treatment Response"] = thyroiddata["Treatment Response"].map({
    "Excellent": 0,
    "Indeterminate": 1,
    "Biochemical Incomplete": 2,
    "Structural Incomplete": 3
})

In [35]:
thyroiddata_filtered.head()

Unnamed: 0,Age,Gender,Currently Smoking,Smoking History,Radiotherapy History,Thyroid Function,Physical Examination,Adenopathy,Types of Thyroid Cancer (Pathology),Focality,Risk,Tumor,Lymph Nodes,Cancer Metastasis,Stage,Treatment Response,Recurred
0,27,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,No
1,34,0,0,1,0,0,3,0,0,0,0,0,0,0,0,0,No
2,30,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,No
3,62,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,No
4,62,0,0,0,0,0,3,0,0,1,0,0,0,0,0,0,No


#### Now that our data is encoded and prepared, we will move on to the Machine Learning part.


## 2. Splitting the data into Train and Test sets
We __shuffled__ our data first and then used a __3:1__ ratio to split the data into training and testing sets. </br>
To __evaluate the impact__ of different predictor variables on model performance, we created __four versions__ of the __training dataset__, each containing a varying number of features. Guided by __insights from EDA__, we __systematically__ excluded irrelevant variables while retaining consistently important ones like __“Age” and “Gender”__ across all. 

This approach allowed us to __assess how model accuracy__ and generalizability respond to changes in feature selection, helping us strike a __balance between complexity and performance__.

### First Set of Predictor: X

__The Target Variable__: "Recurrence"  <br>
__The Predictor Variable__: "Age", "Gender", "Smoking History" <br>


In [87]:
from sklearn.model_selection import train_test_split

#Shuffling the data
thyroiddata_filtered = thyroiddata_filtered.sample(frac=1, random_state=42) 

y = pd.DataFrame(thyroiddata_filtered['Recurred'])

X = pd.DataFrame(thyroiddata_filtered[["Age", "Gender", "Smoking History"]]) 


# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

X_train.head()

Unnamed: 0,Age,Gender,Smoking History
95,26,0,1
78,35,0,0
179,67,1,0
329,56,0,0
266,19,0,0


### Second Set of Predictors: X2
__The Target Variable__: "Recurrence"  <br>
__The Predictor Variables[X2]__: "Age", "Gender", "Smoking History", "Currently Smoking", "Adenopathy" <br>

In [89]:
X2 = pd.DataFrame(thyroiddata_filtered[["Age", "Gender", "Currently Smoking", "Smoking History", "Adenopathy"]]) 

# Split the Dataset into Train and Test
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y, test_size = 0.25)
X2_train.head()

Unnamed: 0,Age,Gender,Currently Smoking,Smoking History,Adenopathy
217,28,0,1,0,0
59,43,0,0,0,0
78,35,0,0,0,0
159,24,0,0,0,0
227,21,0,0,0,1


### Third Set of Predictors: X3

__The Target Variable__: "Recurrence"  <br>
__The Predictor Variables [X3]__: "Age", "Gender", "Currently Smoking", "Smoking History", "Adenopathy", "Risk", "Treatment Response"

In [91]:
# 
X3 = pd.DataFrame(thyroiddata_filtered[["Age", "Gender", "Currently Smoking", "Smoking History", "Adenopathy", "Risk", "Treatment Response"]]) 

# Split the Dataset into Train and Test
X3_train, X3_test, y3_train, y3_test = train_test_split(X3, y, test_size = 0.25)

X3_train.head()

Unnamed: 0,Age,Gender,Currently Smoking,Smoking History,Adenopathy,Risk,Treatment Response
379,81,1,1,0,2,2,3
326,35,0,0,0,1,1,2
201,25,0,0,0,1,0,0
329,56,0,0,0,0,0,3
356,54,1,1,0,1,1,3


### Fourth Set of Predictors: X4


__The Target Variable__: "Recurrence"  <br>
__The Predictor Variables [X4]__: __ALL__ "Age", "Gender", "Currently Smoking", "Smoking History", "Radiotherapy History", "Thyroid Function", "Physical Examination", "Adenopathy", 
                    "Types of Thyroid Cancer (Pathology)", "Focality", "Risk", "Tumor", 
                    "Lymph Nodes", "Cancer Metastasis", "Stage", "Treatment Response"

In [93]:
# Using all the variables as a predictor here
X4 = pd.DataFrame(thyroiddata_filtered[["Age", "Gender", "Currently Smoking", "Smoking History", "Radiotherapy History", "Thyroid Function", "Physical Examination", "Adenopathy", 
                    "Types of Thyroid Cancer (Pathology)", "Focality", "Risk", "Tumor", 
                    "Lymph Nodes", "Cancer Metastasis", "Stage", "Treatment Response"]])

# Split the Dataset into Train and Test
X4_train, X4_test, y4_train, y4_test = train_test_split(X4, y, test_size = 0.25)

X4_train.head()

Unnamed: 0,Age,Gender,Currently Smoking,Smoking History,Radiotherapy History,Thyroid Function,Physical Examination,Adenopathy,Types of Thyroid Cancer (Pathology),Focality,Risk,Tumor,Lymph Nodes,Cancer Metastasis,Stage,Treatment Response
48,26,0,0,0,0,0,0,2,1,0,1,0,1,0,0,3
114,26,0,0,0,0,0,3,0,1,1,0,2,0,0,0,0
177,52,0,0,0,0,0,3,0,1,0,0,2,0,0,0,0
370,78,1,1,1,1,1,3,0,2,1,2,5,0,1,4,3
255,37,0,0,0,0,0,2,0,1,1,0,3,0,0,0,0


### Saving the Outputs as a New File
__Explanation:__ Our machine learning models were trained and tested on __all eight datasets__, each yielding different accuracy results. Ultimately, we found that the models trained on __datasets X3 and X4__ showed the best performance in general. Therefore, we’ve chosen to highlight these two datasets to showcase our feature selection process. 

This step is included as an __additional component__ in the "datasets" folder of our github for __documentation purposes__ and is __not__ part of the main codebase.

In [80]:
X3_train.to_csv('X3_train.csv', index=False)
print("Saved the X3_train dataset as 'X3_train.csv'")

Saved the X3_train dataset as 'X3_train.csv'


In [78]:
X3_test.to_csv('X3_test.csv', index=False)
print("Saved the X3_test dataset as 'X3_test.csv'")

Saved the X3_test dataset as 'X3_test.csv'


In [82]:
X4_train.to_csv('X4_train.csv', index=False)
print("Saved the X4_train dataset as 'X4_train.csv'")

Saved the X4_train dataset as 'X4_train.csv'


In [103]:
X4_test.to_csv('X4_test.csv', index=False)
print("Saved the X4_test dataset as 'X4_test.csv'")

Saved the X4_test dataset as 'X4_test.csv'


### Saving our encoded main dataset as well

In [101]:
thyroiddata_filtered.to_csv('thyroiddata_filtered.csv', index=False)
print("Saved the encoded main data set as 'thyroiddata_fil.csv'")

Saved the encoded main data set as 'thyroiddata_filtered.csv'
