Data Overview:
train data (train_dataset):

Dataset: 1500 entries, 10 columns

Columns:
Patient_ID:(not useful for prediction, can be ignored)
Age: Numerical
Marital_Status: Categorical (married/single)
Year of Operation: Numerical 
Positive_Axillary_Nodes: Numerical
Tumor_Size: Numerical (cm)
Radiation_Therapy: Categorical (Yes/No)
Chemotherapy: Categorical (Yes/No)
Hormone_Therapy: Categorical (Yes/No)
Survival_Status: Target variable (1 = Survived/0 = Not Survived)

0-No missing data,no need for imputation
1-Drop Patient_ID cause not useful
2-Encoding: convert categorical variables into numerical form: (Marital_Status, Radiation_Therapy, Chemotherapy, Hormone_Therapy) 
3-Scale numerical data( Age, Positive_Axillary_Nodes, and Tumor_Size)
4-LabelEncoder: convert categorical data into numerical form

Training set: 1,200 samples, 8 features.

Test data (test_dataset):
500 entries

Same structure as train_data, except it doesn't contain the Survival_Status column (cause we aim to predict it).

same no missing data

Categorical columns (Marital_Status, Radiation_Therapy, Chemotherapy, and Hormone_Therapy) encoded with LabelEncoder,to convert  categories into numerical values
Numerical columns (Age, Year of Operation, Positive_Axillary_Nodes, and Tumor_Size) standardized with StandardScaler,all features have mean 0 and standard deviation 1.

Test set: 300 samples, 8 features.

note :
X_train contains all features except Survival_Status and Patient_ID.
y_train contains only the Survival_Status.
X_test: Preprocessed test features (to make predictions)

fit_transform converts the training data to numbers/ transform converts the test data using the same mapping.
StandardScaler scales numeric data so that it has a mean of 0 and a standard deviation of 1.

unscalled data

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer 
train_dataset = pd.read_csv("train.csv")
test_dataset = pd.read_csv("test.csv")

print("Columns in train_dataset:", train_dataset.columns.tolist())
print("Columns in test_dataset:", test_dataset.columns.tolist())

X_train = train_dataset.drop(columns=['Survival_Status', 'Patient_ID'])
y_train = train_dataset['Survival_Status']
#X_train = train_data.drop('Survival_Status', axis=1)
#y_train = train_data['Survival_Status']

X_test = test_dataset.drop(columns=['Patient_ID']) 
# Drop 'Patient_ID'
#X_test = test_data.drop(test_data.columns[1], axis=1)
categorical_columns = ['Marital_Status', 'Radiation_Therapy', 'Chemotherapy', 'Hormone_Therapy']
labelencoder = LabelEncoder()

for col in categorical_columns:
    X_train[col] = labelencoder.fit_transform(X_train[col])
    X_test[col] = labelencoder.transform(X_test[col])
    
    scaler = StandardScaler()

numerical_columns = ['Age', 'Year of Operation', 'Positive_Axillary_Nodes', 'Tumor_Size']
 
print("\nX_train:")
print(X_train.to_string(index=False))

print("\nX_test:")
print(X_test.to_string(index=False))

print("\ny_train:")
print(y_train.to_string(index=False))

Columns in train_dataset: ['Patient_ID', 'Age', 'Marital_Status', 'Year of Operation', 'Positive_Axillary_Nodes', 'Tumor_Size', 'Radiation_Therapy', 'Chemotherapy', 'Hormone_Therapy', 'Survival_Status']
Columns in test_dataset: ['Patient_ID', 'Age', 'Marital_Status', 'Year of Operation', 'Positive_Axillary_Nodes', 'Tumor_Size', 'Radiation_Therapy', 'Chemotherapy', 'Hormone_Therapy']

X_train:
 Age  Marital_Status  Year of Operation  Positive_Axillary_Nodes  Tumor_Size  Radiation_Therapy  Chemotherapy  Hormone_Therapy
  77               0               1962                        5         3.0                  0             1                0
  36               0               1964                        2         1.9                  1             0                0
  47               0               1960                        5         2.0                  0             0                0
  54               0               1965                        0         1.4                  0 

scalled data

In [7]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split


train_dataset = pd.read_csv("train.csv")
test_dataset = pd.read_csv("test.csv")

print("Columns in train_dataset:", train_dataset.columns.tolist())
print("Columns in test_dataset:", test_dataset.columns.tolist())


X_train = train_dataset.drop(columns=['Survival_Status', 'Patient_ID'])
y_train = train_dataset['Survival_Status']


X_test = test_dataset.drop(columns=['Patient_ID'])


categorical_columns = ['Marital_Status', 'Radiation_Therapy', 'Chemotherapy', 'Hormone_Therapy']
labelencoder = LabelEncoder()

for col in categorical_columns:
    X_train[col] = labelencoder.fit_transform(X_train[col])
    X_test[col] = labelencoder.transform(X_test[col])

scaler = StandardScaler()


X_train[numerical_columns] = scaler.fit_transform(X_train[numerical_columns])


X_test[numerical_columns] = scaler.transform(X_test[numerical_columns])


print("\nX_train shape:", X_train.shape)
print("X_val shape:", X_val.shape)
print("y_train shape:", y_train.shape)
print("y_val shape:", y_val.shape)


print("\nX_test:")
print(X_test.to_string(index=False))


print("\nX_train:")
print(X_train.to_string(index=False))

print("\ny_train:")
print(y_train.to_string(index=False))


Columns in train_dataset: ['Patient_ID', 'Age', 'Marital_Status', 'Year of Operation', 'Positive_Axillary_Nodes', 'Tumor_Size', 'Radiation_Therapy', 'Chemotherapy', 'Hormone_Therapy', 'Survival_Status']
Columns in test_dataset: ['Patient_ID', 'Age', 'Marital_Status', 'Year of Operation', 'Positive_Axillary_Nodes', 'Tumor_Size', 'Radiation_Therapy', 'Chemotherapy', 'Hormone_Therapy']

X_train shape: (1500, 8)
X_val shape: (300, 8)
y_train shape: (1500,)
y_val shape: (300,)

X_test:
      Age  Marital_Status  Year of Operation  Positive_Axillary_Nodes  Tumor_Size  Radiation_Therapy  Chemotherapy  Hormone_Therapy
 0.330900               1           0.749868                -0.558080   -1.047030                  1             0                0
-1.548479               0          -1.024735                -0.558080    0.766271                  1             0                0
-0.317162               0          -0.137433                -0.267413   -0.518151                  1             0    

Training Dataset is the portion of data used to train ml model here i ll split it into: 80% Training Data & 20% validation Data 
TEST dataset won't be split because it's used to evaluate the final model's performance after training. It should not be altered or split during the training process to ensure the model's performance is assessed on completely unseen data btw i ll be using random forest classifier so i won't need to scalled data

In [29]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

train_dataset = pd.read_csv("train.csv")
test_dataset = pd.read_csv("test.csv")

print("Columns in train_dataset:", train_dataset.columns.tolist())
print("Columns in test_dataset:", test_dataset.columns.tolist())


X_train = train_dataset.drop(columns=['Survival_Status', 'Patient_ID'])
y_train = train_dataset['Survival_Status']


X_test = test_dataset.drop(columns=['Patient_ID'])

categorical_columns = ['Marital_Status', 'Radiation_Therapy', 'Chemotherapy', 'Hormone_Therapy']
labelencoder = LabelEncoder()

for col in categorical_columns:
    X_train[col] = labelencoder.fit_transform(X_train[col])
    X_test[col] = labelencoder.transform(X_test[col])

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

print("\nX_train shape:", X_train.shape)
print("X_val shape:", X_val.shape)
print("y_train shape:", y_train.shape)
print("y_val shape:", y_val.shape)

print("\nX_test:")
print(X_test.to_string(index=False))

print("\nX_train:")
print(X_train.to_string(index=False))

print("\ny_train:")
print(y_train.to_string(index=False))


Columns in train_dataset: ['Patient_ID', 'Age', 'Marital_Status', 'Year of Operation', 'Positive_Axillary_Nodes', 'Tumor_Size', 'Radiation_Therapy', 'Chemotherapy', 'Hormone_Therapy', 'Survival_Status']
Columns in test_dataset: ['Patient_ID', 'Age', 'Marital_Status', 'Year of Operation', 'Positive_Axillary_Nodes', 'Tumor_Size', 'Radiation_Therapy', 'Chemotherapy', 'Hormone_Therapy']

X_train shape: (1200, 8)
X_val shape: (300, 8)
y_train shape: (1200,)
y_val shape: (300,)

X_test:
 Age  Marital_Status  Year of Operation  Positive_Axillary_Nodes  Tumor_Size  Radiation_Therapy  Chemotherapy  Hormone_Therapy
  62               1               1966                        4         1.4                  1             0                0
  33               0               1960                        4         3.8                  1             0                0
  52               0               1963                        7         2.1                  1             0                0
  56  

accarancy test with random forest

In [25]:

import pandas as pd
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import LabelEncoder
from imblearn.over_sampling import SMOTE

train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")

features_train = train_data.drop(columns=['Survival_Status', 'Patient_ID'])
labels_train = train_data['Survival_Status']

features_test = test_data.drop(columns=['Patient_ID'])

categorical_features = ['Marital_Status', 'Radiation_Therapy', 'Chemotherapy', 'Hormone_Therapy']
label_encoder = LabelEncoder()

for feature in categorical_features:
    features_train[feature] = label_encoder.fit_transform(features_train[feature])
    features_test[feature] = label_encoder.transform(features_test[feature])

features_train, features_val, labels_train, labels_val = train_test_split(features_train, labels_train, test_size=0.2, random_state=42)

smote = SMOTE(random_state=42)
features_train, labels_train = smote.fit_resample(features_train, labels_train)

rf_classifier = RandomForestClassifier(random_state=42)

param_grid = {
    'n_estimators': [100, 200, 500],
    'max_depth': [3, 5, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

random_search = RandomizedSearchCV(estimator=rf_classifier, param_distributions=param_grid, n_iter=50, cv=5, verbose=2, n_jobs=-1, random_state=42)
random_search.fit(features_train, labels_train)

best_rf_classifier = random_search.best_estimator_
print("Best parameters found: ", random_search.best_params_)

best_rf_classifier.fit(features_train, labels_train)

val_predictions = best_rf_classifier.predict(features_val)
accuracy = accuracy_score(labels_val, val_predictions)
print("Validation Accuracy:", accuracy)

print("\nClassification Report:")
print(classification_report(labels_val, val_predictions))

test_predictions = best_rf_classifier.predict(features_test)
submission_df = pd.DataFrame({'Patient_ID': test_data['Patient_ID'], 'Survival_Status': test_predictions})
submission_df.to_csv('submission_rf.csv', index=False)

print("\nSubmission file created: submission_rf.csv")


Fitting 5 folds for each of 50 candidates, totalling 250 fits
Best parameters found:  {'n_estimators': 500, 'min_samples_split': 10, 'min_samples_leaf': 1, 'max_depth': 5}
Validation Accuracy: 0.5333333333333333

Classification Report:
              precision    recall  f1-score   support

           0       0.52      0.54      0.53       147
           1       0.54      0.52      0.53       153

    accuracy                           0.53       300
   macro avg       0.53      0.53      0.53       300
weighted avg       0.53      0.53      0.53       300


Submission file created: submission_rf.csv


accuarancy test with Logistic Regression Model

In [26]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from imblearn.over_sampling import SMOTE

train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")

features_train = train_data.drop(columns=['Survival_Status', 'Patient_ID'])
labels_train = train_data['Survival_Status']

features_test = test_data.drop(columns=['Patient_ID'])

categorical_features = ['Marital_Status', 'Radiation_Therapy', 'Chemotherapy', 'Hormone_Therapy']
label_encoder = LabelEncoder()

for feature in categorical_features:
    features_train[feature] = label_encoder.fit_transform(features_train[feature])
    features_test[feature] = label_encoder.transform(features_test[feature])

scaler = StandardScaler()
numerical_features = ['Age', 'Tumor_Size', 'Positive_Axillary_Nodes']
features_train[numerical_features] = scaler.fit_transform(features_train[numerical_features])
features_test[numerical_features] = scaler.transform(features_test[numerical_features])

features_train, features_val, labels_train, labels_val = train_test_split(features_train, labels_train, test_size=0.2, random_state=42)

smote = SMOTE(random_state=42)
features_train, labels_train = smote.fit_resample(features_train, labels_train)

logistic_regression_model = LogisticRegression(random_state=42, max_iter=1000)

param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2', 'elasticnet'],
    'solver': ['liblinear', 'saga']
}

random_search = RandomizedSearchCV(estimator=logistic_regression_model, param_distributions=param_grid, n_iter=50, cv=5, verbose=2, n_jobs=-1, random_state=42)
random_search.fit(features_train, labels_train)

best_logistic_model = random_search.best_estimator_
print("Best parameters found: ", random_search.best_params_)

best_logistic_model.fit(features_train, labels_train)

val_predictions = best_logistic_model.predict(features_val)

accuracy = accuracy_score(labels_val, val_predictions)
print("Validation Accuracy:", accuracy)

print("\nClassification Report:")
print(classification_report(labels_val, val_predictions))

test_predictions = best_logistic_model.predict(features_test)
submission_df = pd.DataFrame({'Patient_ID': test_data['Patient_ID'], 'Survival_Status': test_predictions})
submission_df.to_csv('submission_logreg.csv', index=False)

print("\nSubmission file created: submission_logreg.csv")




Fitting 5 folds for each of 30 candidates, totalling 150 fits


50 fits failed out of a total of 150.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
25 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\DELL\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\model_selection\_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\DELL\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\DELL\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\linear_model\_logistic.py", line 1194, in fit
    solver = _check_solver(self.

Best parameters found:  {'solver': 'liblinear', 'penalty': 'l1', 'C': 100}
Validation Accuracy: 0.5166666666666667

Classification Report:
              precision    recall  f1-score   support

           0       0.51      0.48      0.49       147
           1       0.53      0.55      0.54       153

    accuracy                           0.52       300
   macro avg       0.52      0.52      0.52       300
weighted avg       0.52      0.52      0.52       300


Submission file created: submission_logreg.csv




XGBoost with Scaled Data and Hyperparameter Tuning

In [27]:
import pandas as pd
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import LabelEncoder, StandardScaler
from imblearn.over_sampling import SMOTE

train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")

features_train = train_data.drop(columns=['Survival_Status', 'Patient_ID'])
labels_train = train_data['Survival_Status']

features_test = test_data.drop(columns=['Patient_ID'])

categorical_features = ['Marital_Status', 'Radiation_Therapy', 'Chemotherapy', 'Hormone_Therapy']
label_encoder = LabelEncoder()

for feature in categorical_features:
    features_train[feature] = label_encoder.fit_transform(features_train[feature])
    features_test[feature] = label_encoder.transform(features_test[feature])

scaler = StandardScaler()
numerical_features = ['Age', 'Tumor_Size', 'Positive_Axillary_Nodes']
features_train[numerical_features] = scaler.fit_transform(features_train[numerical_features])
features_test[numerical_features] = scaler.transform(features_test[numerical_features])

features_train, features_val, labels_train, labels_val = train_test_split(features_train, labels_train, test_size=0.2, random_state=42)

smote = SMOTE(random_state=42)
features_train, labels_train = smote.fit_resample(features_train, labels_train)

xgboost_model = XGBClassifier(random_state=42, use_label_encoder=False)

param_grid = {
    'learning_rate': [0.01, 0.1, 0.2],
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 5, 10],
    'subsample': [0.6, 0.8, 1.0]
}

random_search = RandomizedSearchCV(estimator=xgboost_model, param_distributions=param_grid, n_iter=50, cv=5, verbose=2, n_jobs=-1, random_state=42)
random_search.fit(features_train, labels_train)

best_xgboost_model = random_search.best_estimator_
print("Best parameters found: ", random_search.best_params_)

val_predictions = best_xgboost_model.predict(features_val)

accuracy = accuracy_score(labels_val, val_predictions)
print("Validation Accuracy:", accuracy)

print("\nClassification Report:")
print(classification_report(labels_val, val_predictions))

test_predictions = best_xgboost_model.predict(features_test)
submission_df = pd.DataFrame({'Patient_ID': test_data['Patient_ID'], 'Survival_Status': test_predictions})
submission_df.to_csv('submission_xgb.csv', index=False)

print("\nSubmission file created: submission_xgb.csv")


Fitting 5 folds for each of 50 candidates, totalling 250 fits
Best parameters found:  {'subsample': 1.0, 'n_estimators': 300, 'max_depth': 3, 'learning_rate': 0.01}
Validation Accuracy: 0.5066666666666667

Classification Report:
              precision    recall  f1-score   support

           0       0.50      0.52      0.51       147
           1       0.52      0.50      0.51       153

    accuracy                           0.51       300
   macro avg       0.51      0.51      0.51       300
weighted avg       0.51      0.51      0.51       300


Submission file created: submission_xgb.csv


Parameters: { "use_label_encoder" } are not used.

