<a href="https://colab.research.google.com/github/Devsachin2003/Prognosticating-Different-stages-of-Goiter-using-ML-algorithms-/blob/main/Gradient_Boosting_machine_for_goiter_detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Overall Goal
This script outlines a complete machine learning workflow to predict the presence of 'goitre' based on a medical dataset. It involves data loading, extensive preprocessing of categorical features, handling missing values, splitting data into training and testing sets, training a Gradient Boosting Classifier, and finally evaluating its performance.

### 1. Importing Necessary Libraries
First, we import all the Python libraries required for data manipulation, machine learning model building, and evaluation. `pandas` is used for data handling, `sklearn` modules for model selection, ensemble methods, imputation, and metrics, and `google.colab` for potentially mounting Google Drive (though it's commented out in your original code).

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
from google.colab import drive

# Mount Google Drive (commented out in original code)
# drive.mount('/content/drive')

### 2. Loading the Dataset
Here, the script attempts to load a CSV file named `sick_csv.csv` into a pandas DataFrame.

**Note on `FileNotFoundError`:** The original execution failed because `sick_csv.csv` was not found. Please ensure this file is uploaded to your Colab environment's current working directory or provide the correct path to the file. For example, if it's in a specific folder, `file_path = '/content/your_folder/sick_csv.csv'`.

In [4]:
# Load the dataset
file_path = '/content/drive/MyDrive/Copy of sick_csv.csv'
df = pd.read_csv(file_path)

### 3. Data Preprocessing - 'goitre' Column
The 'goitre' column, which is the target variable, contains 'f' (false) and 't' (true) values. This step converts these categorical values into numerical representations: 'f' becomes 0 and 't' becomes 1.

In [5]:
# Replace 'f' with 0 and 't' with 1 in 'goitre' column
df['goitre'] = df['goitre'].map({'f': 0, 't': 1})

### 4. Data Preprocessing - Binary Categorical Columns
Several other columns also contain binary categorical values ('f' and 't'). This section identifies these columns and applies the same mapping (f=0, t=1) to convert them into numerical features, which is necessary for most machine learning algorithms.

In [6]:
binary_categorical_cols = ['sex', 'on_thyroxine', 'query_on_thyroxine', 'on_antithyroid_medication',
                           'sick', 'pregnant', 'thyroid_surgery', 'I131_treatment', 'query_hypothyroid',
                           'query_hyperthyroid', 'lithium', 'tumor', 'hypopituitary', 'psych',
                           'TSH_measured', 'T3_measured', 'TT4_measured', 'T4U_measured', 'FTI_measured',
                           'TBG_measured']

for col in binary_categorical_cols:
    df[col] = df[col].map({'f': 0, 't': 1})

### 5. Data Preprocessing - Multi-Categorical Columns
Columns with more than two categories (`referral_source`, `Class`) are handled using one-hot encoding. This technique creates new binary columns for each unique category, preventing the model from assuming an ordinal relationship between categories.

In [7]:
multi_categorical_cols = ['referral_source', 'Class']

# One-hot encode multi-category columns
df = pd.get_dummies(df, columns=multi_categorical_cols)

### 6. Defining Features (X) and Target (y)
In this step, the dataset is split into features (`X`), which are the input variables used for prediction, and the target variable (`y`), which is what we want to predict (in this case, 'goitre').

In [8]:
# Define features and target variable
X = df.drop("goitre", axis=1)  # Features
y = df["goitre"]  # Target

### 7. Imputing Missing Values
Many machine learning algorithms cannot handle missing data. This step uses `SimpleImputer` to fill any missing numerical values in the features (`X`) with the mean of their respective columns. This is a common strategy for handling missing data.

In [9]:
# Impute missing values using SimpleImputer
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)



### 8. Splitting Data into Training and Test Sets
To evaluate the model's performance on unseen data, the dataset is split into training and testing sets. The training set is used to train the model, and the test set is used to evaluate how well it generalizes. `test_size=0.00099` means a very small portion (approximately 0.099%) of the data is reserved for testing, and `random_state=42` ensures reproducibility of the split.

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X_imputed, y, test_size=0.00099, random_state=42)

### 9. Training the Gradient Boosting Classifier
This section initializes and trains a Gradient Boosting Classifier. Gradient Boosting is an ensemble learning method that builds a model in a stage-wise fashion, and it's known for its high predictive accuracy. `n_estimators` is the number of boosting stages, and `learning_rate` shrinks the contribution of each tree.

In [11]:
# Train the Gradient Boosting classifier
gradient_boost_classifier = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
gradient_boost_classifier.fit(X_train, y_train)

### 10. Making Predictions and Evaluating the Model
Finally, the trained model is used to make predictions on the unseen test set. The `accuracy_score` metric is then calculated to quantify the model's performance by comparing the predicted values (`y_pred`) with the actual values (`y_test`).

In [12]:
# Make predictions on the test set
y_pred = gradient_boost_classifier.predict(X_test)
accuracy_gbt = accuracy_score(y_test, y_pred)
# Evaluate the model
print("Accuracy:",accuracy_gbt )

Accuracy: 0.75
