# About Dataset

### Brain Tumor Dataset
This dataset contains simulated data for brain tumor diagnosis, treatment, and patient details. It consists of 20 columns and 20,000 rows, providing information such as patient demographics, tumor characteristics, symptoms, treatment details, and follow-up requirements. The dataset is designed for machine learning projects focused on predicting the type and severity of brain tumors, as well as understanding various treatment methods and patient outcomes.
## Columns:
1. Patient_ID: Unique identifier for each patient.
2. Age: Age of the patient (in years).
3. Gender: Gender of the patient (Male/Female).
4. Tumor_Type: Type of tumor (Benign/Malignant).
5. Tumor_Size: Size of the tumor in centimeters.
6. Location: The part of the brain where the tumor is located (e.g., Frontal, Temporal).
7. Histology: The histological type of the tumor (e.g., Astrocytoma, Glioblastoma).
8. Stage: The stage of the tumor (I, II, III, IV).
9. Symptom_1: The first symptom observed (e.g., Headache, Seizures).
10. Symptom_2: The second symptom observed.
11. Symptom_3: The third symptom observed.
12. Radiation_Treatment: Whether radiation treatment was administered (Yes/No).
13. Surgery_Performed: Whether surgery was performed (Yes/No).
14. Chemotherapy: Whether chemotherapy was administered (Yes/No).
15. Survival_Rate: The estimated survival rate of the patient (percentage).
16. Tumor_Growth_Rate: The growth rate of the tumor (cm per month).
17. Family_History: Whether the patient has a family history of brain tumors (Yes/No).
18. MRI_Result: The result of the MRI scan (Positive/Negative).
19. Follow_Up_Required: Whether follow-up is required (Yes/No).
20. Treatment_Response: The response to the treatment (Improved/Worsened/Stable).
### Intended Use:
This dataset can be used for various machine learning tasks, such as:
- Tumor classification: Predicting whether a tumor is benign or malignant.
- Survival analysis: Estimating the survival rate based on different features like tumor type and treatment.
- Outcome prediction: Predicting the treatment response or follow-up requirement.

# Plans 
- Summary Statistics
- EDA
- Feature Engineering
- Feature Extraction

## Steps
- Handling Missing values
- Feature Encoding
- Feature Scaling
- Feature Selection
- Feature Engineering
- Train Test Splitting
- Model Training
- Model Evaluation
- Hyperparameter Tuning
- Best model Selection

In [20]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [6]:
##loading data
bt=pd.read_csv('BTumer.csv')
bt.head()



## Summary Statistics/EDA

In [7]:
bt.info()



In [8]:
##shapes of data
bt.shape



In [9]:
##checking for missing values
bt.isnull().sum()



## Observe
No null value or no missing value exist in the dataset

In [10]:
##Droping patient id
bt.drop(columns='Patient_ID',axis=1,inplace=True)

In [11]:
bt.head()



In [12]:
##describe the data
bt.describe()



## observe
- Minimum age is 20, Max is 79 and average age of the patient is 49 years old
- Minimum Tumor Size is 0.5 centimeters. maximum is 9.99 cm and the average size of the tumor is 5.2 cm.
- Minimum chance of survival rate is 40%, maximum is 99% and on average survival rate is 70.1%.
- The average Tumor growth rate is 1.55 CM per month and minimum growth rate per month is 0.100,maximum tumor growth rate per month is 2.999.
- Among Numerical features, Survival Rate has the highest variance about 17.27 in comparision with other features. The minimum would be 0.83 for Tumor Growth Rate.

In [13]:
##checking for unique values
bt.columns



## Data Preprocessing

In [14]:
bt.head()



In [15]:
##category columns 
cat_cols=[cat for cat in bt.columns if bt[cat].dtype=='object']
print(cat_cols)
print(len(cat_cols))




In [16]:
##numerical columns
num_cols=[num for num in bt.columns if bt[num].dtype!='O']
print(num_cols)
print(len(num_cols))



In [17]:
##checking for unique value count for all categorical values
for i in cat_cols:
    print(bt[i].value_counts())
    print()



In [19]:
bt.head()



# Data Distribution

In [39]:
##correlation among numerical values
sns.heatmap(bt[num_cols].corr(),annot=True)





In [22]:
sns.displot(bt['Tumor_Size'])





In [23]:
## Analysis 
plt.figure(figsize=(15,13))
plt.subplot(3,2,1)
sns.barplot(y=bt['Tumor_Size'],x=bt['Tumor_Type'],hue=bt['Tumor_Type'])
plt.xlabel("Tumor Type")
plt.ylabel("Tumor Size")
plt.title("Fig-1: Tumor Type vs Size",fontsize=12, fontweight="bold")
plt.legend()

plt.subplot(3,2,2)
sns.countplot(x=bt['Tumor_Type'],hue=bt['Gender'])
plt.xlabel("Gender")
plt.ylabel("Counts")
plt.title("Fig-2: Tumor Type over Gender counts",fontsize=12, fontweight="bold")
plt.legend()

plt.subplot(3,2,3)
sns.countplot(x=bt['Chemotherapy'],hue=bt['Tumor_Type'])
plt.xlabel("Chemotherapy requirement")
plt.ylabel("Counts")
plt.title("Fig-3: Chemotherapy Counts",fontsize=12, fontweight="bold")
plt.legend()

plt.subplot(3,2,4)
sns.countplot(x=bt['Symptom_1'], hue=bt['Tumor_Type'])
plt.title("Symptom-1 Tumor")
plt.xlabel("Fig-4: Symptom 1",fontsize=12, fontweight="bold")
plt.legend()

plt.subplot(3,2,5)
sns.countplot(x=bt['Location'],hue=bt['Tumor_Type'])
plt.xlabel("Brain Parts")
plt.ylabel("Counts")
plt.title("Fig-5: Parts of Brain affacted by Tumor affects",fontsize=12, fontweight="bold")

plt.subplot(3,2,6)
sns.countplot(x=bt['Histology'], hue=bt['Tumor_Type'])
plt.title("Fig-6: Histology of Tumor Cell",fontsize=12, fontweight="bold")
plt.xlabel("Histology")
plt.legend()

## Entire figure name
plt.figtext(0.5, 0.01, "Fig-1: Tumor Data Analysis", ha="center", fontsize=12, fontweight="bold")
plt.tight_layout(pad=2.0, h_pad=1.5, w_pad=1.5, rect=[0, 0.03, 1, 0.97]) ##dimension that separates each figure distinctively
plt.show() 



## Observation
- In Fig 1: Apperantly the size of the Benign tumor is bit larger than Malignant tumor.
- In Fig 2: similarly, the malignent tumor count in female is higher than the male, where Benign tumor count is higher in male than the female.
- In fig 3: In comparing both tumor, Benign seems less user of chemotherapy than Malignant tumor.
- In Fig 4: Vision issue is more in Malignant, headach is same on both case, seizure is higher in Benign etc.
- In Fig 5: Temporal and Parietal part of the brain is highly affected by Malignant than Benign, Frontal and Occipital part of the brain is dominantly affect by Benign than Malignant.
- In Fig 6: Astrocytoma,Glioblastoma and Meningioma cell structures are  most likely to be found more in Malignant tumor cells as compare to Benign.


In [33]:
##Outlier checks
sns.boxplot(x=bt['Tumor_Size'])
plt.show()



In [34]:
sns.boxplot(x=bt['Survival_Rate'])
plt.show()



In [35]:
num_cols



In [36]:
sns.boxplot(x=bt['Tumor_Growth_Rate'])
plt.show()



In [37]:
sns.boxplot(x=bt['Age'])
plt.show()



## Splitting the data 
- Spliting data into Dependent and Independent features using Train-Test Splitting technique before applying any transformation is just because to avoid some sort of Data Leakage. Data leakage is a terminology used to explain the indulgence of test data in a training set.

In [24]:
## Dependencies for data transformation/feature engineering and model training
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder,LabelEncoder,OrdinalEncoder
from sklearn.pipeline import make_pipeline,Pipeline

##model training dependencies
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
import  lightgbm 
from lightgbm import LGBMClassifier 

## Model evaluation metrics
from sklearn.metrics import classification_report,accuracy_score,f1_score,recall_score,precision_score,roc_auc_score

## Hyperparameter tuning
from sklearn.model_selection import GridSearchCV,cross_val_score,RandomizedSearchCV

## Dependent and Independent Features

In [25]:
##Independent features
X=bt.drop(columns='Tumor_Type',axis=1)

##Dependent feature
y=bt['Tumor_Type']

In [46]:
## train test splitting
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.20,random_state=42)

# Feature Encoding: Handling categorical features
 In the dataset I have two types of feature present in the dataset:
 1. Ordinal categorical Features: I can use either OrdinalEncoding or LabelEncoding technique to convert into numerical feature.
 2. Nominal(non-orderd) categorical features: To handle nominal fetaure, we have OneHotEncoding technique.

 ## Common practice to use encodors
 1. ### Label Encoding:
    When we have the categorical data that can represent a binary classification like (Male/Female, Yes/No, Positive/Negative), Label encoding would be very helpful.

2. ### Ordinal Encoding:
    The categorical feature that supposedly an orderd nature of the categorical values like (Educational Level, Review Rating scale etc.)

3. ### OneHot Encoding:
    It is normally used when none of the above two nature of the features present in the dataset.

In [47]:
X_train.head()



## Feature Engineering Note
So, the following feature transformation is only apply on Independent set of feature such as X_train and X_test, Nonetheless,
we transform target feture seperately.

In [28]:
##numerical feature scaling to bring all the numerical features into same range of values
##numerical columns
num_cols
##initialize the object of standardscaler
scaler=StandardScaler()


## Separating ordinal
ordinal_feats=["Stage"]
##initialize ordinal encoding
ordinalencoder=OrdinalEncoder()


##label features
label_feats=['Gender','Radiation_Treatment','Surgery_Performed',
            'Chemotherapy','Family_History','MRI_Result','Follow_Up_Required']
##initialize label encoder object
labelencoder=LabelEncoder()

##nominal features
ohe_feats=['Location','Histology','Symptom_1','Symptom_2','Symptom_3']
##initialize onhot encoder object
ohencoder=OneHotEncoder(handle_unknown='ignore')



##Combine all the transformer with ColumnTransformer
preprocessor=ColumnTransformer(transformers=[
                                     ('Scaling',scaler,num_cols),
                                     ('Ordinal Encoding',ordinalencoder,ordinal_feats),
                                     ('Nominal Encoding',ohencoder,ohe_feats)
                                 ])

## Feature transformation of Independent set of features

In [29]:
##LabelEncoder.fit_transform() can only handle one input at a time
for i in label_feats:
    X_train[i]=labelencoder.fit_transform(X_train[i])
    X_test[i]=labelencoder.fit_transform(X_test[i])

In [None]:
X_train



In [30]:
##apply transformation on independent features only
X_trainprocess=preprocessor.fit_transform(X_train)
##only transform on independent test sets of fetaures to avoid data leakage
X_testprocess=preprocessor.transform(X_test)

In [31]:
X_trainprocess




## Transformation on target feature

In [48]:
##feature engineering on y_train target feature
y_trainprocess=labelencoder.fit_transform(y_train)

##y_test target feature
y_testprocess=labelencoder.transform(y_test)

## Model training with default parameters

In [49]:
##models
models={
    "Random Forest":RandomForestClassifier(),
    "Decision Tree":DecisionTreeClassifier(),
    "Logistic Regression":LogisticRegression(),
    "Support Vector Machine":SVC(),
    "Ada-Boost":AdaBoostClassifier(),
    "XGBoost":XGBClassifier(),
    "Gradient Boosting":GradientBoostingClassifier(),
    "KNN":KNeighborsClassifier(),
    "Lgbm":LGBMClassifier()
    }



for i in range(len(list(models))):
    model=list(models.values())[i]
    model.fit(X_trainprocess,y_trainprocess)

    ##prediction of the model
    ##prediction on training set
    y_predtrain=model.predict(X_trainprocess)
    ##prediction on unseen test set
    y_predtest=model.predict(X_testprocess)

    ##model evaluation on training set
    Train_accuracyscore=accuracy_score(y_predtrain,y_trainprocess)
    Train_f1score=f1_score(y_predtrain,y_trainprocess)
    Train_recallscore=recall_score(y_predtrain,y_trainprocess)
    Train_rocauc_score=roc_auc_score(y_predtrain,y_trainprocess)
    Train_precisionscore=precision_score(y_predtrain,y_trainprocess)
    Train_ClassReport=classification_report(y_predtrain,y_trainprocess)

    ##model evaluation on test set(unseen data)
    Test_accuracyscore=accuracy_score(y_testprocess,y_predtest)
    Test_recallscore=recall_score(y_testprocess,y_predtest)
    Test_f1score=f1_score(y_testprocess,y_predtest)
    Test_precisionscore=precision_score(y_testprocess,y_predtest)
    Test_rocauc_score=roc_auc_score(y_testprocess,y_predtest)
    Test_ClassReport=classification_report(y_testprocess,y_predtest)

    print(list(models.keys())[i])
    print("Performance Score of the Training set")
    print(f'Accuracy Score: {Train_accuracyscore}')
    print(f'F1 Score: {Train_f1score}')
    print(f'Recall Score: {Train_recallscore}')
    print(f'Precision Score: {Train_precisionscore}')
    print(f'ROC/AUC Score: {Train_accuracyscore}')
    print(f'Classification Report: {Train_ClassReport}')
    print('*'*35,'\n')

    print("Performance Score of the Testing set")
    print(f'Accuracy Score: {Test_accuracyscore}')
    print(f'F1 Score: {Test_f1score}')
    print(f'Recall Score: {Test_recallscore}')
    print(f'Precision Score: {Test_precisionscore}')
    print(f'ROC/AUC score: {Test_rocauc_score}')
    print(f'Classification Report: {Test_ClassReport}')
    print('='*35,'\n')





In [None]:
df=[('x',34,'amke'),(56,'gh','son')]

for f,s,t in df:
    print(f,s,t)




## Hyperparameter Tunning

In [50]:
rf_params={
    'max_depth':[5,8,None,15,10],
    'n_estimators':[100,200,500,1000],
    'min_samples_split':[5,8,15,20],
    'max_features':['auto',5,7,8]
    }

XGB_params={'learning_rate':[0.1,0.01],
            'max_depth':[5,8,12,20,30],
            'n_estimators':[100,200,300],
            'colsample_bytree':[0.5,0.8,1,0.3,0.4]}

dt_params={
    'criterion':['gini', 'entropy', 'log_loss'],
    'splitter':['best', 'random'],
    'max_depth':[3,5,10,None],
    'min_samples_split':[2,5,10,15,20],
    'min_samples_leaf':[1,5,10,15],
    'max_leaf_nodes':[10,50,100,None],
    'max_features':['auto','sqrt', 'log2']
}

lg_params={
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2', 'elasticnet'],
    'solver': ['lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'],
    'max_iter': [100, 500, 1000]
}

Ada_params={
    'n_estimators': [50, 100, 200, 500],
    'learning_rate': [0.001, 0.01, 0.1, 0.5, 1],
    'algorithm': ['SAMME', 'SAMME.R']   
}

gb_params={
    'loss':['log_loss', 'exponential'],
    'learning_rate':[0.001,0.01,0.1,0.5,1],
    'n_estimators':[10,50,100,200],
    'subsample':[0.01,0.8, 0.9,1],
    'criterion':['friedman_mse', 'squared_error'],
    'max_depth': [3, 5, 7, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 5, 10],
    'max_features': [None, 'sqrt', 'log2', 0.8],
    'loss': ['deviance', 'exponential']
}

svm_params={
    'C':[0.1, 1, 10, 100, 1000],
    'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
    'gamma': ['scale', 'auto', 0.001, 0.01, 0.1, 1, 10],
    #'degree': [3, 4, 5],  # Only relevant for 'poly' kernel
    #'class_weight': ['balanced', None],
    #'tol': [1e-3, 1e-4, 1e-5],
    'max_iter': [1000, 5000, 10000],
    'shrinking': [True, False]
}

knn_params={
    'n_neighbors': [3, 5, 7, 10, 15, 20],
    'metric': ['euclidean', 'manhattan', 'minkowski', 'chebyshev'],
    'p': [1, 2],  # Only relevant if metric is 'minkowski'
    'weights': ['uniform', 'distance'],
    'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
    'leaf_size': [10, 20, 30, 40, 50],
    'n_jobs': [-1, 1, 2, 4]
}

In [51]:
##Combine model with its parameters
model_param=[
    ('rf',RandomForestClassifier(),rf_params),
    ('dt',DecisionTreeClassifier(),dt_params),
    ('xgb',XGBClassifier(),XGB_params),
    ('Ada',AdaBoostClassifier(),Ada_params),
    ('Gb',GradientBoostingClassifier(),gb_params),
    ('lgr',LogisticRegression(),lg_params),
    ('SVM',SVC(),svm_params),
    ('KNN',KNeighborsClassifier(),knn_params)
]

## RandomsearchCv for Hyperparameter Tuning

In [52]:
##save model and best param
model_bestparam={}

##Tuning on params
for name,model,param in model_param:
    np.random.seed(42)
    randcvmodel=RandomizedSearchCV(estimator=model,param_distributions=param,cv=3,n_iter=100,verbose=2,n_jobs=-1)
    randcvmodel.fit(X_trainprocess,y_trainprocess)
    model_bestparam[name]=randcvmodel.best_params_

for model_name in model_bestparam:
    print(f'{model_name}: {model_bestparam[model_name]}')







































































































































































































































































































































































































































































































































































































































































In [53]:
hyper_models={
    'Random Forest': RandomForestClassifier(n_estimators= 100,
                                            min_samples_split= 5,
                                            max_features= 8,
                                            max_depth= 15),

'Decision Tree': DecisionTreeClassifier(splitter='random',
                                    min_samples_split= 2, 
                                    min_samples_leaf= 5, 
                                    max_leaf_nodes= None,
                                    max_features= 'sqrt', 
                                    max_depth= None, 
                                    criterion='log_loss'),

'XGBoost': XGBClassifier(n_estimators= 200, 
                            max_depth= 12,
                            learning_rate= 0.01,
                            colsample_bytree= 0.4),

'AdaBoost': AdaBoostClassifier(n_estimators= 50,
                               learning_rate= 0.5,
                                algorithm= 'SAMME'),

'GradientBoost': GradientBoostingClassifier(subsample= 1,
                                            n_estimators= 50,
                                            min_samples_split= 2, 
                                            min_samples_leaf= 10, 
                                            max_features= 0.8, 
                                            max_depth= 3, 
                                            loss='exponential', 
                                            learning_rate= 1, 
                                            criterion='friedman_mse'),

'Logistic Regression': LogisticRegression(solver='liblinear', 
                                          penalty='l1', 
                                          max_iter= 100, 
                                          C= 0.1),

'SVM': SVC(shrinking= True, 
      max_iter= 5000, 
      kernel='poly', 
      gamma= 1, 
      C= 10),

'KNN': KNeighborsClassifier(weights= 'uniform', 
      p= 1, 
      n_neighbors= 10, 
      n_jobs= -1, 
      metric= 'chebyshev', 
      leaf_size= 10, 
      algorithm= 'brute')}

In [54]:
##models with hyperparameters
for i in range(len(list(hyper_models))):
    model=list(hyper_models.values())[i]
    model.fit(X_trainprocess,y_trainprocess)

    ##prediction of the model
    ##prediction on training set
    y_predtrain=model.predict(X_trainprocess)
    ##prediction on unseen test set
    y_predtest=model.predict(X_testprocess)

    ##model evaluation on training set
    Train_accuracyscore=accuracy_score(y_predtrain,y_trainprocess)
    Train_f1score=f1_score(y_predtrain,y_trainprocess)
    Train_recallscore=recall_score(y_predtrain,y_trainprocess)
    Train_rocauc_score=roc_auc_score(y_predtrain,y_trainprocess)
    Train_precisionscore=precision_score(y_predtrain,y_trainprocess)
    Train_ClassReport=classification_report(y_predtrain,y_trainprocess)

    ##model evaluation on test set(unseen data)
    Test_accuracyscore=accuracy_score(y_testprocess,y_predtest)
    Test_recallscore=recall_score(y_testprocess,y_predtest)
    Test_f1score=f1_score(y_testprocess,y_predtest)
    Test_precisionscore=precision_score(y_testprocess,y_predtest)
    Test_rocauc_score=roc_auc_score(y_testprocess,y_predtest)
    Test_ClassReport=classification_report(y_testprocess,y_predtest)

    print(list(hyper_models.keys())[i])
    print("Performance Score of the Training set")
    print(f'Accuracy Score: {Train_accuracyscore}')
    print(f'F1 Score: {Train_f1score}')
    print(f'Recall Score: {Train_recallscore}')
    print(f'Precision Score: {Train_precisionscore}')
    print(f'ROC/AUC Score: {Train_accuracyscore}')
    print(f'Classification Report: {Train_ClassReport}')
    print('*'*35,'\n')

    print("Performance Score of the Testing set")
    print(f'Accuracy Score: {Test_accuracyscore}')
    print(f'F1 Score: {Test_f1score}')
    print(f'Recall Score: {Test_recallscore}')
    print(f'Precision Score: {Test_precisionscore}')
    print(f'ROC/AUC score: {Test_rocauc_score}')
    print(f'Classification Report: {Test_ClassReport}')
    print('='*35,'\n')



## Model observation 
### Logistic Regression
- Accuracy Score: 0.50675
- F1 Score: 0.5210002427773731
- Recall Score: 0.5356964553170245
- Precision Score: 0.5070888468809074
- ROC/AUC score: 0.506706515089659

### SVM
- Accuracy Score: 0.51225
- F1 Score: 0.5544644896094999
- Recall Score: 0.6060908637044433
- Precision Score: 0.5109427609427609
- ROC/AUC score: 0.5121090272453113

In [55]:
svm_params = {
    'C': [0.01, 0.1, 1, 10, 100],  # Keep range, but add lower values for better regularization
    'kernel': ['rbf', 'linear'],  # 'rbf' and 'linear' work best for most classification tasks
    'gamma': [0.0001, 0.001, 0.01, 0.1, 1],  # Remove large gamma values to prevent overfitting
    'max_iter': [5000],  # 5000 is sufficient in most cases
    'shrinking': [True],  # Shrinking heuristics usually improve performance
}

In [None]:
svm=SVC()
rnd=RandomizedSearchCV(estimator=svm,param_distributions=svm_params,cv=3,n_iter=100,verbose=2,n_jobs=-1)
rnd.fit(X_trainprocess,y_trainprocess)























































































































































































































































































In [57]:
rnd.best_params_



In [59]:
SVM_model=SVC(shrinking= True, 
      max_iter= 5000, 
      kernel='poly', 
      gamma= 1, 
      C= 10)

SVM_model.fit(X_trainprocess,y_trainprocess)
y_predtrain=SVM_model.predict(X_trainprocess)
    ##prediction on unseen test set
y_predtest=SVM_model.predict(X_testprocess)

Test_accuracyscore=accuracy_score(y_testprocess,y_predtest)
Test_recallscore=recall_score(y_testprocess,y_predtest)
Test_f1score=f1_score(y_testprocess,y_predtest)
Test_precisionscore=precision_score(y_testprocess,y_predtest)
Test_rocauc_score=roc_auc_score(y_testprocess,y_predtest)
Test_ClassReport=classification_report(y_testprocess,y_predtest)

print(Test_accuracyscore)
print(Test_recallscore)
print(Test_f1score)
print(Test_precisionscore)
print(Test_rocauc_score)
print(Test_ClassReport)

