<a href="https://colab.research.google.com/github/9-coding/Machine_Learning/blob/main/The_Wisconsin_Canser_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### PHW1

Compare the performance of the following classification models against the same dataset.
- Decision Tree (using entropy)
- Decision Tree (using gini index)
- Logistic Regression
- Support Vector Machine

Must try combinations of following:
- Various data scaling methods and encoding methods
- Various values of the model parameters for each model.
- Various values for the hyperparameters
- Various numbers 𝑘 for 𝑘-fold cross validation.

** Document the user manual of the program framework in the Scikit-learn style.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder

df = pd.read_csv('sample_data/tumor.csv')

X = df.drop(['Class'], axis=1)
y = df['Class']

scaling_method = [StandardScaler(), MinMaxScaler()]
encoding_method = [LabelEncoder(), OneHotEncoder(sparse=False)]

param_dte = {'max_depth':[5,10],
             'min_samples_split':[2,3,4]} # Decision Tree using Entropy
param_dtg = {'max_depth':[5,10],
             'min_samples_split':[2,3,4]} # Decision Tree using Gini index
param_lr = {'solver':['lbfgs', 'liblinear']} # Logistic Regression
param_svc = {'kernel':['linear', 'rbf'],
             'gamma':[0.001, 0.01, 0.1],
             'C':[0.1]} # Support Vector Machine

# Function to select the parameter set that fits the situation
def paramSelector(num):
  if num == 1:
    param = param_dte
  elif num == 2:
    param = param_dtg
  elif num == 3:
    param = param_lr
  elif num == 4:
    param = param_svc

  return param

# A function that adjusts various scaling, encoding, and hyperparameter for each model and
# performs cross-validation to produce results
def BuildModel(num, model_name, model, X, y):
  print(model_name)

  train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.2, random_state=42)
  scores = []

  for scaling in scaling_method:
    for encoding in encoding_method:
      print(scaling)
      print(encoding)

      param = paramSelector(num)

      # Use GridSearchCV for various hyperparameter verification and cross-validation
      grid_dtree = GridSearchCV(model, param_grid=param, cv=5, refit=True, return_train_score=True)
      grid_dtree.fit(train_x, train_y)

      scores_df = pd.DataFrame(grid_dtree.cv_results_)
      print(f"Optiaml parameter : {grid_dtree.best_params_}")
      print(f"Accuracy : {grid_dtree.best_score_}\n")
  print("\n")


# Run four models presented
model = DecisionTreeClassifier(criterion='entropy')
BuildModel(1, "DecisionTreeClassifier using entropy", model, X, y)

model = DecisionTreeClassifier(criterion='gini')
BuildModel(2, "DecisionTreeClassifier using gini index", model, X, y)

model = LogisticRegression()
BuildModel(3, "Logistic Regression", model, X, y)

model = SVC()
BuildModel(4, "Support Vector Machine", model, X, y)


DecisionTreeClassifier using entropy
StandardScaler()
LabelEncoder()
Optiaml parameter : {'max_depth': 5, 'min_samples_split': 2}
Accuracy : 0.9504920767306089

StandardScaler()
OneHotEncoder(sparse=False)
Optiaml parameter : {'max_depth': 5, 'min_samples_split': 3}
Accuracy : 0.9523269391159299

MinMaxScaler()
LabelEncoder()
Optiaml parameter : {'max_depth': 5, 'min_samples_split': 2}
Accuracy : 0.950508757297748

MinMaxScaler()
OneHotEncoder(sparse=False)
Optiaml parameter : {'max_depth': 5, 'min_samples_split': 2}
Accuracy : 0.950508757297748



DecisionTreeClassifier using gini index
StandardScaler()
LabelEncoder()
Optiaml parameter : {'max_depth': 5, 'min_samples_split': 4}
Accuracy : 0.9505754795663053

StandardScaler()
OneHotEncoder(sparse=False)
Optiaml parameter : {'max_depth': 5, 'min_samples_split': 3}
Accuracy : 0.9487406171809842

MinMaxScaler()
LabelEncoder()
Optiaml parameter : {'max_depth': 5, 'min_samples_split': 3}
Accuracy : 0.9505754795663053

MinMaxScaler()
OneHotE