# Models optimization and training Notebook (s) (60 points)
Must include complete training and optimization of:
- A Penalized (Ridge, Lasso or ElasticNet) linear model (Linear Regression or Logistic Regression).
- Support Vector Machine
- Ensemble model (e.g. Random Forest or Gradient Boosting)
- Neural network implemented in PyTorch

You may use one combined notebook or separate notebooks for each model.

REMEMBER: For the optimization and training stage, you must not use the test set you put aside in the prerequisite Final Project assignment. 

In [22]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset
df = pd.read_csv('../Data/survey-lung-cancer.csv')
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
train_df.to_csv('../Data/train_lung_cancer.csv', index=False)
test_df.to_csv('../Data/test_lung_cancer.csv', index=False)

## Preprocessing

StandardScaler of X_train, X_test + LabelEncoder of y_train, y_test

In [None]:
import pandas as pd
train_df = pd.read_csv('../Data/train_lung_cancer.csv')

In [None]:
features_df = train_df.drop(columns=['LUNG_CANCER'])
features = features_df.columns.tolist()
#print(features)
# Standardization: turn 1 into 0 and 2 into 1
train_df[features] = train_df[features].replace({1:0, 2:1})
train_df['GENDER']= train_df['GENDER'].replace({'M':0, 'F':1})
train_df['LUNG_CANCER']= train_df['LUNG_CANCER'].replace({'NO':0, 'YES':1})

In [None]:
# Split X_train and y_train
X_train = train_df.drop('LUNG_CANCER', axis=1)
y_train = train_df['LUNG_CANCER']

## Training and optimization of the 4 models mentioned above.

### A penalized linear model (Ridge, Lasso or ElasticNet) (Linear Regression or Logistic Regression).

Lasso with Logistic Regression

### Support Vector Machine.

SVM do not need data preprocessing with StandardScaler.

In [29]:
# Support Vector Machine (SVM)
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix

# Define and train the model
svm_model = SVC(kernel='linear', random_state=42)
svm_model.fit(X_train, y_train)

# Make predictions on the training set
y_train_pred = svm_model.predict(X_train)

# Evaluate the model
print("Confusion Matrix (Train):")
print(confusion_matrix(y_train, y_train_pred))
print("\nClassification Report (Train):")
print(classification_report(y_train, y_train_pred))

Confusion Matrix (Train):
[[ 27  10]
 [  7 203]]

Classification Report (Train):
              precision    recall  f1-score   support

           0       0.79      0.73      0.76        37
           1       0.95      0.97      0.96       210

    accuracy                           0.93       247
   macro avg       0.87      0.85      0.86       247
weighted avg       0.93      0.93      0.93       247



### Ensemble model (e.g. Random Forest or Gradient Boosting).

In [28]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# Define the model
rf = RandomForestClassifier(random_state=42)
# Define the hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 5, 10]
}
# Set up GridSearchCV
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid,
                           cv=5, n_jobs=-1, scoring='accuracy')

# Fit the model
grid_search.fit(X_train, y_train)

# Get the best model
best_rf = grid_search.best_estimator_
print("Best Hyperparameters:", grid_search.best_params_)

Best Hyperparameters: {'max_depth': 10, 'min_samples_split': 2, 'n_estimators': 50}


### Neural network implemented in PyTorch.