## Titanic Project; using XGBoost model (by ChatGPT)
Here’s a code example for solving the Titanic dataset problem using XGBoost, which includes data preprocessing, one-hot encoding, training the model, and saving the predictions to a submission file:

In [5]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import accuracy_score
import xgboost as xgb

# Load the datasets
train_data = pd.read_csv('./titanic_project_data/train.csv')
test_data = pd.read_csv('./titanic_project_data/test.csv')

# Display the first few rows of the train dataset
print(train_data.head())

# Data cleaning
# Fill missing Age values with the median age
train_data['Age'].fillna(train_data['Age'].median(), inplace=True)
test_data['Age'].fillna(test_data['Age'].median(), inplace=True)

# Fill missing Embarked values with the most common port
train_data['Embarked'].fillna(train_data['Embarked'].mode()[0], inplace=True)

# Fill missing Fare values in the test set with the median fare
test_data['Fare'].fillna(test_data['Fare'].median(), inplace=True)

# Drop the Cabin column as it has many missing values
train_data.drop(columns=['Cabin'], inplace=True)
test_data.drop(columns=['Cabin'], inplace=True)

# Drop the Ticket and Name columns as they may not contribute much to the prediction
train_data.drop(columns=['Ticket', 'Name'], inplace=True)
test_data.drop(columns=['Ticket', 'Name'], inplace=True)

# One-hot encoding for categorical variables
encoder = OneHotEncoder(sparse_output=False, drop='first')

# Encode 'Sex' and 'Embarked' columns in the train dataset
encoded_train = encoder.fit_transform(train_data[['Sex', 'Embarked']])
encoded_train_df = pd.DataFrame(encoded_train, columns=encoder.get_feature_names_out(['Sex', 'Embarked']))

# Encode 'Sex' and 'Embarked' columns in the test dataset
encoded_test = encoder.transform(test_data[['Sex', 'Embarked']])
encoded_test_df = pd.DataFrame(encoded_test, columns=encoder.get_feature_names_out(['Sex', 'Embarked']))

# Concatenate encoded columns with original data
train_data = pd.concat([train_data.drop(['Sex', 'Embarked'], axis=1), encoded_train_df], axis=1)
test_data = pd.concat([test_data.drop(['Sex', 'Embarked'], axis=1), encoded_test_df], axis=1)

# Split features and target variable
X = train_data.drop(['Survived', 'PassengerId'], axis=1)
y = train_data['Survived']

# Split the training data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert datasets into DMatrix format for XGBoost
dtrain = xgb.DMatrix(X_train, label=y_train)
dval = xgb.DMatrix(X_val, label=y_val)
dtest = xgb.DMatrix(test_data.drop('PassengerId', axis=1))

# Set XGBoost parameters
params = {
    'objective': 'binary:logistic',
    'eval_metric': 'logloss',
    'eta': 0.1,
    'max_depth': 3
}

# Train the model
bst = xgb.train(params, dtrain, num_boost_round=100, evals=[(dval, 'validation')], early_stopping_rounds=10)

# Make predictions on the test dataset
predictions = bst.predict(dtest)
predictions = np.where(predictions >= 0.5, 1, 0)  # Convert probabilities to binary outcome (0 or 1)

# Create submission DataFrame
submission = pd.DataFrame({'PassengerId': test_data['PassengerId'], 'Survived': predictions})

# Save the submission DataFrame to a CSV file
submission.to_csv('submission.csv', index=False)

print("Your submission was successfully saved!")


   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  
[0

## Explanation for Reducing Runtime, Complexity, and Increasing Interpretability:

1. **Reduce `max_depth` and `num_boost_round`:**
   - Lowering the `max_depth` parameter reduces the complexity of individual trees, making the model less computationally expensive.
   - Reducing `num_boost_round` can lower runtime, although it may slightly reduce accuracy.

2. **Feature Selection:**
   - Dropping less important features or using feature selection techniques can reduce model complexity. Dropping columns like 'PassengerId' or other less informative features can help.

3. **Parameter Tuning:**
   - Adjusting the `eta` (learning rate) can reduce the number of boosting rounds needed for convergence.
   - Use regularization parameters (`alpha` and `lambda`) to prevent overfitting and make the model more interpretable.

4. **Use a simpler model:**
   - For increased interpretability, consider using logistic regression or decision trees instead of XGBoost.



### If you’re not using `train_test_split`, you can directly train the model using the entire `train_data` for training and then use `test_data` for predictions. Here’s the revised code:

In [8]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
import xgboost as xgb

# Load the datasets
train_data = pd.read_csv('./titanic_project_data/train.csv')
test_data = pd.read_csv('./titanic_project_data/test.csv')

# Data cleaning
# Fill missing Age values with the median age
train_data['Age'].fillna(train_data['Age'].median(), inplace=True)
test_data['Age'].fillna(test_data['Age'].median(), inplace=True)

# Fill missing Embarked values with the most common port
train_data['Embarked'].fillna(train_data['Embarked'].mode()[0], inplace=True)

# Fill missing Fare values in the test set with the median fare
test_data['Fare'].fillna(test_data['Fare'].median(), inplace=True)

# Drop the Cabin column as it has many missing values
train_data.drop(columns=['Cabin'], inplace=True)
test_data.drop(columns=['Cabin'], inplace=True)

# Drop the Ticket and Name columns as they may not contribute much to the prediction
train_data.drop(columns=['Ticket', 'Name'], inplace=True)
test_data.drop(columns=['Ticket', 'Name'], inplace=True)

# One-hot encoding for categorical variables
encoder = OneHotEncoder(sparse_output=False, drop='first')

# Encode 'Sex' and 'Embarked' columns in the train dataset
encoded_train = encoder.fit_transform(train_data[['Sex', 'Embarked']])
encoded_train_df = pd.DataFrame(encoded_train, columns=encoder.get_feature_names_out(['Sex', 'Embarked']))

# Encode 'Sex' and 'Embarked' columns in the test dataset
encoded_test = encoder.transform(test_data[['Sex', 'Embarked']])
encoded_test_df = pd.DataFrame(encoded_test, columns=encoder.get_feature_names_out(['Sex', 'Embarked']))

# Concatenate encoded columns with original data
train_data = pd.concat([train_data.drop(['Sex', 'Embarked'], axis=1), encoded_train_df], axis=1)
test_data = pd.concat([test_data.drop(['Sex', 'Embarked'], axis=1), encoded_test_df], axis=1)

# Separate features and target variable
X_train = train_data.drop(['Survived', 'PassengerId'], axis=1)
y_train = train_data['Survived']

# Convert training data to DMatrix format for XGBoost
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(test_data.drop('PassengerId', axis=1))

# Set XGBoost parameters
params = {
    'objective': 'binary:logistic',
    'eval_metric': 'logloss',
    'eta': 0.1,
    'max_depth': 3
}

# Train the model
bst = xgb.train(params, dtrain, num_boost_round=100)

# Make predictions on the test dataset
predictions = bst.predict(dtest)
predictions = np.where(predictions >= 0.5, 1, 0)  # Convert probabilities to binary outcome (0 or 1)

# Create submission DataFrame
submission = pd.DataFrame({'PassengerId': test_data['PassengerId'], 'Survived': predictions})

# Save the submission DataFrame to a CSV file
submission.to_csv('submission.csv', index=False)

### Explanation of Key Changes:
1. **No Validation Split:** The `train_test_split` step was removed since you will use the entire training data (`train_data`) for training and directly make predictions on the provided `test_data`.
2. **Model Training:** The XGBoost model is trained on all of the provided training data without any separation for validation.
3. **Output Prediction:** Predictions are made directly on the test data and saved as `submission.csv`.

### Impact on Runtime, Complexity, and Interpretability:
- **Decrease Runtime and Complexity:** Reduce the `max_depth`, lower `num_boost_round`, or tune other hyperparameters like `eta` for faster training. Feature selection could also reduce runtime by using fewer features.
- **Increase Interpretability:** Consider using fewer features, a simpler model like a decision tree, or applying SHAP values for feature importance explanation in XGBoost.