## Instructions:

- Put the parts of your code under the corresponding sections. (0.25/2 points will be taken off for not doing this.)
- Do not include any redundant/irrelevant code, text or comments. (0.5/2 points will be taken off for not doing this.)
- **Your code must run without any errors or runtime issues.** (Failure to meet this condition will result in a 0.)
- **Your code must return your Public Leaderboard score.** (Failure to meet this condition will result in a 0.)
- **Submit both your ipynb and your html file for grading purposes.**

## 1) Libraries

Put all the Python libraries and tools you imported here.

In [16]:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier
import warnings
warnings.filterwarnings("ignore")

## 2) Data

- This section is required to include the code that reads, cleans and preprocesses the datasets.
- Note that both the training and test datasets should undergo the same sequence of operations.

In [17]:

train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
y = train['host_is_superhost']
X = train.drop(columns=['host_is_superhost', 'id', 'description', 'host_about', 'first_review', 'last_review',
                        'host_verifications', 'host_location', 'host_neighbourhood', 'neighbourhood_cleansed',
                        'host_since', 'amenities', 'bathrooms_text'], errors='ignore')


In [18]:

X_train, X_val, y_train, y_val = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)
cat_cols = X_train.select_dtypes(include=['object', 'bool']).columns.tolist()
X_train[cat_cols] = X_train[cat_cols].astype(str)
X_val[cat_cols] = X_val[cat_cols].astype(str)
num_cols = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()


## 3) Machine Learning Model

- This section is required to train the **already tuned** model and obtain the test predictions (or prediction probabilities) with it.
- As written in the instructions, your code must not have any runtime issues, so **do NOT include your grid search here!** You will still need to tune your model to pass the thresholds. However, you need to keep that as your personal work and should NOT include the grid search here.

In [19]:

numeric_transformer = Pipeline([('imputer', SimpleImputer(strategy='median'))])
categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse=False))
])
preprocessor = ColumnTransformer([
    ('num', numeric_transformer, num_cols),
    ('cat', categorical_transformer, cat_cols)
])
model = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', XGBClassifier(use_label_encoder=False, 
                                 subsample=0.8, 
                                 n_estimators=700,
                                 min_child_weight=1,
                                 max_depth=6,
                                 learning_rate=0.07,
                                 colsample_bytree=0.6,
                                 eval_metric='auc', 
                                 random_state=42))
])


## 4) Exporting the Predictions

Include the code that (1) puts the predictions in the format that Kaggle understands and (2) exports it as a csv file.

In [20]:

X_full = pd.concat([X_train, X_val])
y_full = pd.concat([y_train, y_val])
model.fit(X_full, y_full)

X_test = test.drop(columns=['id', 'description', 'host_about', 'first_review', 'last_review',
                            'host_verifications', 'host_location', 'host_neighbourhood',
                            'neighbourhood_cleansed', 'host_since', 'amenities', 'bathrooms_text'], errors='ignore')
X_test[cat_cols] = X_test[cat_cols].astype(str)
X_test = X_test.reindex(columns=X.columns, fill_value=0)

y_test_pred = model.predict_proba(X_test)[:, 1]
submission = pd.DataFrame({'id': test['id'], 'predicted': y_test_pred})
submission.to_csv("submission.csv", index=False)
