# A Simple XGBT Model using XGBoost 🪐
## Tracking Experiments Using Neptune AI...
In this Notebook I want to demonstrate how to Track Experiments using Neptune.AI

## What's Neptune AI
Neptune is a metadata store for MLOps, built for teams that run a lot of experiments.‌
It gives you a single place to log, store, display, organize, compare, and query all your model-building metadata.

https://neptune.ai/

**Neptune is used for:**
* Experiment tracking: Log, display, organize, and compare ML experiments in a single place.
* Model registry: Version, store, manage, and query trained models, and model building metadata.
* Monitoring ML runs live: Record and monitor model training, evaluation, or production runs live



## Dashboards

Use this links below to visualize the dashboards

**Main View**

https://app.neptune.ai/cviejom/Spaceship-Titanic/experiments?split=tbl&dash=charts&viewId=standard-view

**Comparison Dashboard**

https://app.neptune.ai/cviejom/Spaceship-Titanic/experiments?compare=auto&split=cmp&dash=leaderboard&viewId=standard-view&base=SPAC-8&to=SPAC-7


---

## About the Data...
In this competition your task is to predict whether a passenger was transported to an alternate dimension during the Spaceship Titanic's collision with the spacetime anomaly. To help you make these predictions, you're given a set of personal records recovered from the ship's damaged computer system.

File and Data Field Descriptions

train.csv - Personal records for about two-thirds (~8700) of the passengers, to be used as training data.
* PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
* HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.
* CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
* Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
* Destination - The planet the passenger will be debarking to.
* Age - The age of the passenger.
* VIP - Whether the passenger has paid for special VIP service during the voyage.
* RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
* Name - The first and last names of the passenger.
* Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

test.csv - Personal records for the remaining one-third (~4300) of the passengers, to be used as test data. Your task is to predict the value of Transported for the passengers in this set.

sample_submission.csv - A submission file in the correct format.
* PassengerId - Id for each passenger in the test set.
* Transported - The target. For each passenger, predict either True or False.

---

# 1.0 - Installing Libraries

In [None]:
%%capture
!pip install neptune-client

---

# 2.0 - Importing Libraries

In [None]:
%%time
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

---

# 3.0 - Configuring Neptune AI Project

In [None]:
%%time
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
API_KEY = user_secrets.get_secret("NEPTUNE_API_TOKEN")

In [None]:
%%time
import neptune.new as neptune

In [None]:
%%time
run = neptune.init(
    project = 'cviejom/Spaceship-Titanic',
    name = 'Spaceship-Titanic',
    tags = ['Kagle', 'Machine_Learning'],
    api_token = API_KEY
)

In [None]:
%%time
run['Algorithm'] = 'XGBoost'

---

# 4.0 - Configuring the Notebook

In [None]:
%%time
# I like to disable my notebook warnings this helps to remove unintended Messages.
import warnings
warnings.filterwarnings('ignore')

In [None]:
%%time
# Notebook Configuration...

# Amount of data we want to load into the Model...
DATA_ROWS = None
# Dataframe, the amount of rows and cols to visualize...
NROWS = 50
NCOLS = 15
# Main data location path...
BASE_PATH = '...'

In [None]:
%%time
# Configure notebook display settings to only use 2 decimal places, tables look nicer.
pd.options.display.float_format = '{:,.2f}'.format
pd.set_option('display.max_columns', NCOLS) 
pd.set_option('display.max_rows', NROWS)

---

# 5.0 - Importing the Information or Data for the Analysis

In [None]:
%%time
trn_data = pd.read_csv('/kaggle/input/spaceship-titanic/train.csv')
tst_data = pd.read_csv('/kaggle/input/spaceship-titanic/test.csv')

In [None]:
%%time
sub = pd.read_csv('/kaggle/input/spaceship-titanic/sample_submission.csv')

---

# 6.0 - Exploring the Datasets

## 6.1 - Understanding the Train Dataset

In [None]:
%%time
trn_data.info()

In [None]:
%%time
trn_data.head()

In [None]:
%%time
trn_data.describe()

In [None]:
%%time
def describe_categ(df):
    for col in df.columns:
        unique_samples = list(df[col].unique())
        unique_values = df[col].nunique()

        print(f' {col}: {unique_values} Unique Values,  Data Sample >> {unique_samples[:5]}')
    print(' ...')
    return None

In [None]:
%%time
describe_categ(trn_data)

In [None]:
%%time
trn_data.isnull().sum()

In [None]:
%%time
trn_data[trn_data['CryoSleep'].isnull() == True].sample(10)

## 6.2 - Understanding the Test Dataset

In [None]:
%%time
tst_data.head()

In [None]:
%%time
describe_categ(tst_data)

In [None]:
%%time
tst_data.isnull().sum()

## 6.3 - Understanding the Target Variable

In [None]:
%%time
def analyse_categ_target(df, target = 'Transported'):
    
    transported = df[df[target] == True].shape[0]
    not_transported = df[df[target] == False].shape[0]
    total = transported + not_transported
    
    print(f'Transported     : {transported / total:.2f} %')
    print(f'Not Transported : {not_transported / total:.2f} %')
    print(f'Total Passengers: {total}')
    print('...')

In [None]:
%%time
analyse_categ_target(trn_data)

In [None]:
%%time
trn_passenger_ids = set(trn_data['PassengerId'].unique())
tst_passenger_ids = set(tst_data['PassengerId'].unique())
intersection = trn_passenger_ids.intersection(tst_passenger_ids)
print('Overlapped Passengers:', len(intersection))

---

# 7.0 - Feature Engineering

## 7.1 - Traveling Group

In [None]:
%%time
def extract_group(df):
    '''
    Extracte the Group from the PassengerId information...
    '''
    df['TravelGroup'] =  df['PassengerId'].str.split('_', expand = True)[0]
    return df

In [None]:
%%time
trn_data = extract_group(trn_data)
tst_data = extract_group(tst_data)

## 7.2 - Total Money Expended

In [None]:
%%time
def total_billed(df):
    '''
    Calculates total amount billed in the trip to the passenger... 
    Args:
    Returns:
    
    '''
    
    df['TotalBilled'] = df['RoomService'] + df['FoodCourt'] + df['ShoppingMall'] + df['Spa'] + df['VRDeck']
    return df

In [None]:
%%time
trn_data = total_billed(trn_data)
tst_data = total_billed(tst_data)

## 7.3 - Extracting Deck, Side and Cabin Number

In [None]:
%%time
def cabin_separation(df):
    '''
    Split the Cabin name into Deck, Number and Side
    
    '''
    
    df['CabinDeck'] = df['Cabin'].str.split('/', expand=True)[0]
    df['CabinNum']  = df['Cabin'].str.split('/', expand=True)[1]
    df['CabinSide'] = df['Cabin'].str.split('/', expand=True)[2]
    df.drop(columns = ['Cabin'], inplace = True)
    return df

In [None]:
%%time
trn_data = cabin_separation(trn_data)
tst_data = cabin_separation(tst_data)

## 7.4 - Counting the Number of Missing Data

In [None]:
%%time
def count_missing(df):
    '''
    
    '''
    
    df['MissingData'] = df.isnull().sum(axis = 1)
    return df

In [None]:
%%time
trn_data = count_missing(trn_data)
tst_data = count_missing(tst_data)

## 7.5 - Extracting Family Names

In [None]:
%%time
def name_extraction(df):
    '''
    Split the Name of the passenger into First and Family...
    
    '''
    
    df['FirstName']  = df['Name'].str.split(' ', expand=True)[0]
    df['FamilyName'] = df['Name'].str.split(' ', expand=True)[1]
    df.drop(columns  = ['Name'], inplace = True)
    return df

In [None]:
%%time
trn_data = name_extraction(trn_data)
tst_data = name_extraction(tst_data)

## 7.6 - Number of Relatives

In [None]:
%%time
trn_relatives = trn_data.groupby('FamilyName')['PassengerId'].count().reset_index()
tst_relatives = tst_data.groupby('FamilyName')['PassengerId'].count().reset_index()

In [None]:
%%time
trn_relatives = trn_relatives.rename(columns = {'PassengerId': 'NumRelatives'})
tst_relatives = tst_relatives.rename(columns = {'PassengerId': 'NumRelatives'})

In [None]:
%%time
trn_data = trn_data.merge(trn_relatives, how = 'left', on = ['FamilyName'])
tst_data = tst_data.merge(tst_relatives, how = 'left', on = ['FamilyName'])

---

# 8.0 - Pre-Processing the Data

## 8.1 - Imputing Missing Values

In [None]:
%%time
# Filling NaNs Based on Feature Engineering...
def fill_nans_by_age(df, age_limit = 13):
    df['RoomService'] = np.where(df['Age'] < age_limit, 0, df['RoomService'])
    df['FoodCourt'] = np.where(df['Age'] < age_limit, 0, df['FoodCourt'])
    df['ShoppingMall'] = np.where(df['Age'] < age_limit, 0, df['ShoppingMall'])
    df['Spa'] = np.where(df['Age'] < age_limit, 0, df['Spa'])
    df['VRDeck'] = np.where(df['Age'] < age_limit, 0, df['VRDeck'])
    
    return df

In [None]:
%%time
trn_data =  fill_nans_by_age(trn_data)
tst_data =  fill_nans_by_age(tst_data)

In [None]:
%%time
# Filling NaNs Based on Feature Engineering...
def fill_nans_by_cryo(df, age_limit = 13):
    df['RoomService'] = np.where(df['CryoSleep'] == True, 0, df['RoomService'])
    df['FoodCourt'] = np.where(df['CryoSleep'] == True, 0, df['FoodCourt'])
    df['ShoppingMall'] = np.where(df['CryoSleep'] == True, 0, df['ShoppingMall'])
    df['Spa'] = np.where(df['CryoSleep'] == True, 0, df['Spa'])
    df['VRDeck'] = np.where(df['CryoSleep'] == True, 0, df['VRDeck'])
    
    return df

In [None]:
%%time
trn_data =  fill_nans_by_cryo(trn_data)
tst_data =  fill_nans_by_cryo(tst_data)

In [None]:
%%time
def fill_nans_by_totalspend(df):
    df['CryoSleep'] = np.where(df['TotalBilled'] >= 0, True, df['CryoSleep'])
    return df

In [None]:
%%time
trn_data =  fill_nans_by_totalspend(trn_data)
tst_data =  fill_nans_by_totalspend(tst_data)

In [None]:
%%time
def fill_na_using_groups(df, group_field = 'TravelGroup'):
    """
    Fill the missing information for numerical features utilizing the median value of the group.

    Args
        df(DataFrame): The input DataFrame to impute.
        group_field (str): The name of the field that will be use to group the data.
    Returns
        df (DataFrame): A DataFrame with some imputed missing values.

    """
    
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    numeric_tmp = df.select_dtypes(include = numerics)
    categorical_tmp = df.select_dtypes(exclude = numerics)
    
    for col in numeric_tmp.columns:
        df[col] = df[col].fillna(df.groupby(group_field)[col].transform('median'))
    
    return df

In [None]:
%%time
trn_data = fill_na_using_groups(trn_data, group_field = 'TravelGroup')
tst_data = fill_na_using_groups(tst_data, group_field = 'TravelGroup')

In [None]:
%%time
def fill_categ_na_using_groups(df, group_field = 'TravelGroup'):
    """
    Fill the missing information for numerical features utilizing the median value of the group.

    Args
        df(DataFrame): The input DataFrame to impute.
        group_field (str): The name of the field that will be use to group the data.
    Returns
        df (DataFrame): A DataFrame with some imputed missing values.

    """
    
    df = df.groupby(group_field).apply(lambda x: x.fillna(x.mode().iloc[0])).reset_index(drop = True)
    return df


In [None]:
%%time
trn_data = fill_categ_na_using_groups(trn_data, group_field = 'TravelGroup')
tst_data = fill_categ_na_using_groups(tst_data, group_field = 'TravelGroup')

In [None]:
%%time
trn_data.isnull().sum()

In [None]:
%%time
tst_data.isnull().sum()

In [None]:
tst_data[tst_data['CryoSleep'] == False].head()

In [None]:
%%time
trn_data[trn_data['Age'].isnull()].sample(10)

In [None]:
%%time
def fill_missing(df):
    '''
    Fill nan values or missing data with median and most commond value...
    
    '''
    
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    numeric_tmp = df.select_dtypes(include = numerics)
    categ_tmp = df.select_dtypes(exclude = numerics)

    for col in numeric_tmp.columns:
        print(col)
        df[col] = df[col].fillna(value = df[col].mean())
        
    for col in categ_tmp.columns:
        print(col)
        df[col] = df[col].fillna(value = df[col].mode()[0])
        
    print('...', '\n')
    
    return df

In [None]:
%%time
trn_data =  fill_missing(trn_data)
tst_data =  fill_missing(tst_data)

In [None]:
%%time
trn_data.isnull().sum()

In [None]:
%%time
def age_groups(df, age_limit = 13):
    df['AgeGroup'] = np.where(df['Age'] < age_limit, 0, 1)
    return df

In [None]:
%%time
trn_data =  age_groups(trn_data)
tst_data =  age_groups(tst_data)

## 8.2 - Label Encoding

In [None]:
print(trn_data.columns)

In [None]:
%%time
categorical_features = ['TravelGroup','CabinNum', 'FirstName', 'FamilyName']
categorical_features_onehot = ['HomePlanet','CryoSleep','Destination','VIP', 'CabinSide', 'CabinDeck']

In [None]:
%%time
from sklearn.preprocessing import LabelEncoder 

def encode_categorical(train_df, test_df, categ_feat = categorical_features):
    '''
    
    '''
    encoder_dict = {}
    concat_data = pd.concat([trn_data[categ_feat], tst_data[categ_feat]])
    
    for col in concat_data.columns:
        print('Encoding: ', col, '...')
        encoder = LabelEncoder()
        encoder.fit(concat_data[col])
        encoder_dict[col] = encoder

        train_df[col + '_Enc'] = encoder.transform(train_df[col])
        test_df[col + '_Enc'] = encoder.transform(test_df[col])
    
    train_df = train_df.drop(columns = categ_feat, axis = 1)
    test_df = test_df.drop(columns = categ_feat, axis = 1)

    return train_df, test_df

In [None]:
%%time
trn_data, tst_data = encode_categorical(trn_data, tst_data, categorical_features)

## 8.3 - One Hot Encoding

In [None]:
%%time
def one_hot(df, one_hot_categ):
    for col in one_hot_categ:
        print('Encoding: ', col, '...')
        tmp = pd.get_dummies(df[col], prefix = col)
        df = pd.concat([df, tmp], axis = 1)
    df = df.drop(columns = one_hot_categ)
    return df

In [None]:
%%time
trn_data = one_hot(trn_data, categorical_features_onehot) 
tst_data = one_hot(tst_data, categorical_features_onehot) 

In [None]:
%%time
trn_data.head()

In [None]:
trn_data.info(verbose=True)

---

# 9.0 - Train and Validation Strategy

## 9.1 - Selection of the Features for Training

In [None]:
%%time
target_feature = 'Transported'

remove = ['PassengerId', 
          'Route', 
          'FirstName_Enc', 
          'CabinNum_Enc', 
          'Transported',
          'Name'
         ]

features = [feat for feat in trn_data.columns if feat not in remove]

In [None]:
%%time
features

['Age',
 'RoomService',
 'FoodCourt',
 'ShoppingMall',
 'Spa',
 'VRDeck',
 'AgeGroup',
 'Total_Billed',
 'TravelGroup',
 'FamilyName_Enc',
 'HomePlanet_Earth',
 'HomePlanet_Europa',
 'HomePlanet_Mars',
 'CryoSleep_False',
 'CryoSleep_True',
 'CabinDeck_A',
 'CabinDeck_B',
 'CabinDeck_C',
 'CabinDeck_D',
 'CabinDeck_E',
 'CabinDeck_F',
 'CabinDeck_G',
 'CabinDeck_T',
 'CabinSide_P',
 'CabinSide_S',
 'Destination_55 Cancri e',
 'Destination_PSO J318.5-22',
 'Destination_TRAPPIST-1e',
 'VIP_False',
 'VIP_True']

In [None]:
%%time
run['Features'] = features

## 9.2 - Cross Validation Strategy, Simple Train, Test Split

In [None]:
%%time
from sklearn.model_selection import train_test_split
test_size_pct = 0.10
X_train, X_valid, y_train, y_valid = train_test_split(trn_data[features], trn_data[target_feature], test_size = test_size_pct, random_state = 42)

In [None]:
%%time
run['CV Strategy'] = 'train_test_split'
run['Test Size'] = test_size_pct

---

# 10.0 - Model Training Using Gradient Boosted Trees

## 10.1 - Model Training

In [None]:
%%time
from xgboost  import XGBClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier

In [None]:
%%time
params = {'learning_rate': 0.001,
         'n_estimators': 8096,
         'n_jobs': -1,
         'random_state': 42,
         'objective': 'binary:logistic',
        }

In [None]:
run['Params'] = params

In [None]:
%%time
cls = XGBClassifier(**params)
cls.fit(X_train, y_train, eval_set = [(X_valid, y_valid)], eval_metric = ['logloss'], early_stopping_rounds = 128, verbose = 250)

## 10.2 - Model Performance Analysis

In [None]:
%%time
from sklearn.metrics import accuracy_score

trn_preds = cls.predict(X_train[features])
trn_preds = trn_preds.astype('bool')
trn_accuracy = accuracy_score(trn_preds, y_train)

val_preds = cls.predict(X_valid[features])
val_preds = val_preds.astype('bool')
val_accuracy = accuracy_score(val_preds, y_valid)

In [None]:
%%time
print(f'Mean train accuracy score: {trn_accuracy}')
print(f'Mean validation accuracy score: {val_accuracy}')

In [None]:
%%time
run['Train Accuracy'] = trn_accuracy
run['Val Accuracy'] = val_accuracy

## 10.3 - Feature Importance Analysis

In [None]:
%%time
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
def plot_feature_importance(importance, names, model_type):
    #Create arrays from feature importance and feature names
    feature_importance = np.array(importance)
    feature_names = np.array(names)

    #Create a DataFrame using a Dictionary
    data={'feature_names':feature_names,'feature_importance':feature_importance}
    fi_df = pd.DataFrame(data)

    #Sort the DataFrame in order decreasing feature importance
    fi_df.sort_values(by=['feature_importance'], ascending=False,inplace=True)

    #Define size of bar plot
    plt.figure(figsize=(8,8))
    #Plot Searborn bar chart
    sns.barplot(x=fi_df['feature_importance'], y=fi_df['feature_names'])
    #Add chart labels
    plt.title(model_type + 'FEATURE IMPORTANCE')
    plt.xlabel('FEATURE IMPORTANCE')
    plt.ylabel('FEATURE NAMES')

In [None]:
plot_feature_importance(cls.feature_importances_,X_train.columns,'XG BOOST ')

---

# 11.0 - Model Prediction Submission to Kaggle

In [None]:
%%time
preds = cls.predict(tst_data[features])

In [None]:
%%time
sub['Transported'] = preds
sub.to_csv('submission_simple_split_03262022.csv', index = False)

In [None]:
%%time
# Stop logging to your Run
run.stop()