Imports from xgboost for regression models, sklearn for mse, pandas for dataframe and numpy for numbers

In [17]:
import xgboost as xgb
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

Importation of the dataset containing data to train our model with

In [18]:
df = pd.read_csv("../datafiles/Train_data.csv")

Calling the head of the dataset to view it

In [19]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age
0,329,1,3,0,1,1,346,20.525,-1,2,31.0
1,74,0,3,1,1,0,166,14.4542,-1,0,26.0
2,254,0,3,1,1,0,419,16.1,-1,2,30.0
3,720,0,3,1,0,0,260,7.775,-1,2,33.0
4,667,0,2,1,0,0,104,13.0,-1,2,25.0


Locating a string in the fare column. Our data set needs to be cleaned.

In [20]:
df.loc[df["Fare"] == "Kanskje du burde fjerne denne?"]

Unnamed: 0,PassengerId,Survived,Pclass,Sex,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age
1361,765,0,3,1,0,0,268,Kanskje du burde fjerne denne?,-1,2,16.0


Using df.apply and pd.to_numeric to convert everything to numbers and coerce on errors

In [21]:
df = df.apply(pd.to_numeric, errors="coerce")

Having used the df.apply(pd.to_numeric, errors="coerce"), we now have the earliger string cells marked as NaN. We then clean the dataset by calling .dropna()

In [22]:
df = df.dropna()

Validating our earlier row containing "Kanskje du burde fjerne denne?" as a value in the fare cell now has been removed.

In [23]:
df.loc[df["PassengerId"] == 765]

Unnamed: 0,PassengerId,Survived,Pclass,Sex,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age


Calling info() to study our data set. 

In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1482 entries, 0 to 1492
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1482 non-null   int64  
 1   Survived     1482 non-null   int64  
 2   Pclass       1482 non-null   int64  
 3   Sex          1482 non-null   int64  
 4   SibSp        1482 non-null   int64  
 5   Parch        1482 non-null   int64  
 6   Ticket       1482 non-null   float64
 7   Fare         1482 non-null   float64
 8   Cabin        1482 non-null   int64  
 9   Embarked     1482 non-null   int64  
 10  Age          1482 non-null   float64
dtypes: float64(3), int64(8)
memory usage: 138.9 KB


By default df.drop_duplicates work on rows. We then call it two more times to drop duplicates of PassengerIds and Tickets.

In [25]:
df = df.drop_duplicates()
df = df.drop_duplicates(subset=["PassengerId"])
df = df.drop_duplicates(subset=["Ticket"])

Cleaning our dataset by removing values in the age column under 0 years, and over 100 years. We then remove every ticket with a value under 0, and cast the ticket column to the datatype integer.

In [26]:
df = df[df["Age"] > 0]
df = df[df["Age"] < 100] 
df = df[df["Ticket"] > 0]
df["Ticket"] = df["Ticket"].astype(int)

Calling info to inspect our cleaned dataset.

In [27]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 402 entries, 0 to 1492
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  402 non-null    int64  
 1   Survived     402 non-null    int64  
 2   Pclass       402 non-null    int64  
 3   Sex          402 non-null    int64  
 4   SibSp        402 non-null    int64  
 5   Parch        402 non-null    int64  
 6   Ticket       402 non-null    int32  
 7   Fare         402 non-null    float64
 8   Cabin        402 non-null    int64  
 9   Embarked     402 non-null    int64  
 10  Age          402 non-null    float64
dtypes: float64(2), int32(1), int64(8)
memory usage: 36.1 KB


Creating a regression model from XGBregressor.

In [28]:
model = xgb.XGBRegressor()

We let y be our "Age" and X be everything else.

In [29]:
X = df.drop("Age", axis=1)
y = df["Age"]

Validating that "Age" has been removed from X

In [30]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 402 entries, 0 to 1492
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  402 non-null    int64  
 1   Survived     402 non-null    int64  
 2   Pclass       402 non-null    int64  
 3   Sex          402 non-null    int64  
 4   SibSp        402 non-null    int64  
 5   Parch        402 non-null    int64  
 6   Ticket       402 non-null    int32  
 7   Fare         402 non-null    float64
 8   Cabin        402 non-null    int64  
 9   Embarked     402 non-null    int64  
dtypes: float64(1), int32(1), int64(8)
memory usage: 33.0 KB


Validating that y only contains "Age".

In [31]:
y.info()

<class 'pandas.core.series.Series'>
Int64Index: 402 entries, 0 to 1492
Series name: Age
Non-Null Count  Dtype  
--------------  -----  
402 non-null    float64
dtypes: float64(1)
memory usage: 6.3 KB


Reading our dataset of test data from the csv file. As specified in the assignment we assume this data is clean and to be trusted.

In [32]:
df_test = pd.read_csv("../datafiles/Test_data.csv")

y_test contains "Age", X_test contains everything else.

In [33]:
X_test = df_test.drop(["Age"], axis=1)
y_test = df_test["Age"]

We train our model here by applying X and y to it.

In [34]:
model.fit(X, y)

  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):


We predict the age based with the model trained in the code block above on X_test(the clean test data)

In [35]:
preds = model.predict(X_test)

display(preds)

array([29.61592   , 11.645875  , 32.733345  , 32.59539   , 28.982075  ,
        6.805506  , 21.551044  , 11.552687  , 30.323124  , 27.891098  ,
       43.59116   , 34.6083    , 45.2575    , 33.228676  , 40.544964  ,
       10.578039  , 39.636227  , 44.671383  , 38.187035  , 43.49068   ,
        0.51472634, 30.52768   , 21.81449   , 29.890509  , 29.750868  ,
       28.800314  , 27.787771  , 30.639275  , 19.550314  , 22.724663  ,
       25.099821  , 19.98789   , 35.571507  , 35.014297  , 25.561426  ,
       32.68694   , 31.785522  , 35.3464    , 50.780823  , 34.546204  ,
       12.450188  , 26.93968   , 41.33641   , 41.216614  , 38.753963  ,
       22.554886  , 14.7937355 , 21.466509  , 48.637913  , 37.095783  ,
       37.584747  , 33.74455   , 30.8529    , 25.852428  , 23.980745  ,
       24.857702  , 44.01738   ,  5.5597954 , 21.307789  , 33.919735  ,
       36.837673  , 28.926031  ,  8.51908   , 43.541553  , 21.863     ,
       48.91516   , 16.957909  , 43.26497   , 27.379827  , 36.73

We use the mean_squared_error method imported earlier to find the mse between the predicted ages and the clean age data set (y_test).

In [36]:
mse = mean_squared_error(preds, y_test)

We find if any of our ages are below 1.

In [37]:
y.loc[df["Age"] < 1]
#display(X_test.tail())

974    0.42
Name: Age, dtype: float64

The assignment specified we should find root mse. We use np.sqrt() on the mean square error to find root mse of our trained model for predicting age

In [38]:
np.sqrt(mse)

12.606453245944877

Hyperparametrization:

In [39]:
params = {
    "learning_rate": [0.05, 0.10, 0.15, 0.20, 0.25, 0.30],
    "max_depth": [3, 4, 5, 6, 8, 10, 12, 15],
    "min_child_weight": [1, 3, 5, 7],
    "gamma": [0.0, 0.1, 0.2, 0.3, 0.4],
    "colsample_bytree": [0.3, 0.4, 0.5, 0.7],
    "n_estimators": [100, 200, 300, 400, 500, 900, 1100, 1500],
}

Defining a new model and using random_search to find the best model with 100 iterations (n_iter).

In [40]:
model2 = xgb.XGBRegressor()

random_search = RandomizedSearchCV(model2, param_distributions=params, n_iter=100, scoring="neg_mean_squared_error", n_jobs=-1, cv=5)

random_search.fit(X,y)

  elif isinstance(data.columns, (pd.Int64Index, pd.RangeIndex)):


Using the best parameters

In [41]:
model_new = random_search.best_params_
model_new

{'n_estimators': 100,
 'min_child_weight': 1,
 'max_depth': 3,
 'learning_rate': 0.1,
 'gamma': 0.2,
 'colsample_bytree': 0.3}

Using the best estimator:

In [42]:
model_new = random_search.best_estimator_

Confirming the new model is of the type XGBRegressor.

In [43]:
type(model_new)

xgboost.sklearn.XGBRegressor

Creating a new prediction set based on the new model and the clean test data of X_test

In [44]:
preds2 = model_new.predict(X_test)

Creating the new mse based on our second prediction using the best estimator and parameters and comparing it to y_test (the clean y data).

In [45]:
mse_new = mean_squared_error(preds2, y_test)

Finding the rmse of our new and improved prediction model. As we can see the rmse has now lowered from 12.6 to 11.85. Not too shabby!

In [46]:
np.sqrt(mse_new)

11.93322365340896