# Example application

Here we will go through the steps of a simple machine learning application based on the Titanic data set

## Load the data and print a bit of information

In [1]:
import pandas as pd
import numpy as np
titanic = pd.read_csv("titanic.csv")

In [2]:
titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


Before we can start, we need to handle a few issues: There are several "NaN"-values in the data set, which our machine learning algorithms won't be able to handle. It can't handle non-numeric values either, which we will deal with in two different ways: we change the "sex"-feature to a dummy variable and remove all other non-numeric columns. 

In [3]:
#Dropping unwanted columns:
titanic = titanic.drop(['Name',"Ticket","Cabin","Embarked"], axis='columns')

#Creating dummy variable for sex:
titanic["Sex"] = pd.get_dummies(titanic["Sex"])["male"]

#We will also drop all rows that contain NaN-values:
titanic = titanic.dropna()

We want to try to predict whether a given passenger survived or not. So "survived" is our target variable (i.e. our labels).

In [4]:
labels = titanic["Survived"]
labels

0      0
1      1
2      1
3      1
4      0
      ..
885    0
886    0
887    1
889    1
890    0
Name: Survived, Length: 714, dtype: int64

In [7]:
data = titanic.drop(['Survived'], axis='columns')
data

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare
0,1,3,1,22.0,1,0,7.2500
1,2,1,0,38.0,1,0,71.2833
2,3,3,0,26.0,0,0,7.9250
3,4,1,0,35.0,1,0,53.1000
4,5,3,1,35.0,0,0,8.0500
...,...,...,...,...,...,...,...
885,886,3,0,39.0,0,5,29.1250
886,887,2,1,27.0,0,0,13.0000
887,888,1,0,19.0,0,0,30.0000
889,890,1,1,26.0,0,0,30.0000


In [9]:
labels

0      0
1      1
2      1
3      1
4      0
      ..
885    0
886    0
887    1
889    1
890    0
Name: Survived, Length: 714, dtype: int64

## Building a model

For illustrating the process we use the naïve Bayes model (more on this later). In scikit-learn all ML algorithms are implemented in their own class (for naïve Bayes it is `GaussianNB` under `sklearn.naive_bayes`) that should be instantiated. 

In [41]:
#from sklearn.naive_bayes import GaussianNB
#model = GaussianNB()

# It is extremely easy to try with a different machine learning algorithm instead! 
# Just uncomment the two lines below to use a "k Nearest Neighbors"-model instead of Naive Bayes:

#from sklearn.neighbors import KNeighborsClassifier 
#model = KNeighborsClassifier()

# Xgboost:
from xgboost import XGBClassifier
model = XGBClassifier(n_estimators=1000, learning_rate=0.05)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, random_state=42)

Use the `fit` method for learning the model; this function takes as arguments the training data and the corresponding labels 

In [42]:
model.fit(X_train, y_train)

## Making predictions

We use the learned model to make predictions about new data instances for which we do not know the labels.
Let's see if a person with PassengerId 100, class 3, female, age 20, no siblings, parch 0 and a fare of 30 is predicted to survive:

In [43]:
# New data organized in a two-dimensional array 
x_new = np.array([[100, 3, 0, 23, 2, 0, 15]])

In [44]:
predict = model.predict(x_new)

In [45]:
print("Prediction: {}".format(predict))

Prediction: [0]


Lastly, let's see for how many of the passengers it was correctly predicted whether they survived or not:

In [46]:
print("Accuracy score: {}".format(model.score(data, labels)))
from sklearn.metrics import roc_auc_score
print("AUC-ROC score: {}".format(roc_auc_score(labels, model.predict_proba(data)[:,1])))

Accuracy score: 0.9551820728291317
AUC-ROC score: 0.975300910865322


In [40]:
#cross validation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, data, labels, cv=5)
print("Cross-validation scores: {}".format(scores))

Cross-validation scores: [0.63636364 0.67132867 0.73426573 0.8041958  0.83802817]


In [49]:
#test data evaluation
score = model.score(X_test, y_test)
print("Test data ROC-AUC score: {}".format(score))

Test data ROC-AUC score: 0.7762237762237763


In [136]:
gold_data = pd.read_csv("gold_data.csv")
#read json datafrom2023.json
import json
with open('datafrom2023.json') as f:
    new_data = json.load(f)
new_data = pd.DataFrame(new_data)
new_data.rename(columns={'timestamp':'date'}, inplace=True)
gold_data = gold_data.append(new_data, ignore_index=True)

  gold_data = gold_data.append(new_data, ignore_index=True)


In [138]:
#save gold_data to csv
gold_data.to_csv('gold_data3.csv', index=False)

In [114]:
label = gold_data["price"]
train = gold_data.drop(['price'], axis='columns')

In [123]:
#convert gold_data['date'] from date to int
gold_data['date'] = pd.to_datetime(gold_data['date'])
train = pd.to_numeric(gold_data['date'])

In [124]:
from xgboost import XGBRegressor
model = XGBRegressor(n_estimators=1000, learning_rate=0.05)
model.fit(train, label)

In [142]:
#predict for 29/08/2023 08:00:00
datetoday = pd.to_datetime('2023-08-22 08:00:00')
intdate = datetoday.value

In [143]:
intdate

1692691200000000000

In [144]:
#make prediction
datedf= pd.DataFrame({'date': [intdate]})
model.predict(datedf)

array([4878.792], dtype=float32)

In [93]:
#make request to https://west.albion-online-data.com/api/v2/stats/Gold.json?date=2018-01-01%2008%3A00%3A00&end_date=2023-08-29%2008%3A00%3A00
#import requests
import requests
gold_data2 = requests.get('https://west.albion-online-data.com/api/v2/stats/Gold.json?date=2018-01-01%2008%3A00%3A00&end_date=2023-08-29%2008%3A00%3A00')

In [94]:
gold_data2

<Response [200]>