# Context – Drunk Smurfs
Among all international hotel guests, Smurfs are burdened with the upkeep of a singular reputation: they are (supposedly) the rowdiest bunch one can entertain, and are equally well-known for unbridled spending as for racking up extensive costs in damages to hotel infrastructure, staff, and occasionally also other guests – costs which typically cannot be recovered once the guest has sought out the safety of his (or her) homeland.
It is your job as a data scientist to screen applying Smurfs clients for an exclusive hotel in the Bahamas - yes, it's the kind of hotel you need to apply for!
# The data
At your disposal is a training set containing data about the behavior of 5000 Smurf hotel guests (train_V2.csv). This data set contains information about the profit the hotel made during their last visit (excluding damages), but also whether they caused damages during their last visit, and for what amount. These outcomes are respectively called 'outcome_profit', 'outcome_damage_inc', and 'outcome_damage_amount'. To predict them, you have access to a host of personal information: previous history of profits and damages, use of hotel facilities, socio-demographics and behavioral scores from the staff of other hotels within the hotel chains. A minor description of features is available in dictionary.csv.
You also get information on the 500 applicants for the 2024 season (score.csv). It is your job to return a list of 150 clients that offer an attractive balance between projected profit for the hotel, and anticipated damages. 
You will notice the data set contains a large number of oddities. You are expected to think yourself about what is intuitive and acceptable in terms of approach, and to provide some minor reflection on this in your technical report. 


# Possible approach
To generate a client list, you can (but don't have to) follow the next steps:
1)	prepare the data set	
* briefly survey the data
* deal with data issues:
* appropriate handle categorical data
* treat missing data
* identify outliers, and choose whether to make your analysis more robust by removing these
2)	predict the projected revenue per clients
* choose an algorithm, and train it in an optimal way
* score the 500 applicants
3)	predict which clients will cause damage
* choose an algorithm, and train it in an optimal way
* score the 500 applicants
4)	for those that will wreak havoc, predict the amount of damage they will cause
* choose an algorithm, and train it in an optimal way
* score the 500 applicants
5)	create a measure of the expected value of each applicant, and create an optimal selection of 200 guests


## 0. Loading packages and dataset

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

plt.style.use('seaborn-darkgrid')   

In [None]:
# read in data
train = pd.read_csv('train_V2.csv')
score = pd.read_csv('score.csv')

## 1. Data exploration

1. and 2: number of features and observations

In [None]:
train.shape

In [None]:
# correlation matrix
corr_matrix = train.corr()
print(corr_matrix)
# Geen variabelen die een correlatie van 1 hebben dus op basis daarvan moeten we geen variabelen weglaten.


# Check for constant variables

In [None]:
# There are no constant variables so we do not need to ommit any based on this information.
constant_columns = [col for col in train.columns if train[col].nunique() == 1]
print(constant_columns)

No constant variables were found.

In [None]:
train.describe().T

In [None]:
train[0:500].T

3. Check for datatypes

In [None]:
train.info()

4. and 5. Check for missing data

In [None]:
#Here we check how many missing values we have per variable.
train.isnull().sum()[train.isnull().sum() != 0]

In [None]:
#here we look at what percentage of the observations are not NaN per variable
(5000- train.isnull().sum()[train.isnull().sum() != 0])/5000*100

In [None]:
# Define the columns that cannot contain negative values
non_neg_cols = ['outcome_damage_inc','outcome_damage_amount','crd_lim_rec', 'credit_use_ic', 'insurance_ic', 'spa_ic', 
                'empl_ic', 'bar_no', 'sport_ic','neighbor_income','age', 'dining_ic', 
                'presidential', 'client_segment', 'sect_empl','prev_stay','prev_all_in_stay', 'fam_adult_size', 'children_no','tenure_yrs',
                'tenure_mts','company_ic', 'claims_no','claims_am', 'damage_am', 'damage_inc','nights_booked', 'shop_am', 'shop_use', 'retired',
                'profit_am','profit_last_am', 'gold_status']

mask = (train[non_neg_cols] < 0).any(axis=1)

# Drop the rows that have negative values in any of the specified columns
train = train[~mask]
train

## b) Look at the data


In [None]:
#here we look at the first 16 variables
train.iloc[:,0:16].head()

In [None]:
#here we look at the variables starting from the 16th just to see what the data looks like
train.iloc[:,16:53].head()

## Plot the data

### Barplot

In [None]:
# Look at the amount of men and women with a bar chart
sns.countplot(y=train["gender"])

### Correlation plot

In [None]:
corrmat = train.corr()

fig, ax = plt.subplots(figsize=(8,8))

# Add title to the Heat map
title = "Correlation between variables heatmap"

# Set the font size and the distance of the title from the plot
plt.title(title,fontsize=18)
ttl = ax.title
ttl.set_position([0.5,1.05])

# Hide ticks for X & Y axis
ax.set_xticks([])
ax.set_yticks([])

# Remove the axes
ax.axis('off')

sns.heatmap(corrmat,fmt="",cmap='RdYlGn',linewidths=0.30,ax=ax)

plt.show()


In [None]:
# sns.set()
# features = train.copy()
# features = features.drop(["outcome_damage_inc"], 1)
# xvars = features.columns
# sns.pairplot(train, y_vars=['outcome_damage_inc'], x_vars=(xvars[0:5]))
# sns.pairplot(train, y_vars=['outcome_damage_inc'], x_vars=(xvars[5:10]))
# sns.pairplot(train, y_vars=['outcome_damage_inc'], x_vars=(xvars[10:15]))
# sns.pairplot(train, y_vars=['outcome_damage_inc'], x_vars=(xvars[15:20]))
# sns.pairplot(train, y_vars=['outcome_damage_inc'], x_vars=(xvars[20:25]))
# sns.pairplot(train, y_vars=['outcome_damage_inc'], x_vars=(xvars[25:30]))
# sns.pairplot(train, y_vars=['outcome_damage_inc'], x_vars=(xvars[30:35]))
# sns.pairplot(train, y_vars=['outcome_damage_inc'], x_vars=(xvars[35:40]))
# sns.pairplot(train, y_vars=['outcome_damage_inc'], x_vars=(xvars[40:45]))
# sns.pairplot(train, y_vars=['outcome_damage_inc'], x_vars=(xvars[45:50]))
# sns.pairplot(train, y_vars=['outcome_damage_inc'], x_vars=(xvars[50:53]))

# plt.show()

In [None]:
# sns.set()
# features = train.copy()
# features = features.drop(["outcome_profit"], 1)
# xvars = features.columns
# sns.pairplot(train, y_vars=['outcome_profit'], x_vars=(xvars[0:5]))
# sns.pairplot(train, y_vars=['outcome_profit'], x_vars=(xvars[5:10]))
# sns.pairplot(train, y_vars=['outcome_profit'], x_vars=(xvars[10:15]))
# sns.pairplot(train, y_vars=['outcome_profit'], x_vars=(xvars[15:20]))
# sns.pairplot(train, y_vars=['outcome_profit'], x_vars=(xvars[20:25]))
# sns.pairplot(train, y_vars=['outcome_profit'], x_vars=(xvars[25:30]))
# sns.pairplot(train, y_vars=['outcome_profit'], x_vars=(xvars[30:35]))
# sns.pairplot(train, y_vars=['outcome_profit'], x_vars=(xvars[35:40]))
# sns.pairplot(train, y_vars=['outcome_profit'], x_vars=(xvars[40:45]))
# sns.pairplot(train, y_vars=['outcome_profit'], x_vars=(xvars[45:50]))
# sns.pairplot(train, y_vars=['outcome_profit'], x_vars=(xvars[50:53]))

# plt.show()

In [None]:
# sns.set()
# features = train.copy()
# features = features.drop(["outcome_damage_amount"], 1)
# xvars = features.columns
# sns.pairplot(train, y_vars=['outcome_damage_amount'], x_vars=(xvars[0:5]))
# sns.pairplot(train, y_vars=['outcome_damage_amount'], x_vars=(xvars[5:10]))
# sns.pairplot(train, y_vars=['outcome_damage_amount'], x_vars=(xvars[10:15]))
# sns.pairplot(train, y_vars=['outcome_damage_amount'], x_vars=(xvars[15:20]))
# sns.pairplot(train, y_vars=['outcome_damage_amount'], x_vars=(xvars[20:25]))
# sns.pairplot(train, y_vars=['outcome_damage_amount'], x_vars=(xvars[25:30]))
# sns.pairplot(train, y_vars=['outcome_damage_amount'], x_vars=(xvars[30:35]))
# sns.pairplot(train, y_vars=['outcome_damage_amount'], x_vars=(xvars[35:40]))
# sns.pairplot(train, y_vars=['outcome_damage_amount'], x_vars=(xvars[40:45]))
# sns.pairplot(train, y_vars=['outcome_damage_amount'], x_vars=(xvars[45:50]))
# sns.pairplot(train, y_vars=['outcome_damage_amount'], x_vars=(xvars[50:53]))

# plt.show()

## Check for outliers

We calculate the z_scores of each data point and identify outliers as data points with a score greater than 3, here for 'outcome_damage_amount'

In [None]:
# calculate the Z-score of each data point
z_scores = np.abs((train['outcome_damage_amount'] - train['outcome_damage_amount'].mean()) / train['outcome_damage_amount'].std())

# identify outliers as data points with a Z-score greater than 3
outliers = train[z_scores > 3]

# print the number of outliers
print(len(outliers["outcome_damage_amount"]))

We calculate the z_scores of each data point and identify outliers as data points with a score greater than 3, here for 'outcome_profit'

In [None]:
# calculate the Z-score of each data point
z_scores = np.abs((train['outcome_profit'] - train['outcome_profit'].mean()) / train['outcome_profit'].std())

# identify outliers as data points with a Z-score greater than 3
outliers = train[z_scores > 3]

# print the number of outliers
print(len(outliers["outcome_profit"]))

We calculate the z_scores of each data point and identify outliers as data points with a score greater than 3, here for 'outcome_damage_inc'

In [None]:
# calculate the Z-score of each data point
z_scores = np.abs((train['outcome_damage_inc'] - train['outcome_damage_inc'].mean()) / train['outcome_damage_inc'].std())

# identify outliers as data points with a Z-score greater than 3
outliers = train[z_scores > 3]

# print the number of outliers
print(len(outliers["outcome_damage_inc"]))

In [None]:
train.info()

## c) Look at the descriptives
1. For which features do you suspect outliers?
2. Which of these outliers seem most suspicious? Which would you certainly check if you were able to?

In [None]:
train.iloc[:,0:16].head()

2. Convert categorical

In [None]:
# Deze variabele is een boolean maar moet een getal worden
train['married_cd'] = train['married_cd'].astype('int')
train.loc[:, 'married_cd']

score['married_cd'] = score['married_cd'].astype('int')

## Clean the data

1. Drop duplicates

In [None]:
train = train.drop_duplicates()
train.head()

2. Replace all NaN values with '-1'

In [None]:
# Replace all NaN values with a specified value (e.g., 0)
train.fillna(-1, inplace=True)
train.head()

score.fillna(-1, inplace=True)

In [None]:
na = train.isna()
columns_with_na = train.columns[na.any()].tolist() 
print(len(columns_with_na))

3. Drop irrelvant columns

gluten_ic and lactose_ic: The fact that a person is gluten or lactose intolerant does not indicate how likely it is for them to inflict damages or how much money they will be spending in the hotel. Maybe they will pay a tiny bit more for food without those ingredients but it shouldn't have a significant impact.

cab_requests: The hotel is very unlikely to own the cab company so wether or not they buy a lot of taxis will not influence the profit for the hotel.

marketing_permit: The choice on wether or not the marketing team may contact them will not influence how much money they will be spending nor how likely they are to inflict damages.

region: Although region could be a small factor due to cultural diferences in spending and personal traits, this could lead to discrimination of people of a certain region.

gender: Here it could also be that a cerain gender is for example more aggressive than others and thus more likely to inflict damages, but this could also lead to discrimination based on generalisations.

divorce: Being divorced or not does not impact the way you behave or spend money, definitely not if some time has passed. Maybe the first months they could be a bit more aggressive or impulsive due to their grief.

In [None]:
# drop the columns that are not needed
train = train.drop(['gluten_ic', 'lactose_ic', 'marketing_permit','divorce', 'cab_requests', 'urban_ic', 'gender', "married_cd"], axis=1) #outcome_damage_inc, outcome_damage_amount
score = score.drop(['gluten_ic', 'lactose_ic', 'marketing_permit', 'divorce', 'cab_requests', 'urban_ic', 'gender', "married_cd"], axis=1)

 I make sure all three categorical features are classified as 'object' to be able to check if they are categorical

4. Remove unwanted outliers


No unwanted outliers found

## 2. Machine Learning 


Train the different ML models to predict projected revenue


### 2.0.0 Split the data in test/train and standardize data


In [None]:
from sklearn.preprocessing import StandardScaler

X = train
X = X.drop(['outcome_damage_amount', 'outcome_damage_inc', 'outcome_profit'], axis=1)
y = train['outcome_profit']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

num_feat = X_train.select_dtypes(include=['int64', 'float64']).columns

scaler = StandardScaler()
scaler.fit(X_train[num_feat])

X_train_stan = X_train.copy()
X_test_stan = X_test.copy()

X_train_stan[num_feat] = scaler.transform(X_train[num_feat])
X_test_stan[num_feat] = scaler.transform(X_test[num_feat])

### 2.0.1 Lineair Regression

In [None]:
# Dit algoritme heeft een score van 0.2842767456180185

# Train a linear regression model
LRmodel = LinearRegression()
LRmodel.fit(X_train, y_train)

# Evaluate the model on the testing set
testScore = LRmodel.score(X_test, y_test)
print('R^2 score on testing set:', testScore)

LRy_pred = LRmodel.predict(X_test)

# Predict the projected revenue for the 500 applicants

LRscore = scaler.transform(score)
LRpredictions = LRmodel.predict(LRscore)

# Sort the predictions in descending order
sorted_index = np.argsort(LRpredictions)[::-1]
LRsorted_predictions = LRpredictions[sorted_index]

print(LRsorted_predictions)

Plot of the Actual outcomes and predicted outcomes: Lineair Regression

In [None]:
plt.scatter(y_test, LRy_pred)
plt.xlabel('Actual outcomes')
plt.ylabel('Predicted outcomes')
plt.title('Scatter plot of actual vs predicted outcomes')
plt.show()

### 2.0.2 KNN 

In [None]:
# Heeft een score van 0.025000182087153267
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import r2_score

# Train a linear regression model
KNNmodel = KNeighborsRegressor(n_neighbors=5)
KNNmodel.fit(X_train, y_train)

# Make predictions on the testing set
KNNy_pred = KNNmodel.predict(X_test)

# Evaluate the model using r-squared score
r2 = r2_score(y_test, KNNy_pred)
print('R-squared score:', r2)

# Predict the projected revenue for the 500 applicants
KNNscore = scaler.transform(score)
KNNpredictions = KNNmodel.predict(KNNscore)

# Sort the predictions in descending order
KNNsorted_index = np.argsort(KNNpredictions)[::-1]
KNNsorted_predictions = KNNpredictions[KNNsorted_index]

print(KNNsorted_predictions)

Plot of the Actual outcomes and predicted outcomes: KNN

In [None]:
# plot the predicted outcomes against the actual outcomes in the testing set
plt.scatter(y_test, KNNy_pred)
plt.xlabel('Actual outcomes')
plt.ylabel('Predicted outcomes')
plt.title('Scatter plot of actual vs predicted outcomes')
plt.show()

### 2.0.3 Decision Tree

In [None]:
# Dit algoritme heeft een score van 0.5646249872330892
from sklearn.tree import DecisionTreeRegressor

# Create a decision tree regressor object
dt_regressor = DecisionTreeRegressor(random_state=0)

# Fit the regressor with the training data
dt_regressor.fit(X_train, y_train)

# Predict the revenue on the testing data
y_pred_dt = dt_regressor.predict(X_test)

# Compute R^2 score on the testing data
r2_score_dt = dt_regressor.score(X_test, y_test)
print("R^2 Score (Decision Tree Regression): ", r2_score_dt)

# Predict the projected revenue for the 500 applicants

DTscore = scaler.transform(score)
DTpredictions = dt_regressor.predict(DTscore)

# Sort the predictions in descending order
DTsorted_index = np.argsort(DTpredictions)[::-1]
DTsorted_predictions = DTpredictions[DTsorted_index]

print(DTsorted_predictions)

Plot of the Actual outcomes and predicted outcomes: Decision Tree

In [None]:
plt.scatter(y_test, y_pred_dt)
plt.xlabel('Actual outcomes')
plt.ylabel('Predicted outcomes')
plt.title('Scatter plot of actual vs predicted outcomes')
plt.show()

### 2.0.4 Random Forest 

In [None]:
# Dit algoritme heeft een score van 0.7581997911409797
from sklearn.ensemble import RandomForestRegressor

# Instantiate the model
rf = RandomForestRegressor(n_estimators=100, random_state=42)

# Fit the model to the training data
rf.fit(X_train, y_train)

# Predict on the test data
RFy_pred = rf.predict(X_test)

# Evaluate the model using r2 score
from sklearn.metrics import r2_score
RFr2 = r2_score(y_test, RFy_pred)
print("r2 score on test set:", RFr2)

# Predict the projected revenue for the 500 applicants

RFscore = scaler.transform(score)
RFpredictions = rf.predict(RFscore)

# Sort the predictions in descending order
RFsorted_index = np.argsort(RFpredictions)[::-1]
RFsorted_predictions = RFpredictions[RFsorted_index]

print(RFsorted_predictions)

Plot of the Actual outcomes and predicted outcomes: Random Forrest

In [None]:
plt.scatter(y_test, RFy_pred)
plt.xlabel('Actual outcomes')
plt.ylabel('Predicted outcomes')
plt.title('Scatter plot of actual vs predicted outcomes')
plt.show()

### 2.0.5 Gradient Boosting

In [None]:
# Dit algoritme heeft een score van 0.771681378659433
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import r2_score

# Instantiate the model
GBmodel = GradientBoostingRegressor()

# Fit the model on the training data
GBmodel.fit(X_train, y_train)

# Make predictions on the testing data
GBy_pred = GBmodel.predict(X_test)

# Calculate the R-squared score on the testing data
GBr2 = r2_score(y_test, GBy_pred)
print("R-squared score on testing data:", GBr2)

# Predict the projected revenue for the 500 applicants

GBscore = scaler.transform(score)
GBpredictions = GBmodel.predict(GBscore)

# Sort the predictions in descending order
GBsorted_index = np.argsort(GBpredictions)[::-1]
GBsorted_predictions = GBpredictions[GBsorted_index]

print(GBsorted_predictions)

Plot of the Actual outcomes and predicted outcomes: Gradient Boosting

In [None]:
plt.scatter(y_test, GBy_pred)
plt.xlabel('Actual outcomes')
plt.ylabel('Predicted outcomes')
plt.title('Scatter plot of actual vs predicted outcomes')
plt.show()

### 2.0.6.1 Polynomial

In [None]:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)

# Split the dataset into training and testing sets
X = train
X = X.drop(['outcome_damage_amount', 'outcome_damage_inc', 'outcome_profit'], axis=1)
y = train['outcome_profit']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)


In [None]:
#check the number of features
X_train_poly.shape

In [None]:
#fit the linear regression
reg_quad = LinearRegression(fit_intercept=False)
reg_quad.fit(X_train_poly, y_train)


In [None]:
print(reg_quad.score(X_train_poly, y_train))
print(reg_quad.score(X_test_poly, y_test))


The model is overfitted since it performs relatively well on the training set but not on the new data.

### Rate the models

In [None]:
from sklearn.model_selection import cross_val_score

# Define the evaluation metric (e.g., mean squared error)
metric = 'neg_mean_squared_error'

# Evaluate each algorithm using 10-fold cross-validation
scores = {}
for reg in [KNNmodel, reg_quad, GBmodel, rf, dt_regressor, LRmodel]:
    name = type(reg).__name__
    CVscore = cross_val_score(reg, X, y, cv=10, scoring=metric)
    scores[name] = -CVscore.mean()

# Print the mean squared error of each algorithm
for name, CVscore in scores.items():
    print(f"{name}: {score:.4f}")

# 3. Conclusion Rating ML Algorithms

KNeighborsRegressor: 1543567.6565
LinearRegression: 1290593.4049
GradientBoostingRegressor: 422058.4695
RandomForestRegressor: 486162.5380
DecisionTreeRegressor: 904579.7700

We can see that gradientboosting has the lowest mean squared error so for this metric the gradientboosting scores the best.

# 2.1 Damages

Score the 500 applicants

### Split data in test/train and standardize

In [None]:
num_feat = X_train.select_dtypes(include=['int64', 'float64']).columns
num_feat = [feat for feat in num_feat if feat in X_train.columns and feat in score.columns] # remove non-existent features
scaler = StandardScaler()
scaler.fit(X_train[num_feat])
X_train_stand = X_train.copy()
X_test_stand = X_test.copy()
X_train_stand[num_feat] = scaler.transform(X_train[num_feat])
X_test_stand[num_feat] = scaler.transform(X_test[num_feat])

### 2.1.1 Decision Tree

In [None]:
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score

# fit decision tree regressor with cross-validation
depth = np.arange(1, 50)
cv_scores = []
sd_scores = []
for d in depth:
    dec_tree = DecisionTreeRegressor(random_state=0, max_depth=d)
    scores = cross_val_score(dec_tree, X_train_stand, y_train, cv=5)
    cv_scores.append(scores.mean())
    sd_scores.append(np.sqrt(scores.var())/np.sqrt(5))

# fit decision tree regressor to entire training set
dec_tree.fit(X_train_stand, y_train)

# Standardize numerical features for new applicants
num_feat = [feat for feat in num_feat if feat in score.columns] # remove non-existent features
new_applicants_stand = score.copy()
new_applicants_stand[num_feat] = scaler.transform(score[num_feat])


# Predict damages for new applicants using the trained decision tree regressor
damages_pred = dec_tree.predict(new_applicants_stand)

# Only keep the applicants who will cause damage to calculate the damage amount
applicants_who_will_cause_damage = new_applicants_stand[damages_pred > 0]
applicants_who_will_cause_damage



### 2.1.3 Gradient Boosting

In [None]:
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score

# fit decision tree regressor with cross-validation
depth = np.arange(1, 50)
cv_scores = []
sd_scores = []
for d in depth:
    dec_tree = DecisionTreeRegressor(random_state=0, max_depth=d)
    scores = cross_val_score(dec_tree, X_train_stand, y_train, cv=5)
    cv_scores.append(scores.mean())
    sd_scores.append(np.sqrt(scores.var())/np.sqrt(5))

# fit decision tree regressor to entire training set
dec_tree.fit(X_train_stand, y_train)

# Standardize numerical features for new applicants
num_feat = [feat for feat in num_feat if feat in score.columns] # remove non-existent features
new_applicants_stand = score.copy()
new_applicants_stand[num_feat] = scaler.transform(score[num_feat])

# Predict damages for new applicants using the trained decision tree regressor
damages_pred = dec_tree.predict(new_applicants_stand)
print(damages_pred)

# Only keep the applicants who will cause damage to calculate the damage amount
applicants_who_will_cause_damage = new_applicants_stand[damages_pred > 0]
applicants_who_will_cause_damage


### 2.1.4 KNN

In [None]:
import numpy as np
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score

# fit KNN regressor with cross-validation
k_values = np.arange(1, 50)
cv_scores = []
sd_scores = []
for k in k_values:
    knn = KNeighborsRegressor(n_neighbors=k)
    scores = cross_val_score(knn, X_train_stand, y_train, cv=5)
    cv_scores.append(scores.mean())
    sd_scores.append(np.sqrt(scores.var())/np.sqrt(5))

# fit KNN regressor to entire training set
knn = KNeighborsRegressor(n_neighbors=5)
knn.fit(X_train_stand, y_train)

# Standardize numerical features for new applicants
num_feat = [feat for feat in num_feat if feat in score.columns] # remove non-existent features
new_applicants_stand = score.copy()
new_applicants_stand[num_feat] = scaler.transform(score[num_feat])

# Predict damages for new applicants using the trained KNN regressor
damages_pred = knn.predict(new_applicants_stand)
print(damages_pred)

# Only keep the applicants who will cause damage to calculate the damage amount
applicants_who_will_cause_damage = new_applicants_stand[damages_pred > 0]
applicants_who_will_cause_damage


## 2.1.5 Lineair Regression

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

# fit linear regression model
lin_reg = LinearRegression()
lin_reg.fit(X_train_stand, y_train)

# Standardize numerical features for new applicants
new_applicants_stand = score.copy()
new_applicants_stand[num_feat] = scaler.transform(score[num_feat])

# Predict damages for new applicants using the trained linear regression model
damages_pred = lin_reg.predict(new_applicants_stand)
print(damages_pred)

# Only keep the applicants who will cause damage to calculate the damage amount
applicants_who_will_cause_damage = new_applicants_stand[damages_pred > 0]
applicants_who_will_cause_damage


### 2.1.6 Random Forest 

In [None]:
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

# fit random forest regressor with cross-validation
depth = np.arange(1, 50)
cv_scores = []
sd_scores = []
for d in depth:
    rnd_forest = RandomForestRegressor(random_state=0, n_estimators=100, max_depth=d)
    scores = cross_val_score(rnd_forest, X_train_stand, y_train, cv=5)
    cv_scores.append(scores.mean())
    sd_scores.append(np.sqrt(scores.var())/np.sqrt(5))

# fit random forest regressor to entire training set
rnd_forest.fit(X_train_stand, y_train)

# Standardize numerical features for new applicants
num_feat = [feat for feat in num_feat if feat in score.columns] # remove non-existent features
new_applicants_stand = score.copy()
new_applicants_stand[num_feat] = scaler.transform(score[num_feat])

# Predict damages for new applicants using the trained random forest regressor
damages_pred = rnd_forest.predict(new_applicants_stand)
print(damages_pred) 

# Only keep the applicants who will cause damage to calculate the damage amount
applicants_who_will_cause_damage = new_applicants_stand[damages_pred > 0]
applicants_who_will_cause_damage


# 2.2 Predict damage amount

In [None]:
 # TODO
 # Use all ML algorithms to predict damages:
    # - Linear Regression x
    # - Decision Tree x
    # - KNN x
    # - Random Forest x
    # - Gradient Boosting x

### 2.2.1 Gradient Boosting

In [None]:
# from random import Random
# import numpy as np
# from sklearn.model_selection import train_test_split
# from sklearn.preprocessing import StandardScaler
# from sklearn.ensemble import GradientBoostingRegressor
# from sklearn.model_selection import cross_val_score

# # split into train and test sets
# X = train
# X = X.drop(['outcome_damage_amount', 'outcome_damage_inc', 'outcome_profit'], axis=1)
# y = train['damage_am']
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# # standardize numerical features
# num_feat = X_train.select_dtypes(include=['int64', 'float64']).columns
# num_feat = [feat for feat in num_feat if feat in X_train.columns and feat in score.columns] # remove non-existent features
# scaler = StandardScaler()
# scaler.fit(X_train[num_feat])
# X_train_stand = X_train.copy()
# X_test_stand = X_test.copy()
# X_train_stand[num_feat] = scaler.transform(X_train[num_feat])
# X_test_stand[num_feat] = scaler.transform(X_test[num_feat])

# # fit Gradient Boosting regressor with cross-validation
# depth = np.arange(1, 10)
# cv_scores = []
# sd_scores = []
# for d in depth:
#     gb_regressor = GradientBoostingRegressor(random_state=0, n_estimators=100, max_depth=d)
#     scores = cross_val_score(gb_regressor, X_train_stand, y_train, cv=5)
#     cv_scores.append(scores.mean())
#     sd_scores.append(np.sqrt(scores.var())/np.sqrt(5))

# # fit Gradient Boosting regressor to entire training set
# gb_regressor.fit(X_train_stand, y_train)

# # Predict damages for new applicants using the trained Gradient Boosting regressor
# damages_pred = gb_regressor.predict(applicants_who_will_cause_damage)
# damages_pred


### 2.2.2 Decision Tree Regressor

TODO Zorg ervoor dat de andere algoritmes ook de applicants_who_will_cause_damage gebruiken

In [None]:
# # Decision Tree Regressor
# from random import Random
# import numpy as np
# from sklearn.model_selection import train_test_split
# from sklearn.preprocessing import StandardScaler
# from sklearn.tree import DecisionTreeRegressor
# from sklearn.model_selection import cross_val_score

# # split into train and test sets
# X = train
# X = X.drop(['outcome_damage_amount', 'outcome_damage_inc', 'outcome_profit'], axis=1)
# y = train['damage_am']
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# # standardize numerical features
# num_feat = X_train.select_dtypes(include=['int64', 'float64']).columns
# num_feat = [feat for feat in num_feat if feat in X_train.columns and feat in score.columns] # remove non-existent features
# scaler = StandardScaler()
# scaler.fit(X_train[num_feat])
# X_train_stand = X_train.copy()
# X_test_stand = X_test.copy()
# X_train_stand[num_feat] = scaler.transform(X_train[num_feat])
# X_test_stand[num_feat] = scaler.transform(X_test[num_feat])

# # fit decision tree regressor with cross-validation
# depth = np.arange(1, 50)
# cv_scores = []
# sd_scores = []
# for d in depth:
#     dec_tree = DecisionTreeRegressor(random_state=0, max_depth=d)
#     scores = cross_val_score(dec_tree, X_train_stand, y_train, cv=5)
#     cv_scores.append(scores.mean())
#     sd_scores.append(np.sqrt(scores.var())/np.sqrt(5))

# # fit decision tree regressor to entire training set
# dec_tree.fit(X_train_stand, y_train)

# # Standardize numerical features for new applicants
# num_feat = [feat for feat in num_feat if feat in score.columns] # remove non-existent features
# new_applicants_stand = score.copy()
# new_applicants_stand[num_feat] = scaler.transform(score[num_feat])

# # Predict damages for new applicants using the trained decision tree regressor
# damages_pred = dec_tree.predict(applicants_who_will_cause_damage)
# print(damages_pred)

### 2.2.3 KNN

In [None]:
# import numpy as np
# from sklearn.model_selection import train_test_split
# from sklearn.preprocessing import StandardScaler
# from sklearn.neighbors import KNeighborsRegressor
# from sklearn.model_selection import cross_val_score

# # split into train and test sets
# X = train
# X = X.drop(['outcome_damage_amount', 'outcome_damage_inc', 'outcome_profit'], axis=1)
# y = train['damage_am']
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# # standardize numerical features
# num_feat = X_train.select_dtypes(include=['int64', 'float64']).columns
# num_feat = [feat for feat in num_feat if feat in X_train.columns and feat in score.columns] # remove non-existent features
# scaler = StandardScaler()
# scaler.fit(X_train[num_feat])
# X_train_stand = X_train.copy()
# X_test_stand = X_test.copy()
# X_train_stand[num_feat] = scaler.transform(X_train[num_feat])
# X_test_stand[num_feat] = scaler.transform(X_test[num_feat])

# # fit KNN regressor with cross-validation
# k_values = np.arange(1, 50)
# cv_scores = []
# sd_scores = []
# for k in k_values:
#     knn = KNeighborsRegressor(n_neighbors=k)
#     scores = cross_val_score(knn, X_train_stand, y_train, cv=5)
#     cv_scores.append(scores.mean())
#     sd_scores.append(np.sqrt(scores.var())/np.sqrt(5))

# # fit KNN regressor to entire training set
# knn = KNeighborsRegressor(n_neighbors=5)
# knn.fit(X_train_stand, y_train)

# # Standardize numerical features for new applicants
# num_feat = [feat for feat in num_feat if feat in score.columns] # remove non-existent features
# new_applicants_stand = score.copy()
# new_applicants_stand[num_feat] = scaler.transform(score[num_feat])

# # Predict damages for new applicants using the trained KNN regressor
# damages_pred = knn.predict(new_applicants_stand)
# print(damages_pred)


### 2.2.4 Random Forest

In [None]:
# from random import Random
# import numpy as np
# from sklearn.model_selection import train_test_split
# from sklearn.preprocessing import StandardScaler
# from sklearn.ensemble import RandomForestRegressor
# from sklearn.model_selection import cross_val_score

# # split into train and test sets
# X = train
# X = X.drop(['outcome_damage_amount', 'outcome_damage_inc', 'outcome_profit'], axis=1)
# y = train['damage_am']
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# # standardize numerical features
# num_feat = X_train.select_dtypes(include=['int64', 'float64']).columns
# num_feat = [feat for feat in num_feat if feat in X_train.columns and feat in score.columns] # remove non-existent features
# scaler = StandardScaler()
# scaler.fit(X_train[num_feat])
# X_train_stand = X_train.copy()
# X_test_stand = X_test.copy()
# X_train_stand[num_feat] = scaler.transform(X_train[num_feat])
# X_test_stand[num_feat] = scaler.transform(X_test[num_feat])

# # fit random forest regressor with cross-validation
# depth = np.arange(1, 50)
# cv_scores = []
# sd_scores = []
# for d in depth:
#     rnd_forest = RandomForestRegressor(random_state=0, n_estimators=100, max_depth=d)
#     scores = cross_val_score(rnd_forest, X_train_stand, y_train, cv=5)
#     cv_scores.append(scores.mean())
#     sd_scores.append(np.sqrt(scores.var())/np.sqrt(5))

# # fit random forest regressor to entire training set
# rnd_forest.fit(X_train_stand, y_train)

# # Standardize numerical features for new applicants
# num_feat = [feat for feat in num_feat if feat in score.columns] # remove non-existent features
# new_applicants_stand = score.copy()
# new_applicants_stand[num_feat] = scaler.transform(score[num_feat])

# # Predict damages for new applicants using the trained random forest regressor
# damages_pred = rnd_forest.predict(new_applicants_stand)
# print(damages_pred)


In [None]:
# TODO
# To select the 200 applicants we will subtract the predicted damages from the predicted revenue
# Test the ML algortihms to determine which one is the best