# Background:

We are one of the fastest growing startups in the logistics and delivery domain. We work with several partners and make on-demand delivery to our customers. From operational standpoint we have been facing several different challenges and everyday we are trying to address these challenges.

We thrive on making our customers happy. As a growing startup, with a global expansion strategy we know that we need to make our customers happy and the only way to do that is to measure how happy each customer is. If we can predict what makes our customers happy or unhappy, we can then take necessary actions.

Getting feedback from customers is not easy either, but we do our best to get constant feedback from our customers. This is a crucial function to improve our operations across all levels.

We recently did a survey to a select customer cohort. You are presented with a subset of this data. We will be using the remaining data as a private test set.

# Data Description:

Y = target attribute (Y) with values indicating 0 (unhappy) and 1 (happy) customers

X1 = my order was delivered on time

X2 = contents of my order was as I expected

X3 = I ordered everything I wanted to order

X4 = I paid a good price for my order

X5 = I am satisfied with my courier

X6 = the app makes ordering easy for me

Attributes X1 to X6 indicate the responses for each question and have values from 1 to 5 where the smaller number indicates less and the higher number indicates more towards the answer.

# Download Data:

https://drive.google.com/open?id=1KWE3J0uU_sFIJnZ74Id3FDBcejELI7FD

# Goal(s):

Predict if a customer is happy or not based on the answers they give to questions asked.

# Success Metrics:

Reach 73% accuracy score or above, or convince us why your solution is superior. We are definitely interested in every solution and insight you can provide us.

Try to submit your working solution as soon as possible. The sooner the better.

# Bonus(es):

We are very interested in finding which questions/features are more important when predicting a customer’s happiness. Using a feature selection approach show us understand what is the minimal set of attributes/features that would preserve the most information about the problem while increasing predictability of the data we have. Is there any question that we can remove in our next survey?

In [11]:
#importing the needed libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFECV, SelectKBest, chi2
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import confusion_matrix


# Data Preparation

In [14]:
#loading the CSV data

customers = pd.read_csv('ACME-HappinessSurvey2020.csv')

#looking at the first five rows of the data

print(customers.head())

   Y  X1  X2  X3  X4  X5  X6
0  0   3   3   3   4   2   4
1  0   3   2   3   5   4   3
2  1   5   3   3   3   3   5
3  0   5   4   3   3   3   5
4  0   5   4   3   3   3   5


# Data Preprocessing 

In [17]:
#examining if there are any missing values in the data

print(customers.isnull().sum())

#upon the examination, there are no missing values in the data

Y     0
X1    0
X2    0
X3    0
X4    0
X5    0
X6    0
dtype: int64


# Data Exploration 

Here is the description of the data

Y = target attribute (Y) with values indicating 0 (unhappy) and 1 (happy) customers

X1 = my order was delivered on time

X2 = contents of my order was as I expected

X3 = I ordered everything I wanted to order

X4 = I paid a good price for my order

X5 = I am satisfied with my courier

X6 = the app makes ordering easy for me

Attributes X1 to X6 indicate the responses for each question and have values from 1 to 5 where the smaller number indicates less and the higher number indicates more towards the answer.

In [20]:
#doing quick summary Stats

customers.describe()

#data has 126 rows

Unnamed: 0,Y,X1,X2,X3,X4,X5,X6
count,126.0,126.0,126.0,126.0,126.0,126.0,126.0
mean,0.547619,4.333333,2.531746,3.309524,3.746032,3.650794,4.253968
std,0.499714,0.8,1.114892,1.02344,0.875776,1.147641,0.809311
min,0.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,0.0,4.0,2.0,3.0,3.0,3.0,4.0
50%,1.0,5.0,3.0,3.0,4.0,4.0,4.0
75%,1.0,5.0,3.0,4.0,4.0,4.0,5.0
max,1.0,5.0,5.0,5.0,5.0,5.0,5.0


In [22]:
#counting the number of happy customers vs. unhappy customers

happy_unhappy_counts = customers['Y'].value_counts()
print("Happy vs Unhappy customers:\n", happy_unhappy_counts)

Happy vs Unhappy customers:
 Y
1    69
0    57
Name: count, dtype: int64


In [24]:
#counting the ratings for X1 as well as the number of happy customers vs. unhappy customers

ratings_X1 = customers['X1'].value_counts().sort_index()
happy_unhappy_X1 = customers.groupby('Y')['X1'].value_counts().unstack().fillna(0)

print("\nRatings distribution for X1:\n", ratings_X1)
print("\nHappy vs Unhappy in X1:\n", happy_unhappy_X1)



Ratings distribution for X1:
 X1
1     1
3    20
4    40
5    65
Name: count, dtype: int64

Happy vs Unhappy in X1:
 X1    1     3     4     5
Y                        
0   1.0  12.0  24.0  20.0
1   0.0   8.0  16.0  45.0


In [26]:
#counting the ratings for X2 as well as the number of happy customers vs. unhappy customers

ratings_X2 = customers['X2'].value_counts().sort_index()
happy_unhappy_X2 = customers.groupby('Y')['X2'].value_counts().unstack().fillna(0)

print("\nRatings distribution for X2:\n", ratings_X2)
print("\nHappy vs Unhappy in X2:\n", happy_unhappy_X2)


Ratings distribution for X2:
 X2
1    27
2    34
3    42
4    17
5     6
Name: count, dtype: int64

Happy vs Unhappy in X2:
 X2   1   2   3   4  5
Y                    
0   13  13  19  10  2
1   14  21  23   7  4


In [28]:
#counting the ratings for X3 as well as the number of happy customers vs. unhappy customers

ratings_X3 = customers['X3'].value_counts().sort_index()
happy_unhappy_X3 = customers.groupby('Y')['X3'].value_counts().unstack().fillna(0)

print("\nRatings distribution for X3:\n", ratings_X3)
print("\nHappy vs Unhappy in X3:\n", happy_unhappy_X3)


Ratings distribution for X3:
 X3
1     7
2    14
3    55
4    33
5    17
Name: count, dtype: int64

Happy vs Unhappy in X3:
 X3  1  2   3   4   5
Y                   
0   4  7  29  11   6
1   3  7  26  22  11


In [30]:
#counting the ratings for X4 as well as the number of happy customers vs. unhappy customers

ratings_X4 = customers['X4'].value_counts().sort_index()
happy_unhappy_X4 = customers.groupby('Y')['X4'].value_counts().unstack().fillna(0)

print("\nRatings distribution for X4:\n", ratings_X4)
print("\nHappy vs Unhappy in X4:\n", happy_unhappy_X4)


Ratings distribution for X4:
 X4
1     2
2     5
3    41
4    53
5    25
Name: count, dtype: int64

Happy vs Unhappy in X4:
 X4    1    2     3     4     5
Y                             
0   0.0  4.0  20.0  23.0  10.0
1   2.0  1.0  21.0  30.0  15.0


In [32]:
#counting the ratings for X5 as well as the number of happy customers vs. unhappy customers

ratings_X5 = customers['X5'].value_counts().sort_index()
happy_unhappy_X5 = customers.groupby('Y')['X5'].value_counts().unstack().fillna(0)

print("\nRatings distribution for X5:\n", ratings_X5)
print("\nHappy vs Unhappy in X5:\n", happy_unhappy_X5)


Ratings distribution for X5:
 X5
1     7
2    16
3    22
4    50
5    31
Name: count, dtype: int64

Happy vs Unhappy in X5:
 X5  1  2   3   4   5
Y                   
0   5  9  12  22   9
1   2  7  10  28  22


In [34]:
#counting the ratings for X6 as well as the number of happy customers vs. unhappy customers

ratings_X6 = customers['X6'].value_counts().sort_index()
happy_unhappy_X6 = customers.groupby('Y')['X6'].value_counts().unstack().fillna(0)

print("\nRatings distribution for X6:\n", ratings_X6)
print("\nHappy vs Unhappy in X6:\n", happy_unhappy_X6)


Ratings distribution for X6:
 X6
1     1
2     1
3    20
4    47
5    57
Name: count, dtype: int64

Happy vs Unhappy in X6:
 X6    1    2     3     4     5
Y                             
0   0.0  1.0  14.0  20.0  22.0
1   1.0  0.0   6.0  27.0  35.0


# Data Splitting

In [37]:
#separating the target variable Y from the predictors X1-X6

X = customers[['X1', 'X2', 'X3', 'X4', 'X5', 'X6']]
y = customers['Y']


In [39]:
#splitting the data into train and validation sets (80%/20% rule)

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)




# Feature Selection by using RFECV with Logistic Regression Model

In [42]:
#initializing a logistic regression model

estimator = LogisticRegression(max_iter=1000, solver='liblinear')

#setting up Recursive Feature Elimination with Cross-Validation

selector = RFECV(estimator, step=1, cv=5, scoring='accuracy')

#fitting the selector to the training data

selector = selector.fit(X_train, y_train)

print("Selected features:", X.columns[selector.support_])

Selected features: Index(['X1', 'X2', 'X5'], dtype='object')


In [44]:
#reducing the dataset to include only the selected features in the previous step

X_train_selected = selector.transform(X_train)
X_valid_selected = selector.transform(X_valid)


# Hyperparameter Tuning with GridSearchCV for Logistic Regression Model

In [47]:
#defining the parameter grid for the cross-validation
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],  #regularization strength
    'penalty': ['l1', 'l2'],       #regularization type
}

#initializing a new logistic regression model to be tuned through GridSearch

logreg = LogisticRegression(max_iter=1000, solver='liblinear')

#setting up a grid search to find the best hyperparameters using cross-validation

grid = GridSearchCV(logreg, param_grid, cv=5, scoring='accuracy')

#fitting the grid search to the training data

grid.fit(X_train_selected, y_train)

print("Best parameters:", grid.best_params_)


Best parameters: {'C': 1, 'penalty': 'l2'}


In [49]:
#retrieving the model with the best-found parameters

best_model = grid.best_estimator_


In [51]:
#doing the model evaluation on the validation set

y_pred = best_model.predict(X_valid_selected)
accuracy = accuracy_score(y_valid, y_pred)

print(f"Validation Accuracy: {accuracy:.2f}")
print("Classification report:\n", classification_report(y_valid, y_pred))

Validation Accuracy: 0.65
Classification report:
               precision    recall  f1-score   support

           0       0.71      0.42      0.53        12
           1       0.63      0.86      0.73        14

    accuracy                           0.65        26
   macro avg       0.67      0.64      0.63        26
weighted avg       0.67      0.65      0.63        26



In [53]:
#measuring the feature importance

coefficients = best_model.coef_[0]
for feature, coef in zip(X.columns[selector.support_], coefficients):
    print(f"{feature}: {coef:.4f}")

X1: 0.2466
X2: -0.2150
X5: 0.1940


# Feature Selection by using RFECV with Random Forest Model

In [56]:
#initializing a random forest model 

rf_estimator = RandomForestClassifier(n_estimators=100, random_state=42)

#setting up Recursive Feature Elimination with Cross-Validation

selector = RFECV(rf_estimator, step=1, cv=5, scoring='accuracy')

#fitting the selector to the training data
selector = selector.fit(X_train, y_train)

print("Optimal number of features:", selector.n_features_)
print("Selected features:", list(X.columns[selector.support_]))

Optimal number of features: 4
Selected features: ['X1', 'X2', 'X3', 'X5']


In [57]:
#transforming the  training and validation sets

X_train_selected = selector.transform(X_train)
X_valid_selected = selector.transform(X_valid)

# Hyperparamter Tuning for Random Forest Model

In [61]:
#defining the parameter grid for the cross-validation
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 5, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2']
}

#initializing a new random forest model

rf = RandomForestClassifier(random_state=42)

#setting up a grid search to find the best hyperparameters using cross-validation

grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='accuracy', n_jobs=-1)

#fitting the grid search to the training data

grid_search.fit(X_train_selected, y_train)

#print("Best parameters found:", grid_search.best_params_)

540 fits failed out of a total of 1620.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
404 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\maadm\Downloads\anac\Lib\site-packages\sklearn\model_selection\_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\maadm\Downloads\anac\Lib\site-packages\sklearn\base.py", line 1466, in wrapper
    estimator._validate_params()
  File "C:\Users\maadm\Downloads\anac\Lib\site-packages\sklearn\base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "C:\Users\maadm\Downloads\anac\Lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_constraints
  

In [62]:
#retrieving the model with the best-found parameters

best_rf = grid_search.best_estimator_

In [63]:
#doing the model evaluation on the validation set

y_pred = best_rf.predict(X_valid_selected)
accuracy = accuracy_score(y_valid, y_pred)

print(f"Validation Accuracy: {accuracy:.2f}")
print("Classification Report:\n", classification_report(y_valid, y_pred))

Validation Accuracy: 0.69
Classification Report:
               precision    recall  f1-score   support

           0       0.67      0.67      0.67        12
           1       0.71      0.71      0.71        14

    accuracy                           0.69        26
   macro avg       0.69      0.69      0.69        26
weighted avg       0.69      0.69      0.69        26



# Feature Scaling for SVM Model

In [68]:
#initializing an instance of the Standard Scaler

scaler = StandardScaler()

#fitting the Standard Scaler to the data

X_scaled = scaler.fit_transform(X)


In [70]:
#spliting dataset into training and validation sets

X_train, X_valid, y_train, y_valid = train_test_split(X_scaled, y, test_size=0.2, random_state=42, stratify=y)


# Feature selection by using RFECV with SVC Model

In [72]:
#initializing an SVC model 

svc_estimator = SVC(kernel='linear', random_state=42)

#setting up Recursive Feature Elimination with Cross-Validation

selector = RFECV(svc_estimator, step=1, cv=5, scoring='accuracy')

#fitting the selector to the training data

selector = selector.fit(X_train, y_train)

print("Optimal number of features:", selector.n_features_)
print("Selected features:", list(X.columns[selector.support_]))

Optimal number of features: 1
Selected features: ['X1']


In [74]:
#transforming the training and validation sets

X_train_selected = selector.transform(X_train)
X_valid_selected = selector.transform(X_valid)

# Hyperparamter Tuning for SVC Model

In [77]:
#defining the parameter grid for the cross-validation

param_grid = {
    'C': [0.1, 1, 10, 100],
    'kernel': ['linear', 'rbf'],
    'gamma': ['scale', 'auto']  #only relevant for 'rbf' kernel
}

#initializing a new SVC model

svc = SVC(random_state=42)

#initializing GridSearchCV to search for the best hyperparameters

grid_search = GridSearchCV(svc, param_grid, cv=5, scoring='accuracy', n_jobs=-1)

#fitting the GridSearchCV to the training data 

grid_search.fit(X_train_selected, y_train)

print("Best parameters found:", grid_search.best_params_)


Best parameters found: {'C': 1, 'gamma': 'scale', 'kernel': 'linear'}


In [79]:
#retrieving the model with the best-found parameters

best_svc = grid_search.best_estimator_

In [81]:
#doing the model evaluation on the validation set

y_pred = best_svc.predict(X_valid_selected)
accuracy = accuracy_score(y_valid, y_pred)

print(f"Validation Accuracy: {accuracy:.2f}")
print("Classification Report:\n", classification_report(y_valid, y_pred))

Validation Accuracy: 0.73
Classification Report:
               precision    recall  f1-score   support

           0       0.73      0.67      0.70        12
           1       0.73      0.79      0.76        14

    accuracy                           0.73        26
   macro avg       0.73      0.73      0.73        26
weighted avg       0.73      0.73      0.73        26



In [86]:
import xgboost as xgb
from sklearn.feature_selection import SelectFromModel
import optuna



# Feature Selection for XGBoost Model

In [89]:
#initiating an instance of the XGBoost Model

model = xgb.XGBClassifier()

# fittting the model to the training data

model.fit(X_train, y_train)

#fitting the selector to the training data

selector = SelectFromModel(model, prefit=True)

#transforming the training and validation datas to retain the selected features

X_train_selected = selector.transform(X_train)
X_val_selected = selector.transform(X_valid)

# Hyperparameter Tuning with Optuna for XGBoost Model

In [92]:
def objective(trial):       #defining an objective function for the Optuna study
    param = {
        'objective': 'binary:logistic',
        'eval_metric': 'logloss',
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
        'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'gamma': trial.suggest_float('gamma', 0, 5)
    }              #setting up a dictionary of hyperparameters to optimize

#createing another instance of the XGBClassifier using the hyperparameters in param 
#training it on the selected features of the training set and then predicting the validation set
    model = xgb.XGBClassifier(**param)
    model.fit(X_train_selected, y_train)
    preds = model.predict(X_val_selected)

#returning the accuracy of the model predictions on the validation set    
    accuracy = accuracy_score(y_valid, preds)
    return accuracy

#creating an Optuna study object that aims to maximize the objective function to find the hyperparameter set that yields the highest accuracy
study = optuna.create_study(direction='maximize')

#running the optimization process for 100 trials to explore the different hyperparameter combinations
study.optimize(objective, n_trials=100)

#returning the best trial results by displaying the highest accuracy and the corresponding set of hyperparameters
print("Best trial:")
trial = study.best_trial
print(f'  Accuracy: {trial.value}')
print("  Best hyperparameters: {}".format(trial.params))

[I 2025-08-03 16:57:09,418] A new study created in memory with name: no-name-32f71a5b-41a2-45a3-b7d1-98b771f3239d
[I 2025-08-03 16:57:09,490] Trial 0 finished with value: 0.7307692307692307 and parameters: {'max_depth': 4, 'learning_rate': 0.17055988994462612, 'n_estimators': 135, 'subsample': 0.7724094686835079, 'colsample_bytree': 0.5919239736855275, 'gamma': 2.939325967013593}. Best is trial 0 with value: 0.7307692307692307.
[I 2025-08-03 16:57:09,707] Trial 1 finished with value: 0.7307692307692307 and parameters: {'max_depth': 6, 'learning_rate': 0.1224787394323923, 'n_estimators': 607, 'subsample': 0.5559829914214132, 'colsample_bytree': 0.7692054272588472, 'gamma': 2.805719929723669}. Best is trial 0 with value: 0.7307692307692307.
[I 2025-08-03 16:57:09,808] Trial 2 finished with value: 0.7307692307692307 and parameters: {'max_depth': 8, 'learning_rate': 0.26132027626544296, 'n_estimators': 267, 'subsample': 0.9020180733419183, 'colsample_bytree': 0.8092376619925836, 'gamma': 0

Best trial:
  Accuracy: 0.7692307692307693
  Best hyperparameters: {'max_depth': 3, 'learning_rate': 0.18990398767656982, 'n_estimators': 855, 'subsample': 0.8282106342366561, 'colsample_bytree': 0.9724452860337269, 'gamma': 0.43196934881542415}


In [94]:
#retrieving the best hyperparameters from the study and initializing a new XGBoost model with these parameters
best_params = trial.params
best_model = xgb.XGBClassifier(**best_params)

#training the new model on the selected features of the training set and predicting on the validation set
best_model.fit(X_train_selected, y_train)
final_predictions = best_model.predict(X_val_selected)

#calculating the final accuracy of the new model
final_accuracy = accuracy_score(y_valid, final_predictions)
print(f'Final Accuracy: {final_accuracy}')

Final Accuracy: 0.7692307692307693
