## Exploratory Data Analysis for Happy Customers

### Section 1 — The Problem

**Predict if a customer is happy or not based on the answers they give to questions asked.**

### Section 2 — Load data + sanity checks

In [16]:
import pandas as pd

df = pd.read_csv("../data/ACME-HappinessSurvey2020.csv")

# df.head()
# df.info()
df.describe()
# df.shape
# df.isna().sum()

Unnamed: 0,Y,X1,X2,X3,X4,X5,X6
count,126.0,126.0,126.0,126.0,126.0,126.0,126.0
mean,0.547619,4.333333,2.531746,3.309524,3.746032,3.650794,4.253968
std,0.499714,0.8,1.114892,1.02344,0.875776,1.147641,0.809311
min,0.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,0.0,4.0,2.0,3.0,3.0,3.0,4.0
50%,1.0,5.0,3.0,3.0,4.0,4.0,4.0
75%,1.0,5.0,3.0,4.0,4.0,4.0,5.0
max,1.0,5.0,5.0,5.0,5.0,5.0,5.0


### Dataset Overview & Constraints

This dataset contains customer survey responses (Likert scale 1–5) for 6 operational questions,
with a binary target indicating overall customer happiness.

Key constraints that influence modeling choices:
- Small sample size (126 rows, no nulls)
- Ordinal, low-cardinality features
- Binary target with no severe class imbalance

Given these constraints, model evaluation will rely on cross-validation
rather than a single train/test split.


In [15]:
df["Y"].value_counts(), df["Y"].value_counts(normalize=True)

(Y
 1    69
 0    57
 Name: count, dtype: int64,
 Y
 1    0.547619
 0    0.452381
 Name: proportion, dtype: float64)

The target variable is reasonably balanced, so accuracy is an acceptable primary metric.
However, due to the small dataset size, all results will be reported using
Stratified K-Fold cross-validation to reduce variance.

### Section 3 — Univariate EDA

In [19]:
df.groupby("Y").mean().T.sort_values(by=1, ascending=False)

Y,0,1
X1,4.087719,4.536232
X6,4.105263,4.376812
X5,3.368421,3.884058
X4,3.684211,3.797101
X3,3.140351,3.449275
X2,2.561404,2.507246


In [20]:
df.corr(numeric_only=True)["Y"].sort_values(ascending=False)

Y     1.000000
X1    0.280160
X5    0.224522
X6    0.167669
X3    0.150838
X4    0.064415
X2   -0.024274
Name: Y, dtype: float64

In [13]:
df.corr(numeric_only=True)

Unnamed: 0,Y,X1,X2,X3,X4,X5,X6
Y,1.0,0.28016,-0.024274,0.150838,0.064415,0.224522,0.167669
X1,0.28016,1.0,0.059797,0.283358,0.087541,0.432772,0.411873
X2,-0.024274,0.059797,1.0,0.184129,0.114838,0.039996,-0.062205
X3,0.150838,0.283358,0.184129,1.0,0.302618,0.358397,0.20375
X4,0.064415,0.087541,0.114838,0.302618,1.0,0.293115,0.215888
X5,0.224522,0.432772,0.039996,0.358397,0.293115,1.0,0.320195
X6,0.167669,0.411873,-0.062205,0.20375,0.215888,0.320195,1.0


From a univariate perspective, some features (e.g., delivery timeliness and app usability)
show stronger separation between happy and unhappy customers.

This motivates:
- Testing whether all questions are necessary
- Evaluating if a smaller subset of features preserves predictive power


### Section 4 — Baseline Models

We begin with simple, interpretable classifiers to establish a performance floor.
These models help determine whether:
- The problem is linearly separable
- More complex models are justified


In [21]:
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

In [22]:
# Logistic Regression Baseline
from sklearn.linear_model import LogisticRegression

X = df.drop(columns="Y")
y = df["Y"]

log_reg = LogisticRegression(max_iter=1000)
scores_lr = cross_val_score(log_reg, X, y, cv=cv, scoring="accuracy")

scores_lr.mean(), scores_lr.std()

(np.float64(0.5473846153846154), np.float64(0.06114025006094423))

In [23]:
# Decision-Tree Baseline
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(max_depth=3, random_state=42)
scores_dt = cross_val_score(dt, X, y, cv=cv, scoring="accuracy")

scores_dt.mean(), scores_dt.std()

(np.float64(0.5956923076923077), np.float64(0.054578752612041176))

The baseline models achieve reasonable accuracy, confirming that the survey responses
contain meaningful signal.

However:
- Linear models may underfit nonlinear interactions
- Single trees are unstable on small datasets

This motivates testing ensemble-based models designed for tabular data.

### Section 5 — Stronger models

1. Random Forest
2. Gradient Boosting

In [24]:
# Random Forest
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=4,
    random_state=42
)

scores_rf = cross_val_score(rf, X, y, cv=cv, scoring="accuracy")
scores_rf.mean(), scores_rf.std()


(np.float64(0.6027692307692308), np.float64(0.0815610999052335))

In [26]:
# Gradiant Boosting
from sklearn.ensemble import GradientBoostingClassifier

gb = GradientBoostingClassifier(random_state=42)
scores_gb = cross_val_score(gb, X, y, cv=cv, scoring="accuracy")

scores_gb.mean(), scores_gb.std()


(np.float64(0.6107692307692308), np.float64(0.033225070736475286))

Ensemble models improve performance and stability, suggesting that:
- Feature interactions matter
- The problem benefits from non-linear decision boundaries

Gradient Boosting provides a strong accuracy–complexity tradeoff and is selected
for further analysis.


### Hyperparameter Tuning

Given the small dataset, we restrict the search space to avoid overfitting and excessive variance.

In [27]:
# Grid Search
from sklearn.model_selection import GridSearchCV

param_grid = {
    "n_estimators": [50, 100],
    "learning_rate": [0.05, 0.1],
    "max_depth": [2, 3],
}

grid = GridSearchCV(
    GradientBoostingClassifier(random_state=42),
    param_grid,
    cv=cv,
    scoring="accuracy"
)

grid.fit(X, y)

grid.best_params_, grid.best_score_

({'learning_rate': 0.05, 'max_depth': 3, 'n_estimators': 100},
 np.float64(0.6427692307692309))

### Section 7 — Feature importance & selection

In [28]:
best_gb = grid.best_estimator_
importances = pd.Series(
    best_gb.feature_importances_,
    index=X.columns
).sort_values(ascending=False)

importances

X1    0.282480
X3    0.209636
X5    0.203363
X6    0.138052
X4    0.084346
X2    0.082124
dtype: float64

Feature importance analysis indicates that only a subset of questions
contribute most of the predictive signal.

This motivates an explicit feature selection step to determine whether
a smaller survey can achieve comparable performance.