### Understanding the Problem:

**1. Problem description**:

   - **Objective**: The company’s primary objective is to ensure customer happiness, which is crucial for maintaining customer loyalty and sustaining growth, especially during challenging times like the COVID-19 pandemic. By accurately predicting customer happiness, the company can take proactive measures to improve customer experiences, thus reducing churn and enhancing customer satisfaction.
   - **Actionable Insights**: Beyond just predicting customer happiness, the company needs insights into *why* customers are happy or unhappy. Understanding the key drivers behind customer satisfaction allows the company to make strategic decisions, such as improving delivery times, adjusting pricing strategies, or enhancing app usability.

**2. Which metric should we use?**

   - **Accuracy**: While accuracy is a straightforward and commonly used metric, it might not be the best choice, especially if there is class imbalance. In this context, the company should focus not just on overall accuracy but also on how well the model predicts both happy and unhappy customers.
   - **Precision, Recall, F1-Score**: 
     - **Precision** is crucial when the cost of a false positive (predicting a customer is happy when they are not) is high.
     - **Recall** is important when missing out on unhappy customers (false negatives) could lead to losing those customers.
     - **F1-Score** balances precision and recall, and might be a more appropriate metric if both false positives and false negatives are costly.
   > **Customer Lifetime Value (CLV) Impact**: The company might consider metrics that are tied to customer lifetime value, where the impact of retaining or losing a customer is factored into the evaluation.

**3. What adds the most value?**

   - **Predictive Power vs. Interpretability**: While a highly accurate model is valuable, the company also needs to understand what drives customer satisfaction. A simpler, more interpretable model might be preferable if it provides actionable insights, even if it’s slightly less accurate.
   - **Customer Retention**: Focusing on the factors that contribute most to retaining customers adds significant value. Identifying and addressing the pain points that lead to customer dissatisfaction can directly impact the company’s revenue and growth.
   - **Survey Optimization**: If the company can identify the minimal set of survey questions that still provide robust predictions, it can streamline its feedback process, reduce customer survey fatigue, and focus resources on the most impactful areas.

### Conclusion:

- **Primary Need**: The company needs a model that not only predicts customer happiness accurately but also provides insights into the factors that most influence customer satisfaction.
- **Metrics**: While accuracy is important, the company should also consider precision, recall, and F1-score, possibly weighted by customer lifetime value, to ensure the model aligns with business goals.
- **Value Addition**: Understanding and acting on the key drivers of customer satisfaction will add the most value. This includes focusing on features that have the highest impact on customer retention and optimizing the survey to gather only the most essential data.

In [4]:
import pandas as pd
from IPython.display import display


# Load the data
file_path = r"../data/raw/ACME-HappinessSurvey2020.csv"
df = pd.read_csv(file_path)

# Display basic information about the dataset
display(df.info())

# Display the first few rows of the dataset
df.sample(5)

# Display the summary statistics of the dataset
display(df.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126 entries, 0 to 125
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Y       126 non-null    int64
 1   X1      126 non-null    int64
 2   X2      126 non-null    int64
 3   X3      126 non-null    int64
 4   X4      126 non-null    int64
 5   X5      126 non-null    int64
 6   X6      126 non-null    int64
dtypes: int64(7)
memory usage: 7.0 KB


None

Unnamed: 0,Y,X1,X2,X3,X4,X5,X6
count,126.0,126.0,126.0,126.0,126.0,126.0,126.0
mean,0.547619,4.333333,2.531746,3.309524,3.746032,3.650794,4.253968
std,0.499714,0.8,1.114892,1.02344,0.875776,1.147641,0.809311
min,0.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,0.0,4.0,2.0,3.0,3.0,3.0,4.0
50%,1.0,5.0,3.0,3.0,4.0,4.0,4.0
75%,1.0,5.0,3.0,4.0,4.0,4.0,5.0
max,1.0,5.0,5.0,5.0,5.0,5.0,5.0


The data has been successfully loaded and consists of 126 entries with 7 columns: `Y`, `X1`, `X2`, `X3`, `X4`, `X5`, and `X6`. There are no missing values in the dataset.


The dataset contains 126 records with the following features:

- **Y**: Target attribute indicating customer happiness (0 = unhappy, 1 = happy).
- **X1**: Order delivered on time (1 to 5 scale).
- **X2**: Contents of the order were as expected (1 to 5 scale).
- **X3**: Ordered everything wanted (1 to 5 scale).
- **X4**: Paid a good price for the order (1 to 5 scale).
- **X5**: Satisfaction with the courier (1 to 5 scale).
- **X6**: Ease of ordering via the app (1 to 5 scale).

### Initial Observations:

- The target variable `Y` is relatively balanced (mean around 0.55).
- Most features have values primarily concentrated around the upper range (4-5), except for `X2` and `X3`, which have a slightly broader distribution.

### Next Steps:

1. **Modeling**: I will apply a classification algorithm to predict customer happiness.
2. **Feature Importance**: I'll use feature selection techniques to identify the most significant features influencing customer happiness.

In [5]:
fcols = df.select_dtypes('float').columns
icols = df.select_dtypes('integer').columns

df[fcols] = df[fcols].apply(pd.to_numeric, downcast='float')
df[icols] = df[icols].apply(pd.to_numeric, downcast='integer')

df.info()

In [8]:
df.to_parquet("../data/clean/ACME-happinesSurvey2020.parquet", compression="brotli")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126 entries, 0 to 125
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Y       126 non-null    int8 
 1   X1      126 non-null    int8 
 2   X2      126 non-null    int8 
 3   X3      126 non-null    int8 
 4   X4      126 non-null    int8 
 5   X5      126 non-null    int8 
 6   X6      126 non-null    int8 
dtypes: int8(7)
memory usage: 1010.0 bytes


### Next Steps:

1. **Feature Selection**:
   - We'll use techniques like Recursive Feature Elimination (RFE) to identify the most important features.
2. **Model Training and Evaluation**:
   - Train a classification model (e.g., Logistic Regression, Random Forest).
   - Evaluate the model using accuracy and other metrics.