# Baseline

## Set up

In [11]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from IPython.display import display, Markdown
from pprint import pprint
from sklearn.metrics import accuracy_score, classification_report

# Load the data
file_path = r"../data/clean/ACME-happinesSurvey2020.parquet"
data = pd.read_parquet(file_path)

# Display basic information about the dataset
data_info = data.info()

# Display the first few rows of the dataset
data_info
data.sample(5)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126 entries, 0 to 125
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Y       126 non-null    int8 
 1   X1      126 non-null    int8 
 2   X2      126 non-null    int8 
 3   X3      126 non-null    int8 
 4   X4      126 non-null    int8 
 5   X5      126 non-null    int8 
 6   X6      126 non-null    int8 
dtypes: int8(7)
memory usage: 1010.0 bytes


Unnamed: 0,Y,X1,X2,X3,X4,X5,X6
86,1,5,3,3,3,5,5
108,0,5,2,4,4,5,5
92,1,5,4,5,5,5,4
72,1,4,3,3,4,2,4
91,1,5,1,3,4,5,5


## Modeling

The data has been successfully loaded and consists of 126 entries with 7 columns: `Y`, `X1`, `X2`, `X3`, `X4`, `X5`, and `X6`. There are no missing values in the dataset.

Let's start with splitting the data into training and test sets.

### Splitting

In [12]:
# Separate features and target
X = data.drop('Y', axis=1)
y = data['Y']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape


((100, 6), (26, 6), (100,), (26,))

The data has been successfully split into training and test sets. We have 100 samples in the training set and 26 samples in the test set.

Let's start with feature selection using RFE.

**Feature Selection**:
   - Use Recursive Feature Elimination (RFE) with a Logistic Regression model to identify important features.

### Feature Selection

In [13]:
# Initialize the model
model = LogisticRegression(max_iter=1000)

# Initialize RFE
rfe = RFE(model, n_features_to_select=1)
rfe.fit(X_train, y_train)

# Get the ranking of the features
feature_ranking = rfe.ranking_
selected_features = X.columns[rfe.support_]

feature_ranking, selected_features

(array([1, 2, 5, 4, 6, 3]), Index(['X1'], dtype='object'))

The Recursive Feature Elimination (RFE) process has identified that `X1` (my order was delivered on time) is the most important feature when predicting customer happiness.

### Model Training and Evaluation

   - Train the Logistic Regression model using all features.
   - Evaluate the model's performance.
   - Compare the performance with a model trained using only the selected feature `X1`.

### Model Training with All Features

In [14]:
# Train the model with all features
model_all_features = LogisticRegression(max_iter=1000)
model_all_features.fit(X_train, y_train)

# Predict on the test set
y_pred_all_features = model_all_features.predict(X_test)

# Evaluate the model
accuracy_all_features = accuracy_score(y_test, y_pred_all_features)
report_all_features = classification_report(y_test, y_pred_all_features)


print(f'{accuracy_all_features=:}')
pprint(report_all_features)

accuracy_all_features=0.46153846153846156
('              precision    recall  f1-score   support\n'
 '\n'
 '           0       0.56      0.33      0.42        15\n'
 '           1       0.41      0.64      0.50        11\n'
 '\n'
 '    accuracy                           0.46        26\n'
 '   macro avg       0.48      0.48      0.46        26\n'
 'weighted avg       0.49      0.46      0.45        26\n')


#### Model Performance with All Features:

The results of the model evaluation are as follows:
- **Accuracy**: 46.15%
- **Classification Report**:

|              | precision | recall | f1-score | support |
| ------------ | --------- | ------ | -------- | ------- |
| 0            | 0.56      | 0.33   | 0.42     | 15      |
| 1            | 0.41      | 0.64   | 0.50     | 11      |
| accuracy     |           |        | 0.46     | 26      |
| macro avg    | 0.48      | 0.48   | 0.46     | 26      |
| weighted avg | 0.49      | 0.46   | 0.45     | 26      |

In [19]:
# Train the model with all features
model_all_features = LogisticRegression(max_iter=1000)
model_all_features.fit(X_train, y_train)

# Predict on the test set
y_pred_all_features = model_all_features.predict(X_test)

# Evaluate the model
accuracy_all_features = accuracy_score(y_test, y_pred_all_features)
report_all_features = classification_report(y_test, y_pred_all_features)

# Train the model with the selected feature
model_selected_feature = LogisticRegression(max_iter=1000)
model_selected_feature.fit(X_train[['X1']], y_train)

# Predict on the test set
y_pred_selected_feature = model_selected_feature.predict(X_test[['X1']])

# Evaluate the model
accuracy_selected_feature = accuracy_score(y_test, y_pred_selected_feature)
report_selected_feature = classification_report(y_test, y_pred_selected_feature)


display(Markdown("#### accuracy_all_features"))
pprint(accuracy_all_features)

display(Markdown("#### report_all_features"))
pprint(report_all_features)

display(Markdown("#### accuracy_selected_feature"))
pprint(accuracy_selected_feature)

display(Markdown("#### report_selected_feature"))
pprint(report_selected_feature)

#### accuracy_all_features

0.46153846153846156


#### report_all_features

('              precision    recall  f1-score   support\n'
 '\n'
 '           0       0.56      0.33      0.42        15\n'
 '           1       0.41      0.64      0.50        11\n'
 '\n'
 '    accuracy                           0.46        26\n'
 '   macro avg       0.48      0.48      0.46        26\n'
 'weighted avg       0.49      0.46      0.45        26\n')


#### accuracy_selected_feature

0.4230769230769231


#### report_selected_feature

('              precision    recall  f1-score   support\n'
 '\n'
 '           0       0.50      0.20      0.29        15\n'
 '           1       0.40      0.73      0.52        11\n'
 '\n'
 '    accuracy                           0.42        26\n'
 '   macro avg       0.45      0.46      0.40        26\n'
 'weighted avg       0.46      0.42      0.38        26\n')


### Model Training with Selected Feature

In [None]:
# Train the model with the selected feature
model_selected_feature = LogisticRegression(max_iter=1000)
model_selected_feature.fit(X_train[['X1']], y_train)

# Predict on the test set
y_pred_selected_feature = model_selected_feature.predict(X_test[['X1']])

# Evaluate the model
accuracy_selected_feature = accuracy_score(y_test, y_pred_selected_feature)
report_selected_feature = classification_report(y_test, y_pred_selected_feature)

accuracy_selected_feature, report_selected_feature
print(f'{accuracy_selected_feature=:}')
pprint(report_selected_feature)

accuracy_selected_feature=0.4230769230769231
('              precision    recall  f1-score   support\n'
 '\n'
 '           0       0.50      0.20      0.29        15\n'
 '           1       0.40      0.73      0.52        11\n'
 '\n'
 '    accuracy                           0.42        26\n'
 '   macro avg       0.45      0.46      0.40        26\n'
 'weighted avg       0.46      0.42      0.38        26\n')


I'll run these codes to train the models and evaluate their performance.


### Model Performance with Selected Feature (`X1`):
- **Accuracy**: 42.31%
- **Classification Report**:


### Analysis:
- The model trained with all features performs slightly better than the model trained with only the selected feature `X1`.
- However, the accuracy of both models is relatively low and below the target of 73%.

### Next Steps:
1. **Try Different Models**:
 - Experiment with different classification algorithms (e.g., Random Forest, Support Vector Machine, etc.).
 - Perform hyperparameter tuning to improve model performance.

2. **Feature Engineering**:
 - Consider creating new features or combining existing ones to capture more information.

3. **Cross-Validation**:
 - Use cross-validation to ensure the model's performance is consistent and not overfitting.