# Baseline

## Set up

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from IPython.display import display, Markdown
from pprint import pprint
from sklearn.metrics import accuracy_score, classification_report
from lazypredict import LazyClassifier
import random

seed=random.randint(1000, 9999)
print(f"{seed:=}")

# Load the data
file_path = r"../data/clean/ACME-happinesSurvey2020.parquet"
data = pd.read_parquet(file_path)

# Display basic information about the dataset
data_info = data.info()

# Display the first few rows of the dataset
data_info
data.sample(5)

3104
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126 entries, 0 to 125
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Y       126 non-null    int8 
 1   X1      126 non-null    int8 
 2   X2      126 non-null    int8 
 3   X3      126 non-null    int8 
 4   X4      126 non-null    int8 
 5   X5      126 non-null    int8 
 6   X6      126 non-null    int8 
dtypes: int8(7)
memory usage: 1010.0 bytes


Unnamed: 0,Y,X1,X2,X3,X4,X5,X6
116,1,3,4,4,5,1,3
69,1,5,4,5,5,5,5
49,1,5,1,3,3,4,4
79,1,5,5,5,5,5,5
59,1,5,2,4,2,2,4


## Modeling

The data has been successfully loaded and consists of 126 entries with 7 columns: `Y`, `X1`, `X2`, `X3`, `X4`, `X5`, and `X6`. There are no missing values in the dataset.

Let's start with splitting the data into training and test sets.

### Splitting

In [14]:
%%capture
# Separate features and target
X = data.drop('Y', axis=1)
y = data['Y']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state = 123)

clf = LazyClassifier(verbose=0,ignore_warnings=True, custom_metric=recall_score)
models, predictions = clf.fit(X_train, X_test, y_train, y_test)


- ExtraTreeClassifier
- QuadraticDiscriminantAnalysis
- LabelSpreading
- LabelPropagation
- Perceptron

In [16]:
models.sort_values('F1 Score', ascending=False)

Unnamed: 0_level_0,Accuracy,Balanced Accuracy,ROC AUC,F1 Score,recall_score,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ExtraTreeClassifier,0.62,0.65,0.65,0.62,0.8,0.02
LGBMClassifier,0.58,0.6,0.6,0.58,0.7,0.04
ExtraTreesClassifier,0.58,0.6,0.6,0.58,0.7,0.13
LabelSpreading,0.58,0.62,0.62,0.57,0.8,0.02
BaggingClassifier,0.58,0.62,0.62,0.57,0.8,0.04
LabelPropagation,0.58,0.62,0.62,0.57,0.8,0.02
QuadraticDiscriminantAnalysis,0.58,0.64,0.64,0.56,0.9,0.01
Perceptron,0.54,0.51,0.51,0.54,0.4,0.01
RandomForestClassifier,0.54,0.59,0.59,0.53,0.8,0.17
AdaBoostClassifier,0.54,0.59,0.59,0.53,0.8,0.12


- ExtraTreeClassifier
- QuadraticDiscriminantAnalysis
- LabelSpreading
- LabelPropagation
- Perceptron

## Classifier Types

## 1. Quadratic Discriminant Analysis (QDA)

- **Type**: Probabilistic Classifier
- **Description**: QDA is a generative model that assumes that the features follow a Gaussian distribution for each class. It uses the covariance matrices of the classes to separate them, allowing for a quadratic decision boundary.

## 2. Label Spreading

- **Type**: Semi-Supervised Classifier
- **Description**: This algorithm is used for semi-supervised learning, leveraging both labeled and unlabeled data. It spreads labels through the graph structure of the data, making it effective in scenarios where labeled data is scarce.

## 3. Label Propagation

- **Type**: Semi-Supervised Classifier
- **Description**: Similar to Label Spreading, Label Propagation is also a semi-supervised learning technique. It propagates labels through a graph of data points, allowing the model to learn from both labeled and unlabeled data.

## 4. Perceptron

- **Type**: Linear Classifier
- **Description**: The Perceptron is a type of linear classifier that makes its predictions based on a linear predictor function. It is one of the simplest forms of neural networks and is primarily used for binary classification tasks.

### Next Steps:

2. **Feature Engineering**:
 - Consider creating new features or combining existing ones to capture more information.

3. **Cross-Validation**:
 - Use cross-validation to ensure the model's performance is consistent and not overfitting.