# Project Name

[Project Description]

# Import Packages

# 1. Load the Data

In [None]:
train_df = pd.read_csv("data/train.csv")
test_df = pd.read_csv("data/test.csv")

# 2. Understand the Data

In [None]:
display(train_df.head(5))
display(test_df.head(5))

In [None]:
display(train_df.info())
display(test_df.info())

In [None]:
train_df.describe()

In [None]:
test_df.describe()

# 3. Data Cleaning

## 3.1. Check for duplicates

In [None]:
train_df.duplicated().any()

In [None]:
test_df.duplicated().any()

## 3.2. Check for missing data
Let's check for 0, blank, NaN or None values.

In [None]:
def print_missing_values(df):
    """ 
    calculate the 0, blank, NaN or None values in df, in count and %
    filter rows where not all values are 0
    """

    missing_df = pd.concat([
        # Counts
        (df == 0).sum().rename('zeros_count'),
        (df == '').sum().rename('blanks_count'), 
        df.isna().sum().rename('nan_count'),
        (df == None).sum().rename('none_count'),
        
        # Percentages
        ((df == 0).sum() / len(df) * 100).round(1).rename('zeros_%'),
        ((df == '').sum() / len(df) * 100).round(1).rename('blanks_%'), 
        (df.isna().sum() / len(df) * 100).round(1).rename('nan_%'),
        ((df == None).sum() / len(df) * 100).round(1).rename('none_%')
    ], axis=1)

    # Filter rows where not all values are zero
    missing_df = missing_df[(missing_df.select_dtypes(include=[np.number]) != 0).any(axis=1)]

    # Sort by zeros count (descending) and then by NaN count (descending)
    missing_df = missing_df.sort_values(['zeros_count', 'nan_count'], ascending=[False, False])

    display(missing_df)

# 4. Exploratory Data Analysis

## 4.1. Univariate analysis

For each categorical variable, display the bar plot.

For each numerical variable, show histograms, measures of central tendency (mean, median, mode), and measures of dispersion (range, standard deviation, skewness, kurtosis). 

## 4.2. Bivariate analysis

Understand the relationships between features and the target variable using scatterplots, correlation coefficients / matrix.

## 4.3. Handle Outliers

## 4.4. Handle missing values

**IMPORTANT!** Don't impute missing values using the test set! This is a common mistake leading to **Data Leakage** by train-test contamination. Use SimpleImputer to prevent data leakage.

# 5. Feature engineering

## 5.1. Create new features

Let's engineer relevant features that might improve predictive performance.

## 5.2. Encode categorical features

Let's convert categorical features into numerical form using techniques like one-hot encoding, label encoding or target encoding.

Note : using drop_first=True creates 1 column instead of 2 for 2 categories (dummy variable trapping). This avoids multicollinearity between columns.

Target encoding preserves the relationship between categorical values and the target variable, which can improve model performance compared to simple label encoding or one-hot encoding for tree-based models.

## 5.3. Feature selection

Let's remove features that do not contribute much to the prediction, such as colinear features or features poorly correlated with the target variable.

Many ML algorithms, such as Random Forest, provide a feature importance score.

This improves the model's stability and interpretability, and reduces overfitting. 

The Spearman correlation coefficient is able to pick up relationships between variables even when they are nonlinear.

## 5.4. Feature Normalization

Many regression models assume normally distributed data. Let's fix skewed features by applying log transform to the numerical features.

In [None]:
for col in quantitative:
    train[col] = np.log1p(train[col])
    test[col] = np.log1p(test[col])

## 5.5. Feature Scaling

Scaling numerical features improves distance-based calculations (for KNN, SVM classifiers) and prevents feature dominance.

## 5.6. Save the processed data

In [None]:
train.to_csv('data/train_clean.csv', index=False)
test.to_csv('data/test_clean.csv', index=False)

# 6. Choose an Evaluation Metrics
For regression problems, common metrics include:
- Mean Absolute Error (MAE)
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- R-squared

For classification problems, common metrics include:
- Accuracy
- Precision
- Recall
- F1-score
- AUC

The confusion matric and ROC Curve can also bring useful insights. 

# 7. Select Algorithms
- Start with simple regression algorithms like Linear Regression and gradually explore more complex models like Random Forest, Gradient Boosting, or XGBoost.
- Consider simple ensemble methods, such as simple average, weighted average, or voting ensembles, to combine multiple models for potentially better results.
- The model chosen depends on the data. A more complex model does not always constitute a better model.

# 8. Model Validation
- Split the data into training and validation sets. A common split is 70-30 or 80-20 for training and validation, respectively. This method is computationally less intensive and often used for initial model exploration or when dealing with very large datasets.
- K-Fold Cross Validation. This method provides a more reliable evaluation, especially with smaller datasets.
- Model validation is important to assess the model's generalization performance (i.e. assess how well the model performs on unseen data). This helps prevent overfitting and gives you a more reliable estimate of your model's performance.

## 8.1. Hyperparameter Tuning
- Tune the hyperparameters of your chosen algorithms on the validation dataset using techniques like grid search or random search to find the best combination.
- Optuna is an efficient and effective way to search for optimal hyperparameters.

## 8.2. Regularization
- Implement regularization techniques like L1 (Lasso) or L2 (Ridge) regularization to prevent overfitting.
- Many ML algorithms include regularization parameters, including L1 and L2, sometimes called reg_alpha or reg_lambda. Read up on your chosen algorithms regularization parameters and tune them accordingly on your validation set.

# 9. Fit the final model

Fit the best model on the whole training set (including the validation set) using the optimal hyperparameters found during model validation.

# 10. Predict

Generate predictions on the test set (unseen data).

# 11. Model Persistence
Save the model weights for future use.