In [None]:
# import the libraries top use
import numpy as np
import pandas as pd
import seaborn as sns

# Step 1: Problem statement and data collection

In [None]:
from utils import load_data, ReadCsvParams, SaveCsvParams

file_path = '../data/raw/file_name.csv'
url = 'url_to_csv_file'
read_csv_params: ReadCsvParams = {'delimiter': ','}
save_csv_params: SaveCsvParams = {'sep': ','}

df: pd.DataFrame = load_data(file_path=file_path, url=url, read_csv_params=read_csv_params, save_csv_params=save_csv_params)

## Problem to solve:

Problem to solve and how.

# Step 2: Exploration and data cleaning

## Eliminate duplicates

## Eliminate irrelevant information

# Step 3: Analysis of univariate variables

A **univariate variable** is a statistical term used to refer to a set of observations of an attribute. That is, the column-by-column analysis of the DataFrame. To do this, we must distinguish whether a variable is categorical or numerical, as the body of the analysis and the conclusions that can be drawn will be different.

## Analysis of categorical variables

A **categorical variable** is a type of variable that can be one of a limited number of categories or groups. These groups are often nominal (e.g., the color of a car: red, blue, black, etc., but none of these colors is inherently "greater" or "better" than the others) but can also be represented by finite numbers.

**To represent these types of variables we will use histograms.**

## Analysis on numeric variables

A **numeric variable** is a type of variable that can take numeric values (integers, fractions, decimals, negatives, etc.) in an infinite range. A numerical categorical variable can also be a numerical variable. 

**They are usually represented using a histogram and a boxplot, displayed together.**

# Step 4: Analysis of multivariate variables

After analyzing the characteristics one by one, it is time to analyze them in relation to the predictor and to themselves, in order to draw clearer conclusions about their relationships and to be able to make decisions about their processing.

Thus, if we would like to eliminate a variable due to a high amount of null values or certain outliers, it is necessary to first apply this process to ensure that the elimination of certain values are not critical for the survival of a passenger. For example, the variable Cabin has many null values, and we would have to ensure that there is no relationship between it and survival before eliminating it, since it could be very significant and important for the model and its presence could bias the prediction.

## Numerical-numerical analysis

When the two variables being compared have numerical data, the analysis is said to be numerical-numerical. 

**Scatterplots and correlation analysis are used to compare two numerical columns.**

## Categorical-categorical analysis

When the two variables being compared have categorical data, the analysis is said to be categorical-categorical. 

**Histograms and combinations are used to compare two categorical columns.**

### Combinations of class with various predictors

## Numerical-categorical analysis (complete)

# Step 5: Feature engineering

Feature engineering is a process that involves the creation of new features (or variables) from existing ones to improve model performance. This may involve a variety of techniques, such as normalization, data transformation, and so on. The goal is to improve the accuracy of the model and/or reduce the complexity of the model, thus making it easier to interpret.

Although this could have been done in this step as it is part of the feature engineering, it is usually done before analyzing the variables, separating this process into a previous one and the one we are going to see next.

## Outlier analysis

An outlier is a data point that deviates significantly from the others. It is a value that is noticeably different from what would be expected given the general trend of the data. These outliers may be caused by errors in data collection, natural variations in the data, or they may be indicative of something significant, such as an anomaly or extraordinary event.

Descriptive analysis is a powerful tool for characterizing the data set: the mean, variance and quartiles provide powerful information about each variable. The describe() function of a DataFrame helps us to calculate in a very short time all these values.

## Missing value analysis

A **missing** value is a space that has no value assigned to it in the observation of a specific variable. These types of values are quite common and can arise for many reasons. For example, there could be an error in data collection, someone may have refused to answer a question in a survey, or it could simply be that certain information is not available or not applicable.

## Inference of new features

Another typical use of this engineering is to obtain new features by "merging" two or more existing ones.

## Divide the set into train and test,

In [None]:
from utils import split_my_data


# set independent and dependent variables
X: pd.DataFrame = fact_df.drop("y", axis = 1)
y: pd.Series = fact_df["y"]

# divide the dataset into training and test samples
X_train, X_test, y_train, y_test = split_my_data(X, y, test_size = 0.2, random_state = 42)

## Feature scaling

**Feature scaling** is a crucial step in data preprocessing for many Machine Learning algorithms. It is a technique that changes the range of data values so that they can be compared to each other.

### Feature Scaling with Scikit-learn (sklearn)

Scikit-learn (sklearn) provides several tools for feature scaling, each with its own characteristics and use cases. Here's a breakdown:

**1. StandardScaler:**

* **How it works:** Standardizes features by removing the mean and scaling to unit variance.
* **Formula:** `z = (x - u) / s`, where `u` is the mean and `s` is the standard deviation.
* **When to use:**
    * When your data has a Gaussian (normal) distribution or you want to transform it to resemble a Gaussian distribution.
    * When your model assumes that features are centered around zero and have unit variance (e.g., linear regression, logistic regression, support vector machines).
* **Caution:** Sensitive to outliers.

**2. MinMaxScaler:**

* **How it works:** Scales features to a given range, usually between 0 and 1.
* **Formula:** `x_scaled = (x - x_min) / (x_max - x_min)`
* **When to use:**
    * When you need to keep the values within a specific range.
    * When you don't have many outliers.
    * When using algorithms that are sensitive to the magnitude of features (e.g., neural networks).
* **Caution:** Sensitive to outliers.

**3. RobustScaler:**

* **How it works:** Scales features using statistics that are robust to outliers (median and interquartile range).
* **Formula:** `x_scaled = (x - median) / IQR`, where `IQR` is the interquartile range.
* **When to use:**
    * When your data contains outliers.
    * When you want to reduce the impact of outliers on your scaling.
* **Caution:** Doesn't normalize data to a specific range.

**4. MaxAbsScaler:**

* **How it works:** Scales features by dividing each value by the maximum absolute value.
* **Formula:** `x_scaled = x / abs(x_max)`
* **When to use:**
    * When you have sparse data (data with many zero values).
    * When you want to preserve the sparsity of your data.
    * When you want to scale data to the range [-1, 1].
* **Caution:** Sensitive to outliers in the maximum absolute values.

**5. QuantileTransformer:**

* **How it works:** Transforms features to follow a uniform or normal distribution. It is a non-linear transformation.
* **When to use:**
    * When your data has a non-linear distribution.
    * When you want to reduce the impact of outliers.
    * Can also compress outliers into a smaller interval.
* **Caution:** Distorts correlations and distances.

**6. PowerTransformer:**

* **How it works:** Applies power transformations (Yeo-Johnson or Box-Cox) to make data more Gaussian-like.
* **When to use:**
    * When your data is skewed and you want to normalize its distribution.
    * When your model assumes a Gaussian distribution.
* **Caution:** Works better for positive data. Box-Cox can only be used with strictly positive data.

**Key Considerations:**

* **Model Requirements:** The choice of scaler often depends on the requirements of your machine learning model. Some models are more sensitive to the scale of features than others.
* **Data Distribution:** Consider the distribution of your data (e.g., Gaussian, skewed, presence of outliers) when choosing a scaler.
* **Outliers:** If your data contains outliers, `RobustScaler` or `QuantileTransformer` are good choices.
* **Range Requirements:** If you need to scale data to a specific range (e.g., [0, 1] or [-1, 1]), use `MinMaxScaler` or `MaxAbsScaler`.
* **Pipelines and ColumnTransformer:** It is highly recommended to use the scikit-learn pipeline, and the ColumnTransformer to properly work with data that have different kind of data into it.

# Step 6: Feature selection

The feature selection is a process that involves selecting the most relevant features (variables) from our dataset to use in building a Machine Learning model, discarding the rest.

There are several reasons to include it in our exploratory analysis:

1. To simplify the model so that it is easier to understand and interpret.
2. To reduce the training time of the model.
3. Avoid overfitting by reducing the dimensionality of the model and minimizing noise and unnecessary correlations.
4. Improve model performance by removing irrelevant features.
 
In addition, there are several techniques for feature selection. Many of them are based on trained supervised or clustering models. More information is available here.

The sklearn library contains many of the best alternatives to perform it. One of the most commonly used tools for fast and successful feature selection processes is SelectKBest. This function selects the k best features from our dataset based on a function of a statistical test. This statistical test is usually an ANOVA or a Chi-Square.

# Step 7: Save the data

In [None]:
from utils import X_TRAIN_PATH, X_TEST_PATH, Y_TRAIN_PATH, Y_TEST_PATH 

# save the processed data to their corresponding files
X_train_sel.to_csv(path_or_buf = X_TRAIN_PATH, sep=',', index=False,)
X_test_sel.to_csv(path_or_buf = X_TEST_PATH, sep=',', index=False,)

y_train.to_csv(path_or_buf = Y_TRAIN_PATH, sep=',', index=False,)
y_test.to_csv(path_or_buf = Y_TEST_PATH, sep=',', index=False,)