In [None]:
# import the libraries top use
#import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt 

sns.set_theme()

# Step 1: Problem statement and data collection

We can see the data information in the page <https://raw.githubusercontent.com/4GeeksAcademy/linear-regression-project-tutorial/main/medical_insurance_cost.csv> where each feature is:

1. age. Age of primary beneficiary **(numeric)**
2. sex. Gender of the primary beneficiary **(categorical)**
3. bmi. Body mass index **(numeric)**
4. children. Number of children/dependents covered by health insurance **(numeric)**
5. smoker. Is the person a smoker? **(categorical)**
6. region. Beneficiary's residential area in the U.S.: northeast, southeast, southwest, northwest **(categorical)**
7. charges. Health insurance premium **(numerical) (TARGET)**

In [None]:
from src.utils import load_data, ReadCsvParams, SaveCsvParams

file_path = '../data/raw/file_name.csv'
url = 'url_to_data.csv'
read_csv_params: ReadCsvParams = {'delimiter': ','}
save_csv_params: SaveCsvParams = {'sep': ','}

df: pd.DataFrame = load_data(file_path=file_path, url=url, read_csv_params=read_csv_params, save_csv_params=save_csv_params)

## Problem to solve:
Calculate, based on the physiological data of its customers, what will be the premium (cost) to be borne by someone. Construct a Linear Regression model to predict the cost of a person to be ensured.

# Step 2: Exploration and data cleaning

## Dataframe information

Let's see how is the data, the info and a little of its distribution.

In [None]:
# head of the dataframe
df.head()

In [None]:
# tail of the dataframe
df.tail()

In [None]:
# info of the dataframe
df.info()

In [None]:
# describe the dataframe
df.describe()

## Cols for the different types of data

In [None]:
# numerical columns
numerical_cols: list[str] = []

# categorical columns
categorical_cols: list[str] = []

# features
features = []

# target variable
target: str = ''

## Eliminate duplicates

This could be done here, or in the feature engineering step.

## Eliminate irrelevant information

This could be done here, or in the feature engineering step.

# Step 3: Analysis of uni variate variables

A **uni variate variable** is a statistical term used to refer to a set of observations of an attribute. That is, the column-by-column analysis of the DataFrame. To do this, we must distinguish whether a variable is categorical or numerical, as the body of the analysis and the conclusions that can be drawn will be different.

## Analysis of categorical variables

A **categorical variable** is a type of variable that can be one of a limited number of categories or groups. These groups are often nominal (e.g., the color of a car: red, blue, black, etc., but none of these colors is inherently "greater" or "better" than the others) but can also be represented by finite numbers.

**To represent these types of variables we will use histograms.**

In [None]:
# let's remember the categorical data
print(f'Categorical columns: {categorical_cols}')
print(f'Amount of categorical columns: {len(categorical_cols)}')

In [None]:
fig, axis = plt.subplots(1, 3, figsize = (14, 4))

"""
Create histograms for each categorical feature to see the count for each categorical feature.
"""

# creating a multiple figure with histograms and box plots
# first row
# 	 col
sns.histplot(ax = axis[0], data = df, x = "sex")
# 	second col
sns.histplot(ax = axis[1], data = df, x = "smoker").set(ylabel = None)
# 	third col
sns.histplot(ax = axis[2], data = df, x = "region").set(ylabel = None)


# adjust the layout
plt.tight_layout()

# show the plot
plt.show()

### Analysis

Put here the analysis

## Analysis on numeric variables

A **numeric variable** is a type of variable that can take numeric values (integers, fractions, decimals, negatives, etc.) in an infinite range. A numerical categorical variable can also be a numerical variable. 

**They are usually represented using a histogram and a boxplot, displayed together.**

In [None]:
# let's remember the categorical data
print(f'Categorical columns: {numerical_cols}')
print(f'Amount of categorical columns: {len(numerical_cols)}')

In [None]:
_, axis = plt.subplots(2, 3, figsize = (20, 5), gridspec_kw={'height_ratios': [6, 1]})

"""
Create histograms and box-plots for each numerical feature to see the count for each categorical feature.
"""


# creating a multiple figure with histograms and box plots
# first row
# 	 col
sns.histplot(ax = axis[0, 0], data = df, x = "age").set(xlabel = None)
sns.boxplot(ax = axis[1, 0], data = df, x = "age")
# 	second col
sns.histplot(ax = axis[0, 1], data = df, x = "bmi").set(xlabel = None, ylabel = None)
sns.boxplot(ax = axis[1, 1], data = df, x = "bmi")
# 	third col
sns.histplot(ax = axis[0, 2], data = df, x = "children").set(xlabel = None, ylabel = None)
sns.boxplot(ax = axis[1, 2], data = df, x = "children")


# adjust the layout
plt.tight_layout()

# show the plot
plt.show()

### Analysis

Do the breakdown of the distribution and skewness for each variable:

**1. For each variable**

* **Histogram:** Example - The distribution appears to be relatively uniform or slightly multimodal. There are no strong peaks or a clear bell shape.
* **Box Plot:** Example - The box plot shows a relatively symmetrical distribution with no apparent outliers.
* **Distribution:** Example - Not normally distributed. It looks more like a uniform distribution or possibly a multimodal distribution.
* **Skewness:** Example - Slightly right-skewed, but not strongly so.


# Step 4: Analysis of multivariate variables

After analyzing the characteristics one by one, it is time to analyze them in relation to the predictor and to themselves, in order to draw clearer conclusions about their relationships and to be able to make decisions about their processing.

Thus, if we would like to eliminate a variable due to a high amount of null values or certain outliers, it is necessary to first apply this process to ensure that the elimination of certain values are not critical for the survival of a passenger. For example, the variable Cabin has many null values, and we would have to ensure that there is no relationship between it and survival before eliminating it, since it could be very significant and important for the model and its presence could bias the prediction.

## Numerical-numerical analysis

When the two variables being compared have numerical data, the analysis is said to be numerical-numerical. 

**Scatter-plots and correlation analysis are used to compare two numerical columns.**

In [None]:
# graphs of numerical vs numerical with histograms and scatter plots
g = sns.PairGrid(df[numerical_cols])
g.map_diag(sns.histplot)
g.map_offdiag(sns.scatterplot)

In [None]:
from src.draw_utils import draw_corr_matrix

# compute the correlation matrix of the numerical columns
corr = df[numerical_cols].corr()

# draw the correlation matrix
draw_corr_matrix(corr=corr)

### Analysis

Here analysis.

## Categorical-categorical analysis

When the two variables being compared have categorical data, the analysis is said to be categorical-categorical. 

**Histograms and combinations are used to compare two categorical columns.**

### Factorize the dataframe

In [None]:
# get a copy of the dataframe with factorize categorical data
fact_df = df.copy()

# factorize every categorical column unless the target
for col in categorical_cols:
	fact_df[col] = pd.factorize(fact_df[col])[0]

In [None]:
# see the factorized dataframe
fact_df.head()

In [None]:
# compute the correlation matrix of the numerical columns
corr = fact_df[categorical_cols].corr()

# draw the correlation matrix
draw_corr_matrix(corr=corr)

### Combinations of class with various predictors

In [None]:
# let's remember the numerical data
print(f'Numerical columns: {numerical_cols}')
print(f'Amount of numerical columns: {len(numerical_cols)}')

In [None]:
# let's remember the categorical data
print(f'Categorical columns: {categorical_cols}')
print(f'Amount of categorical columns: {len(categorical_cols)}')

In [None]:
fig, axis = plt.subplots(2, 2, figsize = (15, 8))

"""
Create histograms for each numerical feature with categorical columns as hues.
"""

# first row
# 	 col
sns.histplot(ax = axis[0, 0], data = df, x = "smoker", hue="sex")
# 	second col
sns.histplot(ax = axis[0, 1], data = df, x = "region", hue="sex").set(ylabel = None)

# second row
# 	first col
sns.histplot(ax = axis[1, 0], data = df, x = "sex", hue="smoker")
# 	second col
sns.histplot(ax = axis[1, 1], data = df, x = "region", hue="smoker").set(ylabel = None)


# adjust the layout
plt.tight_layout()

# show the plot
plt.show()

### Analysis

Here the analysis.

## Numerical-categorical analysis (complete)

Now do the analysis of the numerical vs categorical variables (factorized).

In [None]:
# pair-plot of all the data
sns.pairplot(data = fact_df)

In [None]:
# compute the correlation matrix of all the data
corr = fact_df.corr()

# draw the correlation matrix
draw_corr_matrix(corr=corr)

### Conclusion

Do the analysis of the relation of all the variables

***For example*** We can see that the variables that have the stronger correlation with the target **y** are:

Positive correlation:
1. variable 1
2. variable 2

Negative correlation:
1. variable 3

# Step 5: Feature engineering

***Feature engineering*** is a process that involves the creation of new features (or variables) from existing ones to improve model performance. This may involve a variety of techniques, such as normalization, data transformation, and so on. The goal is to improve the accuracy of the model and/or reduce the complexity of the model, thus making it easier to interpret.

Although this could have been done in this step as it is part of the feature engineering, it is usually done before analyzing the variables, separating this process into a previous one and the one we are going to see next.

## Missing value analysis

A **missing** value is a space that has no value assigned to it in the observation of a specific variable. These types of values are quite common and can arise for many reasons. For example, there could be an error in data collection, someone may have refused to answer a question in a survey, or it could simply be that certain information is not available or not applicable.

### Treating Missing Values in Pandas DataFrames

Treating missing values in a variable within a Pandas DataFrame is a crucial step in data preprocessing. Here's a breakdown of common methods and when to use them:

**1. Removing Rows or Columns:**

* **Method:** Delete rows or columns containing missing values.
* **When to use:**
    * When missing values are a small percentage of the dataset.
    * When you're confident that removing the missing values won't introduce significant bias.
    * When a column has a very large number of missing values.
* **Caution:**
    * Can lead to significant data loss, especially if missing values are widespread.
    * May introduce bias if missing values are not randomly distributed.


**2. Imputation:**

* **Method:** Replace missing values with estimated values.
* **Common Imputation Methods:**
    * **Mean Imputation:** Replace missing values with the mean of the column.
    * **Median Imputation:** Replace missing values with the median of the column. (Robust to outliers)
    * **Mode Imputation:** Replace missing values with the mode (most frequent value) of the column. (For categorical data)
    * **Forward Fill:** Propagate the last valid observation forward.
    * **Backward Fill:** Propagate the next valid observation backward.
    * **Interpolation:** Estimate missing values based on surrounding values.
    * **Predictive Imputation:** Use machine learning models to predict missing values.
* **When to use:**
    * When you want to preserve the data and avoid data loss.
    * When missing values are likely due to random factors.
    * When you have time series data and the forward/backward fill make sense.
* **Caution:**
    * Can introduce bias if imputed values are not accurate.
    * Reduces variability in the data.
    * Using the mean can be heavily influenced by outliers.


**3. Creating a Missing Value Indicator:**

* **Method:** Create a new binary column indicating whether a value is missing or not.
* **When to use:**
    * When the fact that a value is missing is itself informative.
    * When you want to preserve the missing information for your model.
* **Caution:**
    * Adds a new feature to your dataset.
    * May not be necessary if missing values are completely random.


**Choosing the Right Method:**

* Analyze the nature of missing values: Are they random or systematic?
* Consider the percentage of missing values: If it's high, imputation might be necessary.
* Think about the impact on your model: Some models can handle missing values better than others.
* Use domain knowledge: Your understanding of the data can guide your decision.
* Experiment: Try different methods and evaluate their impact on your analysis.

In [None]:
# verify non values
fact_df.info()

**Explain what was done**

## Outlier analysis

An outlier is a data point that deviates significantly from the others. It is a value that is noticeably different from what would be expected given the general trend of the data. These outliers may be caused by errors in data collection, natural variations in the data, or they may be indicative of something significant, such as an anomaly or extraordinary event.

Descriptive analysis is a powerful tool for characterizing the data set: the mean, variance and quartiles provide powerful information about each variable. The describe() function of a DataFrame helps us to calculate in a very short time all these values.

In [None]:
# verify the distribution again, we are not going to work in the outliers this time
fact_df.describe()

### Treating Outliers in Pandas DataFrames

There are several ways to treat outliers in a variable within a Pandas DataFrame. The best approach depends on the nature of your data, the extent of the outliers, and your specific analysis goals. Here's a breakdown of common methods:

**1. Removing Outliers:**

* **Method:** Filter out rows containing outliers based on a defined threshold.
* **When to use:**
    * When you're confident that the outliers are due to errors or anomalies.
    * When you have a large dataset and removing a few outliers won't significantly impact your analysis.
    * When you want to prevent outliers from skewing statistical measures.
* **Caution:**
    * Can lead to data loss.
    * May introduce bias if outliers are not random.


**2. Capping/Flooring Outliers:**

* **Method:** Replace outlier values with a predefined upper or lower limit.
* **When to use:**
    * When you want to preserve the data but reduce the impact of outliers.
    * When outliers are likely due to extreme but valid values.
* **Caution:**
    * Can distort the distribution of the data.
    * Requires careful selection of capping/flooring limits.


**3. Transforming Outliers:**

* **Method:** Apply mathematical transformations (e.g., log, square root, Box-Cox) to reduce the skewness caused by outliers.
* **When to use:**
    * When outliers are causing significant skewness in the data.
    * When your model assumes a normal distribution.
* **Caution:**
    * Can make data interpretation more complex.
    * Requires careful selection of transformation methods.


**4. Imputing Outliers:**

* **Method:** Replace outliers with estimated values (e.g., mean, median).
* **When to use:**
    * When you want to preserve the data and avoid data loss.
    * When the outliers are probably errors.
* **Caution:**
    * Can introduce bias if imputed values are not accurate.
    * Reduces variability in the data.

**5. Using Robust Scalers:**

* **Method:** Use scaling techniques that are less sensitive to outliers (e.g., `RobustScaler` from scikit-learn).
* **When to use:**
    * When you want to scale the data without removing or capping outliers.
    * When using models that are sensitive to feature scaling.
* **Caution:**
    * Doesn't remove or modify outliers; it only scales them.

**Choosing the Right Method:**

* Visualize your data: Use box plots, histograms, and scatter plots to identify outliers.
* Consider your model: Some models are more sensitive to outliers than others.
* Domain knowledge: Use your understanding of the data to determine the best approach.
* Experiment: Try different methods and evaluate their impact on your analysis.

**Explain what was done**

## Inference of new features

Another typical use of this engineering is to obtain new features by "merging" two or more existing ones.

**Explain what was done**

## Divide the set into train and test,

In [None]:
from src.utils import split_my_data


# set independent and dependent variables
X: pd.DataFrame = fact_df.drop(target, axis = 1)
y: pd.Series = fact_df[target]

# divide the dataset into training and test samples
X_train, X_test, y_train, y_test = split_my_data(X, y, test_size = 0.2, random_state = 42)

## Feature scaling

**Feature scaling** is a crucial step in data preprocessing for many Machine Learning algorithms. It is a technique that changes the range of data values so that they can be compared to each other.

### Feature Scaling with Scikit-learn (sklearn)

Scikit-learn (sklearn) provides several tools for feature scaling, each with its own characteristics and use cases. Here's a breakdown:

**1. StandardScaler:**

* **How it works:** Standardizes features by removing the mean and scaling to unit variance.
* **Formula:** `z = (x - u) / s`, where `u` is the mean and `s` is the standard deviation.
* **When to use:**
    * When your data has a Gaussian (normal) distribution, or you want to transform it to resemble a Gaussian distribution.
    * When your model assumes that features are centered around zero and have unit variance (e.g., linear regression, logistic regression, support vector machines).
* **Caution:** Sensitive to outliers.

**2. MinMaxScaler:**

* **How it works:** Scales features to a given range, usually between 0 and 1.
* **Formula:** `x_scaled = (x - x_min) / (x_max - x_min)`
* **When to use:**
    * When you need to keep the values within a specific range.
    * When you don't have many outliers.
    * When using algorithms that are sensitive to the magnitude of features (e.g., neural networks).
* **Caution:** Sensitive to outliers.

**3. RobustScaler:**

* **How it works:** Scales features using statistics that are robust to outliers (median and interquartile range).
* **Formula:** `x_scaled = (x - median) / IQR`, where `IQR` is the interquartile range.
* **When to use:**
    * When your data contains outliers.
    * When you want to reduce the impact of outliers on your scaling.
* **Caution:** Doesn't normalize data to a specific range.

**4. MaxAbsScaler:**

* **How it works:** Scales features by dividing each value by the maximum absolute value.
* **Formula:** `x_scaled = x / abs(x_max)`
* **When to use:**
    * When you have sparse data (data with many zero values).
    * When you want to preserve the sparsity of your data.
    * When you want to scale data to the range [-1, 1].
* **Caution:** Sensitive to outliers in the maximum absolute values.

**5. QuantileTransformer:**

* **How it works:** Transforms features to follow a uniform or normal distribution. It is a non-linear transformation.
* **When to use:**
    * When your data has a non-linear distribution.
    * When you want to reduce the impact of outliers.
    * Can also compress outliers into a smaller interval.
* **Caution:** Distorts correlations and distances.

**6. PowerTransformer:**

* **How it works:** Applies power transformations (Yeo-Johnson or Box-Cox) to make data more Gaussian-like.
* **When to use:**
    * When your data is skewed, and you want to normalize its distribution.
    * When your model assumes a Gaussian distribution.
* **Caution:** Works better for positive data. Box-Cox can only be used with strictly positive data.

**Key Considerations:**

* **Model Requirements:** The choice of scaler often depends on the requirements of your machine learning model. Some models are more sensitive to the scale of features than others.
* **Data Distribution:** Consider the distribution of your data (e.g., Gaussian, skewed, presence of outliers) when choosing a scaler.
* **Outliers:** If your data contains outliers, `RobustScaler` or `QuantileTransformer` are good choices.
* **Range Requirements:** If you need to scale data to a specific range (e.g., [0, 1] or [-1, 1]), use `MinMaxScaler` or `MaxAbsScaler`.
* **Pipelines and ColumnTransformer:** It is highly recommended to use the scikit-learn pipeline, and the ColumnTransformer to properly work with data that have different kind of data into it.

In this case we are going to use a **StandardScaler** because we do not have many outliers in the features, the outliers are in the target.

In [None]:
# numerical columns without the target
numerical_features = ['age', 'bmi', 'children']

In [None]:
from sklearn.preprocessing import StandardScaler 

# scaler instance
scaler = StandardScaler ()

### scaling training data --------------------------

# crate a copy of the train dataframe
X_train_scaled: pd.DataFrame = X_train.copy()
# scale just the numerical columns
X_train_scaled[numerical_features] = scaler.fit_transform(X_train_scaled[numerical_features])

### scaling testing data --------------------------

# crate a copy of the test dataframe
X_test_scaled: pd.DataFrame = X_test.copy()
# scale just the numerical columns
X_test_scaled[numerical_features] = scaler.fit_transform(X_test_scaled[numerical_features])

In [None]:
# print the head of the x train data
X_train_scaled.head()

In [None]:
# print the head of the x test data
X_test_scaled.head()

# Step 6: Feature selection

The feature selection is a process that involves selecting the most relevant features (variables) from our dataset to use in building a Machine Learning model, discarding the rest.

There are several reasons to include it in our exploratory analysis:

1. To simplify the model so that it is easier to understand and interpret.
2. To reduce the training time of the model.
3. Avoid overfitting by reducing the dimensionality of the model and minimizing noise and unnecessary correlations.
4. Improve model performance by removing irrelevant features.
 
In addition, there are several techniques for feature selection. Many of them are based on trained supervised or clustering models. More information is available here.

The sklearn library contains many of the best alternatives to perform it. One of the most commonly used tools for fast and successful feature selection processes is SelectKBest. This function selects the k best features from our dataset based on a function of a statistical test. This statistical test is usually an ANOVA or a Chi-Square.

In [None]:
from sklearn.feature_selection import SelectKBest

# create the selection model, we are not setting the K value because the results of the models are not great.
selection_model = SelectKBest()
# fit the model to the train scaled data
selection_model.fit(X_train_scaled, y_train)
# get the indexes of the selected columns
ix = selection_model.get_support()

# get the dataframe of the train selected columns
X_train_sel = pd.DataFrame(selection_model.transform(X_train_scaled), columns = X_train_scaled.columns.values[ix]) # type: ignore

# get the dataframe of the test selected columns
X_test_sel = pd.DataFrame(selection_model.transform(X_test_scaled), columns = X_test_scaled.columns.values[ix]) # type: ignore

In [None]:
# print the selected features on the training data
X_train_sel.head()

In [None]:
# print the selected features on the test data
X_test_sel.head()

## Conclusion

We can see that we got different features that the ones that have the best correlation.

# Step 7: Save the data

In [None]:
from src.constants import X_TRAIN_PATH, X_TEST_PATH, Y_TRAIN_PATH, Y_TEST_PATH

# save the processed data to their corresponding files
X_train_sel.to_csv(path_or_buf = X_TRAIN_PATH, sep=',', index=False,)
X_test_sel.to_csv(path_or_buf = X_TEST_PATH, sep=',', index=False,)

y_train.to_csv(path_or_buf = Y_TRAIN_PATH, sep=',', index=False,)
y_test.to_csv(path_or_buf = Y_TEST_PATH, sep=',', index=False,)