<a id = "toc"></a>

## Table of Contents
* [1. Import the Libraries](#chapter1)
* [2. Import the Datasets](#chapter2)
* [3. Explore the Dataset ](#chapter3)
    * [3.1. Basic Statistics](#section_3_1)
    * [3.2. Inconsistencies](#section_3_2)
        * [3.2.1. Checking Combinations of Code](#sub_section_3_2_1)
        * [3.2.2. Handling Covid-19 Inconsistencies](#sub_section_3_2_2)
        * [3.2.3. Handling Average Weekly Wage Inconsistencies](#sub_section_3_2_3)
        * [3.2.4. Handling Birth Year Inconsistencies](#sub_section_3_2_4)
        * [3.2.5. Age at Injury vs. Birth Year](#sub_section_3_2_5)
        * [3.2.6. Age at Injury](#sub_section_3_2_6)
        * [3.2.7. First Hearing Date vs. Accident Date](#sub_section_3_2_7)
        * [3.2.8. C2 Date vs. C3 Date vs. Accident Date](#sub_section_3_2_8)
        * [3.2.9. Assembly Date vs. Accident Date](#sub_section_3_2_9)
        * [3.2.10. Handling ZIP Code Format](#sub_section_3_2_10)
        * [3.2.11. Overview of Inconsistencies](#sub_section_3_2_11)
* [4. Visual Exploration ](#chapter4)
    * [4.1. Univariate Plots](#section_3_3)  
        * [4.1.1. Continuous Variables](#sub_section_4_1_1)
        * [4.1.2. Categorical Variables](#sub_section_4_1_2)
        * [4.1.3. Discrete Variables](#sub_section_4_1_3)
        * [4.1.4. Binary Variables](#sub_section_4_1_4)
    * [4.2. Multivariate Analysis](#section_4_2)  
* [5. Save Dataset for Preprocessing](#chapter5)


# 1. Import the Libraries <a class="anchor" id="chapter1"></a>

# 2. Load and Prepare Datasets <a class="anchor" id="chapter2"></a>

Inconcistencias

In [None]:
# Replace negative code (-9) with 101 and update description to 'Nonclassifiable'
df_train.loc[df_train['WCIO Part Of Body Code'] < 0, 'WCIO Part Of Body Code'] = 101
df_train.loc[df_train['WCIO Part Of Body Code'] == 101, 'WCIO Part Of Body Description'] = 'Nonclassifiable'

In [None]:
# Drop rows with missing values in the target column ('Claim Injury Type')
df_train_cleaned = df_train_cleaned.dropna(subset=['Claim Injury Type'])

# Data Exploration

Basic plots

Continuas

In [None]:
# Filtering the dataset to remove rows with 'Average Weekly Wage' equal to zero and extreme values(using the quantile 0.95)
filtered_df = df[(df['Average Weekly Wage'] > 0) & (df['Average Weekly Wage'] < df['Average Weekly Wage'].quantile(0.95))]

# Set figure size and plot pairplot
g = sns.pairplot(filtered_df, vars=['Average Weekly Wage'], hue='Claim Injury Type', palette='Set2', height=5, aspect=1.5)

# Add a title to the pairplot
g.fig.suptitle("Pairplot between continuous variable and target (filtered data)", y=1.02)

# Manually add legend outside of plot
plt.legend(
    labels=filtered_df['Claim Injury Type'].unique(),
    bbox_to_anchor=(1.05, 1),
    loc='upper left',
    fontsize='medium'
)

plt.show()

In [None]:
plt.figure(figsize=(10, 6))  # Increased figure size for better visibility
sns.boxplot(data=filtered_df, x='Claim Injury Type', y='Average Weekly Wage', palette='Set2')
plt.title(f'Distribution of Average Weekly Wage by Claim Injury Type')
plt.xlabel('Claim Injury Type')  # Adding labels for better clarity
plt.ylabel('Average Weekly Wage')
plt.xticks(rotation=45)
plt.show()

Categoricas e variaveis discretas

In [None]:
# Categorical Columns: Plotting Frequency Distributions with Target as Hue
# The Categorical Column was created above.

# Loop through categorical columns and create separate figures
for column in categorical_columns:
    plt.figure(figsize=(12, 8))
    
    # Plot using seaborn to include hue (target column)
    ax = sns.countplot(data=df, x=column, hue='Claim Injury Type', palette='viridis', order=df[column].value_counts().iloc[:10].index)
    
    # Set title and labels
    plt.title(f'Top 10 Most Frequent Values in {column}')
    plt.xlabel(column)
    plt.ylabel('Frequency')
    plt.xticks(rotation=45)
    
    # Add count labels above each bar
    for p in ax.patches:
        height = p.get_height()
        if height > 0:
            ax.annotate(f'{int(height)}', (p.get_x() + p.get_width() / 2., height),
                        ha='center', va='bottom', fontsize=8, color='black')

    # Show the plot
    plt.tight_layout()
    plt.show()

Ver binomiais 

In [None]:
# Define the columns to exclude
exclude_columns = ['Zip Code', 'Carrier Name']  # This columns have many elements to evalute, so we are going to exclude.

#  Get binary variables from the dataframe
binary_vars = [col for col in binary_columns if col not in exclude_columns]  # Identify binary variables

# Get categorical variables excluding the specified columns
categorical_vars = [col for col in categorical_columns if col not in exclude_columns]

   
# Generate count plots for binary and categorical variables
plot_count_for_binary_and_categorical(df, binary_vars, categorical_vars)

In [None]:
def plot_count_for_binary_and_discrete(data, binary_vars, discrete_vars):
    for discrete_var in discrete_vars:
        print(f"Discrete Variable: {discrete_var}\n")  # Print the binary variable being plotted
        
        # Loop through each categorical variable
        for binary_var in binary_vars:
            plt.figure(figsize=(16, 8))  # Increase figure size for better clarity
            
            # Create the count plot with aesthetics for categorical variables
            ax = sns.countplot(
                data=data, 
                x=binary_var, 
                hue=discrete_var, 
                palette="muted", 
                order=data[binary_var].value_counts().index,  # Order categories by frequency
                linewidth=0.5, 
                edgecolor="gray"
            )
            
            # Add annotations to display counts on top of the bars
            for container in ax.containers:
                ax.bar_label(container, fmt='%d', label_type='edge', fontsize=8, padding=3)
            
            # Improve titles and labels for clarity
            plt.title(f"Distribution of {discrete_var} by {binary_var}", fontsize=14, fontweight='bold')  
            plt.xlabel(binary_var, fontsize=12)  
            plt.ylabel("Count", fontsize=12)  
            plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for readability
            
            # Optimize layout to avoid overlapping elements
            plt.tight_layout()  
            plt.legend(title=discrete_var, loc='upper right')  # Add legend
            
            plt.show()  # Display the plot

In [None]:
def plot_top_feature_with_proportion(X, y, feature, target_name='Claim Injury Type', top_n=10):

    import matplotlib.pyplot as plt
    import seaborn as sns
    import pandas as pd
    sns.set(style="darkgrid")
    
    # Combine feature and target into a new DataFrame
    data = X[[feature]].copy()
    data[target_name] = pd.Series(y, index=X.index)  # Ensure alignment
    
    # Filter the top N most frequent values for the feature
    top_values = data[feature].value_counts().head(top_n).index
    filtered_data = data[data[feature].isin(top_values)]
    
    # Calculate proportions of target values for each feature value
    proportion_data = (
        filtered_data.groupby([feature, target_name])
        .size()
        .groupby(level=0)
        .apply(lambda x: x / x.sum())
        .reset_index(name='Proportion')
    )
    
    # Pivot for stacked bar plot
    pivot_data = proportion_data.pivot(index=feature, columns=target_name, values='Proportion').fillna(0)
    pivot_data = pivot_data.loc[top_values]  # Ensure correct order
    
    # Plot the stacked bar plot
    fig, ax = plt.subplots(figsize=(18, 12))
    pivot_data.plot(kind='bar', stacked=True, ax=ax, colormap='viridis')
    
    # Annotate the bars with percentage values
    for i, bar_group in enumerate(ax.containers):  # Loop through stacked bars
        for bar in bar_group:
            height = bar.get_height()
            if height > 0:  # Only annotate non-zero segments
                ax.annotate(f'{height:.0%}',  # Convert to percentage
                            (bar.get_x() + bar.get_width() / 2, bar.get_y() + height / 2),
                            ha='center', va='center', fontsize=13, color='white', weight='bold')
    
    # Final plot adjustments
    plt.title(f"Top {top_n} Most Frequent {feature} with {target_name} Proportions")
    plt.xlabel(feature)
    plt.ylabel("Proportion")
    plt.legend(title=target_name, bbox_to_anchor=(1.05, 1), loc='upper left')
    plt.xticks(rotation=45, ha="right")
    plt.tight_layout()
    plt.show()

# Feature Engineering & Encoding Notebook*

fazer bins

In [None]:
# Defining the bins and labels for categorizing income based on percentiles
income_bins = [0, 876.0, 1125.0, 1269.0, 1666.0, float('inf')]  # float('inf') allows us to set an open-ended range
income_labels = ['Low Income', 'Lower-Middle Income', 'Middle Income', 'Upper-Middle Income', 'High Income']

# Creating the new feature for income categories for the train set
X_train['Income_Category'] = pd.cut(X_train['Average Weekly Wage'], bins=income_bins, labels=income_labels)

# Apply to the val set
X_val['Income_Category'] = pd.cut(X_val['Average Weekly Wage'], bins=income_bins, labels=income_labels)

# Apply to the test set
df_test['Income_Category'] = pd.cut(df_test['Average Weekly Wage'], bins=income_bins, labels=income_labels)

Encoding
- label_encoder = LabelEncoder()
- freq e ohe

Devemos incluir binarias? Spearman Correlation Matrix for Numerical and Binary Features

Escolher com base na interpretabilidade pode ser um criterio. 

Escolher com base no que é original?

### LASSO Regression <a class="anchor" id="sub_section_4_2_1"></a>

USAMOS MESMO QUANDO NAO E LINEAR?


 The LASSO (Least Absolute Shrinkage and Selection Operator) regression is used here for feature selection by fitting a model to the standardized dataset and analyzing the coefficients.

 ### Recursive Feature Elimination - RFE <a class="anchor" id="sub_section_4_2_2"></a>


RFE is employed here to further validate the important features as identified by LASSO. By sequentially removing the least important features, RFE helps to refine the feature set.

The selected features after RFE likely overlap with those identified by LASSO, suggesting consistency in feature importance.
Using both LASSO and RFE provides a more robust feature selection by cross-validating the importance of individual features.

This block of code performs RFE to identify the best subset of features by iterating over a range of feature numbers. The code aims to maximize model performance on the validation set.

### Feature Importance - Decision Tree <a class="anchor" id="sub_section_4_2_3"></a>
### Feature Importance - Random Forest <a class="anchor" id="sub_section_4_2_4"></a>

In [None]:

#This function is it to find the optimal number of features using Recursive Feature Elimination (RFE) and evaluates using F1 Macro score.
def find_optimal_features_with_rfe(model, X_train, y_train, X_val, y_val, max_features=8):
    """
    Finds the optimal number of features using Recursive Feature Elimination (RFE) 
    and evaluates using F1 Macro score.
    
    Parameters:
    - model: The machine learning model (e.g., LogisticRegression()).
    - X_train: Scaled training feature set (numpy array or DataFrame).
    - y_train: Encoded training target labels.
    - X_val: Scaled validation feature set (numpy array or DataFrame).
    - y_val: Encoded validation target labels.
    - max_features: Maximum number of features to evaluate (default=8).

    Returns:
    - best_features: Optimal number of features for the highest F1 Macro score.
    - best_score: The highest F1 Macro score achieved.
    - scores_list: List of F1 Macro scores for each number of features.
    """
    nof_list = np.arange(1, max_features + 1)
    high_score = 0
    best_features = 0
    scores_list = []

    for n in nof_list:
        rfe = RFE(model, n_features_to_select=n)
        
        # Transform training and validation sets with RFE
        X_train_rfe = rfe.fit_transform(X_train, y_train)
        X_val_rfe = rfe.transform(X_val)
        
        # Fit the model
        model.fit(X_train_rfe, y_train)
        
        # Predict on the validation set
        y_val_pred = model.predict(X_val_rfe)
        
        # Calculate F1 Macro score
        score = f1_score(y_val, y_val_pred, average='macro')
        scores_list.append(score)
        
        if score > high_score:
            high_score = score
            best_features = n
    
    print(f"Optimum number of features: {best_features}")
    print(f"F1 Macro Score with {best_features} features: {high_score:.6f}")
    
    return best_features, high_score, scores_list

#Function to plot decision tree feature importance
def compare_feature_importances(X_train, y_train, figsize=(13, 5)):
    """
    Compares feature importances using Gini and Entropy criteria in a Decision Tree Classifier 
    and visualizes the results in a bar plot.

    Parameters:
    - X_train: Training feature set (DataFrame or array with column names).
    - y_train: Target labels for training.
    - figsize: Tuple specifying the figure size for the plot (default=(13, 5)).

    Returns:
    - zippy: DataFrame containing feature importances for Gini and Entropy.
    """
    # Calculate feature importances using Gini and Entropy criteria
    gini_importance = DecisionTreeClassifier().fit(X_train, y_train).feature_importances_
    entropy_importance = DecisionTreeClassifier(criterion='entropy').fit(X_train, y_train).feature_importances_
    
    # Create a DataFrame to store and organize the feature importances
    zippy = pd.DataFrame(zip(gini_importance, entropy_importance), columns=['gini', 'entropy'])
    zippy['col'] = X_train.columns  # Add column names
    
    # Melt the DataFrame for easier plotting with Seaborn
    tidy = zippy.melt(id_vars='col').rename(columns=str.title)
    tidy.sort_values(['Value'], ascending=False, inplace=True)
    
    # Plot the feature importances
    plt.figure(figsize=figsize)
    sns.barplot(y='Col', x='Value', hue='Variable', data=tidy)
    plt.title("Feature Importances: Gini vs Entropy")
    plt.xlabel("Importance")
    plt.ylabel("Feature")
    plt.legend(title="Criterion")
    plt.show()
    
    return zippy

#This function is it to compare feature importances using Gini and Entropy criteria in a Random Forest Classifier and visualizes the results in a bar plot.
def compare_rf_feature_importances(X_train, y_train, figsize=(13, 5), random_state=42):
    """
    Compares feature importances using Gini and Entropy criteria in a Random Forest Classifier 
    and visualizes the results in a bar plot.

    Parameters:
    - X_train: Training feature set (DataFrame or array with column names).
    - y_train: Target labels for training.
    - figsize: Tuple specifying the figure size for the plot (default=(13, 5)).
    - random_state: Random state for reproducibility (default=42).

    Returns:
    - importances: DataFrame containing feature importances for Gini and Entropy.
    """
    # Calculate feature importances using Gini and Entropy criteria
    gini_importance = RandomForestClassifier(random_state=random_state).fit(X_train, y_train).feature_importances_
    entropy_importance = RandomForestClassifier(criterion='entropy', random_state=random_state).fit(X_train, y_train).feature_importances_
    
    # Create a DataFrame to store and organize the feature importances
    importances = pd.DataFrame({
        'gini': gini_importance,
        'entropy': entropy_importance,
        'col': X_train.columns
    })
    
    # Melt the DataFrame for easier plotting with Seaborn
    tidy = importances.melt(id_vars='col').rename(columns=str.title)
    tidy.sort_values(['Value'], ascending=False, inplace=True)
    
    # Plot the feature importances
    plt.figure(figsize=figsize)
    sns.barplot(y='Col', x='Value', hue='Variable', data=tidy)
    plt.title("Random Forest Feature Importances: Gini vs Entropy")
    plt.xlabel("Importance")
    plt.ylabel("Feature")
    plt.legend(title="Criterion")
    plt.show()
    
    return importances


#This function is used to select top k features based on scores using Chi-square test.
def select_high_score_features_chi2_no_model(X_train, y_train, threshold=25):
    """
    Performs Chi-square test to select top k features based on scores.

    Parameters:
    - X_train: Training feature set (DataFrame).
    - y_train: Training target labels.
    - threshold: Number of top features to select (default=25).

    Returns:
    - high_score_features: List of top feature names based on Chi-square scores.
    - scores: Corresponding Chi-square scores of the selected features.
    """
    # Perform Chi-square test
    feature_scores = SelectKBest(chi2, k=threshold).fit(X_train, y_train).scores_

    # Select top features
    high_score_features = []
    scores = []
    for score, f_name in sorted(zip(feature_scores, X_train.columns), reverse=True)[:threshold]:
        high_score_features.append(f_name)
        scores.append(score)

    print(f"Top {threshold} features based on Chi-square scores:", high_score_features)
    print("Corresponding Chi-square scores:", scores)

    return high_score_features, scores

#This function is used to select top k features based on scores using MIC test.
def select_high_score_features_MIC(X_train, y_train, threshold=25, random_state=42):
    """
    Selects the top features based on Mutual Information Criterion (MIC).

    Parameters:
    - X_train: Training feature set (DataFrame or array).
    - y_train: Training target labels (array-like).
    - threshold: Number of top features to select (default=25).
    - random_state: Random state for reproducibility (default=42).

    Returns:
    - high_score_features: List of top feature names based on MIC scores.
    - scores: Corresponding MIC scores of the selected features.
    """
    # Calculate MIC scores
    feature_scores = mutual_info_classif(X_train, y_train, random_state=random_state)

    # Select top features based on MIC scores
    high_score_features = []
    scores = []
    for score, f_name in sorted(zip(feature_scores, X_train.columns), reverse=True)[:threshold]:
        high_score_features.append(f_name)
        scores.append(score)

    print(f"Top {threshold} features based on MIC scores:", high_score_features)
    print("Corresponding MIC scores:", scores)

    return high_score_features, scores


### **Categorical Feature Selection Results**

The following table summarizes the decisions for each categorical feature based on **Mutual Information (MIC)** and **Chi-Squared** (X²) results. The retained features will be used in subsequent modeling to enhance predictive performance.

# Modeling <a class="anchor" id="chapter5"></a>


Logistic Rrgression

KNN

RandomForest

Grid Search

Gradient Boosting



____