
# Credit Card Churn Analysis
---  

## Project Description 
In this project, I will be working with a dataset provided by **Kaggle**, where I will develop a churn-rate analysis. The goal is to identify the causes and reasons for customer churn from a banking institution in relation to credit card services. After understanding these causes and reasons, some machine learning models will be developed to predict potential customers who will be abandoning the credit card service of this institution. With these predictions, I will seek to develop solutions to prevent or reverse the churn of these customers.  

---  

### CRISP-DM Methodology  
The project will follow the CRISP-DM (*Cross-Industry Standard Process for Data Mining*) framework:  

| **Stage** | **Objective** | **Key Actions** |  
|-----------|---------------|------------------|  
| **1. Business Understanding** | Define the impact of churn prediction on customer retention. | - Identify costs of false negatives.<br>- Align metrics with business KPIs. |  
| **2. Data Understanding** | Analyze data structure, quality, and variable relationships. | - Exploratory Data Analysis (EDA).<br>- Outlier and correlation detection. |  
| **3. Data Preparation** | Prepare data for model training. | - Split training and test data.<br>- Remove redundant variables. |  
| **4. Modeling** | Train and compare classical models and neural networks. | - Random Forest/Logistic Regression (baseline).<br>- PyTorch neural network (focus on generalization). |  
| **5. Evaluation** | Validate performance with business-oriented metrics. | - AUC-ROC, confusion matrix.<br>- Simulate financial impact. |  
| **6. Deployment** | Deploy the model for production use. | - Build a final churn prediction model with customer behavior indicators. |  

*This notebook covers the Business Understanding, Data Understanding, and Data Preparation.*  

---  


## Imports:


In [0]:

# Data analizing/preparation:
# SRC/Data
import sys
sys.path.append('./src/data.py')
# PySpark.SQL
from pyspark.sql import functions as F
from pyspark.sql import DataFrame
from pyspark.sql.types import IntegerType
# Numpy 
import numpy as np
# Pandas
import pandas as pd
# Sklearning features
from sklearn.feature_selection import chi2
# Scipy
from scipy.stats import ttest_ind

# Graphics:
# Matplotlib
import matplotlib.pyplot as plt
# Seaborn
import seaborn as sns

## Functions:

#### Dataset Manipulation in Spark

In [0]:
class DataSpark:

    # Init
    def __init__(
        self, 
        file_location: str,
        dataframe = False, 
    ):
  
        try:
            if dataframe: 
                # Checks if it is a PySpark DataFrame
                if not isinstance(dataframe, DataFrame):
                    raise TypeError('The input must be a PySpark DataFrame.')

                # Check if the DataFrame is empty
                if dataframe.rdd.isEmpty():
                    raise ValueError('The provided DataFrame is empty.')

                self.dataframe = dataframe
            
            self.file_location = file_location

        except Exception as e:
            print(f"[Error] Failed to load DataFrame {dataframe} or file path {file_location}: {e}")

    # Save Data
    def save_data(
        self,
        file_type: str = 'parquet',
        mode: str = 'overwrite',
        delimiter: str = ',',
        header: bool = True
    ):
        """
        Saves the DataFrame to disk at the path defined by `self.file_location`.

        Supports saving as CSV or Parquet. The output format is determined by the `file_type` argument.
        This method uses the PySpark DataFrameWriter with the specified options.

        Args:
            file_type (str, optional): Format to save the file. Must be 'csv' or 'parquet'. Default is 'parquet'.
            mode (str, optional): Write mode, such as 'overwrite', 'append', 'ignore', or 'error'. Default is 'overwrite'.
            delimiter (str, optional): Field delimiter to use when writing CSV files. Default is ','.
            header (bool, optional): Whether to include a header row in CSV files. Default is True.

        Raises:
            ValueError: If an unsupported file type is provided.
            Exception: If the save operation fails for any other reason.

        Example:
            >>> spark_df = spark.read.csv('data.csv', header=True, inferSchema=True)
            >>> ds = DataSpark(file_location='/tmp/output/', dataframe=spark_df)
            >>> ds.save_data(file_type='csv')
        """
        try:
            # Mode
            writer = self.dataframe.write.mode(mode)

            # Parquet file
            if file_type == 'parquet':
                writer = writer.format('parquet')

            # CSV File
            elif file_type == 'csv':
                writer = writer.format('csv') \
                    .option('delimiter', delimiter) \
                    .option('header', str(header)) \
                    .option('encoding', 'UTF-8') \
                    .option('escape', '"') \
                    .option('multiline', 'true')
            else:
                raise ValueError(f"Unsupported file type '{file_type}'. Use 'csv' or 'parquet'.")

            writer.save(self.file_location)
            print(f"✅ Data saved successfully to: {self.file_location}")

        except Exception as e:
            print(f"[ERROR] Unable to save data to '{self.file_location}': {e}")

    # Load Data
    def load_data(
        self,
        spark = spark,
        file_type: str = 'csv',
        infer_schema: bool = True,
        header: bool = True,
        delimiter: str = ',',
        encoding: str = 'UTF-8',
        multiline: bool = True,
        escape: str = '"'
    ):
        """
        Loads data from disk into `self.dataframe` using the specified format and options.

        Supports loading from CSV or Parquet files. Uses the path defined in `self.file_location`.
        For CSV files, several options like schema inference, delimiter, and multiline reading are configurable.

        Args:
            spark (SparkSession): The active Spark session used to read the data.
            file_type (str, optional): The format of the input file. Options are 'csv' or 'parquet'. Default is 'csv'.
            infer_schema (bool, optional): Whether to infer the schema when reading CSV files. Default is True.
            header (bool, optional): Whether the CSV file has a header row. Default is True.
            delimiter (str, optional): Field delimiter for CSV files. Default is ','.
            encoding (str, optional): Character encoding of the file. Default is 'UTF-8'.
            multiline (bool, optional): Whether to support multiline fields in CSV. Default is True.
            escape (str, optional): Character used to escape quotes in CSV. Default is '"'.

        Returns:
            DataFrame: A PySpark DataFrame containing the loaded data.

        Raises:
            ValueError: If an unsupported file type is provided.
            FileNotFoundError: If the file does not exist at the specified location.
            Exception: For any other issues during file loading.

        Example:
            >>> ds = DataSpark(file_location='/tmp/data.csv')
            >>> df = ds.load_data(spark=spark, file_type='csv')
        """
        try:
            if file_type == 'csv':
                df = spark.read.format('csv') \
                    .option('inferSchema', str(infer_schema)) \
                    .option('header', str(header)) \
                    .option('delimiter', delimiter) \
                    .option('encoding', encoding) \
                    .option('multiline', str(multiline)) \
                    .option('escape', escape) \
                    .load(self.file_location)

            elif file_type == 'parquet':
                df = spark.read.format('parquet').load(self.file_location)
            else:
                raise ValueError(f"Unsupported file type '{file_type}'. Use 'csv' or 'parquet'.")

            print(f'✅ File loaded successfully from: {self.file_location}')
            self.dataframe = df
            return self.dataframe

        except FileNotFoundError:
            print(f"[ERROR] File not found at: '{self.file_location}'")
        except ValueError as ve:
            print(f"[ERROR] {ve}")
        except Exception as e:
            print(f"[ERROR] Error loading file '{self.file_location}': {e}")


#### Graphics

In [0]:
class GraphicsData:

    # Init
    def __init__(
        self, 
        data: pd.DataFrame,
        ):

        try:
            # Entry checks
            if data.empty:
                raise ValueError('The provided DataFrame is empty.')

            self.data = data

        except Exception  as e:
            print(f'[Error] Failed to load Dataframe : {str(e)}')
    

    ###_initializer_subplot_grid
    def _initializer_subplot_grid(
        self, 
        num_columns, 
        figsize_per_row
    ):
        """
        Initializes and returns a standardized matplotlib subplot grid layout.

        This utility method calculates the required number of rows based on 
        the number of variables in the dataset and the desired number of 
        columns per row. It then creates a grid of subplots accordingly and 
        applies a consistent styling.

        Args:
            num_columns (int): Number of subplots per row.
            figsize_per_row (int): Vertical size (height) per row in the final figure.

        Returns:
            tuple:
                - fig (matplotlib.figure.Figure): The full matplotlib figure object.
                - ax (np.ndarray of matplotlib.axes._subplots.AxesSubplot): Flattened array of subplot axes.
        """
        num_vars = len(self.data.columns)
        num_rows = (num_vars + num_columns - 1) // num_columns

        plt.rc('font', size = 12)
        fig, ax = plt.subplots(num_rows, num_columns, figsize = (30, num_rows * figsize_per_row))
        ax = ax.flatten()
        sns.set(style = 'whitegrid')

        return fig, ax

    ###_finalize_subplot_layout
    def _finalize_subplot_layout(
        self,
        fig,
        ax,
        i: int,
        title: str = None,
        fontsize: int = 30,
    ):
        """
        Finalizes and displays a matplotlib figure by adjusting layout and removing unused subplots.

        This method is used after plotting multiple subplots to:
        - Remove any unused axes in the grid.
        - Set a central title for the entire figure.
        - Automatically adjust spacing and layout for better readability.
        - Display the resulting plot.

        Args:
            fig (matplotlib.figure.Figure): The matplotlib figure object containing the subplots.
            ax (np.ndarray of matplotlib.axes.Axes): Array of axes (flattened) for all subplots.
            i (int): Index of the last used subplot (all subplots after this will be removed).
            title (str, optional): Title to be displayed at the top of the entire figure.
            fontsize (int, optional): Font size of the overall title. Default is 30.
        """
        for j in range(i + 1, len(ax)):
                fig.delaxes(ax[j])
        
        plt.suptitle(title, fontsize = fontsize, fontweight = 'bold')
        plt.tight_layout(rect = [0, 0, 1, 0.97])
        plt.show()
    
    ###_format_single_ax
    def _format_single_ax(
        self, 
        ax,
        title: str = None,
        fontsize: int = 20,
        linewidth: float = 0.9
    ):

        """
        Applies standard formatting to a single subplot axis.

        This method configures a single axis by:
        - Setting the title with specified font size and bold style.
        - Hiding the x and y axis labels.
        - Adding dashed grid lines for both axes with configurable line width.

        Args:
            ax (matplotlib.axes.Axes): The axis to be formatted.
            title (str, optional): Title text for the axis. Defaults to None.
            fontsize (int, optional): Font size for the title. Defaults to 20.
            linewidth (float, optional): Width of the dashed grid lines. Defaults to 0.9.
        """
        ax.set_title(title, fontsize = fontsize, fontweight = 'bold')
        ax.set_xlabel(None)
        ax.set_ylabel(None)
        ax.grid(axis = 'y', which = 'major', linestyle = '--', linewidth = linewidth)
        ax.grid(axis = 'x', which = 'major', linestyle = '--', linewidth = linewidth)

    ### Plot Variable Type
    def plot_variable_type(
        self,
        count_col: str,
        label_col: str, 
        title = 'Distribution of Variable Types'
    ):
        
        """
        Plots a pie chart to display the proportion of each variable type in the dataset.

        This method uses a pie chart to visualize the distribution of different types of variables
        (e.g., categorical, numerical) based on the values provided in `count_col` and `label_col`.

        Args:
            count_col (str): Name of the column containing the counts for each variable type.
            label_col (str): Name of the column containing the labels/categories of variable types.
            title (str, optional): Title of the pie chart. Defaults to 'Distribution of Variable Types'.

        Raises:
            ValueError: If `count_col` or `label_col` is not found in the DataFrame.
            Exception: For any other error that occurs during plotting.
        """

        try:
            # Entry checks
            if count_col not in self.data.columns:
                raise ValueError(f"Column '{count_col}' does not exist in the DataFrame.")

            if label_col not in self.data.columns:
                raise ValueError(f"Column '{label_col}' does not exist in the DataFrame.")

            # Define AX and Fig
            plt.rc('font', size = 14)
            fig, ax = plt.subplots(figsize = (7, 7))

            ax.pie(
                self.data[count_col],
                labels = self.data[label_col],
                colors = sns.color_palette('Set3', len(self.data)),
                autopct = '%1.1f%%',
                startangle = 120,
                explode=[0.05 if i >= len(self.data) - 2 else 0 for i in range(len(self.data))],
                shadow = False,
            )
            # Config Ax's and Show Graphics
            ax.set_title(title, fontsize = 15, fontweight='bold')
            plt.tight_layout()
            plt.show()
        except Exception  as e:
            print(f'[Error] Failed to generate variable distribution plot: {str(e)}')

    ### Numerical histograms
    def numerical_histograms(
        self, 
        num_columns: int = 3,
        figsize_per_row: int = 6,
        color: str = '#a2bffe',
        hue: str = None,
        palette: list = ['#b0ff9d', '#db5856'],
        title: str = 'Histograms of Numerical Variables',
    ):
        """
        Plots histograms with KDE (Kernel Density Estimation) for all numerical columns in the dataset.

        Optionally groups the histograms by a categorical target variable using different colors (hue).
        Useful for visualizing the distribution of numerical features and how they differ between groups.

        Args:
            num_columns (int): Number of plots per row in the subplot grid.
            figsize_per_row (int): Height of each row in inches (controls vertical spacing).
            color (str): Default color for histograms when `hue` is not specified.
            hue (str, optional): Name of the column used for grouping (e.g., 'churn_target'). Must be categorical.
            palette (list): List of colors for hue levels. Only used if `hue` is provided.
            title (str): Title of the entire figure layout.

        Raises:
            Exception: If plotting fails due to missing columns, incorrect types, or rendering errors.
        """
        try:
            # Entry checks
            numeric_cols = self.data.select_dtypes(include = 'number').columns.tolist()
            if hue and hue in numeric_cols:
                numeric_cols.remove(hue)

            # Define AX and Fig
            fig, ax = self._initializer_subplot_grid(num_columns, figsize_per_row)

            for i, column in enumerate(numeric_cols):
                sns.histplot(
                    data = self.data,
                    x = column,
                    kde = True,
                    hue = hue,
                    palette = palette if hue else None,
                    edgecolor = 'black',
                    alpha = 0.4 if hue else 0.7,
                    color = None if hue else color,
                    ax = ax[i],
                )
                # Config Ax's
                self._format_single_ax(ax[i], title = f'Histogram of variable: {column}')
                
            # Show Graphics
            self._finalize_subplot_layout(fig, ax, i, title = title)
        except Exception as e:
            print(f'[Error] Failed to generate numeric histograms: {str(e)}')

    ### Numerical Boxplots
    def numerical_boxplots(
        self, 
        hue: str = None, 
        num_columns: int = 3,
        figsize_per_row: int = 6,
        palette: list = ['#b0ff9d', '#db5856'],
        color: str = '#a2bffe',
        showfliers: bool = False,
        title: str = 'Boxplots of Numerical Variables',
        legend: list = []
    ):
        """
        Plots boxplots for each numerical variable in the dataset.

        Optionally groups the boxplots by a categorical hue variable (e.g., churn target), 
        allowing for comparison of distributions between groups. Helps identify outliers, 
        skewness, and variability in each feature.

        Args:
            hue (str, optional): Column name to group the boxplots (e.g., 'churn_target').
                                If None, individual boxplots are created without grouping.
            num_columns (int): Number of plots per row in the subplot grid.
            figsize_per_row (int): Height (in inches) of each row of plots.
            palette (list): Color palette to use when `hue` is provided.
            color (str): Single color to use when `hue` is not specified.
            showfliers (bool): Whether to display outlier points in the boxplots (default: False).
            title (str): Overall title for the subplot grid.
            legend (list): Custom legend labels to replace default tick labels when `hue` is present.

        Raises:
            ValueError: If the hue column is not found in the DataFrame.
            Exception: If plotting fails due to unexpected issues.
        """
        try:
            # Entry checks
            if hue and hue not in self.data.columns:
                raise ValueError(f"Column '{hue}' not in the DataFrame.")

            numeric_cols = self.data.select_dtypes(include = 'number').columns.tolist()
            if hue and hue in numeric_cols:
                numeric_cols.remove(hue)

            # Define AX and Fig
            fig, ax = self._initializer_subplot_grid(num_columns, figsize_per_row)

            for i, column in enumerate(numeric_cols):
                    sns.boxplot(
                        data = self.data,
                        x = hue if hue else column,
                        y = column if hue else None,
                        hue = hue if hue else None,
                        palette = palette if hue else None,
                        color = None if hue else color,
                        showfliers = showfliers,
                        legend = False,
                        ax = ax[i]
                    )

                    # Config Ax's
                    if len(legend) > 0:
                        ax[i].set_xticks([l for l in range(0, len(legend))])
                        ax[i].set_xticklabels(legend, fontsize = 16, fontweight = 'bold')

                    self._format_single_ax(ax[i], f'Box plot of variable: {column}')
                    ax[i].set_yticklabels([])
                    sns.despine(ax = ax[i], top = True, right = True, left = True, bottom = True)
            
            # Show Graphics
            self._finalize_subplot_layout(fig, ax, i, title = title)
        except Exception as e: 
            print(f'[ERROR] Failed to generate numerical boxplots: {str(e)}')

    ### Categorical Countplots
    def categorical_countplots(
        self,
        hue: str = None,
        num_columns: int = 2,
        figsize_per_row: int = 7,
        palette: list = ['#b0ff9d', '#db5856'],
        color: str = '#a2bffe',
        title: str = 'Countplots of Categorical Variables '
    ):
        """
        Plots countplots for all categorical variables in the dataset.

        Optionally groups the bars using a hue column (e.g., 'churn_target'), allowing 
        visual comparison of class distributions between different categories. Annotates
        each bar with its percentage frequency.

        Args:
            hue (str, optional): Name of the column used to group bars (e.g., target variable).
                                If None, no grouping is applied.
            num_columns (int): Number of plots per row in the subplot grid.
            figsize_per_row (int): Height (in inches) of each subplot row.
            palette (list): List of colors to use when `hue` is specified.
            color (str): Default color to use when `hue` is not provided.
            title (str): General title for the entire plot grid.

        Raises:
            ValueError: If the hue column is not found in the DataFrame.
            Exception: If the plot generation fails for unexpected reasons.
        """
        try:
            # Entry checks
            if hue and hue not in self.data.columns:
                raise ValueError(f"Column '{hue}' not found in the DataFrame.")

            categorical_cols = self.data.select_dtypes(include = ['object', 'category']).columns.tolist()
            if hue and hue in categorical_cols:
                categorical_cols.remove(hue)
            
            # Config Ax's
            fig, ax = self._initializer_subplot_grid(num_columns, figsize_per_row)

            for i, column in enumerate(categorical_cols):
                sns.countplot(
                    data = self.data,
                    x = column,
                    hue = hue if hue else None,
                    palette = palette if hue else None,
                    color = None if hue else color,
                    edgecolor = 'white' if hue else 'black',
                    saturation = 1,
                    legend = False,
                    ax = ax[i]
                )
                
                total = len(self.data[column])
                for p in ax[i].patches:
                    height = p.get_height()
                    if height == 0:
                        continue
                    percentage = f'{100 * height / total:.1f}%'
                    x = p.get_x() + p.get_width() / 1.95
                    y = height
                    ax[i].annotate(
                        percentage,
                        (x, y),
                        ha = 'center',
                        va = 'bottom',
                        fontsize = 16,
                        color = 'black'
                    )

                # Config Ax's
                self._format_single_ax(ax[i], f'Countplot of variable: {column}')
                ax[i].set_xticks(range(len(ax[i].get_xticklabels())))
                ax[i].set_xticklabels(ax[i].get_xticklabels(), fontsize = 16)
                
            # Show Graphics
            self._finalize_subplot_layout(fig, ax, i, title = title)
        except Exception as e:
            print(f'[ERROR] Failed to generate categorical countplots: {str(e)}')

    ### Numerical Barplots
    def numerical_barplots(
        self,
        hue: str = None,
        num_columns: int = 3,
        figsize_per_row: int = 6,
        palette: list = ['#b0ff9d', '#db5856'],
        errorbar = ('ci', 90),
        title: str = 'Barplots of Numerical Variables',
        legend: list = []
    ):
        """
        Plots barplots for each numerical variable, optionally grouped by a hue variable.

        This method creates barplots to visualize the mean (or other estimator) of numerical
        variables in the dataset. It supports grouping by a categorical variable (`hue`)
        and displays error bars (e.g., confidence intervals).

        Args:
            hue (str, optional): Column name to group the barplots (e.g., 'churn_target').
                If None, no grouping is applied. Defaults to None.
            num_columns (int): Number of subplots per row in the grid.
            figsize_per_row (int): Height (in inches) allocated per row of subplots.
            palette (list, optional): List of colors to use when `hue` is specified.
                Defaults to ['#b0ff9d', '#db5856'].
            errorbar (tuple or str, optional): Error bar representation passed to seaborn.barplot.
                Defaults to ('ci', 90) for 90% confidence intervals.
            title (str): Overall title for the figure.
            legend (list, optional): Custom labels to replace default hue legend labels.
                Defaults to an empty list.

        Raises:
            ValueError: If the `hue` column is specified but not found in the DataFrame.
            Exception: For other errors during plotting.
        """
        try:
            # Entry checks
            if hue and hue not in self.data.columns:
                raise ValueError(f"Column '{hue}' not found in the DataFrame.")

            numeric_cols = self.data.select_dtypes(include = 'number').columns.tolist()
            if hue and hue in numeric_cols:
                numeric_cols.remove(hue)
            
            # Define AX and Fig
            fig, ax = self._initializer_subplot_grid(num_columns, figsize_per_row)

            for i, column in enumerate(numeric_cols):
                sns.barplot(
                    data = self.data,
                    x = hue,
                    y = column,
                    hue = hue,
                    errorbar = errorbar,
                    dodge = False,
                    palette = palette,
                    edgecolor = 'white',
                    legend = False,
                    ax = ax[i]
                )

                # Config Ax's
                if len(legend) > 1:
                    ax[i].set_xticks(list(range(len(legend))))
                    ax[i].set_xticklabels(legend, fontsize = 16, fontweight = 'bold')

                self._format_single_ax(ax[i], f'Barplot of variable: {column}')
                ax[i].set_yticklabels([])
                sns.despine(ax = ax[i], top = True, right = True, left = True, bottom = True)
            
            # Show Graphics
            self._finalize_subplot_layout(fig, ax, i, title = title)
        except Exception as e:
            print(f'[ERROR] Failed to generate numerical barplots: {str(e)}')

    ### Barplot Target
    def barplot_target(
        self,
        target_col: str,
        percentage_col: str,
        title: str,
        palette: list = ['#b0ff9d', '#db5856'],
    ):
        """
        Plots a bar chart showing the churn rate from a pre-aggregated DataFrame.

        This method visualizes the percentage distribution of the churn target classes,
        using bars colored by the target class and annotated with percentage values.

        Args:
            target_col (str): Name of the column representing the churn target classes
                (e.g., 0 = non-churner, 1 = churner).
            percentage_col (str): Name of the column containing percentage values for each class.
            title (str): Title of the plot.
            palette (list, optional): List of colors for the bars. Defaults to ['#b0ff9d', '#db5856'].

        Raises:
            ValueError: If `target_col` or `percentage_col` are not found in the DataFrame.
            Exception: For any other error occurring during plotting.
        """
        try:
            # Entry checks
            if target_col not in self.data.columns:
                raise ValueError(f"Column '{target_col}' not found in the DataFrame.")

            if percentage_col not in self.data.columns:
                raise ValueError(f"Column '{percentage_col}' not found in the DataFrame.")
            
            # Define AX and Fig
            plt.rc('font', size = 20, weight = 'bold')
            fig, ax = plt.subplots(figsize = (8, 6))

            barplot = sns.barplot(
                data = self.data,
                x = target_col,
                y = percentage_col,
                hue = target_col,
                dodge = False,
                palette = palette,
                edgecolor = 'black',
                saturation = 1,
                legend = False,
                ax = ax
            )

            # Annotate bars
            for v in barplot.patches:
                barplot.annotate(
                    f'{v.get_height():.2f}%',
                    (v.get_x() + v.get_width() / 2., v.get_height() / 1.06),
                    ha = 'center',
                    va = 'top',
                    fontsize = 16,
                    fontweight = 'bold',
                    color = 'black'
                )

            # Config Ax's and Show Graphics
            ax.set_yticklabels([])
            sns.despine(ax = ax, top = True, right = True, left = True, bottom = False)
            self._format_single_ax(ax, title = title, linewidth = 0.5)
            
            plt.tight_layout()
            plt.show()
        except Exception as e:
            print(f'[Error] Failed to generate Barplot target: {str(e)}')

    ### Correlation Heatmap
    def correlation_heatmap(
        self,
        title: str = None,
        cmap: str = 'coolwarm'
    ):
        """
        Plots a heatmap showing the correlation matrix among the numerical columns.

        This method computes the correlation matrix of the dataset and displays it as a heatmap,
        with annotations showing the correlation coefficients.

        Args:
            title (str, optional): Title for the heatmap plot. Defaults to None.
            cmap (str, optional): Colormap to use for the heatmap. Defaults to 'coolwarm'.

        Raises:
            Exception: If the heatmap generation or plotting fails.
        """
        try:
            # Select only the desired columns
            corr_data = self.data.corr()

            # Define AX and Fig
            plt.rc('font', size = 15)
            fig, ax = plt.subplots(figsize = (20, 15))

            sns.heatmap(
                corr_data,
                annot = True,
                cmap = cmap,
                fmt = '.2f',
                linewidths = 0.5,
                ax = ax
            )
            # Config Ax's and Show Graphics
            ax.set_title(title, fontsize = 20, fontweight = 'bold')
            plt.tight_layout(rect = [0, 0, 1, 0.97])
            plt.show()
        except Exception as e:
            print(f'[Error] Failed to generate correlation heatmap: {str(e)}')

    ### Scatterplots vs Reference
    def scatterplots_vs_reference(
        self, 
        x_reference: str,
        hue: str = None,
        exclude_cols: list = [],
        num_columns: int = 3,
        figsize_per_row: int = 6,
        palette: list = ['#b0ff9d', '#db5856'],
        title: str = 'Scatterplot of Numerical Variables vs Reference'
    ):
        """
        Plots scatterplots comparing numerical variables against a reference variable,
        optionally grouped by a hue variable.

        This method creates scatterplots of all numerical columns (excluding specified ones)
        against a single reference numerical column on the X-axis. Points can be colored by
        a categorical hue variable.

        Args:
            x_reference (str): Column name to be used as X-axis in all scatterplots.
            hue (str, optional): Column name used for grouping/coloring points. Defaults to None.
            exclude_cols (list, optional): List of columns to exclude from Y-axis candidates,
                in addition to `x_reference` and `hue`. Defaults to empty list.
            num_columns (int): Number of plots per row in the subplot grid.
            figsize_per_row (int): Height (in inches) allocated per subplot row.
            palette (list, optional): List of colors for the hue categories. Defaults to ['#b0ff9d', '#db5856'].
            title (str): Overall title for the figure.

        Raises:
            ValueError: If `x_reference` or `hue` (when specified) are not found in the DataFrame.
            Exception: For any other errors during plotting.

        """
        try:
            # Entry checks
            if x_reference not in self.data.columns:
                raise ValueError(f"Column '{x_reference}' not found in the DataFrame.")
        
            if hue and hue not in self.data.columns:
                raise ValueError(f"Column '{hue}' not found in the DataFrame.")

            numeric_cols = self.data.select_dtypes(include = 'number').columns.tolist()
            for col in [x_reference, hue] + exclude_cols:
                if col in numeric_cols:
                    numeric_cols.remove(col)

            # Define AX and Fig
            fig, ax = self._initializer_subplot_grid(num_columns, figsize_per_row)

            for i, column in enumerate(numeric_cols):
                sns.scatterplot(
                    data = self.data,
                    x = x_reference,
                    y = column,
                    hue = hue,
                    palette = palette if hue else None,
                    ax = ax[i]
                )

                # Config Ax's
                self._format_single_ax(ax[i], f'{column} x {x_reference}')
                ax[i].set_xticklabels([])
                ax[i].set_yticklabels([])
                sns.despine(ax = ax[i], top = True, right = True, left = True, bottom = True)

            # Show Graphics
            self._finalize_subplot_layout(fig, ax, i, title = title)
        except Exception as e:
            print(f'[ERROR] Failed to generate scatterplots vs reference: {str(e)}')  

    # Categorical Bar Percentages
    def categorical_bar_percentages(
        self,
        hue: str ,
        palette: list = ['#b0ff9d', '#db5856'],
        num_columns: int = 2,
        figsize_per_row: int = 8,
        title: str = 'Barplots Of The Individual Rate Percentages Of Each Column Class'
    ):
        """
        Plots barplots of churn percentages per class of each categorical variable.

        This method calculates the percentage distribution of a binary target (`hue`)
        within each category of all categorical columns in the dataset, and visualizes
        these percentages as barplots.

        Args:
            hue (str): Name of the binary target column (e.g., 'churn_target').
            palette (list, optional): List of colors for the hue classes.
                Defaults to ['#b0ff9d', '#db5856'].
            num_columns (int): Number of subplots per row in the grid.
            figsize_per_row (int): Height (in inches) allocated per subplot row.
            title (str): Overall title for the figure.

        Raises:
            ValueError: If `hue` is not found in the DataFrame.
            Exception: For other errors during computation or plotting.

        Returns:
            None: Displays the plot directly.
        """
        try:
            # Entry checks
            if hue and hue not in self.data.columns:
                raise ValueError(f"Column '{hue}' not found in the DataFrame.")
            categorical_cols = self.data.select_dtypes(include = ['object', 'category']).columns.tolist()
            if hue and hue in categorical_cols:
                categorical_cols.remove(hue)

            # Define AX and Fig
            fig, ax = self._initializer_subplot_grid(num_columns, figsize_per_row)

            for i, column in enumerate(categorical_cols):
                
                total_churn_per_class = self.data.groupby(column)[hue].count().reset_index(name = f'total_count_class')

                result = (
                    self.data.groupby([column, hue])[hue]
                    .count()
                    .reset_index(name = 'frequency')
                    .merge(total_churn_per_class, on = column)
                )
                result['percentage_per_class'] = round((result['frequency'] / result['total_count_class']) * 100, 2)

                sns.barplot(
                    data=result,
                    x = column,
                    y = 'percentage_per_class',
                    hue = hue,
                    palette = palette,
                    edgecolor = 'white',
                    saturation = 1,
                    legend = False,
                    ax = ax[i]
                )

                # Annotate bars
                for p in ax[i].patches:
                    height = p.get_height()
                    percentage = f'{height:.1f}%'
                    x = p.get_x() + p.get_width() / 2
                    ax[i].annotate(
                        percentage,
                        (x, height),
                        ha='center',
                        va='bottom',
                        fontsize=14,
                        color='black'
                    )

                # Config Ax's
                self._format_single_ax(ax[i], f'Barplot of variable: {column}')
                ax[i].set_xticks(range(len(ax[i].get_xticklabels())))
                ax[i].set_xticklabels(ax[i].get_xticklabels(), fontsize = 16)
            
            # Show Graphics
            self._finalize_subplot_layout(fig, ax, i, title = title)
        except Exception as e:
            print(f'[ERROR] Failed to generate percentage barplots: {str(e)}')


#### Find Outliers

In [0]:
def find_outliers(df_num):
    """
    Calculates and displays the percentage of outliers in each column of a PySpark DataFrame.

    This function identifies outliers using the IQR method (Q3 - Q1).
    It displays the percentage of outliers per column.

    Parameters:
    -----------
    df_num : pyspark.sql.DataFrame
        PySpark DataFrame with numeric columns.

    Returns:
    --------
    None
    """
    try:
        # Checks if it is a PySpark DataFrame
        if not isinstance(df_num, DataFrame):
            raise TypeError('The input must be a PySpark DataFrame.')
        
        # Check if the DataFrame is empty
        if df_num.rdd.isEmpty():
            raise ValueError('The provided DataFrame is empty.')
        
        # List to save data
        out_col, num_outliers = [], []

        # Total size of the dataframe
        size_df = df_num.count()
        if size_df == 0:
            raise ValueError('The DataFrame has no rows.')
        
        for column in df_num.columns:
            try:
                # Calculation of quartiles (may fail if not numeric)
                quantiles = df_num.approxQuantile(column, [0.25, 0.75], 0)
                if not quantiles or len(quantiles) < 2:
                    print(f"[Warning] Could not calculate quantiles for column: {column}")
                    continue
                
                Q1, Q3 = quantiles # Lower quartile and Upper quartile
                IQR = Q3 - Q1 # Difference between the third quartile and the first quartile
                lower_bound = Q1 - 1.5 * IQR
                upper_bound = Q3 + 1.5 * IQR

                # Filter nulls and create temporary column for outliers
                df_filtered = df_num.filter(F.col(column).isNotNull())
                df_filtered = df_filtered.withColumn(
                    f'{column}_out',
                    F.when((F.col(column) < lower_bound) | (F.col(column) > upper_bound), True).otherwise(False)
                )

                # Count outliers
                n_outliers = df_filtered.filter(F.col(f'{column}_out') == True).count()
                percentage_out = round((n_outliers / size_df) * 100, 2)

                # # Stores the data
                out_col.append(column)
                num_outliers.append(percentage_out)
            
            except Exception as inner_e:
                print(f"[Warning] Failed to process column '{column}': {inner_e}")

        # Show Results
        if out_col:
            print('\n✅ Percentage of Outliers by Column:')
            percentage_out_data = spark.createDataFrame([tuple(num_outliers)], out_col)
            percentage_out_data.display()
        else:
            print('⚠️ No outliers could be computed.')

    except Exception as e:
        print(f"[Error] Failed to compute outliers: {e}")

#### Hypothesis Testing

In [0]:
def run_ttest_between_groups(
    data: pd.DataFrame, 
    numerical_col: str, 
    group_col: str, 
    group1_val  = 1, 
    group2_val = 0, 
    alpha: float = 0.05,
    print_summary: bool = True
):
    """
    Performs a t-test comparing a numerical variable between two specified groups.

    This function conducts an independent two-sample t-test (Welch’s t-test) 
    on the specified numerical column between two groups defined by values in a grouping column.

    Args:
        data (pd.DataFrame): The dataset containing the variables.
        numerical_col (str): Name of the numeric column to compare.
        group_col (str): Name of the column representing groups.
        group1_val (any): Value in `group_col` representing the first group.
        group2_val (any): Value in `group_col` representing the second group.
        alpha (float, optional): Significance level for hypothesis testing. Default is 0.05.
        print_summary (bool, optional): Whether to print the test summary. Default is True.

    Returns:
        tuple:
            t_stat (float): The computed t-statistic.
            p_value (float): The p-value of the test.

    Raises:
        Exception: If an error occurs during the test execution.
    """
    try:
        group1 = data[data[group_col] == group1_val][numerical_col]
        group2 = data[data[group_col] == group2_val][numerical_col]

        t_stat, p_value = ttest_ind(group1, group2, equal_var = False)

        if print_summary:
            print(f'\n🟢 t-statistic: {t_stat:.5f}')
            print(f'🔵 p-value: {p_value:.5f}')
            print('-------' * 10)
            if p_value <= alpha:
                print(f'\n✅ Null Hypothesis (H0) Rejected!')
                print(f"There is a significant difference in '{numerical_col}'") 
                print(f'between the two groups ({group1_val} vs {group2_val}).')
            else:
                print(f'\n⛔ Null Hypothesis (H0) Accepted!') 
                print(f"There is no significant difference in '{numerical_col}'") 
                print(f'between the two groups ({group1_val} vs {group2_val})')

    except Exception as e:
        print(f'[ERROR] Failed to perform t-test: {str(e)}')



## 1 - Business Understanding  
---

### General Problem Context  
#### What is Churn Rate, and What Are the Solutions to This Problem? 
Many companies struggle with customer churn and often find it challenging to reverse this trend. The metric that measures this scenario is called **churn rate**, which indicates when strategic solutions are needed to address the issue.  

In 2020, Bryce Baer published a guide on churn rate on the [Zendesk website](https://www.zendesk.com.br/blog/customer-churn-rate/?_ga=2.155312252.614584228.1623244699-1365810980.1622555740#) – a company specializing in corporate software development. The guide highlights that businesses implementing strategies to reduce churn can increase their **profitability** by nearly 40%.  

---

#### How to Calculate Churn Rate?
##### Churn Rate Formula:  
$$\text{Churn Rate} = \frac{\text{Number of customers lost during a period}}{\text{Total number of customers at the start of the period}} \times 100$$  

---

#### Impacts of a High Churn Rate
While reducing churn to zero is practically impossible, acceptable rates (4% to 5%) minimize financial impacts. Some companies operate at higher rates (5% to 7%) without significant revenue loss, depending on industry dynamics. **Key factors to define "acceptable" churn**:  
- Industry standards (e.g., SaaS vs. retail).  
- Customer lifetime value (CLV).  
- Customer acquisition cost (CAC).  

---

#### Reasons for Customer Churn
1. **Lack of Perceived Value**:  
   - Occurs when there’s a growing gap between customer expectations and actual delivery. Clear communication about product/service benefits is critical.  
2. **Poor Customer Experience**:  
   - Negative interactions (e.g., bad support, complex processes, product failures) drive churn.  
3. **Competitor Offers**:  
   - Attractive promotions or pricing from competitors can lure customers away.  
4. **Changing Customer Needs**:  
   - Failure to adapt products/services to evolving demands leads to turnover.  

---

## Project Challenge: 
The bank’s manager has observed a rising number of customers abandoning credit card services. Stakeholders aim to:  
1. **Analyze historical data** to identify root causes of churn.  
2. **Develop a machine learning model** to predict customer churn probability.  
3. **Implement strategic actions** to retain high-risk customers.  

---

## KPIs for the Churn Prediction Project:  
1. **Churn Rate**:  
   - *Definition*: Percentage of customers who discontinue credit card services within a specific period.  
   - *Goal*: Reduce this metric through targeted retention strategies.  

2. **Retention Rate**:  
   - *Definition*: Percentage of customers retained after a period.  
   - *Importance*: Directly reflects the success of retention efforts.  

3. **Customer Acquisition Cost (CAC) vs. Retention Cost**:  
   - *Definition*: Ratio of costs to acquire new customers vs. retaining existing ones.  
   - *Insight*: Retention is typically **5-7x cheaper** than acquisition.  

4. **AUC-ROC (Area Under the Receiver Operating Characteristic Curve)**:  
   - *Definition*: Measures the model’s ability to distinguish between churners and non-churners.  
   - *Target*: AUC-ROC > 0.85.  

5. **Recall**:  
   - *Definition*: Proportion of actual churners correctly identified by the model.  
   - *Importance*: High recall ensures fewer **false negatives** (missed churners), which is critical because a false negative could result in losing a customer. Retaining existing customers through targeted strategies is significantly cheaper than acquiring new ones.  

---



## 2 - Data Understanding

---

* This dataset consists of 10,000 customers mentioning their age, salary, marital_status, credit card limit, credit card category, etc.

---

- **Data file**: - BankChurners.csv

---

- **Target dependent variable**: - 'Attrition_Flag', categorical column with binary classification, i.e. 'Existing Customer'(No-churner) or 'Attrited Customer'(Churner).

---

- **The dataset colleted from kaggle**: https://www.kaggle.com/datasets/sakshigoyal7/credit-card-customers?sort=votes&select=BankChurners.csv

---
- **The dataset origin from this site**: https://leaps.analyttica.com/home

---



### Loading Dataset

I will use Medallion Architecture to organize and classify the data. This approach organizes the data into different levels of processing and refinement, making it easier to manage and analyze the data. Here is a summary of each layer:

---

- **Bronze Layer:**

  Description: Stores the raw data, exactly as it was collected from different sources.
  
  Objective: Preserve the integrity of the original data, without any transformation.
---
- **Silver Layer:**

  Description: Contains pre-processed and cleaned data.
  
  Objective: Perform basic transformations, such as data cleaning, standardization, and type correction.
---
- **Gold Layer:**

  Description: Stores the refined data, ready for analysis and final consumption.
  
  Objective: Apply specific corrections and improvements according to business needs.

---

### Bronze Data Tier

In [0]:
# Creating a directory to store the files
dbutils.fs.mkdirs('dbfs:/FileStore/DS_Credit-Card_Churn_Analysis/Datasets/Bronze/')
###### >>>>>>> Note: At this point, upload the files present in the notebook repository folder to this directory

# Viewing the location of files
display(dbutils.fs.ls('dbfs:/FileStore/DS_Credit-Card_Churn_Analysis/Datasets/Bronze/'))

In [0]:
# File location and type
file_location = 'dbfs:/FileStore/DS_Credit-Card_Churn_Analysis/Datasets/Bronze/BankChurners.csv'
file_type = 'csv'
# Load Data
df_csv = DataSpark(file_location = file_location).load_data(file_type = file_type)
# Show Data
df_csv.limit(10).display()

### Saving dataset in Parquet format for more performance at consultations

In [0]:
# File location and type
file_location = 'dbfs:/FileStore/DS_Credit-Card_Churn_Analysis/Datasets/Bronze/BankChurners-parquet'
file_type = 'parquet'

# Save Data
DataSpark(dataframe = df_csv, file_location = file_location).save_data(file_type = file_type)
# Load Data
df = DataSpark(file_location = file_location).load_data(file_type = file_type)
# Show Data
df.limit(10).display()

### Dictionary of Dataset
---

**CLIENTNUM**: Client number. Unique identifier for the customer holding the account

---

**Attrition_Flag**: Internal event (customer activity) variable - if the account is closed then 'Attrited Customer' else 'Existing Customer'

---

**Customer_Age**: Demographic variable - Customer's Age in Years

---

**Gender**: Demographic variable - M=Male, F=Female

---

**Dependent_count**: Demographic variable - Number of dependents

---

**Education_Level**: Demographic variable - Educational Qualification of the account holder (example: high school, college graduate, etc.)

---

**Marital_Status**: Demographic variable - Married, Single, Divorced, Unknown

---

**Income_Category**: Demographic variable - Annual Income Category of the account holder (< $40K, $40K - 60K, $60K - $80K, $80K-$120K, >$120K)

---

**Card_Category**: Product Variable - Type of Card (Blue, Silver, Gold, Platinum)

---

**Months_on_book**: Period of relationship with bank

---

**Total_Relationship_Count**: Total no. of products held by the customer

---

**Months_Inactive_12_mon**: No. of months inactive in the last 12 months

---

**Contacts_Count_12_mon**: No. of Contacts in the last 12 months

---

**Credit_Limit**: Credit Limit on the Credit Card

---

**Total_Revolving_Bal**: Total Revolving Balance on the Credit Card

---

**Avg_Open_To_Buy**: Open to Buy Credit Line (Average of last 12 months)

---

**Total_Amt_Chng_Q4_Q1**: Change in Transaction Amount (Q4 over Q1)

---

**Total_Trans_Amt**: Total Transaction Amount (Last 12 months)

---

**Total_Trans_Ct**: Total Transaction Count (Last 12 months)

---

**Total_Ct_Chng_Q4_Q1**: Change in Transaction Count (Q4 over Q1)

---

**Avg_Utilization_Ratio**: Average Card Utilization Ratio

---


### The Size Dataset

In [0]:
print(f'Number of registers: {df.count()}\nNumber of columns: {len(df.columns)}')

### Drop Redundantes Columns:

According to the documentation of this dataset that was made available by Kaggle, we were given the recommendation to remove the columns:

---
- Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1

---
- Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2

---
Therefore, I will be applying this recommendation.

Will also be removing the **CLIENTNUM** column, which refers to the registration number of the customers of this banking institution. It is possible to conclude that this data will not add any relevant information to the resolution of the problems and questions to be answered with this analysis.

---

In [0]:
redundants_cols = [
  'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1', 'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2',
  'CLIENTNUM'
]
df = df.drop(* redundants_cols)

# Check size Dataset
df.count(), len(df.columns)

### Checking data and its characteristics

#### Checking data schema

In [0]:
df.printSchema()

#### Checking for null data

In [0]:
# Size df
size_df = df.count()

# Check null data
df.agg(*[F.round(((F.count(F.when(F.isnan(c) | F.col(c).isNull(), c)) / size_df) * 100), 2).alias(c) for c in df.columns]) \
    .display()

#### Checking data duplicate 

In [0]:
df.groupby(df.columns) \
    .count() \
    .filter(F.col('count') > 1) \
    .display()

### Classifying variables
#### Concepts for Classification of variables according to statistics:

---

**Quantitative or numerical variables**:

* *Discrete*: only take integer values

* *Continuous*: assumes any value in the range of real numbers

---

**Qualitative or categorical variables**:
* *Nominals*: when categories do not have a natural order

* *Ordinals*: when categories can be ordered.

---

##### Adjusting column names

In [0]:
for column in df.columns:
    # Renaming columns with only lowercase letters    
    df = df.withColumnRenamed(column, column.lower()) \

df.limit(5).display()       

In [0]:
# Numerical Variables

discrete_numerical = [
    'customer_age', 'dependent_count', 'months_on_book', 'total_relationship_count', 'months_inactive_12_mon', 'contacts_count_12_mon', 'total_trans_ct'
    ]

continuos_numerical = [
    'credit_limit','total_revolving_bal', 'avg_open_to_buy', 'total_amt_chng_q4_q1', 'total_trans_amt', 'total_ct_chng_q4_q1',
    'avg_utilization_ratio'

    ]


# Categorical Variables

nominal_categorical = [
    'attrition_flag', 'gender', 'marital_status', 
    ]

ordinal_categorical  = [
    'education_level', 'income_category', 'card_category', 

    ]


In [0]:
# Create dataset with column types
column = ['ct_type_cols']
data = [ (name_col, ) for name_col in df.columns]
type_columns = spark.createDataFrame(data, column)

# Adding types
type_columns = type_columns \
  .withColumn('ct_type_cols', F.when(F.col('ct_type_cols').isin(nominal_categorical), 'Nominal Categorical').otherwise(F.col('ct_type_cols'))) \
  .withColumn('ct_type_cols', F.when(F.col('ct_type_cols').isin(ordinal_categorical), 'Ordinal Categorical').otherwise(F.col('ct_type_cols'))) \
  .withColumn('ct_type_cols', F.when(F.col('ct_type_cols').isin(continuos_numerical), 'Continuos Numerical').otherwise(F.col('ct_type_cols'))) \
  .withColumn('ct_type_cols', F.when(F.col('ct_type_cols').isin(discrete_numerical), 'Discrete Numerical').otherwise(F.col('ct_type_cols'))) \

# Colleting data
groupy_type_columns = type_columns.groupBy('ct_type_cols') \
  .agg(F.count('ct_type_cols').alias('count_types')) \
  .withColumn('percentage', F.round((F.col('count_types') / len(df.columns)) * 100, 2)) \
  .orderBy('count_types') \

# Graphic
# Data
data_ax = groupy_type_columns.toPandas()

In [0]:
pie_type_var = GraphicsData(data = data_ax)
pie_type_var.plot_variable_type(count_col='count_types', label_col='ct_type_cols')

### Checking the data initially to verify its characteristics and structure

####Checking some statistical data from the numerical columns of the data

In [0]:
df_describe = df.select(*discrete_numerical, *continuos_numerical) \
    .describe()
df_describe.display()

#### Mean of numerical variables

In [0]:
df_describe.filter(F.col('summary') == 'mean').display()

#### Check categorial variables

In [0]:
    # Iterating over dataset columns
    for column in  df.select(*nominal_categorical, *ordinal_categorical).columns:

        # Grouping columns by frequency and percentage
        df.groupBy(column) \
            .agg(F.count(column).alias('frequency')) \
            .withColumn('percentage', F.round((F.col('frequency') / df.count()) * 100, 2)) \
            .orderBy('frequency', ascending = False) \
            .display()     

#### Checking Outliers

The min values ​​of numerical data

In [0]:
df_describe.filter(F.col('summary') == 'min').display()

The max values ​​of numerical data

In [0]:
df_describe.filter(F.col('summary') == 'max').display()

Distribution of numerical variables

In [0]:
# Data Collect
data_ax = df.select(*discrete_numerical, *continuos_numerical).toPandas()
# Histoplots of Numerical Variables
GraphicsData(data_ax).numerical_histograms()

In [0]:
# Boxplots of Numerical Variables
GraphicsData(data_ax).numerical_boxplots(showfliers = True)

Checking the percentage of outliers

In [0]:
find_outliers(df.select(*discrete_numerical, *continuos_numerical))


### Initial observations and insights:
---

- 1 - This dataset has a little over **10,000 records**, so it is a small dataset. This brings us some limitations regarding the statistical inferences that will be made in the analyses, and also regarding the training of models, it will be a little more difficult to train them and make them have a good generalization of the data.
---
- 2 - There is no **null data** or **duplicate data** in this dataset, which is a very positive thing.
---
- 3 - This dataset has as its main source of data **numerical variables**, corresponding to **70% of the data**, therefore, only **30% of the data** is **categorical**.
---
- 4 - The average age of customers is **46 years**, which indicates that we have a more mature profile and customer experience. Therefore, it is possible to hypothesize that, due to this factor, they tend to be more demanding regarding the services provided by the bank through credit cards.
---
- 5 - Customers have, on average, **2 dependents**, which can be a significant factor in understanding the profile of this banking institution's customers.
---
- 6 - On average, customers have been in a relationship with the bank for **35 months (about 3 years)**, which initially seems to be a positive thing, but it could be a factor that needs to be improved.
---
- 7 - Customers, on average, maintain approximately **4 products** offered by the bank. Therefore, this could be a factor that should be questioned and analyzed more carefully, so that it is possible to determine the influence of this factor on the turnover rate of this institution's customers.
---
- 8 - Most customers are **married, about 46%**, which can provide us with more information about the products and services to be offered to this customer profile.
---
- 9 - Most customers are **graduates**, around 30%.
---
- 10 - Most customers earn less than **$40,000** per year.
---
- 11 - On average, customers tend to use **27%** of their credit card limit.
---
- 12 - The **Blue** card is the most common category of credit cards, with around **93%** of participation.
---
- 13 - There is a balance between the number of male and female customers, with a slight majority of women.
---
- 14 - Initially, I considered the **IQR** as a parameter to define the existence of **outliers** in this dataset. Some numeric variables have data that, statistically, can be considered outliers. The **Credit_Limit** variable, for example, has 9.72% of data that can be considered outliers. However, all data and numeric columns will be analyzed to verify whether they are part of the **natural distribution** of this dataset or whether they are **incorrect data**.

---


## 3 - Data Preparation
---

- In this step, I will initially divide the training and testing data so that the testing data does not interfere in the analyses, so that the model does not have any bias from the testing data, and only with the training data is it capable of generating good classifications with good generalization.
---
- Next, an EDA will be conducted to verify the data and its main characteristics. In this EDA, the main objective will be to understand the relationship of the data with the churn rate of this banking institution.
---

### Adjusting dataset targets
---

I will adjust the target column, which is attrition_flag. Since it is a categorical column with binary classification:

---

- **Existing Customer (Non-churner)**: will receive the **value 0**. 

  These are customers who still use the credit card services provided by the bank.

---

- **Attrited Customer (Churner)**: will receive the **value 1**. 

  These are customers who have stopped using the credit card services.

---

This approach will make it easier to calculate correlation statistics between variables since the target variable is already indexed.

---

In [0]:
# Indexing and adjusting the target column name
df_clean = df.withColumn('churn_target', F.when(F.col('attrition_flag') == 'Existing Customer', 0).otherwise(1).cast(IntegerType()))

# Saving previously cleaned and adjusted data to a new dataframe
df_clean = df_clean.drop('attrition_flag')

# Adjusting the list of nominal variables
nominal_categorical.remove('attrition_flag')

# Check new df
df_clean.limit(5).display()



### Silver Data Tier

In [0]:

# File location and type
file_location = 'dbfs:/FileStore/DS_Credit-Card_Churn_Analysis/Datasets/Silver/Clean-data'
file_type = 'parquet'

# Save  Clean Dataset
DataSpark(dataframe = df_clean, file_location = file_location).save_data(file_type = file_type)

# Reading Dataset
df = DataSpark(file_location = file_location).load_data(file_type = file_type)

# Check Dataset
df.limit(10).display()

### Splitting the Training and Testing data
---
* From here on, all analyses will be based on the **training dataset**, which will represent **80%** of the total dataset. I chose to separate the data from this point on to avoid **leaking test data**, since this data cannot be influenced by the analysis or modeling applied to the training data.

---

* I will be using pyspark randomSplit function, I will be looking for an equal division in the distribution of targets in this data set.

---

In [0]:
# Df data
df.groupBy('churn_target') \
    .agg(F.count('churn_target').alias('frequency')) \
    .withColumn('percentage', F.round((F.col('frequency') / df.count()) * 100, 2)) \
    .orderBy('frequency', ascending = False) \
    .display()  

In [0]:
train, test = df.randomSplit([0.8, 0.2], seed = 12) 

In [0]:
# Training data
train.groupBy('churn_target') \
    .agg(F.count('churn_target').alias('frequency')) \
    .withColumn('percentage', F.round((F.col('frequency') / train.count()) * 100, 2)) \
    .orderBy('frequency', ascending = False) \
    .display()  

In [0]:
# Test data
test.groupBy('churn_target') \
    .agg(F.count('churn_target').alias('frequency')) \
    .withColumn('percentage', F.round((F.col('frequency') / test.count()) * 100, 2)) \
    .orderBy('frequency', ascending = False) \
    .display()  

### Exploratory Data Analysis - EDA

#### Checking the churn rate

In [0]:
# Data collect
train_churn = train.groupBy('churn_target') \
    .agg(F.count('churn_target').alias('frequency')) \
    .withColumn('percentage', F.round((F.col('frequency') / train.count()) * 100, 2)) \
    .withColumn('churn_target',F.when(F.col('churn_target') == 1, 'Churners').otherwise('Non-Churners')) \
    .orderBy('frequency', ascending = False)

# Data of graphics
data_ax = train_churn.toPandas()

In [0]:
GraphicsData(data_ax).barplot_target(target_col = 'churn_target', percentage_col = 'percentage', title =  'Churn Rate of Training Data')

- In this dataset, the rate of customers who abandon credit card services is **16.02%**, while **83.98%** of customers continue to use the bank's services.
---
- Taking into account a basic principle of simple statistics, it is possible to observe that, for every **100 customers**, at least **16** of them discontinued credit card services.
---
- This dataset has imbalanced classes, which can be a factor to be considered when training machine learning models. Datasets with imbalanced classes make it more difficult to train and generalize model classifications, especially for **minority** classes. In general, models tend to learn more easily to predict the **majority class**, while they have more difficulty in detecting **minority classes**.

---

#### Checking the distributions of numerical variables

In [0]:
data_ax = train.select(*discrete_numerical, *continuos_numerical).toPandas()
GraphicsData(data_ax).numerical_histograms()

#### Observations and insights regarding numerical data
---
- 1 - The **age of customers** is more widely distributed between the **40** and **50 age groups**, with **49** being the most frequent age group in these data.
---
- 2 - Most customers have between **2** and **3 dependents**, with a minority having 5 dependents.
---
- 3 - The length of the **customer's relationship** with the bank varies from **13** to **56 months**, with **36 months** being the most frequent.
---
- 4 - The **number of products** maintained by the customer is generally above **3 products**, with few customers maintaining only **1** or **2 products**.
---
- 5 - Most **customers remain inactive** for a maximum of **3 months**, with only a small fraction remaining inactive for **4** to **6 months**.
---
- 6 - The **number of contacts** in the last 12 months was, in most cases, **2** to **3 contacts**.
---
- 7 - The **number of transactions** is mostly distributed between **60** and **80 transactions**, with a very small portion of customers making less than **20** or more than **100 transactions**.
---
- 8 - Most customers have a **credit limit** of less than **5,000 dollars**, although there is a relatively significant portion of customers with a limit of **35,000 dollars**.
---
- 9 -Most customers have a **zero credit card revolving balance**, which is relatively positive, indicating that most customers are up to date with their bill payments.
---
- 10 - The **average open to buy**  is below **5,000 dollars**.
---
- 11 - Most credit **card limit utilization** is below **20%**, with a small portion of customers using more than **80%** of their credit card limit.

#### Checking the distribution of categorical variables

In [0]:
# Data collect
data_ax = train.select(*nominal_categorical, *ordinal_categorical).toPandas()
# Categorical Countplots
GraphicsData(data_ax).categorical_countplots()


#### Observations and insights regarding categorical data
---

- 1 - The majority of clients are women, with a percentage of **52.8%**.
---
- 2 - **46.3%** of clients are **married**, while **38.8%** are **single**. There is a small portion of **7.5%** of **divorced** clients and another portion of **7.4%** of clients who do **not fit** into any of the above categories.
---
- 3 - The majority of clients have a **Graduate** level of education, with **30.8%**. This status refers to people who have already graduated and completed a specialization in the area they studied.

  **High School** represents **20%**. This status refers to people who have already graduated and completed high school.

  **Unknown** represents **15%**. This status refers to people who possibly did not fill out the form or did not fit into any of the above classifications.

  **Uneducated** represents **14.7%**. This status refers to people who have not had access to formal education or have not completed a significant level of study.

  **College** represents **10%**. This status refers to higher education.

  Finally, the **Postgraduate** and **Doctorate** statuses have the smallest shares. **Postgraduate** is basically a synonym for **Graduate**, both referring to the same status, while **Doctorate** refers to the highest level of study.
---
- 4 - The majority of clients, **34.9%**, have an income below **40k**.

  **17.6%** of clients have an income between **40k and 60k**.

  **15.5%** have an income between **80k and 120k**.

  **13.7%** have an income between **60k and 80k**.

  **11%** did not fill out this information or do not fit into any of the categories above.

  A smaller portion, **7.3%**, has an income above **120k** per year.
---
- 5 - **93.2%** of customers have a **Blue** credit card, which is the dominant class. Next comes the **Silver** credit card with **5.6%**, and the **Gold** and **Platinum** cards with a small share of participation that is practically nil.

#### Checking the correlation of numerical variables with the cause of the problem

In [0]:
# Data collect
data_ax = train.select(*continuos_numerical, *discrete_numerical, 'churn_target').toPandas()
# Correlation Heatmap
GraphicsData(data_ax).correlation_heatmap(title = 'Correlation Matrix Of Variables with Churn Rate')

Variables ordered by correlation with the churn target variable:

In [0]:
data_ax.corr()['churn_target'].abs().sort_values(ascending = False)

#### Observations and insights on the correlation of numeric variables with the 'churn_target' variable.
---
**Considering that a customer who churns has a value of 1 and a customer who does not churn has a value of 0:**

---
- 1 - **total_trans_ct** is the variable with the highest correlation with the **churn_target** variable, with a negative correlation of **-0.37**. This indicates that the lower the number of transactions in the last 12 months, the greater the likelihood that customers will stop using credit card services.
---
- 2 - **total_ct_chng_q4_q1** had a correlation of **-0.29**. This column represents the change in the number of transactions between the fourth quarter (Q4) of the previous year and the first quarter (Q1). The lower this index, the greater the possibility of the customer stopping using the credit card.
---
- 3 - Customers with a lower revolving balance** are more likely to become **inactive**, while customers with a higher revolving balance tend to continue using the bank's credit card services. It is interesting to note that, even with a **higher revolving balance**, which can lead to possible future debt, these customers tend to remain **active customers** of the institution.
---
- 4 - The **number of contacts** made in the last 12 months showed a **positive correlation**, indicating that a higher number of contacts is statistically associated with the rate of **inactive customers**.
---
- 5 - The **use of the credit card limit** has a **negative correlation** with customers who have abandoned the credit card service, that is, the **lower the use of the card**, the greater the possibility of them becoming **inactive**. On the other hand, the greater the use of the credit card limit, the greater the possibility of the customer continuing to use the service.
---
- 6 - The **number of services** has a negative correlation with customers who have stopped using the credit card; the **fewer the number of services**, the greater the possibility of the customer becoming **inactive**.
---
- 7 - The **number of inactive months** has a **positive correlation** with inactive customers, since the higher this number, the greater the possibility of the customer becoming inactive.
---
- 8 - **total_amt_chng_q4_q1** has a **negative correlation** with inactive customers; the lower the total value of the variations between the fourth quarter (Q4) of the previous year and the first quarter (Q1), the greater the possibility of the customer becoming inactive.
---
- 9 - The variables ​​​​**avg_open_to_buy** and **credit_limit** have a **perfect correlation**, indicating that these two variables are passing the same information to future machine learning models.
---
- 10 - The **other variables** do not have a very relevant correlation with the **churn_target** variable.
---
- 11 - Initially, it is possible to conclude that the **number of transactions** carried out by customers in recent months has a **very strong and important correlation** with the problem in question, which is the abandonment of credit card customers. The **two variables** that most clearly relate to the **churn_target** variable are variables that provide information about the **number of transactions carried out by customers**. Therefore, it is now possible to question and create a hypothesis about this fact.

In [0]:
# Data collect
data_ax = train.select(*continuos_numerical, *discrete_numerical, 'churn_target').toPandas()
# Numerical Histograms By Churn Rate
GraphicsData(data = data_ax).numerical_histograms(
    hue = 'churn_target',
    title='Histoplots Of Numerical Variables Grouped By Churn Rate'
)

In [0]:
GraphicsData(data_ax).numerical_barplots(hue = 'churn_target', legend = ['Non-Churner', 'Churner'])

In [0]:
GraphicsData(data_ax).numerical_boxplots(hue = 'churn_target', legend = ['Non-churner', 'Churner'])



#### Observations and insights into the of numeric variables with the 'churn_target' variable.
---
- 1 - In the variable **total_revolving_bal**, it is possible to observe that a greater distribution of customers who stopped using their credit card is in the **lowest revolving balance values**. A significant portion of these customers have a revolving balance **below \$500**.
---
- 2 - In the variable **total_trans_amt**, it is possible to observe that customers who stopped using their credit card have a greater distribution in the **lower transfer values**. Most of these customers made a total of **transfers below \$2750**.
---
- 3 - In the variable **total_ct_chng_q4_q1**, it is possible to observe that most customers who kept their credit card service active had an increase of at least **50%** in the number of transactions carried out in relation to Q4 and Q1.
---
- 4 - In the **avg_utilization_ratio** variable, it is possible to observe that most customers who stopped using their credit card have practically **not used their credit card limit** in the last few months.
---
- 5 - In the **contacts_count_12_mon** variable, it is possible to observe that most customers who stopped using their credit card have a **number of contacts greater than or equal to 3**.
---
- 6 - In the **total_trans_ct** variable, it is possible to observe that most customers who stopped using the credit card have a number **below 80 transactions in the last 12 months**. And all customers who have **95 transactions or more** continued to use the credit card.
---
- 7 - The value of the revolving balances of customers who stopped using their credit cards is relatively lower, around **45% less**, than that of customers who continued using their credit cards.
---
- 8 - The total transfer values ​​in recent months are lower for customers who stopped using their credit cards.
---
- 9 - Customers who continued using their credit cards have a reasonably higher number of services.
---
- 10 - Customers who stopped using their credit cards have a higher number of inactive months and a higher number of contacts in the last 12 months.
---
- 11 - Customers who stopped using their credit cards have around **34% fewer transactions** in the last 12 months.
---

#### Checking the relationship of the numeric variables together with the total_trans_ct variable with the churn_target variable

In [0]:
# Scatterplots vs total_trans_ct Grouped By Churn Rate
GraphicsData(data_ax).scatterplots_vs_reference(
    x_reference = 'total_trans_ct',
    hue = 'churn_target',
    title = 'Scatterplot Of Numeric Variables x total_trans_ct Grouped By Churn Rate'
)


#### Observations and insights into the of numeric variables x  total_trans_ct with the 'churn_target' variable.
---
- 1 - The variable **total_trans_ct** has a very strong correlation with the variable **churn_target**. Therefore, I chose to check its dispersion with the other variables, grouped by the target variable **churn_target**.

The combinations that best defined a good separation between **Non_churners** and **Churners** customers were:

- **total_trans_ct** x **total_trans_amt**
- **total_trans_ct** x **total_ct_chng_q4_q1**
- **total_trans_ct** x **total_ct_chng_q4_q1**

These variables refer to the quantity or total value of transactions, reinforcing the previous observations that the number of transactions and their total value reflect, in a certain way, the possible behavior of the customer, indicating whether he or she will continue to use the credit card or stop using it.

---
- 2 - The variables **total_revolving_bal** and **avg_utilization_ratio**, together with the variable **total_trans_ct**, had a reasonable separation of the **Churn** and **Non-churn** classes, although there was a greater dispersion in the graphs of these two variables.

---


#### Checking the relationship of categorical variables with the cause of the problem
---

- Checking the statistical relationship of categorical variables using the Chi-Square Test

- **Note**:
To submit categorical variables to the Chi-Square Test it will be necessary to index them first.

---

In [0]:
train_idx = train.select(*ordinal_categorical, *nominal_categorical, 'churn_target') \
    .withColumn('education_level', F.when(F.col('education_level') == 'Unknown', 6)
                .otherwise(F.col('education_level'))) \
    .withColumn('education_level', F.when(F.col('education_level') == 'Uneducated', 5)
                .otherwise(F.col('education_level'))) \
    .withColumn('education_level', F.when(F.col('education_level') == 'High School', 4)
                .otherwise(F.col('education_level'))) \
    .withColumn('education_level', F.when(F.col('education_level') == 'College', 3)
                .otherwise(F.col('education_level'))) \
    .withColumn('education_level', F.when(F.col('education_level') == 'Graduate', 2)
                .otherwise(F.col('education_level'))) \
    .withColumn('education_level', F.when(F.col('education_level') == 'Post-Graduate', 1)
                .otherwise(F.col       ('education_level'))) \
    .withColumn('education_level', F.when(F.col('education_level') == 'Doctorate', 0)
                .otherwise(F.col('education_level').cast(IntegerType()))) \
    .withColumn('income_category', F.when(F.col('income_category') == 'Unknown', 0)
                .otherwise(F.col('income_category'))) \
    .withColumn('income_category', F.when(F.col('income_category') == 'Less than $40K', 1)
                .otherwise(F.col('income_category'))) \
    .withColumn('income_category', F.when(F.col('income_category') == '$40K - $60K', 2)
                .otherwise(F.col('income_category'))) \
    .withColumn('income_category', F.when(F.col('income_category') == '$60K - $80K', 3)
                .otherwise(F.col('income_category'))) \
    .withColumn('income_category', F.when(F.col('income_category') == '$80K - $120K', 4)
                .otherwise(F.col('income_category'))) \
    .withColumn('income_category', F.when(F.col('income_category') == '$120K +', 5)
                .otherwise(F.col('income_category').cast(IntegerType()))) \
    .withColumn('card_category', F.when(F.col('card_category') == 'Blue', 0)
                .otherwise(F.col('card_category'))) \
    .withColumn('card_category', F.when(F.col('card_category') == 'Silver', 1)
                .otherwise(F.col('card_category'))) \
    .withColumn('card_category', F.when(F.col('card_category') == 'Gold', 2)
                .otherwise(F.col('card_category'))) \
    .withColumn('card_category', F.when(F.col('card_category') == 'Platinum', 3)
                .otherwise(F.col('card_category').cast(IntegerType()))) \
    .withColumn('gender', F.when(F.col('gender') == 'F', 0)
                .otherwise(F.col('gender'))) \
    .withColumn('gender', F.when(F.col('gender') == 'M', 1)
                .otherwise(F.col('gender').cast(IntegerType()))) \
    .withColumn('marital_status', F.when(F.col('marital_status') == 'Unknown', 0)
                .otherwise(F.col('marital_status'))) \
    .withColumn('marital_status', F.when(F.col('marital_status') == 'Married', 1)
                .otherwise(F.col('marital_status'))) \
    .withColumn('marital_status', F.when(F.col('marital_status') == 'Divorced', 2)
                .otherwise(F.col('marital_status'))) \
    .withColumn('marital_status', F.when(F.col('marital_status') == 'Single', 3)
                .otherwise(F.col('marital_status').cast(IntegerType()))) \
 
train_idx.limit(3).display()

In [0]:
train_idx = train_idx.toPandas()
x = train_idx[['education_level', 'income_category', 'card_category', 'gender','marital_status']]
y = train_idx['churn_target']

chi_stat, p_value = chi2(x, y)
chi_results = pd.DataFrame({
    'cat_variables': x.columns,
    'chi_score': chi_stat, 
    'p_value': p_value
  })
chi_results

#### Observations and insights on the Chi-Square Test of category variables with the 'churn_target' variable.
---
- Initially, the statistical tests used on the categorical variables showed that they did not have much of a relationship with the churn rate.
---
- The variable **gender** showed a significant relationship if we consider its p_value, but its chi_score is too low to be considered statistically relevant.
---

In [0]:
# Data collect
data_ax = train.select(*ordinal_categorical, *nominal_categorical, 'churn_target').toPandas()
# Categorical Countplots by Churn
GraphicsData(data_ax).categorical_countplots(
    hue = 'churn_target',
    title = 'Countplot of categorical variables by Churn Rate',
)

In [0]:
GraphicsData(data_ax).categorical_bar_percentages(
    hue = 'churn_target',
    title = 'Barplots Of The Individual Churn Rate Percentages Of Each Column Class'
)

#### Observations and insights on categorical variables with the 'churn_target' variable
---
- 1 - The **level of education** does not demonstrate a very strong relationship with the rate of customers who stopped using credit card services. Considering that this variable is classified according to the levels of education, it was expected that the higher or lower the level of education, the more likely these customers would choose to continue using the credit card service.

- However, it is possible to draw some observations regarding this data, as it directly affects the institution's possible decision-making.
- The level of education with the highest churn rate is the **Doctorate**, with around **22%**. The lowest rates are the **Graduate** levels, with **15%**, and **High School**, with **14.40%** of churn rate.
---
- 2 - **Customers' annual income** does not have a significant influence on the rate of customers who stopped using their credit cards, as all salary ranges follow a practically similar distribution in relation to the churn rate index. Only customers with a salary range of **60k - 80k** had a churn rate of **13.5%**, which is slightly lower compared to the other salary ranges.
---
- 3 - The **credit card category** shows a significant relationship with the rate of customers who stopped using their credit cards. The **Gold** and **Platinum** categories had a higher-than-average rate of credit card service cancellations compared to the other categories. However, these two categories represent a very small percentage of this data set; together, they do not even have a **2%** share in relation to the other categories.
---
- 4 - The **Silver** category is the category with the lowest rate of credit card service cancellations **with a 15% churn rate**.
---
- 5 - Considering the data from this banking institution, it is possible to conclude that the rate of customers with the **Silver**, **Gold** and **Platinum** card brands is very low. We have **93%** of customers with the initial brand, which is the basic **Azul** card. It would be of great value for this institution to invest in a more flexible policy in its card categories. Offering more benefits to its customers and differentiated services through the **Silver**, **Gold** and **Platinum** brands can increase the loyalty rate of its customers.
---
- 6 - The **gender** of customers declared to have a specific relationship with the churn rate. Women reported having a higher rate of cancellation of the card service than men.
---
- 7 - The **relationship status** is graphically revealed to have a specific relationship with the churn rate indexes. **Married customers** have a **slightly lower** churn rate index than **single customers**.
---
- 8 - Initially considering the statistical data and graphs of these categorical variables, it is possible to conclude that they do not have a satisfactory relevance in solving the problem of this institution, which would be the turnover rate. However, we have some variables that somehow present some differences in their classes regarding the churn rate index, which directly affect these variables is the distorted distribution of these variables such as **marital_status and card_category**.
---
- 9 - At first, I will be keeping these variables and I will analyze their performance during the pre-training process of the models and I will see if it is viable to keep them or discard them.

---

#### Hypothesis testing H0 and H1

- I will be considering the main variables that in the analyses demonstrated to have a significant relationship with the churn rate indices of this banking institution.

In [0]:
train_h_test = train.toPandas()

##### Hypothesis test with the variable: total_trans_ct

In [0]:
run_ttest_between_groups(
    data = train_h_test,
    numerical_col = 'total_trans_ct',
    group_col = 'churn_target',   
)

##### Hypothesis test with the variable: total_trans_amt

In [0]:
run_ttest_between_groups(
    data = train_h_test,
    numerical_col = 'total_trans_amt',
    group_col = 'churn_target',   
)

##### Hypothesis test with the variable: total_revolving_bal

In [0]:
run_ttest_between_groups(
    data = train_h_test,
    numerical_col = 'total_revolving_bal',
    group_col = 'churn_target',   
)

##### Hypothesis test with the variable: avg_utilization_ratio

In [0]:
run_ttest_between_groups(
    data = train_h_test,
    numerical_col = 'avg_utilization_ratio',
    group_col = 'churn_target',   
)

##### Hypothesis test with the variable: total_relationship_count

In [0]:
run_ttest_between_groups(
    data = train_h_test,
    numerical_col = 'total_relationship_count',
    group_col = 'churn_target',   
)

##### Hypothesis test with the variable: months_inactive_12_mon

In [0]:
run_ttest_between_groups(
    data = train_h_test,
    numerical_col = 'months_inactive_12_mon',
    group_col = 'churn_target',   
)

### Saving training and testing data
---

- Removing the **avg_open_to_buy** column as it has a perfect fit with **credit_limit**. I will choose to continue with credit_limit as it has shown to be slightly more associated with the target variable.

In [0]:
# Drop redundantes continuos variables
# Train data
train =  train.drop('avg_open_to_buy')
# Test data
test =  test.drop('avg_open_to_buy') 

### Gold Data Tier

In [0]:

# File location and type
file_location = 'dbfs:/FileStore/DS_Credit-Card_Churn_Analysis/Datasets/Gold/train'
file_type = 'parquet'

# Save  Train Dataset
DataSpark(dataframe = train, file_location = file_location).save_data(file_type = file_type)


In [0]:
# File location and type
file_location = 'dbfs:/FileStore/DS_Credit-Card_Churn_Analysis/Datasets/Gold/test'
file_type = 'parquet'

# Save  Test Dataset
DataSpark(dataframe = test, file_location = file_location).save_data(file_type = file_type)

### EDA Conclusions
---

- Through the hypothesis test, it is possible to conclude that the number of transactions made by customers using their credit cards has a significant impact on the churn rate. In the distribution graphs, it is possible to see that the number of transactions for customers who stopped using the credit card services of this institution is much lower. The average number of transactions made by **non-churner customers is almost 69**, while that of **churner customers is 45 transactions**.
---
- Also through the hypothesis test, it is possible to conclude that the total value of transactions made by customers using their credit cards has a significant impact on the churn rate. Most customers who stopped using credit card services had a total value of **transactions equal to or below \$2500**.
---
- The credit card limit utilization rate demonstrated to have a very high significance in relation to the churn rate of this institution, and this fact was validated through the hypothesis test. In this data, there is an **asymmetric distribution with a very long tail to the right**, indicating that most of the credit card limit usage values ​​are in the **lowest values**. This is something to be explored by the stakeholders of this institution, since most of its customers use a **limit much lower than expected**, which can be a problem, since most of the bank's **profits from the credit card** are associated with customer usage. Regarding this problem, we have **non-churner** customers with an **average of 30%** limit usage and **churners** with an **average of 20%** limit usage.
---
- The number of services maintained on their card by the customer has been shown to have a significant impact on the churn rate of this institution, and this was also proven in the hypothesis tests.
---
- The **credit limit of credit cards** has an **asymmetric distribution**, with most of the limits **below \$5000**. This is linked to the customer profile, as most have an annual income **below $40k**. However, the possibility of having a more flexible policy regarding limit increases can be assessed together with the institution's stakeholders, considering the default rate of its customers.
---
- The **card categories** need to be restructured with more flexible policies regarding the **Silver, Gold and Platinum categories**, as these three categories together represent **less than 7% of the total number of customers**. The **Blue card** category has **93.20% of the total number of customers**, indicating that most customers only keep the initial credit card brand. This may not be a positive factor for the bank, considering that the benefits and advantages tend to be greater than those of cards above the Blue category, and with these advantages it is natural for there to be greater and more continuous customer loyalty.
---
- Churners had a rate of **20% more contacts with the bank**, it is important for the bank to understand and seek to meet the requests of these customers.
---
- Most of the customers of this institution were inactive for more than 3 months, and we have **churners** with 15% more months inactive** in relation to **non-churners** customers who continued with the credit card service, indicating that customer inactivity is a problem to be solved and explored by the bank in possible solutions to this problem.
---