# Client Churn Prediction
### CRISP-DM Cycle 3
---

> Disclaimer: This is a fictional bussiness case

## Settings

In [1]:
# Settings imports
import os
import sys

from dotenv import load_dotenv

# Load .env file
env_path = "../.env"
load_dotenv(dotenv_path=env_path)

# Seed
seed = int(os.getenv("SEED"))

# Add path
path = os.getenv("HOMEPATH")

# Add path to sys.path
sys.path.append(path)

# Colors
colors_list = ["#DE9776", "#9192B3", "#3D8221", "#823F21", "#BFC2FF", "#97D77D"]

In [2]:
# Import classes
from helpers.classes.Queries import DuckQueries
from helpers.classes.FeatureEngineering import FeatureEngineering
from helpers.classes.DataVisualizer import DataVisualizer

# import libraries

  from .autonotebook import tqdm as notebook_tqdm


## Data

This dataset is avaliable [here](https://www.kaggle.com/mervetorkan/churndataset).


**Data fields**

- **RowNumber**: the number of the columns
- **CustomerID**: unique identifier of clients
- **Surname**: client's last name
- **CreditScore**: clien'ts credit score for the financial market
- **Geography**: the country of the client
- **Gender**: the gender of the client
- **Age**: the client's age
- **Tenure**: number of years the client is in the bank 
- **Balance**: the amount that the client has in their account 
- **NumOfProducts**: the number of products that the client bought 
- **HasCrCard**: if the client has a credit card 
- **IsActiveMember**: if the client is active (within the last 12 months) 
- **EstimateSalary**: estimative of anual salary of clients 
- **Exited**: if the client is a churn (*target variable*)

In [3]:
qb = DuckQueries()
conn = qb.get_connection(path + "/data/interim/churn.db")

query = qb.select("*").from_table("churn").build()
data = conn.execute(query).df()
conn.close()

fe = FeatureEngineering()

df = fe._perform_transformations(data)

## Report

### Generating Report

In [4]:
# report_path = path + "/reports/eda_report.html"
# fe.get_profile_report(df, report_path)

### Report Feedback

## Quantitative and Qualitative

In [5]:

univariate_quantitative = df.select_dtypes(include=["int64", "float64"])
univariate_quantitative.drop(
    columns=["customer_id", "row_number", "has_cr_card", "exited", "is_active_member"],
    inplace=True,
)
univariate_qualitative = df[
    [
        "is_active_member",
        "has_cr_card",
        "exited",
        "geography",
        "gender",
        "balance_indicator",
        "life_stage",
        "cs_category",
        "tenure_group",
    ]
]

eda = DataVisualizer(df)


## Univariate Analysis

### Quantitative

In [6]:
# There are a lot of quantitative variables, let's split them in two
n = len(univariate_quantitative.columns) // 2
print(n)
columns = univariate_quantitative.columns
univariate_quantitative_1 = univariate_quantitative[columns[:n]]
univariate_quantitative_2 = univariate_quantitative[columns[n:]]

10


In [7]:
# color_list have 6 colors and n is 10
colors = colors_list + colors_list[:4]

eda.multiple_distplots(univariate_quantitative_1.columns, colors)

In [8]:
eda.multiple_distplots(univariate_quantitative_2.columns, colors)

### Qualitative

In [9]:
n = len(univariate_qualitative.columns)
colors = colors_list + colors_list[:3]

In [10]:
eda.multiple_barplots(univariate_qualitative.columns, colors)

## Bivariate Analysis

### List of Hypotheses

Number | Hypotheses
---    | ---
1      | Elderly clients has more tendency to be churn.
2      | Elderly people tend to not be in churn.
3      | Clients with more products has less tendency to be churn.
4      | Clients with a bad credit score tend to be in churn.
5      | Clients with higher estimated salaries tend to be in churn.

## Multivariate Analysis

In [11]:
eda.correlation_heatmap(univariate_quantitative.columns)

In [12]:
eda.categorical_heatmap(univariate_qualitative.columns)

In [15]:
eda.scatter_plot_matrix(univariate_quantitative.columns)