## CUSTOMER SEGMENTATION

Firstly, declaring necessary libraries

In [1]:
# for data analysis
import pandas as pd
import numpy as np

import json     # for reading the key inside the json formatted file

# for data visualization
import matplotlib.pyplot as plt

import pyodbc   # for connecting database

> Connecting to Database

Pyodbc library handles the connection between Jupyter notebook and MS SQL Server. SQL Server's key is hidden inside the json file.

In [2]:
f = open('log.json')
sql_key = json.load(f)     # returns JSON object as a dictionary

cnxn = pyodbc.connect(sql_key['key'])     # establish a connection
crsr = cnxn.cursor()                      # cursor enables to send command

> Sending Queries 

Queries are sent with respect to the decisions given on analytics phase. G and T type company data are retrieved separately.

For G (Şahıs)

In [None]:
gk_query= """SELECT MUSTERI_ID, ID, CEK_NO, CEK_TUTAR, VADE_GUN, SIRKET_TURU, BK_LIMIT, BK_RISK,
            BK_GECIKMEHESAP, BK_GECIKMEBAKIYE
            FROM dbo.dataset
            WHERE SIRKET_TURU LIKE 'G' """

g_company_type_df = pd.read_sql(gk_query, cnxn)

For T (Tüzel)

In [None]:
tk_query= """SELECT MUSTERI_ID, ID, CEK_NO, CEK_TUTAR, VADE_GUN, SIRKET_TURU, TK_NAKDILIMIT, TK_NAKDIRISK, TK_GAYRINAKDILIMIT, TK_GAYRINAKDIRISK, TK_GECIKMEHESAP, TK_GECIKMEBAKIYE FROM dbo.dataset WHERE SIRKET_TURU LIKE 'T' """
t_company_type_df = pd.read_sql(tk_query, cnxn)

### MACHINE LEARNING

The ML is going to be implemented to segment customer portfolio into clusters based on their risks. Firstly, the customer portfolio divided into two: T type and G type customers. Due to the differences between their attributes, this was inevitable step to be done. Also, it is crucial to define type of machine learning. Due to the attributes, it will be unsupervised learning. As we observe, all attributes will be used are features. For providing accurate solution, we obtained that classification of the customer portfolio is a must requirements. In this sense, **K-means** is going to implemented.

Next, the datasets will be prepared for the clustering. The feature extraction and data scaling must be done before putting data into model. Also, data will be divided into two as training and test for ML. Moreover, cluster amount is going to be obtained by using Elbow Method. After this, we are going to put the training data into model.

After these works, the accuracy of K-means model must be found out. Also, other appropriate unsupervised ML modelling techniques are going to be compared. The most fitting model's results are going to be saved into the database. The database is going to be integrated our data-oriented web application for strong Business Intelligence presentation.

To conclude, our steps are;

*   Feature Extraction
*   Scaling Data
*   Training/Test Set Division
*   Elbow Method
*   Accuracy Calculation
*   Comparing K-means with Other Models
*   Saving Results into Database
*   Presenting Results via Streamlit Web-App

> Feature Extraction

In [None]:
# ... to be done!

> Scaling Data

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

In [None]:
scaler.fit(df)      # df is not exist yet, volatile to change later
scaled_data = scaler.transform(df)

> Elbow Method

We aim to maximize the efficiency of segmentation while minimizing the number of clusters. In this sense, Elbow Method is crucial concept to satisfy this requirement.

In [11]:
# importing scikit-learning library for the K-means model
from sklearn.cluster import KMeans

In [None]:
def find_optimal_clusters(df, maximum_K):
    clusters_centers = []   # appending inertia value coming from the model
    k_values = []           # putting k values from 0 to maximum K

    for k in range(1, maximum_K):
        kmeans_model = KMeans(n_clusters = k)
        kmeans_model.fit(df)

        clusters_centers.append(kmeans_model.inertia_)
        k_values.append(k)

    return clusters_centers, k_values

Note About **Inertia**:

Inertia measures how well a dataset was clustered by K-Means. It is calculated by measuring the distance between each data point and its centroid, squaring this distance, and summing these squares across one cluster. A good model is one with low inertia AND a low number of clusters (K) *-Codeacademy*

In the next step, illustrating an elbow method for the data

In [None]:
def generate_elbow_plot(clusters_centers, k_values):
    figure = plt.subplots(figsize = (12, 6))
    plt.plot(k_values, clusters_centers, 'o-', color = 'blue')
    plt.xlabel("Number of Clusters")
    plt.ylabel("Cluster Inertia")
    plt.title("Elbow Plot of Model")
    plt.show()

In [None]:
clusters_centers, k_values = find_optimal_clusters(df, 16)
generate_elbow_plot(clusters_centers, k_values)