# Mel Schwan, Stuart Miller, Justin Howard, Paul Adams
# Lab Three: Clustering, Association Rules, or Recommenders
## Capstone: Association Rule Mining, Clustering, or Collaborative Filtering

### Lab3 Project Requirments -
1. [Business understanding](#Businessunderstanding)
    1. [Describe the purpose of the data set you selected](#Assessthecurrentsituation)
    2. [Describe how you would define and measure the outcomes from the dataset](#CostBenefit)
    3. [How would you measure the effectiveness of a good prediction algorithm](#Desiredoutputs)
  
2. [Data Understanding](#Dataunderstanding)
    1. [Describe the meaning and type of data for each attribute in the data file](#Describedata)
    2. [Verify data quality: Explain any missing values, duplicate data, and outliers](#Datareport)
    3. [Give simple, appropriate statistics (range, mode, mean, median, variance, counts, etc.) for the most important attributes and describe what they mean](#Stats)
    4. [Visualize the most important attributes appropriately (at least 5 attributes)](#Distributions)
    5. [Explore relationships between attributes: Look at the attributes via scatter plots, correlation, cross-tabulation, group-wise averages, etc. as appropriate](#Correlations)
    6. [Identify and explain interesting relationships between features and the class you are trying to predict](#relationships)
    7. [Are there other features that could be added to the data or created from existing features? Which ones?](#Featurecreation)
    8. [Outlier Removal](#OutlierRemoval)

3. [Modeling and Evaluation](#Model)
    1. [Option A: KMeans Cluster Analysis](#KMeans)
    2. [Option B: t-SNE Cluster Analysis](#t-SNE)
    
4. [Deployment](#Deployment)
    1. [How useful is your model for interested parties? ](#Useful)
    2. [How would you measure the model's value if it was used by these parties?](#Value)
    3. [How would your deploy your model for interested parties?](#Deploy)
    4. [How often would your model need to be updated?](#Update)
    5. [What other data should be collected? ](#Collect)

A1. [Model Hyperparameter Tuning Details](#A2)


Our project will follow a hybrid methodology, mixing the expectations of the grading rubric with the CRISP_DM framework. CRISP-DM stands for the cross-industry process for data mining, which provides a structured approach to planning a data mining project. It is a robust and well-proven methodology.

We are continuing the data cleaning and preperation loop for preparing the dataset for cluster analysis.

In the final we have choosen to test different models for clustering of our dataset. The two approaches we have taken are
* [Model A: KMeans Clustering](#KMeans)
* [Model B: t-SNE Clustering](#t-SNE)


<img src="../_images/crisps-dm3.png" style="width:550px;height:450px"/>


# 1. Stage One - Determine Business Objectives and Assess the Situation  <a class="anchor" id="Businessunderstanding"></a>
We will use the Home Credit Default Risk dataset made available on Kaggle to develop a useful model that predicts loan defaults for a majority of the loan applicants whose population is defined by the given training and test datasets. Predicting loan defaults is essential to the profitability of banks and, given the competitive nature of the loan market, a bank that collects the right data can offer and service more loans. This analysis of Home Credit's Default Risk dataset will focus on generating accurate loan default risk probabilities, identifying sub-populations among the given applicants, and finally, the most critical factors that indicate that an applicant will likely default on their loan.

## 1.1 Assess the Current Situation (Q1A)<a class="anchor" id="Assessthecurrentsituation"></a>
Home Credit is an international non-bank financial institution that operates in 10 countries and focuses on lending to people with little or no credit history. This institution has served 11 million customers, is based in the Czechia, and is a significant consumer lender in most of the Commonwealth of Independent States Countries, especially Russia. Recently, it has established a presence in China and the United States. The dataset provided is extensive, representing 307,511 applications from various locations. 

The data types vary in scale and type, from time-series credit histories to demographic indicators. Our analysis will focus on two datasets, data collected in the application train and test datasets, and several engineered features gathered from the millions of credit bureau records for each loan applicant.

### 1.1.2. Measuring the effectiveness of a good algorithm- <a class="anchor" id="Requirements"></a> 
#### 1. Effective Clustering Metric: Cluster Validity

The dataset contains 326 attributes for 307,511 loan applicants. An algorithm that clusters the dataset with a high degree of cluster validity will be difficult to obtain due to the high space-time complexity of clustering algorithms at our disposal. To reduce the space-time complexity of the dataset, we have elected to perform K-Means clustering on the first two principal components.

- **Business success criteria**

- **Data mining success criteria**

#### 2. Effective Association Rule Determination

- **Business success criteria**

- **Data mining success criteria**

#### 3. Effective Collaborative Filtering

- **Business success criteria**

- **Data mining success criteria**

# 2. Stage  Two - Data Understanding <a class="anchor" id="Dataunderstanding"></a>


## 2.1 Initial Data Report (Q2) <a class="anchor" id="Datareport"></a>
Our data comes from the [Kaggle Home Credit Default Risk](https://www.kaggle.com/c/home-credit-default-risk/overview) competition website. 

Our analysis features the use of several Python libraries, such as Pandas, in addition to a custom data cleaning script for both the `application` and `bureau` datasets. 

In [None]:
# Import Libraries Required.
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import seaborn as sns
from IPython.display import IFrame

# import custom code
from project_code.cleaning import read_clean_data, missing_values_table, load_bureau, create_newFeatures, fill_occupation_type
from project_code.tables import count_values_table

# some defaults
pd_max_rows_default = 60

#removing warnings
import warnings
warnings.simplefilter('ignore')

In [None]:
# load data 
# path =  './application_train.csv'
# note that XNA is a encoding for NA interpret as np.nan
df =  pd.read_csv('./application_train.csv', na_values = ['XNA'])
#loading bureau dataset

bureau = pd.read_csv('./bureau.csv', na_values = ['XNA'])

In [None]:
pca = PCA(random_state = random_state).fit(std_df)
exp_var = np.cumsum(pca.explained_variance_ratio_)

plt.figure()
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Variance (%)') #for each component
plt.title('Standardized Dataset Explained Variance')
plt.show()

To avoid inserting random noise from the principal component analysis, we will use the minimum number of Principal Components to arrive at a cumulative variance of 99%.

In [None]:
idx = np.where(exp_var >= .99)
print('Optimal Number of Principal Components: ', idx[0][0])
X_pca = np.array(X_pca.loc[:,:87])

## 2.2 Verify data quality: Explain any missing values, duplicate data, and outliers (Q2B) <a class="anchor" id="Datareport"></a>
We will use two of the files from the total dataset.

application_train.csv: Information provided with each loan application
bureau.csv: Information regarding clients from the credit bureaus
The two data files can be joined on the loan id (SK_ID_CURR).


## 2.3 Give simple, appropriate statistics (range, mode, mean, median, variance, counts, etc.) for the most important attributes and describe what they mean (Q2C) <a class="anchor" id="Stats"></a>



#### Data Cleaning Script

All the cleaning discussed in the sections above are implemented in `cleaning.py`.
This script contains a function (`read_clean_data`) to apply the cleaning steps and return the cleaned dataset for work.

**Details**  
* Cleaning
  * Read csv with Pandas (setting correct data types)
  * Drop columns that will not be used
  * Recode NA values that are not listed as np.nan
  * Formattings
  * Encode categorical variables
* Returns
  * DataFrame with cleaned data


## 2.4 Visualize the most important attributes appropriately (Q2D) <a class="anchor" id="Visualize"></a>

info

-


### 2.4.1 Distributions  <a class="anchor" id="Distributions"></a>

This part of the exploration focused on the use of box plots and histograms for visualing continuous variables and bar charts for visualizing categorical variables. These graphical formats permit the easy identification of skewedness and help us identify outliers. 

Variables we expect to be important were selected for univariate visualization, such as `AMT_INCOME_TOTAL` and `AMT_ANNUITY`. The histogram shows the features of the main distribution, while the behavior of the tails and extreme values are shown in the boxplots. The distribution of incomes is extremely long-tailed and right-skewed, as can be expected with most any income distribution. 

Boxplots indicate a large number of outliers in both the `AMT_TOTAL_INCOME` and `AMT_ANNUITY` feature and one extreme outlier that will require closer exmaination. A comparison of each distribution in its raw and transformed form is displayed below.

## 2.5 Explore relationships between attributes: Look at the attributes via scatter plots, correlation, cross-tabulation, group-wise averages, etc. as appropriate (Q2E) <a class="anchor" id="Correlations"></a>


info

## 2.6 Identify and explain interesting relationships between features and the class you are trying to predict (Q2F) <a class="anchor" id="relationships"></a>

infor

## 2.7 Are there other features that could be added to the data or created from existing features (Q2G) <a class="anchor" id="Featurecreation"></a>
Info

# 3. Stage  Three - Modeling and Evaluation <a class="anchor" id="Model"></a>

## 3.1 Option A: Cluster Analysis (Q3A)<a class="anchor" id="Cluster"></a>

**Clustering using K-Means**<a class="anchor" id="KMeans"></a>

A dataset with 307,511 records restricts the tools we can use to arrive at useful clusters of data without losing very large numbers of data. We elected to use the K-Means method using the first two principal component values for each record to drastically reduce the dimensionality of the dataset and minimize the information that is lost in the process.

In [None]:
# reducing dataset to 2 PCs
X_pca2 = X_pca[:,:2]
# conducting a 2 cluster KMeans operation to create benchmark visualization
km = KMeans(n_clusters = 2,
            init = 'random',
            n_init = 10,
            max_iter = 300,
            tol = .0001,
           random_state = random_state,
           n_jobs = -1)
y_km = km.fit_predict(X_pca2)

markers = ('s','x', 'o', '^', 'v')
colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
cmap = ListedColormap(colors[:len(np.unique(y))])

plt.figure(figsize = (14,7))
plt.subplot(1,2,1)
for idx, cl in enumerate(np.unique(y)):
        plt.scatter(x = X_pca2[y.TARGET == cl, 0],
                    y = X_pca2[y.TARGET == cl, 1],
                    alpha = .4,
                    color = cmap(idx),
                    edgecolor = 'black',
                    marker = markers[idx],
                    label =cl)
plt.title('PCA of application_train Dataset')
plt.legend(scatterpoints = 1, loc = 'lower right')
plt.xlabel('PC1')
plt.ylabel('PC2')


plt.subplot(1,2,2)
plt.scatter(X_pca2[y_km == 0, 0],
           X_pca2[y_km == 0 , 1],
           s = 50, c = 'lightgreen',
           marker = 's', edgecolor = 'black', 
           label = 'Cluster 1')

plt.scatter(X_pca2[y_km == 1, 0],
           X_pca2[y_km == 1, 1],
           s = 50, c = 'orange',
           marker = 'o', edgecolor = 'black', 
           label = 'Cluster 2')
plt.scatter(km.cluster_centers_[:,0],
           km.cluster_centers_[:,1],
           s = 250,
           marker = '*',
            c = 'red',
           label = 'Centroids')
plt.legend(scatterpoints = 1, loc = 'lower right')
plt.title('K-Means with 2 Clusters')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.tight_layout()
plt.show()

**Significant Findings**

It was clear from the initial K-Means clustering, that a more ideal number of clusters is available, so an elbow plot was created by using the between_cluster sum of squared errors for a range of clusters from 1 - 12 - 

In [None]:
SSEs = []

for i in range(1,13):
    km = KMeans(n_clusters = i,
               init = 'k-means++',
               n_init= 10,
               max_iter= 300,
               random_state = random_state,
               n_jobs = -1)
    km.fit(X_pca2)
    SSEs.append(km.inertia_)


   
plt.figure(figsize = (10,6))
plt.plot(range(1,13), SSEs, marker = 'o')
plt.xlabel('Number of Clusters')
plt.ylabel('Within-Cluster SSE')
plt.tight_layout()
plt.title('Elbow Plot: SSE Reduction by Cluster Number')
plt.show()

** Significant Findings**

We elect to describe the data using 5 clusters to create new features to help describe the clusters of applicants. To begin our decription of the 5 chosen clusters, we will first visualize the clusters. 

In [None]:
km = KMeans(n_clusters = 5,
            init = 'k-means++',
            n_init = 10,
            max_iter = 300,
            tol = .0001,
           random_state = 1)
y_km = km.fit_predict(X_pca2)

In [None]:
plt.figure(figsize = (10,6))
plt.scatter(X_pca2[y_km == 0, 0],
           X_pca2[y_km == 0 , 1],
           s = 30, c = 'lightgreen',
           marker = 's', edgecolor = 'black', 
           label = 'Cluster 1')

plt.scatter(X_pca2[y_km == 1, 0],
           X_pca2[y_km == 1, 1],
           s = 30, c = 'orange',
           marker = 'o', edgecolor = 'black', 
           label = 'Cluster 2')

plt.scatter(X_pca2[y_km == 2, 0],
           X_pca2[y_km == 2, 1],
           s = 50, c = 'yellow',
           marker = '*', edgecolor = 'black', 
           label = 'Cluster 3') 
plt.scatter(X_pca2[y_km == 3, 0],
           X_pca2[y_km == 3, 1],
           s = 50, c = 'lightblue',
           marker = '^', edgecolor = 'black', 
           label = 'Cluster 4') 
plt.scatter(X_pca2[y_km == 4, 0],
           X_pca2[y_km == 4, 1],
           s = 40, c = 'pink',
           marker = 'd', edgecolor = 'black', 
           label = 'Cluster 5') 
plt.scatter(km.cluster_centers_[:,0],
           km.cluster_centers_[:,1],
           s = 250,
           marker = '*',
            c = 'red',
           label = 'Centroids')

plt.legend(scatterpoints = 1, loc = 'lower right')
plt.title('K-Means: 5 Clusters')
plt.xlabel('PC1')
plt.ylabel("PC2")
plt.tight_layout()
plt.show()

In [None]:
c_1 = np.array([n for n  in y_km if n == 0])
c_2 = np.array([n for n  in y_km if n == 1])
c_3 = np.array([n for n  in y_km if n == 2])
c_4 = np.array([n for n  in y_km if n == 3])
c_5 = np.array([n for n  in y_km if n == 4])

print(c_1.shape,
      c_2.shape,
      c_3.shape,
      c_4.shape,
      c_5.shape)

** Significant Findings **

We see that our 5 clusters vary in their density and magnitude. These clusters were formed based on the Euclidean Distance of the first two Principal Components of the standardized dataset. 

This makes it nearly impossible to interpret the characteristics of the 5 clusters we have identified. 
To clarify what makes these clusters different, we can add a cluster feature to the raw dataset and conducting a Linear Discriminate Analysis (LDA). Since this analysis is purely descriptive, a finely tuned model is not deemed necessary. 

To stay aligned with the assumptions of LDA, log transformations will be applied to all predictor variables, excluding the `EXT_SOURCE` variables and the binary indicator variables. We chose to exclude the `EXT_SOURCE` variables from trnasformations because they are already transformed.

In [None]:
#reducing dataset size to 20% to reduce the computational time of silhouette sampling
tiny_pca = pd.DataFrame(X_pca2)
tiny_pca = tiny_pca.sample(frac= 0.2, replace = False, random_state=1)


range_n_clusters = [2,3,4,5,6,7]

for n_clusters in range_n_clusters:
    # Create a subplot with 1 row and 2 columns
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(18, 7)

    # The 1st subplot is the silhouette plot
    # The silhouette coefficient can range from -1, 1 but in this example all
    # lie within [-0.1, 1]
    ax1.set_xlim([-0.1, 1])
    # The (n_clusters+1)*10 is for inserting blank space between silhouette
    # plots of individual clusters, to demarcate them clearly.
    ax1.set_ylim([0, len(tiny_pca) + (n_clusters + 1) * 10])

    # Initialize the clusterer with n_clusters value and a random generator
    # seed of 10 for reproducibility.
    clusterer = KMeans(n_clusters=n_clusters, 
                       init = 'k-means++',
                       random_state=1, 
                       n_jobs = -1)
    cluster_labels = clusterer.fit_predict(tiny_pca)

    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
    silhouette_avg = silhouette_score(tiny_pca, cluster_labels)
    print("For n_clusters =", n_clusters,
          "The average silhouette_score is :", silhouette_avg)

    # Compute the silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(tiny_pca, cluster_labels)

    y_lower = 10
    for i in range(n_clusters):
        # Aggregate the silhouette scores for samples belonging to
        # cluster i, and sort them
        ith_cluster_silhouette_values = \
            sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          0, ith_cluster_silhouette_values,
                          facecolor=color, edgecolor=color, alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")

    # The vertical line for average silhouette score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

    # 2nd Plot showing the actual clusters formed
    colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
    ax2.scatter(tiny_pca.iloc[:, 0], tiny_pca.iloc[:, 1], marker='.', s=30, lw=0, alpha=0.7,
                c=colors, edgecolor='k')

    # Labeling the clusters
    centers = clusterer.cluster_centers_
    # Draw white circles at cluster centers
    ax2.scatter(centers[:, 0], centers[:, 1], marker='o',
                c="white", alpha=1, s=200, edgecolor='k')

    for i, c in enumerate(centers):
        ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1,
                    s=50, edgecolor='k')

    ax2.set_title("The visualization of the clustered data.")
    ax2.set_xlabel("Feature space for the 1st feature")
    ax2.set_ylabel("Feature space for the 2nd feature")

    plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
                  "with n_clusters = %d" % n_clusters),
                 fontsize=14, fontweight='bold')

plt.show()

In [None]:
km = KMeans(n_clusters = 4,
            init = 'k-means++',
            random_state=1, 
            n_jobs = -1).fit(tiny_pca)

**Significant Findings**

- The silhouette scores provided additional insight into the validity of our clustering strategy. While 2 clusters provide the highest cluster validity, the significant imbalance of the clusters led us to choose a less optimal clustering in exchange for gains in the group membership balance. Rather than the 5 cluster arrangement, we elect to proceed with a 4 cluster arrangement because of its higher silhouette score.

- The combination of PCA and K-means led to the establishment of 4 clusters with an acceptable cluster validity based on an average silhouette score of .57. 

- A major limitation if this strategy is the relative densities of the clusters. 

## t-SNE<a class="anchor" id="t-SNE"></a>

A limitation of using only the Principal Components on our dataset is that the Euclidean distance between the first two components varied widely across the dataset, making an accurate interpretation of the clustering difficult at best. To solve this problem, we will use a variant of Stochastic Neighborhood Embedding (t-SNE). This technique solves the density problem making adjustments to the distributions of the input variables and distribute the vectors more evenly across each dimension.

Methodology:
- Our input will be the reduced X_pca dataset that consists of the 88 principal components that represent 99% of the variance in the dataset.
- We must down-sample this dataset by 80% to reduce the time-complexity of the algorithms

In [None]:
#downsampling X_pca
dfX_pca = pd.DataFrame(X_pca)
tiny_pca = dfX_pca.sample(frac = .2, replace =False)
tiny_pca.shape

### t-SNE Hyperparameter Tuning

In [None]:
tsne_res = dict()
for i in range(5, 55, 5):
    tsne = TSNE(n_components = 2, random_state = 42, perplexity = i, n_jobs =-1)
    tsne_res[i] = tsne.fit_transform(tiny_pca)

In [None]:
fig, ax = plt.subplots(2,5, figsize = (50,10));
fig.suptitle('Perplexity Plots');

for i, key in enumerate(tsne_res.keys()):
    if i % 2 == 0:
        ax[0, i // 2].scatter(
            tsne_res[key][:, 0],
            tsne_res[key][:, 1],
            s = 0.5
        )
    else:
        ax[1, (i - 1) // 2].scatter(
            tsne_res[key][:, 0],
            tsne_res[key][:, 1],
            s = 0.5
        )

In [None]:
As a benchmark, we will visualize where the t-SNE method dispurses the k_means labeled points

In [None]:
tsne_res[50].shape

In [None]:
tsne = TSNE(n_components = 2, random_state=random_state, perplexity = 50, n_jobs = -1)
tsne_comp = tsne.fit_transform(tiny_pca)

In [None]:
def tsne_clusters(tsne, labels, s = 0.5, figsize = (10,10)):
    """Plot tsne 2D composition
    """
    num_classes = len(np.unique(labels))
    palette = np.array(sns.color_palette("hls", num_classes))
    plt.figure(figsize = figsize)
    plt.scatter(tsne[:, 0], tsne[:, 1], c = palette[labels], s = s);
plt.subplot(1,2,1)
tsne_clusters(tsne_comp, km.labels_)
plt.subplot(1,2,2)

In [None]:
## Implementing DBSCAN to take advantage of the variance in the densities

# lets first look at the connectivity of the graphs and distance to the nearest neighbors
from sklearn.neighbors import kneighbors_graph

#=======================================================
# CHANGE THESE VALUES TO ADJUST MINPTS FOR EACH DATASET
minpts= 2500

#=======================================================
    # create connectivity graphs before calcualting the hierarchy
knn_graph = kneighbors_graph(tsne_res[50], 
                             minpts,
                             mode='distance',
                             n_jobs = -1) # calculate distance to four nearest neighbors 
    

N2 = knn_graph.shape[0]
nn_distances = np.zeros((N2,1))
for i in range(N2):
    nn_distances[i] = knn_graph[i,:].max()

nn_distances = np.sort(nn_distances, axis=0)

plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
plt.plot(range(N2), nn_distances, 'r.', markersize=2) #plot the data
plt.title('Dataset name: tsne_res(50), sorted by neighbor distance')
plt.xlabel('X2, Instance Number')
plt.ylabel('X2, Distance to {0}th nearest neighbor'.format(minpts))
plt.grid()

plt.show()


In [None]:
from sklearn.cluster import DBSCAN

#=====================================
# ENTER YOUR CODE HERE TO CHANGE MINPTS AND EPS FOR EACH DATASET
minpts = [2100,2500]
eps = [6.55, 7.7]
#=====================================

for e,m in zip(eps, minpts):

    db = DBSCAN(eps=e, 
                min_samples=m,
                leaf_size = 41,
                n_jobs = -1,
                metric = 'euclidean',
               ).fit(tsne_res[50])
    labels = db.labels_

    # Number of clusters in labels, ignoring noise if present.
    n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)

    # mark the samples that are considered "core"
    core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
    core_samples_mask[db.core_sample_indices_] = True

    plt.figure(figsize=(15,8))
    unique_labels = set(labels) # the unique labels
    colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
    for k, col in zip(unique_labels, colors):
        if k == -1:
            # Black used for noise.
            col = 'k'

        class_member_mask = (labels == k)

        xy = tsne_res[50][class_member_mask & core_samples_mask]
        # plot the core points in this class
        plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
                 markeredgecolor='w', markersize=6)

        # plot the remaining points that are edge points
        xy = tsne_res[50][class_member_mask & ~core_samples_mask]
        plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
                 markeredgecolor='w', markersize=3)

    plt.title('Estimated number of clusters: %d' % n_clusters_)
    plt.grid()
    plt.show()

## 3.2 Option B: Association Rule Mining (Q3B)<a class="anchor" id="Rule_mining"></a>

Info

## 3.3 Option C: Collaborative Filtering (Q3C)<a class="anchor" id="Collaborative"></a>

Info

# 4. Stage Five - Deployment (Q3) <a class="anchor" id="Deployment"></a>

## 4.1 Next Stage Deployment <a class="anchor" id="Deployment"></a>

The next stage in the CRISP-DM is deployment. After model building and evaluation, we are ready to deploy our code representation of the model into a production environment and solve our original business problem.
Our business problem is to give Home Credit loan evaluators access to a model that evaluates an applicant’s current and past financial history in determining whether to approve the requested loan.

#### How useful is your model for interested parties?<a class="anchor" id="Useful"></a>

We believe this model would be useful for loan departments loan evaluators. This is contingent of discovering how some of the external variables are created. With this and some additional consistancy in top scoring features we could improve the accuracy

#### How would you measure the model's value if it was used by these parties?<a class="anchor" id="Value"></a>

This model should be tested in parrellel to present evaluation proceess. Then after a set period of time compare human evaluation to model based accuracy. If the results were the same, the minimum resulting savings would be the salaries of the loan evaluators. Added value would result from the acceleration of the loan approval process.

#### How would your deploy your model for interested parties?<a class="anchor" id="Deploy"></a>

Depending on the resources,  available, models can be deployed as batch or real-time predictions. Home Credits current process is a batch implementation. The applicant fills out the form which is then digitized and sent to the loan approval department. During the loan approval, the collected data will need to be cleaned and normalized before processed through the machine learning predictive model.

#### How often would the model need to be updated?<a class="anchor" id="Update"></a>

The metrics for the customer is continually updated. On a scheduled cycle, this new dataset is analyzed through the current model. Doing batch cycles will allow for consistency in whether an applicant is approved or not.  However, model designs will change as newer technology are available or changing business environments make reengineering required.

#### What other data should be collected?<a class="anchor" id="Collect"></a>

In addition to the clarification of data we have described above, there would be a tremendous value in the financials for the industry that the individual is working. These financials could assist in prediction of any future concerns for the applicants present income.

Below is a typical example of a batch deployment \[1\].

<img src="../_images/batchdeployment_process.png" style="width:800px;height:375px"/>



## References

\[1\] J. Kervizic, Overview of Different Approaches to Deploying Machine Learning Models in Production, June 2016.
Accessed on: Feb. 15, 2019. \[Online\].
Available: https://www.kdnuggets.com/2019/06/approaches-deploying-machine-learning-production.html

# A1. Parameter Tuning <a class="anchor" id="A2"></a>

The code and output of model tuneing in shown below.