In [1]:
from jenkspy import JenksNaturalBreaks
import pandas as pd
import numpy as np
import time
import os
import pickle
from sklearn.feature_selection import RFECV
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import PredefinedSplit
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

In [2]:
if (
        (os.path.isfile('./data/X_train.csv')) & (os.path.isfile('./data/X_test.csv')) & 
        (os.path.isfile('./data/y_train.csv')) & (os.path.isfile('./data/y_test.csv')) &
        (os.path.isfile('./data/X_validate.csv')) & (os.path.isfile('./data/y_validate.csv'))
    ):
    print("generate from files")
    X_train = pd.read_csv('./data/X_train.csv', index_col = 'index')
    X_train.index.name = None
    X_test = pd.read_csv('./data/X_test.csv', index_col = 'index')
    X_test.index.name = None
    X_validate = pd.read_csv('./data/X_validate.csv', index_col = 'index')
    X_validate.index.name = None
    y_train = pd.read_csv('./data/y_train.csv', index_col = 'index')
    y_train.index.name = None
    y_test = pd.read_csv('./data/y_test.csv', index_col = 'index')
    y_test.index.name = None
    y_validate = pd.read_csv('./data/y_validate.csv', index_col = 'index')
    y_validate.index.name = None
else:
    print("=== NO data to load ===")

generate from files


# Modeling
This is the point at which your hard work begins to pay off. The data you spent time preparing are brought into the analysis tools in IBM SPSS Modeler, and the results begin to shed some light on the business problem posed during Business Understanding. Modeling is usually conducted in multiple iterations. Typically, data miners run several models using the default parameters and then fine-tune the parameters or revert to the data preparation phase for manipulations required by their model of choice. It is rare for an organization's data mining question to be answered satisfactorily with a single model and a single execution. This is what makes data mining so interesting--there are many ways to look at a given problem, and IBM SPSS Modeler offers a wide variety of tools to help you do so.

## Selecting Modeling Technique
Although you may already have some idea about which types of modeling are most appropriate for your organization's needs, now is the time to make some firm decisions about which ones to use. Determining the most appropriate model will typically be based on the following considerations: 
* The data types available for mining. For example, are the fields of interest categorical (symbolic)? 
* Your data mining goals. Do you simply want to gain insight into transactional data stores and unearth interesting purchase patterns? Or do you need to produce a score indicating, for example, propensity to default on a student loan? 
* Specific modeling requirements. Does the model require a particular data size or type? Do you need a model with easily presentable results? 

For more information on the model types in IBM SPSS Modeler and their requirements, see the IBM SPSS Modeler documentation or online Help.

### Choosing the Right Modeling Techniques 

Many modeling techniques are available in IBM SPSS Modeler. Frequently, data miners use more than one to approach the problem from a number of directions. Task List When deciding on which model(s) to use, consider whether the following issues have an impact on your choices

* Does the model require the data to be split into test and training sets? • Do you have enough data to produce reliable results for a given model? 
* Does the model require a certain level of data quality? Can you meet this level with the current data? 
* Are your data the proper type for a particular model? If not, can you make the necessary conversions using data manipulation nodes? 

For more information on the model types in IBM SPSS Modeler and their requirements, see the IBM SPSS Modeler documentation or online Help. 

### Modeling Assumptions 

As you begin to narrow down your modeling tools of choice, take notes on the decision-making process. Document any data assumptions as well as any data manipulations made to meet the model's requirements. For example, both the Logistic Regression and Neural Net nodes require the data types to be fully instantiated (data types are known) before execution. This means you will need to add a Type node to the stream and execute it to run the data through before building and running a model. Similarly, predictive models, such as C5.0, may benefit from rebalancing the data when predicting rules for rare events. When making this type of prediction, you can often get better results by inserting a Balance node into the stream and feeding the more balanced subset into the model. Be sure to document these types of decisions.

## Generating a Test Design 
As a final step before actually building the model, you should take a moment to consider again how the model's results will be tested. There are two parts to generating a comprehensive test design: 
* Describing the criteria for "goodness" of a model 
* Defining the data on which these criteria will be tested 

A model's goodness can be measured in several ways. For supervised models, such as C5.0 and C&R Tree, measurements of goodness typically estimate the error rate of a particular model. For unsupervised models, such as Kohonen cluster nets, measurements may include criteria such as ease of interpretation, deployment, or required processing time. Remember, model building is an iterative process. This means that you will typically test the results of several models before deciding on the ones to use and deploy. 

### Writing a Test Design 

The test design is a description of the steps you will take to test the models produced. Because modeling is an iterative process, it is important to know when to stop adjusting parameters and try another method or model. Task List When creating a test design, consider the following questions: 
* What data will be used to test the models? Have you partitioned the data into train/test sets? (This is a commonly used approach in modeling.) 
* How might you measure the success of supervised models (such as C5.0)?
* How might you measure the success of unsupervised models (such as Kohonen cluster nets)?
* How many times are you willing to rerun a model with adjusted settings before attempting another type of model?

## Building the Models 

At this point, you should be well prepared to build the models you've spent so long considering. Give yourself time and room to experiment with a number of different models before making final conclusions. Most data miners typically build several models and compare the results before deploying or integrating them. 

In order to track your progress with a variety of models, be sure to keep notes on the settings and data used for each model. This will help you to discuss the results with others and retrace your steps if necessary. At the end of the model-building process, you'll have three pieces of information to use in data mining decisions:
* Parameter settings include the notes you take on parameters that produce the best results.
* The actual models produced.
* Descriptions of model results, including performance and data issues that occurred during the execution of the model and exploration of its results.

### Parameter Settings 
Most modeling techniques have a variety of parameters or settings that can be adjusted to control the modeling process. For example, decision trees can be controlled by adjusting tree depth, splits, and a number of other settings. Typically, most people build a model first using the default options and then refine parameters during subsequent sessions. 

Once you have determined the parameters that produce the most accurate results, be sure to save the stream and generated model nodes. Also, taking notes on the optimal settings can help when you decide to automate or rebuild the model with new data. 

### Running the Models 

In IBM SPSS Modeler, running models is a straightforward task. Once you've inserted the model node into the stream and edited any parameters, simply execute the model to produce viewable results. Results appear in the Generated Models navigator on the right side of the workspace. You can right-click a model to browse the results. For most models, you can insert the generated model into the stream to further evaluate and deploy the results. Models can be also be saved in IBM SPSS Modeler for easy reuse. 

### Model description 

When examining the results of a model, be sure to take notes on your modeling experience. You can store notes with the model itself using the node annotations dialog box or the project tool. Task list For each model, record information such as:
* Can you draw meaningful conclusions from this model?
* Are there new insights or unusual patterns revealed by the model?
* Were there execution problems for the model? How reasonable was the processing time?
* Did the model have difficulties with data quality issues, such as a high number of missing values?
* Were there any calculation inconsistencies that should be noted?

## Assessing the Model 

Now that you have a set of initial models, take a closer look at them to determine which are accurate or effective enough to be final. Final can mean several things, such as "ready to deploy" or "illustrating interesting patterns." Consulting the test plan that you created earlier can help to make this assessment from your organization's point of view. 

### Comprehensive Model Assessment 

For each model under consideration, it is a good idea to make a methodical assessment based on the criteria generated in your test plan. Here is where you may add the generated model to the stream and use evaluation charts or analysis nodes to analyze the effectiveness of the results. You should also consider whether the results make logical sense or whether they are too simplistic for your business goals (for example, a sequence that reveals purchases such as wine > wine > wine). 

Once you've made an assessment, rank the models in order based on both objective (model accuracy) and subjective (ease of use or interpretation of results) criteria. 

Task List
* Using the data mining tools in IBM SPSS Modeler, such as evaluation charts, analysis nodes, or crossvalidation charts, evaluate the results of your model.
* Conduct a review of the results based on your understanding of the business problem. Consult data analysts or other experts who may have insight into the relevance of particular results.
* Consider whether a model's results are easily deployable. Does your organization require that results be deployed over the Web or sent back to the data warehouse?
* Analyze the impact of results on your success criteria. Do they meet the goals established during the business understanding phase? 

If you were able to address the above issues successfully and believe that the current models meet your goals, it's time to move on to a more thorough evaluation of the models and a final deployment. Otherwise, take what you've learned and rerun the models with adjusted parameter settings.

### Keeping Track of Revised Parameters 

Based on what you've learned during model assessment, it's time to have another look at the models. You have two options here:
* Adjust the parameters of existing models.
* Choose a different model to address your data mining problem. 

In both cases, you'll be returning to the building models task and iterate until the results are successful. Don't worry about repeating this step. It is extremely common for data miners to evaluate and rerun models several times before finding one that meets their needs. This is a good argument for building several models at once and comparing the results before adjusting the parameters for each.