**Cross-Industry Standard Process for Data Mining**
Overview by Michael McCarthy mbmccart@utica.edu

[CrISP-DM](https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining) was developed by industry but closely follows the [scientific method](https://en.wikipedia.org/wiki/Scientific_method) used by scientists.
*   Observe
*   Question
*   Hypothesis
*   Test
*   Analyze
*   Iterate

As Data Scientists, it is important to always use a standard process to generate thoughtful outcomes.

CrISP-DM showcases iteration, unlike knowledge discovoery in databases (KDD).



In [None]:
#CrISP-DM step 1: Business Understanding
step1= ["This is where you want to map out your plan and is starts with the understanding of the problem being solved or the question being asked.",
"Important questions to answer to have a full business understanding:"]
step1sub= ["What is the problem statement?",
  "What are the measures of success?",
  "Who is the data owner?",
  "What are the limitations of the data?",
  "Are there ethical considerations with the data or the resulting analysis?" ]
for i in range(len(step1)):
  print("Key idea ", i+1, ": ", step1[i])
for i in range(len(step1sub)):
  print("              ", "* ", step1sub[i])

Key idea  1 :  This is where you want to map out your plan and is starts with the understanding of the problem being solved or the question being asked.
Key idea  2 :  Important questions to answer to have a full business understanding:
               *  What is the problem statement?
               *  What are the measures of success?
               *  Who is the data owner?
               *  What are the limitations of the data?
               *  Are there ethical considerations with the data or the resulting analysis?


In [None]:
#CrISP-DM step 2: Data Understanding
"""
Where is the data coming from?
Review the metadata.
Data Exploration (often called EDA for "exploratory data analysis")
  Visualizations: Categorical Data
    Bar
    Pie
    Waffle Chart
  Visualizations: Continuous Data
     Scatterpolts
     Histograms
     Boxplots
     Sankey
  Visualizations: Text
    Word Clouds
  Descriptive Statistics
    Mean
    Median
    Mode
    Standard Deviation
  Correlation
    Pearson (ratio data)
    Spearman (interval data)
"""

import matplotlib as mp
# https://pypi.org/project/matplotlib/

import numpy as np
# descriptive statistics https://numpy.org/doc/stable/reference/generated/numpy.mean.html?highlight=mean#numpy.mean
# correlation matrix https://numpy.org/doc/stable/reference/generated/numpy.cov.html#numpy.cov

# REMEMBER, the EDA might require you to step back to Step 1: Business Understanding (this is OK!).

In [None]:
#CrISP-DM step 3: Data Preparation
"""
Based on the EDA, select your data. Often this step is calld "Data Cleaning" althoug the cleaning process can start with the EDA (step 2).

Transform variables.
Generage features (i.e., use variables to generate new variables)
Perfom diminsonalty reduction (e.g., PCA)
Reflect and resolve Nulls (the stategies can help but not a full list of way so resolve nulls),
  Impute values (e.g., mean, median, 10th percentile)
  Assign zeros
  Delete variables (i.e. fields) with a lot of zeros.
  Remove records with a lot of zeros.
"""

'\nBased on the EDA, select your data. Often this step is calld "Data Cleaning" althoug the cleaning process can start with the EDA (step 2).\n\nTransform variables.\nGenerage features (i.e., use variables to generate new variables)\nPerfom diminsonalty reduction (e.g., PCA)\nReflect and resolve Nulls (the stategies can help but not a full list of way so resolve nulls), \n  Impute values (e.g., mean, median, 10th percentile)\n  Assign zeros \n  Delete variables (i.e. fields) with a lot of zeros.\n  Remove records with a lot of zeros.\n'

In [None]:
#CrISP-DM step 4: Modeling
"""
Linear Modeling
Supervise Learning
Unsupervised Learning
Time-Series Analysis
NLP
"""
!pip install pycaret
# https://pycaret.org/

# REMEMBER, the modeling might require you to step back to earlier steps, this is OK!

Collecting pycaret
  Downloading pycaret-2.3.3-py3-none-any.whl (264 kB)
[K     |████████████████████████████████| 264 kB 5.0 MB/s 
[?25hCollecting pandas-profiling>=2.8.0
  Downloading pandas_profiling-3.0.0-py2.py3-none-any.whl (248 kB)
[K     |████████████████████████████████| 248 kB 39.9 MB/s 
[?25hCollecting imbalanced-learn==0.7.0
  Downloading imbalanced_learn-0.7.0-py3-none-any.whl (167 kB)
[K     |████████████████████████████████| 167 kB 41.7 MB/s 
[?25hCollecting Boruta
  Downloading Boruta-0.3-py3-none-any.whl (56 kB)
[K     |████████████████████████████████| 56 kB 3.8 MB/s 
[?25hCollecting scikit-plot
  Downloading scikit_plot-0.3.7-py3-none-any.whl (33 kB)
Collecting umap-learn
  Downloading umap-learn-0.5.1.tar.gz (80 kB)
[K     |████████████████████████████████| 80 kB 7.7 MB/s 
Collecting pyLDAvis
  Downloading pyLDAvis-3.3.1.tar.gz (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 43.3 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  

In [None]:
#CrISP-DM step 5: Evaluation
"""
Every model has criterea to assess.
Often it is appropriate to use multiple modeling methods to contrast the results.
"""
# REMEMBER, the evaluation might require you to step back to earlier steps (maybe all the way to step 1), this is OK!
## Sometimes the model isn't good . . . iterate!

In [None]:
#CrISP-DM step 6: Deployment
"""
The is the ultimate, a deployed model that is generalized and reliable.
Models need tuning after deployment.
If the model is not worth deploying, iterate or move on.
"""