<h1 style="color:blue; text-align: center;">Data Modeling Template with CRISP-DM</h1> 


## Business Understanding
- **Framing the problem**
    - What is the expectation of analyzing the data?
    - Is there a question to be answered?
    - Is it completely exploratory?  (a lot of data and no questions)
    - Is it a machine learning problem that includes a predictive model?
    - Is a visualization or report all that is needed?
    
    
- Item One
    - Item One Sub 1
    
    
1. Item One
2. Item Two

https://www.markdownguide.org/cheat-sheet/

Fred *went* to the **store**.



## Data Understanding
- **Setup the workspace**
    - Use an editor – Jupyter notebooks
    - Folder management 
        - Root folder
            - data folder
            - raw folder
            - WIP folder
            - images folder
            - docs folder
- **Import and pip install libraries**
    - Ex. Numpy, Pandas, Scikit-learn, MatPlotLib, Seaborn, Statsmodels 
- **Get the Data**
    - Import from:
        - csv or xls/xlsx, URL, SQL, txt, other files/connections
- **Explore the Data**
    - Visualize the data
        - histograms, bar charts, scatter plots, correlation matrix
    - Group by
    - Value counts
    - .info()
    - .head()
    - .tail()
    - .sample()
    - .describe()

## Data Preparation
- **Combine Data**
    - Merge Data
    - Concatenate Data
    - Pivot or Melt Data
- **Cleanse the Data**
    - Cleaning NaN values
        - fillna with value, median, mean, grouped mean
        - Drop NaN
- **Transform the Data**
    - Change categorical data to numerical – binary, ordinal, or dummy variables
    - Standardize and normalize the data
    - Create a pipeline 
- **Feature Engineering**
    - Create new variables based on other features
        - Ex. ratios, interactions, etc. 
- **Create your X and y datasets for predictive modeling**
    - Create a dataset for your target variable (y) and your features (X)

## Modeling
- **Split the Data**
    - Train test split
        - Standardize X_train and X_test (separately)
- **Select the Model/Test**
    - Supervised Learning
        - Numerical target - Regression
            - Lasso
            - Ridge
            - Backwards model building
    - Categorical target – Classification
        - Probabilistic
            - Logistic regression
            - Naive Bayes
        - Decision tree modeling
        - Ensemble
            - Random forest
        - SVM
    - Unsupervised learning
        - Clustering
            - K-means
            - Hierarchal
        - Dimension Reduction
            - PCA

## Evaluation
- **Fine tune the Model**
    - K-folds
    - For loop alpha scores
    - Grid Search
- **Evaluate the Final Model** 
    - Accuracy Scores (RMSE, etc.)
    - Confusion matrix

## Deployment
- Identify and deploy the model for testing and production


### Reference for Markdown
- https://www.markdownguide.org/cheat-sheet/

---
<h2 style="color:blue; text-align: center;">Business Understanding</h2> 

---

## Business Objective
 - **Framing the problem**
    - Is there a specific variable that you are looking to predict?
        - If not - what are you looking to explore?
        - If yes - what type of variable and what is the objective - predcitive or feature selection?
    - Is there a question to be answered or is it completely exploratory?
    - Is it completely exploratory?  (a lot of data and no questions)
    - Is it a machine learning problem that includes a predictive model?
    - Is a visualization or report all that is needed?
    - Who is the target audience?
 
## Technical Objective
- Create a dataset will help meet the business objective
    - Normalize the data to see if it provides better results
    - Determine the best way to select and clean variables
    - Create additional variables to enhance your model
- What type of exploratory analysis is needed to identify the right technique?
    - Visualizations, summarizations, and groupings
- What are the techniques that you choose to use to solve the problem
    - supervised vs. unsupervised learning?



---
<h2 style="color:blue; text-align: center;">Data Understanding</h2> 

---

## Import Libraries

In [5]:
# import numpy as np
# import pandas as pd
# import matplotlib.pyplot as plt
# %matplotlib inline 
      #if you want graphs to automatically without plt.show
# plt.style.use('fivethirtyeight') #a stle that can be used for plots - see style reference above


## Import Data
https://pandas.pydata.org/pandas-docs/stable/api.html#flat-file

- What data can be used for this data model?
- What are the sources - are they real time or static?
- Where will the data be housed?

In [8]:
#df = pd.read_csv('data/Data.csv', index_col = 0, header=0) #sets the first column to the index
# and the top row as the headers

In [10]:
#df_excel =  pd.ExcelFile('data/RawData.xlsx') 
#print(df_excel.sheet_names)
#df_sheet1 = df_excel.parse('Sheet1')
#or
#df_sheet1 = df_excel.parse('0)

In [12]:
#url = 'https://url.com/datasets/Original.csv' 
    #url is a variable that is created that is pointed to the ile online.

#df_url = pd.read_csv(url, header = 0, index_col=None)
#df_url.head()

### Read the Data
- head( )
- tail( )
- info( )
- describe( )

In [15]:
#df.head(5)
#read the top 5 records

In [17]:
#df.tail(5)
# read the bottom 

In [19]:
#df.info()
#shows all variables and it type and count


## Explore the data

- Describe
- Groupby
- value_counts()
- plot with MatPlotLib
- plot with Seaborn

In [22]:
#df.describe()
#shows all descriptive stats for the dataframe

In [24]:
#df.groupby('variable').count()
#df.groupby('variable').mean()

In [26]:
#df['variable'].value_counts()

In [28]:
#import seaborn as sns

In [30]:
#view data using Matplotlib and Seaborn to get a better idea of the data

---
<h2 style="color:blue; text-align: center;">Data Preparation</h2> 

---


## Merge, concat, melt and pivot

In [33]:
#df_melt =pd.melt(bd_cus, id_vars=['Identifier'], \
                 # value_vars=['Columns that you want to merge into one column'])

In [35]:
#df_melt.head(15)
#view top 15 records of new dataframe

In [37]:
#df_pivot = df_melt.pivot(index='Identifier', columns='Column that you want to split into multiple columns', \
                #values='Value for each of the new columns')

In [39]:
#df_pivot.head()
# View top records in the new pivot dataframe

In [41]:
#df_all = pd.concat([df, df1], axis=0)
# Concatenate two sets of data together
# axis = 0 adds df1 underneath df
# axis = 1 adds df1 to the right of df

In [43]:
#df_all = pd.merge(df, df1, how='left', on='Vairable')
# can also use inner, right, outer for how
# on is the variable that is common between the two (normally primary and foreign keys)


## Cleanse Data

In [46]:
#df['VariableName'] = bd_merge['VariableName'].fillna('Text') - places Text in place of NaN
#df['VariableNumber'] = bd_merge['VariableNumber'].fillna(0) - places 0 in place of NaN
#df['VariableNumber'] = bd_merge['VariableNumber'].fillna(mean_value) - places mean in place of NaN

In [48]:
#df['VariableNumber'] = bd_merge['VariableNumber'].fillna(method='ffill')
#df['VariableNumber'] = bd_merge['VariableNumber'].fillna(method='bfill')


## Transform Data

In [51]:
#transform with a function
# def gender(c):
#  if c['Gender'] == "Female":
#    return 1
#  else:
#    return 0

#df['Gender'] = df.apply(gender, axis=1)

In [53]:
#transform with a label encoder
#from sklearn.preprocessing import LabelEncoder #import function for label encoder
#lb = LabelEncoder() # label encoder to a simple abbreviation
#df['Rank'] = lb.fit_transform(df['Rank']) 

In [55]:
#transform with pd.get_dummies
#df_dummy = pd.Series(df['DummyVariable'])
#df_dummy = pd.get_dummies(df_dummy)
#df = pd.concat([df, df_dummy], axis=1)


## Split Data into test and train

In [58]:
# Must create X and y
#from sklearn.model_selection import train_test_split
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

## Modeling techniques