# The Data Mentor Data Science Process


[Link to Step 1: Setup Analysis Environment](#step1) 

[Link to Step 2: Frame The Problem](#step2)

[Link to Step 3: Collect Data](#step3)

[Link to Step 4: Clean Data](#step4)

[Link to Step 5: Explore Data](#step5)

[Link to Step 6: Model Data](#step6)

[Link to Step 7: Validate Model](#step7)

[Link to Step 8: Productionize Model](#step8)

<a id='step1'></a>

# Step 1: Setup Analysis Environment

In [8]:
import pandas as pd
import seaborn as sns

<a id='step2'></a>

# Step 2: Frame The Problem

Frame the problem: Who is your client? What exactly is the client asking you to solve? How can you translate their ambiguous request into a concrete, well-defined problem?

1. Make a clear statement of the problem
2. List the assumptions about the problem (e.g. about the data).
3. Describe the motivation for solving the problem.
    - Describe the benefits of the solution (model or the predictions).
    - Describe how the solution will be used.
4. Map domain knowledge to a machine learning solution 
    - Describe how the problem is currently solved (if at all).
    - Describe how a subject matter expert would make manual predictions.
    - Describe how a programmer might hand code a classifier.

<a id='step3'></a>

# Step 3: Collect or Access Your Data

Collect the raw data needed to solve the problem: 
- Is this data already available? If so, what parts of the data are useful? 
- If not, what more data do you need? What kind of resources (time, money, infrastructure) would it take to collect this data in a usable form?

<a id='step4'></a>

# Step 4: Clean Data

Real, raw data is rarely usable out of the box. There are errors in data collection, corrupt records, missing values and many other challenges you will have to manage. You will first need to clean the data to convert it to a form that you can further analyze.


### Data Model:

 - Drop duplicate columns
 - Replace/drop missing values
 - Remove/ windsorize outliers
 - Remove unwanted string characters
    

### Science Model:

- Scaling: The preprocessed data may contain attributes with a mixtures of scales for various quantities such as dollars, kilograms and sales volume. Many machine learning methods like data attributes to have the same scale such as between 0 and 1 for the smallest and largest value for a given feature. Consider any feature scaling you may need to perform.
- Decomposition: There may be features that represent a complex concept that may be more useful to a machine learning method when split into the constituent parts. An example is a date that may have day and time components that in turn could be split out further. Perhaps only the hour of day is relevant to the problem being solved. consider what feature decompositions you can perform.
- Sampling: 


Resource: https://machinelearningmastery.com/how-to-prepare-data-for-machine-learning/

<a id='step5'></a>

# Step 5: Explore Data 

Perform in-depth analysis (machine learning, statistical models, algorithms): This step is usually the meat of your project,where you apply all the cutting-edge machinery of data analysis to unearth high-value insights and predictions.

<a id='step6'></a>

# Step 6: Model Data

Document Model Assumptions:
    
    
    
    
    
Resource: https://towardsdatascience.com/all-the-annoying-assumptions-31b55df246c3

<a id='step7'></a>

# Step 7: Craft Data Story 

Communicate results of the analysis: All the analysis and technical results that you come up with are of little value unless you can explain to your stakeholders what they mean, in a way that’s comprehensible and compelling. Data storytelling is a critical and underrated skill that you will build and use here.

<a id='step8'></a>

# Step 8: Productionize Machine Learning Model 