<a href="https://colab.research.google.com/github/ThomasGVoss/LearningFactory/blob/main/Lab_Data_Prep_with_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The **CR**oss *I*ndustry *S*tandard **P**rocess for **D**ata **M**ining (CRISP-DM) is a process model that serves as the base for a data science process. It has six sequential phases:



1.   Business understanding – What does the business need?
2.   Data understanding – What data do we have / need? Is it clean?
3.   Data preparation – How do we organize the data for modeling?
4.   Modeling – What modeling techniques should we apply?
5.   Evaluation – Which model best meets the business objectives?
6.   Deployment – How do stakeholders access the results?


Published in 1999 to standardize data mining processes across industries, it has since become the most common methodology for data mining, analytics, and data science projects.

## Phase 1: Business Understanding

The Business Understanding phase focuses on understanding the objectives and requirements of the project. Aside from the third task, the three other tasks in this phase are foundational project management activities that are universal to most projects:

**Determine business objectives:** You should first “thoroughly understand, from a business perspective, what the customer really wants to accomplish.” (CRISP-DM Guide) and then define business success criteria.

**Assess situation:** Determine resources availability, project requirements, assess risks and contingencies, and conduct a cost-benefit analysis.

**Determine data mining goals:** In addition to defining the business objectives, you should also define what success looks like from a technical data mining perspective.

**Produce project plan:** Select technologies and tools and define detailed plans for each project phase.

While many teams hurry through this phase, establishing a strong business understanding is like building the foundation of a house – absolutely essential.

This enables close coordination between technical department of Data Scientists, Data Analysts and Data Engineers with business stakeholders.

## Phase 2: Data Understanding


Next is the Data Understanding phase. Adding to the foundation of Business Understanding, it drives the focus to identify, collect, and analyze the data sets that can help you accomplish the project goals. This phase also has four tasks:

Collect initial data: Acquire the necessary data and (if necessary) load it into your analysis tool.
Describe data: Examine the data and document its surface properties like data format, number of records, or field identities.
Explore data: Dig deeper into the data. Query it, visualize it, and identify relationships among the data.
Verify data quality: How clean/dirty is the data? Document any quality issues.

In [None]:
import numpy as np                                # For matrix operations and numerical processing
import pandas as pd                               # For munging tabular data
import matplotlib.pyplot as plt                   # For charts and visualizations
from IPython.display import Image                 # For displaying images in the notebook
from IPython.display import display               # For displaying outputs in the notebook
from time import gmtime, strftime                 # For labeling SageMaker models, endpoints, etc.
import sys                                        # For writing outputs to notebook
import math                                       # For ceiling function
import json                                       # For parsing hosting outputs
import os                                         # For manipulating filepath names
import zipfile                                    # For unzipping

# ensure graphs are displayed correctly inline in this notebook
%matplotlib inline

Collecting and Loading the data

In [None]:
!wget https://raw.githubusercontent.com/ThomasGVoss/LearningFactory/main/apjournal.csv
!wget https://raw.githubusercontent.com/ThomasGVoss/LearningFactory/main/kundenauftrag.csv
!wget https://raw.githubusercontent.com/ThomasGVoss/LearningFactory/main/produktionsauftrag.csv


In [None]:
col = ['ProcessID','RoundId','Workstation','Null','Start','End']
data = pd.read_csv('/content/apjournal.csv', header=None, names=col, index_col=0 , sep=',',on_bad_lines='skip')
pd.set_option('display.max_columns', 500)   # Make sure we can see all of the columns
pd.set_option('display.max_rows', 20) # Keep the output on one page
data = data.drop(columns='Null')
#last output of a cell is automatically displayed in this case the pandas DataFrame
data

Let's take a look into the round we played and talk about the data. 



In [None]:
data = data.loc[data['RoundId'] == 216]

#Task: 

Please think back on the game and match your experience from the game with the data set you have been presented with. Please generate a description of the data. What are the Rows and Collums representing, what type of values are you looking at? 

Below you find some examples of ways to access the data.


In [None]:
data['Start'] = pd.to_datetime(data['Start'])
data['End'] = pd.to_datetime(data['End'])
data.dtypes


In [None]:
# Select a column
ws = data['Workstation']
ws

In [None]:
data.sort_values(by="Start")

In [None]:
# Grouping
data.groupby('Workstation').size()

In [None]:
#Generate the duration based on the end and the start date
data['Duration'] = data['End'] - data['Start']

# Series.dt - Accessor object for datetimelike properties of the Series values.
data['Seconds'] = data['Duration'].dt.total_seconds()

#drop the duration col 
data = data.drop(columns=['Duration'])

# Exploration 
 
Let's start exploring the data. First, let's understand how the features are distributed.

In [None]:
data.describe()

In [None]:
# let's find the mean value of each process step 
data.groupby(['Workstation']).mean()

In [None]:
data.groupby(['ProcessID']).apply(print)

## Task: 
Please take a look at the Produktionsauftrag.csv - what can you find out? 

## Phase 3: Data Preparation 
Transformation / Feature engineering

---


Cleaning up data is part of nearly every machine learning project. It takes up a lot of time and is a necessity for a good model.

**Select data:** Determine which data sets will be used and document reasons for inclusion/exclusion.

**Clean data:** Often this is the lengthiest task. Without it, you’ll likely fall victim to garbage-in, garbage-out. A common practice during this task is to correct, impute, or remove erroneous values.

**Construct data:** Derive new attributes that will be helpful. For example, derive someone’s body mass index from height and weight fields.

**Integrate data:** Create new data sets by combining data from multiple sources.

**Format data:** Re-format data as necessary. For example, you might convert string values that store numbers to numeric values so that you can perform mathematical operations.

Using the data can you match the ProcessID with a variant?

can you group the data based on the type of car and the workstation used? 

Can you maybe add more steps to the data set? Such as the start and the end? 

In [None]:
# for each ProcessID add 2 rows to the datatable with station PPS and Storage? 
# think which time would be usefull? In which table would you find those data items? 


## End of Lab 1
We now have gained an understanding of our data and prepared our data ...

Please download the file after generation for further processing

In [None]:
data.to_csv('output.csv')