# Business and Data understanding
------------

The initial phase is concerned with tasks to define the business objectives and translate it to ML objectives, to collect and verify the data quality and to finaly assess the project feasibility.

![](https://i.imgur.com/55J7fBc.jpeg)

## Terminology

#### 1. Business terminology
| Term            | Meaning                                                                                                                                                                                                                                                                             |
|-----------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| WAC             | World Area Code                                                                                                                                                                                                                                                                     |
| US DOT          | United States Department Of Transportation                                                                                                                                                                                                                                          |
| Airline         | The company that operates airplanes                                                                                                                                                                                                                                                 |
| CRS             | The Common Reporting Standard (CRS) is an information standard for the Automatic Exchange Of Information (AEOI) regarding financial accounts on a global level, between tax authorities, which the Organization for Economic Co-operation and Development (OECD) developed in 2014. |
| Carrier         | Same as airline                                                                                                                                                                                                                                                                     |
| On-time         | Flight occurred as-per schedule                                                                                                                                                                                                                                                     |
| Diverted        | Flight is diverted if the airplane landed in an airport that was not put into the flight plan                                                                                                                                                                                       |
| Delayed         | Flight is delayed if it was rescheduled to another time                                                                                                                                                                                                                             |
| Canceled        | Flight is canceled if it will not happen at all                                                                                                                                                                                                                                     |
| FIPS            | Federal Information Processing Standard                                                                                                                                                                                                                                             |
| Airplane        | An airplane, informally plane, is a fixed-wing aircraft that is propelled forward by thrust from a jet engine, propeller, or rocket engine                                                                                                                                          |
| Takeoff airport | Airport where the airplane takes off and the flight starts                                                                                                                                                                                                                          |
| Landing airport | Airport where the airplane lands. Flight may or may not be finished, since the airplane can land for maintenance, eg. refueling.                                                                                                                                                    |
| Flight route    | A route that an airplane takes. Starts at a takeoff airport. Can contain multiple landing airports                                                                                                                                                                                  |

#### 2. ML terminology
| Term         | Meaning                                                                                                                       |
|--------------|-------------------------------------------------------------------------------------------------------------------------------|
| Underfitting | When a machine learning model is too simple and fails to capture the underlying patterns in the data.                         |
| Overfitting  | When a machine learning model is too complex and fixates too much on the training data, failing when preforming on real data. |
| Word2Vec     | Word2vec is a technique in natural language processing (NLP) for obtaining vector representations of words.                   |

## Scope of the project
----------
#### 1. Background
The dataset has been extracted from the Marketing Carrier On-Time Performance (Beginning January 2018) data table of the "On-Time" database from the TranStats data library. 

#### 2. Business problem
Airports suffer from delayed and diverted flights, since rescheduling costs airlines and airports money. Also, this is an inconvenience to passengers. They want a solution to predict whether the flight will be on-time, delayed, canceled, or diverted. The provided dataset contains extensive information about flights in United States from 2022, which could help classify flights in four aforementioned classes ahead of time of departure.

#### 3. Business objectives
- Increase passenger satisfaction and reduce time lost in airports
- Reduce airport load by better managing the flights
- Reduce airline spending on flight fees

#### 4. ML objectives
- Classify whether the given flight will be on-time, delayed, canceled, or diverted given the data before the flight begins

## Success Criteria
-------------
#### 1. Business success criteria
- Decrease airline's spendings by no less than 12% within the next year

#### 2. ML success criteria
- Since the dataset is quite imbalanced, the model aims to achieve a recall rate of >75%.

## Project feasibility
-------------
#### 1. Inventory of resources

List the resources available to the project, including: personnel (business experts, data experts, technical support, machine learning personnel), data (fixed extracts, access to live warehoused or operational data), computing resources (hardware platforms) and software (machine learning tools, other relevant software).

##### 1.1 Personnel:
- Alexey Tkachenko: Data engineer
- Ivan Golov: Data scientist
- Artem Bulgakov: ML engineer
  
##### 1.2 Experts:
- Firas Johla: MLOps expert

##### 1.3 Data:
- Marketing Carrier On-Time Performance data table for 2022 of the "On-Time" database from the TranStats data library

##### 1.4 Computing resources:
- Personal computers
- InnoDataHub
- Google Colab resources
- Apache Spark
- Apache Airflow
- Apache Pulsar
  
##### 1.5 Software:
- DVC for data management
- Python
- Git for version control
- Hydra for configuration management
- scikit-learn / pandas / seaborn
- Pytest for testing
- PostgreSQL
- MLflow
- Docker
- Feast + ClearML
- FastAPI, Flask

#### 2. Requirements, assumptions and constraints

##### 2.1 Requirements
Project will be completed in a few phases:
- Phase I - Business and data understanding
- Phase II - Data engineering/preparation
- Phase III - Model engineering
- Phase IV - Model validation
- Phase V - Model deployment
- Phase VI - Model monitoring and maintenance

##### 2.2 Constraints
- Time constraints: Each phase must be completed within a week, weekly deadline in Friday 23:59. The project ends on 20th of June.

#### 3. Risks and contingencies
- Accidental modification of data
We can accidentally change the sample data. To negate this risk, we have DVC setup and also run Great Expectations. If Great Expectations are not valid, we can revert the data with DVC.

- Insufficient data quality
We have provided a comprehensive EDA. We have showcased that a model that would fulfill the business task is at least feasible. If during the development of the ML model we find that task cannot be fulfilled, we will fallback to the problem formulation stage to update the requirements.

#### 4. Costs and benefits

Construct a cost-benefit analysis for the project, which compares the costs of the project with the potential benefit to the business if it is successful. The comparison should be as specific as possible.

| Category                        | Details                                                                         | Estimated Cost                      | Potential Benefits                                           |
|---------------------------------|---------------------------------------------------------------------------------|-------------------------------------|--------------------------------------------------------------|
| Personnel                       |                                                                                 |                                     |                                                              |
| Data engineer                   | Alexey Tkachenko                                                                | $100/hr, 40 hrs/week                | Expertise in data processing and engineering                 |
| Data scientist                  | Ivan Golov                                                                      | $120/hr, 40 hrs/week                | Expertise in data analysis and model building                |
| ML engineer                     | Artem Bulgakov                                                                  | $110/hr, 40 hrs/week                | Expertise in machine learning and model implementation       |
| **Total Personnel Cost**        |                                                                                 | $13,200/week for 5 weeks = $660,000 |                                                              |
| Data                            | Marketing Carrier On-Time Performance data table for 2022                       | No additional cost                  | High-quality data for model training and validation          |
| **Computing Resources**         |                                                                                 |                                     |                                                              |
| Personal computers              | Already available                                                               | No additional cost                  | Necessary for development                                    |
| InnoDataHub                     | Cloud-based platform for data processing                                        | $1,000/month                        | Scalable data processing resources                           |
| Google Colab resources          | Cloud-based Jupyter notebook environment                                        | $500/month                          | Access to powerful computing resources for model training    |
| Apache Spark                    | Big data processing framework                                                   | No additional cost                  | Efficient large-scale data processing                        |
| Apache Airflow                  | Workflow orchestration tool                                                     | No additional cost                  | Efficient management of data pipelines                       |
| Apache Pulsar                   | Distributed messaging and streaming platform                                    | No additional cost                  | Real-time data processing                                    |
| **Total Computing Cost**        |                                                                                 | $1,500/month for 2 months = $3,000  |                                                              |
| Software                        |                                                                                 |                                     |                                                              |
| DVC                             | Data management tool                                                            | No additional cost                  | Efficient version control for data                           |
| Python                          | Programming language                                                            | No additional cost                  | Essential for model development                              |
| Git                             | Version control system                                                          | No additional cost                  | Essential for code versioning                                |
| Hydra                           | Configuration management tool                                                   | No additional cost                  | Efficient configuration management                           |
| scikit-learn / pandas / seaborn | Machine learning and data analysis libraries                                    | No additional cost                  | Essential for data analysis and model building               |
| Pytest                          | Testing framework                                                               | No additional cost                  | Ensures code quality and reliability                         |
| PostgreSQL                      | Database management system                                                      | No additional cost                  | Efficient data storage and retrieval                         |
| MLflow                          | Machine learning lifecycle management tool                                      | No additional cost                  | Tracks experiments and models                                |
| Docker                          | Containerization platform                                                       | No additional cost                  | Ensures consistency across environments                      |
| Feast + ClearML                 | Feature store and experiment management                                         | No additional cost                  | Manages features and experiments                             |
| FastAPI, Flask                  | Web frameworks for API development                                              | No additional cost                  | Deploys models as services                                   |
| **Total Software Cost**         |                                                                                 | No additional cost                  |                                                              |
| Other Costs                     |                                                                                 |                                     |                                                              |
| Training and development        | Continuous learning for team members                                            | $2,000                              | Improves team expertise and productivity                     |
| Miscellaneous                   | Additional unforeseen expenses                                                  | $1,000                              | Covers unexpected costs                                      |
| Total Other Costs               |                                                                                 | $3,000                              |                                                              |
| Total Project Cost              | Personnel + Computing + Other                                                   | $110,000                            |                                                              |
| Benefits                        | Improved model accuracy and deployment pipeline                                 |                                     |                                                              |
|                                 | Increased customer satisfaction due to improved on-time performance predictions |                                     | Estimated revenue increase of $200,000 annually              |
|                                 | Enhanced decision-making capabilities                                           |                                     | Cost savings of $50,000 annually due to optimized operations |
| Total Benefits                  |                                                                                 |                                     | $250,000 annually                                            |


#### 4. Feasibility report

POC ML model is showcased in `poc.ipynb`

## Project plan
----------------

#### 1. Project plan
Phase I - Business and data understanding - Week 1      
Phase II - Data engineering/preparation - Week 2        
Phase III - Model engineering - Week 3      
Phase IV - Model validation - Week 4        
Phase V - Model deployment - Week 5     
Phase VI - Model monitoring and maintenance - Week 6        

#### 2. ML project Canvas
At the end of the first phase, you should create a canvas for the project as a summary of this phase.


> ##### Example
> Follow the link: https://github.com/louisdorard/machine-learning-canvas/blob/master/churn.pdf


In [None]:
# TODO