# Business and Data understanding
------------

The initial phase is concerned with tasks to define the business objectives and translate it to ML objectives, to collect and verify the data quality and to finaly assess the project feasibility.

![](https://i.imgur.com/55J7fBc.jpeg)

## Terminology

#### 1. Business terminology
| Term            | Meaning                                                                                                                                                                                                                                                                             |
|-----------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| WAC             | World Area Code                                                                                                                                                                                                                                                                     |
| US DOT          | United States Department Of Transportation                                                                                                                                                                                                                                          |
| Airline         | The company that operates airplanes                                                                                                                                                                                                                                                 |
| CRS             | The Common Reporting Standard (CRS) is an information standard for the Automatic Exchange Of Information (AEOI) regarding financial accounts on a global level, between tax authorities, which the Organization for Economic Co-operation and Development (OECD) developed in 2014. |
| Carrier         | Same as airline                                                                                                                                                                                                                                                                     |
| On-time         | Flight occurred as-per schedule                                                                                                                                                                                                                                                     |
| Diverted        | Flight is diverted if the airplane landed in an airport that was not put into the flight plan                                                                                                                                                                                       |
| Delayed         | Flight is delayed if it was rescheduled to another time                                                                                                                                                                                                                             |
| Canceled        | Flight is canceled if it will not happen at all                                                                                                                                                                                                                                     |
| FIPS            | Federal Information Processing Standard                                                                                                                                                                                                                                             |
| Airplane        | An airplane, informally plane, is a fixed-wing aircraft that is propelled forward by thrust from a jet engine, propeller, or rocket engine                                                                                                                                          |
| Takeoff airport | Airport where the airplane takes off and the flight starts                                                                                                                                                                                                                          |
| Landing airport | Airport where the airplane lands. Flight may or may not be finished, since the airplane can land for maintenance, eg. refueling.                                                                                                                                                    |
| Flight route    | A route that an airplane takes. Starts at a takeoff airport. Can contain multiple landing airports                                                                                                                                                                                  |

#### 2. ML terminology
| Term         | Meaning                                                                                                                       |
|--------------|-------------------------------------------------------------------------------------------------------------------------------|
| Underfitting | When a machine learning model is too simple and fails to capture the underlying patterns in the data.                         |
| Overfitting  | When a machine learning model is too complex and fixates too much on the training data, failing when preforming on real data. |
| Word2Vec     | Word2vec is a technique in natural language processing (NLP) for obtaining vector representations of words.                   |

## Scope of the project
----------
#### 1. Background
The dataset has been extracted from the Marketing Carrier On-Time Performance (Beginning January 2018) data table of the "On-Time" database from the TranStats data library. 

#### 2. Business problem
Airports suffer from delayed and diverted flights, since rescheduling costs airlines and airports money. Also, this is an inconvenience to passengers. They want a solution to predict whether the flight will be on-time, delayed, canceled, or diverted. The provided dataset contains extensive information about flights in United States from 2022, which could help classify flights in four aforementioned classes ahead of time of departure.

#### 3. Business objectives
- Increase passenger satisfaction and reduce time lost in airports
- Reduce airport load by better managing the flights
- Reduce airline spending on flight fees

#### 4. ML objectives
- Classify whether the given flight will be on-time, delayed, canceled, or diverted given the data before the flight begins

## Success Criteria
-------------
#### 1. Business success criteria
- Decrease airline's spendings by no less than 12% within the next year

#### 2. ML success criteria
- Since the dataset is quite imbalanced, the model aims to achieve a recall rate of >75%.

## Project feasibility
-------------
This task involves more detailed fact-finding about all of the resources,constraints, assumptions and other factors that should be considered in determining the data analysis goal and project plan. In the previous task, your objective is to quickly get to the crux of the situation. Here, you want to flesh out the details.

### tasks
- Assess the project feasibility
- Create POC (Proof-of-concept) model

#### 1. Inventory of resources

List the resources available to the project, including: personnel (business experts, data experts, technical support, machine learning personnel), data (fixed extracts, access to live warehoused or operational data), computing resources (hardware platforms) and software (machine learning tools, other relevant software).

##### 1.1 Personnel:
- Alexey Tkachenko: Data engineer
- Ivan Golov: Data scientist
- Artem Bulgakov: ML engineer
  
##### 1.2 Experts:
- Firas Johla: MLOps expert

##### 1.3 Data:
- Marketing Carrier On-Time Performance data table for 2022 of the "On-Time" database from the TranStats data library

##### 1.4 Computing resources:
- Personal computers
- InnoDataHub
- Google Colab resources
- Apache Spark
- Apache Airflow
- Apache Pulsar
  
##### 1.5 Software:
Mostly opensource software
- DVC for data management
- Python
- Git for version control
- Hydra for configuration management
- scikit-learn / pandas / seaborn
- Pytest for testing
- PostgreSQL
- MLflow
- Docker
- Feast + ClearML
- FastAPI, Flask

#### 2. Requirements, assumptions and constraints

List all requirements of the project including schedule of completion, comprehensibility and quality of results and security as well as legal issues.As part of this output, make sure that you are allowed to use the data. List the assumptions made by the project.

These may be assumptions about the data that can be checked during machine learning, but may also include non-checkable assumptions about the business upon which the project rests. It is particularly important to list the latter if they form conditions on the validity of the results.

List the constraints on the project. These may be constraints on the availability of resources, but may also include technological constraints such as the size of data that it is practical to use for modeling.

##### 2.1 Requirements
Project will be completed in a few phases:
- Phase I - Business and data understanding
- Phase II - Data engineering/preparation
- Phase III - Model engineering
- Phase IV - Model validation
- Phase V - Model deployment
- Phase VI - Model monitoring and maintenance

#### 2.3 Constraints
- Time constraints: Each phase must be completed within a week, weekly deadline in Friday 23:59. The project ends on 

#### 3. Risks and contingencies

List the risks or events that might occur to delay the project or cause it to fail. List the corresponding contingency plans; what action will be taken if the risks happen.


#### 4. Costs and benefits

Construct a cost-benefit analysis for the project, which compares the costs of the project with the potential benefit to the business if it is successful. The comparison should be as specific as possible.

![](https://i.imgur.com/XU2lghc.png)

#### 5. Feasibility report

Build a POC ML model and explain as a team whether it is feasible to do this ML project or not. If not, then you need to find another business problem. They key factors here are related to data availability, quality, costs and nature of business problem.

In [None]:
# TODO

## produce project plan
----------------

### task

Describe the intended plan for achieving the machine learning goals and thereby achieving the business goals. The plan should specify the anticipated set of steps to be performed during the rest of the project including an initial selection of tools and techniques.

Personesll 

### output

#### 1. Project plan
List the stages to be executed in the project, together with their duration, resources required, inputs, outputs, and dependencies. Where possible, try and make explicit the large-scale iterations in the machine learning process, for example, repetitions of the modeling and evaluation phases. As part of the project plan, it is also important to analyze dependencies between time schedule and risks. Mark results of these analyses explicitly in the project plan, ideally with actions and recommendations if the risks are manifested. Decide at this point which evaluation strategy will be used in the evaluation phase. Your project plan will be a dynamic document. At the end of each phase you’ll review progress and achievements and update the project plan accordingly. Specific review points for these updates should be part of the project plan.

#### 2. ML project Canvas
At the end of the first phase, you should create a canvas for the project as a summary of this phase.


> ##### Example
> Follow the link: https://github.com/louisdorard/machine-learning-canvas/blob/master/churn.pdf


In [None]:
# TODO