# business and data understanding
------------

The initial phase is concerned with tasks to define the business objectives and translate it to ML objectives, to collect and verify the data quality and to finaly assess the project feasibility.

![](https://i.imgur.com/55J7fBc.jpeg)

## Terminology

### tasks

Compile a glossary of terminology relevant to the project. This may include two components:
(1) A glossary of relevant business terminology, which forms part of the business understanding available to the project. Constructing this glossary is a useful "knowledge elicitation" and education exercise.
(2) A glossary of machine learning terminology, illustrated with examples relevant to the business problem in question.




#### 1. Business terminology
A table or paragraph containing all the business-related terms to be used in the project.



> - Car Pricing Information: Data related to the prices of various car models, which is essential for predicting future prices.
> - Market Trends: Patterns and tendencies in the car market, including consumer preferences, economic factors, and industry developments. Example: "The market trend indicates an increasing demand for electric vehicles."
Vehicle Characteristics: Specific attributes of cars such as make, model, year, mileage, condition, and features. Example: "Vehicle characteristics like mileage and age significantly affect car prices."
> - Predictive Model: A statistical or machine learning model used to forecast future values based on historical data. Example: "The company developed a predictive model to estimate car prices for the upcoming year."
> - Dataset: A collection of data points, in this case, related to vehicle sales and market trends. Example: "The dataset includes information on car sales, prices, and market trends over the past decade."
> - Accurate Forecasting: The process of making precise and reliable predictions about future events. Example: "Accurate forecasting of car prices helps customers make informed purchasing decisions."
> - Reliability: The consistency and dependability of the data or the model’s predictions. Example: "The reliability of the predictive model is crucial for maintaining customer trust."
2. ML Terminology
A table or paragraph containing all the machine learning-related terms to be used in the project.

> - Regression: A type of predictive modeling technique which estimates the relationships among variables. Example: "The project uses regression to predict car prices based on various factors."
> - Feature: An individual measurable property or characteristic of a phenomenon being observed. Example: "Features such as car age, mileage, and brand are used in the predictive model."
> - Target Variable: The variable that the model is trying to predict. In this project, it is the car price. Example: "The target variable in our model is the car price."
> - Training Data: The subset of the dataset used to train the model. Example: "80% of the dataset was used as training data to build the predictive model."
> - Test Data: The subset of the dataset used to test the model’s accuracy. Example: "20% of the dataset was reserved as test data to evaluate the model’s performance."
> - Model Accuracy: A measure of how close the model's predictions are to the actual values. Example: "The model achieved an accuracy of 95% in predicting car prices."
> - Overfitting: When a model learns the training data too well, including noise, resulting in poor performance on new data. Example: "The initial model overfitted the training data, so regularization techniques were applied."
> - Underfitting: When a model is too simple and fails to capture the underlying patterns in the data. Example: "To avoid underfitting, additional features were incorporated into the model."
> - Mean Absolute Error (MAE): A measure of errors between paired observations expressing the same phenomenon. Example: "The MAE of the model was low, indicating high accuracy in car price predictions."
> - Cross-Validation: A technique for assessing how the results of a statistical analysis will generalize to an independent data set. Example: "Cross-validation was used to ensure the model's robustness and avoid overfitting."
> - Feature Engineering: The process of using domain knowledge to create features that make machine learning algorithms work better. Example: "Feature engineering was crucial to improve the model's predictions by incorporating relevant vehicle characteristics."

## Scope of the project
----------

### tasks
- Explore the background of the business.
- Define business problem
- Define business objectives
- Translate business objectives into ML objectives

The objective here is to thoroughly understand, from a business perspective, what the client really wants to accomplish. Often the client has many competing objectives and constraints that must be properly balanced. The goal is to uncover important factors, at the beginning, that can influence the outcome of the project. A possible consequence of neglecting this step is to expend a great deal of effort producing the right answers to the wrong questions.



#### 1. background
A short paragraph to record the information that is known about the organization's business situation at the beginning of the project.

>Companies are committed to providing accurate and reliable car pricing information to customers. They gather extensive data on vehicle sales and market trends to build predictive models that forecast car prices. The Vehicle Sales and Market Trends Dataset encompasses a comprehensive collection of information on various vehicle characteristics and market trends, which is crucial for developing robust predictive models.

#### 2. Business Problem
A short paragraph to describe the business problem.

>The business problem is that the companies need to provide accurate and reliable car pricing information to their customers. Currently, the companies struggle to predict car prices effectively due to the rapidly changing market trends and diverse vehicle characteristics. This inaccuracy can lead to customer dissatisfaction and a loss of competitive edge. The companies aim to develop a predictive model to forecast car prices based on historical sales data and market trends.

#### 3. Business Objectives
A list of business objectives which describes the customer's primary objective, from a business perspective. In addition to the primary business objective, there are typically other related business questions that the customer would like to address.

> - How do different vehicle characteristics (e.g., make, model, year, mileage) impact the predicted price?
> - What are the key market trends affecting car prices in the short and long term?
> - How can pricing accuracy influence customer satisfaction and loyalty?
> - Will the integration of economic indicators (e.g., inflation, interest rates) improve the model's prediction accuracy?
> - How can the predictive model help in optimizing inventory management and pricing strategies for dealerships?
#### 4. ML Objectives
A list of machine learning objectives which describes the intended outputs of the project that enables the achievement of the business objectives.

> Build a regression model to predict car prices based on historical data and market trends.
> Identify and engineer relevant features from the dataset that influence car prices.
> Evaluate the model's performance using metrics such as Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE).
> Implement cross-validation techniques to ensure the model's robustness and prevent overfitting.
> Continuously update the model with new data to maintain accuracy and relevance in a changing market.

## Success Criteria
-------------

### tasks
- Describe the success criteria of the ML project on three different levels: the business success criteria, the ML success criteria and the economic success criteria.


#### 1. Business Success Criteria
A list of criteria from a business point of view.

> - Increase the accuracy of car price predictions by at least 10% compared to current methods.
> - Enhance customer satisfaction ratings by providing more accurate pricing information, leading to a 5% increase in positive feedback within six months.
> - Reduce pricing-related customer complaints by 15% within the next year.
> - Improve the decision-making process for inventory management, resulting in a 10% increase in sales turnover.
> - Gain a competitive edge in the market, reflected by a 7% increase in market share within one year.
#### 2. ML Success Criteria
A list of criteria for a successful outcome to the project in technical terms.

> - Achieve a Mean Absolute Error (MAE) of less than $500 in car price predictions.
> - Obtain a Root Mean Squared Error (RMSE) of less than $700.
> - Ensure the model's R-squared value is greater than 0.85, indicating a strong correlation between predicted and actual prices.
> - Implement cross-validation techniques with a variance in performance metrics not exceeding 2% across different folds.
> - Maintain model performance by retraining with new data periodically, ensuring that prediction accuracy does not degrade over time.
#### 3. Economic Success Criteria
A list of criteria that describe the economic benefits expected from the project.

> - Increase in revenue by 8% through optimized pricing strategies and improved sales turnover.
> - Achieve a return on investment (ROI) of at least 15% within the first year of implementing the predictive model.
> - Reduce costs associated with manual pricing adjustments by 20%.
> - Enhance inventory turnover rate by 10%, reducing holding costs and improving cash flow.
> - Generate additional revenue streams by offering premium pricing insights to partner dealerships and third-party vendors.

## Data collection

### tasks
- Specify the data sources
- Collect the data
- Version control on the data


#### 1. Data Collection Report
A section to describe the data sources and how you want to collect the data.

> - **Data Source:** The origin of the data.
"The data was collected from the Kaggle dataset 'Vehicle Sales and Market Trends Dataset'."
> - **Data Type:** The format or structure of the data.
Example: "The data consists of numerical values (e.g., odometer reading, selling prices), categorical values (e.g., make, model, trim, body type), and text values (e.g., VIN)."
> - **Data Size:** The quantity of data collected.
"The dataset contains approximately 500 thousands of vehicle sales transactions, each with numerous features such as year, make, model, trim, body type, transmission type, VIN, state of registration, condition rating, odometer reading, exterior and interior colors, seller information, Manheim Market Report (MMR) values, selling prices, and sale dates."
> - **Data Collection Method:** The process used to collect the data.
"The data was downloaded directly from the Kaggle website, ensuring it is in a standardized CSV format which facilitates easy import and manipulation in data analysis tools."
#### 2. Data Version Control Report
A section to describe data versions, what change happened in the data, how do you backup the data. This should be done after you collect the data.

> - **Data Version:** A unique identifier for each version of the data.
"The current data version last updated in February, 2024"
> - **Data Change Log:** A record of changes made to the data.
"The data change log shows no modifications have been made yet. Initial dataset contains all original attributes and records as provided by Kaggle."
> - **Data Backup:** A copy of the data stored for recovery in case of data loss.
"store the latest version of the unmodified dataset."
> - **Data Archiving:** The process of storing and managing historical data.
"We will update the database when a new version of the dataset is released. Previous versions of the dataset will also be stored"
> - **Data Access Control:** The process of controlling who can access and modify the data.
"We have a small team, so everyone has access to GitHub."

## Data quality verification

### tasks
- Describe data
- Define data requirements
- Explore the data
- Verify the data quality

### output

#### 1. Data description
A section to describe the data that has been acquired including its format, its quantity (for example, the number of records and fields in each table), the identities of the fields and any other surface features which have been discovered. Add a table of description of the data features.

> ##### Example
> - The data acquired for this project includes a dataset of ~ 500k records with 16 fields each. The fields include customer demographics, purchase history, and product preferences. The data is in a CSV format and is stored in a local database.
> - Add a table of description of the data features.

#### 2. Data exploration
A section to present results of your data exploration, including first findings or initial hypothesis and their impact on the remainder of the project. If appropriate you could include graphs and plots here to indicate data characteristics that suggest further examination of interesting data subsets.

> ##### Example
> - During data exploration, several interesting patterns and correlations were discovered. For example, there is a strong correlation between the age of customers and their purchase frequency. Additionally, customers who have purchased from the company in the past are more likely to make repeat purchases. These findings suggest that the data is representative of the target audience and that the company's marketing strategies are effective.
> - Add charts and figures to present the findings.

#### 3. Data requirements
A section to describe the data requirements. The requirements can be defined either on the meta-level or directly in the data, and should state the expected conditions of the data, i.e., whether a certain sample is plausible. The requirements can be, e.g., the expected feature values (a range for continuous features or a list for discrete features), the format of the data and the maximum number of missing values. The bounds of the requirements has to be defined carefully to include all possible real world values but discard non-plausible data. Data that does not satisfy the expected conditions could be treated as anomalies and need to be evaluated manually or excluded automatically. To mitigate the risk of anchoring bias in this first phase discussing the requirements with a domain expert is advised. Documentation of the data requirements could be expressed in the form of a schema with strict data types and conditions.

> ##### Example
> - The data requirements for this project are defined as follows:
> > - Customer Demographics: Age should be within the range of 18 to 100 years, and gender should be either male or female.
> > - Purchase History: The number of purchases should be greater than or equal to 0, and the total amount spent should be greater than or equal to $0.
> > - Product Preferences: The product preferences should be represented as a list of product IDs, and each product ID should be unique and within the range of 1 to 1000.
> > - Here we define expectations to satisfy the data requirements.

#### 4 Data quality verification report
A section to verify and report the quality of the data. Examine the quality of the data, addressing questions such as:
- Is the data complete (does it cover all the cases required)?
- Is it correct, or does it contain errors and, if there are errors, how common are they?
- Are there missing values in the data? If so, how are they represented, where do they occur, and how common are they?

> ##### Example
> - Completeness: The data is complete in the sense that it covers all the required cases. All customers have demographic information, and all purchases are recorded.
> - Correctness: The data appears to be correct, with no obvious errors. However, a manual review of the data is recommended to ensure that there are no errors.
> - Missing Values: There are no missing values in the data.
> - Overall, the data quality is high, and the data is suitable for analysis and modeling. However, a manual review of the data is recommended to ensure that there are no errors or anomalies.

In [None]:
# TODO

## Project feasibility
-------------
This task involves more detailed fact-finding about all of the resources,constraints, assumptions and other factors that should be considered in determining the data analysis goal and project plan. In the previous task, your objective is to quickly get to the crux of the situation. Here, you want to flesh out the details.

### tasks
- Assess the project feasibility
- Create POC (Proof-of-concept) model

### output

#### 1. inventory of resources

List the resources available to the project, including: personnel (business experts, data experts, technical support, machine learning personnel), data (fixed extracts, access to live warehoused or operational data), computing resources (hardware platforms) and software (machine learning tools, other relevant software).

#### 2. Requirements, assumptions and constraints

List all requirements of the project including schedule of completion, comprehensibility and quality of results and security as well as legal issues.As part of this output, make sure that you are allowed to use the data. List the assumptions made by the project.

These may be assumptions about the data that can be checked during machine learning, but may also include non-checkable assumptions about the business upon which the project rests. It is particularly important to list the latter if they form conditions on the validity of the results.

List the constraints on the project. These may be constraints on the availability of resources, but may also include technological constraints such as the size of data that it is practical to use for modeling.

#### 3. Risks and contingencies

List the risks or events that might occur to delay the project or cause it to fail. List the corresponding contingency plans; what action will be taken if the risks happen.


#### 4. Costs and benefits

Construct a cost-benefit analysis for the project, which compares the costs of the project with the potential benefit to the business if it is successful. The comparison should be as specific as possible.

![](https://i.imgur.com/XU2lghc.png)

#### 5. Feasibility report

Build a POC ML model and explain as a team whether it is feasible to do this ML project or not. If not, then you need to find another business problem. They key factors here are related to data availability, quality, costs and nature of business problem.

In [None]:
# TODO

## produce project plan
----------------

### task

Describe the intended plan for achieving the machine learning goals and thereby achieving the business goals. The plan should specify the anticipated set of steps to be performed during the rest of the project including an initial selection of tools and techniques.

### output

#### 1. Project plan
- List the stages to be executed in the project, together with their duration, resources required, inputs, outputs, and dependencies. Where possible, try and make explicit the large-scale iterations in the machine learning process, for example, repetitions of the modeling and evaluation phases. As part of the project plan, it is also important to analyze dependencies between time schedule and risks. Mark results of these analyses explicitly in the project plan, ideally with actions and recommendations if the risks are manifested. Decide at this point which evaluation strategy will be used in the evaluation phase. Your project plan will be a dynamic document. At the end of each phase you’ll review progress and achievements and update the project plan accordingly. Specific review points for these updates should be part of the project plan.

- Build a Gantt chart for the project tasks and phases using some online platforms like TeamGantt, jira, goodday, tello, ...etc

- Add all of your team members and assign tasks to them preliminary. Then you check daily the progress.

#### 2. ML project Canvas
At the end of the first phase, you should create a canvas for the project as a summary of this phase.


> ##### Example
> Follow the link: https://github.com/louisdorard/machine-learning-canvas/blob/master/churn.pdf


In [None]:
# TODO