# business and data understanding
------------

The initial phase is concerned with tasks to define the business objectives and translate it to ML objectives, to collect and verify the data quality and to finaly assess the project feasibility.

![](https://i.imgur.com/55J7fBc.jpeg)

## Terminology

### tasks

Compile a glossary of terminology relevant to the project. This may include two components:
(1) A glossary of relevant business terminology, which forms part of the business understanding available to the project. Constructing this glossary is a useful "knowledge elicitation" and education exercise.
(2) A glossary of machine learning terminology, illustrated with examples relevant to the business problem in question.




#### 1. Business terminology
A table or paragraph containing all the business-related terms to be used in the project.



> - Car Pricing Information: Data related to the prices of various car models, which is essential for predicting future prices.
> - Market Trends: Patterns and tendencies in the car market, including consumer preferences, economic factors, and industry developments. Example: "The market trend indicates an increasing demand for electric vehicles."
Vehicle Characteristics: Specific attributes of cars such as make, model, year, mileage, condition, and features. Example: "Vehicle characteristics like mileage and age significantly affect car prices."
> - Predictive Model: A statistical or machine learning model used to forecast future values based on historical data. Example: "The company developed a predictive model to estimate car prices for the upcoming year."
> - Dataset: A collection of data points, in this case, related to vehicle sales and market trends. Example: "The dataset includes information on car sales, prices, and market trends over the past decade."
> - Accurate Forecasting: The process of making precise and reliable predictions about future events. Example: "Accurate forecasting of car prices helps customers make informed purchasing decisions."
> - Reliability: The consistency and dependability of the data or the model’s predictions. Example: "The reliability of the predictive model is crucial for maintaining customer trust."
2. ML Terminology
A table or paragraph containing all the machine learning-related terms to be used in the project.

> - Regression: A type of predictive modeling technique which estimates the relationships among variables. Example: "The project uses regression to predict car prices based on various factors."
> - Feature: An individual measurable property or characteristic of a phenomenon being observed. Example: "Features such as car age, mileage, and brand are used in the predictive model."
> - Target Variable: The variable that the model is trying to predict. In this project, it is the car price. Example: "The target variable in our model is the car price."
> - Training Data: The subset of the dataset used to train the model. Example: "80% of the dataset was used as training data to build the predictive model."
> - Test Data: The subset of the dataset used to test the model’s accuracy. Example: "20% of the dataset was reserved as test data to evaluate the model’s performance."
> - Model Accuracy: A measure of how close the model's predictions are to the actual values. Example: "The model achieved an accuracy of 95% in predicting car prices."
> - Overfitting: When a model learns the training data too well, including noise, resulting in poor performance on new data. Example: "The initial model overfitted the training data, so regularization techniques were applied."
> - Underfitting: When a model is too simple and fails to capture the underlying patterns in the data. Example: "To avoid underfitting, additional features were incorporated into the model."
> - Mean Absolute Error (MAE): A measure of errors between paired observations expressing the same phenomenon. Example: "The MAE of the model was low, indicating high accuracy in car price predictions."
> - Cross-Validation: A technique for assessing how the results of a statistical analysis will generalize to an independent data set. Example: "Cross-validation was used to ensure the model's robustness and avoid overfitting."
> - Feature Engineering: The process of using domain knowledge to create features that make machine learning algorithms work better. Example: "Feature engineering was crucial to improve the model's predictions by incorporating relevant vehicle characteristics."

## Scope of the project
----------

### tasks
- Explore the background of the business.
- Define business problem
- Define business objectives
- Translate business objectives into ML objectives

The objective here is to thoroughly understand, from a business perspective, what the client really wants to accomplish. Often the client has many competing objectives and constraints that must be properly balanced. The goal is to uncover important factors, at the beginning, that can influence the outcome of the project. A possible consequence of neglecting this step is to expend a great deal of effort producing the right answers to the wrong questions.



#### 1. background
A short paragraph to record the information that is known about the organization's business situation at the beginning of the project.

>Companies are committed to providing accurate and reliable car pricing information to customers. They gather extensive data on vehicle sales and market trends to build predictive models that forecast car prices. The Vehicle Sales and Market Trends Dataset encompasses a comprehensive collection of information on various vehicle characteristics and market trends, which is crucial for developing robust predictive models.

#### 2. Business Problem
A short paragraph to describe the business problem.

>The business problem is that the companies need to provide accurate and reliable car pricing information to their customers. Currently, the companies struggle to predict car prices effectively due to the rapidly changing market trends and diverse vehicle characteristics. This inaccuracy can lead to customer dissatisfaction and a loss of competitive edge. The companies aim to develop a predictive model to forecast car prices based on historical sales data and market trends.

#### 3. Business Objectives
A list of business objectives which describes the customer's primary objective, from a business perspective. In addition to the primary business objective, there are typically other related business questions that the customer would like to address.

> - How do different vehicle characteristics (e.g., make, model, year, mileage) impact the predicted price?
> - What are the key market trends affecting car prices in the short and long term?
> - How can pricing accuracy influence customer satisfaction and loyalty?
> - Will the integration of economic indicators (e.g., inflation, interest rates) improve the model's prediction accuracy?
> - How can the predictive model help in optimizing inventory management and pricing strategies for dealerships?
#### 4. ML Objectives
A list of machine learning objectives which describes the intended outputs of the project that enables the achievement of the business objectives.

> Build a regression model to predict car prices based on historical data and market trends.
> Identify and engineer relevant features from the dataset that influence car prices.
> Evaluate the model's performance using metrics such as Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE).
> Implement cross-validation techniques to ensure the model's robustness and prevent overfitting.
> Continuously update the model with new data to maintain accuracy and relevance in a changing market.

## Success Criteria
-------------

### tasks
- Describe the success criteria of the ML project on three different levels: the business success criteria, the ML success criteria and the economic success criteria.


#### 1. Business Success Criteria
A list of criteria from a business point of view.

> - Increase the accuracy of car price predictions by at least 10% compared to current methods.
> - Enhance customer satisfaction ratings by providing more accurate pricing information, leading to a 5% increase in positive feedback within six months.
> - Reduce pricing-related customer complaints by 15% within the next year.
> - Improve the decision-making process for inventory management, resulting in a 10% increase in sales turnover.
> - Gain a competitive edge in the market, reflected by a 7% increase in market share within one year.
#### 2. ML Success Criteria
A list of criteria for a successful outcome to the project in technical terms.

> - Achieve a Mean Absolute Error (MAE) of less than $500 in car price predictions.
> - Obtain a Root Mean Squared Error (RMSE) of less than $700.
> - Ensure the model's R-squared value is greater than 0.85, indicating a strong correlation between predicted and actual prices.
> - Implement cross-validation techniques with a variance in performance metrics not exceeding 2% across different folds.
> - Maintain model performance by retraining with new data periodically, ensuring that prediction accuracy does not degrade over time.
#### 3. Economic Success Criteria
A list of criteria that describe the economic benefits expected from the project.

> - Increase in revenue by 8% through optimized pricing strategies and improved sales turnover.
> - Achieve a return on investment (ROI) of at least 15% within the first year of implementing the predictive model.
> - Reduce costs associated with manual pricing adjustments by 20%.
> - Enhance inventory turnover rate by 10%, reducing holding costs and improving cash flow.
> - Generate additional revenue streams by offering premium pricing insights to partner dealerships and third-party vendors.

## Data collection

### tasks
- Specify the data sources
- Collect the data
- Version control on the data


#### 1. Data Collection Report
A section to describe the data sources and how you want to collect the data.

> - **Data Source:** The origin of the data.
"The data was collected from the Kaggle dataset 'Vehicle Sales and Market Trends Dataset'."
> - **Data Type:** The format or structure of the data.
Example: "The data consists of numerical values (e.g., odometer reading, selling prices), categorical values (e.g., make, model, trim, body type), and text values (e.g., VIN)."
> - **Data Size:** The quantity of data collected.
"The dataset contains approximately 500 thousands of vehicle sales transactions, each with numerous features such as year, make, model, trim, body type, transmission type, VIN, state of registration, condition rating, odometer reading, exterior and interior colors, seller information, Manheim Market Report (MMR) values, selling prices, and sale dates."
> - **Data Collection Method:** The process used to collect the data.
"The data was downloaded directly from the Kaggle website, ensuring it is in a standardized CSV format which facilitates easy import and manipulation in data analysis tools."
#### 2. Data Version Control Report
A section to describe data versions, what change happened in the data, how do you backup the data. This should be done after you collect the data.

> - **Data Version:** A unique identifier for each version of the data.
"The current data version last updated in February, 2024"
> - **Data Change Log:** A record of changes made to the data.
"The data change log shows no modifications have been made yet. Initial dataset contains all original attributes and records as provided by Kaggle."
> - **Data Backup:** A copy of the data stored for recovery in case of data loss.
"store the latest version of the unmodified dataset."
> - **Data Archiving:** The process of storing and managing historical data.
"We will update the database when a new version of the dataset is released. Previous versions of the dataset will also be stored"
> - **Data Access Control:** The process of controlling who can access and modify the data.
"We have a small team, so everyone has access to GitHub."

## Data quality verification

### tasks
- Describe data
- Define data requirements
- Explore the data
- Verify the data quality

### output

#### 1. Data description
A section to describe the data that has been acquired including its format, its quantity (for example, the number of records and fields in each table), the identities of the fields and any other surface features which have been discovered. Add a table of description of the data features.

> ##### Example
> - The data acquired for this project includes a dataset of ~ 500k records with 16 fields each. The fields include customer demographics, purchase history, and product preferences. The data is in a CSV format and is stored in a local database.
> - Add a table of description of the data features.

| Column Name     | Type     | Description                                        |
|-----------------|----------|----------------------------------------------------|
| year            | int64    | The manufacturing year of the vehicle (e.g., 2015) |
| make            | object   | The brand or manufacturer of the vehicle (e.g., Kia, BMW, Volvo) |
| model           | object   | The specific model of the vehicle (e.g., Sorento, 3 Series, S60, 6 Series Gran Coupe) |
| trim            | object   | Additional designation for a particular version or option package of the model (e.g., LX, 328i SULEV, T5, 650i) |
| body            | object   | The type of vehicle body (e.g., SUV, Sedan)        |
| transmission    | object   | The type of transmission in the vehicle (e.g., automatic) |
| vin             | object   | The Vehicle Identification Number, a unique code used to identify individual motor vehicles |
| state           | object   | The state in which the vehicle is located or registered (e.g., CA for California) |
| condition       | float64  | A numerical representation of the condition of the vehicle (e.g., 5.0) |
| odometer        | float64  | The mileage or distance traveled by the vehicle    |
| color           | object   | The exterior color of the vehicle                  |
| interior        | object   | The interior color of the vehicle                  |
| seller          | object   | The entity or company selling the vehicle (e.g., Kia Motors America Inc, Financial Services Remarketing) |
| mmr             | float64  | Manheim Market Report, a pricing tool used in the automotive industry |
| sellingprice    | float64  | The price at which the vehicle was sold            |
| saledate        | object   | The date and time when the vehicle was sold        |



#### 2. Data exploration
A section to present results of your data exploration, including first findings or initial hypothesis and their impact on the remainder of the project. If appropriate you could include graphs and plots here to indicate data characteristics that suggest further examination of interesting data subsets.

> ##### Example
> - Data need to be cleaned there are a lot of outliers and missing values
> - Features have nonlinear correlation with target, so LinearRegression is not applicable

#### 3. Data requirements
> - We have a certain number of unique brands and models of cars, so data should contain only those brands and models for good model work.


#### 4 Data quality verification report

> - Dataset contains missing values, but its proportion is negligible compared to the size of the dataset
> - Dataset contain duplicates of cars in different time, since they can be reselled many times
> - Dataset has error feature since its hard to parse it and th epreprocess needs to be done by hands.
> - Overall, the data quality is high, and the data is suitable for analysis and modeling. However, a manual review of the data is recommended to ensure that there are no errors or anomalies.

In [None]:
# TODO

## Project feasibility
-------------
This task involves more detailed fact-finding about all of the resources,constraints, assumptions and other factors that should be considered in determining the data analysis goal and project plan. In the previous task, your objective is to quickly get to the crux of the situation. Here, you want to flesh out the details.

### tasks
- Assess the project feasibility
- Create POC (Proof-of-concept) model

### output

#### 1. inventory of resources

List the resources available to the project, including: personnel (business experts, data experts, technical support, machine learning personnel), data (fixed extracts, access to live warehoused or operational data), computing resources (hardware platforms) and software (machine learning tools, other relevant software).

##### Personnel
> - Small team of 3 developers
##### Hardware resources
> - Laptops that can run model training
> - Large data storage for saving the model and given dataset and versions of samples
##### Data
> - Dataset that was downloaded from kaggle
##### Software
> - Hydra for managing configurations
> - DVC for versioning the data
> - Python for vizualizations, data analysis, and model training

#### 2. Requirements, assumptions and constraints

##### Requirements
> - Project needs to be done before the end of July
> - We should use tools that were mentioned in the course

##### Assumptions
> - We can predict approximate price for the car based on its parameters
> - The project is feasible
> - The dataset contains actual data

##### Constraints
> - Not all models and brands of cars can be handled
> - Slow model training, since we have a lot of data in our dataset
> - We cannot guarantee accurate predictions nonetheless it will be a nearly accurate predictions
> - We don't have the whole info about the car, e.g. its engine, number of owners, was it repaired and etc.

#### 3. Risks and contingencies

##### Data Quality Issues
Risk: The dataset might contain missing, inconsistent, or incorrect values that weren't identified during initial exploration.\
Contingency Plan:\
Implement Comprehensive Validation: Use Great Expectations to implement comprehensive data validation checks.
Regular Updates: Regularly review and update the validation rules based on new data and findings.
Data Cleaning: Incorporate additional data cleaning steps to handle unexpected data issues, ensuring data integrity before further processing.
##### Configuration Errors
Risk: Errors in Hydra configuration files might lead to incorrect data processing or validation.\
Contingency Plan:\
Thorough Testing: Thoroughly test configurations in a development environment before deployment.
Unit Tests: Implement unit tests for configuration validation to catch errors early.
Version Control: Use version control for configuration files to track and revert changes if necessary, ensuring configuration integrity.
##### Dependency Issues
Risk: Updates or changes to dependencies like Hydra, Great Expectations, or pandas could introduce compatibility issues.\
Contingency Plan:\
Pin Versions: Pin dependency versions in requirements.txt to ensure consistent environments.
Controlled Updates: Regularly update dependencies in a controlled environment and run all tests to ensure compatibility.
Issue Tracking: Maintain a list of known issues and fixes for dependency updates to quickly address any arising problems.
##### Performance Bottlenecks
Risk: Data processing might be slower than expected, especially with large datasets.\
Contingency Plan:\
Optimize Processing: Optimize data processing steps for performance improvements.
Sampling Techniques: Use sampling techniques for testing and validation to reduce processing time.
Scalable Resources: Scale the computing resources if necessary, leveraging cloud services to handle large datasets efficiently.
##### Integration Issues
Risk: Issues may arise when integrating different components of the project (e.g., data pipeline, validation).\
Contingency Plan:\
CI/CD Practices: Implement continuous integration and continuous deployment (CI/CD) practices to ensure smooth integration.
Regular Integration Tests: Perform regular integration tests to catch and address issues early.
Documentation: Document integration points and potential failure scenarios to provide clear guidance and facilitate troubleshooting.

#### 4. Costs and benefits

##### Proposed Action
> - Develop a ml model for predicting car price for the car reseller business.

##### Benefits
> - Business can automate predicting car prices such that prices would be good for seller and consumer at the same time
> - Consumers will by more cars based on its available price

##### Costs
> - To develop and integrate such a model the business need to pay to developers
> - Dataset is free available and can be fulfilled with reseller's data
> - Model should be supported by the team of developers to retrain it time to time

##### Ratio benefits/costs

> - The project overall is important for the business and its benefits are bigger than its costs


#### 5. Feasibility report

After training the random forest regressor we found that such project is feasible, the model accurately predicted the price based on the parameters. So that the project can be done using the ml model.

## produce project plan
----------------

### task

Describe the intended plan for achieving the machine learning goals and thereby achieving the business goals. The plan should specify the anticipated set of steps to be performed during the rest of the project including an initial selection of tools and techniques.

### output

#### 1. Project plan

**Project Duration:** June 14 - Final Exam Date

**Team Members:** 3

##### Stage 1: Initial Setup and Dataset Analysis (14 June - 18 June)
- **Tasks:**
  - Set up the project repository and environment.
  - Download and explore the Kaggle dataset.
  - Conduct initial data exploration and visualization.
  - Identify key features and target variable.
- **Deliverables:**
  - Project repository set up.
  - Initial data exploration report.

##### Stage 2: Data Cleaning and Preprocessing (18 June - 22 June)
- **Tasks:**
  - Handle missing values and outliers.
  - Encode categorical variables.
  - Normalize/scale numerical features.
  - Split data into training and testing sets.
- **Deliverables:**
  - Cleaned and preprocessed dataset.
  - Data preprocessing script.

##### Stage 3: Model Development (22 June - 25 June)
- **Tasks:**
  - Research and select appropriate algorithms for car price prediction.
  - Develop multiple models (e.g. Decision Tree, Random Forest, XGBoost).
- **Deliverables:**
  - Initial versions of multiple models.
  - Model development scripts.

##### Stage 4: Model Training and Hyperparameter Tuning (26 June - 6 July)
- **Tasks:**
  - Train the models on the training dataset.
  - Perform hyperparameter tuning to optimize model performance.
  - Compare model performances using validation metrics.
- **Deliverables:**
  - Trained and tuned models.
  - Model training and tuning scripts.
  - Performance comparison report.

##### Stage 5: Model Testing and Evaluation (6 July - 9 July)
- **Tasks:**
  - Evaluate the models on the test dataset.
  - Select the best-performing model based on evaluation metrics.
  - Perform error analysis and document findings.
- **Deliverables:**
  - Evaluation report.
  - Error analysis documentation.

##### Stage 6: Model Deployment (9 July - 16 July)
- **Tasks:**
  - Develop an API for model inference.
  - Create a simple web application to showcase the model.
  - Deploy the model and application using a cloud service.
- **Deliverables:**
  - Deployed API and web application.
  - Deployment documentation.

##### Stage 7: Model Monitoring and Maintenance (16 July - 20 July)
- **Tasks:**
  - Set up monitoring for model performance and data drift.
  - Implement logging and alerting for potential issues.
  - Plan for regular model retraining with new data.
- **Deliverables:**
  - Monitoring and maintenance setup.
  - Documentation for model monitoring and maintenance.

##### Final Preparation and Review (20 July - Final Exam)
- **Tasks:**
  - Conduct a final review of the project.
  - Prepare project presentation and reports.
  - Practice for the final exam presentation.
- **Deliverables:**
  - Final project presentation.
  - Comprehensive project report.

##### Summary
This project plan outlines the key stages, tasks, and deliverables for predicting car prices from a Kaggle dataset. Each stage has a clear timeline to ensure the project progresses smoothly and is completed successfully by the final exam date.

#### Gantt chart

https://github.com/ErokhinE/MLOps/blob/main/reports/car_prices.pdf


#### 2. ML project Canvas
At the end of the first phase, you should create a canvas for the project as a summary of this phase.


> ##### Link
> Follow the link: https://github.com/ErokhinE/MLOps/blob/main/reports/CanvasML.pdf
