# business and data understanding
------------

The initial phase is concerned with tasks to define the business objectives and translate it to ML objectives, to collect and verify the data quality and to finaly assess the project feasibility.

![](https://i.imgur.com/55J7fBc.jpeg)

## Terminology

### tasks

Compile a glossary of terminology relevant to the project. This may include two components:
(1) A glossary of relevant business terminology, which forms part of the business understanding available to the project. Constructing this glossary is a useful "knowledge elicitation" and education exercise.
(2) A glossary of machine learning terminology, illustrated with examples relevant to the business problem in question.

### output

#### 1. Business terminology
> - Fair Market Value (FMV): The estimated price a car would sell for in an open and competitive market, assuming neither buyer nor seller is under pressure.
> - Asking Price: The price a seller lists a car for on Craigslist or any other platform.
> - Selling Price: The final price at which a car is sold.
> - Market Price: The typical price at which similar used cars are selling in the current market.
> - MSRP (Manufacturer's Suggested Retail Price): The recommended price set by the manufacturer, not necessarily reflective of market value.
> - Market Segmentation: Dividing the car market into smaller groups based on specific characteristics (e.g., luxury cars, fuel-efficient cars).
> - Return on Investment (ROI): The net profit gained from selling a car compared to its purchase price.

#### 2. ML terminology
> - Underfitting: When a machine learning model is too simple and fails to capture the underlying patterns in the data.
> - Overfitting: When a model performs well on training data but poorly on unseen data. It has memorized specific examples instead of learning general patterns.
> - Root Mean Square Error (RMSE): metric that measures the difference between values predicted by the model and the actual target values in the dataset. Has the following formula: $RMSE=\frac{\sum_{i=1}^N(y_i-\hat{y_i})^2}{N}$, where $N$ - is the number of points in the evaluation set, $y_i$, $\hat{y_i}$ are the true and predicted values respectively. The closer RMSE to 0, the better the model is.
> - $R^2$ - metric that represents the proportion of variance in the target variable that can be explained by the model. Has the following formula: $R^2=1-\frac{\sum_{i=1}^N(y_i-\hat{y_i})^2}{\sum_{i=1}^N(y_i-\frac{1}{N}\sum_{i=1}^Ny_i)^2}$, where $N$ - is the number of points in the evaluation set, $y_i$, $\hat{y_i}$ are the true and predicted values respectively. The closer the value to 1, the better the model is.


## Scope of the project
----------

### tasks
- Explore the background of the business.
- Define business problem
- Define business objectives
- Translate business objectives into ML objectives

The objective here is to thoroughly understand, from a business perspective, what the client really wants to accomplish. Often the client has many competing objectives and constraints that must be properly balanced. The goal is to uncover important factors, at the beginning, that can influence the outcome of the project. A possible consequence of neglecting this step is to expend a great deal of effort producing the right answers to the wrong questions.

### output

#### 1. background

> Craiglist is a company operating a classified advertisements website with sections devoted to jobs, housing, for sale, items wanted, services, community service, gigs, résumés, and discussion forums. Specifically, it cotains the information related to car sales.

#### 2. business problem

> The current process for pricing used cars relies heavily on subjective judgement, leading to listings that are either overpriced and linger on the market or underpriced and leave money on the table for sellers. This sellers want to utilize a variety of car attributes to predict a car's fair market value. This will empower sellers to price their vehicles competitively and realistically, facilitating faster sales and maximizing their return on investment. In addition, it will help buyers avoid overpaying for a vehicle.

#### 3. business objectives

> - Maximize seller profit and buyer satisfaction by predicting the fair market value of used cars. This will:
>    * Reduce the time cars sit on the market (days listed) for sellers.
>    * Increase the likelihood of getting the asking price for sellers.
>    * Help buyers avoid overpaying for a vehicle.
> - Related objectives:
>     - How does the accuracy of the predicted price impact the time cars stay listed?
>     - Will segmenting the market and offering targeted pricing strategies for different buyer demographics impact the seller profit and buyer satisfaction?
>     - Does using the predicted price in conjunction with other pricing strategies (e.g., dynamic pricing) further improve sales velocity and profitability?

#### 4. ML objectives

> Given a dataset of used car listings, including attributes such as mileage, make, model, year, features, and location, develop a machine learning model that can accurately predict the fair market value of a car.

## Success Criteria
-------------

### tasks
- Describe the success criteria of the ML project on three different levels: the business success criteria, the ML success criteria and the economic success criteria.

### output

#### 1. Business success criteria

> - Decrease the average number of days a car listing remains active by 15% compared to the current process.
> - Achieve a 5% increase in the average selling price for a car compared to the current process. This can be measured by comparing the predicted price to the actual selling price.

#### 2. ML success criteria

> - The model should achieve an RMSE of less than $1,000 on the car price predictions. 
> - The models hould achieve an $R^2$ of greater than 0.85 on the car price predictions. 

## Data collection

### tasks
- Specify the data sources
- Collect the data
- Version control on the data

### output

#### 1. Data collection report

> - **Data Source:** The data was collected from a publicly available dataset on Kaggle: https://www.kaggle.com/datasets/austinreese/craigslist-carstrucks-data
> - **Data Type:** The data is expected to be a mixture of numerical, categorical and datetime features. Numerical features are represented by features such as miles traveled by vehicle, latitude and longitude, categorical - manufacturer, model and confition, datetime - by posting date.
> - **Data Size:** The dataset on Kaggle contains 426880 listings, with 26 features each.
> - **Data Collection Method:** The data was directly downloaded from the Kaggle platform using its web interface. This eliminates the need for web scraping and ensures data quality and consistency.


#### 2. Data version control report

> - **Data Version:** The current data version is v1.0, which was updated on June 19, 2024."
> - **Data Change Log:** We will maintain a change log that documents any modifications made to the data. This ensures transparency and allows us to replicate the data preparation process for future iterations.
> - **Data Backup:** Daily backups of the downloaded data will be stored on a Google Drive. This safeguards against accidental deletion or system failures.
> - **Data Archiving:** Since the project duration is too short, none of the data will be archived.
> - **Data Access Control:** Access to the data will be restricted only to the project team and Teacher Assistant. This mitigates the risk of unauthorized modifications.

## Data quality verification

### tasks
- Describe data
- Define data requirements
- Explore the data
- Verify the data quality

### output

#### 1. Data description

> - The data acquired for this project includes a dataset of 426880 records with 26 fields each. The fields include car characteristics, its location and price. The data is in a CSV format and is stored in Google Drive.
> - Feature description table:

| Feature Name | Description          | Data Type |
|--------------|----------------------|-----------|
| id           | Entry ID             | Integer   |
| url          | Listing URL          | String    |
| region       | Craiglist region     | String    |
| region_url   | Region URL           | String    |
| price        | Car price            | Integer   |
| year         | Entry year           | Integer   |
| manufacturer | Vehicle manufacturer | String    |
| model        | Vehicle model        | String    |
| condition    | Vehicle condition    | String    |
| cylinders    | Number of cylinders  | String    |
| fuel         | Fuel type            | String    |
| odometer     | Miles traveled by vehicle | Integer |
| title_status | Vehicle title status | String    |
| transmission | Vehicle transmission | String    |
| VIN          | Vehicle Identification Number | String |
| drive        | Drive Type           | String    |
| size         | Vehicle size         | String    |
| type         | Generic vehicle type | String    |
| paint_color  | Vehicle color        | String    |
| image_url    | Image URL            | String    |
| description  | Listed vehicle description | String |
| county       | County               | String (acutally, all entries are NaN) |
| state        | Listing state        | String    |
| lat          | Listing latitude     | Float     |
| long         | Listing longitude    | Float     |
| posting_date | Posting date         | Datetime  |

#### 2. Data exploration
A section to present results of your data exploration, including first findings or initial hypothesis and their impact on the remainder of the project. If appropriate you could include graphs and plots here to indicate data characteristics that suggest further examination of interesting data subsets.

> ##### Example
> - "country" containes only NaN values. Hence, we will drop it since it has no information to recover or utilize.
> - During data exploration, several interesting patterns and correlations were discovered. For example, there is a strong correlation between the age of customers and their purchase frequency. Additionally, customers who have purchased from the company in the past are more likely to make repeat purchases. These findings suggest that the data is representative of the target audience and that the company's marketing strategies are effective.
> - Add charts and figures to present the findings.

#### 3. Data requirements

> - The data requirements for this project are defined as follows:
> > - id should be unique.
> > - price should not be NaN and should be greater than or equal to 0.
> > - region should not be NaN.
> > - description should not be
> > - VIN (Vehicle Identification Number) should contain exactly 17 characters.
> > - cylinders should be greater than or equal to 0.
> > - odometer (i.e., number of miles traveled by vehicle) should be greater than or equal to 0.
> > - year should be greater than or equal to 1800.
> > - lat should be in the range $[-90, 90]$, long - $[-180, 180]$.
> > - drive must be one of [4wd, fwd, rwd].
> > - transmission must be one of [manual, automatic].
> > - posting_date must satisfy the format "%Y-%m-%dT%H:%M:%S.%fZ", where Y - year, m - month, d - day, H - hour, M - minutes, S - seconds, Z - milliseconds.

#### 4 Data quality verification report
A section to verify and report the quality of the data. Examine the quality of the data, addressing questions such as:
- Is the data complete (does it cover all the cases required)?
- Is it correct, or does it contain errors and, if there are errors, how common are they?
- Are there missing values in the data? If so, how are they represented, where do they occur, and how common are they?

> ##### Example
> - Completeness: The data is complete in the sense that it covers all the required cases. All customers have demographic information, and all purchases are recorded.
> - Correctness: The data appears to be correct, with no obvious errors. However, a manual review of the data is recommended to ensure that there are no errors.
> - Missing Values: There are no missing values in the data.
> - Overall, the data quality is high, and the data is suitable for analysis and modeling. However, a manual review of the data is recommended to ensure that there are no errors or anomalies.

## Project feasibility
-------------
This task involves more detailed fact-finding about all of the resources,constraints, assumptions and other factors that should be considered in determining the data analysis goal and project plan. In the previous task, your objective is to quickly get to the crux of the situation. Here, you want to flesh out the details.

### tasks
- Assess the project feasibility
- Create POC (Proof-of-concept) model

### output

#### 1. inventory of resources

> - Personnel:
>   - Data & ML Experts:
>       - Data Engineer - Responsible for data acquisition, cleaning, transformation, and preparation for modeling.
>       - Data Scientist - Responsible for exploring data, feature engineering, model selection training, evaluation, and interpretation.
>       - ML Engineer - Responsible for designing, building, deploying, and monitoring the machine learning model.
>   - Business Experts:
>       - Actually, we do not have contacts with someone who has enough expertise in the field of car selling, but we can contact the professor and the TA in case of difficulties.
>   - Technical Support:
>       - We can contact the professor and the TA in case of difficulties.
> - Data:
>   - Fixed Extract:
>       - The provided dataset from Kaggle: https://www.kaggle.com/datasets/austinreese/craigslist-carstrucks-data
> - Computing Resources:
>   - Since we will utilize the data by 20% samples and not the whole data by once, our own computing resources will enough to complete the project. Otherwise, we can use Kaggle or Google Colab to get additional resources, especially in terms of GPU.
> - Software
>   - Machine Learning Tools:
>       - Python libraries such as scikit-learn and PyTorch
>   - Other Relevant Software:
>       - Pandas, Matplotlib, Seaborn for Exploratory Data Analysis
>       - DVC for data versioning
>       - Hydra for configuration file
>       - Pytest for testing the code
>       - Great Expectations for validating and documenting the data

#### 2. Requirements, assumptions and constraints

> - Requirements:
>   - Data requirements:
>       - Accessibility: the dataset is publicly available on Kaggle, so we are allowed to use it without restrictions.
>       - The rest of data requirements can be seen above.   
>   - Time requirements:
>       - The project must be completed by the end of the semester (approximately the end of July)
>   - Model requirements:
>       - At least one of the trained models must give RMSE and R^2 satisfying success criteria on unseen data 
>   - Technical requirements:
>       - The project must satisfy the minimal requirements stated by the Teaching Assistant

> - Assumptions:
>   - Data relevance:
>       - The is assumed to be representative of the used car market.
>   - Data accuracy:
>       - There are no mistakes in the dataset.
>   - Relationship:
>       - There is a relationship between the car's attributes and the price.

> - Constraints:
>   - Time constraints: 
>       - The project must be completed by the end of the semester as stated above.
>   - Resources constraints:
>       - Since no additional resources has been provided, we must use our own resources, which may limit our project in some aspects such as training complex ML models.

#### 3. Risks and contingencies

> - Inaccurate or incomplete data: The data from Craigslist may contain errors, typos, or missing information about the cars. This can lead to a poorly trained model that produces inaccurate predictions.
>   - Contingency Plan: Implement data cleaning techniques to identify and fix errors or inconsistencies in the data. This may involve removing or imputing entries with missing values or removing outliers. 
> - Model underfitting or overfitting: The model might not learn the underlying patterns in the data effectively, leading to poor predictions. On the other hand, the model might memorize the training data too well and fail to generalize to unseen data, leading to inaccurate predictions on new cars.
>   - Contingency Plan: experiment with different model hyperparameters to find the best configuration that balances underfitting and overfitting. Apply techniques such regularization or dropout layers to prevent the model from overfitting. Train multiple models on different subsets of the data and combine their predictions to potentially improve overall accuracy.
> - Market fluctuations: The used car market can fluctuate significantly due to economic factors, fuel prices, or new car releases. This can affect the accuracy of the model's predictions over time.
>   - Contingency Plan: Regularly retrain the model with new data to account for market changes and ensure its predictions remain relevant.


#### 4. Costs and benefits

> Proposed Action/Alternative
> - Develop a machine learning model to predict the fair market value of used cars based on Craigslist car listings data.

> Benefits
> - Increased Sales Velocity (Benefit Impact: 3): Accurate pricing will lead to listings attracting qualified buyers faster, reducing the time a car sits on the market. This translates to faster revenue generation.
> - Improved Seller Profitability (Benefit Impact: 2): Fair market value pricing ensures sellers get a competitive price without undercutting themselves.
> - Enhanced Buyer Experience (Benefit Impact: 2): Buyers avoid overpaying for vehicles and can find cars within their budget more efficiently.

> Costs
> - Data Acquisition (Cost Impact: 1): The Craigslist dataset is publicly available, minimizing data acquisition costs.
> - Model Development & Training (Cost Impact: 2): This requires time from all the team to develop, train, and maintain the model.
> - Deployment & Integration (Cost Impact: 1): model deployment might require development effort.
> - Model Maintenance (Cost Impact: 1): Cost of ongoing monitoring and retraining the model to ensure accuracy over time.

> Ratio Benefits/Costs
> - Assuming that this ratio is calculated as sum(benefits impact)/sum(costs impact), then this ratio is 7/5, meaning that benefits significantly outweight the costs.

> Ranking
> - This project ranks highly due to the potential for significant benefits (increased sales velocity, maximized seller profit and enhanced buyer epxperience) with relatively moderate costs (data acquisition, model development, deployment, maintenance).

#### 5. Feasibility report

Build a POC ML model and explain as a team whether it is feasible to do this ML project or not. If not, then you need to find another business problem. They key factors here are related to data availability, quality, costs and nature of business problem.

## produce project plan
----------------

### task

Describe the intended plan for achieving the machine learning goals and thereby achieving the business goals. The plan should specify the anticipated set of steps to be performed during the rest of the project including an initial selection of tools and techniques.

### output

#### 1. Project plan
List the stages to be executed in the project, together with their duration, resources required, inputs, outputs, and dependencies. Where possible, try and make explicit the large-scale iterations in the machine learning process, for example, repetitions of the modeling and evaluation phases. As part of the project plan, it is also important to analyze dependencies between time schedule and risks. Mark results of these analyses explicitly in the project plan, ideally with actions and recommendations if the risks are manifested. Decide at this point which evaluation strategy will be used in the evaluation phase. Your project plan will be a dynamic document. At the end of each phase you’ll review progress and achievements and update the project plan accordingly. Specific review points for these updates should be part of the project plan.

#### 2. ML project Canvas
At the end of the first phase, you should create a canvas for the project as a summary of this phase.


> ##### Example
> Follow the link: https://github.com/louisdorard/machine-learning-canvas/blob/master/churn.pdf
