# business and data understanding
------------

The initial phase is concerned with tasks to define the business objectives and translate it to ML objectives, to collect and verify the data quality and to finaly assess the project feasibility.

![](https://i.imgur.com/55J7fBc.jpeg)

## Terminology

### tasks

Compile a glossary of terminology relevant to the project. This may include two components:
(1) A glossary of relevant business terminology, which forms part of the business understanding available to the project. Constructing this glossary is a useful "knowledge elicitation" and education exercise.
(2) A glossary of machine learning terminology, illustrated with examples relevant to the business problem in question.

### output

#### 1. Business terminology
A table or paragraph contains all the business related terms to be used in the project.

> ##### Example:
> - Key Performance Indicators (KPIs): Quantifiable metrics used to measure the performance of a business or organization. The company's KPIs include revenue growth, customer satisfaction, and employee retention.
> - Market Segmentation: Dividing a market into distinct groups based on demographics, needs, or preferences. Example: "The company segmented its target market into young professionals, families, and retirees to tailor marketing strategies."
> Return on Investment (ROI): The return or profit generated by an investment compared to its cost. Example: "The company calculated a 20% ROI on its new marketing campaign, indicating a successful investment."

#### 2. ML terminology
A table or paragraph contains all the ML related terms to be used in the project.


> ##### Example:
> Underfitting: When a machine learning model is too simple and fails to capture the underlying patterns in the data. Example: "The company's initial model was underfitting the data, so they added more features and layers to improve accuracy."

---
1.Business terminology

Income: The total amount of money generated from a business's primary operations.

Market Trends: The upward or downward movement of a market, during a period of time. One such trend that has been on the rise is sustainability, with companies moving towards more eco-friendly and socially responsible practices.

Sales Forecasting: The process of estimating future revenue by predicting how much of a product or service will sell in the next week, month, quarter, or year. Example: Predicting that the company will sell 1,000 units of a new product in its first month based on market research.

Revenue: The total income generated from the sale of goods or services. Example: If a company sells 100 products at \$10 each, its revenue is \$1,000.

2.ML terminology


Feature Engineering: The process of using domain knowledge to create new features that make machine learning models work better.

Feature Importance: A technique used to assign scores to input features based on their importance to predicting a target variable.

Hyperparameter Tuning: The process of choosing the optimal set of parameters for a machine learning model.

----

## Scope of the project
----------


----
1. Background

  AliExpress is a major global e-commerce platform offering a wide range of products from various sellers.
  Product prices on AliExpress are influenced by production costs and market conditions, such as supply and demand, competition, and seasonal trends.
  Both the sellers on AliExpress and the platform itself are invested in accurate income prediction.
  For sellers, precise forecasting helps set competitive prices and manage inventory.
  For AliExpress, reliable income prediction supports resource allocation, investment strategies, and overall business planning.
  Accurate forecasting is crucial for financial management and planning for future growth and expansion.
2. Business problem

  It is essential for companies to enhance their ability to predict their income accurately.
  By accurately forecasting the prices of their products, companies can more precisely estimate their total or partial income.
  This allows them to make informed decisions about resource allocation, investment, and overall business strategy.
  Additionally, accurate income prediction enables companies to better manage their finances and plan for future growth and expansion.
3. Business objectives

  To make revenue forecasting more precise.

  Related Business Questions:

  How do seasonal trends and market demand affect product prices?
  What impact do vendor ratings and historical sales data have on future product pricing?
  
  How can we optimize resource allocation and inventory management based on predicted income?
4. ML objectives

  Predict Product Prices: Develop a machine learning model to accurately predict the prices of products listed on AliExpress, based on historical data, product features, and market trends.

  ---

## Success Criteria
-------------

---
1. Business success criteria
  

* Increase Forecast Accuracy: Achieve a noticeable improvement in the accuracy of revenue forecasts.

2. ML success criteria

* Feature Importance: Accurately identify and quantify the top five factors influencing product price variations, with a clear impact ranking.
* r2 >= 0.5

---

## Data collection

---
1. Data collection report

* Data Source:
This is A dataset of Ali-Express Products available in Saudi Arabia
* Data Type:
Most of the data consists of numerical values representing the price, discount, category, rating, shipping cost, count of solded products, ids. It has text data such as title, name of the store, url of the store and of the image of the product. Also it has categorical data such as category name and the type of the product.
* Data Size:
The dataset contains 172,854 records with 16 of 17 filled columns
* Data Collection Method:
This dataset was scrapped from Ali-Express site from Saudi Arabia.

2. Data Version control report

* Data Version: The current data version is v1.1, which was updated on June 20, 2024
* Data Change Log: The current data have no log within, so we can`t ensured say about changes that was happend with data on the stage of collecting. However it not hard to see that format of the data is the same for insctances from different time periods.
* Data Backup: The bacakup of the current version of the data is stored in the cloud storage in case of data loss.
* Data Archiving: Data is not archived. Everything stored in cloud storage.
* Data Access Control: Access to the cloud storage is regulated acording to the google accounts of the developers. So no one except developers have access.

## Data quality verification

---
## Data Description
- The data acquired for this project includes a dataset of 172854 records with 17 fields each. The fields include product price, name, type, and seller information. The data is in a CSV format and is stored in a local database.
- Feature description

| Column         | Description                                     |
|----------------|-------------------------------------------------|
| Price          | It is the Price of the product after applying the discount, it is in SAR |
| Discount       | It is the discount percentage of the product Price, between 0-100   |
| Shipping Cost  | Shipping Cost in SAR                            |
| Sold           | Number of times the Product was Bought          |
| ID             | It is the unique identifier for each Entry (Same Product from different store/sellers has different IDs) |
| Store ID       | It is the id of the store unique to each store  |
| Store Name     | It is the name of the product store             |
| Title          | It is the title of the product                  |
| Rating         | It is the rating of the product, from 0-5       |
| Lunch Time     | It is the date at which the product was added   |
| CategoryName   | It is the product's category name               |
| CategoryID     | The ID of the Category                          |
| Type           | It is the type of the product, with two values: ad (Advertised), natural (not Advertised) |
| imageUrl       | Product's Image Url                             |
| store URL      | Store URL                                       | 

### 2. Data Exploration
During data exploration several patterns and discovers were discovered. Each category has it's mean price window, and each store can work in their price window. Also, we have inverse correlation between solded count and price


![first diagram](imgs/1_business_data_understanding/category_price.png)


![second diagram](imgs/1_business_data_understanding/store_price.png)

![third diagram](imgs/1_business_data_understanding/sold_price.png)

### 3. Data Requirements
> - The data requirements for this project are defined as follows:
> > - Rating: rating should be written in range 0 - 5
> > - Sold: Sold should be integer value greater or equal to 0
> > - Price and shippingCost: price should be greater than 0
> > - Discount: discount should be written in range 0 - 100

### 4. Data Quality Check Verification Report
* Completeness: The dataset seems to be complete in a sense that all products have necessary information for price prediction
* Correctness: The data is correct but manual review is recommended.
* Missing values: One column 'category' is fully empty. But it is the only column with missed values.
* Overall data seems to be proper

## Project feasibility

### 1. Inventory of resources
* computing resources: 3 laptops, limited access to T4, P100 GPUs
* software: Libraries for machine learning
### 2. Requirements, assumptions and constraints
* Requirements: This project will be done in several weeks. The results must be comprehensible for business holders. During the work with the project we don't have problems with the rights of using the data.
* Assumptions: There are assumption that title can be important feature when we will predict prices
* Constraints: There is constraint in data such that we don't have direct features that influence the price.
### 3. Risks and contingencies
* There is a risk that all listed features will not be enough to predict price with high quality with all ml methods. In this case we probably will seek additional information from product and store site, since we have such url.
### 4. Costs and benefits
This project from material point of view is totally free, the only cost is the time. The benefits will be noticeable because business holders will save money on proper price forecasting.
### 5. Feasibility report
We have utilized the RoBERTa model for text embeddings and concatenated them with preprocessed data. Employing a straightforward MLP architecture with hidden layers of sizes 512 and 64 yielded promising results, achieving an R-squared value of approximately 0.8 and Mean Absolute Error (MAE) around 27. These metrics underscore the feasibility of leveraging machine learning for our project goals. However, significant challenges were encountered during data cleaning, highlighting the problem of raw data. We could achieve these promising results only after cutting the products with number of sell under 20 and stores with number of products less than 5.


----------------

### task

Describe the intended plan for achieving the machine learning goals and thereby achieving the business goals. The plan should specify the anticipated set of steps to be performed during the rest of the project including an initial selection of tools and techniques.

### output

#### 1. Project plan
- List the stages to be executed in the project, together with their duration, resources required, inputs, outputs, and dependencies. Where possible, try and make explicit the large-scale iterations in the machine learning process, for example, repetitions of the modeling and evaluation phases. As part of the project plan, it is also important to analyze dependencies between time schedule and risks. Mark results of these analyses explicitly in the project plan, ideally with actions and recommendations if the risks are manifested. Decide at this point which evaluation strategy will be used in the evaluation phase. Your project plan will be a dynamic document. At the end of each phase you’ll review progress and achievements and update the project plan accordingly. Specific review points for these updates should be part of the project plan.

- Build a Gantt chart for the project tasks and phases using some online platforms like TeamGantt, jira, goodday, tello, ...etc

- Add all of your team members and assign tasks to them preliminary. Then you check daily the progress.

#### 2. ML project Canvas
At the end of the first phase, you should create a canvas for the project as a summary of this phase.


> ##### Example
> Follow the link: https://github.com/louisdorard/machine-learning-canvas/blob/master/churn.pdf


#### 1. Project plan
1. Data Understanding Phase: (Iteration 1)
   - Duration: 1 weeks
   - Resources Required: Data scientist, Data Engineer
   - Inputs: Data collection plan, ALiExpress data
   - Outputs: Clear understanding of the business task
   - Dependencies: -

2. Preprocessing, Feature Engineering and Selection Phase:
   - Duration: 1 weeks
   - Resources Required: Data scientist, Computational Power
   - Inputs: Data collection plan, ALiExpress data
   - Outputs: Cleaned and Preprocessed data, Selected features for modeling
   - Dependencies: Completion of Data Understanding Phase

2. Modeling and Evaluation Phase (Iteration 1):
   - Duration: Half of week
   - Resources Required: Data scientist, ML engineer
   - Inputs: Selected features, modeling algorithms
   - Outputs: Trained machine learning model, Evaluation results, Model performance analysis
   - Dependencies: Comlition of Preprocessing, Feature Engineering and Selection Phase

3. Model Refinement and Re-evaluation Phase (Iteration 2):
   - Duration: 1 weeks
   - Resources Required: Data scientists, machine learning engineers
   - Inputs: Evaluation results, feedback from Evaluation Phase
   - Outputs: Improved machine learning models, Updated evaluation results, model performance analysis
   - Dependencies: Complition of Modeling and Evaluation Phase (Iteration 1)

4. Deployment Phase:
   - Duration: 1 weeks
   - Resources Required: All
   - Inputs: Finalized models
   - Outputs: Deployed model for price prediction on ALiExpress
   - Dependencies: Complition of Model Refinement and Re-evaluation Phase (Iteration 2)

5. Testing Automated Pipeline Phase:
   - Duration: Couple day
   - Resources Required: All
   - Inputs: Finalized models
   - Outputs: Deployed and tested model for price prediction on ALiExpress with fully automated pipeline
   - Dependencies: Model Refinement and Re-evaluation Phase (Iteration 2)


Evaluation Strategy: The project will use a holdout validation approach where the dataset will be split into training and testing sets, and cross-validation will be employed to assess model performance.

Analysis of Dependencies and Risks:
- The main risks to the project are delays in data collection and preprocessing, as well as the need for significant model refinement iterations. Contingency plans will be put in place to address these risks, such as allocating more resources to critical phases or adjusting project timelines accordingly.
- Regular review points will be set after each phase to assess progress and make necessary adjustments to the project plan based on achievements and challenges encountered. 

- Build a Gantt chart for the project tasks and phases using some online platforms like TeamGantt, jira, goodday, tello, ...etc

- Add all of your team members and assign tasks to them preliminary. Then you check daily the progress.

Gannt chart:
https://docs.google.com/spreadsheets/d/1S_q7SpKwVRrKLhX5t75cMqm2tJVm7LSLbULE6R0hH9A/edit?pli=1&gid=0#gid=0

#### 2. ML project Canvas
At the end of the first phase, you should create a canvas for the project as a summary of this phase.
![ML canvas](imgs/1_business_data_understanding/Mlcanvas1.png)
![ML canvas](imgs/1_business_data_understanding/Mlcanvas2.png)