# business and data understanding
------------

The initial phase is concerned with tasks to define the business objectives and translate it to ML objectives, to collect and verify the data quality and to finaly assess the project feasibility.

![](https://i.imgur.com/55J7fBc.jpeg)

## Terminology

### tasks

Compile a glossary of terminology relevant to the project. This may include two components:
(1) A glossary of relevant business terminology, which forms part of the business understanding available to the project. Constructing this glossary is a useful "knowledge elicitation" and education exercise.
(2) A glossary of machine learning terminology, illustrated with examples relevant to the business problem in question.

### output

#### 1. Business terminology
A table or paragraph contains all the business related terms to be used in the project.

> ##### Example:
> - Key Performance Indicators (KPIs): Quantifiable metrics used to measure the performance of a business or organization. The company's KPIs include revenue growth, customer satisfaction, and employee retention.
> - Market Segmentation: Dividing a market into distinct groups based on demographics, needs, or preferences. Example: "The company segmented its target market into young professionals, families, and retirees to tailor marketing strategies."
> Return on Investment (ROI): The return or profit generated by an investment compared to its cost. Example: "The company calculated a 20% ROI on its new marketing campaign, indicating a successful investment."

#### 2. ML terminology
A table or paragraph contains all the ML related terms to be used in the project.


> ##### Example:
> Underfitting: When a machine learning model is too simple and fails to capture the underlying patterns in the data. Example: "The company's initial model was underfitting the data, so they added more features and layers to improve accuracy."

---
1.Business terminology

Income: The total amount of money generated from a business's primary operations.

Market Trends: The upward or downward movement of a market, during a period of time. One such trend that has been on the rise is sustainability, with companies moving towards more eco-friendly and socially responsible practices.

Sales Forecasting: The process of estimating future revenue by predicting how much of a product or service will sell in the next week, month, quarter, or year. Example: Predicting that the company will sell 1,000 units of a new product in its first month based on market research.

Revenue: The total income generated from the sale of goods or services. Example: If a company sells 100 products at \$10 each, its revenue is \$1,000.

2.ML terminology


Feature Engineering: The process of using domain knowledge to create new features that make machine learning models work better.

Feature Importance: A technique used to assign scores to input features based on their importance to predicting a target variable.

Hyperparameter Tuning: The process of choosing the optimal set of parameters for a machine learning model.

----

## Scope of the project
----------

### tasks
- Explore the background of the business.
- Define business problem
- Define business objectives
- Translate business objectives into ML objectives

The objective here is to thoroughly understand, from a business perspective, what the client really wants to accomplish. Often the client has many competing objectives and constraints that must be properly balanced. The goal is to uncover important factors, at the beginning, that can influence the outcome of the project. A possible consequence of neglecting this step is to expend a great deal of effort producing the right answers to the wrong questions.

### output

#### 1. background

A short paragraph to record the information that is known about the organization's business situation at the beginning of the project.

> ##### Example:
> Shopzilla is an an e-commerce platform. They have branches in different cities. They have a Mobile app which allows clients to buy products. They profile the purchase history of the users and use this data for building AI models.

#### 2. business problem
A short paragraph to describe the business problem.

> ##### Example:
> The business problem is that the business stakeholders wants to follow a customer centric sales methodology. They want to predict customer satisfaction in their services such that they can adapt their strategies accordingly. The provided dataset is labelled and captures customer satisfaction scores for a one-month period. It includes various features such as category and sub-category of interaction, customer remarks, survey response date, category, item price, agent details (name, supervisor, manager), and CSAT score etc.
> The company has been experiencing high customer churn rates, resulting in significant revenue losses.

#### 3. business objectives

A list of business objectives which describes the customer's primary objective, from a business perspective. In addition to the primary business objective, there are typically other related business questions that the customer would like to address. For example, the primary business goal might be to keep current customers by predicting when they are prone to move to a competitor. Examples of related business questions are "How does the primary channel (e.g., ATM, visit branch, internet) a bank customer uses affect whether they stay or go?" or "Will lower ATM fees significantly reduce the number of high-value customers who leave?"

> ##### Examples:
> - Will lower ATM fees significantly reduce the number of high-value customers who leave?
> - Does the channel used affect whether customers stay or go?

#### 4. ML objectives

A list of business objectives which describes the intended outputs of the project that enables the achievement of the business objectives.

> ##### Examples:
> Predict how many widgets a customer will buy, given their purchases over the past three years, demographic information (age, salary, city, etc.), and the price of the item.

----
1. Background

  AliExpress is a major global e-commerce platform offering a wide range of products from various sellers.
  Product prices on AliExpress are influenced by production costs and market conditions, such as supply and demand, competition, and seasonal trends.
  Both the sellers on AliExpress and the platform itself are invested in accurate income prediction.
  For sellers, precise forecasting helps set competitive prices and manage inventory.
  For AliExpress, reliable income prediction supports resource allocation, investment strategies, and overall business planning.
  Accurate forecasting is crucial for financial management and planning for future growth and expansion.
2. Business problem

  It is essential for companies to enhance their ability to predict their income accurately.
  By accurately forecasting the prices of their products, companies can more precisely estimate their total or partial income.
  This allows them to make informed decisions about resource allocation, investment, and overall business strategy.
  Additionally, accurate income prediction enables companies to better manage their finances and plan for future growth and expansion.
3. Business objectives

  To make revenue forecasting more precise.

  Related Business Questions:

  How do seasonal trends and market demand affect product prices?
  What impact do vendor ratings and historical sales data have on future product pricing?
  
  How can we optimize resource allocation and inventory management based on predicted income?
4. ML objectives

  Predict Product Prices: Develop a machine learning model to accurately predict the prices of products listed on AliExpress, based on historical data, product features, and market trends.

  ---

## Success Criteria
-------------

### tasks
- Describe the success criteria of the ML project on three different levels: the business success criteria, the ML success criteria and the economic success criteria.

### output

#### 1. Business success criteria
A list of criteria  from a business point of view. For example, if an ML application is planned for a quality check in production and is supposed to outperform the current manual failure rate of 3%, the business success criterion could be derived as e.g. "failure rate less than 3%"

> ##### Example:
> - Increase customer satisfaction ratings by 8% within the next quarter.
> - Reduce operational costs by 12% within the next year.


#### 2. ML success criteria
A list of criteria for a successful outcome to the project in technical terms, for example a certain level of predictive accuracy or a propensity to purchase profile with a given degree of "lift." As with business success criteria, it may be necessary to describe these in subjective terms, in which case the person or persons making the subjective judgment should be identified.

> ##### Example:
> - The model aims to achieve a recall rate of 95% for identifying customers who are likely to churn.
> - The model aims to achieve a precision rate of 90% for identifying high-value customers.

---
1. Business success criteria
  

* Increase Forecast Accuracy: Achieve a noticeable improvement in the accuracy of revenue forecasts.

2. ML success criteria

* Feature Importance: Accurately identify and quantify the top five factors influencing product price variations, with a clear impact ranking.

---

## Data collection

### tasks
- Specify the data sources
- Collect the data
- Version control on the data

### output

#### 1. Data collection report
A section to describe the data sources and how you want to collect the data.

> ##### Example:
> - **Data Source:** The origin of the data, such as a database, file, or API. Example: "The data was collected from the company's customer relationship management (CRM) database."
> - **Data Type:** The format or structure of the data, such as numerical, categorical, or text. Example: "The data consists of numerical values representing customer purchase amounts and categorical values representing customer demographics."
> - **Data Size:** The quantity of data collected. Example: "The dataset contains 100,000 customer records, with 50 features each."
> - **Data Collection Method:** The process used to collect the data. Example: "The data was collected using a web scraping tool that extracted customer information from the company's website."


#### 2. Data version control report
A section to describe data versions, what change happend in the data, how do you backup the data. This should be done after you collect the data.

> ##### Example:
> - **Data Version:** A unique identifier for each version of the data. Example: "The current data version is v1.2, which was updated on March 15, 2023."
> - **Data Change Log:** A record of changes made to the data. Example: "The data change log shows that the customer demographics feature was updated on February 20, 2023, to include new categories."
> - **Data Backup:** A copy of the data stored for recovery in case of data loss. Example: "The company has a daily backup of the data stored on a secure server."
> - **Data Archiving:** The process of storing and managing historical data. Example: "The company archives data older than one year to a cloud storage service for long-term retention."
> - **Data Access Control:** The process of controlling who can access and modify the data. Example: "The company uses role-based access control to ensure that only authorized personnel can access and modify the data."

---
1. Data collection report

* Data Source:
This is A dataset of Ali-Express Products available in Saudi Arabia
* Data Type:
Most of the data consists of numerical values representing the price, discount, category, rating, shipping cost, count of solded products, ids. It has text data such as title, name of the store, url of the store and of the image of the product. Also it has categorical data such as category name and the type of the product.
* Data Size:
The dataset contains 172,854 records with 16 of 17 filled columns
* Data Collection Method:
This dataset was scrapped from Ali-Express site from Saudi Arabia.

2. Data Version control report

* Data Version: The current data version is v1.1, which was updated on June 20, 2024
* Data Change Log: The current data have no log within, so we can`t ensured say about changes that was happend with data on the stage of collecting. However it not hard to see that format of the data is the same for insctances from different time periods.
* Data Backup: The bacakup of the current version of the data is stored in the cloud storage in case of data loss.
* Data Archiving: Data is not archived. Everything stored in cloud storage.
* Data Access Control: Access to the cloud storage is regulated acording to the google accounts of the developers. So no one except developers have access.

## Data quality verification

### tasks
- Describe data
- Define data requirements
- Explore the data
- Verify the data quality

### output

#### 1. Data description
A section to describe the data that has been acquired including its format, its quantity (for example, the number of records and fields in each table), the identities of the fields and any other surface features which have been discovered. Add a table of description of the data features.

> ##### Example
> - The data acquired for this project includes a dataset of 1000 records with 20 fields each. The fields include customer demographics, purchase history, and product preferences. The data is in a CSV format and is stored in a local database.
> - Add a table of description of the data features.

#### 2. Data exploration
A section to present results of your data exploration, including first findings or initial hypothesis and their impact on the remainder of the project. If appropriate you could include graphs and plots here to indicate data characteristics that suggest further examination of interesting data subsets.

> ##### Example
> - During data exploration, several interesting patterns and correlations were discovered. For example, there is a strong correlation between the age of customers and their purchase frequency. Additionally, customers who have purchased from the company in the past are more likely to make repeat purchases. These findings suggest that the data is representative of the target audience and that the company's marketing strategies are effective.
> - Add charts and figures to present the findings.

#### 3. Data requirements
A section to describe the data requirements. The requirements can be defined either on the meta-level or directly in the data, and should state the expected conditions of the data, i.e., whether a certain sample is plausible. The requirements can be, e.g., the expected feature values (a range for continuous features or a list for discrete features), the format of the data and the maximum number of missing values. The bounds of the requirements has to be defined carefully to include all possible real world values but discard non-plausible data. Data that does not satisfy the expected conditions could be treated as anomalies and need to be evaluated manually or excluded automatically. To mitigate the risk of anchoring bias in this first phase discussing the requirements with a domain expert is advised. Documentation of the data requirements could be expressed in the form of a schema with strict data types and conditions.

> ##### Example
> - The data requirements for this project are defined as follows:
> > - Customer Demographics: Age should be within the range of 18 to 100 years, and gender should be either male or female.
> > - Purchase History: The number of purchases should be greater than or equal to 0, and the total amount spent should be greater than or equal to $0.
> > - Product Preferences: The product preferences should be represented as a list of product IDs, and each product ID should be unique and within the range of 1 to 1000.
> > - Here we define expectations to satisfy the data requirements.

#### 4 Data quality verification report
A section to verify and report the quality of the data. Examine the quality of the data, addressing questions such as:
- Is the data complete (does it cover all the cases required)?
- Is it correct, or does it contain errors and, if there are errors, how common are they?
- Are there missing values in the data? If so, how are they represented, where do they occur, and how common are they?

> ##### Example
> - Completeness: The data is complete in the sense that it covers all the required cases. All customers have demographic information, and all purchases are recorded.
> - Correctness: The data appears to be correct, with no obvious errors. However, a manual review of the data is recommended to ensure that there are no errors.
> - Missing Values: There are no missing values in the data.
> - Overall, the data quality is high, and the data is suitable for analysis and modeling. However, a manual review of the data is recommended to ensure that there are no errors or anomalies.

---
## Data Description
- The data acquired for this project includes a dataset of 172854 records with 17 fields each. The fields include product price, name, type, and seller information. The data is in a CSV format and is stored in a local database.
- Feature description

| Column         | Description                                     |
|----------------|-------------------------------------------------|
| Price          | It is the Price of the product after applying the discount, it is in SAR |
| Discount       | It is the discount percentage of the product Price, between 0-100   |
| Shipping Cost  | Shipping Cost in SAR                            |
| Sold           | Number of times the Product was Bought          |
| ID             | It is the unique identifier for each Entry (Same Product from different store/sellers has different IDs) |
| Store ID       | It is the id of the store unique to each store  |
| Store Name     | It is the name of the product store             |
| Title          | It is the title of the product                  |
| Rating         | It is the rating of the product, from 0-5       |
| Lunch Time     | It is the date at which the product was added   |
| CategoryName   | It is the product's category name               |
| CategoryID     | The ID of the Category                          |
| Type           | It is the type of the product, with two values: ad (Advertised), natural (not Advertised) |
| imageUrl       | Product's Image Url                             |
| store URL      | Store URL                                       | 

### 2. Data Exploration
During data exploration several patterns and discovers were discovered. Each category has it's mean price window, and each store can work in their price window. Also, we have inverse correlation between solded count and price


![first diagram](../data/imgs/1_business_data_understanding/category_price.png)


![second diagram](../data/imgs/1_business_data_understanding/store_price.png)

![third diagram](../data/imgs/1_business_data_understanding/sold_price.png)

### 3. Data Requirements
> - The data requirements for this project are defined as follows:
> > - Rating: rating should be written in range 0 - 5
> > - Sold: Sold should be integer value greater or equal to 0
> > - Price and shippingCost: price should be greater than 0
> > - Discount: discount should be written in range 0 - 100

### 4. Data Quality Check Verification Report
* Completeness: The dataset seems to be complete in a sense that all products have necessary information for price prediction
* Correctness: The data is correct but manual review is recommended
* Missing values: One column 'category' is fully empty. But it is the only column with missed values
* Overall data seems to be proper

## Project feasibility
-------------
This task involves more detailed fact-finding about all of the resources,constraints, assumptions and other factors that should be considered in determining the data analysis goal and project plan. In the previous task, your objective is to quickly get to the crux of the situation. Here, you want to flesh out the details.

### tasks
- Assess the project feasibility
- Create POC (Proof-of-concept) model

### output

#### 1. inventory of resources

List the resources available to the project, including: personnel (business experts, data experts, technical support, machine learning personnel), data (fixed extracts, access to live warehoused or operational data), computing resources (hardware platforms) and software (machine learning tools, other relevant software).

#### 2. Requirements, assumptions and constraints

List all requirements of the project including schedule of completion, comprehensibility and quality of results and security as well as legal issues.As part of this output, make sure that you are allowed to use the data. List the assumptions made by the project.

These may be assumptions about the data that can be checked during machine learning, but may also include non-checkable assumptions about the business upon which the project rests. It is particularly important to list the latter if they form conditions on the validity of the results.

List the constraints on the project. These may be constraints on the availability of resources, but may also include technological constraints such as the size of data that it is practical to use for modeling.

#### 3. Risks and contingencies

List the risks or events that might occur to delay the project or cause it to fail. List the corresponding contingency plans; what action will be taken if the risks happen.


#### 4. Costs and benefits

Construct a cost-benefit analysis for the project, which compares the costs of the project with the potential benefit to the business if it is successful. The comparison should be as specific as possible.

![](https://i.imgur.com/XU2lghc.png)

#### 5. Feasibility report

Build a POC ML model and explain as a team whether it is feasible to do this ML project or not. If not, then you need to find another business problem. They key factors here are related to data availability, quality, costs and nature of business problem.

## produce project plan
----------------

### task

Describe the intended plan for achieving the machine learning goals and thereby achieving the business goals. The plan should specify the anticipated set of steps to be performed during the rest of the project including an initial selection of tools and techniques.

### output

#### 1. Project plan
- List the stages to be executed in the project, together with their duration, resources required, inputs, outputs, and dependencies. Where possible, try and make explicit the large-scale iterations in the machine learning process, for example, repetitions of the modeling and evaluation phases. As part of the project plan, it is also important to analyze dependencies between time schedule and risks. Mark results of these analyses explicitly in the project plan, ideally with actions and recommendations if the risks are manifested. Decide at this point which evaluation strategy will be used in the evaluation phase. Your project plan will be a dynamic document. At the end of each phase you’ll review progress and achievements and update the project plan accordingly. Specific review points for these updates should be part of the project plan.

- Build a Gantt chart for the project tasks and phases using some online platforms like TeamGantt, jira, goodday, tello, ...etc

- Add all of your team members and assign tasks to them preliminary. Then you check daily the progress.

#### 2. ML project Canvas
At the end of the first phase, you should create a canvas for the project as a summary of this phase.


> ##### Example
> Follow the link: https://github.com/louisdorard/machine-learning-canvas/blob/master/churn.pdf


In [None]:
# TODO