# business and data understanding
------------

The initial phase is concerned with tasks to define the business objectives and translate it to ML objectives, to collect and verify the data quality and to finaly assess the project feasibility.

![](https://i.imgur.com/55J7fBc.jpeg)

## Terminology

### tasks

Compile a glossary of terminology relevant to the project. This may include two components:
(1) A glossary of relevant business terminology, which forms part of the business understanding available to the project. Constructing this glossary is a useful "knowledge elicitation" and education exercise.
(2) A glossary of machine learning terminology, illustrated with examples relevant to the business problem in question.

### output

#### 1. Business terminology
A table or paragraph contains all the business related terms to be used in the project.

> ##### Example:
> - Key Performance Indicators (KPIs): Quantifiable metrics used to measure the performance of a business or organization. The company's KPIs include revenue growth, customer satisfaction, and employee retention.
> - Market Segmentation: Dividing a market into distinct groups based on demographics, needs, or preferences. Example: "The company segmented its target market into young professionals, families, and retirees to tailor marketing strategies."
> Return on Investment (ROI): The return or profit generated by an investment compared to its cost. Example: "The company calculated a 20% ROI on its new marketing campaign, indicating a successful investment."

#### 2. ML terminology
A table or paragraph contains all the ML related terms to be used in the project.


> ##### Example:
> Underfitting: When a machine learning model is too simple and fails to capture the underlying patterns in the data. Example: "The company's initial model was underfitting the data, so they added more features and layers to improve accuracy."

## Terminology

#### 1. Business terminology

> - **Airbnb** : An online marketplace that connects people who want to rent out their homes with those looking for accommodations.
> - **Listing**: An individual rental property available on Airbnb.
>- **Occupancy Rate**: The percentage of available rental units that are occupied at a given time.
>- **Revenue Optimization**: The process of adjusting prices to maximize income from rental properties.
>- **Short-term Rental Market**: A segment of the rental market that offers accommodations for a short duration, typically less than a month.
>-**Dynamic Pricing**: A pricing strategy where prices are adjusted based on real-time supply and demand conditions.
>-**Booking Window**: The time frame between when a guest books a rental and the start of their stay.
>-**Cancellation Policy**: The rules set by the host regarding the conditions under which a guest can cancel a reservation and receive a refund.
>-**Review Rating**: A score given by guests based on their stay, reflecting the quality of the property and the host's service.
>-**Cleaning Fee**: An additional charge imposed by the host for cleaning the property after a guest's stay.
>-**Amenities**: Features provided by the rental property, such as Wi-Fi, parking, or a swimming pool.
>-**Property Type**: The classification of rental properties, such as apartment, house, or villa.
>-**Location**: The geographical area where the rental property is situated, impacting its attractiveness and price.
>-**Seasonality**: Fluctuations in demand and prices due to seasonal factors, >-**Competitive Analysis**: The process of evaluating similar listings in the area to set competitive prices.
>-**Market Trends**: Changes and patterns in the short-term rental market that influence demand and pricing.
>-**Minimum Stay Requirement**: The shortest duration a guest can book a property, as set by the host.
>-**Check-in/Check-out Policy**: The rules and timings related to when guests can arrive and depart from the rental property.


#### 2. ML terminology

>- **Regression**: A supervised learning technique used to predict continuous values, such as the price of an Airbnb listing based on its features.
>- **Classification**: A supervised learning technique used to predict categorical outcomes, such as whether a listing will be booked or not.
>-**Feature Engineering**: The process of selecting, modifying, and creating new variables (features) that enhance the performance of ML models.
>-**Training Data**: The subset of data used to train ML models, containing input-output pairs.
>-**Validation Data**: A subset of data used to tune model parameters and prevent overfitting by evaluating model performance.
>-**Test Data**: A subset of data used to assess the final performance of the model after training and validation.
>-**Overfitting**: A modeling error that occurs when the ML model captures noise in the training data, performing well on training data but poorly on new, unseen data.
>-**Underfitting**: A modeling error that occurs when the ML model is too simple to capture the underlying patterns in the data, leading to poor performance on both training and test data.
>-**Cross-Validation**: A technique for assessing how the results of a statistical analysis will generalize to an independent dataset, typically by partitioning the data into subsets and training/testing the model multiple times.
>-**Hyperparameters**: Settings that define the model architecture and learning process, such as learning rate or number of trees in a random forest, which need to be specified before training.
>-**Model Evaluation Metrics**: Measures used to evaluate the performance of ML models, such as Mean Absolute Error (MAE) for regression or Accuracy for classification.
>-**Feature Importance**: A technique to determine the significance of individual features in predicting the target variable.
>-**Normalization**: A preprocessing step that scales features to a standard range, often 0 to 1, to ensure equal contribution to the model.
>-**ROC Curve**: A graphical plot that illustrates the diagnostic ability of a binary classifier system, plotting the true positive rate against the false positive rate.
>-**F1 Score**: A measure of a test's accuracy that considers both precision and recall, providing a single metric that balances both aspects.
>-**Data Preprocessing**: The process of cleaning and preparing raw data for ML, involving steps like handling missing values, encoding categorical variables, and normalizing data.
>-**Exploratory Data Analysis (EDA**): An approach to analyzing data sets to summarize their main characteristics, often using visual methods, before applying more formal modeling techniques.

## Scope of the project
----------

### tasks
- Explore the background of the business.
- Define business problem
- Define business objectives
- Translate business objectives into ML objectives

The objective here is to thoroughly understand, from a business perspective, what the client really wants to accomplish. Often the client has many competing objectives and constraints that must be properly balanced. The goal is to uncover important factors, at the beginning, that can influence the outcome of the project. A possible consequence of neglecting this step is to expend a great deal of effort producing the right answers to the wrong questions.

### output

#### 1. background

A short paragraph to record the information that is known about the organization's business situation at the beginning of the project.

> ##### Example:
> Shopzilla is an an e-commerce platform. They have branches in different cities. They have a Mobile app which allows clients to buy products. They profile the purchase history of the users and use this data for building AI models.

#### 2. business problem
A short paragraph to describe the business problem.

> ##### Example:
> The business problem is that the business stakeholders wants to follow a customer centric sales methodology. They want to predict customer satisfaction in their services such that they can adapt their strategies accordingly. The provided dataset is labelled and captures customer satisfaction scores for a one-month period. It includes various features such as category and sub-category of interaction, customer remarks, survey response date, category, item price, agent details (name, supervisor, manager), and CSAT score etc.
> The company has been experiencing high customer churn rates, resulting in significant revenue losses.

#### 3. business objectives

A list of business objectives which describes the customer's primary objective, from a business perspective. In addition to the primary business objective, there are typically other related business questions that the customer would like to address. For example, the primary business goal might be to keep current customers by predicting when they are prone to move to a competitor. Examples of related business questions are "How does the primary channel (e.g., ATM, visit branch, internet) a bank customer uses affect whether they stay or go?" or "Will lower ATM fees significantly reduce the number of high-value customers who leave?"

> ##### Examples:
> - Will lower ATM fees significantly reduce the number of high-value customers who leave?
> - Does the channel used affect whether customers stay or go?

#### 4. ML objectives

A list of business objectives which describes the intended outputs of the project that enables the achievement of the business objectives.

> ##### Examples:
> Predict how many widgets a customer will buy, given their purchases over the past three years, demographic information (age, salary, city, etc.), and the price of the item.

**1. Background**

  Airbnb is a leading online marketplace that connects people looking to rent out their properties with those seeking short-term accommodations. Currently, Airbnb faces challenges in optimizing pricing strategies for different listings, considering varying demand across locations and seasons. The organization has collected substantial data on past bookings, pricing, customer reviews, and property features, which can be leveraged to enhance pricing strategies and improve occupancy rates.

**2. Business Problem**

  The primary business problem is to optimize the pricing of Airbnb listings to maximize revenue while maintaining high occupancy rates. This involves dynamically adjusting prices based on various factors, such as location, seasonality, property features, and market trends, to attract more bookings and enhance overall profitability.

**3. Business Objectives**

  The main objective of the project is to develop a dynamic pricing model that optimizes rental prices for Airbnb listings to maximize revenue and maintain high occupancy rates.

**4. ML Objectives**

  The goal is to develop a predictive model that predicts the optimal pricing for Airbnb listings based on historical booking data, property characteristics, location details, and market trends.

## Success Criteria
-------------

### tasks
- Describe the success criteria of the ML project on three different levels: the business success criteria, the ML success criteria and the economic success criteria.

### output

#### 1. Business success criteria
A list of criteria  from a business point of view. For example, if an ML application is planned for a quality check in production and is supposed to outperform the current manual failure rate of 3%, the business success criterion could be derived as e.g. "failure rate less than 3%"

> ##### Example:
> - Increase customer satisfaction ratings by 8% within the next quarter.
> - Reduce operational costs by 12% within the next year.


#### 2. ML success criteria
A list of criteria for a successful outcome to the project in technical terms, for example a certain level of predictive accuracy or a propensity to purchase profile with a given degree of "lift." As with business success criteria, it may be necessary to describe these in subjective terms, in which case the person or persons making the subjective judgment should be identified.

> ##### Example:
> - The model aims to achieve a recall rate of 95% for identifying customers who are likely to churn.
> - The model aims to achieve a precision rate of 90% for identifying high-value customers.


#### 1. Business success criteria

- Achieve at least a 15% increase in average monthly revenue per listing.
- Maintain or increase occupancy rates to at least 75% across listings.
- Aim to reduce cancellation rates by at least 10%.



#### 2. ML success criteria
- Mean Absolute Percentage Error < 0.05
- RMSE < 0.1
- R-squared values > 0.55
- Max error < 0.45


## Data collection

### tasks
- Specify the data sources
- Collect the data
- Version control on the data

### output

#### 1. Data collection report
A section to describe the data sources and how you want to collect the data.

> ##### Example:
> - **Data Source:** The origin of the data, such as a database, file, or API. Example: "The data was collected from the company's customer relationship management (CRM) database."
> - **Data Type:** The format or structure of the data, such as numerical, categorical, or text. Example: "The data consists of numerical values representing customer purchase amounts and categorical values representing customer demographics."
> - **Data Size:** The quantity of data collected. Example: "The dataset contains 100,000 customer records, with 50 features each."
> - **Data Collection Method:** The process used to collect the data. Example: "The data was collected using a web scraping tool that extracted customer information from the company's website."


#### 2. Data version control report
A section to describe data versions, what change happend in the data, how do you backup the data. This should be done after you collect the data.

> ##### Example:
> - **Data Version:** A unique identifier for each version of the data. Example: "The current data version is v1.2, which was updated on March 15, 2023."
> - **Data Change Log:** A record of changes made to the data. Example: "The data change log shows that the customer demographics feature was updated on February 20, 2023, to include new categories."
> - **Data Backup:** A copy of the data stored for recovery in case of data loss. Example: "The company has a daily backup of the data stored on a secure server."
> - **Data Archiving:** The process of storing and managing historical data. Example: "The company archives data older than one year to a cloud storage service for long-term retention."
> - **Data Access Control:** The process of controlling who can access and modify the data. Example: "The company uses role-based access control to ensure that only authorized personnel can access and modify the data."

#### 1. Data collection report

> - **Data Source:** The dataset used for this analysis was sourced from Kaggle, includes comprehensive data on Airbnb listings. Below is a detailed description of the data collection aspects. The dataset is provided in CSV format, a common and versatile format for tabular data.
> - **Data Type:** The data consists of numerical values representing number of accomodates, bedrooms and proces. Categorical values representing type of rooms, beds, proprety types. Datetime values representing date of the first review, date of the last review, date the host started. Text features representing descriptions, names, neigbourhoods.
> - **Data Size:** The dataset comprises 74,111 rows and 29 columns.
> - **Data Collection Method:** Initial raw dataset
downloaded from Kaggle and then cleaned and prepossessed
and restored in the same filename.


#### 2. Data version control report
> - **Data Version:** The current data version is cleaned and prepossessed
and restored in the same filename.
> - **Data Change Log:** Handled missing values using imputation strategies.
> - **Data Backup:** A copy of the data stored for recovery in case of data loss.

## Data quality verification

### tasks
- Describe data
- Define data requirements
- Explore the data
- Verify the data quality

### output

#### 1. Data description
A section to describe the data that has been acquired including its format, its quantity (for example, the number of records and fields in each table), the identities of the fields and any other surface features which have been discovered. Add a table of description of the data features.

> ##### Example
> - The data acquired for this project includes a dataset of 1000 records with 20 fields each. The fields include customer demographics, purchase history, and product preferences. The data is in a CSV format and is stored in a local database.
> - Add a table of description of the data features.

#### 2. Data exploration
A section to present results of your data exploration, including first findings or initial hypothesis and their impact on the remainder of the project. If appropriate you could include graphs and plots here to indicate data characteristics that suggest further examination of interesting data subsets.

> ##### Example
> - During data exploration, several interesting patterns and correlations were discovered. For example, there is a strong correlation between the age of customers and their purchase frequency. Additionally, customers who have purchased from the company in the past are more likely to make repeat purchases. These findings suggest that the data is representative of the target audience and that the company's marketing strategies are effective.
> - Add charts and figures to present the findings.

#### 3. Data requirements
A section to describe the data requirements. The requirements can be defined either on the meta-level or directly in the data, and should state the expected conditions of the data, i.e., whether a certain sample is plausible. The requirements can be, e.g., the expected feature values (a range for continuous features or a list for discrete features), the format of the data and the maximum number of missing values. The bounds of the requirements has to be defined carefully to include all possible real world values but discard non-plausible data. Data that does not satisfy the expected conditions could be treated as anomalies and need to be evaluated manually or excluded automatically. To mitigate the risk of anchoring bias in this first phase discussing the requirements with a domain expert is advised. Documentation of the data requirements could be expressed in the form of a schema with strict data types and conditions.

> ##### Example
> - The data requirements for this project are defined as follows:
> > - Customer Demographics: Age should be within the range of 18 to 100 years, and gender should be either male or female.
> > - Purchase History: The number of purchases should be greater than or equal to 0, and the total amount spent should be greater than or equal to $0.
> > - Product Preferences: The product preferences should be represented as a list of product IDs, and each product ID should be unique and within the range of 1 to 1000.
> > - Here we define expectations to satisfy the data requirements.

#### 4 Data quality verification report
A section to verify and report the quality of the data. Examine the quality of the data, addressing questions such as:
- Is the data complete (does it cover all the cases required)?
- Is it correct, or does it contain errors and, if there are errors, how common are they?
- Are there missing values in the data? If so, how are they represented, where do they occur, and how common are they?

> ##### Example
> - Completeness: The data is complete in the sense that it covers all the required cases. All customers have demographic information, and all purchases are recorded.
> - Correctness: The data appears to be correct, with no obvious errors. However, a manual review of the data is recommended to ensure that there are no errors.
> - Missing Values: There are no missing values in the data.
> - Overall, the data quality is high, and the data is suitable for analysis and modeling. However, a manual review of the data is recommended to ensure that there are no errors or anomalies.

#### **1. Data description**
Here's an overview of the dataset:
- Format: CSV
- Quantity: 74,111 records and 29 columns
- Key Features: Includes numerical, categorical, text, and datetime data types

#### **2. Data exploration**

Upon initial exploration, we identified several key points:
- Numerical Features:
Significant presence of missing values in some columns
- Categorical Features: Important features include property_type, room_type, bed_type, cancellation_policy, city, among others.
Presence of categorical values needing encoding for analysis    
- Text Features:Includes free-form text such as description and amenities.
Requires preprocessing for meaningful analysis
- Datetime Features:
Includes dates such as first_review, last_review, and host_since.
Requires extraction and transformation for temporal analysis

#### **3. Data requirements**
- thumbnail_url: No transformation needed (not typically used as a feature, dropped.)
- Expected Feature Values: Define ranges for continuous features (e.g., price, number of reviews) or lists for discrete features (e.g., property types, room types).
- Data Format: Ensure data formats adhere to specified standards, such as date formats or textual content.
    
- Maximum Number of Missing Values: Define thresholds for acceptable missing data across features, guiding data cleaning and imputation processes.

#### **4. Data quality verification report**
The data quality verification process evaluates the integrity and reliability of the dataset through various analyses:
- Completeness: Missing data statistics per column were examined, revealing areas requiring imputation or further investigation.
- Accuracy:  Quality checks revealed discrepancies that impact data reliability, particularly in numerical and categorical features.
    
- Missing Values: Strategies like median imputation and default values were used to address missing data systematically.

Initial exploration revealed significant missing data across several columns, necessitating robust handling strategies.

## Project feasibility
-------------
This task involves more detailed fact-finding about all of the resources,constraints, assumptions and other factors that should be considered in determining the data analysis goal and project plan. In the previous task, your objective is to quickly get to the crux of the situation. Here, you want to flesh out the details.

### tasks
- Assess the project feasibility
- Create POC (Proof-of-concept) model

### output

#### 1. inventory of resources

List the resources available to the project, including: personnel (business experts, data experts, technical support, machine learning personnel), data (fixed extracts, access to live warehoused or operational data), computing resources (hardware platforms) and software (machine learning tools, other relevant software).

#### 2. Requirements, assumptions and constraints

List all requirements of the project including schedule of completion, comprehensibility and quality of results and security as well as legal issues.As part of this output, make sure that you are allowed to use the data. List the assumptions made by the project.

These may be assumptions about the data that can be checked during machine learning, but may also include non-checkable assumptions about the business upon which the project rests. It is particularly important to list the latter if they form conditions on the validity of the results.

List the constraints on the project. These may be constraints on the availability of resources, but may also include technological constraints such as the size of data that it is practical to use for modeling.

#### 3. Risks and contingencies

List the risks or events that might occur to delay the project or cause it to fail. List the corresponding contingency plans; what action will be taken if the risks happen.


#### 4. Costs and benefits

Construct a cost-benefit analysis for the project, which compares the costs of the project with the potential benefit to the business if it is successful. The comparison should be as specific as possible.

![](https://i.imgur.com/XU2lghc.png)

#### 5. Feasibility report

Build a POC ML model and explain as a team whether it is feasible to do this ML project or not. If not, then you need to find another business problem. They key factors here are related to data availability, quality, costs and nature of business problem.

#### 1. inventory of resources

- Presonnel: We have  a Data Engineer that is responsible for Data Transformation, Data Analysis and Model Training. We have a Data Scientist that is responsible for Business Understanding, Data Analysis and Model Evaluation. And an ML Engineer that is responsible for Data preparation,Pipelines, CI/CD.
- Data: Dataset sourced from Kaggle, which includes 74,111 rows and 29 columns with various features such as numerical, categorical, text, and datetime data.
- Computing Resources: Access to cloud computing resources like Google Colab. Most of the times working on local machine.
- Machine Learning Tools: Python libraries (e.g., scikit-learn, TensorFlow, Keras), Jupyter notebooks for development and analysis. Data visualization tools (e.g., Matplotlib, Seaborn), data processing libraries (e.g., Pandas, NumPy).

#### 2. Requirements, assumptions and constraints

- Schedule of Completion: 4-5 weeks for the entire project.
- Comprehensibility and Quality of Results: Results should be interpretable and provide actionable insights for pricing optimization.
-Usability of Data: Data is clean, relevant, and well-preprocessed for the model.
- Data Quality: The data is representative and sufficiently comprehensive for building a reliable model.
-Business Context: The factors influencing Airbnb pricing identified in the dataset are consistent with real-world scenarios.
-Resource Availability: Limited.
- Time Constraints: Limited to 4-5 weeks, requiring efficient project management and prioritization of tasks.
-Data Size: The dataset is fixed in size (74,111 rows), so scalability testing might be limited.

#### 3. Risks and contingencies

- Data Quality Issues: Missing or inaccurate data could lead to poor model performance.
- Resource Limitations: Limited access to high-performance computing resources could delay processing and model training.
- Model Performance: The model may not achieve the desired accuracy or may overfit/underfit the data.
- Team Coordination: Coordination issues among team members could delay the project.

Our plan includes this:

- Data Quality: Implement robust data preprocessing techniques and validate the data thoroughly before model training.
- Resource Limitations: Utilize cloud resources.
- Model Performance: Iterate on model tuning, consider alternative algorithms, and perform extensive cross-validation to ensure robustness.
- Team Coordination: Regular team meetings and clear task assignments to ensure smooth progress and timely completion.


#### 4. Costs and benefits

- Time Investment: Significant time commitment from all team members over 4-5 weeks.
- Resource Usage: Open cloud and local tools were used.
- Software Tools: Open cloud and local tools were used.

Let's imagine we're doing real project outside univeristy.
Total personnel cost would be 300,000 RUB for our team of 3 people.

Using existing hardware, no additional cost. Assunig we pay for the cloud resources 10,000 RUB.

Internet and Electricity: 2,000 RUB.
Administrative and Documentation: 3,000 RUB.

Total Project Cost:
- Personnel: 300,000 RUB
- Cloud Computing: 10,000 RUB
- Internet, electricity, etc: 5,000 RUB

Total: 315,000 RUB

Benefits
1. Skill Development:
Enhanced Expertise:
Value of acquiring advanced data science and machine learning skills.
Estimated value: 50,000 RUB per person.
Total Skill Development Value: 150,000 RUB (for 3 team members).
2. Project Output:
Functional Model can be included in portfolios, enhancing employability.
Estimated value: 100,000 RUB.
3. Academic Achievement:
University Project Completion:
Contributing to academic success, potential for awards or recognition.
Estimated value: 50,000 RUB.
4. Business Insight:

Practical Insights for Airbnb Hosts:
Potential to improve pricing strategies, increasing rental income.

Estimated value: 100,000 RUB (considering potential increase in income over time).

Total Project Benefits:
-Skill Development: 150,000 RUB
-Project Output: 100,000 RUB
-Academic Achievement: 50,000 RUB
-Business Insight: 100,000 RUB

Total: 400,000 RUB


This analysis shows that the project is feasible and beneficial, with a significant positive net benefit. The detailed cost and benefit estimation provides a clear understanding of the project's value from both educational and practical perspective

![](https://i.imgur.com/XU2lghc.png)

#### 5. Feasibility report

Model Built POC:

Multi-layer Perceptron (MLP) Regressor
Configuration: Hidden layers = (100, 50), Max iterations = 500, lr 1e-3

Performance:
Training MSE: 3.691, Residual Coefficient : 0.528
Testing MSE: 3.691, Residual Coefficient : 0.525. The model explains about 53% of the variance in the target variable

Model Performance: Moderate. While not ideal, the model is capturing some variance in the data

Recommendations:

Hyperparameter Tuning: Further adjust parameters for improved performance.
Feature Engineering: Enhance features for better model input.
Alternative Models: Test other algorithms to compare effectiveness.

## produce project plan
----------------

### task

Describe the intended plan for achieving the machine learning goals and thereby achieving the business goals. The plan should specify the anticipated set of steps to be performed during the rest of the project including an initial selection of tools and techniques.

### output

#### 1. Project plan
- List the stages to be executed in the project, together with their duration, resources required, inputs, outputs, and dependencies. Where possible, try and make explicit the large-scale iterations in the machine learning process, for example, repetitions of the modeling and evaluation phases. As part of the project plan, it is also important to analyze dependencies between time schedule and risks. Mark results of these analyses explicitly in the project plan, ideally with actions and recommendations if the risks are manifested. Decide at this point which evaluation strategy will be used in the evaluation phase. Your project plan will be a dynamic document. At the end of each phase you’ll review progress and achievements and update the project plan accordingly. Specific review points for these updates should be part of the project plan.

- Build a Gantt chart for the project tasks and phases using some online platforms like TeamGantt, jira, goodday, tello, ...etc

- Add all of your team members and assign tasks to them preliminary. Then you check daily the progress.

#### 2. ML project Canvas
At the end of the first phase, you should create a canvas for the project as a summary of this phase.


> ##### Example
> Follow the link: https://github.com/louisdorard/machine-learning-canvas/blob/master/churn.pdf


In the project we divided tasks according to our roles and skills:

Andreas: data transformation, data analysis, model training

Zukhra: business understanding, data analysis, model evaluation

Amir: data preparation, data versioning, application build, CI/CD configuration

You can view detailed plan and its completion by this [link](https://docs.google.com/spreadsheets/d/16rn_4RmgNRX4sfpjhVMVTUwoDpTVk2LV-2Xcvrw7c9I/edit?gid=0#gid=0)