# business and data understanding
------------

The initial phase is concerned with tasks to define the business objectives and translate it to ML objectives, to collect and verify the data quality and to finaly assess the project feasibility.

![](https://i.imgur.com/55J7fBc.jpeg)

## Terminology

### tasks

Compile a glossary of terminology relevant to the project. This may include two components:
(1) A glossary of relevant business terminology, which forms part of the business understanding available to the project. Constructing this glossary is a useful "knowledge elicitation" and education exercise.
(2) A glossary of machine learning terminology, illustrated with examples relevant to the business problem in question.

### output

#### 1. Business terminology
A table or paragraph contains all the business related terms to be used in the project.

> ##### Example:
> - Key Performance Indicators (KPIs): Quantifiable metrics used to measure the performance of a business or organization. The company's KPIs include revenue growth, customer satisfaction, and employee retention.
> - Market Segmentation: Dividing a market into distinct groups based on demographics, needs, or preferences. Example: "The company segmented its target market into young professionals, families, and retirees to tailor marketing strategies."
> Return on Investment (ROI): The return or profit generated by an investment compared to its cost. Example: "The company calculated a 20% ROI on its new marketing campaign, indicating a successful investment."

#### 2. ML terminology
A table or paragraph contains all the ML related terms to be used in the project.


> ##### Example:
> Underfitting: When a machine learning model is too simple and fails to capture the underlying patterns in the data. Example: "The company's initial model was underfitting the data, so they added more features and layers to improve accuracy."

### TODO
#### 1. Business terminology

| Term                             | Description                                                                                                                                                                                      |
|----------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Appraisal                    | The process or result of estimating the value of a property by an independent, qualified professional. Example: "The appraisal came in higher than the listing price, encouraging the seller."   |
| Comparative Market Analysis (CMA) | An evaluation of similar, recently sold homes in the same area to determine a competitive market price. Example: "The agent performed a CMA to suggest a listing price."                         |
| Fair Market Value (FMV)      | The price that a property would sell for on the open market. Example: "The FMV was determined based on recent sales data."                                                                       |
| Loan-to-Value Ratio (LTV)    | A ratio used by lenders to determine the risk of a loan by comparing the loan amount to the appraised value of the property. Example: "The bank required an LTV of 80% to approve the mortgage." |
| Multiple Listing Service (MLS)| A database used by real estate agents to share information about properties for sale. Example: "The property was listed on the MLS to reach a broader audience."                                 |
| Real Estate Investment Trust (REIT) | A company that owns, operates, or finances income-generating real estate. Example: "Investing in a REIT allowed for portfolio diversification."                                                  |
| Capitalization Rate (Cap Rate) | A metric used to evaluate the potential return on a real estate investment. Example: "The investor looked at the cap rate to decide on the property purchase."                                   |
| Depreciation                 | The decrease in the value of a property over time due to wear and tear. Example: "Depreciation was factored into the property's long-term value assessment."                                     |
| Equity                       | The difference between the market value of a property and the amount owed on the mortgage. Example: "Building equity was a key reason for purchasing a home."                                    |
| Mortgage                     | A loan used to purchase real estate, secured by the property itself. Example: "The couple secured a 30-year mortgage at a fixed interest rate."                                                  |

#### 2. ML terminology
| Term                             | Description |
|----------------------------------|-------------|
| Feature Engineering          | The process of selecting, modifying, or creating new variables to improve the performance of a machine learning model. Example: "Feature engineering involved creating new variables from the raw data, such as average neighborhood price." |
| Regression                   | A type of predictive modeling technique that estimates the relationships among variables. Example: "Linear regression was used to predict house prices based on features like square footage and location." |
| Overfitting                  | When a model learns not only the underlying pattern but also the noise in the training data, resulting in poor performance on new data. Example: "The model was overfitting, so we applied regularization techniques." |
| Hyperparameters              | Configurations that are external to the model and whose values cannot be estimated from the data. Example: "Tuning hyperparameters like learning rate and tree depth improved the model's accuracy." |
| Cross-Validation             | A technique for assessing how a model will generalize to an independent dataset, typically by partitioning the data into training and testing sets multiple times. Example: "Cross-validation showed that the model performed consistently across different subsets of the data." |
| Random Forest                | An ensemble learning method that constructs multiple decision trees and merges their results to improve predictive accuracy and control overfitting. Example: "The random forest model provided robust house price predictions." |
| Mean Absolute Error (MAE)    | A metric used to measure the accuracy of a model by averaging the absolute differences between predicted and actual values. Example: "The MAE of the model was low, indicating accurate predictions." |
| Mean Squared Error (MSE)     | A metric used to measure the average squared differences between predicted and actual values, giving more weight to large errors.|
| Gradient Boosting            | An ensemble technique that builds a model in a stage-wise fashion by combining weak learners to form a strong learner. Example: "Gradient boosting improved the prediction accuracy for complex datasets." |
| Training Set                 | The portion of the dataset used to train the model. Example: "80% of the data was used as the training set to build the model." |
| Test Set                     | The portion of the dataset used to evaluate the model's performance. Example: "The remaining 20% of the data was used as the test set to validate the model." |
| Data Preprocessing           | The process of cleaning and preparing raw data for modeling, including handling missing values and scaling features. Example: "Data preprocessing involved normalizing the features to improve model performance." |


## Scope of the project
----------

### tasks
- Explore the background of the business.
- Define business problem
- Define business objectives
- Translate business objectives into ML objectives

The objective here is to thoroughly understand, from a business perspective, what the client really wants to accomplish. Often the client has many competing objectives and constraints that must be properly balanced. The goal is to uncover important factors, at the beginning, that can influence the outcome of the project. A possible consequence of neglecting this step is to expend a great deal of effort producing the right answers to the wrong questions.

### output

#### 1. background

A short paragraph to record the information that is known about the organization's business situation at the beginning of the project.

> ##### Example:
> Shopzilla is an an e-commerce platform. They have branches in different cities. They have a Mobile app which allows clients to buy products. They profile the purchase history of the users and use this data for building AI models.

#### 2. business problem
A short paragraph to describe the business problem.

> ##### Example:
> The business problem is that the business stakeholders wants to follow a customer centric sales methodology. They want to predict customer satisfaction in their services such that they can adapt their strategies accordingly. The provided dataset is labelled and captures customer satisfaction scores for a one-month period. It includes various features such as category and sub-category of interaction, customer remarks, survey response date, category, item price, agent details (name, supervisor, manager), and CSAT score etc.
> The company has been experiencing high customer churn rates, resulting in significant revenue losses.

#### 3. business objectives

A list of business objectives which describes the customer's primary objective, from a business perspective. In addition to the primary business objective, there are typically other related business questions that the customer would like to address. For example, the primary business goal might be to keep current customers by predicting when they are prone to move to a competitor. Examples of related business questions are "How does the primary channel (e.g., ATM, visit branch, internet) a bank customer uses affect whether they stay or go?" or "Will lower ATM fees significantly reduce the number of high-value customers who leave?"

> ##### Examples:
> - Will lower ATM fees significantly reduce the number of high-value customers who leave?
> - Does the channel used affect whether customers stay or go?

#### 4. ML objectives

A list of business objectives which describes the intended outputs of the project that enables the achievement of the business objectives.

> ##### Examples:
> Predict how many widgets a customer will buy, given their purchases over the past three years, demographic information (age, salary, city, etc.), and the price of the item.

### TODO
#### 1. background
Zameen.com is Pakistan's leading website for posting ads about houses and properties. The platform operates across various cities, facilitating the buying, selling, and renting of properties. Zameen.com maintains a rich dataset comprising property listings, transaction histories, and user interactions. The company aims to utilize this data to improve the accuracy of house price predictions and enhance the overall user experience on the platform.

#### 2. business problem
Real estate businesses, such as Zameen.com, and real estate agents may struggle with accurately pricing houses, causing revenue loss, customer dissatisfaction, and competitive disadvantage. We propose developing a machine learning model to automate the evaluation of house prices using diverse data sources. This solution will enhance pricing accuracy, build customer trust, and streamline operations, ultimately optimizing pricing strategies and maximizing profits through precise, data-driven property valuations.

#### 3. business objectives
- Primary Objective: Improve the accuracy of house price predictions on the Zameen.com platform in order to:
- Enhance customer satisfaction by offering fair and competitive property prices.
- Reduce the time properties remain on the market by providing accurate initial pricing.
- Increase revenue through optimized pricing strategies.

#### 4. ML objectives
- Predict house prices based on property type, location, city, province name, coordinates, number of baths and bedrooms, and other relevant features.

## Success Criteria
-------------

### tasks
- Describe the success criteria of the ML project on three different levels: the business success criteria, the ML success criteria and the economic success criteria.

### output

#### 1. Business success criteria
A list of criteria  from a business point of view. For example, if an ML application is planned for a quality check in production and is supposed to outperform the current manual failure rate of 3%, the business success criterion could be derived as e.g. "failure rate less than 3%"

> ##### Example:
> - Increase customer satisfaction ratings by 8% within the next quarter.
> - Reduce operational costs by 12% within the next year.


#### 2. ML success criteria
A list of criteria for a successful outcome to the project in technical terms, for example a certain level of predictive accuracy or a propensity to purchase profile with a given degree of "lift." As with business success criteria, it may be necessary to describe these in subjective terms, in which case the person or persons making the subjective judgment should be identified.

> ##### Example:
> - The model aims to achieve a recall rate of 95% for identifying customers who are likely to churn.
> - The model aims to achieve a precision rate of 90% for identifying high-value customers.

### TODO
#### 1. Business Success Criteria

- **Customer Satisfaction**: Increase customer satisfaction ratings by 10% within the next two quarters by providing more accurate and fair pricing.
- **Market Competitiveness**: Reduce the average time a property stays on the market by 20% through more precise initial pricing.
- **User Engagement**: Improve user engagement on the platform by 15% through enhanced pricing tools and features.
- **Revenue Growth**: Increase revenue from property listings by 15% over the next year by optimizing pricing strategies.

#### 2. ML Success Criteria

- **Predictive Accuracy**: Achieve a MAE of less than 5% for the house price prediction model.
- **Feature Importance**: Accurately identify and validate the top 5 features that most significantly influence house prices.
- **Generalization**: The model should perform consistently with less than a 2% drop in MAE when applied to new, unseen data.
- **Scalability**: The model should be able to handle real-time predictions with response times under 1 second.

#### 3. Economic Success Criteria

- **Cost Efficiency**: Reduce the costs associated with manual property appraisals by 25% within the next year through automation.
- **Return on Investment (ROI)**: Achieve an ROI of at least 20% from the implementation of the ML model within the first year.
- **Operational Savings**: Realize operational savings of 15% by streamlining the property valuation process with the ML model.
- **Profit Margin**: Increase the overall profit margin by 10% by reducing pricing errors and optimizing sales processes.
- **Long-Term Viability**: Ensure the model's maintenance and updating costs remain below 5% of the overall operational budget for the valuation process.

## Data collection

### tasks
- Specify the data sources
- Collect the data
- Version control on the data

### output

#### 1. Data collection report
A section to describe the data sources and how you want to collect the data.

> ##### Example:
> - **Data Source:** The origin of the data, such as a database, file, or API. Example: "The data was collected from the company's customer relationship management (CRM) database."
> - **Data Type:** The format or structure of the data, such as numerical, categorical, or text. Example: "The data consists of numerical values representing customer purchase amounts and categorical values representing customer demographics."
> - **Data Size:** The quantity of data collected. Example: "The dataset contains 100,000 customer records, with 50 features each."
> - **Data Collection Method:** The process used to collect the data. Example: "The data was collected using a web scraping tool that extracted customer information from the company's website."


#### 2. Data version control report
A section to describe data versions, what change happend in the data, how do you backup the data. This should be done after you collect the data.

> ##### Example:
> - **Data Version:** A unique identifier for each version of the data. Example: "The current data version is v1.2, which was updated on March 15, 2023."
> - **Data Change Log:** A record of changes made to the data. Example: "The data change log shows that the customer demographics feature was updated on February 20, 2023, to include new categories."
> - **Data Backup:** A copy of the data stored for recovery in case of data loss. Example: "The company has a daily backup of the data stored on a secure server."
> - **Data Archiving:** The process of storing and managing historical data. Example: "The company archives data older than one year to a cloud storage service for long-term retention."
> - **Data Access Control:** The process of controlling who can access and modify the data. Example: "The company uses role-based access control to ensure that only authorized personnel can access and modify the data."

### TODO
## Data Collection

### tasks

1. Specify the data sources
2. Collect the data
3. Version control on the data

### output

#### 1. Data Collection Report

**Data Source**:
The data for this project is sourced from a publicly available dataset on Kaggle. The dataset is provided by Zameen.com, Pakistan's leading real estate website, and contains information on house prices and related features.

**Data Link**:
[Zameen.com House Price Prediction Dataset](https://www.kaggle.com/datasets/howisusmanali/house-price-prediction-zameencom-dataset)

**Data Type**:
The dataset includes a mix of numerical, categorical, text, and time-based data:
- **Numerical**: price, property_id, location_id, latitude, longitude, etc;
- **Categorical/Text**: property_type, location, city, province_name, etc;
- **Time-Based**: date_added.

**Data Size**:
- **Number of Rows**: 168,446
- **Number of Columns**: 20

**Data Collection Method**:
The dataset was directly downloaded from the Kaggle platform. Kaggle ensures the data is clean and structured in a CSV format, making it ready for immediate analysis and modeling.

#### 2. Data Version Control Report

**Data Version**:
- Current data version: **v1.0**

**Data Change Log**:
- **v1.0**: Initial dataset downloaded on [Insert Date], includes 168,446 rows and 20 columns.
- Any updates to the dataset, such as corrections or new data additions, will be logged here with corresponding dates.


# TODO: add other sections

## Data quality verification

### tasks
- Describe data
- Define data requirements
- Explore the data
- Verify the data quality

### output

#### 1. Data description
A section to describe the data that has been acquired including its format, its quantity (for example, the number of records and fields in each table), the identities of the fields and any other surface features which have been discovered. Add a table of description of the data features.

> ##### Example
> - The data acquired for this project includes a dataset of 1000 records with 20 fields each. The fields include customer demographics, purchase history, and product preferences. The data is in a CSV format and is stored in a local database.
> - Add a table of description of the data features.

#### 2. Data exploration
A section to present results of your data exploration, including first findings or initial hypothesis and their impact on the remainder of the project. If appropriate you could include graphs and plots here to indicate data characteristics that suggest further examination of interesting data subsets.

> ##### Example
> - During data exploration, several interesting patterns and correlations were discovered. For example, there is a strong correlation between the age of customers and their purchase frequency. Additionally, customers who have purchased from the company in the past are more likely to make repeat purchases. These findings suggest that the data is representative of the target audience and that the company's marketing strategies are effective.
> - Add charts and figures to present the findings.

#### 3. Data requirements
A section to describe the data requirements. The requirements can be defined either on the meta-level or directly in the data, and should state the expected conditions of the data, i.e., whether a certain sample is plausible. The requirements can be, e.g., the expected feature values (a range for continuous features or a list for discrete features), the format of the data and the maximum number of missing values. The bounds of the requirements has to be defined carefully to include all possible real world values but discard non-plausible data. Data that does not satisfy the expected conditions could be treated as anomalies and need to be evaluated manually or excluded automatically. To mitigate the risk of anchoring bias in this first phase discussing the requirements with a domain expert is advised. Documentation of the data requirements could be expressed in the form of a schema with strict data types and conditions.

> ##### Example
> - The data requirements for this project are defined as follows:
> > - Customer Demographics: Age should be within the range of 18 to 100 years, and gender should be either male or female.
> > - Purchase History: The number of purchases should be greater than or equal to 0, and the total amount spent should be greater than or equal to $0.
> > - Product Preferences: The product preferences should be represented as a list of product IDs, and each product ID should be unique and within the range of 1 to 1000.
> > - Here we define expectations to satisfy the data requirements.

#### 4 Data quality verification report
A section to verify and report the quality of the data. Examine the quality of the data, addressing questions such as:
- Is the data complete (does it cover all the cases required)?
- Is it correct, or does it contain errors and, if there are errors, how common are they?
- Are there missing values in the data? If so, how are they represented, where do they occur, and how common are they?

> ##### Example
> - Completeness: The data is complete in the sense that it covers all the required cases. All customers have demographic information, and all purchases are recorded.
> - Correctness: The data appears to be correct, with no obvious errors. However, a manual review of the data is recommended to ensure that there are no errors.
> - Missing Values: There are no missing values in the data.
> - Overall, the data quality is high, and the data is suitable for analysis and modeling. However, a manual review of the data is recommended to ensure that there are no errors or anomalies.

# TODO: complete this part
### 1. Data description
The data acquired for this project includes a dataset of 168446 records with 20 fields each. The fields include link to add(s), type of property, price, location, number of baths and bedrooms, are related data. The data is in a CSV format and is stored in Google Drive.

| feature       | description                                                      |
|---------------|------------------------------------------------------------------|
| property_id   | unique id for each property                                      |
| location_id   | unique id for each location (not unique for each property)       |
| page_url      | url to the add on zameen.com                                     |
| property_type | type of the property (FarmHouse, Flat, etc.)                     |
| price         | price of the property                                            |
| location      | location name in each city                                       |
| city          | name of the city                                                 |
| province_name | name of province (like state) of the city                        |
| latitude      | latitude in degrees (part of coordinates of a property)          |
| longitude     | longitude in degrees (another part of coordinates of a property) |
| baths         | number of baths in a property                                    |
| area          | area either in Marla or Kanal (e.g. "4 Marla")                   |
| purpose       | either "For Sale" or "For Rent"                                  |
| bedrooms      | number of bedrooms in a property                                 |
| date_added    | date when add was added                                          |
| agency        | real estate agency to help find customer's a property            |
| agent         | either individual agent or agent from some agency                |
| Area Type     | either "Marla" or "Kanal" (1 Kanal = 20 Marla)                   |
| Area Size     | value of area size either in Marla or Kanal (e.g. "5")           |
| Area Category | categorically encoded area (e.g. "0-5 Marla")                    |

### 2. Data exploration
During data exploration, the following findings were discovered:
- There are outliers in the price (e.g., extremely high or low prices) that may need to be addressed.
- There is an add with coordinates not in Pakistan, which may need to be removed.
- There are null values only in 'agent' and 'agency' columns, which just means that an add was posted by an individual without an agency. So we can say that there are no missing values in the data.
- There is strong correlation between number of baths and number of bedrooms, which is expected.
- Number of baths and bedrooms are highly correlated with price, indicating their importance in predicting house prices.
- Area size and price are also correlated, suggesting that larger properties tend to have higher prices.
- An average price for each property type and city are really different, which can be used as a feature for the model.

#### Correlation Matrix of numerical features:
![alt text](./data/output.png)

#### Average prices per city:
Islamabad: 10996388.2093984
Lahore: 20294003.591380686
Faisalabad: 7244967.351237494
Rawalpindi: 8789136.459498653
Karachi: 18173945.75491426

#### Average prices per property_type:
Flat: 8507796.62478443
House: 20340713.913295355
Farm House: 63248551.02040816
Lower Portion: 1567718.38643371
Upper Portion: 2509633.6141406544
Penthouse: 15017850.877192982
Room: 258742.91497975707

### 3. Data requirements
The data requirements for this project are defined as follows:
- Null values are allowed only in 'agent' and 'agency' columns.
- **Property ID**: Unique identifier for each property.
- **Location ID**: Unique identifier for each location.
- **Latitude**: Should be within the range of 22 to 38 degrees.
- **Longitude**: Should be within the range of 59 to 79 degrees.
- **Price**: Should be greater than or equal to 0.
- **Baths**: Number of baths should be greater than or equal to 0.
- **Bedrooms**: Number of bedrooms should be greater than or equal to 0.
- **Area Size**: Value of area size should be greater than 0.
- **Area Category**: Should be one of the predefined categories (e.g., "0-5 Marla").
- **Area Type**: Should be either "Marla" or "Kanal".
- **Purpose**: Should be either "For Sale" or "For Rent".
- **Date Added**: Should be in a date format YEAR-MONTH-DAY (e.g., 2023-03-15).

### 4. Data quality verification report
- **Completeness**: The data is complete in the sense that it covers all the required cases. All properties have the necessary information.
- **Correctness**: The data appears to have rare outliers and no obvious errors. For example, there is 1 add along ~34000 values with coordinates not in Pakistan, which may need to be removed.
- **Missing Values**: There are no missing values in the data except in 'agent' and 'agency' columns, which are logical.
- Overall, the data quality is good, it is suitable for analysis and modeling. However, further cleaning and preprocessing may be required to address outliers and ensure data consistency.

## Project feasibility
-------------
This task involves more detailed fact-finding about all of the resources,constraints, assumptions and other factors that should be considered in determining the data analysis goal and project plan. In the previous task, your objective is to quickly get to the crux of the situation. Here, you want to flesh out the details.

### tasks
- Assess the project feasibility
- Create POC (Proof-of-concept) model

### output

#### 1. inventory of resources

List the resources available to the project, including: personnel (business experts, data experts, technical support, machine learning personnel), data (fixed extracts, access to live warehoused or operational data), computing resources (hardware platforms) and software (machine learning tools, other relevant software).

#### 2. Requirements, assumptions and constraints

List all requirements of the project including schedule of completion, comprehensibility and quality of results and security as well as legal issues.As part of this output, make sure that you are allowed to use the data. List the assumptions made by the project.

These may be assumptions about the data that can be checked during machine learning, but may also include non-checkable assumptions about the business upon which the project rests. It is particularly important to list the latter if they form conditions on the validity of the results.

List the constraints on the project. These may be constraints on the availability of resources, but may also include technological constraints such as the size of data that it is practical to use for modeling.

#### 3. Risks and contingencies

List the risks or events that might occur to delay the project or cause it to fail. List the corresponding contingency plans; what action will be taken if the risks happen.


#### 4. Costs and benefits

Construct a cost-benefit analysis for the project, which compares the costs of the project with the potential benefit to the business if it is successful. The comparison should be as specific as possible.

![](https://i.imgur.com/XU2lghc.png)

#### 5. Feasibility report

Build a POC ML model and explain as a team whether it is feasible to do this ML project or not. If not, then you need to find another business problem. They key factors here are related to data availability, quality, costs and nature of business problem.

# TODO: complete this part

## produce project plan
----------------

### task

Describe the intended plan for achieving the machine learning goals and thereby achieving the business goals. The plan should specify the anticipated set of steps to be performed during the rest of the project including an initial selection of tools and techniques.

### output

#### 1. Project plan
- List the stages to be executed in the project, together with their duration, resources required, inputs, outputs, and dependencies. Where possible, try and make explicit the large-scale iterations in the machine learning process, for example, repetitions of the modeling and evaluation phases. As part of the project plan, it is also important to analyze dependencies between time schedule and risks. Mark results of these analyses explicitly in the project plan, ideally with actions and recommendations if the risks are manifested. Decide at this point which evaluation strategy will be used in the evaluation phase. Your project plan will be a dynamic document. At the end of each phase you’ll review progress and achievements and update the project plan accordingly. Specific review points for these updates should be part of the project plan.

- Build a Gantt chart for the project tasks and phases using some online platforms like TeamGantt, jira, goodday, tello, ...etc

- Add all of your team members and assign tasks to them preliminary. Then you check daily the progress.

#### 2. ML project Canvas
At the end of the first phase, you should create a canvas for the project as a summary of this phase.


> ##### Example
> Follow the link: https://github.com/louisdorard/machine-learning-canvas/blob/master/churn.pdf


# TODO: complete this part