# business and data understanding
------------

The initial phase is concerned with tasks to define the business objectives and translate it to ML objectives, to collect and verify the data quality and to finaly assess the project feasibility.

![](./data/business_and_data.jpeg)

## Terminology
#### 1. Business terminology

| Term                             | Description                                                                                                                                                                                      |
|----------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Comparative Market Analysis (CMA) | An assessment of similar recently sold properties in the same area to determine a competitive price. Example: "The agent has conducted a CMA to suggest an entry price".                        |
| Fair Market Value (FMV)      | The price for which a property would be sold on the open market. Example: "The FMV has been determined on the basis of recent sales data".                                                                      |
| Appraisal                    | The process or result of an estate agent estimating a property's value. Example: "The appraisal was higher than the firstly suggested price."   |
| Multiple Listing Service (MLS)| A database used by real estate agents to share information about properties for sale. Example: "The property was listed on the MLS."                                 |
| Real Estate Investment Trust (REIT) | A company that owns, manages or finances income-generating property. Example: "The property was under the jurisdiction of a trusted REIT".                                                   |
| Marla                            | A traditional unit of area used in Pakistan and India for measuring property sizes, equal to approximately 272.25 square feet. Example: "The house is built on a 10 Marla plot."                |
| Kanal                            | Another traditional unit of area used in Pakistan and India for measuring property sizes, equal to 20 Marlas or approximately 5,445 square feet. Example: "The farm house spans 2 Kanals."      |
| Agency                           | A company or organisation set up to provide a particular service. Typically this involves arranging transactions between buyers and sellers of property. Example: "The agency took care of all the paperwork for the sale". |


#### 2. ML terminology
| Term                             | Description |
|----------------------------------|-------------|
| Feature Engineering          | The process of selecting, modifying, or creating new variables to improve the performance of a machine learning model. |
| Regression                   | A type of predictive modeling technique that estimates the relationships among variables. |
| Overfitting                  | When a model learns not only the underlying pattern but also the noise in the training data, resulting in poor performance on new data. |
| Hyperparameters              | Configurations that are external to the model and whose values cannot be estimated from the data. |
| Random Forest                | An ensemble learning method that constructs multiple decision trees and merges their results to improve predictive accuracy and control overfitting. |
| KNN               | A supervised learning algorithm that classifies or predicts values based on the 'k' nearest data points in the feature space.|
| Mean Absolute Error (MAE)    | A metric used to measure the accuracy of a model by averaging the absolute differences between predicted and actual values.|
| Mean Squared Error (MSE)     | A metric used to measure the average squared differences between predicted and actual values, giving more weight to large errors.|
| Gradient Boosting            | An ensemble technique that builds a model in a stage-wise fashion by combining weak learners to form a strong learner.|
| Training Set                 | The portion of the dataset used to train the model. Example: "80% of the data was used as the training set to build the model." |
| Test Set                     | The portion of the dataset used to evaluate the model's performance.  |
| Data Preprocessing           | The process of cleaning and preparing raw data for modeling.|


## Scope of the project
#### 1. background
Zameen.com is Pakistan's leading website for posting ads about houses and properties. The platform operates across various cities, facilitating the buying, selling, and renting of properties. Zameen.com maintains a rich dataset comprising property listings, transaction histories, and user interactions. The company aims to utilize this data to improve the accuracy of house price predictions and enhance the overall user experience on the platform.

#### 2. business problem
Real estate businesses, such as Zameen.com, and real estate agents may struggle with accurately pricing houses, causing revenue loss, customer dissatisfaction, and competitive disadvantage. We propose developing a machine learning model to automate the evaluation of house prices using diverse data sources. This solution will enhance pricing accuracy, build customer trust, optimize pricing strategies and maximize profits through precise, data-driven property valuations.

#### 3. business objectives
- Primary Objective: Improve the accuracy of house price predictions on the Zameen.com platform in order to:
- Enhance customer satisfaction by offering fair and competitive property prices.
- Reduce the time properties remain on the market by providing accurate initial pricing.
- Increase revenue through optimized pricing strategies.

#### 4. ML objectives
- Predict house prices based on property type, location, city, province name, coordinates, number of baths and bedrooms, and other relevant features.

## Success Criteria
#### 1. Business Success Criteria

- **Customer Satisfaction**: Increase customer satisfaction ratings by 10% within the next two quarters by providing more accurate and fair pricing.
- **Market Competitiveness**: Reduce the average time a property stays on the market by 20% through more precise initial pricing.
- **Revenue Growth**: Increase revenue from property listings by 15% over the next year by optimizing pricing strategies.

#### 2. ML Success Criteria

- **Predictive Accuracy**: Achieve an R-squared of more than 0.4 when applied to new, unseen data.
- **Feature Importance**: Accurately identify and validate the top 7 features that most significantly influence house prices.
- **Scalability**: The model should be able to handle real-time predictions with response times under 1 second.

#### 3. Economic Success Criteria

- **Cost Efficiency**: Reduce the costs associated with manual property appraisals by 25% within the next year through automation.
- **Return on Investment (ROI)**: Achieve an ROI of at least 20% from the implementation of the ML model within the first year.
- **Operational Savings**: Realize operational savings of 15% by streamlining the property valuation process with the ML model.
- **Profit Margin**: Increase the overall profit margin by 10% by reducing pricing errors and optimizing sales processes.
- **Long-Term Viability**: Ensure the model's maintenance and updating costs remain below 5% of the overall operational budget for the valuation process.

## Data collection
#### 1. Data Collection Report

**Data Source**:
The data for this project is sourced from a publicly available dataset on Kaggle.

**Data Link**:
[Zameen.com House Price Prediction Dataset](https://www.kaggle.com/datasets/howisusmanali/house-price-prediction-zameencom-dataset)

**Data Type**:
The dataset includes a mix of numerical, categorical, text, and time-based data:
- **Numerical**: price, property_id, location_id, latitude, longitude, etc;
- **Categorical/Text**: property_type, location, city, province_name, etc;
- **Time-Based**: date_added.

**Data Size**:
- **Number of Rows**: 168,446
- **Number of Columns**: 20

**Data Collection Method**:
The dataset was directly downloaded from the Kaggle platform. Kaggle ensures the data is clean and structured in a CSV format, making it ready for immediate analysis and modeling.

#### 2. Data Version Control Report

**Data Version**:
- Current data version: **v1.1**

**Data Change Log**:
- **v1.0**: Initial dataset sampled from the whole dataset. Sample was collected with script data.py, which was run on Windows. To ensure correctly configured paths for different systems, we need version specified for each system. Number of rows: 33689.
- **v1.1**: Initial dataset sampled from the whole dataset, Linux style path.
Number of rows: 33689.
- Any updates to the dataset, such as corrections or new data additions, will be logged here with corresponding dates.

**Data Backup**:
- The dataset is being stored on a configured remote on a Google Drive. Additionally, versions of the dataset are stored locally on team members computers. When a new version is being sampled, it is versioned and being added to our local storage.

**Data Archiving**:
- If the team will find a more suitable and up-to-date data, old version will be archived. For now, since we will be working with one Kaggle dataset, archiving is not neccesary.

**Data Access Control**:
- Data is being stored on a remote on a Google Drive. The direct access to the folder has only one of the team members, while the ID of the folder is stored in a config file in a private repository.

## Data quality verification
### 1. Data description
The data acquired for this project includes a dataset of 168446 records with 20 fields each. The fields include link to add(s), type of property, price, location, number of baths and bedrooms, are related data. The data is in a CSV format and is stored in Google Drive.

| feature       | description                                                      |
|---------------|------------------------------------------------------------------|
| property_id   | unique id for each property                                      |
| location_id   | unique id for each location (not unique for each property)       |
| page_url      | url to the add on zameen.com                                     |
| property_type | type of the property (FarmHouse, Flat, etc.)                     |
| price         | price of the property                                            |
| location      | location name in each city                                       |
| city          | name of the city                                                 |
| province_name | name of province (like state) of the city                        |
| latitude      | latitude in degrees (part of coordinates of a property)          |
| longitude     | longitude in degrees (another part of coordinates of a property) |
| baths         | number of baths in a property                                    |
| area          | area either in Marla or Kanal (e.g. "4 Marla")                   |
| purpose       | either "For Sale" or "For Rent"                                  |
| bedrooms      | number of bedrooms in a property                                 |
| date_added    | date when add was added                                          |
| agency        | real estate agency to help find customer's a property            |
| agent         | either individual agent or agent from some agency                |
| Area Type     | either "Marla" or "Kanal" (1 Kanal = 20 Marla)                   |
| Area Size     | value of area size either in Marla or Kanal (e.g. "5")           |
| Area Category | categorically encoded area (e.g. "0-5 Marla")                    |

### 2. Data exploration
During data exploration, the following findings were discovered:
- There are outliers in the price (e.g., extremely high or low prices) that may need to be addressed.
- There is an add with coordinates not in Pakistan, which may need to be removed.
- There are null values only in 'agent' and 'agency' columns, which just means that an add was posted by an individual without an agency. So we can say that there are no missing values in the data.
- There is strong correlation between number of baths and number of bedrooms, which is expected.
- Number of baths and bedrooms are highly correlated with price, indicating their importance in predicting house prices.
- Area size and price are also correlated, suggesting that larger properties tend to have higher prices.
- An average price for each property type and city are really different, which can be used as a feature for the model.

#### Correlation Matrix of numerical features:
![alt text](./data/output.png)

#### Average prices per city:
Islamabad: 10996388.2093984
Lahore: 20294003.591380686
Faisalabad: 7244967.351237494
Rawalpindi: 8789136.459498653
Karachi: 18173945.75491426

#### Average prices per property_type:
Flat: 8507796.62478443
House: 20340713.913295355
Farm House: 63248551.02040816
Lower Portion: 1567718.38643371
Upper Portion: 2509633.6141406544
Penthouse: 15017850.877192982
Room: 258742.91497975707

### 3. Data requirements
The data requirements for raw unprocessed data for the project are defined as follows:
- Null values are allowed only in 'agent' and 'agency' columns.
- **property_id**: Unique identifier for each property.
- **location_id**: Unique identifier for each location.
- **latitude**: Should be within the range of 0 to 90 degrees.
- **longitude**: Should be within the range of 0 to 180 degrees.
- **price**: Should be greater than or equal to 0.
- **baths**: Number of baths should be greater than or equal to 0.
- **bedrooms**: Number of bedrooms should be greater than or equal to 0.
- **Area Size**: Value of area size should be greater than or equal to 0.
- **Area Category**: Should be one of the predefined categories (e.g., "0-5 Marla"), no more than 18 unique values.
- **Area Type**: Should be either "Marla" or "Kanal".
- **purpose**: Should be either "For Sale" or "For Rent".
- **date_added**: Should be in a date format YEAR-MONTH-DAY (e.g., 2023-03-15).
- **property_type**: No more than 7 unique values.

Specifically, several data requirements were coded as expectaions in a Great Expectation suite to regurarly check data requirements against data samples:
- **price**:
    - values must never be null
    - values must be of type int64
    - minimum value must be greater than or equal to 0.
    - median must be greater than or equal to 120000.0 and less than or equal to 19800000.0.
- **location**:
    - values must never be null.
    - distinct values must belong to this set (set is predifiened by values in the dataset)
    - must have greater than or equal to 800 and less than or equal to 1500 unique values.
- **area**:
    - values must never be null.
    - values must match this regular expression: ^[0-9]+(\.[0-9])?\s*(Kanal|Marla)$.
- **Area Size**:
    - minimum value must be greater than or equal to 0.
- **agency**:
    - distinct values must belong to this set (set consist of a list of known agency list from the first sample, it will be updated, if needed)
    - values must be of type str.
    - values must not be null, at least 50 % of the time.
- **baths**:
    - minimum value must be greater than or equal to 0.
    - values must be of type int64.
    - maximum value must be less than or equal to 10.
- **bedrooms**:
    - minimum value must be greater than or equal to 0.
    - values must be of type int64.
    - maximum value must be less than or equal to 18.
- **property_type**:
    - distinct values must belong to this set: House Penthouse Flat Farm House Upper Portion Lower Portion Room.
    - values must never be null.
    - must have greater than or equal to 5 and less than or equal to 9 unique values.

Data deimensions, which are represented by each of the expectations can be seen in the data_expectations.ipynb or by loading the docu,entation site from the GX context.

The data requirements for preprocessed data:
- Null values are not allowed.
- **month_sin** Should be in the range from -1 to 1.
- **month_cos** Should be in the range from -1 to 1.
- **day_sin** Should be in the range from -1 to 1.
- **day_cos** Should be in the range from -1 to 1.

These data requirements will be used in the next phase in oreder to check data expectation for preprocessed for ML training dataset.

### 4. Data quality verification report
- **Completeness**: The data is complete in the sense that it covers all the required cases. All properties have the necessary information.
- **Correctness**: The data appears to have rare outliers and no obvious errors. For example, there is 1 add along ~34000 values with coordinates not in Pakistan, which may need to be removed.
- **Missing Values**: There are no missing values in the data except in 'agent' and 'agency' columns, which is logical.
- Overall, the data quality is good, it is suitable for analysis and modeling. However, further cleaning and preprocessing may be required to address outliers and ensure data consistency.

## Project feasibility
### 1. Inventory of resources
Personnel:
- Business experts: no real experts, but the team has a good understanding of the real estate domain.
- Data experts: 1 data engineer;
- Technical support: 2 MLOps course instructors;
- Machine learning personnel: 1 data scientist and 1 machine learning engineer.

Data:
- Fixed extracts: Zameen.com House Price Prediction Dataset (168446 records, 20 fields);
- Access to live warehoused or operational data: can be parsed from the website.

Computing resources:
- Hardware platforms:
    - GPUs: 2xNvidia RTX 3060 Ti, 1xNvidia RTX 2050;
    - CPUs: 1xIntel I5-11600, 1xIntel I7-1250h, 1xAMD Ryzen 5 5500, 1xAMD Ryzen 5 5600H, 1xAMD Ryzen 7 3700U;
    - RAM: 16-32 GB per machine;
    - Storage: 512-1024 GB SSD per machine.
    - Cloud hardware: Yandex Cloud 4xNvidia Tesla A100 cluster with 384 GB RAM and 64 vCPUs (Paid resource, preferably not to be used).

- Software:
    - Machine learning tools: Python, scikit-learn, TensorFlow, PyTorch, XGBoost, LightGBM, etc.
    - MLOps tools: Hydra, DVC, Pytest, GX, Spark, Airflow, etc.
    - Other relevant software: Jupyter Notebook, Google Colab, GitHub, etc.

### 2. Requirements, assumptions and constraints
Requirements:
- Schedule of completion: 6 weeks;
- Comprehensibility and quality of results: The results should be easily interpretable and have a high level of accuracy;
- Security: Data is open-source and does not require additional security measures;
- Legal issues: No legal issues are expected as the data is publicly available.

Assumptions:
- The data is representative of the real estate market in Pakistan;
- The data does not contain non-real estate related properties;
- The data is accurate and up-to-date.
- The data is not biased towards any specific property type or location based on the website's listings.
- The data is not manipulated or altered in any way.

Constraints:
- Availability of resources: Limited access to cloud hardware, which may impact the scalability of the project;
- Technological constraints: The size of the dataset may limit the complexity of the models that can be trained.
- Time constraints: The project must be completed within 6 weeks, which may limit the scope of the analysis.
- Budget constraints: Limited budget for cloud resources and additional tools.

### 3. Risks and contingencies
Risks:
- Data quality issues: Inaccurate or missing data may impact the time consumed for data cleaning and preprocessing;
- Model performance: The models may not achieve the desired accuracy on the new data, leading to suboptimal results;
- Resource constraints: Limited access to cloud hardware may slow down the training process and limit the scalability of the models.
- Personnel constraints: The team may face challenges in coordinating tasks and managing time effectively. Moreover, if one of the team members will be unavailable for some time, it may impact the project timeline significantly.

Contingencies:
- Data quality issues: Implement data validation checks and data cleaning procedures to address inaccuracies and missing values;
- Model performance: Experiment with different algorithms and hyperparameters to improve model performance;
- Resource constraints: Optimize code and use cloud resources efficiently, using local hardware for initial testing and development;
- Personnel constraints: Regular team meetings and task tracking to ensure timely completion of project milestones. If one of the team members will be unavailable for some time, the rest of the team will redistribute the tasks and responsibilities to ensure the project timeline is met. If the project timeline is at risk, the team will consider reducing the scope of the project or seeking additional resources to meet the deadline.

### 4. Costs and benefits
Costs:
- Personnel costs: Salaries for data engineer, data scientist, and machine learning engineer;
- Hardware costs: Cloud resources for training and testing models;

Benefits:
- Improved house price predictions on Zameen.com platform;
- Enhanced user experience and customer satisfaction;
- Increased revenue from optimized pricing strategies;
- Improved understanding of the real estate market in Pakistan.

Benefit impact can be said to be high, while the costs impact is medium, so ratio benefits/costs is 3/2=1.5

### 5. Feasibility report
Lasso regression model was built as a POC to predict house prices based on property features. The model achieved an R2 score of 0.3 on the test data, indicating a moderate level of predictive accuracy. The feasibility of the project is confirmed based on the available resources, data quality, and business objectives. The team is confident in proceeding with the project and will focus on refining the model and exploring additional features to improve predictive performance.

## produce project plan
#### 1. Project plan

- Task tracker with Gantt chart: [invite link](https://height.app/invite/yD-nM4mcEVZgSn)

#### Stages:

1. Business and Data Understanding
   - Duration: 18 june - 25 june
   - Resources Required: All team members
   - Inputs: Dataset, research on real estate market
   - Outputs: Business problem definition, business and machine learning goals, success criteria, and risk analysis
   - Dependencies: Availability of dataset
   - Risks: Misalignment of business and ML goals, data unavailability, unfeasibility of ML solution
   - Evaluation Strategy: Documentation reviews

2. Data Preparation
   - Duration: 25 june - 2 july
   - Resources Required: Data Engineer, Data Scientist
   - Inputs: Raw sampled data, business and ML requirements
   - Outputs: Cleaned and transformed data, ML-ready features stored in feature stores
   - Dependencies: Access to data storage systems, ETL tools
   - Risks: Data quality issues, ETL pipeline failures
   - Evaluation Strategy: Data quality checks, ETL pipeline testing

3. Model Engineering
   - Duration: 3 july - 9 july
   - Resources Required: Data Scientist, ML Engineer
   - Inputs: ML-ready data, feature stores
   - Outputs: Trained ML models, experiment tracking reports, optimized model versions
   - Dependencies: Computational resources, MLflow setup
   - Risks: Overfitting/underfitting, computational resource limitations
   - Evaluation Strategy: Cross-validation, performance metrics tracking using MLflow

4. Model Validation and Deployment
   - Duration: 10 july - 16 july
   - Resources Required: Data Scientist, ML engineer
   - Inputs: Trained models, validation data, deployment infrastructure
   - Outputs: Validated and deployed model, REST endpoint, user interface, business and ML objective assessment, feasibility report,
   - Dependencies: Validation data, Deployment platform, CI/CD tools
   - Risks: Model performance below expectations, deployment failures, integration issues
   - Evaluation Strategy: Performance metrics review, deployment testing, CI/CD pipeline validation

5. Model Monitoring and Maintenance
   - Duration: 17 july - 23 july
   - Resources Required: ML Engineer, Data Scientist
   - Inputs: Production data, synthetic data, all generated artifacts from previous stages
   - Outputs: Monitoring reports, retrained models (if needed)
   - Dependencies: Monitoring tools, data streams
   - Risks: Model drift, performance degradation
   - Evaluation Strategy: Regular drift checks, synthetic data testing, model performance monitoring

#### 2. ML project Canvas
Available in github repo in reports folder.