# Optimizing Profits: Discover Property Treasures Using Cutting-Edge AI and Data Mining

## Business Problem

### Introduction:

Welcome to the Property Investment Challenge! Our client, an investment fund, is on a mission to maximize returns on real estate investments. In this dynamic and competitive market, the fund is looking to leverage cutting-edge AI and data mining techniques to identify and invest in properties with the highest potential for profitability. The goal is to optimize the allocation of resources and strategically select properties that can yield substantial returns on investment.

### Challenge Overview:

The investment fund is planning to make a significant number of property investments, and the key to success lies in the ability to identify properties that offer substantial returns. The challenge is rooted in a vast dataset that spans the last 23 years, encompassing millions of property sales. The fund envisions a solution that can effectively segment these properties, enabling them to leverage sophisticated machine learning and analytics to quickly identify target properties for investment.

### Key Objectives:

1. **Predictive Modeling:**
   - Develop robust machine learning models capable of accurately predicting property sale prices. These models should take into account various factors such as location, estimated price, and other relevant features.

2. **Segmentation:**
   - After predicting sale prices, segment the properties into four distinct categories based on the calculated gain. The gain is calculated using the formula: `(Sale price - Estimated price)/100`.
     - **Segment 0: Premium Properties 💰🏰**
     - **Segment 1: Valuable Properties 💎🏡**
     - **Segment 2: Standard Properties 🏘️💸**
     - **Segment 3: Budget Properties 🏠💵**

### Dataset:

Participants will be provided with a comprehensive property dataset spanning the last 23 years. This dataset includes information such as property location, estimated price, selling price, and other relevant details.

### Expected Outcome:

The investment fund anticipates a solution that can not only accurately predict property sale prices but also categorize them into distinct segments. This segmentation will empower the fund to make well-informed decisions, strategically investing in properties that align with their overarching goal of maximizing returns.

### Evaluation:

Submissions will be assessed based on the performance metric relevant to regression tasks. The chosen metric should accurately reflect the model's ability to predict continuous numerical values, such as property sale prices. Participants are encouraged to select a metric aligned with the competition's goals and dataset characteristics, using it as a guiding measure for model accuracy and improvement.

Let the challenge begin, and may the best solution unlock the potential of property treasures! 🏡💎


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
pip install requests

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.3.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [13]:
!pip install -r requirements.txt

Defaulting to user installation because normal site-packages is not writeable
Collecting fastapi (from -r requirements.txt (line 6))
  Downloading fastapi-0.109.2-py3-none-any.whl.metadata (25 kB)
Collecting docker (from -r requirements.txt (line 12))
  Downloading docker-7.0.0-py3-none-any.whl.metadata (3.5 kB)
Collecting starlette<0.37.0,>=0.36.3 (from fastapi->-r requirements.txt (line 6))
  Downloading starlette-0.36.3-py3-none-any.whl.metadata (5.9 kB)
Downloading fastapi-0.109.2-py3-none-any.whl (92 kB)
   ---------------------------------------- 92.1/92.1 kB 186.8 kB/s eta 0:00:00
Downloading docker-7.0.0-py3-none-any.whl (147 kB)
   -------------------------------------- 147.6/147.6 kB 313.5 kB/s eta 0:00:00
Downloading starlette-0.36.3-py3-none-any.whl (71 kB)
   ---------------------------------------- 71.5/71.5 kB 437.4 kB/s eta 0:00:00
Installing collected packages: starlette, docker, fastapi
Successfully installed docker-7.0.0 fastapi-0.109.2 starlette-0.36.3



[notice] A new release of pip is available: 23.3.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


# 1. Data Acquisition

In [20]:
import os
import requests
from credentials import ACCESS, BUCKET_NAME, DATABASE_URL


# IMPORTANT: Credentials are not shared in this public notebook.

# with your own credentials if you intend to run this code.

# Never share sensitive information like credentials publicly.

# See Rest of code 

In [7]:
file_names = [f'pucon24_ai_train_{i}.csv' for i in range(1, 13)]
file_names += ['pucon24-ai-test.csv', 'pucon24-ai-sample_submission.csv', 'pucon24_ai_description.txt']
file_names

['pucon24_ai_train_1.csv',
 'pucon24_ai_train_2.csv',
 'pucon24_ai_train_3.csv',
 'pucon24_ai_train_4.csv',
 'pucon24_ai_train_5.csv',
 'pucon24_ai_train_6.csv',
 'pucon24_ai_train_7.csv',
 'pucon24_ai_train_8.csv',
 'pucon24_ai_train_9.csv',
 'pucon24_ai_train_10.csv',
 'pucon24_ai_train_11.csv',
 'pucon24_ai_train_12.csv',
 'pucon24-ai-test.csv',
 'pucon24-ai-sample_submission.csv',
 'pucon24_ai_description.txt']

In [10]:
def access_dataset(file_name):
    url = f'{DATABASE_URL}/{ACCESS}/{BUCKET_NAME}/{file_name}'
    response = requests.get(url)
    
    if response.status_code == 200:
        with open(f'./Dataset/{file_name}', 'wb') as file:
            file.write(response.content)
        print(f'File {file_name} saved successfully.')
    else:
        print(f'Failed to download file. Status code: {response.status_code}')

In [11]:
access_dataset('pucon24-ai-test.csv')


File pucon24-ai-test.csv saved successfully.


In [None]:
for file_name in file_names:
    access_dataset(file_name)

In [2]:
df = pd.DataFrame()

for i in range(1, 13):
    tmp_df = pd.read_csv(f'./Dataset/pucon24_ai_train_{i}.csv')
    df = pd.concat([df, tmp_df], axis=1)

In [18]:
df.head()

Unnamed: 0,Date,crime_rate,renovation_level,Year,Address,num_rooms,Property,amenities_rating,carpet_area,nearby_restaurants,...,carpet_area.1,nearby_restaurants.1,public_transport_availability,property_tax_rate,distance_to_school,Locality,Residential,Estimated Value,Sale Price,specifications
0,2009-01-02-00:00:00,2.6568262407789027,Minor,2009,40 ETTL LN UT 24,Two rooms,Condo,Mediocre,760.0,19.0,...,933.0,14.0,Scattered,1.003979,0.128864,Groton,Detached House,269430.0,650000.0,"In 2022, this residential located in Groton at..."
1,2009-01-02-00:00:00,5.328727031244374,Basic,2009,18 BAUER RD,Three rooms,Single Family,Superb,7967.337677159014,17.0,...,975.0,1.0,Abundant,1.003979,8.864996,Waterbury,Detached House,78180.0,140000.0,"In 2022, this residential located in Waterbury..."
2,2009-01-02-00:00:00,4.037758682930219,Basic,2009,48 HIGH VALLEY RD.,Three rooms,Single Family,Satisfactory,982.0,1.0,...,1073.0,2.0,Inadequate,1.003979,5.792544,Greenwich,Detached House,787290.0,1349000.0,"In 2022, this residential located in Greenwich..."
3,2009-01-02-00:00:00,2.085308997846847,Extensive,2009,56 MERIDEN RD,Three rooms,Single Family,Superb,976.0,5.0,...,1022.0,5.0,Sparse,1.003979,7.198714,Columbia,Detached House,144900.0,172750.0,"In 2022, this residential located in Columbia ..."
4,02-01-2009-00:00:00,4.397712193695299,Partial,2009,13 CELENTANO DR,Three rooms,Single Family,Below Average,947.0,14.0,...,956.0,13.0,Extensive,1.003979,3.833746,Watertown,Detached House,107100.0,217000.0,"In 2022, this residential located in Watertown..."


In [3]:
df.to_csv('./Dataset/training.csv')