<h1 align="center">Pool Detector</h1>
<h2 align="center">A problem of Quera's Olympic of Technology in Image and Data Processing </h2>

## Purpose

On many tourism and accommodation booking websites, users can view various images of different properties. In this project, we aim to design a model that can predict the presence or absence of a swimming pool in an accommodation based on the analysis of its images.

## Importing Required Libraries

First, let's import the necessary libraries.

In [59]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import os
from PIL import Image

In [60]:
# Check for GPU availability
gpus = tf.config.list_physical_devices('GPU')
if gpus:
    print(f"GPU is available: {gpus}")
    try:
        # Set TensorFlow to use only the first GPU
        tf.config.set_visible_devices(gpus[0], 'GPU')
        logical_gpus = tf.config.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
        # Visible devices must be set before GPUs have been initialized
        print(e)
else:
    print("GPU not available.")

GPU is available: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
1 Physical GPUs, 1 Logical GPUs


## Dataset Introduction

The provided dataset is in JSON format and contains information about 1,599 accommodations. The main key is "rooms", which holds a list of all properties. Each record represents an accommodation with its specific features and amenities. The table below describes the keys for each record:

| Key          | Description                                    |
|:-------------|:-----------------------------------------------|
| `id`         | Unique identifier for the accommodation        |
| `title`      | Title of the accommodation listing             |
| `description`| A detailed description of the accommodation    |
| `province`   | Province information, including ID and name    |
| `city`       | City information, including ID and name        |

Here is an example of a single data entry:
```json
{
  "rooms": [
    {
      "id": 3202100,
      "title": "Rent a suite on Javaherdeh road - Ashkounkooh",
      "description": "This suite with no bedroom features a lovely balcony and a beautiful view, located on the ground floor of a two-story building. The distance to the supermarket and bakery is about 50 and 500 meters, respectively.",
      "province": {
        "id": "p26",
        "name": "Mazandaran"
      },
      "city": {
        "id": 303,
        "name": "Ramsar"
      }
    },
    ...
  ]
}
```

## Reading the Dataset

First, you need to read the dataset files. The training features are in `train.json`, and the test data (used to find image directories) is in `test.json`. The image directory for each accommodation corresponds to its `id` and is located inside the `pictures` folder.

**Note:** The dataset is approximately 3.8 GB. If you have trouble uploading it to your environment (like Google Colab), you can use the code below to download and unzip it directly.

<a href="https://drive.google.com/file/d/1b8O_a6ywcsbLqJAGDGCkePrdn1cFlXl0/view?usp=sharing" target="_blank">Download from Google Drive</a>

In [61]:
## Run this cell to download the data directly into your environment

# Install gdown to download from Google Drive
# !pip install gdown
# 
# import gdown
# import zipfile
# 
# # Google Drive file ID and destination filename
# file_id = '1b8O_a6ywcsbLqJAGDGCkePrdn1cFlXl0'
# destination = 'dataset.zip'
# 
# # Download the file
# gdown.download(f'https://drive.google.com/uc?id={file_id}', destination, quiet=False)
# 
# # Unzip the file
# with zipfile.ZipFile(destination, 'r') as zip_ref:
#     zip_ref.extractall('unzipped_content')
# 
# print("Download and extraction complete.")

In [62]:
train_json_path = 'unzipped_content/train.json'
test_json_path = 'unzipped_content/test.json'

In [63]:
## Run this cell if you are on colab

# train_json_path = os.path.join("content/", train_json_path)
# test_json_path = os.path.join("content/", test_json_path)

In [64]:
train_data = pd.read_json(train_json_path)
print("Sample training data entry:")
print(train_data["rooms"][0])

Sample training data entry:
{'id': 3175858, 'title': 'رزرو ویلا با استخر روباز آبگرم در چهارباغ', 'description': '**رزرو ویلا با استخر روباز آبگرم در چهارباغ **\nاین ویلا دو خوابه که یک اتاق خواب مستر دارد با استخر چهارفصل روباز آبگرم در حیاط دنج و باصفای ویلا مزین به آبنما، فضای سبز، آتشدان و باربیکیو در منطقه ای امن و آرام از چهارباغ واقع شده است و با شهر کرج حدود ۲۵ کیلومتر فاصله دارد.\nبام تهرانی موجود به همراه فضای حیاط دلنشین ویلا می تواند اوقات خوشی را جهت شب نشینی و بهره بردن از آب و هوای منطقه فراهم آورد.\nمحیط اطراف ویلا از چهار طرف با دیوار بلند حصارکشی شده است و یک واحد نگهبانی شبانه در کانکس موجود در محوطه کوچه مستقر می باشد.\nاز این ویلا با حدود ۳ دقیقه رانندگی دسترسی به نانوایی و سوپرمارکت امکان\u200cپذیر است.\nکیفیت خطوط شبکه برای ایرانسل و همراه اول در مکالمه عالی و پوشش اینترنت ۴g می باشد.\nلازم به ذکر است حدود ۴۰۰ متر مسیر انتهایی ویلا جاده خاکی مناسب برای عبور و مرور وسایل نقلیه می باشد.', 'province': {'id': 'p31', 'name': 'البرز'}, 'city': {'id': 363, 'name': 'چهار

In [65]:
test_data = pd.read_json(test_json_path)
print("Test data structure:")
print(test_data.head())

Test data structure:
             rooms
0  {'id': 3160664}
1  {'id': 3195184}
2  {'id': 3224078}
3  {'id': 3233712}
4  {'id': 3201449}


## Preprocessing and Feature Engineering

In this section, we will create the target labels for our model. We will determine if an accommodation has a pool by searching for the keyword `استخر` (Persian for "pool") in its `title` and `description`.

The effectiveness of this feature engineering step directly impacts the model's performance. A well-created label set is crucial for training an accurate model.

In [66]:
train_df = pd.DataFrame()

for i in train_data['rooms']:
  # Check for the keyword "استخر" in both "title" and "description"
  has_pool = "استخر" in i.get("title", "") or "استخر" in i.get("description", "")
  i["pool"] = has_pool
  train_df = pd.concat([train_df, pd.DataFrame.from_records([{"id": i["id"], 'pool': i["pool"]}])])
  
print('\nValue counts for the target variable:')
print(train_df['pool'].value_counts())


Value counts for the target variable:
pool
True     807
False    792
Name: count, dtype: int64


In [67]:
train_df

Unnamed: 0,id,pool
0,3175858,True
0,3237321,True
0,3154228,True
0,3169850,False
0,3207557,False
...,...,...
0,3167459,True
0,3207406,True
0,3172620,False
0,3237948,False


## Loading and Preprocessing Images

Now, we will load the images from their respective directories. We will create a list of all image paths and their corresponding labels (`True`/`False` for the presence of a pool).

In [68]:
image_dir = 'unzipped_content/train'

## Run this as well if you're on colab ##
# image_dir = os.path.join("content/", image_dir)
# print(image_dir)

In [69]:
image_dir = os.path.join(os.getcwd(), "unzipped_content/train")

all_locations = []
for index, row in train_df.iterrows():
    id = row['id']
    pool_label = row['pool']
    path = os.path.join(image_dir, str(id))
    if os.path.exists(path):
        one_location_img_files = os.listdir(path)
        one_location_path = []
        for img_path in one_location_img_files:
            one_location_path.append(os.path.join(path, img_path))
        single_location = []
        single_location.append(id)
        single_location.append(pool_label)
        single_location.append(one_location_path)
        all_locations.append(single_location)

In [70]:
image_paths_labels = []
for location_data in all_locations:
    pool_label = location_data[1]
    image_paths = location_data[2]
    for image_path in image_paths:
        image_paths_labels.append((image_path, pool_label))

print("Number of image path and label pairs:", len(image_paths_labels))
print("First 5 pairs:", image_paths_labels[:5])

Number of image path and label pairs: 15841
First 5 pairs: [('C:\\Users\\Pouyan\\PycharmProjects\\Pool_Detector\\unzipped_content/train\\3175858\\3175858220912224333.jpg', True), ('C:\\Users\\Pouyan\\PycharmProjects\\Pool_Detector\\unzipped_content/train\\3175858\\3175858220912224711.jpg', True), ('C:\\Users\\Pouyan\\PycharmProjects\\Pool_Detector\\unzipped_content/train\\3175858\\3175858220912224834.jpg', True), ('C:\\Users\\Pouyan\\PycharmProjects\\Pool_Detector\\unzipped_content/train\\3175858\\3175858220912224844.jpg', True), ('C:\\Users\\Pouyan\\PycharmProjects\\Pool_Detector\\unzipped_content/train\\3175858\\3175858220912225459.jpg', True)]


### Creating a TensorFlow Dataset

To train the model efficiently, we'll use `tf.data.Dataset`. This allows for high-performance data loading pipelines, including on-the-fly preprocessing, shuffling, and batching.

In [71]:
# Separate paths and labels
image_paths = [item[0] for item in image_paths_labels]
labels = [item[1] for item in image_paths_labels]

# Create TensorFlow datasets from the lists
path_ds = tf.data.Dataset.from_tensor_slices(image_paths)
label_ds = tf.data.Dataset.from_tensor_slices(tf.cast(labels, tf.int32))

# Zip the paths and labels together
full_ds = tf.data.Dataset.zip((path_ds, label_ds))

In [72]:
def load_and_preprocess_image(image_path, label):
    # Read the image file
    img = tf.io.read_file(image_path)
    # Decode the JPEG image to 3 channels (RGB)
    img = tf.image.decode_jpeg(img, channels=3)
    # Resize all images to a consistent size
    img = tf.image.resize(img, [128, 128])
    # There is no need for normalization since we're going to use "SELU"
    return img, label

# Apply the preprocessing function to the dataset
processed_ds = full_ds.map(load_and_preprocess_image)

### Splitting Data and Creating Batches

We will split the dataset into training (80%) and validation (20%) sets. We'll also shuffle the data, create batches, and use prefetching for better performance.

In [73]:
dataset_size = len(image_paths_labels)
train_size = int(0.8 * dataset_size)
val_size = dataset_size - train_size

# Shuffle the dataset for randomness
processed_ds = processed_ds.shuffle(buffer_size=dataset_size)

# Split into training and validation sets
train_ds = processed_ds.take(train_size)
val_ds = processed_ds.skip(train_size)

# Configure the datasets for performance
BATCH_SIZE = 128
train_ds = train_ds.batch(BATCH_SIZE).cache().prefetch(buffer_size=tf.data.AUTOTUNE)
val_ds = val_ds.batch(BATCH_SIZE).cache().prefetch(buffer_size=tf.data.AUTOTUNE)

print(f"Training dataset: {train_ds}")
print(f"Validation dataset: {val_ds}")

Training dataset: <PrefetchDataset element_spec=(TensorSpec(shape=(None, 128, 128, 3), dtype=tf.float32, name=None), TensorSpec(shape=(None,), dtype=tf.int32, name=None))>
Validation dataset: <PrefetchDataset element_spec=(TensorSpec(shape=(None, 128, 128, 3), dtype=tf.float32, name=None), TensorSpec(shape=(None,), dtype=tf.int32, name=None))>


## Building and Training the Model

We will now define and compile a Convolutional Neural Network (CNN) for our image classification task. The model architecture consists of several convolutional blocks to extract features from the images, followed by a dense head to classify them.

In [74]:
model = keras.Sequential([
keras.Input(shape=(128, 128, 3)),
    
    # Block 1: Gradually extract features
    layers.Conv2D(32, (5,5), activation='relu', padding='same'),
    layers.MaxPooling2D((2, 2)),
    
    # Block 2
    layers.Conv2D(64, (3,3), activation='relu', padding='same'),
    layers.MaxPooling2D((2, 2)),

    # Block 3
    layers.Conv2D(128, (5,5), activation='relu', padding='same'),
    layers.MaxPooling2D((2, 2)),

    # Classifier Head with SELU
    layers.Flatten(),
    layers.Dense(128, activation='selu',
                 kernel_initializer="lecun_normal",
                 kernel_regularizer=keras.regularizers.l2(0.01)),
    layers.AlphaDropout(0.5),
    layers.Dense(1, activation='sigmoid')
])

In [75]:
model.compile(
    optimizer=keras.optimizers.Nadam(learning_rate=0.001),
    loss=keras.losses.BinaryCrossentropy(),
    metrics=[
        "accuracy",
        keras.metrics.Precision(),
        keras.metrics.Recall()
    ]
)

In [76]:
EPOCHS = 128

# Define callbacks
callbacks_list = [
    tf.keras.callbacks.ModelCheckpoint(
        filepath="best_model.h5",   # File to save the model
        monitor="val_loss",  # The metric to monitor
        save_best_only=True         # Only save when the metric improves
    ),
    tf.keras.callbacks.EarlyStopping(
        monitor="val_loss",
        patience=32,
        restore_best_weights=True   # Restore weights from the epoch with the best F1 score
    ),
]

history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=EPOCHS,
    callbacks=callbacks_list
)

# After training, the 'model' object will hold the weights from the
# epoch with the lowest val loss, and 'best_model.h5' will also contain that model.

Epoch 1/128
Epoch 2/128
Epoch 3/128
Epoch 4/128
Epoch 5/128
Epoch 6/128
Epoch 7/128
Epoch 8/128
Epoch 9/128
Epoch 10/128
Epoch 11/128
Epoch 12/128
Epoch 13/128
Epoch 14/128
Epoch 15/128
Epoch 16/128
Epoch 17/128
Epoch 18/128
Epoch 19/128
Epoch 20/128
Epoch 21/128
Epoch 22/128
Epoch 23/128
Epoch 24/128
Epoch 25/128
Epoch 26/128
Epoch 27/128
Epoch 28/128

KeyboardInterrupt: 

## Evaluation Metric

The primary metric for evaluating our model's performance is the **F1-score**. This metric provides a balance between Precision and Recall and is a robust measure for classification tasks.

$F1 = 2 \cdot \frac{precision \cdot recall}{precision + recall}$

## Generating Predictions for Test Data

Now we will use our trained model to make predictions on the test dataset. We'll follow a similar preprocessing pipeline for the test images.

In [81]:
test_image_dir = os.path.join(os.getcwd(), "unzipped_content/test")

## Run this as well if you're on colab ##
# test_image_dir = os.path.join("content/", test_image_dir)

# Create a list of test image paths, grouped by location ID
test_image_paths_by_location = []
for index, row in test_data.iterrows():
    id = row['rooms']['id']
    path = os.path.join(test_image_dir, str(id))
    one_location_img_files = os.listdir(path)
    one_location_paths = [os.path.join(path, img_path) for img_path in one_location_img_files]
    test_image_paths_by_location.append((id, one_location_paths))

print("Number of locations in test data with image paths:", len(test_image_paths_by_location))

Number of locations in test data with image paths: 861


In [82]:
# Flatten the list to get all image paths and their corresponding IDs
test_image_paths = []
test_ids = []
for location_id, paths in test_image_paths_by_location:
    test_image_paths.extend(paths)
    test_ids.extend([location_id] * len(paths))

# Create a TensorFlow dataset for test images
test_path_ds = tf.data.Dataset.from_tensor_slices(test_image_paths)

# Preprocess the test images
def preprocess_test_image(image_path):
    img = tf.io.read_file(image_path)
    img = tf.image.decode_jpeg(img, channels=3)
    img = tf.image.resize(img, [128, 128])
    img = img
    return img

processed_test_ds = test_path_ds.map(preprocess_test_image, num_parallel_calls=tf.data.AUTOTUNE)

# Batch the test dataset
test_ds_batched = processed_test_ds.batch(BATCH_SIZE).prefetch(buffer_size=tf.data.AUTOTUNE)

print("Test dataset ready for prediction.")

Test dataset ready for prediction.


In [83]:
# Make predictions on the test dataset
predictions = model.predict(test_ds_batched)

# Convert probabilities to boolean labels (True/False)
predicted_labels = (predictions > 0.5).flatten()

print(f"Number of predictions made: {len(predicted_labels)}")
print("First 15 predicted labels:", predicted_labels[:15])

Number of predictions made: 8505
First 15 predicted labels: [False  True False False  True  True False False  True False False False
 False False False]


In [84]:
# Since there are multiple images per location, we need to aggregate the predictions.
# Strategy: A location has a pool if the *majority* of its images are predicted to have a pool.

# Create a dictionary to store predictions for each location ID
location_predictions = {}
for i, label in enumerate(predicted_labels):
    location_id = test_ids[i]
    if location_id not in location_predictions:
        location_predictions[location_id] = []
    location_predictions[location_id].append(label)

# Aggregate predictions by location using a majority vote
final_predictions = {}
for location_id, preds in location_predictions.items():
    num_true = sum(preds)
    num_false = len(preds) - num_true
    if num_true >= num_false:
        final_predictions[location_id] = True
    else:
        final_predictions[location_id] = False

# Create the submission DataFrame
submission = pd.DataFrame(list(final_predictions.items()), columns=['id', 'pool'])

print(submission.head())
print(f"\nSubmission DataFrame shape: {submission.shape}")

        id   pool
0  3160664  False
1  3195184  False
2  3224078   True
3  3233712   True
4  3201449   True

Submission DataFrame shape: (861, 2)


## Generating the Submission File

After feature engineering and modeling, you have an algorithm that can predict the target variable from the independent variables. Use this model to predict the samples in the test data and prepare the results in the following DataFrame format.

| Column | Description                               |
|--------|-------------------------------------------|
| `pool` | Predicted presence of a pool (True/False) |
| `id`   | Unique identifier of the accommodation    |

The DataFrame must be named `submission`, otherwise, the evaluation system cannot assess your work. It should contain two columns, `id` and `pool`, and have one row for each unique ID in the test set. Below is an example of the first 5 rows of the `submission` DataFrame. Your predicted values in the `pool` column may differ.

| id      | pool  |
|---------|-------|
| 3160664 | True  |
| 3195184 | False |
| 3224078 | False |
| 3233712 | True  |
| 3201449 | True  |

## Create Submission Archive

Run the following cell to create the `result.zip` file. Ensure that you have saved the notebook (`Ctrl+S`) before running this cell. If you are using Google Colab, download the latest version of your notebook and include it in the `result.zip` file before submission.

In [85]:
import zipfile
import os

# Save the submission dataframe to a CSV file
submission.to_csv('submission.csv', index=False)

# Define the files to be included in the zip archive
notebook_name = 'Pool_Detector_English.ipynb' # Make sure this matches your notebook's filename
file_names = [notebook_name, 'submission.csv']

def compress(file_names):
    print("Files to be zipped:")
    print(file_names)
    compression = zipfile.ZIP_DEFLATED
    with zipfile.ZipFile("result.zip", mode="w") as zf:
        for file_name in file_names:
            if os.path.exists(file_name):
                 zf.write(file_name, compress_type=compression)
            else:
                print(f"Warning: {file_name} not found and will not be included in the zip file.")

compress(file_names)
print("\nresult.zip created successfully.")

Files to be zipped:
['Pool_Detector_English.ipynb', 'submission.csv']

result.zip created successfully.


In [86]:
# Optional: Save the trained model for future use
model.save('pool_detector_model.h5')