# Hotel Recognition to Combat Human Trafficking | Exploratory Data Analysis
    2021-04-05
    Edward Sims

## Introduction
Victims of human trafficking are often photographed in hotel rooms as in the below examples. Identifying these hotels is vital to these trafficking investigations but poses particular challenges due to low quality of images and uncommon camera angles.

![Image from Kaggle competition page: https://www.kaggle.com/c/hotel-id-2021-fgvc8/overview/description](images/example_victim_images.png)

Even without victims in the images, hotel identification in general is a challenging fine-grained visual recognition task with a huge number of classes and potentially high intraclass and low interclass variation. In order to support research into this challenging task and create image search tools for human trafficking investigators, we created the TraffickCam mobile application, which allows every day travelers to submit photos of their hotel room. Read more about [TraffickCam on TechCrunch](https://techcrunch.com/2016/06/25/traffickcam/).

## Evaluation
Submissions are evaluated according to the Mean Average Precision @ 5 (MAP@5):

<img src="images/map5_formula.png" alt="MAP@5 formula" style="width: 300px;"/>

where ***U*** is the number of images, ***P(k)*** is the precision at cutoff ***k***, ***n*** is the number of predictions per image, and ***rel(k)*** is an indicator function equaling 1 if the item at rank ***k*** is a relevant correct label, zero otherwise.

Once a correct label has been scored for an observation, that label is no longer considered relevant for that observation, and additional predictions of that label are skipped in the calculation. For example, if the correct label is ***A*** for an observation, the following predictions all score an average precision of 1.0.

```
A B C D E
A A A A A
A B A C A
```

## Submission File
For each image in the test set, you must predict a space-delimited list of hotel IDs that could match that image. The list should be sorted such that the first ID is considered the most relevant one and the last the least relevant one. The file should contain a header and have the following format:

```
image,hotel_id 
99e91ad5f2870678.jpg,36363 53586 18807 64314 60181
b5cc62ab665591a9.jpg,36363 53586 18807 64314 60181
d5664a972d5a644b.jpg,36363 53586 18807 64314 60181
```

## Data 
Identifying the location of a hotel room is a challenging problem of great interest for combating human trafficking. This competition provides a rich dataset of photos of hotel room interiors, without any people present, for this purpose.

Many of the hotels are independent or part of very small chains, where shared decor isn't a concern. However, the shared standards for their interior decoration for the larger chains means that many hotels can look quite similar at first glance. Identifying the chain can narrow the range of possibilities, but only down to a set that is much harder to tell apart and is still scattered across a wide geographic area. The real value lies in getting the number of candidates to a small enough number that a human investigator could follow up on all of them.

### Files
**train.csv** - The training set metadata.
 - `image` - The image ID.
 - `chain` - An ID code for the hotel chain. A chain of zero (0) indicates that the hotel is either not part of a chain or the chain is not known. This field is not available for the test set. The number of hotels per chain varies widely.
 - `hotel_id` - The hotel ID. The target class.
 - `timestamp` - When the image was taken. Provided for the training set only.
 
**sample_submission.csv** - A sample submission file in the correct format.
 - `image` The image ID
 - `hotel_id` The hotel ID. The target class.
 
**train_images** - The training set contains 97000+ images from around 7700 hotels from across the globe. All of the images for each hotel chain are in a dedicated subfolder for that chain.

**test_images** - The test set images. This competition has a hidden test set: only three images are provided here as samples while the remaining 13,000 images will be available to your notebook once it is submitted.

## 1.00 Import Packages

In [1]:
# Data manipulation packages
import pandas as pd
import numpy as np
from scipy import stats
from datetime import datetime as dt
import cv2

# General packages
import multiprocessing
import pickle
import os
import gc
import random
from tqdm import tqdm, tqdm_notebook
import time
import warnings

# Data vis packages
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# Package options
warnings.filterwarnings("ignore")
pd.set_option("display.max_columns", 50)
plt.rcParams["figure.figsize"] = [14, 8]

## 2.00 Read in data
### 2.01 Get paths and set initial parameters

In [2]:
# Data paths
data_dir_path         = "../input/hotel-id-2021-fgvc8"
train_images_dir_path = os.path.join(data_dir_path, "train_images")
test_images_dir_path  = os.path.join(data_dir_path, "test_images")

train_metadata_path   = os.path.join(data_dir_path, "train.csv")
sample_sub_path       = os.path.join(data_dir_path, "sample_submission.csv")

# Read csv data
train_metadata        = pd.read_csv(train_metadata_path, parse_dates=["timestamp"])
sample_sub            = pd.read_csv(sample_sub_path)

In [3]:
# Set image parameters
ROWS     = 128 # Default row size
COLS     = 128 # Default col size
CHANNELS = 3

### 2.02 Read in images

In [4]:
def read_images(image_dir_path, rows, cols, loading_bar=True):
    """Reads images into np.array from directory of image files.

    Parameters
    ----------
    image_dir : list
        Directory of images to read from.
    rows : int
        Image height to resize to.
    cols : int
        Image width to resize to.
    loading_bar : bool
        Include loading bar or non-verbose.

    Returns
    -------
    np.array
        Array of images read in.

    """
    # Read image data
    image_list = []
    
    if loading_bar == True:
        for chain_id in tqdm(os.listdir(image_dir_path)):
            # Each subdirectory is a chain_id
            chain_id_dir_path = os.path.join(image_dir_path, chain_id)
            # Read images from each chain_id subdirectory
            for image in os.listdir(chain_id_dir_path)[0:10]: 
                # Read image
                image_path = os.path.join(chain_id_dir_path, image)
                try:
                    image = cv2.imread(image_path)
                    image = cv2.resize(image, (rows, cols))
                    # Append to list of images
                    image_list.append(image)    
                except:
                    pass
    elif loading_bar == False:
        for chain_id in os.listdir(image_dir_path):
            # Each subdirectory is a chain_id
            chain_id_dir_path = os.path.join(image_dir_path, chain_id)
            # Read images from each chain_id subdirectory
            for image in os.listdir(chain_id_dir_path)[0:10]:
                # Read image
                image_path = os.path.join(chain_id_dir_path, image)
                try:
                    image = cv2.imread(image_path)
                    image = cv2.resize(image, (rows, cols))
                    # Append to list of images
                    image_list.append(image)    
                except:
                    pass
    
    # Convert image list to array
    image_list = np.array(image_list)
    
    return(image_list)

In [5]:
def read_images_2(image_dir_path, rows, cols, loading_bar=True):
    """Reads images into np.array from directory of image files.

    Parameters
    ----------
    image_dir : list
        Directory of images to read from.
    rows : int
        Image height to resize to.
    cols : int
        Image width to resize to.
    loading_bar : bool
        Include loading bar or non-verbose.

    Returns
    -------
    np.array
        Array of images read in.

    """
    def load_image(image_path):
        image = cv2.imread(image_path)
        image = cv2.resize(image, (rows, cols))
        return image
    
    # Read image data
    image_list = []
    
    if loading_bar == True:
        for chain_id in tqdm(os.listdir(image_dir_path)):
            # Each subdirectory is a chain_id
            chain_id_dir_path = os.path.join(image_dir_path, chain_id)
            # Read images from each chain_id subdirectory
            for image in os.listdir(chain_id_dir_path)[0:10]: 
                # Read image
                image_path = os.path.join(chain_id_dir_path, image)
                try:
                    pool = multiprocessing.Pool(processes=2)
                    pool.starmap(load_image, image_path)
                    
                    # Append to list of images
                    image_list.append(image)    
                finally:
                    pool.close()
                    pool.join()

    elif loading_bar == False:
        for chain_id in os.listdir(image_dir_path):
            # Each subdirectory is a chain_id
            chain_id_dir_path = os.path.join(image_dir_path, chain_id)
            # Read images from each chain_id subdirectory
            for image in os.listdir(chain_id_dir_path)[0:10]:
                # Read image
                image_path = os.path.join(chain_id_dir_path, image)
                try:
                    pool = multiprocessing.Pool(processes=2)
                    pool.starmap(load_image, image_path)
                    
                    # Append to list of images
                    image_list.append(image)    
                finally:
                    pool.close()
                    pool.join()
    
    # Convert image list to array
    image_list = np.array(image_list)
    
    return(image_list)

In [6]:
train_images = read_images(image_dir_path=train_images_dir_path, rows=ROWS, cols=COLS)

100%|██████████| 88/88 [00:20<00:00,  4.32it/s]


In [7]:
train_images = read_images_2(image_dir_path=train_images_dir_path, rows=ROWS, cols=COLS)

  0%|          | 0/88 [00:00<?, ?it/s]


AttributeError: Can't pickle local object 'read_images_2.<locals>.load_image'

## 3.00 Metadata Analysis


### 3.01 Initial Checks

In [None]:
train_metadata.head()

In [None]:
train_metadata.tail()

In [None]:
train_metadata.info()

### 3.02 Duplicates

In [None]:
# Print high level stats on metadata dimensions
print(f"Train metadata dimensions: \t{train_metadata.shape}")
print(f"Number of unique records: \t{len(train_metadata.drop_duplicates())}")
print(f"Number of unique image names: \t{train_metadata['image'].nunique()}")

There appear to be two duplicated image names, but these aren't considered duplicate records. We'll investigate these two records individually.

In [None]:
# Subset duplicated image names - to all records (not just duplicates)
train_metadata.loc[train_metadata.groupby("image")["image"].transform("count") > 1, ]

It looks like these records have indeed been duplicated, and the names have just been switched around. We can remove 1 of each record. 

Although 2 records don't seem like a lot in the grand scheme of things (i.e. out of 97,554 total records), but every marginal improvement in the data quality is **critical** to model improvement. 

In [None]:
# Remove 2 duplicated records
train_metadata_dupes = train_metadata.loc[train_metadata.groupby("image")["image"].transform("count") > 1, ]
train_metadata_dupes_idx = train_metadata_dupes.iloc[[1, 3]].index
train_metadata = train_metadata.drop(train_metadata_dupes_idx, axis=0)

In [None]:
# Observe any duplicates using chain, hotel_id and timestamp columns
train_metadata[train_metadata.duplicated(
    subset=["chain", "hotel_id", "timestamp"], keep=False
)].sort_values(["chain", "hotel_id", "timestamp"])

Excluding the image column, most images are considered duplicates - this won't actually be the case however. Rather, it can possibly be that the timestamp column is not wholly accurate. Images may have been batch uploaded, or the timestamp column doesn't precisely measure the actual time these images were taken. 

The timestamp feature will be analysed in more detail to see if there is any value in including this feature as part of our model training. While it is not included as a feature in the test set, if informative, we can use it in our cross validation strategy. 

And as for possible image duplicates (where the metadata appear unique, but the actual image isn't), this will also be analysed in subsequent sections.

### 3.03 Chain ID Analysis

In [None]:
print(f"Number of unique chain ids: {train_metadata['chain'].nunique()}")

In [None]:
# Plot histogram of chain ids
sns.histplot(data=train_metadata, x="chain", stat="count", binwidth=1)
plt.xlabel("Chain ID", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.title("Distribution of Chain IDs in Train Metadata", fontsize=15)
plt.show()

We'll definitely have to stratify by chain during cross validation - there's a heavy 'skew' (there's no particular distribution) towards lower chain ID integers, as well as some increase in counts towards the higher end of chain ID integers also. The middle range appears quite sparse, so we should take extra care not to allow our model to favour only chain IDs in the majority groups.

In [None]:
# Display descriptive stats on numbers of records per chain id
train_chain_stats = pd.DataFrame(train_metadata["chain"].value_counts())
train_chain_stats.describe().rename(columns={"chain": "Chain ID Count Statistics"})

Confirming the histogram above, there's a pretty huge standard deviation for number of records per chain id. We'll need to do some significant work and likely have to iterate on this topic to find an optimal solution.

Given that the minimum number of records for a chain is 8, it might be useful to look into sampling strategies. We could potentially upsample the low frequency chains and downsample (if needed) the chains with high frequency. This will be explored in later sections and in the modelling notebooks.

### 3.04 Hotel ID Analysis

In [None]:
print(f"Number of unique hotel ids: {train_metadata['hotel_id'].nunique()}")

In [None]:
# Plot histogram of chain ids
sns.histplot(data=train_metadata, x="hotel_id", stat="count", binwidth=100, kde=True)
plt.xlabel("Hotel ID", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.title("Distribution of Hotel IDs in Train Metadata", fontsize=15)
plt.show()

There appear to be a fairly even distribution of hotel IDs, despite the skews in chain ID. This will be beneficial to modelling. What will be tricky for modelling, is the sheer number of classes (7,770 is a huge number of classes for the size of the dataset).

In [None]:
# Display descriptive stats on numbers of records per chain id
train_hotel_id_stats = pd.DataFrame(train_metadata["hotel_id"].value_counts())
train_hotel_id_stats.describe().rename(columns={"hotel_id": "Chain ID Count Statistics"})

The fact that the standard deviation is less than the mean is promising, in terms of distribution. But a minimum count of hotel ID records of 1, and a maximum of 95, might mean we'll need to explore sampling strategies similar to with chain ID. Stratifying by 7,770 different classes on the other hand might not be feasible.

In [None]:
sns.scatterplot(data=train_metadata, x="hotel_id", y="chain")
plt.xlabel("Hotel ID", fontsize=12)
plt.ylabel("Chain ID", fontsize=12)
plt.title("Hotel ID vs Chain ID", fontsize=15)
plt.show()

In [None]:
# Calculate Pearson correlation coefficient for Hotel ID and Chain ID
coeff, pvalue = stats.pearsonr(train_metadata["hotel_id"], train_metadata["chain"])
print("Hotel ID and chain ID correlation")
print(f"Pearson R Correlation Coefficient: {round(coeff, 4)} (p-value: {pvalue})")

The above confirms that there is definitely no data leakage occurring between hotel ID and chain ID - which is a good thing when it comes to modelling. An incredibly low Pearson correlation coefficient shows no evidence of a correlation, and an even lower p-value shows this is statistically significant.

### 3.05 Timestamp

In [None]:
# Extract year, month and hour from timestamp
train_metadata["year"] = train_metadata["timestamp"].dt.year
train_metadata["month"] = train_metadata["timestamp"].dt.month
train_metadata["hour"] = train_metadata["timestamp"].dt.hour
train_metadata.head()

In [None]:
# Get counts for each year
datetime_freq = pd.DataFrame(train_metadata["year"].value_counts())
datetime_freq = datetime_freq.reset_index().rename(columns={"year": "count", "index": "year"}).sort_values("year")

# Plot histogram of year
sns.barplot(data=datetime_freq, x="year", y="count", color="tab:blue")
plt.xlabel("Year", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.title("Number of Records by Year", fontsize=15)
plt.show()

The data span from 2015 to 2020, and there is some difference in counts for each year - particularly for 2015 and 2020.

This may seem unimportant, but given the data source is an app where people can upload photos of their hotel indicates that image data originate from phones. Year-on-year, phone photograph technology has improved dramatically, and the images from 2015 will be noticeably different compared to 2020. It will thus be important to attempt to look into stratifying by year, or figure out how to reconcile this problem.

In [None]:
# Get counts for each month
datetime_freq = pd.DataFrame(train_metadata["month"].value_counts())
datetime_freq = datetime_freq.reset_index().rename(columns={"month": "count", "index": "month"})
datetime_freq = datetime_freq.sort_values("month")

# Plot histogram of year
sns.barplot(data=datetime_freq, x="month", y="count", color="tab:blue")
plt.xlabel("Month", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.title("Number of Records by Month", fontsize=15)
plt.show()

While most months are quite equal in counts, it is unexpected that the summer months of June and July (6 & 7) have the highest numbers. This may not neccessarily affect the model, but it is useful knowledge to have as we might be able to add it to the cross validation strategy. If a robust solution to account for this information in the training strategy is found I imagine it would benefit modelling.

The reason why this might make a difference is that there could be some data leakage if we ignore the month. Some seasonal hotels are only open during summer months or vice versa, or at the very least there will be a seasonal factor involved (more images of seaside hotels occur in summer months). There may also be aspects of the image too, such as images are in general higher in brightness during summer months than winter months. Model generalisation will only improve if we can take this into account.

In [None]:
# Get counts for each hour
datetime_freq = pd.DataFrame(train_metadata["hour"].value_counts())
datetime_freq = datetime_freq.reset_index().rename(columns={"hour": "count", "index": "hour"}).sort_values("hour")

# Plot histogram of year
sns.barplot(data=datetime_freq, x="hour", y="count", color="tab:blue")
plt.xlabel("Hour", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.title("Number of Records by Hour", fontsize=15)
plt.show()

Interestingly, the above suggests that most images have been taken during darker hours - which will certainly have an impact on the image. Given our reservations about how accurate the time fragment of the timestamp is though, we will have to confirm this using the actual images to see if this is actually the case. It might be the case that images aren't uploaded as they are taken, or that the timestamp may not be the local time for the user, but rather the time logged at the server location. 

If the data are accurate though, we might want to consider comparing model performance for RGB vs grayscale images. 

In [None]:
# Extract is_weekend from timestamp
def get_is_weekend(timestamp_col):
    """
    Returns boolean for whether timestamp is a weekend.
    """
    timestamp_col_weekday = timestamp_col.dt.weekday
    # Allocate booleans - Weekends are designated 6 & 7
    timestamp_col_weekday = timestamp_col_weekday.apply(lambda x: False if x < 5 else True)
    
    return timestamp_col_weekday

train_metadata["is_weekend"] = get_is_weekend(train_metadata["timestamp"])

In [None]:
# Get counts for is_weekend
datetime_freq = pd.DataFrame(train_metadata["is_weekend"].value_counts())
datetime_freq = datetime_freq.reset_index().rename(columns={"is_weekend": "count", "index": "is_weekend"})
datetime_freq = datetime_freq.sort_values("is_weekend")

datetime_freq.loc[datetime_freq.is_weekend == True, "is_weekend"] = "Weekend"
datetime_freq.loc[datetime_freq.is_weekend == False, "is_weekend"] = "Weekday"

# Plot histogram of year
sns.barplot(data=datetime_freq, x="is_weekend", y="count", color="tab:blue")
plt.xlabel("Weekday or Weekend", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.title("Number of Records for Weekends and Weekdays", fontsize=15)
plt.show()

Given that there are 5 weekdays and only 2 weekends, it's expected that there are more records for weekdays. But this is important to look into, for the same reasons as looking into month. Hotels, such as airport hotels, may have higher frequency of weekday images. Again, however, we cannot profess to know based on the competition information whether users are uploading images as soon as they are taken - they could be taken and uploaded as a batch following the visit. This feature of the data could be useful, depending on data quality, but won't be prioritised over the less granular features created from timestamp - such as year and month.

In [None]:
# Get correlation coefficients for all features
corr = train_metadata.drop(["image", "timestamp"], axis=1).corr()
mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True

sns.heatmap(data=corr, annot=True, linewidths=0.5, mask=mask)
plt.title("Corrplot of Train Metadata Features", fontsize=15)
plt.show()

In [None]:
del train_metadata_dupes, train_metadata_dupes_idx, train_chain_stats, train_hotel_id_stats, coeff
del pvalue, datetime_freq, corr, mask

The above confirms that there are no correlations that we should be aware of (particularly in terms of how hotel ID and chain ID integers are allocated).

## 4.00 Train Images

TODO:
 - rgb distribution
 - avg rgb by chain
 - avg rgb by month
 - rgb distribution by month (2 plots for dec and jul)
 - image brightness by hour (https://stackoverflow.com/questions/14243472/estimate-brightness-of-an-image-opencv)
 - image brightness by month
 
 - Anomaly detection (https://github.com/SIlvaMFPedro/pyimagesearch/tree/master/anomaly-detection)
 - Detect low contrast images (https://github.com/SIlvaMFPedro/pyimagesearch/tree/master/detect-low-contrast)
 - Colour correction (https://github.com/SIlvaMFPedro/pyimagesearch/tree/master/opencv-color-correction)
 
 
### 4.01 RGB Analysis

### Image Duplicates

In [None]:
!time python3 ../input/multiprocessing_images/extract.py --images ../input/multiprocessing_images/images \
--output ../input/multiprocessing_images/temp_output --hashes ../input/multiprocessing_images/hashes.pickle

pd.read_pickle(r'../input/multiprocessing_images/hashes.pickle')