# Correlated

## Problem Statement

In this activity, you will use CV to detect the points in the plots and then calculate the correlation ( range: [-1,1] )  between the X and Y positions of the points.

Attached Files
[CorrCV.package.tar.xz](https://api.t.cyberthon24.ctf.sg/file?id=clu5pv0wj0d2u0806jxncy6tf&name=CorrCV.package.tar.xz)

## Solution

We are going to use cv2 to find points on the image and use scipy to calculate the correlation


You have to upload CorrCV.package.tar.xz to Google Drive and mount it
The code below are to be used on Google Colab

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [2]:
!tar -xf CorrCV.package.tar.xz

In [3]:
import cv2
import numpy as np
import scipy.stats
from sklearn.metrics import mean_squared_error
import pandas as pd

This function reads the image using OpenCV, converts it to grayscale, and applies a binary thresholding technique to isolate regions of interest. Then, it finds contours in the binary image and calculates the centroids of these contours. It filters out small contours and extracts their centroids as points. These points are then used to calculate the correlation between the x and y coordinates of the centroids using Pearson correlation coefficient.

Note: All of the points are the same size (Area of 28.0), which allows us to check if there are multiple points clumped together and make the necessary adjustments

In [4]:
def process_image(image_path):
    image = cv2.imread(image_path)
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    _, binary = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)
    contours, _ = cv2.findContours(binary, cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)

    points = []
    for cnt in contours:
        if cv2.contourArea(cnt) < 28.0:
            continue
            
        M = cv2.moments(cnt)
        if M["m00"] != 0:
            cX = int(M["m10"] / M["m00"])
            cY = int(M["m01"] / M["m00"])
            cY = image.shape[0] - cY
            for _ in range(round(cv2.contourArea(cnt) / 28)):
                points.append((cX, cY))

    points = np.array(points)
    x = points[:, 0]
    y = points[:, 1]

    correlation = scipy.stats.pearsonr(x, y)[0]
    return correlation

In [5]:
df = pd.read_csv("./train.csv")
df['calculated'] = df['image'].map(lambda x: process_image(f'./train/{x}.jpg'))
mse = mean_squared_error(df['calculated'], df['correlation'])
print(f'Mean Squared Error: {mse}')
df.head()

Mean Squared Error: 0.00010633884481230846


Unnamed: 0,image,correlation,calculated
0,0,-0.21087,-0.21256
1,1,0.46533,0.466746
2,2,-0.04396,-0.046793
3,3,-0.10122,-0.102621
4,4,-0.16605,-0.16378


Lastly, run the algorithm on the submission dataset and writes to `submission.csv`

In [6]:
df = pd.read_csv("submission.csv")
df['correlation'] = df['image'].map(lambda x: process_image(f'./test/{x}.jpg'))
df.to_csv("submission.csv", index=False)
df.head()

Unnamed: 0,image,correlation
0,0,-0.387269
1,1,0.016863
2,2,-0.080502
3,3,-0.182272
4,4,0.014748


This yields a score of **99.67**, which obtains full marks for the problem.