# SoccerNet Jersey Extraction Dataset Creation

## Introduction

Welcome to my SoccerNet Jersey Extraction Dataset Creation notebook! Here, I'll be focusing on creating that consist of jersey exract from SoccerNet jersey number recognition dataset. The SoccerNet dataset provides a large collection player frames from soccer match videos, making it an ideal resource for training computer vision models to recognize and extract jersey numbers from player frames.

## Goals

- **Dataset Creation**: Generate a dataset containing jersey extract of soccer players and their corresponding jersey numbers as a label.
- **Data Preprocessing**: Perform necessary preprocessing steps such as image resizing, cropping, and normalization etc. .
- **Dataset Storage**: Save the preprocessed dataset in an appropriate format for easy access and future use.

## Note

Ensure that you have access to the SoccerNet dataset or have downloaded the required data for this notebook to run successfully.

Let's get started!


### Package Installation
- **Ultralytics**: This package provides utilities and tools for working with computer vision models, including object detection, image classification, and more.
- **natsort**: This package offers human-friendly sorting of lists and arrays.

In [None]:
!pip install ultralytics
!pip install natsort

### Importing Necessary Libraries

In [None]:
import os
import cv2
import matplotlib.pyplot as plt
import numpy as np
import itertools
import time
from ultralytics import YOLO
from natsort import natsorted
import concurrent.futures

### Loading the YOLOv8 Pose Model

Before proceeding, we need to load the YOLOv8 Pose model. This model will be used for detecting poses and keypoints in the soccer player images.

In [None]:
model = YOLO("/kaggle/input/yolov8-pose-models/models/yolov8m-pose.pt")

# Demonstrating Dataset Processing on a Single Image

I'll illustrate the dataset processing workflow by performing all steps on a single image from the dataset. This demonstration showcases image preprocessing, jersey extraction , dataset creation.

In [None]:
image = cv2.imread(r"/kaggle/input/soccernet/Data/SoccerNet/jersey-2023/train/train/images/10/10_2.jpg")
results = model("/kaggle/input/soccernet/Data/SoccerNet/jersey-2023/train/train/images/10/10_2.jpg" , save = True)
keypoints = results[0].keypoints
# Assuming keypoints.xy is a tensor of shape (batch_size, num_keypoints, 2)
# Iterate over the first dimension to access each keypoint
for batch_idx in range(keypoints.xy.size(0)):
    for kpt in keypoints.xy[batch_idx]:
        x, y = int(kpt[0].item()), int(kpt[1].item())
        cv2.circle(image, (x, y), radius=3, color=(0, 255, 0), thickness=-1)

# Display the image
plt.imshow(image)

In [None]:
threshold = 0.1
points = []

for kpt_data in keypoints.data[0]:
    # Extract the confidence (probability) of the keypoint
    prob = kpt_data[2].item()

    # If the confidence is above the threshold, consider it as a valid keypoint
    if prob > threshold:
        x = kpt_data[0].item()
        y = kpt_data[1].item()
        points.append((int(x), int(y)))
    else:
        points.append(None)
if  keypoints.has_visible is False:
    points = [None] * 13


In [None]:
print(points)

In [None]:
POSE_PAIRS = [(2,0),(0,1),(1,3),(4,6),(3,5),(5,6),(6,8),(5,7),(8,10),(7,9),(6,12),(5,11),(12,14),(14,16),(11,13),(13,15),(11,12),(4,2)]

In [None]:
imPoints =image.copy()
imSkeleton= image.copy()
# Draw points
for i, p in enumerate (points):
	cv2.circle(imPoints, p, 2, (255, 255,0), thickness=-1, lineType=cv2.FILLED) 
	cv2.putText(imPoints, "{}".format(i), p, cv2.FONT_HERSHEY_SIMPLEX, 0.3, (255,0,0), 1, lineType=cv2.LINE_AA)
# Draw skeleton
for pair in POSE_PAIRS:
	partA =pair[0]
	partB =pair[1]
	if points[partA] and points[partB]:
		cv2.line(imSkeleton, points[partA], points[partB], (255, 255,0), 2) 
		cv2.circle(imSkeleton, points[partA], 2, (255, 0, 0), thickness=-1, lineType=cv2.FILLED)

In [None]:
plt.figure(figsize=(20,20))
plt.subplot(121); plt.axis('off'); plt.imshow(imPoints);
plt.subplot(122); plt.axis('off'); plt.imshow(imSkeleton);

### Testing Workflow on a Folder of Images

To validate the workflow, I'll apply the entire process to a folder containing multiple images. This test will assess the robustness and efficiency of the workflow across various images.


In [None]:
paths = os.listdir("/kaggle/input/soccernet/Data/SoccerNet/jersey-2023/train/train/images/0")
paths = ["/kaggle/input/soccernet/Data/SoccerNet/jersey-2023/train/train/images/0/"+ i for i in paths]
img = cv2.imread(paths[1])
img = cv2.resize(img , (200,200))
plt.imshow(img)
len(paths)

In [None]:
start_time = time.time()

whole_numbers_generator = itertools.count(start=1)
# next(whole_numbers_generator)
npoints = 13
plt.figure(figsize=(200,200))

for j in paths:
    print(j)
    image = cv2.imread(j)
    im_height, im_width = image.shape[:2]
    if image is None or im_height<75 or im_width<35:
        print(f"Error: Unable to read image at index {j}. Skipping...")
        continue
#     if im_height<180 or im_width<90:
    image = cv2.resize(image , (285,600))
    im_height, im_width = image.shape[:2]
    results = model(image)
    keypoints = results[0].keypoints

    threshold = 0.1
    points = []

    for kpt_data in keypoints.data[0]:
        # Extract the confidence (probability) of the keypoint
        prob = kpt_data[2].item()

        # If the confidence is above the threshold, consider it as a valid keypoint
        if prob > threshold:
            x = kpt_data[0].item()
            y = kpt_data[1].item()
            points.append((int(x), int(y)))
        else:
            points.append(None)
    if  keypoints.has_visible is False:
        points = [None] * 13
    quadrilateral_coords = [points[6], points[5], points[12], points[11]]
    
    # Check if any of the points are None
    if None in quadrilateral_coords:
        print(f"Error: Not enough valid points detected for image at index {j}. Skipping...")
        continue
    x1, y1 = points[5]
    x2, y2 = points[6]
    x3, y3 = points[11]
    x4, y4 = points[12]

    top = np.abs(min(y1,y2,y3,y4))
    left = np.abs(min(x1,x2,x3,x4))
    height = np.abs(max(y1,y2,y3,y4)-top)
    width = np.abs(max(x1,x2,x3,x4)-left)
    if top >= im_height*0.5 or left >= im_width*0.7:
        print(f"Error: wrong player detected {j}. Skipping...") 
        continue

    if 0 in [height , width , top , left]:
        continue
    img1 = image[int(top+height*.1-3):int(top+height*.8+3), int(left-10):int(left+width+10)]
    img1 = cv2.resize(img1 , (32 ,32))
    if x1 >= x2 or x1 >= x4 or x3 >= x4 or x3 >= x2 or  y1 >= y3 or y1 >=y4 or y2 >= y3 or y2 >= y4 or np.abs(x2 - x1) < 65 or np.abs(x1 - x2) < 65 :
        continue
    else:
        plt.subplot(50,50, next(whole_numbers_generator) )
        plt.axis("off")
        plt.imshow(img1)
end_time = time.time()
elapsed_time = end_time - start_time
print("Time consumed:", elapsed_time, "seconds")

In [None]:
def preprocess_image(image_path):
    """
    Preprocesses an image for jersey extraction.

    Args:
        image_path (str): The path to the input image.

    Returns:
        tuple or None: A tuple containing the label ("jersey" or "no_jersey") and the preprocessed image, or None if preprocessing fails.
    """
    # Read the image
    image = cv2.imread(image_path)
    
    # Check if the image is valid and meets minimum size requirements
    if image is None or image.shape[0] < 75 or image.shape[1] < 35:
        return None

    # Resize the image to a standard size
    image = cv2.resize(image, (285, 600))

    # Run YOLO model on the image to detect keypoints
    results = model(image)
    keypoints = results[0].keypoints
    threshold = 0.1
    points = []

    # Iterate through detected keypoints
    for kpt_data in keypoints.data[0]:
        # Extract the confidence (probability) of the keypoint
        prob = kpt_data[2].item()

        # If the confidence is above the threshold, consider it as a valid keypoint
        if prob > threshold:
            x = kpt_data[0].item()
            y = kpt_data[1].item()
            points.append((int(x), int(y)))
        else:
            points.append(None)

    # If no visible keypoints, return "no_jersey"
    if not keypoints.has_visible:
        points = [None] * 13
    
    # Extract quadrilateral coordinates
    quadrilateral_coords = [points[6], points[5], points[12], points[11]]
    
    # Check if any of the points are None
    if None in quadrilateral_coords:
        return None
        
    # Calculate bounding box coordinates
    x1, y1 = points[5]
    x2, y2 = points[6]
    x3, y3 = points[11]
    x4, y4 = points[12]
    top = np.abs(min(y1,y2,y3,y4))
    left = np.abs(min(x1,x2,x3,x4))
    height = np.abs(max(y1,y2,y3,y4)-top)
    width = np.abs(max(x1,x2,x3,x4)-left)

    # Check if dimensions are valid
    if top >= im_height * 0.5 or left >= im_width*0.7:
        return None
    if 0 in [top, height, width, left]:
        return None
    
    # Extract jersey region
    jersey_roi = image[int(top+height*.1-3):int(top+height*.8+3), int(left-10):int(left+width+10)]
   
    try:
        # Resize the jersey region
        jersey_roi  = cv2.resize(jersey_roi  , (32 ,32))
    except:
        return None
    
    # Check if the region represents a jersey or not
    if x1 >= x2 or x1 >= x4 or x3 >= x4 or x3 >= x2 or  y1 >= y3 or y1 >=y4 or y2 >= y3 or y2 >= y4 or np.abs(x2 - x1) < 55 or np.abs(x1 - x2) < 55 :
        return ["no_jersey", jersey_roi]
    else:
        return ["jersey", jersey_roi]

### Dataset Processing on a Folder

To demonstrate the processing steps on the entire dataset folder, I will iterate through each image in the folder and apply the necessary operations. This will provide a comprehensive overview of the data processing pipeline.



#### Preparing Directories

Ensure proper directory structure is set up for organizing processed images

In [None]:
for player_id in range(len(os.listdir("/kaggle/input/soccernet/Data/SoccerNet/jersey-2023/train/train/images"))):
    
    os.makedirs("/kaggle/working/soccernet/jersey-2023/train/train/blur_no_jersey_images/" + str(player_id) , exist_ok=True)
    os.makedirs("/kaggle/working/soccernet/jersey-2023/train/train/images/" + str(player_id) , exist_ok=True)


This code block preprocesses images from the SoccerNet dataset, separating images with jerseys and without jerseys of single folder.

In [None]:
import time
start_time = time.time()


for player_id_path in ["/kaggle/input/soccernet/Data/SoccerNet/jersey-2023/train/train/images/569"]:
    for frame in os.listdir(player_id_path):
        frame_path = os.path.join(player_id_path, frame)
        image = preprocess_image(frame_path)
        if image is not None and image[0] == "no_jersey":
            dest_path = "/kaggle/working/soccernet/jersey-2023/train/train/blur_no_jersey_images/" + frame_path.split("/")[-2] + "/" + frame_path.split("/")[-1]
            print("dest_path"  , dest_path)
        elif image is not None and image[0] == "jersey":
            dest_path = "/kaggle/working/soccernet/jersey-2023/train/train/images/" + frame_path.split("/")[-2] + "/" + frame_path.split("/")[-1]
            print("dest_path"  , dest_path)
        try:
            cv2.imwrite(dest_path, img=image[1])
        except Exception as e:
#             print(f"Error processing image {img_path}: {str(e)}")
            pass

end_time = time.time()
elapsed_time = end_time - start_time
print("Time consumed:", elapsed_time, "seconds")
print(len(os.listdir("/kaggle/working/soccernet/jersey-2023/train/train/images/569")))

#### Sorting and Preparing Image Paths

The following code block sorts and prepares the image paths for processing from the SoccerNet dataset.



In [None]:
player_id_dir = os.listdir("/kaggle/input/soccernet/Data/SoccerNet/jersey-2023/train/train/images")
player_id_dir = natsorted(player_id_dir)
player_id_dir = ["/kaggle/input/soccernet/Data/SoccerNet/jersey-2023/train/train/images/"+i+"/" for i in player_id_dir]
img_paths = [os.path.join(directory, filename) for directory in player_id_dir for filename in os.listdir(directory)]

In [None]:
len(img_paths)

#### Image Processing and Saving

In [None]:
# Function to process images
def process_and_save_image(img_path):
    try:
        image = preprocess_image(img_path)
        if image is not None and image[0] == "no_jersey":
            dest_path = "/kaggle/working/soccernet/jersey-2023/train/train/blur_no_jersey_images/" + img_path.split("/")[-2] + "/" + img_path.split("/")[-1]
        elif image is not None and image[0] == "jersey":
            dest_path = "/kaggle/working/soccernet/jersey-2023/train/train/images/" + img_path.split("/")[-2] + "/" + img_path.split("/")[-1]
        cv2.imwrite(dest_path, img=image[1])
    except Exception as e:
        print(f"Error processing image {img_path}: {str(e)}")
        pass

# Main function
def main():
    start_time = time.time()
    
    with concurrent.futures.ThreadPoolExecutor() as executor:
        executor.map(process_and_save_image, img_paths)

    end_time = time.time()
    elapsed_time = end_time - start_time
    print("Time consumed:", elapsed_time, "seconds")

if __name__ == "__main__":
    main()

## Data Creation Procedure

### Preprocessing and Function Implementation
1. **Preprocessing Steps:** Initially, complete the preprocessing steps required for the image extraction and processing tasks.
2. **Function Implementation:** Implement necessary functions for image extraction and processing to facilitate subsequent steps.

### Grouping Extracted Jersey Numbers
1. **Reading Ground Truth JSON File:**
```python
   with open(r'Data\SoccerNet\jersey-2023\test\test\test_gt.json', 'r') as file:
       data = json.load(file)
```
2. **Defining Labels and Paths:**
```python
    labels = []
    paths = []
    for i in range(len(data)):
        for j in os.listdir(path_to_preprocessed_extracted_jersey_no_images + str(i)+"/"):
            label = data[str(i)]
            path = path_to_preprocessed_extracted_jersey_no_images + str(i)+"/"+j
            labels.append(label)
            paths.append(path)
```
3. **Creating DataFrame:**
```python
    df = pd.DataFrame({"img_path": paths, "label": labels})
```

4. **Sorting DataFrame by Label:**
```python
    grouped = df.sort_values(by=['label'])
```
5. **Creating Directories for Data Storage:**
```python
    unique_labels = list(set(data.values()))
    for label in unique_labels:
        os.makedirs("test/test/images/" + str(label), exist_ok=True)
```
6. **Grouping Jersey Extracts with Corresponding Labels:**
```python
    for _, row in grouped.iterrows():
        label = row["label"]
        source = row["img_path"]
        destination = "test/test/images/" + str(label) + "/"
        try:
            shutil.copy(source, destination)
            print("File copied successfully.")
        except shutil.SameFileError:
            print("Source and destination represent the same file.")
        except PermissionError:
            print("Permission denied.")
        except Exception as e:
            print(f"Error occurred while copying file: {e}")
```
### Manual Data Cleaning
* After the initial data organization step, manual data cleaning was conducted.
* Each extracted image was carefully inspected to remove any erroneous or irrelevant samples.
* The cleaning process ensured that only high-quality and relevant data remained for further analysis and model training.
