# **GT AI Talent Acquisition Project Management**

## **Introduction**

The GT AI Talent Acquisition Project Management is gathering a team to embark on an AI project to develop a model to make accurate predictions for a football games. This notebook details the processes developed for loading data to peforming analysis.

In this notebbok the roles of the data engineer and data scientist are outlined for further development and for documenation purposes. For this particular project, The DFL - Bundesliga Data Shootout Kaggle Competition shall be used to kick-start the project and enable team members prep and familiarize themselves with how the real work would look like.

## **Data Ingestion**

The Data Ingestion role characterizes everything that is done to load and extract data from various sources to be stored in an efficient storage system. This section would detail how the football game videos were loaded from Kaggle (in this case, this is out target source).

### **Libraries and Packages**


**Kaggle :** This packgae had to be installed to facilitate a direct download from Kaggle to the preferred cloud storage, which for this project (due to external issues), is Google drive. So in the script below 'kaggle' is installed and in the following script 'KaggleApi' is imported from the 'kaggle.api.kaggle_api_extended' library to handle the api set up and thus facilitate direct link to the required data.

**OS :** This Python module was used used to execute file system functionalities. For this project it is used to crest directories, navigate through file systems, remove unwanted files and rename files or directores.

**CV2 :** This is from the 'OpenCV' package to handle reading video files.

**ZipFile :** This Python module facilitates CRUD functionalites on zip files. For this project, it was used to locate the zip files in the Kaggle storage as the videos were zip files. This module is also used to extraxt zip files to have the raw video format.



### **Data Ingestion Procedures and Steps**

**Importing Libraries and Packges :** This first step is to import libraries and packages. For this project the libraries as stated in the 'Libraries and Packages section were imported'. Kaggle is also downloaded using pip.

**Defining Functions to List Files in Kaggle :** A function called 'list_files_in_kaggle_competition' is defined to view the available data from the source and decide which data shall be used. This also helped with confirming Kaggle API connection. This function returns a list of files in the Kaggle dates set defined. These functionalities are peformed within this function;

- **Initialising Kaggle** is done here using the 'KaggleApi' instance.
- **List files in the Kaggle dataset** in order to view the available files in the dataset and their formats. This also ensures successful API connection and initialization.  

**Defining Function to download Dta from Kaggle to Google Drive :** This function ensures that the files defined in it's parameters are downloaded into the defined path. These functionalities are peformed within this function;

- **Initialising Kaggle** is done here using the 'KaggleApi' instance.
- **Ensuring the Download Path Exist** as provided in the parameter.  
- **Download Specific Files** since the entire data is large, only the train and test folders along with the train CSV file were to be used.
- **Extract Zip files** in order to acquire the video data. Files were stored as zip files in Kaggle, so this is necessary to ensure data can be used effectively.


** Set Kaggle Credentials :** Kaggle credentials is set using 'os.environ' to set it as enviroment variables to authenticate with the Kaggle API.

**List All the Files in the Kaggel dataset :** By calling the 'list_files_in_kaggle_competition' function to retrieve and print the names of all files available in the specified Kaggle competition.

**Specify the files to download :** The files to download are specified. It includes the train.csv file explicitly and all files in the train and test directories.

**Download data into specified directory :** By calling the 'download_specific_files_kaggle' function, the files specified are to be downloaded into the specifed google drive path.

In [6]:
import os
import zipfile
from kaggle.api.kaggle_api_extended import KaggleApi

def list_files_in_kaggle_competition(competition: str):
    # Initializing Kaggle API
    api = KaggleApi()
    api.authenticate()
    # List Kaggle DFL files
    files = api.competition_list_files(competition)
    return files.files  

def download_specific_files_kaggle(competition: str, files_to_download: list, download_path: str):
    # Initializing Kaggle API
    api = KaggleApi()
    api.authenticate()

    # Ensure the download path exists
    if not os.path.exists(download_path):
        os.makedirs(download_path)

    # Download specific files
    for file_name in files_to_download:
        file_path = os.path.join(download_path, file_name)
        print(f"Attempting to download {file_name} to {file_path}")  # Debug statement
        if os.path.exists(file_path):
            print(f"File {file_name} already exists. Skipping download.")
            continue
        try:
            # Ensure subdirectories are created
            os.makedirs(os.path.dirname(file_path), exist_ok=True)
            api.competition_download_file(competition, file_name, path=download_path)
            print(f"Downloaded {file_name} to {file_path}")
        except Exception as e:
            print(f"Error downloading {file_name}: {e}")

def extract_zip_files(directory):
    # Extract all zip files, including nested zips
    for root, _, files in os.walk(directory):
        for file in files:
            if file.endswith('.zip'):
                file_path = os.path.join(root, file)
                try:
                    with zipfile.ZipFile(file_path, 'r') as zip_ref:
                        zip_ref.extractall(root)
                    os.remove(file_path)  # Optionally, remove the zip file after extraction
                    print(f"Extracted {file_path}")
                    # Recursively extract any new zips
                    extract_zip_files(root)
                except zipfile.BadZipFile:
                    print(f"Bad zip file: {file_path}")

def check_video_files(directory):
    all_videos_valid = True
    for root, _, files in os.walk(directory):
        for file in files:
            if file.endswith(('.mp4', '.avi', '.mov', '.mkv')):
                video_path = os.path.join(root, file)
                cap = cv2.VideoCapture(video_path)
                if not cap.isOpened():
                    print(f"Corrupt video file detected: {video_path}")
                    all_videos_valid = False
                cap.release()
    if all_videos_valid:
        print("All video files are valid and not corrupt.")

In [1]:

# from google.colab import drive
# drive.mount('/content/mydrive')

# import os
# import boto3
# from kaggle.api.kaggle_api_extended import KaggleApi

# # AWS Credentials
# os.environ['AWS_ACCESS_KEY_ID'] = 'your_access_key'
# os.environ['AWS_SECRET_ACCESS_KEY'] = 'your_secret_key'
# os.environ['AWS_DEFAULT_REGION'] = 'your_region'

# # Kaggle Credentials
# os.environ['KAGGLE_USERNAME'] = 'your_kaggle_username'
# os.environ['KAGGLE_KEY'] = 'your_kaggle_key'

# def download_dataset(dataset_name, path_to_save):
#     api = KaggleApi()
#     api.authenticate()
#     api.dataset_download_files(dataset_name, path=path_to_save, unzip=True)

# def upload_files_to_s3(bucket_name, source_folder, s3_folder):
#     s3 = boto3.client('s3')
#     for filename in os.listdir(source_folder):
#         local_path = os.path.join(source_folder, filename)
#         s3_path = os.path.join(s3_folder, filename)
#         s3.upload_file(local_path, bucket_name, s3_path)
#         print(f"Uploaded {filename} to s3://{bucket_name}/{s3_path}")

# # Download from Kaggle
# download_dataset('dataset-owner/dataset-name', './data')

# # Upload to S3
# upload_files_to_s3('your-s3-bucket', './data', 'kaggle-data')



In [14]:
# THIS IS THE REAL ONE

# RUN TO LOAD DATA

# Set Kaggle API credentials
os.environ['KAGGLE_USERNAME'] = 'yolenedeblanche'
os.environ['KAGGLE_KEY'] = '38868207cd31f4b0b736846cfe41c7c3'

# List all files in the competition to view files in DFL and check API connection
files = list_files_in_kaggle_competition('dfl-bundesliga-data-shootout')
for file in files:
    print(file.name)

# Initialize train.csv file to download
files_to_download = ['train.csv']

# Adding all files in 'train' and 'test' directories
train_files_to_download = [file.name for file in files if file.name.startswith('train/')]
test_files_to_download = [file.name for file in files if file.name.startswith('test/')]

# Debug statements to verify the files to be downloaded
print("Files to download (train):", train_files_to_download)
print("Files to download (test):", test_files_to_download)

# Download the specific files and directories, i.e., the data
download_path = '/content/drive/MyDrive/GT-Bungladesh/raw_data_kaggle'
train_download_path = os.path.join(download_path, 'train')
test_download_path = os.path.join(download_path, 'test')

# Create train and test directories
os.makedirs(train_download_path, exist_ok=True)
os.makedirs(test_download_path, exist_ok=True)

# Download train.csv file to the main directory
download_specific_files_kaggle('dfl-bundesliga-data-shootout', ['train.csv'], download_path)

# Download train files to the train directory
download_specific_files_kaggle('dfl-bundesliga-data-shootout', train_files_to_download, download_path=train_download_path)

# Download test files to the test directory
download_specific_files_kaggle('dfl-bundesliga-data-shootout', test_files_to_download, download_path=test_download_path)

# Print to check if files are downloaded correctly
print(f"Train files downloaded to: {train_download_path}")
print(f"Test files downloaded to: {test_download_path}")

# Extract all downloaded zip files
extract_zip_files(train_download_path)
# extract_zip_files(test_download_path)

# Check for corrupt video files
check_video_files(train_download_path)
check_video_files(test_download_path)

test/0b1495d3_1.mp4
test/9f4df856_0.mp4
test/2f54ed1c_1.mp4
test/947e05ca_0.mp4
test/160606be_0.mp4
test/9a70c54e_1.mp4
test/2f54ed1c_0.mp4
test/947e05ca_1.mp4
test/9a70c54e_0.mp4
test/4dae79a9_0.mp4
test/019d5b34_0.mp4
test/9d3c239b_0.mp4
test/5dc4fe12_0.mp4
test/4dae79a9_1.mp4
test/0b1495d3_0.mp4
test/019d5b34_1.mp4
test/9d3c239b_1.mp4
test/9f4df856_1.mp4
test/5dc4fe12_1.mp4
test/160606be_1.mp4


We want to save the data in AWS S3 Bucket. Until that issue is fixed, here are the codes for preprocessing. It has been tested and it works when we have the data locally. Figuring out how it will work when the data is in S3 shouldn't be too hard.

## **Preprocessing**

### Extract Frames from video

In [15]:
import cv2
import os

def extract_frames_from_video(video_path, output_folder):
    # Create a VideoCapture object to read from video file
    cap = cv2.VideoCapture(video_path)
    if not cap.isOpened():
        print(f"Error: Could not open video {video_path}")
        return

    frame_count = 0
    while True:
        ret, frame = cap.read()
        if not ret:
            break  # if no more frames to read

        # Save each frame as a JPEG file
        frame_filename = os.path.join(output_folder, f"frame_{frame_count:04d}.jpg")
        cv2.imwrite(frame_filename, frame)
        frame_count += 1

    cap.release()
    print(f"Finished extracting frames from {video_path}")

def process_videos_in_folder(folder_path):
    # List all files in the given folder
    for entry in os.listdir(folder_path):
        file_path = os.path.join(folder_path, entry)
        # Check if the file is a video (you could check extensions like .mp4, .avi)
        if os.path.isfile(file_path) and file_path.endswith('.mp4'):
            # Create a subfolder for the frames of this video
            video_name = os.path.splitext(entry)[0]
            frames_output_folder = os.path.join(folder_path, video_name + '_frames')
            os.makedirs(frames_output_folder, exist_ok=True)

            # Extract frames
            extract_frames_from_video(file_path, frames_output_folder)

# Set the directory where your videos are located
video_directory = '/content/raw_data/test'  # Adjust this path to fit video files location

# Process all videos in the directory
process_videos_in_folder(video_directory)


ModuleNotFoundError: No module named 'cv2'

### Resize frames


In [None]:
# Run if you want to overwrite the frames in the orginal extracted frames with the resize frames

import cv2
import os

def resize_frames_in_folder(folder_path, width=640, height=480):
    """Resize all frames in a specified folder."""
    for entry in os.listdir(folder_path):
        file_path = os.path.join(folder_path, entry)
        if file_path.endswith('.jpg'):
            # Read the frame
            frame = cv2.imread(file_path)
            # Resize the frame
            resized_frame = cv2.resize(frame, (width, height))
            # Save the resized frame
            cv2.imwrite(file_path, resized_frame)
            print(f"Resized and saved {file_path}")

def process_multiple_folders(parent_folder):
    """Process each subfolder within a parent directory."""
    for folder_name in os.listdir(parent_folder):
        folder_path = os.path.join(parent_folder, folder_name)
        if os.path.isdir(folder_path):  # Ensure it's a directory
            print(f"Processing folder: {folder_path}")
            resize_frames_in_folder(folder_path)

# Example usage
parent_folder = '/path/to/your/parent/directory'  # Adjust this path to fit video files location where multiple frames folders are stored
process_multiple_folders(parent_folder)

In [None]:
# Run if you want the resized frames to be saved seperately

import cv2
import os

def resize_frames_in_folder(folder_path, width=640, height=480):
    """Resize all frames in a specified folder and save them in a subfolder."""
    # Create a subfolder for the resized frames
    resized_folder_path = os.path.join(folder_path, 'resized')
    if not os.path.exists(resized_folder_path):
        os.makedirs(resized_folder_path)

    for entry in os.listdir(folder_path):
        file_path = os.path.join(folder_path, entry)
        if file_path.endswith('.jpg'):
            # Read the frame
            frame = cv2.imread(file_path)
            # Resize the frame
            resized_frame = cv2.resize(frame, (width, height))
            # Save the resized frame in the new subfolder
            resized_file_path = os.path.join(resized_folder_path, entry)
            cv2.imwrite(resized_file_path, resized_frame)
            print(f"Resized and saved {resized_file_path}")

def process_multiple_folders(parent_folder):
    """Process each subfolder within a parent directory."""
    for folder_name in os.listdir(parent_folder):
        folder_path = os.path.join(parent_folder, folder_name)
        if os.path.isdir(folder_path):  # Ensure it's a directory
            print(f"Processing folder: {folder_path}")
            resize_frames_in_folder(folder_path)

# Example usage
parent_folder = '/content/raw_data/test'  # Adjust this path to fit video files location where multiple frames folders are stored
process_multiple_folders(parent_folder)

### Stabilize video

In [None]:
import cv2
import numpy as np
import os
import shutil

def stabilize_video(input_path, output_path):
    cap = cv2.VideoCapture(input_path)
    if not cap.isOpened():
        print(f"Error: Could not open video {input_path}")
        return

    # Get frame rate and frame size for the output video
    fps = cap.get(cv2.CAP_PROP_FPS)
    frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))

    # Define the codec and create VideoWriter object
    fourcc = cv2.VideoWriter_fourcc(*'XVID')
    out = cv2.VideoWriter(output_path, fourcc, fps, (frame_width, frame_height))

    # Read first frame
    ret, prev = cap.read()
    if not ret:
        print("Failed to read video")
        cap.release()
        return

    prev_gray = cv2.cvtColor(prev, cv2.COLOR_BGR2GRAY)

    while True:
        ret, curr = cap.read()
        if not ret:
            break

        curr_gray = cv2.cvtColor(curr, cv2.COLOR_BGR2GRAY)

        # Calculate optical flow
        flow = cv2.calcOpticalFlowFarneback(prev_gray, curr_gray, None, 0.5, 3, 15, 3, 5, 1.2, 0)
        dx, dy = np.median(flow[..., 0]), np.median(flow[..., 1])

        # Transformation matrix
        m = np.float32([[1, 0, dx], [0, 1, dy]])
        stabilized = cv2.warpAffine(prev, m, (frame_width, frame_height))

        out.write(stabilized)
        prev_gray = curr_gray

    cap.release()
    out.release()
    print(f"Stabilization complete and saved to {output_path}")

def process_videos(input_folder):
    # Create output folder next to the input folder
    output_folder = os.path.join(input_folder, '../stabilized_videos')
    os.makedirs(output_folder, exist_ok=True)  # Create the output folder if it doesn't exist

    # Process each video in the input folder
    for filename in os.listdir(input_folder):
        if filename.endswith('.mp4'):  # Check for video files
            input_path = os.path.join(input_folder, filename)
            output_path = os.path.join(output_folder, 'stabilized_' + filename)
            print(f"Processing {input_path}...")
            stabilize_video(input_path, output_path)

#Adjust the path to use
input_directory = '/content/raw_data/test'
process_videos(input_directory)


### subtract background from video

In [None]:
import cv2
import os

def subtract_background_from_video(input_path, output_path):
    # Create the background subtractor object
    back_sub = cv2.createBackgroundSubtractorMOG2()

    # Open the input video
    cap = cv2.VideoCapture(input_path)
    if not cap.isOpened():
        print(f"Error: Could not open video {input_path}")
        return

    # Prepare to write the processed video
    fourcc = cv2.VideoWriter_fourcc(*'XVID')
    out = cv2.VideoWriter(output_path, fourcc, cap.get(cv2.CAP_PROP_FPS),
                          (int(cap.get(cv2.CAP_PROP_FRAME_WIDTH)), int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))))

    while True:
        ret, frame = cap.read()
        if not ret:
            break

        # Apply background subtraction
        fg_mask = back_sub.apply(frame)

        # Save the resulting frame to the output video
        out.write(fg_mask)

    # Release resources
    cap.release()
    out.release()
    print(f"Background subtraction complete and saved to {output_path}")

def process_videos(input_folder):
    # Automatically set the output directory relative to the input folder
    output_folder = os.path.join(input_folder, '../processed_videos')
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    for filename in os.listdir(input_folder):
        if filename.endswith('.mp4'):
            input_path = os.path.join(input_folder, filename)
            output_path = os.path.join(output_folder, 'bg_subtracted_' + filename)
            print(f"Processing {input_path}...")
            subtract_background_from_video(input_path, output_path)


# Example usage
input_directory = '/path/to/your/input/videos'  # change this to the actual path for videos (the original ones from kaggle)
process_videos(input_directory)