# Image Recognition Project - Structural Defect Recognition
---------------------------------------------------------------
## Data Collection

### Section Objectives
 - Find relevant dataset from Kaggle
 - Collect the data
 - Preprocess data, checking for outlier images or irrelevant files
 - Perform a manual data check
 - Divide dataset into the following subsets: Train, Test and Validation; at the ratio 0.7, 0.2, 0.1
 

---------------------------------------------------------------

### Importing Packages

In [None]:
%pip install -r /workspaces/ML_Project_Image_Recognition/requirements.txt

In [None]:
import numpy
import os

### Setting Working Directory

In [None]:
current_dir = os.getcwd()
current_dir

In [None]:
directory = 'workspaces/ML_Project_Image_Recognition'

if not os.path.exists(directory):
    os.makedirs(directory)
    print(f"Directory '{directory}' created.")
else:
    print(f"Directory '{directory}' already exists.")


In [None]:
os.chdir('workspaces/ML_Project_Image_Recognition')
print("This is your set Working Directory")

### Installing Kaggle


In [None]:
%pip install kaggle==1.5.12

In [None]:
os.environ['KAGGLE_CONFIG_DIR'] = '/workspaces/ML_Project_Image_Recognition'
!chmod 600 /workspaces/ML_Project_Image_Recognition/kaggle.json

In [None]:
new_destination_folder = "/workspaces/ML_Project_Image_Recognition_Project/inputs/cracks_dataset_new"
os.makedirs(new_destination_folder, exist_ok=True)
print(f"Created new folder: {new_destination_folder}")


In [None]:
KaggleDatasetPath = "aniruddhsharma/structural-defects-network-concrete-crack-images"
DestinationFolder = "/workspaces/ML_Image_Recognition_Project/inputs/cracks_dataset_new"   
os.makedirs(DestinationFolder, exist_ok=True)
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

In [None]:
zip_file_path = DestinationFolder + '/download.zip'
if os.path.exists(zip_file_path):
    with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
        zip_ref.extractall(DestinationFolder)
    os.remove(zip_file_path)  
else:
    print(f"File not found: {zip_file_path}")
    print("Listing files in the destination folder:")
    print(os.listdir(DestinationFolder))


In [None]:
import zipfile

with zipfile.ZipFile(DestinationFolder + '/structural-defects-network-concrete-crack-images.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)  

os.remove(DestinationFolder + '/structural-defects-network-concrete-crack-images.zip')

---------------------------------------------------------------

## Preparing Data

### Data Cleaning
Checking for and removing any non-images from the downloaded dataset. 

In [None]:
def remove_non_image_file(my_data_dir):
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir)
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location)  
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file", len(j))
        print(f"Folder: {folder} - has non-image file", len(i))

## Dividing Dataset
As mentioned previously, the dataset must be split into three partitions: a training set; a validation set and a testing set - in the ratio of 0.7, 0.1, 0.2 respectively.