This notebook extracts the necessary data, both images and metadata, and stores them for ease of use for the other notebooks. As the metadata and images have already been processed, this notebook does not need to be run unless changes have been made to the metadata.

####Load these Packages

In [1]:
import pandas as pd
import numpy as np
import itertools
import os
import glob


#### Mount Google Drive and specify directory

In [2]:
from google.colab import drive
drive.mount('/gdrive', force_remount = True)

Mounted at /gdrive


In [3]:
#Change the working directory to where the project has been saved in google drive
project_dir = "/gdrive/My Drive/Colab Notebooks/TCA_project_Tomek_Thomas/tooth-annulation/"
# Do not change these directories
working_dir = project_dir + "SRC/"
images_dir = project_dir + "data/images_cleaned"
labels_dir = project_dir + 'data'

%cd {working_dir} 

/gdrive/My Drive/Colab Notebooks/TCA_project_Tomek_Thomas/tooth-annulation/SRC


#### Unzip image folder

In [4]:
#can only be imported after Google drive has been mounted
import helper as hp
from rename import rename

In [5]:
#only do this part ONCE, uncomment when need and then re-comment to avoid running the code again.

#hp.unzip_file('./images_final.zip', './data')

In [6]:
#Check that all images have been unzipped and are in the correct folder.
current_folder_content = glob.glob(images_dir + "/*")
len(current_folder_content)

2635

This part looks through the Images folder and creates a list of names for each image and then sorts its alphabetically.

In [7]:
image_name_list= [] #contain all image names

for image in current_folder_content:
  image_name = image.split('/', 8)[-1]
  image_name_list.append(image_name)

print(len(image_name_list))
image_name_list.sort()
image_name_list[:5]

2635


['001_44_NA_80.25_0138.jpg',
 '001_44_NA_80.25_0139.tif',
 '001_44_NA_80.25_0368.jpg',
 '001_44_NA_80.25_0369.tif',
 '002_12_2_46_0140.jpg']

##### If File names are inconsistent, load this script to batch edit the file extensions. This will avoid errors when trying to load the images in a future notebook.

In [8]:
rename(images_dir, '*.jpg', ".tif")

Double check to make sure the remaning has taken place

In [9]:
current_folder_content = glob.glob(images_dir + "/*")
image_name_list= [] #contain all image names

for image in current_folder_content:
  image_name = image.split('/', 8)[-1]
  image_name_list.append(image_name)

print(len(image_name_list))
image_name_list[:500]

2635


['311_12_2_59.5_2987.tif',
 '311_11_2_59.5_2982.tif',
 '310_34_2_76.25_2893.tif',
 '310_34_2_76.25_2892.tif',
 '310_34_2_76.25_2891.tif',
 '308_22_1_54.25_2806.tif',
 '308_22_1_54.25_2805.tif',
 '307_21_2_67_3012.tif',
 '307_21_2_67_3016.tif',
 '307_21_2_67_2804.tif',
 '307_21_2_67_2803.tif',
 '307_21_2_67_2802.tif',
 '305_32_1_58_3019.tif',
 '305_32_1_58_2699.tif',
 '305_32_1_58_2698.tif',
 '305_32_1_58_2697.tif',
 '304_32_1_63.25_3005.tif',
 '304_32_1_63.25_2694.tif',
 '304_32_1_63.25_2693.tif',
 '303_32_2_57_3003.tif',
 '303_32_2_57_3001.tif',
 '303_32_2_57_2690.tif',
 '303_32_2_57_2688.tif',
 '303_32_2_57_2687.tif',
 '303_32_2_57_2684.tif',
 '302_31_2_57_2499.tif',
 '302_31_2_57_2498.tif',
 '302_31_2_57_2403.tif',
 '302_31_2_57_2402.tif',
 '302_31_2_57_2401.tif',
 '302_31_2_57_2400.tif',
 '301_41_2_57_2397.tif',
 '301_41_2_57_2380.tif',
 '300_22_2_46.75_3445.tif',
 '300_22_2_46.75_3233.tif',
 '300_22_2_46.75_3206.tif',
 '300_22_2_46.75_3202.tif',
 '300_21_2_46.75_2494.tif',
 '300_2

#### Check out the metadata

Read the metadata file

In [10]:
meta = pd.read_excel(labels_dir + "/Metadata_clean.xlsx")
extra_meta = pd.read_excel(labels_dir + '/Datentabelle TCA_in-Arbeit.xlsx')
meta.head()

Unnamed: 0,ID,Tooth.Code,Age,Sex,ImgID,ImgName
0,1,44.0,80.25,,138,001_44_NA_80.25_0138.jpg
1,1,44.0,80.25,,139,001_44_NA_80.25_0139.jpg
2,1,44.0,80.25,,368,001_44_NA_80.25_0368.tif
3,1,44.0,80.25,,369,001_44_NA_80.25_0369.tif
4,2,12.0,46.0,2.0,140,002_12_2_46_0140.jpg


Some data is missing, we can add this from the "Datentabelle TCA_in-Arbeit" file

In [11]:
extra_meta = extra_meta.loc[:, ['ID','tooth code', 'eruption']]
print(extra_meta.shape)
extra_meta.head()

(583, 3)


Unnamed: 0,ID,tooth code,eruption
0,1,44.0,10.0
1,2,12.0,8.5
2,3,42.0,8.5
3,5,21.0,7.5
4,6,22.0,8.5


In [12]:
print(meta.shape)
print(meta.isna().sum())

(2634, 6)
ID             0
Tooth.Code    41
Age           70
Sex           83
ImgID          0
ImgName        0
dtype: int64


There are some missing values for sex and tooth code. Rather than remove them, the mising values are filled with the number "0".

In [13]:
meta[['Tooth.Code', 'Sex']] = meta[['Tooth.Code', 'Sex']].fillna(value = 0)
clean_meta = meta.dropna()
print(clean_meta.shape)
clean_meta.head(10)

(2564, 6)


Unnamed: 0,ID,Tooth.Code,Age,Sex,ImgID,ImgName
0,1,44.0,80.25,0.0,138,001_44_NA_80.25_0138.jpg
1,1,44.0,80.25,0.0,139,001_44_NA_80.25_0139.jpg
2,1,44.0,80.25,0.0,368,001_44_NA_80.25_0368.tif
3,1,44.0,80.25,0.0,369,001_44_NA_80.25_0369.tif
4,2,12.0,46.0,2.0,140,002_12_2_46_0140.jpg
5,2,12.0,46.0,2.0,141,002_12_2_46_0141.jpg
6,2,12.0,46.0,2.0,142,002_12_2_46_0142.jpg
7,2,12.0,46.0,2.0,143,002_12_2_46_0143.jpg
8,2,12.0,46.0,2.0,144,002_12_2_46_0144.jpg
9,2,12.0,46.0,2.0,151,002_12_2_46_0151.jpg


The data is now ready to be merged together. Duplicate images also need to be removed, which is done in the code below.

In [14]:
merged_meta = pd.merge(clean_meta, extra_meta, left_on = ['ID', 'Tooth.Code'], right_on = ["ID", "tooth code"], how = "inner")
merged_meta = merged_meta.drop_duplicates("ImgName")
print(merged_meta.shape)
merged_meta.head()

(2564, 8)


Unnamed: 0,ID,Tooth.Code,Age,Sex,ImgID,ImgName,tooth code,eruption
0,1,44.0,80.25,0.0,138,001_44_NA_80.25_0138.jpg,44.0,10.0
1,1,44.0,80.25,0.0,139,001_44_NA_80.25_0139.jpg,44.0,10.0
2,1,44.0,80.25,0.0,368,001_44_NA_80.25_0368.tif,44.0,10.0
3,1,44.0,80.25,0.0,369,001_44_NA_80.25_0369.tif,44.0,10.0
4,2,12.0,46.0,2.0,140,002_12_2_46_0140.jpg,12.0,8.5


Rename any file extensions which do not fit the renamed files

In [15]:
merged_meta[['ImgName']] = merged_meta[['ImgName']].replace({'.jpg':'.tif'}, regex = True)

Make sure that only the images in the folder are in the table

In [16]:
merged_meta = merged_meta[merged_meta["ImgName"].isin(image_name_list)]

In [17]:
merged_meta = merged_meta.drop('tooth code', axis = 1)
merged_meta.head()

Unnamed: 0,ID,Tooth.Code,Age,Sex,ImgID,ImgName,eruption
0,1,44.0,80.25,0.0,138,001_44_NA_80.25_0138.tif,10.0
1,1,44.0,80.25,0.0,139,001_44_NA_80.25_0139.tif,10.0
2,1,44.0,80.25,0.0,368,001_44_NA_80.25_0368.tif,10.0
3,1,44.0,80.25,0.0,369,001_44_NA_80.25_0369.tif,10.0
4,2,12.0,46.0,2.0,140,002_12_2_46_0140.tif,8.5


Once the table is nice and clean, export as a CSV file for future use

In [21]:
with open(project_dir +'data/clean_meta.csv', 'w') as f:
  merged_meta.to_csv(f)