# Extracting metadata from the images

This script is designed to methodically extract and catalogue embedded information, known as EXIF (Exchangeable Image File Format) data, from a collection of digital photographs. Each digital image file, beyond just the visible content, contains a potentially rich set of metadata detailing aspects such as the geographic coordinates where the photo was taken, the timestamp, the camera model, and various other technical parameters.

Initiating its process, the script employs the `glob` module as a means to systematically identify and access all files within a specified directory that conform to a standard image format, typically denoted by the `.jpg` extension. Upon locating these files, it proceeds to unravel the layers of metadata concealed within each photo, utilizing the `PIL` (Python Imaging Library) to interpret and translate the encoded information into a human-readable format.

In instances where the metadata components are identified as binary data sequences, the script judiciously decodes these segments into ASCII text, ensuring consistency and comprehensibility in the resultant data pool. The extracted information, having been sorted and translated, is subsequently consolidated into a structured data frame using the `pandas` library—an organized, tabular construct that facilitates efficient data manipulation and analysis.

To culminate its operation, the script compiles the amassed information into a singular, comprehensive CSV (Comma-Separated Values) document, effectively cataloging the entirety of the extracted EXIF data in a uniform and accessible manner. This file is then saved to a predetermined repository, thereby preserving the metadata ensemble for subsequent review, research, or data-driven decision-making processes. In essence, the script executes a thorough extraction and refinement procedure to harvest, refine, and archive the latent data within digital images, rendering it accessible and usable for further analytical endeavors.

In [65]:
import glob
from PIL import Image
from PIL.ExifTags import TAGS, GPSTAGS
import pandas as pd

def get_exif_data(image_path):
    image = Image.open(image_path)
    exif_data = image._getexif()
    exif_dict = {}
    exif_dict['ImageFilename'] = image_path.split("\\")[1]
    if exif_data is not None:
        for key, value in exif_data.items():
            if key in TAGS:
                tag = TAGS[key]
                if tag == 'GPSInfo':
                    gps_data = {}
                    for t in value:
                        sub_tag = GPSTAGS.get(t, t)
                        if str( value[t])[0]=='b':
                            gps_data[sub_tag] =  value[t].decode('ascii')
                        else:
                            gps_data[sub_tag] = value[t]
                    exif_dict.update(zip(list(gps_data.keys()), list(gps_data.values())))
                else:
                    if str(value)[0]=='b':
                        exif_dict[tag] = value.decode('ascii')
                    else:
                        exif_dict[tag] = value
    return exif_dict


if __name__ == "__main__":
    image_files = glob.glob('/projectnb/ds549/students/vedikas/ml-terc-image-geolocation/data/*.jpg')
    all_exif_data = []
    
    for image_path in image_files:
        exif_data = get_exif_data(image_path)  
        all_exif_data.append(exif_data)
    
    df = pd.DataFrame(all_exif_data)
    df.to_csv('/projectnb/ds549/students/vedikas/ml-terc-image-geolocation/data/exif_metadata.csv', index=False)