## **Teamleden**

|Teamleden|GitHub Username|
|--|--|
|Nima Ghafar|NimaGhafar|
|Busse Heemskerk|BJHeemskerk|
|Henry Lau||
|Jesse van Leeuwen|22096337|

# *Foto Herkennings app*

In dit notebook zal de pipeline worden opgesteld waarmee data kan worden ingeladen en het model mee wordt getraind, gedeployed en gehertraind.

-verdere opdrachtomschrijving-

-inhoudsopgave-

## Inladen van de libaries en de data

In [2]:
import pandas as pd
import numpy as np
import os

In [7]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [9]:
import os
os.chdir('/content/drive/My Drive//mlops/MLOps_2024')

## Begin data-ingestion pipeline

Om de data in te kunnen laden wordt er gebruik gemaakt van verschillende functies om de juiste afbeeldingspaden in de juiste dataframes te zetten. Zo zijn de train, test en validatie afbeeldingen gesplitst in drie verschillende dataframes.

In [10]:
# Functie om afbeeldingen te lezen vanuit een txt bestand
def read_image_filenames(file_path):
    with open(file_path, 'r') as file:
        lines = file.read().splitlines()
    return lines

# Functie om lemma tokens in te lezen
def read_tokens(file_path):
    with open(file_path, 'r') as file:
        lines = file.read().splitlines()
    tokens_dict = {}
    for line in lines:
        parts = line.split()
        image_token = parts[0].split('#')[0]
        tokens = parts[1:]
        tokens_dict[image_token] = tokens
    return tokens_dict

# Functie om afbeeldingen in DataFrames te laden
def load_images_into_dataframes(data_dir, label_dir, labels_file, train_file, test_file, val_file):
    # Lezen van bestandsnamen uit txt bestanden
    train_filenames = read_image_filenames(os.path.join(label_dir, train_file))
    test_filenames = read_image_filenames(os.path.join(label_dir, test_file))
    val_filenames = read_image_filenames(os.path.join(label_dir, val_file))

    # Inlezen van alle tokens
    tokens_dict = read_tokens(os.path.join(label_dir, labels_file))

    # Aanmaken DataFrames
    train_df = pd.DataFrame({'filename': train_filenames})
    test_df = pd.DataFrame({'filename': test_filenames})
    val_df = pd.DataFrame({'filename': val_filenames})

    # Toevoegen van volledige file_paths
    train_df['filepath'] = train_df['filename'].apply(lambda x: os.path.join(data_dir, x))
    test_df['filepath'] = test_df['filename'].apply(lambda x: os.path.join(data_dir, x))
    val_df['filepath'] = val_df['filename'].apply(lambda x: os.path.join(data_dir, x))

    # Toevoegen van de juiste labels aan de juiste dataframes
    train_df['labels'] = train_df['filename'].apply(lambda x: tokens_dict.get(x, []))
    test_df['labels'] = test_df['filename'].apply(lambda x: tokens_dict.get(x, []))
    val_df['labels'] = val_df['filename'].apply(lambda x: tokens_dict.get(x, []))

    return train_df, test_df, val_df

# Toewijzen van alle paden
data_directory = 'Images'
label_directory = 'Label_files'
labels_file = 'Flickr8k.token.txt'
train_file_path = 'Flickr_8k.trainImages.txt'
test_file_path = 'Flickr_8k.testImages.txt'
val_file_path = 'Flickr_8k.devImages.txt'

# Laden van afbeeldingen in datasets
train_df, test_df, val_df = load_images_into_dataframes(
    data_directory, label_directory, labels_file, train_file_path, test_file_path, val_file_path
    )

# Tonen van de datasets
print("Train DataFrame:")
display(train_df.head())

print("\nTest DataFrame:")
display(test_df.head())

print("\nValidation DataFrame:")
display(val_df.head())


Train DataFrame:


Unnamed: 0,filename,filepath,labels
0,2513260012_03d33305cf.jpg,Images/2513260012_03d33305cf.jpg,"[Two, dogs, running, through, a, low, lying, b..."
1,2903617548_d3e38d7f88.jpg,Images/2903617548_d3e38d7f88.jpg,"[The, little, boy, is, playing, with, a, croqu..."
2,3338291921_fe7ae0c8f8.jpg,Images/3338291921_fe7ae0c8f8.jpg,"[A, dog, with, something, pink, in, its, mouth..."
3,488416045_1c6d903fe0.jpg,Images/488416045_1c6d903fe0.jpg,"[The, large, brown, dog, is, running, on, the,..."
4,2644326817_8f45080b87.jpg,Images/2644326817_8f45080b87.jpg,"[The, black, dog, is, dropping, a, red, disc, ..."



Test DataFrame:


Unnamed: 0,filename,filepath,labels
0,3385593926_d3e9c21170.jpg,Images/3385593926_d3e9c21170.jpg,"[Two, dogs, playing, in, the, snow, .]"
1,2677656448_6b7e7702af.jpg,Images/2677656448_6b7e7702af.jpg,"[The, small, brown, and, white, dog, is, in, t..."
2,311146855_0b65fdb169.jpg,Images/311146855_0b65fdb169.jpg,"[Two, people, are, dancing, with, drums, on, t..."
3,1258913059_07c613f7ff.jpg,Images/1258913059_07c613f7ff.jpg,"[Three, people, sit, at, a, picnic, table, out..."
4,241347760_d44c8d3a01.jpg,Images/241347760_d44c8d3a01.jpg,"[The, American, footballer, is, wearing, a, re..."



Validation DataFrame:


Unnamed: 0,filename,filepath,labels
0,2090545563_a4e66ec76b.jpg,Images/2090545563_a4e66ec76b.jpg,"[two, young, children, on, a, skateboard, goin..."
1,3393035454_2d2370ffd4.jpg,Images/3393035454_2d2370ffd4.jpg,"[Child, in, blue, and, grey, shirt, jumping, o..."
2,3695064885_a6922f06b2.jpg,Images/3695064885_a6922f06b2.jpg,"[The, woman, is, leading, a, dog, through, an,..."
3,1679557684_50a206e4a9.jpg,Images/1679557684_50a206e4a9.jpg,"[Two, dogs, playing, with, a, tennis, ball, in..."
4,3582685410_05315a15b8.jpg,Images/3582685410_05315a15b8.jpg,"[Two, women, in, bathing, suits, climb, rock, ..."


Nu de datasets zijn ingeladen kan er gekeken worden naar verschillende vormen van Feature Engineering. Aangezien de token het makkelijkste zijn om te onderzoeken, zullen we hier eerst naar gaan kijken.

In [None]:
# Het tonen van een token
train_df['labels'][0]

['Two',
 'dogs',
 'running',
 'through',
 'a',
 'low',
 'lying',
 'body',
 'of',
 'water',
 '.']

Zoals er te zien is, is er verschillende informatie beschikbaar over de informatie. Volgens de tokens bevat de afbeelding twee honden die door een kleine plas water rennen. Echter bevat de caption ook nog een punt aan het einde. Deze is overbodig en lijdt tot overbodige informatie naar het model.

## Begin ML-pipeline