## 2.2 - Enrich Tfrecords

Code that Modifies the Processed Tfrecords (Files in root/Datasets/SelectedClusters/nexus_tfrecords_processed

to add the other two indicators left (longevity and literacy) in the process method

## Prerequisites

In [None]:
import glob
import os
import numpy as np
import pandas as pd
import tensorflow as tf

Paths to Required Files

In [9]:
# Path to processed Tfrecords - Created in Code 02
tfrecords_path = "/root/Datasets/SelectedClusters/nexus_tfrecords_processed/brazil_2010"
# Path to .csv File, where info about the indicators are stored - Created in Preprocessing
csv_path = "/root/Datasets/SelectedClusters"

# Path to new Processed Tfrecords - Output of this File
new_tfrecords_path = "/root/Datasets/SelectedClusters/nexus_tfrecords_processed_new/brazil_2010"

Creating a list containing all the Filenames of the Processed Tfrecords

In [None]:
filenames = glob.glob(os.path.join(tfrecords_path, "*.tfrecord.gz"))

Configuration to save the new Tfrecords in .gz format

In [None]:
options = tf.io.TFRecordOptions(compression_type = 'GZIP')

Loading the DataFrame and Changing Format

- Float64 -> Float 32
- String (Object) -> Bytes

In [10]:
dataset = pd.read_csv(os.path.join(csv_path,'dataset_clean.csv'), float_precision='high')
for col in dataset.columns:
        if dataset[col].dtype == np.float64:
            dataset[col] = dataset[col].astype(np.float32)
        elif dataset[col].dtype == object:  # pandas uses 'object' type for str
            dataset[col] = dataset[col].astype(bytes)

Verifying Changes

In [11]:
dataset.head()

Unnamed: 0,country,lon,lat,literacy,longevity,income,year,TYPE
0,b'brazil',-60.10067,-12.772811,0.963506,0.768335,0.628482,2010,b'URBAN'
1,b'brazil',-48.270237,-7.218174,0.903346,0.758552,0.52086,2010,b'URBAN'
2,b'brazil',-48.210049,-7.157114,0.926798,0.781569,0.580544,2010,b'URBAN'
3,b'brazil',-49.162132,-11.892742,0.907821,0.790722,0.476975,2010,b'URBAN'
4,b'brazil',-48.468857,-8.062309,0.932752,0.778629,0.543212,2010,b'URBAN'


Function that transform an Dict into String - whats written on the Tfrecords

In [12]:
def serialize(record):
    serializable_record = {key : feature for key, (feature, _) in record.items()}
    record_example = tf.train.Example(features=tf.train.Features(feature=serializable_record))
    return record_example.SerializeToString()

Algorith that injects the new data onto the Tfrecords, creating new ones

Loop for each Tfrecord File
- Loading the Tfrecord
- Parsing and extracting the Features form the String
- Creating a dictionary with the features
- Searching the .csv file for the same location (through Latitude and Longitude)
- Creating new features to put together with the old ones
- Coding the Dictionary into the String again
- Saving into a new Tfrecord

In [14]:
tf_data = tf.data.TFRecordDataset(filenames, compression_type = 'GZIP')
indicators = ["literacy","longevity","income"]

for i, raw_record in enumerate(tf_data.take(len(filenames))):
    record = {}
    example = tf.train.Example()
    example.ParseFromString(raw_record.numpy())

    # We first map each record feature into a dictionary
    for k, feat in example.features.feature.items():
        record[k] = (feat, feat.WhichOneof('kind'))

    # Now we retrieve lat/lon to match with info in the dataset
    # and update the dictionary
    recorded_lat_tuple = record["lat"]
    recorded_lon_tuple = record["lon"]
    lat = getattr(*recorded_lat_tuple).value[0]
    lon = getattr(*recorded_lon_tuple).value[0]
    row = dataset[(dataset['lon'] == lon) & (dataset['lat'] == lat)].reset_index()

    # Now we update the record with each indicator value
    for indicator in indicators:
        feat = tf.train.Feature(float_list=tf.train.FloatList(value=[row[indicator]]))
        record[indicator] = (feat, feat.WhichOneof('kind'))
        
    serialized_string = serialize(record)
    with tf.io.TFRecordWriter(os.path.join(new_tfrecords_path, f'{i:05d}.tfrecord.gz'), options=options) as writer:
        writer.write(serialized_string)

Verifying the Process - Amount of Files Created

Must have the same amount (20438)

In [15]:
import os

# folder path
dir_path = "/root/Datasets/SelectedClusters/nexus_tfrecords_processed_new/brazil_2010/"
count = 0
# Iterate directory
for path in os.listdir(dir_path):
    # check if current path is a file
    if os.path.isfile(os.path.join(dir_path, path)):
        count += 1
print('File count:', count)

File count: 20438
