# Processing 
In this file, the following steps will be taken: 
1. New versions of the AUs files will be created with the dropped columns. By doing this step first, we gain time when all the files will be added together.  
-  Output: "full donation data - dropped columns" 
2. Once we have a new folder with all the files with dropped columns, we will add two columns to each file. One column will be 'ID' and the other will be 'Time_point'. 
- Output: "full donation data - ID and time point"
3. Then we will create the big file in which we combine all the files in the folder 'full donation data'. 
- Output: "processed data"

In [32]:
# import 
import zipfile
import os
import pandas as pd
import csv
import socket  # Import the socket module
import pickle

# Unzipping the zipped folder (only have to do this once!)

Running it once, took me around 2 minutes and 5 seconds. 
From then, the file path will be: '/Users/dionnespaltman/Desktop/downloading/full donation data - unzipped/full donation data'

In [10]:
# Define paths
zip_file_path = '/Users/dionnespaltman/Desktop/downloading/full donation data.zip'
output_folder = '/Users/dionnespaltman/Desktop/downloading/full donation data - unzipped'

# Create the output folder if it doesn't exist
if not os.path.exists(output_folder):
    os.makedirs(output_folder)

# Unzip the folder
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(output_folder)

print("Unzipped folder created at:", output_folder)

Unzipped folder created at: /Users/dionnespaltman/Desktop/downloading/full donation data - unzipped


# If you've unzipped already, only load the unzipped folder 

### Size of all the files 

In [16]:
import os

# Define the path to the folder
folder_path = '/Users/dionnespaltman/Desktop/downloading/full donation data - unzipped/full donation data'

# Initialize total size variable
total_size = 0

# Iterate over all files in the folder
for filename in os.listdir(folder_path):
    file_path = os.path.join(folder_path, filename)
    # Check if it's a file (not a directory)
    if os.path.isfile(file_path):
        # Get the size of the file and add it to total_size
        total_size += os.path.getsize(file_path)

# Convert total size to a human-readable format (e.g., bytes to megabytes)
total_size_mb = total_size / (1024 * 1024)  # Convert bytes to megabytes

print("Total size of all files in the folder:", total_size_mb, "MB")


Total size of all files in the folder: 18684.90839099884 MB


### Loading all the files into a dictionary 

#### I got a lot of errors when I wanted to load all files at once. So first trying to load 10 files. That took around 14.2 seconds. 

In [21]:
# Set the timeout limit to 60 seconds (adjust as needed)
socket.setdefaulttimeout(60)

# Define the path to the folder containing the unzipped files
folder_path = '/Users/dionnespaltman/Desktop/downloading/full donation data - unzipped/full donation data'

# List all files in the folder
file_names = os.listdir(folder_path)

# Read a subset of files into a dictionary
data = {}
num_files_to_read = 10  # Adjust the number of files to read as needed
for i, file_name in enumerate(file_names):
    if i >= num_files_to_read:
        break
    if file_name.endswith('.csv'):  # Assuming the files are CSV format
        file_path = os.path.join(folder_path, file_name)
        try:
            data[file_name] = pd.read_csv(file_path)
            print("File loaded successfully:", file_name)
        except Exception as e:
            print("Error loading file:", file_name, "- Error:", e)

# Now you have a dictionary 'data' containing DataFrames for each file (up to the specified number)
# Access them using keys (file names)
# For example:
print("Number of files loaded:", len(data))


File loaded successfully: 16-07.csv
File loaded successfully: 80_04,05,06.csv
File loaded successfully: 7-04-05-06.csv
File loaded successfully: 324_02.csv
File loaded successfully: 87_04,05,06.csv
File loaded successfully: 78_04,05,06.csv
File loaded successfully: 328_02.csv
File loaded successfully: 105_04 donation not completed. No blood flow.csv
File loaded successfully: 92_04,05,06.csv
File loaded successfully: 38_4,5,6.csv
Number of files loaded: 10


#### Loading 99 files took a little over 2 minutes. 

In [24]:
# Define the path to the folder containing the unzipped files
folder_path = '/Users/dionnespaltman/Desktop/downloading/full donation data - unzipped/full donation data'

# List all files in the folder
file_names = os.listdir(folder_path)

# Read a subset of files into a dictionary
data = {}
num_files_to_read = 100  # Adjust the number of files to read as needed
for i, file_name in enumerate(file_names):
    if i >= num_files_to_read:
        break
    if file_name.endswith('.csv'):  # Assuming the files are CSV format
        file_path = os.path.join(folder_path, file_name)
        try:
            data[file_name] = pd.read_csv(file_path)
            print("File loaded successfully:", file_name)
        except Exception as e:
            print("Error loading file:", file_name, "- Error:", e)

# Now you have a dictionary 'data' containing DataFrames for each file (up to the specified number)
# Access them using keys (file names)
# For example:
print("Number of files loaded:", len(data))


File loaded successfully: 16-07.csv
Error loading file: 80_04,05,06.csv - Error: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.
File loaded successfully: 7-04-05-06.csv
File loaded successfully: 324_02.csv
File loaded successfully: 87_04,05,06.csv
File loaded successfully: 78_04,05,06.csv
File loaded successfully: 328_02.csv
File loaded successfully: 105_04 donation not completed. No blood flow.csv
File loaded successfully: 92_04,05,06.csv
File loaded successfully: 38_4,5,6.csv
File loaded successfully: 129_04,05,06.csv
File loaded successfully: 85_07.csv
File loaded successfully: DSCN2370.csv
File loaded successfully: 95_04,05,06.csv
File loaded successfully: 300_02.csv
File loaded successfully: 20-07.csv
File loaded successfully: 97_07.csv
File loaded successfully: 290_01.csv
File loaded successfully: 312_02.csv
File loaded successfully: 127_03.csv
File loaded successfully: 40_07.csv
File loaded successfully: 101_07.csv
File loaded success

#### Loading all files (took a little less than 8 minutes)

In [25]:
# Define the path to the folder containing the unzipped files
folder_path = '/Users/dionnespaltman/Desktop/downloading/full donation data - unzipped/full donation data'

# List all files in the folder
file_names = os.listdir(folder_path)

# Read a subset of files into a dictionary
data = {}
num_files_to_read = 450  # Adjust the number of files to read as needed
for i, file_name in enumerate(file_names):
    if i >= num_files_to_read:
        break
    if file_name.endswith('.csv'):  # Assuming the files are CSV format
        file_path = os.path.join(folder_path, file_name)
        try:
            data[file_name] = pd.read_csv(file_path)
            print("File loaded successfully:", file_name)
        except Exception as e:
            print("Error loading file:", file_name, "- Error:", e)

# Now you have a dictionary 'data' containing DataFrames for each file (up to the specified number)
# Access them using keys (file names)
# For example:
print("Number of files loaded:", len(data))


File loaded successfully: 16-07.csv
File loaded successfully: 80_04,05,06.csv
File loaded successfully: 7-04-05-06.csv
File loaded successfully: 324_02.csv
File loaded successfully: 87_04,05,06.csv
File loaded successfully: 78_04,05,06.csv
File loaded successfully: 328_02.csv
File loaded successfully: 105_04 donation not completed. No blood flow.csv
File loaded successfully: 92_04,05,06.csv
File loaded successfully: 38_4,5,6.csv
File loaded successfully: 129_04,05,06.csv
File loaded successfully: 85_07.csv
File loaded successfully: DSCN2370.csv
File loaded successfully: 95_04,05,06.csv
File loaded successfully: 300_02.csv
File loaded successfully: 20-07.csv
File loaded successfully: 97_07.csv
File loaded successfully: 290_01.csv
File loaded successfully: 312_02.csv
File loaded successfully: 127_03.csv
File loaded successfully: 40_07.csv
File loaded successfully: 101_07.csv
File loaded successfully: 118_04,05,06.csv
File loaded successfully: 14-03,04,05,06.csv
File loaded successfully: 

  data[file_name] = pd.read_csv(file_path)


File loaded successfully: 108_04,05,06.csv
File loaded successfully: 80_07.csv
File loaded successfully: 305_02.csv
File loaded successfully: 299_01.csv
Number of files loaded: 411


Double check if it worked. 

In [28]:
# Check if the dictionary is not empty
if data:
    # Get the first key (file name) and its corresponding DataFrame
    first_file_name = next(iter(data.keys()))
    first_df = data[first_file_name]
    
    # Display the DataFrame
    print("DataFrame for the first file '{}' in the dictionary:".format(first_file_name))
    display(first_df)
else:
    print("The dictionary is empty. No files loaded.")


DataFrame for the first file '16-07.csv' in the dictionary:


Unnamed: 0,frame,face_id,timestamp,confidence,success,gaze_0_x,gaze_0_y,gaze_0_z,gaze_1_x,gaze_1_y,...,AU14_c,AU15_c,AU17_c,AU20_c,AU23_c,AU25_c,AU26_c,AU28_c,AU45_c,Filename
0,1,0,0.00,0.98,1,-0.186231,-0.071084,-0.979931,-0.321020,-0.025671,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,16-07.csv
1,2,0,0.04,0.98,1,-0.195272,-0.066113,-0.978518,-0.321741,-0.022495,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,16-07.csv
2,3,0,0.08,0.98,1,-0.183891,-0.072790,-0.980248,-0.325824,-0.032522,...,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,16-07.csv
3,4,0,0.12,0.98,1,-0.189050,-0.061480,-0.980041,-0.323833,-0.018319,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,16-07.csv
4,5,0,0.16,0.98,1,-0.193372,-0.059505,-0.979319,-0.317909,-0.020461,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,16-07.csv
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2745,2746,0,109.80,0.98,1,-0.152439,0.008936,-0.988272,-0.382937,0.032291,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,16-07.csv
2746,2747,0,109.84,0.98,1,-0.148107,0.010306,-0.988918,-0.384507,0.019742,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,16-07.csv
2747,2748,0,109.88,0.98,1,-0.154183,0.015469,-0.987921,-0.387649,0.025655,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,16-07.csv
2748,2749,0,109.92,0.98,1,-0.152587,0.013783,-0.988194,-0.396512,0.031367,...,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,16-07.csv


# Saving my progress so far (dictionary with all files including the filename as a column)

Saving the dictionary took a bit more than 3 minutes. 

In [30]:
import pickle

# Specify the file path to save the dictionary
save_path = '/Users/dionnespaltman/Desktop/downloading/data_dictionary.pkl'

# Save the dictionary to a file using pickle
with open(save_path, 'wb') as file:
    pickle.dump(data, file)

print("Dictionary saved to:", save_path)


Dictionary saved to: /Users/dionnespaltman/Desktop/downloading/data_dictionary.pkl


# Loading the dictionary from a pickle file 

In [33]:
# Specify the file path from which to load the dictionary
load_path = '/Users/dionnespaltman/Desktop/downloading/data_dictionary.pkl'  # Update with the actual file path

# Load the dictionary from the file using pickle
with open(load_path, 'rb') as file:
    data = pickle.load(file)

print("Dictionary loaded from:", data)


Dictionary loaded from: {'16-07.csv':       frame   face_id   timestamp   confidence   success   gaze_0_x  \
0         1         0        0.00         0.98         1  -0.186231   
1         2         0        0.04         0.98         1  -0.195272   
2         3         0        0.08         0.98         1  -0.183891   
3         4         0        0.12         0.98         1  -0.189050   
4         5         0        0.16         0.98         1  -0.193372   
...     ...       ...         ...          ...       ...        ...   
2745   2746         0      109.80         0.98         1  -0.152439   
2746   2747         0      109.84         0.98         1  -0.148107   
2747   2748         0      109.88         0.98         1  -0.154183   
2748   2749         0      109.92         0.98         1  -0.152587   
2749   2750         0      109.96         0.98         1  -0.157160   

       gaze_0_y   gaze_0_z   gaze_1_x   gaze_1_y  ...   AU14_c   AU15_c  \
0     -0.071084  -0.979931  -0.321

Check if it works: 

In [34]:
if data:
    print("The loaded dictionary contains {} items.".format(len(data)))
else:
    print("The loaded dictionary is empty.")


The loaded dictionary contains 411 items.


# Dropping columns 

### First creating the variable that contains all the column names that need to be deleted 

First getting a list of all column names. 

In [35]:
# Check if the dictionary is not empty
if data:
    # Get the first key (file name) and its corresponding DataFrame
    first_file_name = next(iter(data.keys()))
    first_df = data[first_file_name]
    
    # Print the column names of the DataFrame
    print("Column names of the first file '{}' in the dictionary:".format(first_file_name))
    print(first_df.columns.tolist())
else:
    print("The dictionary is empty. No files loaded.")


Column names of the first file '16-07.csv' in the dictionary:
['frame', ' face_id', ' timestamp', ' confidence', ' success', ' gaze_0_x', ' gaze_0_y', ' gaze_0_z', ' gaze_1_x', ' gaze_1_y', ' gaze_1_z', ' gaze_angle_x', ' gaze_angle_y', ' eye_lmk_x_0', ' eye_lmk_x_1', ' eye_lmk_x_2', ' eye_lmk_x_3', ' eye_lmk_x_4', ' eye_lmk_x_5', ' eye_lmk_x_6', ' eye_lmk_x_7', ' eye_lmk_x_8', ' eye_lmk_x_9', ' eye_lmk_x_10', ' eye_lmk_x_11', ' eye_lmk_x_12', ' eye_lmk_x_13', ' eye_lmk_x_14', ' eye_lmk_x_15', ' eye_lmk_x_16', ' eye_lmk_x_17', ' eye_lmk_x_18', ' eye_lmk_x_19', ' eye_lmk_x_20', ' eye_lmk_x_21', ' eye_lmk_x_22', ' eye_lmk_x_23', ' eye_lmk_x_24', ' eye_lmk_x_25', ' eye_lmk_x_26', ' eye_lmk_x_27', ' eye_lmk_x_28', ' eye_lmk_x_29', ' eye_lmk_x_30', ' eye_lmk_x_31', ' eye_lmk_x_32', ' eye_lmk_x_33', ' eye_lmk_x_34', ' eye_lmk_x_35', ' eye_lmk_x_36', ' eye_lmk_x_37', ' eye_lmk_x_38', ' eye_lmk_x_39', ' eye_lmk_x_40', ' eye_lmk_x_41', ' eye_lmk_x_42', ' eye_lmk_x_43', ' eye_lmk_x_44', ' eye_lm

I am interested in all the fields that have AU and end with 'r', e.g. AU45_r - these columns show the intensity of each extracted AU. 
So if it includes gaze, eye, pose, x_, y_, X_, Y_, Z, p_; then it can go. 

In [37]:
# Get all the keys (file names) from the dictionary
keys = data.keys()

# Convert the keys to a list if you need to iterate over them or perform list operations
keys_list = list(keys)

# Print the keys
print("Keys in the dictionary:")
for key in keys:
    print(key)


Keys in the dictionary:
16-07.csv
80_04,05,06.csv
7-04-05-06.csv
324_02.csv
87_04,05,06.csv
78_04,05,06.csv
328_02.csv
105_04 donation not completed. No blood flow.csv
92_04,05,06.csv
38_4,5,6.csv
129_04,05,06.csv
85_07.csv
DSCN2370.csv
95_04,05,06.csv
300_02.csv
20-07.csv
97_07.csv
290_01.csv
312_02.csv
127_03.csv
40_07.csv
101_07.csv
118_04,05,06.csv
14-03,04,05,06.csv
324_01.csv
144_03.csv
113_07.csv
52_07.csv
328_01.csv
31_07.csv
125_07.csv
64_07.csv
146_07.csv
300_01.csv
49_04,05,06.csv
89_04.csv
76_07.csv
68_07.csv
129_07.csv
290_02.csv
312_01.csv
68_03.csv
33_07.csv
94_04,05,06.csv
111_07.csv
39_04,05,06.csv
93_04,05,06.csv
50_07.csv
326_01.csv
64_03.csv
42_07.csv
103_07.csv
79_04,05,06.csv
292_02.csv
310_01.csv
31_03.csv
81_04,05,06.csv
52_03.csv
113_03.csv
74_07.csv
135_07.csv
144_07.csv
13-03,04,05,06.csv
302_01.csv
139_07.csv
78_07.csv
127_07.csv
66_07.csv
18-07.csv
48_04,05,06.csv
326_02.csv
14-07.csv
21-03,04,05,06.csv
95_07.csv
292_01.csv
310_02.csv
128_04,05.csv
119_04,0

In [38]:
# Choose a file name (key) from the data dictionary
file_name = '16-07.csv'

# Access the DataFrame corresponding to the chosen file name
df = data.get(file_name)

# Check if the DataFrame is not None (i.e., it exists)
if df is not None:
    # Get the list of column names
    column_names = df.columns.tolist()
    
    # Print the list of column names
    print("Column names of DataFrame '{}' in the 'data' dictionary:".format(file_name))
    print(column_names)
else:
    print("DataFrame '{}' not found in the 'data' dictionary.".format(file_name))


Column names of DataFrame '16-07.csv' in the 'data' dictionary:
['frame', ' face_id', ' timestamp', ' confidence', ' success', ' gaze_0_x', ' gaze_0_y', ' gaze_0_z', ' gaze_1_x', ' gaze_1_y', ' gaze_1_z', ' gaze_angle_x', ' gaze_angle_y', ' eye_lmk_x_0', ' eye_lmk_x_1', ' eye_lmk_x_2', ' eye_lmk_x_3', ' eye_lmk_x_4', ' eye_lmk_x_5', ' eye_lmk_x_6', ' eye_lmk_x_7', ' eye_lmk_x_8', ' eye_lmk_x_9', ' eye_lmk_x_10', ' eye_lmk_x_11', ' eye_lmk_x_12', ' eye_lmk_x_13', ' eye_lmk_x_14', ' eye_lmk_x_15', ' eye_lmk_x_16', ' eye_lmk_x_17', ' eye_lmk_x_18', ' eye_lmk_x_19', ' eye_lmk_x_20', ' eye_lmk_x_21', ' eye_lmk_x_22', ' eye_lmk_x_23', ' eye_lmk_x_24', ' eye_lmk_x_25', ' eye_lmk_x_26', ' eye_lmk_x_27', ' eye_lmk_x_28', ' eye_lmk_x_29', ' eye_lmk_x_30', ' eye_lmk_x_31', ' eye_lmk_x_32', ' eye_lmk_x_33', ' eye_lmk_x_34', ' eye_lmk_x_35', ' eye_lmk_x_36', ' eye_lmk_x_37', ' eye_lmk_x_38', ' eye_lmk_x_39', ' eye_lmk_x_40', ' eye_lmk_x_41', ' eye_lmk_x_42', ' eye_lmk_x_43', ' eye_lmk_x_44', ' eye_

Get the shape of one of the dataframes before: (2750, 171).

In [40]:
# Get the DataFrame with the key '16-07.csv'
df_16_07 = data['16-07.csv']

# Get the shape of the DataFrame
shape_16_07 = df_16_07.shape

# Print the shape
print("Shape of the DataFrame with key '16-07.csv':", shape_16_07)


Shape of the DataFrame with key '16-07.csv': (2750, 171)


In [59]:
def get_columns_to_delete(columns):
    # List of substrings to search for in column names
    substrings = ['gaze', 'p', 'x', 'X', 'Y', 'Z',  'pose', 'eye']

    # Initialize an empty list to store column names to delete
    columns_to_delete = []

    # Iterate through each column name
    for column in columns:
        # Check if any of the substrings are present in the column name
        if any(sub in column for sub in substrings):
            # If present, add the column name to the list of columns to delete
            columns_to_delete.append(column)

    return columns_to_delete


In [46]:
# Example usage:
columns = ['AU01_r', 'AU02_r', 'AU04_r', 'AU05_r', 'gaze_0_x', 'gaze_0_y', 'pose_Rx', 'pose_Ry']
columns_to_delete = get_columns_to_delete(columns)
print(columns_to_delete)

['gaze_0_x', 'gaze_0_y', 'pose_Rx', 'pose_Ry']


In [62]:
# Get the DataFrame from the dictionary
df_to_process = data['16-07.csv']

# Get the list of column names
columns = df_to_process.columns
#print(columns)

# Get the list of columns to delete
columns_to_delete = get_columns_to_delete(columns)

# Print the list of columns to delete
#print(list(columns))
#print(columns_to_delete)
print(len(columns_to_delete))

131


### The function to drop the columns + doing so 

In [63]:
# Define the new dictionary to store modified DataFrames
data_dropped = {}

# Iterate through all the DataFrames in the original dictionary
for key, df in data.items():
    # Get the list of column names for the current DataFrame
    columns = df.columns
    
    # Get the list of columns to delete
    columns_to_delete = get_columns_to_delete(columns)
    
    # Create a new DataFrame without the columns to delete
    df_dropped = df.drop(columns=columns_to_delete)
    
    # Add the new DataFrame to the new dictionary
    data_dropped[key] = df_dropped


Double check. Now there are 40 columns left. 

In [67]:
# Get the DataFrame with the key '16-07.csv'
df_16_07 = data_dropped['16-07.csv']

# Get the shape of the DataFrame
shape_16_07 = df_16_07.shape

# Print the shape
print("Shape of the DataFrame with key '16-07.csv':", shape_16_07)

Shape of the DataFrame with key '16-07.csv': (2750, 40)


In [70]:
# Get the DataFrame from the dictionary
df_to_process = data_dropped['16-07.csv']

print(df_to_process.columns)

Index(['frame', ' face_id', ' confidence', ' success', ' AU01_r', ' AU02_r',
       ' AU04_r', ' AU05_r', ' AU06_r', ' AU07_r', ' AU09_r', ' AU10_r',
       ' AU12_r', ' AU14_r', ' AU15_r', ' AU17_r', ' AU20_r', ' AU23_r',
       ' AU25_r', ' AU26_r', ' AU45_r', ' AU01_c', ' AU02_c', ' AU04_c',
       ' AU05_c', ' AU06_c', ' AU07_c', ' AU09_c', ' AU10_c', ' AU12_c',
       ' AU14_c', ' AU15_c', ' AU17_c', ' AU20_c', ' AU23_c', ' AU25_c',
       ' AU26_c', ' AU28_c', ' AU45_c', 'Filename'],
      dtype='object')


# Adding ID 

First get the keys of all the dataframes in the dictionary. 

In [71]:
# Get a list of keys in the dictionary
keys_list = list(data_dropped.keys())

# Print the list of keys
print(keys_list)

['16-07.csv', '80_04,05,06.csv', '7-04-05-06.csv', '324_02.csv', '87_04,05,06.csv', '78_04,05,06.csv', '328_02.csv', '105_04 donation not completed. No blood flow.csv', '92_04,05,06.csv', '38_4,5,6.csv', '129_04,05,06.csv', '85_07.csv', 'DSCN2370.csv', '95_04,05,06.csv', '300_02.csv', '20-07.csv', '97_07.csv', '290_01.csv', '312_02.csv', '127_03.csv', '40_07.csv', '101_07.csv', '118_04,05,06.csv', '14-03,04,05,06.csv', '324_01.csv', '144_03.csv', '113_07.csv', '52_07.csv', '328_01.csv', '31_07.csv', '125_07.csv', '64_07.csv', '146_07.csv', '300_01.csv', '49_04,05,06.csv', '89_04.csv', '76_07.csv', '68_07.csv', '129_07.csv', '290_02.csv', '312_01.csv', '68_03.csv', '33_07.csv', '94_04,05,06.csv', '111_07.csv', '39_04,05,06.csv', '93_04,05,06.csv', '50_07.csv', '326_01.csv', '64_03.csv', '42_07.csv', '103_07.csv', '79_04,05,06.csv', '292_02.csv', '310_01.csv', '31_03.csv', '81_04,05,06.csv', '52_03.csv', '113_03.csv', '74_07.csv', '135_07.csv', '144_07.csv', '13-03,04,05,06.csv', '302_01

In [74]:
import re

# Iterate over the keys of the dictionary
for key in data_dropped.keys():
    # Extract the ID from the key
    id = re.search(r'\d+', key).group()
    
    # Add a new column to the dataframe with the extracted ID
    data_dropped[key]['ID'] = id


Double check if it worked. 

In [75]:
# Get the DataFrame from the dictionary
df_to_process = data_dropped['16-07.csv']

display(df_to_process)

Unnamed: 0,frame,face_id,confidence,success,AU01_r,AU02_r,AU04_r,AU05_r,AU06_r,AU07_r,...,AU17_c,AU20_c,AU23_c,AU25_c,AU26_c,AU28_c,AU45_c,Filename,Timeframe,ID
0,1,0,0.98,1,0.00,0.00,0.92,0.0,0.23,1.15,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,16-07.csv,16,16
1,2,0,0.98,1,0.07,0.00,0.78,0.0,0.16,0.95,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,16-07.csv,16,16
2,3,0,0.98,1,0.15,0.00,0.81,0.0,0.16,0.91,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,16-07.csv,16,16
3,4,0,0.98,1,0.14,0.10,0.82,0.0,0.15,1.01,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,16-07.csv,16,16
4,5,0,0.98,1,0.15,0.11,0.87,0.0,0.15,0.99,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,16-07.csv,16,16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2745,2746,0,0.98,1,0.17,0.10,0.54,0.0,0.01,1.33,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,16-07.csv,16,16
2746,2747,0,0.98,1,0.22,0.13,0.57,0.0,0.01,1.29,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,16-07.csv,16,16
2747,2748,0,0.98,1,0.14,0.03,0.61,0.0,0.00,1.24,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,16-07.csv,16,16
2748,2749,0,0.98,1,0.09,0.03,0.60,0.0,0.00,1.06,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,16-07.csv,16,16


# Adding timeframe 

In [78]:
import re

def extract_timeframe(filename):
    # Extract numerical values from the filename
    numerical_values = re.findall(r'\d+', filename)
    
    # Convert each numerical value to an integer
    timeframes = [int(value) for value in numerical_values]
    
    return timeframes


Timeframes from filename 1: [16, 7]
Timeframes from filename 2: [6, 3, 4, 5, 6]


In [79]:
# Test cases
filename1 = '16-07.csv'
filename2 = '6-03,04,05,06.csv'

timeframes1 = extract_timeframe(filename1)
timeframes2 = extract_timeframe(filename2)

print("Timeframes from filename 1:", timeframes1)
print("Timeframes from filename 2:", timeframes2)

Timeframes from filename 1: [16, 7]
Timeframes from filename 2: [6, 3, 4, 5, 6]


In [80]:
# Iterate over the keys of the dictionary
for key in data_dropped.keys():
    # Extract the timeframe from the key
    timeframes = extract_timeframe(key)
    
    # Add the timeframe as a new column to the corresponding dataframe
    data_dropped[key]['timeframe'] = timeframes

# Verify the changes
print("Timeframe column added to all dataframes.")

ValueError: Length of values (2) does not match length of index (2750)

# Creating processed_data.csv