# <b>CaRM Module: Advanced Topics in Data Preparation Using Python (2024/2025)</b>
## <b>Session 4: Processing multiple files and large amount of data. </b>

### <b>Loops</b>

At this point in your Python journey, you are probably familiar with two types of loops in Python: the <b>for</b> loop and the <b>while</b> loop. For loops are used to iterate over a sequence of elements contained in a data structructure (commonly a list or an iterator, but also a tuple, a dictionary, a set, a string, etc.). While loops, instead, are used to repeat a segment of code, as long as a certain condition is met.

Loops are very useful strategies when you need to process several data files. They allow you to apply the same processing steps over multiple files, reusing the same lines of code. They are easily applicable when you have named your files in a systematic way. For example, if your file names contain information about the subject ID and a particular experimental condition (or any other variable like treatment, group, type of measurement, date, etc), you can iterate over a list of subjects IDs (outer loop), and then over a list of experimental conditions (nested or inner loop). This will allow you to dynamically change the name of the data file you are going to process with the lines of code inside the loops. The possibilities are endless, as you can have as many nested loops as you want, so you can iterate over several parameters.

#### <b>4.1. Iterating over datasets using nested for/while loops</b>

This is an example of how you would apply a <b>for loop</b> for iterating over subject IDs and conditions.

In [None]:
subjects = [1, 2] 
conditions = ['a', 'b']
# these above are lists, but you could use a sets instead, to ensure you dont have repeated values.
# e.g., subjects = {1,2} or subjects = set([1,2])
# if you get confused and repeat the number of a subject, by making it a set you remove the repetition.
# subjects = {1,2,2,3} 
# print(subjects)

for i in subjects:
    for j in conditions: # this is a nested loop
        # do something: 
        # e.g., load the file of this participant in this condition and process it
        print(f'Processing subject {i} in condition {j}')

An alternative way is to use the <b>range()</b> function to loop over the positions of the elements contained in a list or other data structures. The expression range(n) creates a sequence of numbers, starting from 0 by default, and increments by 1, until reaching n-1. If n is the length of a list, this means that you will have a sequence of positions for all the elements of the list. The main difference with the previous example is that, in this case, the variables i and j do not hold the values of the elements of the list, but the values of the positions. So, to obtain the current value of the element of the list, you need to access the list with the current position (e.g., current_element = myList[current_position]).

In [None]:
subjects = [1, 2] 
conditions = ['a', 'b']

for i in range(len(subjects)):
    for j in range(len(conditions)):
        print(f'Processing subject {subjects[i]} in condition {conditions[j]}')

You can obtain a similar result using a <b>while loop</b> when needing to process data over different subject IDs and conditions. Note that, here, you have to explicitly update the value of the variable (current ID, current condition) by adding 1 at the end of the loop.

In [None]:
subjects = [1, 2]
conditions = ['a', 'b']

i = 0
while i < len(subjects):
    j = 0    
    while j < len(conditions): # this is a nested loop
        # do something: 
        # e.g., load the file of this participant in this condition and process it
        print(f'Processing subject {subjects[i]} in condition {conditions[j]}')
        j = j + 1 # this happens inside the second loop 
    i = i + 1 # this happens outside the second loop, but inside of the first one

In the examples above, all subjects participanted in the same experimental conditions. However, there is another scenario: sometimes, different subjects take part in different experimental conditions (e.g., received a different treatment, or belonged to a different group) and may even have different number of data files to process. In this case, it may be more practical to store the variables in a dictionary. Dictionaries allow you to associate a key (e.g., a subject ID) to a value (e.g., a particular treatment, a list of files, etc). So, you can use a <b>for loop to iterate over the key-value pairs in a dictionary</b>, as you will see in the example below. 

In [None]:
myDict = dict({'sub-01': 'treatment 1', 'sub-02': 'treatment 2', 'sub-03': 'treatment 2'})

# create dictionary dynamically using the zip() function
'''subjects = ['sub-01', 'sub-02', 'sub-03']
treatments = ['treatment 1', 'treatment 2', 'treatment 2']

print(type(zip(subjects, treatments)))

# creating a dictionary by zipping keys and values together
myDict = dict(zip(subjects, treatments))
print(myDict)'''

for i in myDict:
    print(f'Processing {i} in {myDict[i]}')
    #print("Processing %s in %s alternative" % (i, myDict[i])) # alternative way of printing the values of variables into text

I have included alternative ways to perform some parts of the code above. The first commented lines of code show you how you a <b>practical way to create a dictionary from two lists of variables</b>. One list contains the keys, and the other list contains the values. Then, you can use the <b>zip function</b> to encode those separated lists into a zip object, which is iterator of tuples where each element of the first list is paired together with the correspoding element of the second list (e.g., (key1, value1)). Then, you can easily convert the zip object into a dictionary by using the dict() function.

You may have noticed that, inside the loop, there is a line printing some text which contains the subject ID and the treatment, both dynamically changing on each iteration. Curiously, there is a letter f before the text string. This is an <b>f-string</b> (formatted string literal) which is a nice recent feature of Python that simplifies string formatting and interpolation. f-strings provide a concise and intuitive way to embed expressions and variables directly into strings. To format values inside an f-string, you should add placeholders as curly braces {}. A placeholder can contain variables (like in the example) but also functions and operations.

The last commented line of the code snippet above, shows you an alternative way to print variables into a text string. This is known as the <b>printf-style formatting</b> which was used in older versions of Python, and is still available. People with experience in C or Matlab programming may find this formatting style more familiar. Note that the variables in parenthesis will be inserted in the %s positions (% is called the modulo operator). You can use %d for inserting integers, and %f for floats. If you are interested in this method, read more at https://www.geeksforgeeks.org/python-output-formatting/


------------------------------------------------------------------

#### <b>Code I used to prepare the files for this session (in case you are curious):</b>

First, I created the project folder in the current directory:

In [None]:
import os

cwd = os.getcwd()
projectdir = os.path.join(cwd,'Mission_Project_AdvPy2425')

if not os.path.exists(projectdir):
    os.makedirs(projectdir)

Then, I created the subfolders following a hierarchical structure, and saved the data as .txt:

In [None]:
import os
import pandas as pd

projectname = 'Mission_Project_AdvPy2425'

cwd = os.getcwd()
projectdir = os.path.join(cwd, projectname)

df = pd.read_csv('mission1_data.csv')

species = df.species.unique()
sex = df.sex.unique()

for i in species:
    # create the species folder
    speciesdir = os.path.join(projectdir, i)
    if not os.path.exists(speciesdir):
        os.makedirs(speciesdir)

    for j in sex:
        # filter the dataframe
        df_filtered = df.query('species == @i and sex == @j')

        # if there is data, then 
        if df_filtered.shape[0] > 0:
            # create the sex folder
            sexdir = os.path.join(speciesdir, j)
            if not os.path.exists(sexdir):
                os.makedirs(sexdir)            

            filename = os.path.join(sexdir, 'mission_data.txt')
            #df_filtered.to_csv(filename, sep=' ', mode='w')

            for k in range(df_filtered.shape[0]):
                jsonfilename = os.path.join(sexdir, f'creature-{df_filtered.index[k]}.json')
                df_filtered.iloc[k,:].to_json(jsonfilename) #, orient="columns", index=0

In [4]:
import os
import pandas as pd

projectname = 'Mission_Project_AdvPy2425'

cwd = os.getcwd()
projectdir = os.path.join(cwd, projectname)

dflist = []
for i in [1,2,3]:
    df = pd.read_csv(f'mission3_data{i}.csv')
    dflist.append(df)

df2 = pd.concat(dflist)
print(df2.shape)
#print(df2)
df2.to_csv(os.path.join(projectdir,'mission3_data.txt'), sep=' ')

(327, 8)


In the code above and most of the examples, I will be using the <b>os</b> package to handle directory names and related operations as making a new directory. There is at least another package called <b>pathlib</b> that works in the most recent versions of Python, and can perform similar operations and more. I am showing you below a short example on how to use pathlib to create a directory. For more information about how to use the package, you can start here: https://www.geeksforgeeks.org/pathlib-module-in-python/. Also see the official documentation for more details: https://docs.python.org/3/library/pathlib.html. 

In [None]:
from pathlib import Path

cwd = Path.cwd()
# alternative:
#cwd = Path().absolute()

newdir = Path.joinpath(cwd,'newdir_with_pathlib')

Path(newdir).mkdir(parents=True, exist_ok=True)

-----------------------------------------------------

#### <b>4.2. Loop over 'real' mission data files</b>

To have a more concrete example of the use of loops for processing multiple data files, I have created a <b>Mission_Project_AdvPy2425</b> folder containing different subfolders and data files in different formats. Inside the project folder, you will find a <b>first level of subfolders with the names of species</b> for all the creatures found in the mission1_data.csv file. Inside each species subfolder, you will find a <b>second level of subfolders named after the possible sexes</b> for each particular species. Inside each sex folder you will find a <b>mission_data.txt</b> file. This file contains formatted information regarding all the creatures that belong to that species and sex combination. In the code below, I show you how we can use two for loops (one nested inside the other) to load each text file from all of the subfolders. In each iteration, the data is is converted to a dataframe and appended to a list of dataframes. When both of the loops are finished, and all the data has been loaded, we concatenate the list of dataframes into a single dataframe. This reconstructed dataframe is equivalent to the dataframe we obtained when loading de original mission1_data.csv file. Note that, after concatenating the dataframes, we have to sort the rows according to the indexes (row labels), to have the same order of rows as we had in the original csv file.<br>
You could insert lines of code for preprocessing the data right after each dataset is loaded. However, very often it is more efficient (i.e., less time consuming) to apply some preprocessing steps after all the data has been concatenated into a single dataframe (e.g., you could now apply a function to a column in one go, instead of doing it row by row).<br>

In [None]:
import os
import pandas as pd

projectname = 'Mission_Project_AdvPy2425'

cwd = os.getcwd()
projectdir = os.path.join(cwd, projectname)

df = pd.read_csv('mission1_data.csv')

species = df.species.unique() 
sex = df.sex.unique()

dflist = []

for i in species:
    # species dir
    speciesdir = os.path.join(projectdir, i)

    for j in sex:
        # sex dir
        sexdir = os.path.join(speciesdir, j)

        # if the sex directory exists, then 
        if os.path.exists(sexdir):
            # read the text file with the mission data
            filename = os.path.join(sexdir, 'mission_data.txt')
            df2 = pd.read_csv(filename, sep=' ', index_col=0)
            # Note: you could implement some preprocessing steps here
            dflist.append(df2)
            #print(f'There are {df2.shape[0]} {j} {i} creatures.')     

df3 = pd.concat(dflist)
df3.sort_index(inplace=True)
# Note2: you could implement other preprocessing steps here
print(df3.shape)
print('This is the reconstructed dataframe, obtained from concatenating all the data from .txt files.')
print(df3.head(10))     

#### <b>4.3. Processing chunks of data using ```for``` loops with the ```range()``` function</b>

If you have a very large number of data files or rows, or the files are very large in size, loading all the data may become a less efficient strategy because it is memory consuming. In these scenarios, it is wiser to <b>process the data in chunks</b>: for example, process a subgroup of files, or rows at a time.

In [None]:
import numpy as np

subjects = np.arange(1,101)
conditions = ['a', 'b']
batchsize = 11

nsub = len(subjects)
for i in range(0,nsub,batchsize):
    k = i + batchsize
    if k > nsub:
        k = nsub
    # do something with this group of subjects
    # e.g., process the files of these subjects first
    print(f'Loading subjects {subjects[i]} to {subjects[k-1]}:')
    print(subjects[i:k])
    for j in conditions:
        # do something: 
        # e.g., load the files of these participants in this condition and process them
        print(f'Processing condition {j}')

In [None]:
import pandas as pd

df = pd.read_csv('mission1_data.csv')

batchsize = 8
nrows = df.shape[0]
print(nrows)

for i in range(0, nrows, batchsize):
    j = i + batchsize
    if j > nrows:
        j = nrows
    # do something with this group of rows
    # e.g., create a dataframe with those rows and apply some function
    df2 = df.iloc[i:j,:]
    print(f'Processing rows {i+1} to {j}:')
    # code lines of preprocessing steps

    # then, you can save the data after preprocessing, 
    # or example append it to a list of dataframes,
    # then, concatenate the dataframes at the end of the loop.
    # alternatively, you can save the transformed data back to the original dataframe
    # df.iloc[i:j,:] = df2

print(df2)
#print(df.iloc[-1,:])

#### <b>4.4. Processing chunks of rows in Pandas</b>

Pandas has an option to define chunks of data in a dataframe, upon reading the file. Then, it is very easy to loop over the dataframe chunks and process them.

In [None]:
import pandas as pd

# Read the mission1_data.csv in chunks of 10 rows
df = pd.read_csv("mission1_data.csv", chunksize=10)

male = 0
female = 0
# Loop through the chunks
# Iterate over chunks
'''for i, chunk in enumerate(reader):
    print(f'Processing Chunk {i+1}')
    process_chunk(chunk)'''

for chunk in df:
    # Every chunk is a dataframe containing a subset of rows (10) from the large dataframe df
    #print(chunk)
    # Insert code here if you would like to process this dataframe chunk (e.g., apply some function)
    # Then, you could append this chunk to a list (to be able to concatenate them at the end of the loop)

    # count all males in the sex column of the current chunk
    male += chunk[chunk['sex'] == "male"]["sex"].count()
    # count all females in the sex column of the current chunk
    female += chunk[chunk['sex'] == "female"]["sex"].count()

# you could concatenate the chunks here
# you could apply some preprocessing to the whole dataframe here
# and/or save the processed data

# Print the results
print(f'Total number of males: {male}')
print(f'Total number of females: {female}')

#### <b>4.5. Processing lines from a ```.txt``` file in batches</b>

In [6]:
import os
import pandas as pd

projectname = 'Mission_Project_AdvPy2425'

cwd = os.getcwd()
projectdir = os.path.join(cwd, projectname)

# Function to read lines from a file in batches
def read_lines_in_batches(file_path, batch_size):
    with open(file_path, 'r') as file:
        batch = []
        for line in file:
            batch.append(line.strip())
            if len(batch) == batch_size:
                yield batch
                batch = []
        if batch:
            yield batch

# Process lines from the file in batches
file_path = os.path.join(projectdir,'mission3_data.txt')
batch_size = 100
for batch in read_lines_in_batches(file_path, batch_size):
    # Replace the following line with actual processing logic
    print(f"Processing batch: {batch}")

Processing batch: ['"Unnamed: 0" Senator "Planet or sector represented" Term Notes "Species and gender" Planet Sector', '0 0 Aang Roona "c. 21 BBY"  "Roonan male"', '1 1 "Aak, Ask" "Malastare and the Dustig sector" "22 BBYâ€“19 BBY"  "Gran male"', '2 2 "Adem\'thorn, Yeb Yeb" "Makem Te and the Nilgaard sector" "c. 32 BBY"  "Swokes Swokes male"', '3 3 "Alavar, Nee" "Lorrd and Kanz sector" "c. 19 BBY" "Arrested and executed by Palpatine" "Human female"', '4 4 "Aldrete, Agrippa" "Alderaan and Alderaan sector" "c. 52 BBY" "Replaced by Bail Antilles" "Human male"', '5 5 "Allum, Vivendi"  "c. 44 BBY"  male', '6 6 "Amedda, Mas" Champala "c. 33 BBY" "elected as Vice Chair" "Chagrian male"', '7 7 "Amidala, PadmÃ©" "Naboo and Chommell sector" "c. 25 BBY-19 BBY" "Died giving birth to her children, Luke Skywalker and Leia Organa Solo" "Human female"', '8 8 "Annon, Blix"  "c. 77 BBY" "Killed by heart attack" "Human male"', '9 9 "Am-Ris, Paran" Cerea "c. 3653 BBY" "Became interim Supreme Chancellor" 

#### <b>4.6. Iterating over files in a directory</b>

Sometimes you cannot systematically call the name of the data files, or is simply more practical to access a directory and process all the data files that you have in there. In this scenario, there are several python modules that can help (e.g., os, pathlib, glob) and functions within those modules (e.g., os.listdir(), os.scandir()).

Method 1: os.listdir() 

In [None]:
import os
import pandas as pd
import re

projectname = 'Mission_Project_AdvPy2425'

cwd = os.getcwd()
projectdir = os.path.join(cwd, projectname)

df = pd.read_csv('mission1_data.csv')

species = df.species.unique() 
sex = df.sex.unique()

dflist = []
indexlist = []

for i in species:
    # species dir
    speciesdir = os.path.join(projectdir, i)

    for j in sex:
        # sex dir
        sexdir = os.path.join(speciesdir, j)

        # if the sex directory exists, then 
        if os.path.exists(sexdir):
            
            # loop over files in the sex directory
            for filename in os.listdir(sexdir):
                if filename.endswith('.json'):
                    # getting numbers from string 
                    #idx = int(re.findall(r'creature-(.+?).json', filename)[0])
                    idx = int(re.findall(r'creature-(\d+).json', filename)[0]) # be sure the index is an integer, not a string
                    #print(idx)
                    #print(os.path.join(sexdir,filename))
                    # read the file
                    df2 = pd.read_json(os.path.join(sexdir,filename), orient='columns', typ='series').to_frame().transpose().set_axis([idx])
                    #print(df2.shape)
                    #print(df2)
                    dflist.append(df2)
                    indexlist.append(idx)

df3 = pd.concat(dflist)
df3.sort_index(inplace=True)
print(df3.shape)
print(df3.head(10))               

It is considered a best practice for iterating files in Python to check for file existence. Note that, in the previous example we only processed the listed files if they had the .json extension. This strategy already ensures that you will load a data file. For a more generic way of checking the existence of a file, you can use the following approaches.

In [None]:
import os

# assign directory
cwd = os.getcwd()
directory = cwd
 
# iterate over files in
# that directory
for filename in os.listdir(directory):
    f = os.path.join(directory, filename)
    # checking if it is a file
    if os.path.isfile(f):
        print(filename)

Method 2: os.scandir()

In [None]:
import os
 
# assign directory
cwd = os.getcwd()
directory = cwd
 
# iterate over files in 
# that directory
for filename in os.scandir(directory):
    if filename.is_file():
        print(filename.name)

Method 3: pathlib module

In [None]:
# import required module
from pathlib import Path
 
# assign directory
cwd = Path.cwd()
directory = cwd
 
# iterate over files in that directory
files = Path(directory).glob('*')
#files = Path(directory).glob('*.ipynb')
for file in files:
    if file.exists():
        print(file.name)

BTW, you could use one of the strategies above just to create a list of filenames to process. You don't need to process them one by one, but process them in batches using <b>for loops</b> and the <b>range()</b> function as we saw in <b>section 4.3</b>.

In [55]:
fileslist = [...]
batchsize = 500

for i in range(0, len(fileslist), batchsize):
    batch = fileslist[i:i+batchsize] # the result might be shorter than batchsize at the end
    # do stuff with batch

This, applied to our data:

In [None]:
import os
import pandas as pd
import re

projectname = 'Mission_Project_AdvPy2425'

cwd = os.getcwd()
projectdir = os.path.join(cwd, projectname)

df = pd.read_csv('mission1_data.csv')

species = df.species.unique() 
sex = df.sex.unique()

fileslist = []
indexlist = []

for i in species:
    # species dir
    speciesdir = os.path.join(projectdir, i)

    for j in sex:
        # sex dir
        sexdir = os.path.join(speciesdir, j)

        # if the sex directory exists, then 
        if os.path.exists(sexdir):
            
            # loop over files in the sex directory
            for filename in os.listdir(sexdir):
                if filename.endswith('.json'):
                    fileslist.append(os.path.join(sexdir,filename))
                    # getting numbers from string 
                    idx = int(re.findall(r'creature-(\d+).json', filename)[0]) # be sure the index is an integer, not a string            
                    indexlist.append(idx)

#print(fileslist)
batchsize = 10

for i in range(0, len(fileslist), batchsize):
    batch = fileslist[i:i+batchsize] # the result might be shorter than batchsize at the end
    # do stuff with batch
    # e.g., load the files, create a dataframe, apply a function, save the dataframe

['c:\\Users\\maite\\OneDrive\\Documents\\PERSONAL\\LEARNING\\LearningProgramming\\Python\\Advanced_Python_Course\\Material_to_upload\\Code\\Sessions_in_JupyterNotebooks\\Mission_Project_AdvPy2425\\Human\\male\\creature-0.json', 'c:\\Users\\maite\\OneDrive\\Documents\\PERSONAL\\LEARNING\\LearningProgramming\\Python\\Advanced_Python_Course\\Material_to_upload\\Code\\Sessions_in_JupyterNotebooks\\Mission_Project_AdvPy2425\\Human\\male\\creature-10.json', 'c:\\Users\\maite\\OneDrive\\Documents\\PERSONAL\\LEARNING\\LearningProgramming\\Python\\Advanced_Python_Course\\Material_to_upload\\Code\\Sessions_in_JupyterNotebooks\\Mission_Project_AdvPy2425\\Human\\male\\creature-11.json', 'c:\\Users\\maite\\OneDrive\\Documents\\PERSONAL\\LEARNING\\LearningProgramming\\Python\\Advanced_Python_Course\\Material_to_upload\\Code\\Sessions_in_JupyterNotebooks\\Mission_Project_AdvPy2425\\Human\\male\\creature-13.json', 'c:\\Users\\maite\\OneDrive\\Documents\\PERSONAL\\LEARNING\\LearningProgramming\\Python\

#### <b>4.7. Using ```Itertools``` for processing data in batches</b>

Itertools is a standard library module in Python that provides various functions that work on iterators to produce complex iterations. It’s a powerful toolset for handling iterable data efficiently.

In [None]:
import itertools

# Sample data
data = list(range(20)) #This is a list of numbers from 0 to 19, but it couls also be a list of filenames or rows

# Use itertools.batched to process data in chunks of size 5
for batch in itertools.batched(data, 5): # This generates batches of size 5 from the data.
    # process the `batch` tuple
    # Replace the following line with actual processing logic
    print(f"Processing batch: {batch}")

Processing batch: (0, 1, 2, 3, 4)
Processing batch: (5, 6, 7, 8, 9)
Processing batch: (10, 11, 12, 13, 14)
Processing batch: (15, 16, 17, 18, 19)


In [53]:
import itertools 

# Sample data 
data = list(range(50)) 

# Batch size 
batch_size = 10 
# 
# Process data in batches using itertools.islice 
iterator = iter(data) 
# 
for batch in iter(lambda: list(itertools.islice(iterator, batch_size)), []): 
    # Replace the following line with actual processing logic 
    print(f"Processing batch: {batch}")

Processing batch: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Processing batch: [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
Processing batch: [20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
Processing batch: [30, 31, 32, 33, 34, 35, 36, 37, 38, 39]
Processing batch: [40, 41, 42, 43, 44, 45, 46, 47, 48, 49]
