This notebook creates 10 dummy files with an integer in each line. These have a structure of text_file*.txt where * is the file number. The following code converts this to text files with the structure of (number, number of repeats). Both a serial and parallelised version of this process is presented. Finally, the converted text files are read and combined to give an output in the file "out_file.txt" in the form [...,...,...,...] as requested. This process is done sequentially as for a write to be done to file information from all of the input files is required for correct sequential outputs (although the writing of the same number to file repeatedly could be done in parallel if required). Line by line reading of files is done thoughout to minimize memory usage.

In [65]:
##Create Dummy Text Files
from random import randrange
import numpy as np

number_of_files = 10

for i in range(number_of_files):
    number_of_rows = randrange(50)
    text_file = open("text_file"+str(i)+".txt", "w")
    text_file.write("") #clear file
    text_file = open("text_file"+str(i)+".txt", "a")
    
    numbers = []
    
    for j in range(number_of_rows):
        numbers.append(randrange(70))
    numbers.sort()
    for j in range(number_of_rows):
        text_file.write(str(numbers[j])+"\n")
    text_file.close()

Convert strings of numbers in each file into new files with (number, number of repeats) style to minimize memory usage and for easier later processing.
Read in files line by line to minimize memory usage.
Simple serial version.

In [66]:
for i in range(number_of_files):  
    text_file_counted = open("text_file_counted"+str(i)+".txt", "w")
    text_file_counted.write("")
    text_file_counted.close()
    text_file_counted = open("text_file_counted"+str(i)+".txt", "a")
    
    count_number = []
    current_number = []
    flag = 0 
    with open("text_file"+str(i)+".txt") as f:
        for line in f:
            input_number=line.strip()

            if flag==0:
                current_number=input_number
                count_number=0
                flag=1

            if input_number==current_number:
                count_number+=1    
            else:
                text_file_counted.write(str(current_number)+","+str(count_number)+"\n")
                current_number = input_number
                count_number = 1

    text_file_counted.write(str(current_number)+","+str(count_number)+"\n")
    text_file_counted.close()

        
  



Convert strings of numbers in each file into new files with (number, number of repeats) style to minimize memory usage and for easier later processing.
Read in files line by line to minimize memory usage.
Parallel version with each file being independently converted.

In [67]:
from joblib import Parallel, delayed

def read_files(i):
    text_file_counted = open("text_file_counted"+str(i)+".txt", "w")
    text_file_counted.write("")
    text_file_counted.close()
    text_file_counted = open("text_file_counted"+str(i)+".txt", "a")

    count_number = []
    current_number = []
    
    flag = 0 
    with open("text_file"+str(i)+".txt") as f:
        for line in f:
            input_number=line.strip()

            if flag==0:
                current_number=input_number
                count_number=0
                flag=1

            if input_number==current_number:
                count_number+=1    
            else:
                text_file_counted.write(str(current_number)+","+str(count_number)+"\n")
                current_number = input_number
                count_number = 1
    if flag ==1:
        text_file_counted.write(str(current_number)+","+str(count_number)+"\n")
        text_file_counted.close()
    
file_numbers_par = range(number_of_files)

out = Parallel(n_jobs=-1, verbose=1, backend="threading")(
             map(delayed(read_files), file_numbers_par))

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    0.0s finished


Reads in the converted text files and writes the outputs to file in a memory efficient way. This is serialised as information is required from each file and overhead to high to make parallisation worthwhile. The end of files is dealt with by setting the number in the array to a very large number that is larger than the largest in the input data under the variable "large number". 

In [68]:
text_file_counted_list = []
large_number = 1000000
#number_of_files = 2
for i in range(number_of_files):
    text_file_counted_list.append(open("text_file_counted"+str(i)+".txt", "r"))
    
out_file = open("out_file.txt", "w")
out_file.write("") #Clear file
out_file.close()
out_file = open("out_file.txt", "a")
current = 0
current_numbers = []
for i in range(number_of_files):
    temp=next(text_file_counted_list[i], "end") #go line by line through file to minimize memory usage
    if temp == "end":
        current_numbers.append([large_number, 0])
    else:
        current_numbers.append(list(map(int,temp.strip().split(','))))  
        

current_numbers = np.array(current_numbers)    
min_val = np.min(current_numbers[:,0])
min_val_arg = np.argmin(current_numbers[:,0])

current_number_total = 0
current_number_out =  current_numbers[min_val_arg,0]
flag = 0
while np.sum(current_numbers[:,1]>0): #when input file finished the number is set to a very large number and the count of the number is set to 0
    #so total of second column will be 0 when all input files have been read
    
    min_val = np.min(current_numbers[:,0])
    min_val_arg = np.argmin(current_numbers[:,0])
    if current_numbers[min_val_arg,0]==current_number_out:
        current_number_total += current_numbers[min_val_arg,1]
        temp=next(text_file_counted_list[min_val_arg], "end") #go line by line through file to minimize memory usage
        if temp == "end":
            current_numbers[min_val_arg,:] = np.array([large_number, 0])
        else:
            current_numbers[min_val_arg,:] = list(map(int,temp.strip().split(',')))

    else:
        for i in range(current_number_total):
            if flag ==0:
                out_file.write(str(current_number_out))
                flag=1
            else:
                out_file.write(","+str(current_number_out))
        current_number_out = current_numbers[min_val_arg,0]
        current_number_total=0
        
for i in range(current_number_total):
    if flag ==0:
        out_file.write(str(current_number_out))
        flag=1
    else:
        out_file.write(","+str(current_number_out))
out_file.close()        