This code is based on Python 3.5.

I read files with chunks because the process is ressource demanding, especially the joint operation.

A special attention is devoted to data format (for airport origin, airport destination and date) because a bad data format yields mismatch problems.


In [1]:
# common parameters of the code
import pandas as pd 

# important filenames
strSearchesFilename='searches.csv' 
strBookingsFilename='bookings.csv' 

# To obtain a reasonable computation time, the chunk size should be large 
intChunksize=10000000 

# Columns for joining the bookings and the searches
# It is assumed that the data in searches.csv should match with the date in bookings
listColumnsSearchesInner=['Origin', 'Destination', 'Date']
listColumnsBookingInner=['dep_port', 'arr_port','cre_date           ']

The next cell defines two functions to format correctly the important columns reading from the csv files.
The problem of format causes mismatch inconsistencies when associating bookings and searches

In [2]:
# the "strip" function removes the whitespace to avoid mismatch problem when joining the data
# The text format of 'dep_port' and 'arr_port' are not identical to 'Origin' and 'Destination'
def strip(text):
    try:
        return text.strip()
    except AttributeError:
        return text
    
# I read the file and declare NaT the dates uncorrectly formatted
# The correct format is '%Y-%m-%d'
parse = lambda x: pd.to_datetime(x, format='%Y-%m-%d', errors='coerce')

I read the file and declare NaT the dates uncorrectly formatted.

The correct format is '%Y-%m-%d'

In [3]:
# retrieve the columns of "searches.csv"  (0 line is read)
dfSearchCol = pd.read_csv(strSearchesFilename,sep='^',nrows=0)
listColumnsSearches=list(dfSearchCol.columns.values)  
listAllColumns=listColumnsSearches+listColumnsBookingInner

del dfSearchCol  # delete the dataframe to save memory

The next cell works as follows:

1- I read the searches file

2- I read the bookings file

3- I use an inner joint to match the searches with the bookings (keeping only the relevant columns)

4- I accumulate the records w.r.t. the bookings chunks.

5- The output file is writing in CSV format iteratively chunk after chunck. I delete temporary dataframes to save memory.


In [4]:
# read searches.csv, all the columns of interests
# All the formating operations are made during the reading step in order to minimize the computational effort
dfSearch = pd.read_csv(strSearchesFilename,sep='^', 
                       usecols=listColumnsSearches,
                       parse_dates=[listColumnsSearchesInner[2]],date_parser=parse,
                       converters = {listColumnsSearchesInner[0] : strip,
                                     listColumnsSearchesInner[1] : strip},
                       dtype={listColumnsSearches[40]:'object',
                              listColumnsSearches[41]:'object',
                              listColumnsSearches[42]:'object',
                              listColumnsSearches[44]:'object'},
                       chunksize=intChunksize)
# Loop over the searches with chunks
for chunkS in dfSearch:
    dfConcatenated = pd.DataFrame(columns=listAllColumns) # initialization: empty dataframe whose columns correspond
                                                       # to the columns of the final output file
    # read bookings.csv, only the column of interests given in bookingColumns and chunk per chunk 
    # in order to retrieve a corresponding record within chunkS
    # All the formating operations are made during the reading step in order to minimize the computational effort
    dfBook = pd.read_csv(strBookingsFilename,sep='^', 
                         usecols=listColumnsBookingInner,
                         parse_dates=[listColumnsBookingInner[2]],date_parser=parse,
                         converters = {listColumnsBookingInner[0] : strip, # remove whitespaces in the columns
                                       listColumnsBookingInner[1] : strip},
                         chunksize=intChunksize) # read the CSV file, keeping only columns in bookingColumns
 
    # Loop over the dataframe dfBook for the bookings
    for chunkB in dfBook:
        # The inner joint keeps only the common records
        dfMergeSB = pd.merge(chunkS, chunkB, 
                             left_on=listColumnsSearchesInner, 
                             right_on=listColumnsBookingInner, 
                             how='inner')       
        dfConcatenated=pd.concat([dfConcatenated, dfMergeSB]) # I append the new relevant records to the previous one
        
    del dfConcatenated        # delete a dataframe to save memory    
    del dfMergeSB             # delete a dataframe to save memory

