This code is based on Python 3.5.

I propose two code versions: 1) without chunck (the whole file is processed in a single step) and 2) with chunks (when the file is processed in multiple steps). The results are obviously identical.

To accelerate the performance, I used an uncompressed data file but, if necessary, the "pd.read_csv" function can be applied to compressed files by adding the option "compression='bz2'"


The next cell defines the common parameters of the code

In [5]:
import pandas as pd # import the library pandas
import numpy as np # import the library numpy to use "linspace"
filenameBooking="bookings.csv" # name of the bookings file
usedColumns = ['arr_port','pax'] # columns used to process the file
intChunksize = 10000 # size of the chunk (when using chunks)
topNumber=10 # number of tops I retrieve

The next cell computes the top10 by processing the file in one single step (without chunk)

In [6]:
# Version without chunk
# read the CSV file, keeping only columns 'arr_port' and'pax'
df = pd.read_csv(filenameBooking,sep='^', usecols=usedColumns) 
# The next command works as follow:
# 1)The grouby operator groups the dataframe rows by arrival airport, then I sum the pax (including negatives values)
# 2) the function "reset_index" transforms the hierarchical index levels (created with the groupby) into columns, 
# so I obtain a dataframe
# 3) The column "pax" is sorted on descending order and I keep the first topNumber rows
# Comment: If a value is missing within the file,the read_csv function replaces it automatically by a nan, 
# and the sum ignores the nan's.
dfTop=df['pax'].groupby(df['arr_port']).sum().reset_index().sort_values(by='pax', ascending=False)[:topNumber]
print(dfTop.to_string(index=False)) # print the top10 dataframe (just for verification, no "pretty" print)

 arr_port    pax
 LHR       88809
 MCO       70930
 LAX       70530
 LAS       69630
 JFK       66270
 CDG       64490
 BKK       59460
 MIA       58150
 SFO       58000
 DXB       55590


The next cell computes the top10 by processing the file by using chunks

In [7]:
# Version with chunks
df = pd.read_csv(filenameBooking,sep='^', usecols=usedColumns,chunksize=intChunksize) 
dfConcatenated = pd.DataFrame(columns=usedColumns) # initialization: empty dataframe

for chunk in df: # loop over the chunks
    # partial dataframe "dfChunk" for each chunk
    dfChunk=chunk.groupby(chunk['arr_port']).sum().reset_index() 
    # concatenating the partial dataframes ignore overlapping indexes, due to option "ignore_index=True"
    dfConcatenated = pd.concat([dfConcatenated, dfChunk], ignore_index=True) # I concatenate recursively all the partial dataframes 

# Compute the top10 from the concatenated dataframe "dfConcatenated"    
dfTopChunk=dfConcatenated['pax'].groupby(dfConcatenated['arr_port']).sum().reset_index().sort_values(
    by='pax', ascending=False)[:topNumber]

In next cell, I change the organization of dfTopChunk to print a "pretty" dataframe

In [8]:
dfTopChunk['Rank']=np.linspace(1,topNumber,topNumber)
dfTopChunk = dfTopChunk.reindex(columns=['Rank','arr_port','pax'])
dfTopChunkPrinted=dfTopChunk.rename(columns={'arr_port': 'Airport','pax': 'Number of bookings'})
print(dfTopChunkPrinted.to_string(index=False))

 Rank   Airport  Number of bookings
    1  LHR                    88809
    2  MCO                    70930
    3  LAX                    70530
    4  LAS                    69630
    5  JFK                    66270
    6  CDG                    64490
    7  BKK                    59460
    8  MIA                    58150
    9  SFO                    58000
   10  DXB                    55590
