<h6><center>Big Data Algorithms Techniques & Platforms</center></h6>
<h1>
<hr style=" border:none; height:3px;">
<center>Assignment 1: Introduction to MapReduce</center>
<hr style=" border:none; height:3px;">
</h1>

## Flights and Airports Data

In this assignment, we are going to analyze a dataset that will include flight data. The dataset comes from <a href="https://www.kaggle.com/flashgordon/usa-airport-dataset">Kaggle</a> and is in a <code>.csv</code> file. Each line of the file represents a different flight. The data collected contain:


<code>Origin_airport</code>: Three letter airport code of the origin airport </br>
<code>Destination_airport</code>: Three letter airport code of the destination airport</br>
<code>Origin_city</code>: Origin city name</br>
<code>Destination_city</code>: Destination city name</br>
<code>Passengers</code>: Number of passengers transported from origin to destination</br>
<code>Seats</code>: Number of seats available on flights from origin to destination</br>
<code>Flights</code>: Number of flights between origin and destination (multiple records for one month, many with flights > 1)</br>
<code>Distance</code>: Distance (to nearest mile) flown between origin and destination</br>
<code>Fly_date</code>: The date (yyyymm) of flight</br>
<code>Origin_population</code>: Origin city's population as reported by US Census</br>
<code>Destination_population</code>: Destination city's population as reported by US Census</br>

## Assumption

In this assignment, I will assume that the data is partitioned into 100,000-line chunks. That is, my `apply_map` function defined below will be applied to 100,000 lines of data at a time.

As for the `shuffle` function, I used it after `apply_map` produced results for each partition and added all the key-value pairs to the `defaultdict` created for a MapReduce procedure. I did not gather the results produced by 37 `apply_map` calls into one big list just to send them to the `shuffle` function afterwards. The main reason for this was efficiency concerns. If I merged all results, I would have to store all the data in the 37 intermediate lists in RAM, and this would involve a lot of copying as the merged list grows in size.

In [1]:
from collections import defaultdict
from math import ceil
from re import compile
from os import listdir

FILENAME = "Airports2.csv"
ROWS_PER_PARTITION = 100000

In [2]:
with open(FILENAME, "r") as read_file:
    header = read_file.readline()
    line = read_file.readline()
    row_count = 0
    while line:
        row_count += 1
        line = read_file.readline()

    print(f"The total rows are {row_count}.")
    print(f"The number of partitions is {ceil(row_count / ROWS_PER_PARTITION)}.")

The total rows are 3606803.
The number of partitions is 37.


In [3]:
with open(FILENAME, "r") as read_file:  
    header = read_file.readline()
    file_num = 0
    file_name = f"data_part{file_num}.csv"
    partition = open(file_name, "w")
    partition.write(header)
    line = read_file.readline()
    rows_written = 0
    while line:
        partition.write(line)
        rows_written += 1
        line = read_file.readline()
        if rows_written % ROWS_PER_PARTITION == 0:
            partition.close()
            file_num = rows_written // ROWS_PER_PARTITION
            file_name = f"data_part{file_num}.csv"
            partition = open(file_name, "w")
            partition.write(header)

    partition.close()

In [4]:
filenames_pattern = compile(r"^data_part.*\.csv$") 
filenames = list(filter(filenames_pattern.match, listdir('./')))
print(f"The number of created files is {len(filenames)}.")
print("Sample names:", filenames[:5])

The number of created files is 37.
Sample names: ['data_part11.csv', 'data_part10.csv', 'data_part12.csv', 'data_part13.csv', 'data_part17.csv']


## Helper functions

In [5]:
def read_csv(filename):
    """
    Parse a CSV file with the given `filename`.

    Parameters
    ----------
    filename : str
        Full name of the CSV file containing records.

    Yields
    ------
    list of str
        Each line of the CSV file split into a list.

    """
    with open(filename, 'r') as file:
        for line in file:
            line_clean = line[:-1].replace("\"", "").replace(", ", ": ")
            yield line_clean.split(",")


contents = read_csv(FILENAME)
print(next(contents), "\n")
print(next(contents), "\n")

['Origin_airport', 'Destination_airport', 'Origin_city', 'Destination_city', 'Passengers', 'Seats', 'Flights', 'Distance', 'Fly_date', 'Origin_population', 'Destination_population', 'Org_airport_lat', 'Org_airport_long', 'Dest_airport_lat', 'Dest_airport_long'] 

['MHK', 'AMW', 'Manhattan: KS', 'Ames: IA', '21', '30', '1', '254', '2008-10-01', '122049', '86219', '39.140998840332', '-96.6707992553711', 'NA', 'NA'] 



In [6]:
def find_indices(rows_gen, column_names):
    """
    Find the indices of given columns.

    Modifies the input generator by moving it by 1 row.

    Parameters
    ----------
    rows_gen : generator
        Generator yielding the rows of the text file.
    column_names : tuple of str
        Names of the columns whose indices we need.

    Returns
    -------
    list of int
        List containing the index of each column.

    """
    header = next(rows_gen)
    indices = []
    for column in column_names:
        indices.append(header.index(column))
    return indices


def apply_map(filename, map_func, column_names):
    """
    Iterate over `filename` and apply `map_func` on each line.

    Parameters
    ----------
    filename : str
        Full name of a text file containing records.
    map_func : func
        Map function returning a tuple for each line.
    column_names : tuple of str
        Names of the columns that `map_func` needs.

    Returns
    -------
    list of tuples
        Contains tuples returned by each call of `map_func`. If the
        map result contains None, it is not included in the list.
    """
    lines = read_csv(filename)
    column_indices = find_indices(lines, column_names)
    map_result_list = []
    for line in lines:
        map_result = map_func(line, column_indices)
        if None not in map_result:
            map_result_list.append(map_result)

    return map_result_list


def map_reduce(columns, partitions, map_func, reduce_func, combine_func=None):
    """
    Apply a MapReduce procedure on all partitions of data.

    Parameters
    ----------
    columns : tuple of str
        Names of the columns that `map_func` needs.
    partitions: list of str
        Names of the partitions of the data.
    map_func : func
        Map function returning a tuple for each line.
    reduce_func : func
        Reduce function returning a tuple for each key in the
        dictionary created by the shuffle function.
    combine_func : func
        Combine function that can be applied to the result when
        `map_func` is applied to each row of a partition.

    Returns
    -------
    list of tuples
        Contains tuples returned by each call of `reduce_func`.

    """
    shuffle_dict = defaultdict(list)
    for partition in partitions:
        map_list = apply_map(partition, map_func, columns)
        if combine_func is not None:
            map_list = combine_func(map_list)
#             print(map_list[:10])
        shuffle_flights(map_list, shuffle_dict)

    reduce_list = []
    for key in shuffle_dict:
        reduce_list.append(reduce_func(shuffle_dict, key))   
    return reduce_list

## <strong> Exercise 1 - Almost-empty flights</strong> 
### <strong> 4 points </strong>
Describe and define a MapReduce procedure that gives the number of flights that departed with at most 10% capacity (i.e. empty, 1%, 6%, etc.). 

The output can be of any form you like. You can use any data structure you want to support your implementation.
### <strong> Answer </strong>

My `apply_map` function will produce a list of tuples of the form (occupancy rate, flight count). After the map function, flight counts will be grouped by occupancy rate in `shuffle` and summed in `reduce`. So, after the MapReduce procedure, I have a list of 11 elements (one tuple for each occupancy rate). The only thing that remains is to sum the flight counts in that list and arrive at the final result.

In [7]:
def map_empty(line, column_indices):
    """
    Create a tuple for flights having at most 10% occupancy rate.

    Parameters
    ----------
    line : list of str
        Each line of a text file split into a list.
    column_indices : list of int
        List containing the indices of Passengers, Seats, and
        Flights columns.

    Returns
    -------
    tuple of (float, int) or (None, None)
        Each tuple is (occupancy rate, flight count) if occupancy
        rate is at most 10%. Otherwise, it is (None, None).

    """
    passenger_idx, seats_idx, flights_idx = column_indices
    passengers = int(line[passenger_idx])
    seats = int(line[seats_idx])
    try:
        occupancy_rate = passengers / seats
        if occupancy_rate <= 0.1:
            flights = int(line[flights_idx])
            return round(occupancy_rate, 2), flights
        else:
            return None, None
    except ZeroDivisionError:
        return None, None

In [8]:
def shuffle_flights(tuples_list, shuffle_results):
    """
    Take the key-value pairs returned by a map function, create a 
    list of all values corresponding to the same key and store the
    key and the list in `shuffle_results`.

    Parameters
    ----------
    tuples_list : list of (hashable type, any type)
        List of tuples, where each tuple is a (key, value) pair.
    shuffle_results : defaultdict of {hashable type : list of any type}
        Keys are the individual values of the grouping variable, and 
        values are the lists corresponding to same key. This dictionary
        is modified by the function.

    Returns
    -------
    None

    """
    for key, value in tuples_list:
        shuffle_results[key].append(value)

In [9]:
def reduce_flights(pairs_dict, key):
    """
    Sum all values corresponding to the given key.

    Parameters
    ----------
    pairs_dict : dict of {hashable type : list of int}
        Keys are the individual values of the grouping variable,
        and values are the lists corresponding to same key.
    key : hashable type
        Individual value of the grouping variable for which we
        need a sum.

    Returns
    -------
    tuple of (hashable type, int)
        tuple of the form (key, sum of all values).

    """
    values_list = pairs_dict[key]
    values_sum = sum(values_list)
    return key, values_sum

In [10]:
columns1 = ("Passengers", "Seats", "Flights")
reduce_empty_simple = map_reduce(
    columns1, filenames, map_empty, reduce_flights)
reduce_empty_simple

[(0.0, 99198),
 (0.07, 26468),
 (0.06, 21925),
 (0.04, 15442),
 (0.02, 10080),
 (0.03, 12149),
 (0.05, 19344),
 (0.09, 36472),
 (0.1, 21950),
 (0.08, 28528),
 (0.01, 10677)]

In [11]:
reduce_empty_sum = sum(x[1] for x in reduce_empty_simple)
print(f"The number of almost empty flights is {reduce_empty_sum}.")

The number of almost empty flights is 302233.


### <strong> Combine function </strong>
In this case, it is possible to use the `combine` operator. The reason is the same key (`occupancy_rate`) is likely to be repeated several times inside one 100,000-line chunk. Also, the reduce function (`sum`) is commutative and associative.

In [12]:
def combine_sum(tuples_list):
    """
    Take the key-value pairs returned by a map function for one
    partition and sum the values by key.

    Parameters
    ----------
    tuples_list : list of (hashable type, any type)
        List of tuples, where each tuple is a (key, value) pair.

    Returns
    -------
    list of (hashable type, int)
        Each tuple is of the form (key, sum of all values).

    """
    key_value_dict = defaultdict(int)
    for key, value in tuples_list:
        key_value_dict[key] += value

    return list(key_value_dict.items())

In [13]:
reduce_empty_combine = map_reduce(
    columns1, filenames, map_empty, reduce_flights, combine_sum)
reduce_empty_combine

[(0.0, 99198),
 (0.07, 26468),
 (0.06, 21925),
 (0.04, 15442),
 (0.02, 10080),
 (0.03, 12149),
 (0.05, 19344),
 (0.09, 36472),
 (0.1, 21950),
 (0.08, 28528),
 (0.01, 10677)]

In [14]:
reduce_empty_sum = sum(x[1] for x in reduce_empty_combine)
print(f"The number of almost empty flights is {reduce_empty_sum}.")

The number of almost empty flights is 302233.


## <strong> Exercise 2 - Top five destination airports </strong>
### <strong> 4 points </strong>

Provide now a function that lists the top five destination <strong>airports</strong>: the ones that have the highest number of incoming flights. Implement an algorithm that uses the MapReduce procedure.

### <strong> Answer </strong>

In [15]:
def map_flights(line, column_indices):
    """
    Create a tuple of (group variable, flight count).

    Parameters
    ----------
    line : list of str
        Each line of a text file split into a list.
    column_indices : list of int
        List containing the indices of the grouping variable
        and the Flights columns.

    Returns
    -------
    tuple of (hashable type, int)
        Hashable type is the type of the grouping variable.

    """
    group_var_idx, flights_idx = column_indices
    group_var = line[group_var_idx]
    flight_count = int(line[flights_idx])
    return group_var, flight_count

For the `shuffle` and `reduce` operations, we can take advantage of functions written for Exercise 1.

In [16]:
columns2 = ("Destination_airport", "Flights")
reduce_top_airports = map_reduce(
    columns2, filenames, map_flights, reduce_flights, combine_sum)

reduce_top_airports.sort(key=lambda x: x[1], reverse=True)
print("The top-5 airports are", reduce_top_airports[:5])

The top-5 airports are [('ORD', 6896285), ('ATL', 6544759), ('DFW', 5987886), ('LAX', 4096702), ('DTW', 3448042)]


## <strong> Exercise 3 - Top 5 destination cities </strong>
### <strong> 2 points </strong>

Try to reuse the code you run before and define a function that lists the top five destination cities: the ones that have the highest number of incoming flights. Implement an algorithm that uses the MapReduce procedure.

### <strong> Answer </strong>
Since this question is almost identical to the previous one, the only thing to do is to use `Destination_city` column instead of the `Destination_airport` column. We do not need to write new functions.

In [17]:
columns3 = ("Destination_city", "Flights")
reduce_top_cities = map_reduce(
    columns3, filenames, map_flights, reduce_flights, combine_sum)
reduce_top_cities.sort(key=lambda x: x[1], reverse=True)
print("The top-5 cities are", reduce_top_cities[:5])

The top-5 cities are [('Chicago: IL', 8357958), ('Dallas: TX', 6957427), ('Atlanta: GA', 6544880), ('Houston: TX', 4350447), ('New York: NY', 4317734)]


## <strong> Exercise 4 - Top five connections by month</strong>
### <strong> 4 points </strong>

Try to reuse the code you run before and define now a function that lists the top five connections by each month: the top five pairs of cities that have the most number of flights. The function should take into account the flights from A to B and from B to A by month/year. Implement an algorithm that uses the MapReduce procedure.

### <strong> Answer </strong>

Here, we re-use the `shuffle` function written for Exercise 1. The `apply_map` function returns a list of tuples of the form (date, (cities, flight count)). Then, all flights that took place in the same month/year will be included into one list by the `shuffle` function. Finally, `reduce` will sum the flights that took place between the same two cities and return the top-5 connections for each month.

In [18]:
def map_connections(line, column_indices):
    """
    Create a tuple for flights between two cities by month/year.

    Parameters
    ----------
    line : list of str
        Each line of a text file split into a list.
    column_indices : list of int
        List containing the indices of Origin_city, Destination_city,
        Flights, and Fly_date columns.

    Returns
    -------
    tuple of (str, (str, int))
        Each tuple is (date, (cities, flight count))

    """
    source_idx, destination_idx, flights_idx, date_idx = column_indices
    source = line[source_idx]
    destination = line[destination_idx]
    flights = int(line[flights_idx])
    date = line[date_idx][:-3]
    if source > destination:
        destination, source = source, destination  # ordering pairs alphabetically

    cities = source + " - " + destination
    return date, (cities, flights)

In [19]:
def reduce_connections(flights_dict, date):
    """
    Find the top-5 flight directions for the given date.

    Parameters
    ----------
    flights_dict : dict of {str : list of (str, int)}
        Keys are dates, and values are lists of tuples of the
        form (two cities, flights between them).
    date : str
        Date of the form yyyy-mm for which we need the top-5 cities.

    Returns
    -------
    tuple of (str, list of (str, int))
         Tuple is (date, list of (two cities, total flights between them))

    """
    all_flights_in_month = flights_dict[date]
    flight_counts = defaultdict(int)
    for cities, flights in all_flights_in_month:
        flight_counts[cities] += flights

    flight_counts_list = sorted(
        flight_counts.items(),
        key=lambda item: item[1],
        reverse=True)
    return date, flight_counts_list[:5]

In [20]:
columns4 = ("Origin_city", "Destination_city", "Flights", "Fly_date")
reduce_connections_simple = map_reduce(
    columns4, filenames, map_connections, reduce_connections)
reduce_connections_simple.sort()  # sorting chronologically
reduce_connections_simple[:5]

[('1990-01',
  [('Dallas: TX - Houston: TX', 4602),
   ('Los Angeles: CA - San Francisco: CA', 3812),
   ('Honolulu: HI - Kahului: HI', 3285),
   ('Portland: OR - Seattle: WA', 3162),
   ('Chicago: IL - Detroit: MI', 3081)]),
 ('1990-02',
  [('Dallas: TX - Houston: TX', 4135),
   ('Los Angeles: CA - San Francisco: CA', 3168),
   ('Honolulu: HI - Kahului: HI', 2991),
   ('Chicago: IL - Detroit: MI', 2755),
   ('Portland: OR - Seattle: WA', 2741)]),
 ('1990-03',
  [('Dallas: TX - Houston: TX', 4636),
   ('Honolulu: HI - Kahului: HI', 3563),
   ('Los Angeles: CA - San Francisco: CA', 3533),
   ('Portland: OR - Seattle: WA', 3172),
   ('Chicago: IL - Detroit: MI', 3031)]),
 ('1990-04',
  [('Dallas: TX - Houston: TX', 4518),
   ('Honolulu: HI - Kahului: HI', 3941),
   ('Los Angeles: CA - San Francisco: CA', 3545),
   ('Chicago: IL - Detroit: MI', 3042),
   ('Portland: OR - Seattle: WA', 3029)]),
 ('1990-05',
  [('Dallas: TX - Houston: TX', 4786),
   ('Los Angeles: CA - San Francisco: CA', 4

### <strong> Combine function </strong>
In this case, it is possible to use the `combine` operator. The reason is the reduce function includes `sum`, which is commutative and associative. So, this part of the work can be done inside the `combine` function.

In [21]:
def combine_value_count(tuples_list):
    """
    Take the key-value pairs returned by a map function for one
    partition and sum `value[1]` for `value[0]` by key.

    Parameters
    ----------
    tuples_list : list of (str, (str or int, int))
        Each tuple is a (key, (sub key, flight count)).

    Returns
    -------
    list of (str, (str or int, int))
        Each tuple is a (key, (sub key, partition flight count)).

    """
    key_value_dict = defaultdict(int)
    for key, value_tuple in tuples_list:
        sub_key, flights = value_tuple
        new_key = (key, sub_key)
        key_value_dict[new_key] += flights

    combine_list = [(key[0], (key[1], value))
                    for key, value in key_value_dict.items()]
    return combine_list

In [22]:
reduce_connections_combine = map_reduce(
    columns4, filenames, map_connections, reduce_connections, combine_value_count)
reduce_connections_combine.sort()  # sorting chronologically
reduce_connections_combine[:5]

[('1990-01',
  [('Dallas: TX - Houston: TX', 4602),
   ('Los Angeles: CA - San Francisco: CA', 3812),
   ('Honolulu: HI - Kahului: HI', 3285),
   ('Portland: OR - Seattle: WA', 3162),
   ('Chicago: IL - Detroit: MI', 3081)]),
 ('1990-02',
  [('Dallas: TX - Houston: TX', 4135),
   ('Los Angeles: CA - San Francisco: CA', 3168),
   ('Honolulu: HI - Kahului: HI', 2991),
   ('Chicago: IL - Detroit: MI', 2755),
   ('Portland: OR - Seattle: WA', 2741)]),
 ('1990-03',
  [('Dallas: TX - Houston: TX', 4636),
   ('Honolulu: HI - Kahului: HI', 3563),
   ('Los Angeles: CA - San Francisco: CA', 3533),
   ('Portland: OR - Seattle: WA', 3172),
   ('Chicago: IL - Detroit: MI', 3031)]),
 ('1990-04',
  [('Dallas: TX - Houston: TX', 4518),
   ('Honolulu: HI - Kahului: HI', 3941),
   ('Los Angeles: CA - San Francisco: CA', 3545),
   ('Chicago: IL - Detroit: MI', 3042),
   ('Portland: OR - Seattle: WA', 3029)]),
 ('1990-05',
  [('Dallas: TX - Houston: TX', 4786),
   ('Los Angeles: CA - San Francisco: CA', 4

## <strong> Exercise 5 - Number of full flights</strong>
### <strong> 2 points </strong>
<p align="justify">
Describe and implement an algorithm that, following MapReduce procedure, shows how many full flights have departed. This exercise gives you an idea about how many times you can re-use code in MapReduce with minimum effort for repetitive analysis.
</font>
</p>

### <strong> Answer </strong>

To write a function that we can use for both Exercise 5 and Exercise 6, I will modify the `map` function of Exercise 1 slightly. This function will basically return (1, flight count) if the flight was full and (0, flight count) if it was not full. For the `shuffle` and `reduce` operations, we can take advantage of functions written for Exercise 1.

In [23]:
def map_full(line, column_indices):
    """
    Create a tuple for flights depending on their occupancy rate.

    Parameters
    ----------
    line : list of str
        Each line of a text file split into a list.
    column_indices : list of int
        List containing the indices of Passengers, Seats, and
        Flights columns.

    Returns
    -------
    tuple of (int, int) or (None, None)
        Each tuple is (dummy key, flight count). The dummy key
        is 1 if the occupancy rate is 100% and 0 otherwise. The
        return values is (None, None) if seats count is zero.

    """
    passenger_idx, seats_idx, flights_idx = column_indices
    passengers = int(line[passenger_idx])
    seats = int(line[seats_idx])
    try:
        occupancy_rate = passengers / seats
        flight_count = int(line[flights_idx])
        if occupancy_rate == 1:
            return 1, flight_count
        else:
            return 0, flight_count
    except ZeroDivisionError:
        return None, None

In [24]:
reduce_full = map_reduce(columns1, filenames, map_full, reduce_flights)
reduce_full.sort()
reduce_full

[(0, 130848882), (1, 38507)]

In [25]:
number_full = reduce_full[1][1]
print(f"The number of full flights is {number_full}.")

The number of full flights is 38507.


In [26]:
share_full = reduce_full[1][1] / (reduce_full[0][1]+reduce_full[1][1])
print(f"The share of full flights is {share_full*100:.2f}%.")

The share of full flights is 0.03%.


## <strong> Exercise 6 -  Percentage of full flights </strong>
### <strong> 4 points </strong>

<p align="justify">
Describe and implement a MapReduce procedure that gives, for each city, the percentage of full flights that have departed.

Notice that this exercise shares some similarities with one of the previous exercises. Think how and if you can modify (generalize) one of the functions already implemented before. 
</font>
</p>

### <strong> Answer </strong>
Inside the `map` function of this exercise, I will delegate most of the computation to the `map` function of Exercise 5. After `apply_map` produces a list of tuples of the form (city, (is_full, flight count)), all flights that departed from the same city will be included into one list in `shuffle`. (Once again, the `shuffle` function of Exercise 1 will be re-used.) Finally, `reduce` computes the proportion (which is a weighted average of zeros and ones) of flights that were full by city.

In [27]:
def map_proportion(line, column_indices):
    """
    For each origin city, create a tuple depending on the 
    occupancy rate of flights.

    Parameters
    ----------
    line : list of str
        Each line of a text file split into a list.
    column_indices : list of int
        List containing the indices of Origin_city, Passengers,
        Seats, and Flights columns.

    Returns
    -------
    tuple of (str, (int, int)) or (None, (None, None))
        Returns (city, (1, flight count)) if the flight was full,
        (city, (0, flight count)) if the flight wasn't full, and
        (None, (None, None)) if the number of seats was zero.

    """
    city_idx = column_indices[0]
    city = line[city_idx]
    new_column_indices = column_indices[1:]
    is_full_tuple = map_full(line, new_column_indices)
    if is_full_tuple != (None, None):
        return city, is_full_tuple
    else:
        return None, is_full_tuple

In [28]:
def reduce_proportion(counts_dict, key):
    """
    Compute the weighted average of all values for the given key.

    Parameters
    ----------
    counts_dict : dict of {hashable type : list of tuple}
        Keys are the individual values of the grouping variable,
        and values are the lists of tuples corresponding to same key.
        The first element of each tuple is the value to be averaged,
        and the second element is the corresponding weight.
    key : hashable type
        Individual value of the grouping variable for which we
        need a weighted average.

    Returns
    -------
    tuple of (hashable type, float)
        tuple of the form (key, weighted average of all values).

    """
    value_weight_list = counts_dict[key]
    sum_of_weights = 0
    weighted_sum = 0
    for value, weight in value_weight_list:
        weighted_sum += value * weight
        sum_of_weights += weight

    weighted_mean = weighted_sum / sum_of_weights
    return key, weighted_mean

In [29]:
columns6 = ("Origin_city", "Passengers", "Seats", "Flights")
reduce_proportion_simple = map_reduce(
    columns6, filenames, map_proportion, reduce_proportion)
reduce_percent_simple = [(key, share*100) for key, share in reduce_proportion_simple]
reduce_percent_simple.sort(key=lambda x: x[1], reverse=True)
reduce_percent_simple[:7]

[('Marshall: TX', 100.0),
 ('Shelton: WA', 50.0),
 ('Seymour: IN', 50.0),
 ('Lawrence: KS', 40.0),
 ('Napa: CA', 16.666666666666664),
 ('Mansfield: OH', 10.0),
 ('Scottsbluff: NE', 8.539944903581267)]

### <strong> Combine function </strong>
In this case, it is possible to use the `combine` operator. The reason is the `reduce` function needs to sum the flight counts for full and not full flights. This part of the work can be done inside the `combine` function, which we already defined for Exercise 4.

In [30]:
reduce_proportion_combine = map_reduce(
    columns6, filenames, map_proportion, reduce_proportion, combine_value_count)
reduce_percent_combine = [(key, share*100) for key, share in reduce_proportion_combine]
reduce_percent_combine.sort(key=lambda x: x[1], reverse=True)
reduce_percent_combine[:7]

[('Marshall: TX', 100.0),
 ('Shelton: WA', 50.0),
 ('Seymour: IN', 50.0),
 ('Lawrence: KS', 40.0),
 ('Napa: CA', 16.666666666666664),
 ('Mansfield: OH', 10.0),
 ('Scottsbluff: NE', 8.539944903581267)]