Taken from: https://realpython.com/introduction-to-python-generators/

# Creating Data Pipelines With Generators

Data pipelines allow you to string together code to process large datasets or streams of data without maxing out your machine’s memory.

To demonstrate how to build pipelines with generators, you’re going to analyze this file to get the total and average of all series A rounds in the dataset.

Let’s think of a strategy:

Read every line of the file.
Split each line into a list of values.
Extract the column names.
Use the column names and lists to create a dictionary.
Filter out the rounds you aren’t interested in.
Calculate the total and average values for the rounds you are interested in.
Normally, you can do this with a package like pandas, but you can also achieve this functionality with just a few generators

In [1]:
# Let's open the file using a comprehension generator

file_name = "techcrunch.csv"
lines = (line for line in open(file_name))

In [2]:
# Now let's have a comprehension generator that iterates through each line

list_line = (s.rstrip().split(",") for s in lines)

In [3]:
# This line is going to get the column names
cols = next(list_line)
print(cols)

['permalink', 'company', 'numEmps', 'category', 'city', 'state', 'fundedDate', 'raisedAmt', 'raisedCurrency', 'round']


In [4]:
# Creating dictionaries where the keys are the column names
company_dicts = (dict(zip(cols, data)) for data in list_line)

In [5]:
# Using a fourth generator to filter the funding round you want and pull raisedAmt as well

funding = (
    int(company_dict["raisedAmt"])
    for company_dict in company_dicts
    if company_dict["round"] == "a"
)

# Here the generator expression iterates through the results of company_dicts
# and takes the raisedAmt for any company_dict where the round key is "a"

In [6]:
# Up to this point we aren’t iterating through anything
# To iterate through the 4 generators we would need a for loop or other 
# iterative expression like sum()

total_series_a = sum(funding)

# Calling sum() now to iterates through the generators

In [7]:
print(f"Total series A fundraising: ${total_series_a}")

Total series A fundraising: $4376015000


This script pulls together every generator you’ve built, and they all function as one big data pipeline. Here’s a line by line breakdown:

* Line x reads in each line of the file.
* Line x splits each line into values and puts the values into a list.
* Line x uses next() to store the column names in a list.
* Line x creates dictionaries and unites them with a zip() call:
The keys are the column names cols from line 4.
The values are the rows in list form, created in line 3.
* Line x gets each company’s series A funding amounts. It also filters out any other raised amount.
* Line xx begins the iteration process by calling sum() to get the total amount of series A funding found in the CSV.
When you run this code on techcrunch.csv, you should find a total of $4,376,015,000 raised in series A funding rounds.