# Python Generator Functions and Expression

The goal of this notebook is to try to understand how python uses generator functions and expressions to complete a task. I am starting with a csv file "techcrunch.csv" [techcrunch](https://github.com/realpython/materials/blob/master/generators/techcrunch.csv?__s=c4pbrqd9owazaumr6clc) that has structured data saved to disk. This is different from the data I will actually be using for my research as it is not structured.

## Motivation

The reason why I am building this notebook is because I would like to first understand how python generators work so that I can apply this knowledge to a more advance situation, data generators in Keras.

## Background

A **generator function** behaves just like a a function but it is iterable. This is a function that returns a lazy iterable object called a generator. More specifically these special function *yield* a value when assigned to a variable. It does **NOT** *return* values.  

A **generator expression** allow you to quickly create generators without building and holding the entire object in memory before iteration so there is no memory penalty when you use generator expressions.  

What makes this different from other iterable python data containers such as a list is that generators don't actually load the entire data into memory. This is benefitial because if your data is too large to read into memory it is more efficient to use data generators. However, if your data **does** fit in memory it will be best to load it and work from there.

In [55]:
# Load the Pandas libraries with alias 'pd' 
import pandas as pd 

In [1]:
data_path = '/Users/tashaleebillings/Desktop/Research/ML/scripts/data_generator/sandbox/data/'
file_name = data_path+"techcrunch.csv"

row_count = 0
row_count_ = 0

**csvreader()** opens a file and loads its contents into **csv_gen**. Then, the program iterates over the list and increments **row_count** for each row. Specifically line 2 opens the file and this is a generator however, for line 3 .read() populates the list with the entire dataset. If your data is too large to load into memory this can throw and error.

In [6]:
def csvreader(file_name):
    file = open(file_name)#returns a generator object that you can lazily iterate through line by line
    result = file.read().split("\n") # This loads in the full dataset into memory and could throw and MemoryError
    return result

csv_gen = csvreader(file_name) 
row_count = 0

for row in csv_gen:
    row_count += 1

print(f"Row count is {row_count}")

Row count is 1462


### Generator Functions

Below is are 3 lines of code that defines some generator function that yields the row. This function **csv_reader()** is different from the one above because it is a Generator Function. After you iterate over this generator function it gets the samee numbere of rows as expected. All this is done without sacrificing your memory.

In [3]:
def csv_reader(file_name):
    for row in open(file_name):
        yield row

In [4]:
csv_reader(file_name)

<generator object csv_reader at 0x10971ac00>

In [5]:
for row in csv_reader(file_name):
    row_count_ += 1

print(f"Row count is {row_count_}")

Row count is 1461


### Generator Expressions

Another interesting way count the number of rows is to build a generator expression. Generator expressions allow you to quickly create a generator object in just a few lines of code without building and holding the entire object in memory before iteration and have no memory penalty when you use them.

In [2]:
csv_gen = (row for row in open(file_name))
for row in csv_gen:
    row_count += 1

print(f"Row count is {row_count}")

Row count is 1461


### Data Pipelines With Generators

Pipeline Structure:  

- Read every line of the file without sacrificing memory.
- Split each line into a list of values.
- Extract the column names.
- Use this column names and lists to create a dictionary.
- Filter out the rounds you aren’t interested in.
- Calculate the total and average values for the rounds you are interested in.

In [46]:
# Create a generator expression
lines = (line for line in open(file_name))

In [47]:
lines

<generator object <genexpr> at 0x10983cf48>

In [48]:
# Split each lone into a list of values. This too is a generator expression
list_line = (s.rstrip().split(",") for s in lines)

In [49]:
list_line

<generator object <genexpr> at 0x10971ab10>

The next line is a clever way to extract the column name. Typically with a CSV file the column name is at the very top of the file. **next()** with grab the first line, and if you run it again it will grab the 2nd line and so on. As you can see **next()** returns a list of values.

In [50]:
cols = next(list_line)

In [51]:
cols

['permalink',
 'company',
 'numEmps',
 'category',
 'city',
 'state',
 'fundedDate',
 'raisedAmt',
 'raisedCurrency',
 'round']

Now that we have a generator expression for the rows and a variable for the column names we can create out dicionary generator expression

In [52]:
company_dicts = (dict(zip(cols, data)) for data in list_line)

In [53]:
company_dicts

<generator object <genexpr> at 0x10971a9a8>

In [44]:
# Quickly examine first 10 lines of the dictionary
n = 0
for c_d in company_dicts:
    if n < 10:
        print(c_d)
    n += 1

{'permalink': 'lifelock', 'company': 'LifeLock', 'numEmps': '', 'category': 'web', 'city': 'Tempe', 'state': 'AZ', 'fundedDate': '1-May-07', 'raisedAmt': '6850000', 'raisedCurrency': 'USD', 'round': 'b'}
{'permalink': 'lifelock', 'company': 'LifeLock', 'numEmps': '', 'category': 'web', 'city': 'Tempe', 'state': 'AZ', 'fundedDate': '1-Oct-06', 'raisedAmt': '6000000', 'raisedCurrency': 'USD', 'round': 'a'}
{'permalink': 'lifelock', 'company': 'LifeLock', 'numEmps': '', 'category': 'web', 'city': 'Tempe', 'state': 'AZ', 'fundedDate': '1-Jan-08', 'raisedAmt': '25000000', 'raisedCurrency': 'USD', 'round': 'c'}
{'permalink': 'mycityfaces', 'company': 'MyCityFaces', 'numEmps': '7', 'category': 'web', 'city': 'Scottsdale', 'state': 'AZ', 'fundedDate': '1-Jan-08', 'raisedAmt': '50000', 'raisedCurrency': 'USD', 'round': 'seed'}
{'permalink': 'flypaper', 'company': 'Flypaper', 'numEmps': '', 'category': 'web', 'city': 'Phoenix', 'state': 'AZ', 'fundedDate': '1-Feb-08', 'raisedAmt': '3000000', 'ra

In the next line we can have some fun and apply discriminitory conditions in order to return for example the total number of funds.

In [54]:
funding = (
    int(company_dict["raisedAmt"])
    for company_dict in company_dicts
    if company_dict["round"] == 'a'
)
total_series_a = sum(funding)
print(f"Total series 'a' fundraising: ${total_series_a}")

Total series 'a' fundraising: $4376015000


## Do the Same exact thing but with Pandas

In [106]:
num_rows = 0
pandas_gen = (df for df in pd.read_csv(file_name, delimiter=',', header = None, chunksize=1))
for p_row in pandas_gen:
    num_rows += 1

print(f"Row count is {num_rows}")

Row count is 1


In [98]:
pd.read_csv?

In [74]:
[line for line in pd.read_csv(file_name, delimiter=',', names = ['permalink',
 'company',
 'numEmps',
 'category',
 'city',
 'state',
 'fundedDate',
 'raisedAmt',
 'raisedCurrency',
 'round'])]

['permalink',
 'company',
 'numEmps',
 'category',
 'city',
 'state',
 'fundedDate',
 'raisedAmt',
 'raisedCurrency',
 'round']

In [101]:
[df for df in pd.read_csv(file_name, delimiter=',', header = None, chunksize=1)]

[           0        1        2         3     4      5           6          7  \
 0  permalink  company  numEmps  category  city  state  fundedDate  raisedAmt   
 
                 8      9  
 0  raisedCurrency  round  ,
           0         1   2    3      4   5         6        7    8  9
 1  lifelock  LifeLock NaN  web  Tempe  AZ  1-May-07  6850000  USD  b,
           0         1   2    3      4   5         6        7    8  9
 2  lifelock  LifeLock NaN  web  Tempe  AZ  1-Oct-06  6000000  USD  a,
           0         1   2    3      4   5         6         7    8  9
 3  lifelock  LifeLock NaN  web  Tempe  AZ  1-Jan-08  25000000  USD  c,
              0            1  2    3           4   5         6      7    8  \
 4  mycityfaces  MyCityFaces  7  web  Scottsdale  AZ  1-Jan-08  50000  USD   
 
       9  
 4  seed  ,
           0         1   2    3        4   5         6        7    8  9
 5  flypaper  Flypaper NaN  web  Phoenix  AZ  1-Feb-08  3000000  USD  a,
               0            

In [108]:
[df for df in pd.read_csv(file_name, delimiter=',', header = None, iterator=True)]

[                  0              1        2         3           4      5  \
 0         permalink        company  numEmps  category        city  state   
 1          lifelock       LifeLock      NaN       web       Tempe     AZ   
 2          lifelock       LifeLock      NaN       web       Tempe     AZ   
 3          lifelock       LifeLock      NaN       web       Tempe     AZ   
 4       mycityfaces    MyCityFaces        7       web  Scottsdale     AZ   
 ...             ...            ...      ...       ...         ...    ...   
 1456        trusera        Trusera       15       web     Seattle     WA   
 1457     alerts-com     Alerts.com      NaN       web    Bellevue     WA   
 1458          myrio          Myrio       75  software     Bothell     WA   
 1459  grid-networks  Grid Networks      NaN       web     Seattle     WA   
 1460  grid-networks  Grid Networks      NaN       web     Seattle     WA   
 
                6          7               8             9  
 0     funded

In [80]:
np.array(pd.read_csv(file_name, delimiter=',')[['permalink',
 'company']])

array([['lifelock', 'LifeLock'],
       ['lifelock', 'LifeLock'],
       ['lifelock', 'LifeLock'],
       ...,
       ['myrio', 'Myrio'],
       ['grid-networks', 'Grid Networks'],
       ['grid-networks', 'Grid Networks']], dtype=object)

In [79]:
import numpy as np