# Data Structures

1. Tuples
2. Dictionary
3. Class

## Classes
Some variants in class :
1. Slots - Saves memory
2. DataClasses- Reduce coding
3. Named Tuples - Immutability behaviour

### Slots

For data structure use `__slots__` to save memory

In [3]:
class Stock:
    __slots__ = ('name','shares','price') # used for performance optimization
    def __init__(self,name,shares,price):
        self.name = name
        self.shares = shares
        self.price = price

s = Stock('GOOG',100,440.10)
s.name,s.price,s.shares

('GOOG', 440.1, 100)

### Dataclasses
`dataclass` is a decorator in python and it automatically generates special methods like `__init__()`,`__repr__()` and `__eq__()` for user defined classes.

In [4]:
from dataclasses import dataclass

@dataclass
class Person:
    name : str
    age : int

p = Person(name="HE",age=10)
p

Person(name='HE', age=10)

### Named Tuples
There are two types of named tuples


#### From `typing` class
It provides *type hints*, you can access the immutable data structure with name fields.

In [None]:
!pip install typing

In [9]:

from typing import NamedTuple

class Person(NamedTuple):
    name : str
    age : int

p = Person(name="HE",age=10)
p

Person(name='HE', age=10)

#### From `collections`

In [10]:
from collections import namedtuple

Person = namedtuple('Person',['name','age']) # class
s = Person("HE",10)
s

Person(name='HE', age=10)

### Exercise 2.1

*Objectives:*

- Figure out the most memory-efficient way to store a lot of data.
- Learn about different ways of representing records including tuples,
dictionaries, classes, and named tuples.

In this exercise, we look at different choices for representing data
structures with an eye towards memory use and efficiency.  A lot of
people use Python to perform various kinds of data analysis so knowing
about different options and their tradeoffs is useful information.


#### (a) Stuck on the bus

The file `Data/ctabus.csv` is a CSV file containing
daily ridership data for the Chicago Transit Authority (CTA) bus
system from January 1, 2001 to August 31, 2013.  It contains
approximately 577000 rows of data.  Use Python to view a few lines
of data to see what it looks like:

```python
>>> f = open('Data/ctabus.csv')
>>> next(f)
'route,date,daytype,rides\n'
>>> next(f)
'3,01/01/2001,U,7354\n'
>>> next(f)
'4,01/01/2001,U,9288\n'
>>>
```

There are 4 columns of data.

- route: Column 0.  The bus route name.
- date: Column 1.  A date string of the form MM/DD/YYYY.
- daytype: Column 2. A day type code (U=Sunday/Holiday, A=Saturday, W=Weekday)
- rides: Column 3. Total number of riders (integer)

The `rides` column records the total number of people who boarded a
bus on that route on a given day. Thus, from the example, 7354 people
rode the number 3 bus on January 1, 2001.

In [2]:
def collect_data(filename):
    record = []
    with open(filename,mode='r') as file:
        for line in file:
            route,date,daytype,rides = line.split(',')
            record.append(
                {
                    'route':route,
                    'date' : date,
                    'daytype' : daytype,
                    'rides':rides
                }
            )
        
    return record[1:]

file_path = "learning-python-mastery/Data/ctabus.csv"
data = collect_data(file_path)

limit = 5

for d in data[:limit]:
    print(d)

{'route': '3', 'date': '01/01/2001', 'daytype': 'U', 'rides': '7354\n'}
{'route': '4', 'date': '01/01/2001', 'daytype': 'U', 'rides': '9288\n'}
{'route': '6', 'date': '01/01/2001', 'daytype': 'U', 'rides': '6048\n'}
{'route': '8', 'date': '01/01/2001', 'daytype': 'U', 'rides': '6309\n'}
{'route': '9', 'date': '01/01/2001', 'daytype': 'U', 'rides': '11207\n'}



#### (b) Basic memory use of text

Let's get a baseline of the memory required to work with this
datafile.  First, restart Python and try a very simple experiment of
simply grabbing the file and storing its data in a single string:

```python
>>> # --- RESTART 
>>> import tracemalloc
>>> f = open('Data/ctabus.csv')
>>> tracemalloc.start()
>>> data = f.read()
>>> len(data)
12361039
>>> current, peak = tracemalloc.get_traced_memory()
>>> current
12369664
>>> peak
24730766
>>> 
```

Your results might vary somewhat, but you should see current
memory use in the range of 12MB with a peak of 24MB.

What happens if you read the entire file into a list of strings
instead?  Restart Python and try this:

```python
>>> # --- RESTART
>>> import tracemalloc
>>> f = open('Data/ctabus.csv')
>>> tracemalloc.start()
>>> lines = f.readlines()
>>> len(lines)
577564
>>> current, peak = tracemalloc.get_traced_memory()
>>> current
45828030
>>> peak
45867371
>>> 
```

You should see the memory use go up significantly into the range of 40-50MB.
Point to ponder: what might be the source of that extra overhead?

In [3]:
from tracemalloc import start,get_traced_memory

def single_string_mem(file):
    f = open(file)
    start()
    l = f.read()
    return get_traced_memory()

def list_mem(file):
    f = open(file)
    start()
    l = f.readlines()
    return get_traced_memory()

file_path = "learning-python-mastery/Data/ctabus.csv"
print("Single String : ",single_string_mem(file_path))
print("List :",list_mem(file_path))

Single String :  (12361616, 38816419)
List : (40740991, 40749556)


**Point to ponder**

*What might be the source of that extra overhead?*

**Solution**

1. When using `read()` function, it stores all the data in a single string which stores in contiguous memory location
2. When using `readlines()` function, it produces a list of string elements

#### (c) A List of Tuples

In practice, you might read the data into a list and convert each line
into some other data structure.  Here is a program `readrides.py` that
reads the entire file into a list of tuples using the `csv` module:

```python
# readrides.py

import csv

def read_rides_as_tuples(filename):
    '''
    Read the bus ride data as a list of tuples
    '''
    records = []
    with open(filename) as f:
        rows = csv.reader(f)
        headings = next(rows)     # Skip headers
        for row in rows:
            route = row[0]
            date = row[1]
            daytype = row[2]
            rides = int(row[3])
            record = (route, date, daytype, rides)
            records.append(record)
    return records

if __name__ == '__main__':
    import tracemalloc
    tracemalloc.start()
    rows = read_rides_as_tuples('Data/ctabus.csv')
    print('Memory Use: Current %d, Peak %d' % tracemalloc.get_traced_memory())
```

Run this program using `python3 -i readrides.py` and look at the
resulting contents of `rows`. You should get a list of tuples like
this:

```python
>>> len(rows)
577563
>>> rows[0]
('3', '01/01/2001', 'U', 7354)
>>> rows[1]
('4', '01/01/2001', 'U', 9288)
```

Look at the resulting memory use. It should be substantially higher
than in part (b).



In [6]:
import tracemalloc
file_path = "learning-python-mastery/Data/ctabus.csv"
tracemalloc.start()
data = collect_data(file_path)
curr,peak = tracemalloc.get_traced_memory()
print(f"Current : {curr} Peak :{peak}")

Current : 191229264 Peak :386406418


#### (d) Memory Use of Other Data Structures

Python has many different choices for representing data structures.
For example:

```python
# A tuple
row = (route, date, daytype, rides)

# A dictionary
row = {
    'route': route,
    'date': date,
    'daytype': daytype,
    'rides': rides,
}

# A class
class Row:
    def __init__(self, route, date, daytype, rides):
        self.route = route
        self.date = date
        self.daytype = daytype
        self.rides = rides

# A named tuple
from collections import namedtuple
Row = namedtuple('Row', ['route', 'date', 'daytype', 'rides'])

# A class with __slots__
class Row:
    __slots__ = ['route', 'date', 'daytype', 'rides']
    def __init__(self, route, date, daytype, rides):
        self.route = route
        self.date = date
        self.daytype = daytype
        self.rides = rides
```
Your task is as follows:  Create different versions of the `read_rides()` function
that use each of these data structures to represent a single row of data.
Then, find out the resulting memory use of each option.   Find out which
approach offers the most efficient storage if you were working with a lot 
of data all at once.

In [9]:
import csv
import tracemalloc
from collections import namedtuple

def read_rides_tuple(filename):
    with open(filename) as f:
        rows = []
        for route, date, daytype, rides in csv.reader(f):
            rows.append((route, date, daytype, rides))
    return rows

def read_rides_dict(filename):
    with open(filename) as f:
        return [{'route': route, 'date': date, 'daytype': daytype, 'rides': rides}
                for route, date, daytype, rides in csv.reader(f)]

class Row:
    def __init__(self, route, date, daytype, rides):
        self.route = route
        self.date = date
        self.daytype = daytype
        self.rides = rides

def read_rides_class(filename):
    with open(filename) as f:
        return [Row(route, date, daytype, rides)
                for route, date, daytype, rides in csv.reader(f)]

NamedTupleRow = namedtuple('NamedTupleRow', ['route', 'date', 'daytype', 'rides'])

def read_rides_namedtuple(filename):
    with open(filename) as f:
        return [NamedTupleRow(route, date, daytype, rides)
                for route, date, daytype, rides in csv.reader(f)]

class SlottedRow:
    __slots__ = ['route', 'date', 'daytype', 'rides']
    def __init__(self, route, date, daytype, rides):
        self.route = route
        self.date = date
        self.daytype = daytype
        self.rides = rides

def read_rides_slotted(filename):
    with open(filename) as f:
        return [SlottedRow(route, date, daytype, rides)
                for route, date, daytype, rides in csv.reader(f)]

def measure_memory(func, filename):
    tracemalloc.start()
    result = func(filename)
    current, peak = tracemalloc.get_traced_memory()
    tracemalloc.stop()
    return len(result), current, peak

filename = 'learning-python-mastery/Data/ctabus.csv'
functions = [
    ('Tuple', read_rides_tuple),
    ('Dictionary', read_rides_dict),
    ('Class', read_rides_class),
    ('Named Tuple', read_rides_namedtuple),
    ('Slotted Class', read_rides_slotted)
]

print("Data Structure | Row Count | Current Memory (MB) | Peak Memory (MB)")
print("---------------|-----------|----------------------|------------------")

for name, func in functions:
    count, current, peak = measure_memory(func, filename)
    print(f"{name:<14} | {count:9d} | {current/1024/1024:20.2f} | {peak/1024/1024:16.2f}")

Data Structure | Row Count | Current Memory (MB) | Peak Memory (MB)
---------------|-----------|----------------------|------------------
Tuple          |    577564 |               120.02 |           120.05
Dictionary     |    577564 |               181.14 |           181.17
Class          |    577564 |               132.68 |           132.72
Named Tuple    |    577564 |               123.87 |           123.90
Slotted Class  |    577564 |               115.06 |           115.09


Ranking of Data Structure from high to low:
1. Dictionary
2. Class
3. Named Tuple
4. Dictionary
5. Slotted Class

# Containers and Collections

## Comprehensions

1. List Comprehension

`[expression for item in sequence if condition]`

2. Set Comprehension

`{expression for item in sequence if condition}`

3. Dict Comprehension

`{key:value for item in sequence if condition}`

## `Collection` module
 

### Default Dict
It provides a default value for a non existent key, which prevents `KeyError`

In [1]:
from collections import defaultdict

d = defaultdict(list)
d['x'] = [1,2,3]
d['y'] = [4,5,6]
d

defaultdict(list, {'x': [1, 2, 3], 'y': [4, 5, 6]})

### Counter
* It is a subclass of `dict` designed to count hashable objects
* It counts the number of times each element appears in an iterable and stores the counts as dictionary values

In [8]:
from collections import Counter

counter = Counter()
counter['A'] += 20
counter['B'] += 40
counter,counter.most_common(2) # ranking

(Counter({'B': 40, 'A': 20}), [('B', 40), ('A', 20)])

### Deque
* A double ended queue that allows you to add and remove elements from both ends effeciently.
* It provides fast appends and pops from both the left and right sides
* Useful for *keeping a history of last **N** things*

In [10]:
from collections import deque
q = deque()
q.append(1)
q.append(2)
q.appendleft(3)
q.appendleft(4)
print(q)
print(q.pop())
print(q.popleft())
print(q)

deque([4, 3, 1, 2])
2
4
deque([3, 1])


## Exercise 2.2

*Objectives:*

- Work with various containers
- List/Set/Dict Comprehensions
- Collections module
- Data analysis challenge

Most Python programmers are generally familiar with lists, dictionaries,
tuples, and other basic datatypes. In this exercise, we'll put that
knowledge to work to solve various data analysis problems.

### (a) Preliminaries

To get started, let's review some basics with a slightly simpler dataset--
a portfolio of stock holdings. Create a file `readport.py` and put this
code in it:

```python
# readport.py

import csv

# A function that reads a file into a list of dicts
def read_portfolio(filename):
    portfolio = []
    with open(filename) as f:
        rows = csv.reader(f)
        headers = next(rows)
        for row in rows:
            record = {
                'name' : row[0],
                'shares' : int(row[1]),
                'price' : float(row[2])
            }
            portfolio.append(record)
    return portfolio
```

This file reads some simple stock market data in the file `Data/portfolio.csv`.  Use
the function to read the file and look at the results:

```python
>>> portfolio = read_portfolio('Data/portfolio.csv')
>>> from pprint import pprint
>>> pprint(portfolio)
[{'name': 'AA', 'price': 32.2, 'shares': 100},
 {'name': 'IBM', 'price': 91.1, 'shares': 50},
 {'name': 'CAT', 'price': 83.44, 'shares': 150},
 {'name': 'MSFT', 'price': 51.23, 'shares': 200},
 {'name': 'GE', 'price': 40.37, 'shares': 95},
 {'name': 'MSFT', 'price': 65.1, 'shares': 50},
 {'name': 'IBM', 'price': 70.44, 'shares': 100}]
>>>
```

In this data, each row consists of a stock name, a number of held
shares, and a purchase price.   There are multiple entries for
certain stock names such as MSFT and IBM.

### (b) Comprehensions

List, set, and dictionary comprehensions can be a useful tool for manipulating
data.  For example, try these operations:

```python
>>> # Find all holdings more than 100 shares
>>> [s for s in portfolio if s['shares'] > 100]
[{'name': 'CAT', 'shares': 150, 'price': 83.44}, 
 {'name': 'MSFT', 'shares': 200, 'price': 51.23}]

>>> # Compute total cost (shares * price)
>>> sum([s['shares']*s['price'] for s in portfolio])
44671.15
>>>

>>> # Find all unique stock names (set)
>>> { s['name'] for s in portfolio }
{'MSFT', 'IBM', 'AA', 'GE', 'CAT'}
>>>

>>> # Count the total shares of each of stock
>>> totals = { s['name']: 0 for s in portfolio }
>>> for s in portfolio:
        totals[s['name']] += s['shares']

>>> totals
{'AA': 100, 'IBM': 150, 'CAT': 150, 'MSFT': 250, 'GE': 95}
>>> 
```

### (c) Collections

The `collections` module has a variety of classes for more specialized data
manipulation.  For example, the last example could be solved with a `Counter` like this:

```python
>>> from collections import Counter
>>> totals = Counter()
>>> for s in portfolio:
        totals[s['name']] += s['shares']

>>> totals
Counter({'MSFT': 250, 'IBM': 150, 'CAT': 150, 'AA': 100, 'GE': 95})
>>>
```

Counters are interesting in that they support other kinds of operations such as ranking
and mathematics.  For example:

```python
>>> # Get the two most common holdings
>>> totals.most_common(2)
[('MSFT', 250), ('IBM', 150)]
>>>

>>> # Adding counters together
>>> more = Counter()
>>> more['IBM'] = 75
>>> more['AA'] = 200
>>> more['ACME'] = 30
>>> more
Counter({'AA': 200, 'IBM': 75, 'ACME': 30})
>>> totals
Counter({'MSFT': 250, 'IBM': 150, 'CAT': 150, 'AA': 100, 'GE': 95})
>>> totals + more
Counter({'AA': 300, 'MSFT': 250, 'IBM': 225, 'CAT': 150, 'GE': 95, 'ACME': 30})
>>> 
```

The `defaultdict` object can be used to group data.  For example, suppose
you want to make it easy to find all matching entries for a given name such as
IBM.  Try this:

```python
>>> from collections import defaultdict
>>> byname = defaultdict(list)
>>> for s in portfolio:
        byname[s['name']].append(s)

>>> byname['IBM']
[{'name': 'IBM', 'shares': 50, 'price': 91.1}, {'name': 'IBM', 'shares': 100, 'price': 70.44}]
>>> byname['AA']
[{'name': 'AA', 'shares': 100, 'price': 32.2}]
>>>
```

The key feature that makes this work is that a defaultdict
automatically initializes elements for you--allowing an insertion of a
new element and an `append()` operation to be combined together.

### (d) Data Analysis Challenge

In the last exercise you just wrote some code to read CSV-data related
to the Chicago Transit Authority.  For example, you can grab the data
as dictionaries like this:

```python
>>> import readrides
>>> rows = readrides.read_rides_as_dicts('Data/ctabus.csv')
>>>
```

It would be a shame to do all of that work and then do nothing with
the data.

In this exercise, your task is this: write a program to answer the
following three questions:

1. How many bus routes exist in Chicago?

2. How many people rode the number 22 bus on February 2, 2011?  What about any route on any date of your choosing?

3. What is the total number of rides taken on each bus route?

4. What five bus routes had the greatest ten-year increase in ridership from 2001 to 2011?

You are free to use any technique whatsoever to answer the above
questions as long as it's part of the Python standard library (i.e.,
built-in datatypes, standard library modules, etc.). 


In [36]:
import csv

def read_rides_as_dicts(filename):
    '''
    Read the bus ride data as a list of dicts
    '''
    records = []
    with open(filename) as f:
        rows = csv.reader(f)
        headings = next(rows)     # Skip headers
        for row in rows:
            route = row[0]
            date = row[1]
            daytype = row[2]
            rides = int(row[3])
            record = {
                'route': route, 
                'date': date, 
                'daytype': daytype, 
                'rides' : rides
                }
            records.append(record)
    return records
    
rows = read_rides_as_dicts("E:/jai/docs/code/python/tutorial/learning-python-mastery/Data/ctabus.csv")
rows[:5]

[{'route': '3', 'date': '01/01/2001', 'daytype': 'U', 'rides': 7354},
 {'route': '4', 'date': '01/01/2001', 'daytype': 'U', 'rides': 9288},
 {'route': '6', 'date': '01/01/2001', 'daytype': 'U', 'rides': 6048},
 {'route': '8', 'date': '01/01/2001', 'daytype': 'U', 'rides': 6309},
 {'route': '9', 'date': '01/01/2001', 'daytype': 'U', 'rides': 11207}]

In [30]:
# How many routes are in Chichago?
routes = set()
for row in rows:
    routes.add(row['route'])
print(len(routes), 'routes')

182 routes


In [32]:
# How many people rode the number 22 bus on February 2, 2011?  
# What about any route on any date of your choosing?
for i in rows:
    if i['route'] == '22' and i['date'] == '02/02/2011':
        print("Total rides :",i['rides'])
        break 

Total rides : 5055


In [37]:
# What is the total number of rides taken on each bus route?
from collections import Counter

rides_per_route = Counter()

for row in rows:
    rides_per_route[row['route']] += row['rides']

for route, count in rides_per_route.most_common():
    print('%5s %10d' % (route, count))

   79  133796763
    9  117923787
   49   95915008
    4   95309438
   66   93053461
  151   89524268
   22   89380790
    3   89211071
   77   88484043
   53   86884085
   20   86679321
   63   86609336
    8   84362672
   82   74427970
   36   68780696
   72   66184985
   87   65522213
   29   65048709
   81   59612673
   67   57880229
   62   53418231
   60   52860939
   55   52724358
    6   52654473
  147   52379267
   56   52280435
   74   52111931
   80   51543475
   12   51168351
   52   49851059
   85   49776812
   54   47976060
   76   47452382
   47   44111483
   70   42640960
  146   42089695
   71   41754367
   14   41094492
  152   40352452
   94   39817262
   78   35433146
   21   35216129
   50   34166525
  126   33268789
   28   33033639
   75   32597029
   91   32296856
  53A   31898628
   92   29983360
  155   29450367
  156   28882051
   65   28161390
   15   26149468
   34   25709239
  145   25428790
  X49   24110395
  119   24028072
  111   23663300
   35   222916

In [41]:
# What five bus routes had the greatest ten-year increase
# in ridership from 2001 to 2011?
from collections import defaultdict

rides_by_year = defaultdict(Counter)
for row in rows:
    year = row['date'].split('/')[2]
    rides_by_year[year][row['route']] += row['rides']

diffs = rides_by_year['2011'] - rides_by_year['2001']
for route, diff in diffs.most_common(5):
    print(route, diff)

15 2732209
147 2107910
66 1612958
12 1612067
14 1351308


# Iteration and Iterables

1. For loop
2. Iterating on tuples
```python
for name,share,price in portfolio:
    ...
```
3. Looping varying records:
```python
price = [
    ['GOOG',10,20,30],
    ['IBM',10,20],
    ['CAT',10,20,30,40]
]
for name,*values in price:
    print(name,values)
```
4. `zip()` function
```python
# making dict
rec = dict(zip(val1,val2))
```
5. Keeping a running count - `enumerate()`
```python
for n,name in enumerate(names):
    ...
```
6. Iterating on integers
```python
for i in range(start,end,step):
    ...
```
7. Sequence reduction
```python
sum(s),min(s),max(s),any(s),all(s)
```
8. Unpacking Iterables - better than using `+`
```python
a = (1,2,3)
b = [4,5]
c = [*a,*b] # c = [1,2,3,4,5]
d = (*a,*b) # d = (1,2,3,4,5)
```
9. Unpacking dictionaries
```python
a = {'name':'GOOG','shares':100,'price':490.1}
b = {'date':'6/10/2001','time':'9:45am'}
c = {**a,**b} # combining into single dict
```
10. Argument passing
```py
a = (1,2,3)
b = (4,5)
c = {'x':1,'y':2}
func(*a,*b) # func(1,2,3,4,5)
func(**c) # func(x=1,y=2)
func(0,*a,*b,6,variable=37,**c) # order
```
11. Generator expression - alt to list comprehension, it can be only used once
```python
nums = [1,2,3,4]
sqs = (x*x for x in nums)
for n in sqs:
    print(n,end='/t')
```
* It acts as a filter/transform on an iterable
12. Generator functions
```python
def squares(nums):
    for x in nums:
        yield x*x

for n in squares([1,2,3,4]):
    ...
```

To know more about generators, see this [video](https://youtu.be/bD05uGo_sVI?si=p_lMsmWB2p0vu42o)


## Exercise 2.3

*Objectives:*

- Iterate like a pro

*Files Modified:* None.

Iteration is an essential Python skill.  In this exercise, we look at
a number of common iteration idioms.

Start the exercise by grabbing some rows of data from a CSV file.

```python
>>> import csv
>>> f = open('Data/portfolio.csv')
>>> f_csv = csv.reader(f)
>>> headers = next(f_csv)
>>> headers
['name', 'shares', 'price']
>>> rows = list(f_csv)
>>> from pprint import pprint
>>> pprint(rows)
[['AA', '100', '32.20'],
 ['IBM', '50', '91.10'],
 ['CAT', '150', '83.44'],
 ['MSFT', '200', '51.23'],
 ['GE', '95', '40.37'],
 ['MSFT', '50', '65.10'],
 ['IBM', '100', '70.44']]
>>>
```

### (a) Basic Iteration and Unpacking

The `for` statement iterates over any sequence of data. For example:

```python
>>> for row in rows:
        print(row)

['AA', '100', '32.20']
['IBM', '50', '91.10']
['CAT', '150', '83.44']
['MSFT', '200', '51.23']
['GE', '95', '40.37']
['MSFT', '50', '65.10']
['IBM', '100', '70.44']
>>>
```

Unpack the values into separate variables if you need to:

```python
>>> for name, shares, price in rows:
        print(name, shares, price)

AA 100 32.20
IBM 50 91.10
CAT 150 83.44
MSFT 200 51.23
GE 95 40.37
MSFT 50 65.10
IBM 100 70.44
>>>
```

It's somewhat common to use `_` or `__` as a throw-away variable if you don't care
about one or more of the values.  For example:

```python
>>> for name, _, price in rows:
        print(name, price)

AA 32.20
IBM 91.10
CAT 83.44
MSFT 51.23
GE 40.37
MSFT 65.10
IBM 70.44
>>>
```

If you don't know how many values are being unpacked, you can use `*` as a wildcard.
Try this experiment in grouping the data by name:

```python
>>> from collections import defaultdict
>>> byname = defaultdict(list)
>>> for name, *data in rows:
        byname[name].append(data)

>>> byname['IBM']
[['50', '91.10'], ['100', '70.44']]
>>> byname['CAT']
[['150', '83.44']]
>>> for shares, price in byname['IBM']:
        print(shares, price)

50 91.10
100 70.44
>>>
```


### (b) Counting with enumerate()

`enumerate()` is a useful function if you ever need to keep a counter
or index while iterating. For example, suppose you wanted an extra row
number:

```python
>>> for rowno, row in enumerate(rows):
        print(rowno, row)

0 ['AA', '100', '32.20']
1 ['IBM', '50', '91.10']
2 ['CAT', '150', '83.44']
3 ['MSFT', '200', '51.23']
4 ['GE', '95', '40.37']
5 ['MSFT', '50', '65.10']
6 ['IBM', '100', '70.44']
>>>
```

You can combine this with unpacking if you're careful about how you structure it:

```python
>>> for rowno, (name, shares, price) in enumerate(rows):
        print(rowno, name, shares, price)

0 AA 100 32.20
1 IBM 50 91.10
2 CAT 150 83.44
3 MSFT 200 51.23
4 GE 95 40.37
5 MSFT 50 65.10
6 IBM 100 70.44
>>> 
```


### (c) Using the zip() function

The `zip()` function is most commonly used to pair data.  For example,
recall that you created a `headers` variable:

```python
>>> headers
['name', 'shares', 'price']
>>>
```

This might be useful to combine with the other row data:

```python
>>> row = rows[0]
>>> row
['AA', '100', '32.20']
>>> for col, val in zip(headers, row):
        print(col, val)

name AA
shares 100
price 32.20
>>>
```

Or maybe you can use it to make a dictionary:

```python
>>> dict(zip(headers, row))
{'name': 'AA', 'shares': '100', 'price': '32.20'}
>>>
```

Or maybe a sequence of dictionaries:

```python
>>> for row in rows:
        record = dict(zip(headers, row))
        print(record)

{'name': 'AA', 'shares': '100', 'price': '32.20'}
{'name': 'IBM', 'shares': '50', 'price': '91.10'}
{'name': 'CAT', 'shares': '150', 'price': '83.44'}
{'name': 'MSFT', 'shares': '200', 'price': '51.23'}
{'name': 'GE', 'shares': '95', 'price': '40.37'}
{'name': 'MSFT', 'shares': '50', 'price': '65.10'}
{'name': 'IBM', 'shares': '100', 'price': '70.44'}
>>>
```


### (d) Generator Expressions

A generator expression is almost exactly the same as a list
comprehension except that it does not create a list.  Instead, it
creates an object that produces the results incrementally--typically
for consumption by iteration. Try a simple example:

```python
>>> nums = [1,2,3,4,5]
>>> squares = (x*x for x in nums)
>>> squares
<generator object <genexpr> at 0x37caa8>
>>> for n in squares:
        print(n)

1
4
9
16
25
>>>
```

You will notice that a generator expression can only be used once.
Watch what happens if you do the for-loop again:

```python
>>> for n in squares:
        print(n)

>>>
```

You can manually get the results one-at-a-time if you use the
`next()` function. Try this:

```python
>>> squares = (x*x for x in nums)
>>> next(squares)
1
>>> next(squares)
4
>>> next(squares)
9
>>>
```

Keeping typing `next()` to see what happens when there is no
more data.

If the task you are performing is more complicated, you can
still take advantage of generators by writing a generator function 
and using the `yield` statement instead.
For example:

```python
>>> def squares(nums):
        for x in nums:
            yield x*x

>>> for n in squares(nums):
        print(n)

1
4
9
16
25
>>>
```

We'll return to generator functions a little later in the course--for now,
just view such functions as having the interesting property of feeding
values to the `for`-statement.


### (e) Generator Expressions and Reduction Functions

Generator expressions are especially useful for feeding data into
functions such as `sum()`, `min()`, `max()`,
`any()`, etc.   Try some examples using the portfolio data from
earlier.  Carefully observe that these examples are missing some
extra square brackets ([]) that appeared when using list comprehensions.

```python
>>> from readport import read_portfolio
>>> portfolio = read_portfolio('Data/portfolio.csv')
>>> sum(s['shares']*s['price'] for s in portfolio)
44671.15
>>> min(s['shares'] for s in portfolio)
50
>>> any(s['name'] == 'IBM' for s in portfolio)
True
>>> all(s['name'] == 'IBM' for s in portfolio)
False
>>> sum(s['shares'] for s in portfolio if s['name'] == 'IBM')
150
>>>
```

Here is a subtle use of a generator expression in making comma
separated values:

```python
>>> s = ('GOOG',100,490.10)
>>> ','.join(s)
... observe that it fails ...
>>> ','.join(str(x) for x in s)    # This works
'GOOG,100,490.1'
>>>
```

The syntax in the above examples takes some getting used to, but the
critical point is that none of the operations ever create a fully
populated list of results.  This gives you a big memory savings.  However,
you do need to make sure you don't go overboard with the syntax.


### (f) Saving a lot of memory

In [Exercise 2.1](ex2_1.md) you wrote a function
`read_rides_as_dicts()` that read the CTA bus data into a list of
dictionaries.  Using it requires a lot of memory. For example,
let's find the day on which the route 22 bus had the greatest
ridership:

```python
>>> import tracemalloc
>>> tracemalloc.start()
>>> import readrides
>>> rows = readrides.read_rides_as_dicts('Data/ctabus.csv')
>>> rt22 = [row for row in rows if row['route'] == '22']
>>> max(rt22, key=lambda row: row['rides'])
{'date': '06/11/2008', 'route': '22', 'daytype': 'W', 'rides': 26896}
>>> tracemalloc.get_traced_memory()
... look at result. Should be around 220MB
>>>
```

Now, let's try an example involving generators. Restart Python
and try this:

```python
>>> # RESTART
>>> import tracemalloc
>>> tracemalloc.start()
>>> import csv
>>> f = open('Data/ctabus.csv')
>>> f_csv = csv.reader(f)
>>> headers = next(f_csv)
>>> rows = (dict(zip(headers,row)) for row in f_csv)
>>> rt22 = (row for row in rows if row['route'] == '22')
>>> max(rt22, key=lambda row: int(row['rides']))
{'date': '06/11/2008', 'route': '22', 'daytype': 'W', 'rides': 26896}
>>> tracemalloc.get_traced_memory()
... look at result. Should be a LOT smaller than before
>>>
```

Keep in mind that you just processed the entire dataset as if it was
stored as a sequence of dictionaries.  Yet, nowhere did you actually
create and store a list of dictionaries.   Not all problems can be
structured in this way, but if you can work with data in an
iterative manner, generator expressions can save a huge amount of memory.


In [45]:
import tracemalloc

def convert_to_mb(bits):
    return bits / (8 * 1_000_000)

print("By Dict")
tracemalloc.start()
rows = read_rides_as_dicts("E:/jai/docs/code/python/tutorial/learning-python-mastery/Data/ctabus.csv")
rt22 = [row for row in rows if row['route'] == '22'] # list comprehension
print(max(rt22, key=lambda row: row['rides']))
curr,maxi = tracemalloc.get_traced_memory()
print("Current : ",convert_to_mb(curr),"MBs","Maxi : ",convert_to_mb(maxi),"MBs")

print("By Generators")

tracemalloc.start()
import csv
with open("E:/jai/docs/code/python/tutorial/learning-python-mastery/Data/ctabus.csv") as f:
    f_csv = csv.reader(f)
    header = next(f_csv)
    rows = (dict(zip(header,row)) for row in f_csv) # generator
    rt22 = (row for row in rows if row['route'] == '22') # generator
    print(max(rt22,key=lambda row: int(row['rides'])))
    curr,maxi = tracemalloc.get_traced_memory()
    print("Current : ",convert_to_mb(curr),"MBs","Maxi : ",convert_to_mb(maxi),"MBs")

By Dict
{'route': '22', 'date': '06/11/2008', 'daytype': 'W', 'rides': 26896}
Current :  22.520078 MBs Maxi :  22.522845125 MBs
By Generators
{'route': '22', 'date': '06/11/2008', 'daytype': 'W', 'rides': '26896'}
Current :  0.091246625 MBs Maxi :  22.5241865 MBs


# Understanding the builtins

# Object Model