# Python Data Science Toolbox, Part 2
Instructor: Hugo Bowne-Anderson  
Course [link](https://learn.datacamp.com/courses/python-data-science-toolbox-part-2)  
Note Taker: Paris Zhang on Wed, Aug 12, 2020

Course contents:
1. Iterators
2. List comprehensions and generators
3. Case Study - World Bank data

## Chapter 1 - Iterators
### 1.1 Intro to iterators
1. Iterating with a for loop over:
  + a list
  + a string
  + a range object

In [1]:
employees = ['Nick','Lore','Hugo']

for employee in employees:
    print(employee)

Nick
Lore
Hugo


In [2]:
for letter in 'Paris':
    print(letter)

P
a
r
i
s


In [3]:
for i in range(4):
    print(i)

0
1
2
3


2. Iterators vs. iterables
  + Iterable
    + Examples: lists, strings, dictionaries, file connections
    + An object with an associated `iter()` method
    + Applying `iter()` to an iterable creates an iterator
  + Iterator
    + Produces next value with `next()`
    + Iterating at once with `*`

In [5]:
word = 'Da'
it = iter(word)
next(it)

'D'

In [6]:
next(it)

'a'

In [7]:
next(it)

StopIteration: 

In [8]:
word = 'Data'
it = iter(word)
print(*it)

D a t a


In [9]:
print(*it)




* Iterating over dictionaries

In [10]:
pythonistas = {'hugo': 'bowne-anderson','francis': 'castro'}

for key, value in pythonistas.items():
    print(key, value)

hugo bowne-anderson
francis castro


* Iterating over file connections

In [22]:
file = open('/Users/pariszhang/Documents/WeillCornell/courses/FA19-DS1/Python/final/files/hp/hp1.txt')
it = iter(file)
print(next(it))

Harry Potter and the Sorcerer's Stone 



In [24]:
print(next(it))

CHAPTER ONE 



In [26]:
print(next(it))

THE BOY WHO LIVED 



In [28]:
print(next(it))

Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense. 



### 1.2 More iterators

1. Using `enumerate()`

In [29]:
avengers = ['hawkeye', 'iron man', 'thor', 'quicksilver']
e = enumerate(avengers)
print(type(e),"\n")

e_list = list(e)
print(e_list)

<class 'enumerate'> 

[(0, 'hawkeye'), (1, 'iron man'), (2, 'thor'), (3, 'quicksilver')]


In [30]:
for index, value in enumerate(avengers):
    print(index, value)

0 hawkeye
1 iron man
2 thor
3 quicksilver


In [31]:
for index, value in enumerate(avengers, start=10):
    print(index, value)

10 hawkeye
11 iron man
12 thor
13 quicksilver


2. Using `zip()`

In [37]:
avengers = ['hawkeye', 'iron man', 'thor', 'quicksilver']
names = ['barton', 'stark', 'odinson', 'maximoff']
z = zip(avengers, names)
print(type(z),"\n")

z_list = list(z)
print(z_list) # Each element is a tuple

print("\n")
for z1, z2 in zip(avengers, names):
    print(z1, z2)

<class 'zip'> 

[('hawkeye', 'barton'), ('iron man', 'stark'), ('thor', 'odinson'), ('quicksilver', 'maximoff')]


hawkeye barton
iron man stark
thor odinson
quicksilver maximoff


3. Print zip with `*`

In [39]:
z = zip(avengers, names)
print(*z)

('hawkeye', 'barton') ('iron man', 'stark') ('thor', 'odinson') ('quicksilver', 'maximoff')


### 1.3 Using iterators to load large les into memory
* Processing large amounts of Twitter data

In [41]:
import pandas as pd

# Initialize an empty dictionary: counts_dict
counts_dict = {}

# Iterate over the file chunk by chunk
for chunk in pd.read_csv('tweets.csv',chunksize=10):

    for entry in chunk['lang']:
        if entry in counts_dict.keys():
            counts_dict[entry] += 1
        else:
            counts_dict[entry] = 1

print(counts_dict)

{'en': 97, 'et': 1, 'und': 2}


* Extracting information for large amounts of Twitter data

In [42]:
# Define count_entries()
def count_entries(csv_file,c_size,colname):
    """Return a dictionary with counts of
    occurrences as value for each key."""
    
    # Initialize an empty dictionary: counts_dict
    counts_dict = {}

    for chunk in pd.read_csv(csv_file, chunksize=c_size):

        for entry in chunk[colname]:
            if entry in counts_dict.keys():
                counts_dict[entry] += 1
            else:
                counts_dict[entry] = 1

    return counts_dict

result_counts = count_entries(csv_file='tweets.csv',c_size=10,colname='lang')
print(result_counts)

{'en': 97, 'et': 1, 'und': 2}


## Chapter 2 - List Comprehensions & Generators
### 2.1 - List Comprehensions
Syntax: `[`[output expression] `for` iterator variable `in` iterable`]`

1. Populating a list with a for loop VS. a list comprehension

In [44]:
nums = [12, 8, 21, 3, 16]

# for loop
new_nums = []
for num in nums:
    new_nums.append(num + 1)
print(new_nums)

# list comprehension
new_nums = [num + 1 for num in nums]
print(new_nums)

[13, 9, 22, 4, 17]
[13, 9, 22, 4, 17]


2. List comprehension with `range()`

In [45]:
result = [num for num in range(11)]
print(result)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


3. Nested loops

In [49]:
# for loop
pairs_1 = []
for num1 in range(0, 2):
    for num2 in range(6, 8):
        pairs_1.append([num1, num2])
print(pairs_1)

# list comprehension
pairs_2 = [(num1, num2) for num1 in range(0, 2) for num2 in range(6,8)]
print(pairs_2)

[[0, 6], [0, 7], [1, 6], [1, 7]]
[(0, 6), (0, 7), (1, 6), (1, 7)]


* Print the first letter of every word

In [51]:
doctor = ['house', 'cuddy', 'chase', 'thirteen', 'wilson']
result = [doc[0] for doc in doctor]
print(result)

['h', 'c', 'c', 't', 'w']


* Create a matrix using a list comprehension in a list comprehension

In [52]:
# Create a 5 x 5 matrix using a list of lists: matrix
matrix = [[col for col in range(5)] for col in range(5)]

for row in matrix:
    print(row)

[0, 1, 2, 3, 4]
[0, 1, 2, 3, 4]
[0, 1, 2, 3, 4]
[0, 1, 2, 3, 4]
[0, 1, 2, 3, 4]


### 2.2 Advanced comprehensions
1. Conditionals in comprehensions - Conditionals on the iterable

In [54]:
[num ** 2 for num in range(10) if num % 2 == 0]

[0, 4, 16, 36, 64]

* Want strings with 7 characters or more.

In [58]:
# Iterate members with at least 7 letters
fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli']

new_fellowship = [member for member in fellowship if len(member)>=7]
print(new_fellowship)

['samwise', 'aragorn', 'legolas', 'boromir']


2. Conditionals in comprehensions - Conditionals on the output expression

In [55]:
[num ** 2 if num % 2 == 0 else 0 for num in range(10)]

[0, 0, 4, 0, 16, 0, 36, 0, 64, 0]

* In the output expression, keep the string as-is if the number of characters is >= 7, else replace it with an empty string - that is, `''` or `""`.

In [59]:
new_fellowship = [member if len(member)>=7 else '' for member in fellowship]
print(new_fellowship)

['', 'samwise', '', 'aragorn', 'legolas', 'boromir', '']


3. Dict comprehensions
  + Create dictionaries
  + Use curly braces `{}` instead of brackets `[]`

In [60]:
pos_neg = {num: -num for num in range(9)}
print(pos_neg)

{0: 0, 1: -1, 2: -2, 3: -3, 4: -4, 5: -5, 6: -6, 7: -7, 8: -8}


* Create a dict comprehension where the key is a string in `fellowship` and the value is the length of the string.

In [61]:
new_fellowship = {member:len(member) for member in fellowship}
print(new_fellowship)

{'frodo': 5, 'samwise': 7, 'merry': 5, 'aragorn': 7, 'legolas': 7, 'boromir': 7, 'gimli': 5}


### 2.3 Intro to generator expressions
List comprehensions vs. generators
1. List comprehension - returns a list
2. Generators - returns a generator object
3. Both can be iterated over

Generator functions
1. Produces generator objects when called
2. Dened like a regular function - `def`
3. Yields a sequence of values instead of returning a single value
4. Generates a value with `yield` keyword

In [62]:
# List of strings
fellowship = ['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli']

# List comprehension
fellow1 = [member for member in fellowship if len(member) >= 7]

# Generator expression
fellow2 = (member for member in fellowship if len(member) >= 7)

print(fellowship,"\n",fellow1,"\n",fellow2)

['frodo', 'samwise', 'merry', 'aragorn', 'legolas', 'boromir', 'gimli'] 
 ['samwise', 'aragorn', 'legolas', 'boromir'] 
 <generator object <genexpr> at 0x7fc61e010ed0>


1. Write generator expressions

In [65]:
# Create generator object: result
result = (num for num in range(11))

# Print the first 5 values
print(next(result))
print(next(result))
print(next(result))
print(next(result))
print(next(result))

# Print the rest of the values
for value in result:
    print(value)

0
1
2
3
4
5
6
7
8
9
10


2. Changing the output in generator expressions

In [66]:
lannister = ['cersei', 'jaime', 'tywin', 'tyrion', 'joffrey']

# Create a generator object: lengths
lengths = (len(person) for person in lannister)

for value in lengths:
    print(value)

6
5
5
6
7


3. Build a generator function using `yield` instead of `return`
  + create a generator function with a similar mechanism as the generator expression you defined in the previous exercise: `lengths = (len(person) for person in lannister)`

In [67]:
# Define generator function get_lengths
def get_lengths(input_list):
    """Generator function that yields the
    length of the strings in input_list."""

    # Yield the length of a string
    for person in input_list:
        yield len(person)

for value in get_lengths(lannister):
    print(value)

6
5
5
6
7


### 2.4 Wrapping up list comprehensions and generators

1. List comprehensions for time-stamped data

In [70]:
import pandas as pd
df = pd.read_csv('tweets.csv')

tweet_time = df['created_at'] # returns a Series data structure
tweet_clock_time = [entry[11:19] for entry in tweet_time]

print(tweet_clock_time)

['23:40:17', '23:40:17', '23:40:17', '23:40:17', '23:40:17', '23:40:17', '23:40:18', '23:40:17', '23:40:18', '23:40:18', '23:40:18', '23:40:17', '23:40:18', '23:40:18', '23:40:17', '23:40:18', '23:40:18', '23:40:17', '23:40:18', '23:40:17', '23:40:18', '23:40:18', '23:40:18', '23:40:18', '23:40:17', '23:40:18', '23:40:18', '23:40:17', '23:40:18', '23:40:18', '23:40:18', '23:40:18', '23:40:18', '23:40:18', '23:40:18', '23:40:18', '23:40:18', '23:40:18', '23:40:18', '23:40:18', '23:40:18', '23:40:18', '23:40:18', '23:40:18', '23:40:18', '23:40:18', '23:40:18', '23:40:18', '23:40:18', '23:40:18', '23:40:18', '23:40:18', '23:40:18', '23:40:18', '23:40:18', '23:40:18', '23:40:18', '23:40:18', '23:40:18', '23:40:18', '23:40:19', '23:40:18', '23:40:18', '23:40:18', '23:40:19', '23:40:19', '23:40:19', '23:40:18', '23:40:19', '23:40:19', '23:40:19', '23:40:18', '23:40:19', '23:40:19', '23:40:19', '23:40:18', '23:40:19', '23:40:19', '23:40:19', '23:40:19', '23:40:19', '23:40:19', '23:40:19', '23

2. Conditional list comprehensions for time-stamped data

In [71]:
tweet_clock_time = [entry[11:19] for entry in tweet_time if entry[17:19] == '19']
print(tweet_clock_time)

['23:40:19', '23:40:19', '23:40:19', '23:40:19', '23:40:19', '23:40:19', '23:40:19', '23:40:19', '23:40:19', '23:40:19', '23:40:19', '23:40:19', '23:40:19', '23:40:19', '23:40:19', '23:40:19', '23:40:19', '23:40:19', '23:40:19', '23:40:19', '23:40:19', '23:40:19', '23:40:19', '23:40:19', '23:40:19', '23:40:19', '23:40:19', '23:40:19', '23:40:19', '23:40:19', '23:40:19', '23:40:19', '23:40:19', '23:40:19']


## Chapter 3 - Case Study

World bank data
* Data on world economies for over half a century
* Indicators
  + Population
  + Electricity consumption
  + CO2 emissions
  + Literacy rates
  + Unemployment
  + Mortality rates
  
### 3.1 Welcome to Case Study
1. Dictionaries for data science

In [72]:
feature_names = [
    'CountryName',
    'CountryCode',
    'IndicatorName',
    'IndicatorCode',
    'Year',
    'Value'
]

row_vals = [
    'Arab World',
    'ARB',
    'Adolescent fertility rate (births per 1,000 women ages 15-19)',
    'SP.ADO.TFRT',
    '1960',
    '133.56090740552298'
]

zipped_lists = zip(feature_names,row_vals)
rs_dict = dict(zipped_lists)
print(rs_dict)

{'CountryName': 'Arab World', 'CountryCode': 'ARB', 'IndicatorName': 'Adolescent fertility rate (births per 1,000 women ages 15-19)', 'IndicatorCode': 'SP.ADO.TFRT', 'Year': '1960', 'Value': '133.56090740552298'}


2. Writing a function instead

In [73]:
def lists2dict(list1, list2):
    """Return a dictionary where list1 provides
    the keys and list2 provides the values."""

    zipped_lists = zip(list1, list2)
    rs_dict = dict(zipped_lists)

    return rs_dict

rs_fxn = lists2dict(feature_names,row_vals)
print(rs_fxn)

{'CountryName': 'Arab World', 'CountryCode': 'ARB', 'IndicatorName': 'Adolescent fertility rate (births per 1,000 women ages 15-19)', 'IndicatorCode': 'SP.ADO.TFRT', 'Year': '1960', 'Value': '133.56090740552298'}


3. Using a list comprehension

In [77]:
row_lists = [['Arab World',
  'ARB',
  'Adolescent fertility rate (births per 1,000 women ages 15-19)',
  'SP.ADO.TFRT',
  '1960',
  '133.56090740552298'],
 ['Arab World',
  'ARB',
  'Age dependency ratio (% of working-age population)',
  'SP.POP.DPND',
  '1960',
  '87.7976011532547'],
 ['Arab World',
  'ARB',
  'Age dependency ratio, old (% of working-age population)',
  'SP.POP.DPND.OL',
  '1960',
  '6.634579191565161'],
 ['Arab World',
  'ARB',
  'Age dependency ratio, young (% of working-age population)',
  'SP.POP.DPND.YG',
  '1960',
  '81.02332950839141'],
 ['Arab World',
  'ARB',
  'Arms exports (SIPRI trend indicator values)',
  'MS.MIL.XPRT.KD',
  '1960',
  '3000000.0'],
 ['Arab World',
  'ARB',
  'Arms imports (SIPRI trend indicator values)',
  'MS.MIL.MPRT.KD',
  '1960',
  '538000000.0'],
 ['Arab World',
  'ARB',
  'Birth rate, crude (per 1,000 people)',
  'SP.DYN.CBRT.IN',
  '1960',
  '47.697888095096395'],
 ['Arab World',
  'ARB',
  'CO2 emissions (kt)',
  'EN.ATM.CO2E.KT',
  '1960',
  '59563.9892169935'],
 ['Arab World',
  'ARB',
  'CO2 emissions (metric tons per capita)',
  'EN.ATM.CO2E.PC',
  '1960',
  '0.6439635478877049'],
 ['Arab World',
  'ARB',
  'CO2 emissions from gaseous fuel consumption (% of total)',
  'EN.ATM.CO2E.GF.ZS',
  '1960',
  '5.041291753975099'],
 ['Arab World',
  'ARB',
  'CO2 emissions from liquid fuel consumption (% of total)',
  'EN.ATM.CO2E.LF.ZS',
  '1960',
  '84.8514729446567'],
 ['Arab World',
  'ARB',
  'CO2 emissions from liquid fuel consumption (kt)',
  'EN.ATM.CO2E.LF.KT',
  '1960',
  '49541.707291032304'],
 ['Arab World',
  'ARB',
  'CO2 emissions from solid fuel consumption (% of total)',
  'EN.ATM.CO2E.SF.ZS',
  '1960',
  '4.72698138789597'],
 ['Arab World',
  'ARB',
  'Death rate, crude (per 1,000 people)',
  'SP.DYN.CDRT.IN',
  '1960',
  '19.7544519237187'],
 ['Arab World',
  'ARB',
  'Fertility rate, total (births per woman)',
  'SP.DYN.TFRT.IN',
  '1960',
  '6.92402738655897'],
 ['Arab World',
  'ARB',
  'Fixed telephone subscriptions',
  'IT.MLT.MAIN',
  '1960',
  '406833.0'],
 ['Arab World',
  'ARB',
  'Fixed telephone subscriptions (per 100 people)',
  'IT.MLT.MAIN.P2',
  '1960',
  '0.6167005703199'],
 ['Arab World',
  'ARB',
  'Hospital beds (per 1,000 people)',
  'SH.MED.BEDS.ZS',
  '1960',
  '1.9296220724398703'],
 ['Arab World',
  'ARB',
  'International migrant stock (% of population)',
  'SM.POP.TOTL.ZS',
  '1960',
  '2.9906371279862403'],
 ['Arab World',
  'ARB',
  'International migrant stock, total',
  'SM.POP.TOTL',
  '1960',
  '3324685.0']]

# Print the first two lists in row_lists
print(row_lists[0])
print(row_lists[1],"\n")

# Turn list of lists into list of dicts: list_of_dicts
list_of_dicts = [lists2dict(feature_names, sublist) for sublist in row_lists]

# Print the first two dictionaries in list_of_dicts
print(list_of_dicts[0])
print(list_of_dicts[1])

['Arab World', 'ARB', 'Adolescent fertility rate (births per 1,000 women ages 15-19)', 'SP.ADO.TFRT', '1960', '133.56090740552298']
['Arab World', 'ARB', 'Age dependency ratio (% of working-age population)', 'SP.POP.DPND', '1960', '87.7976011532547'] 

{'CountryName': 'Arab World', 'CountryCode': 'ARB', 'IndicatorName': 'Adolescent fertility rate (births per 1,000 women ages 15-19)', 'IndicatorCode': 'SP.ADO.TFRT', 'Year': '1960', 'Value': '133.56090740552298'}
{'CountryName': 'Arab World', 'CountryCode': 'ARB', 'IndicatorName': 'Age dependency ratio (% of working-age population)', 'IndicatorCode': 'SP.POP.DPND', 'Year': '1960', 'Value': '87.7976011532547'}


4. Turning this all into a DataFrame

In [78]:
import pandas as pd

df = pd.DataFrame(list_of_dicts)
print(df.head())

  CountryName CountryCode                                      IndicatorName  \
0  Arab World         ARB  Adolescent fertility rate (births per 1,000 wo...   
1  Arab World         ARB  Age dependency ratio (% of working-age populat...   
2  Arab World         ARB  Age dependency ratio, old (% of working-age po...   
3  Arab World         ARB  Age dependency ratio, young (% of working-age ...   
4  Arab World         ARB        Arms exports (SIPRI trend indicator values)   

    IndicatorCode  Year               Value  
0     SP.ADO.TFRT  1960  133.56090740552298  
1     SP.POP.DPND  1960    87.7976011532547  
2  SP.POP.DPND.OL  1960   6.634579191565161  
3  SP.POP.DPND.YG  1960   81.02332950839141  
4  MS.MIL.XPRT.KD  1960           3000000.0  


### 3.2 Using Python generators for streaming data
1. Processing data in chunks
2. Writing a generator to load data in chunks

In [92]:
# Open a connection to the file
with open('world_ind_pop_data.csv') as file: # Context manager

    # Skip the column names
    file.readline()

    counts_dict = {}

    # Process only the first 1000 rows
    for j in range(1000):

        # Split the current line into a list: line
        line = file.readline().split(',')
        first_col = line[0]

        # If the column value is in the dict, increment its value
        if first_col in counts_dict.keys():
            counts_dict[first_col] += 1

        # Else, add to the dict and set value to 1
        else:
            counts_dict[first_col] = 1

print(list(counts_dict)[:5])

['Arab World', 'Caribbean small states', 'Central Europe and the Baltics', 'East Asia & Pacific (all income levels)', 'East Asia & Pacific (developing only)']


In [94]:
def read_large_file(file_object):
    """A generator function to read a large file lazily."""

    # Loop indefinitely until the end of the file
    while True:

        # Read a line from the file: data
        data = file_object.readline()

        # Break if this is the end of the file
        if not data:
            break

        yield data
        
with open('world_ind_pop_data.csv') as file:

    # Create a generator object for the file: gen_file
    gen_file = read_large_file(file)

    print(next(gen_file))
    print(next(gen_file))
    print(next(gen_file))

CountryName,CountryCode,Year,Total Population,Urban population (% of total)

Arab World,ARB,1960,92495902.0,31.285384211605397

Caribbean small states,CSS,1960,4190810.0,31.5974898513652



In [96]:
counts_dict = {}

with open('world_ind_pop_data.csv') as file:

    for line in read_large_file(file):

        row = line.split(',')
        first_col = row[0]

        if first_col in counts_dict.keys():
            counts_dict[first_col] += 1
        else:
            counts_dict[first_col] = 1

print(list(counts_dict)[:10])

['CountryName', 'Arab World', 'Caribbean small states', 'Central Europe and the Baltics', 'East Asia & Pacific (all income levels)', 'East Asia & Pacific (developing only)', 'Euro area', 'Europe & Central Asia (all income levels)', 'Europe & Central Asia (developing only)', 'European Union']


### 3.3 Using pandas' `read_csv` iterator for streaming data
- Writing an iterator to load data in chunks

In [112]:
import pandas as pd

urb_pop_reader = pd.read_csv('world_ind_pop_data.csv', chunksize=1000)
df_urb_pop = next(urb_pop_reader)
df_pop_ceb = df_urb_pop[df_urb_pop['CountryCode'] == 'CEB']
pops = zip(df_pop_ceb['Total Population'], 
           df_pop_ceb['Urban population (% of total)'])
pops_list = list(pops)

# Use list comprehension to create new DataFrame column 'Total Urban Population'
result = [int(val1 * val2 / 100) for val1,val2 in pops_list]
print(result)

# Plot urban population data
#df_pop_ceb.plot(kind="scatter", x='Year', y='Total Urban Population')
#plt.show()

[40680944, 41697325, 42662734, 43670267, 44717348]


In [None]:
data = pd.DataFrame()

for df_urb_pop in urb_pop_reader:

    df_pop_ceb = df_urb_pop[df_urb_pop['CountryCode'] == 'CEB']

    pops = zip(df_pop_ceb['Total Population'],
                df_pop_ceb['Urban population (% of total)'])

    pops_list = list(pops)

    # Use list comprehension to create new DataFrame column 'Total Urban Population'
    df_pop_ceb['Total Urban Population'] = [int(tup[0] * tup[1] * 0.01) for tup in pops_list]
    
    data = data.append(df_pop_ceb)

# Plot urban population data
data.plot(kind='scatter', x='Year', y='Total Urban Population')
plt.show()

In [None]:
def plot_pop(filename, country_code):

    # Initialize reader object: urb_pop_reader
    urb_pop_reader = pd.read_csv(filename, chunksize=1000)

    data = pd.DataFrame()
    
    for df_urb_pop in urb_pop_reader:
        df_pop_ceb = df_urb_pop[df_urb_pop['CountryCode'] == country_code]

        pops = zip(df_pop_ceb['Total Population'],
                    df_pop_ceb['Urban population (% of total)'])

        pops_list = list(pops)

        df_pop_ceb['Total Urban Population'] = [int(tup[0] * tup[1] * 0.01) for tup in pops_list]
    
        data = data.append(df_pop_ceb)

    data.plot(kind='scatter', x='Year', y='Total Urban Population')
    plt.show()

fn = 'ind_pop_data.csv'
plot_pop(fn,'CEB')
plot_pop(fn,'ARB')