#### `Zipping dictionaries`
For this exercise, you'll use what you've learned about the __zip()__ function and combine two lists into a dictionary.

These lists are actually extracted from a [bigger dataset file of world development indicators from the World Bank](https://datacatalog.worldbank.org/search/dataset/0037712). For pedagogical purposes, we have pre-processed this dataset into the lists that you'll be working with.

The first list __feature_names__ contains header names of the dataset and the second list __row_vals__ contains actual values of a row from the dataset, corresponding to each of the header names.

- Create a _zip_ object by calling __zip()__ and passing to it __feature_names__ and __row_vals__. Assign the result to __zipped_lists__.
- Create a dictionary from the __zipped_lists__ zip object by calling __dict()__ with __zipped_lists__. Assign the resulting dictionary to __rs_dict__.

In [12]:
feature_names = ['CountryName', 'CountryCode', 'IndicatorName', 'IndicatorCode', 'Year', 'Value']
row_vals = ['Arab World', 'ARB', 'Adolescent fertility rate (births per 1,000 women ages 15-19)', 'SP.ADO.TFRT', '1960', '133.56090740552298']

In [13]:
# Zip lists: zipped_lists
zipped_lists = zip(feature_names, row_vals)

# Create a dictionary: rs_dict
rs_dict = dict(zipped_lists)

# Print the dictionary
print(rs_dict)

{'CountryName': 'Arab World', 'CountryCode': 'ARB', 'IndicatorName': 'Adolescent fertility rate (births per 1,000 women ages 15-19)', 'IndicatorCode': 'SP.ADO.TFRT', 'Year': '1960', 'Value': '133.56090740552298'}


#### `Writing a function to help you`
Suppose you needed to repeat the same process done in the previous exercise to many, many rows of data. Rewriting your code again and again could become very tedious, repetitive, and unmaintainable.

In this exercise, you will create a function to house the code you wrote earlier to make things easier and much more concise. Why? This way, you only need to call the function and supply the appropriate lists to create your dictionaries! Again, the lists __feature_names__ and __row_vals__ are preloaded and these contain the header names of the dataset and actual values of a row from the dataset, respectively.

- Define the function __lists2dict()__ with two parameters: first is __list1__ and second is __list2__.
- Return the resulting dictionary __rs_dict__ in __lists2dict()__.
- Call the __lists2dict()__ function with the arguments __feature_names__ and __row_vals__. Assign the result of the function call to __rs_fxn__.

In [14]:
# Define lists2dict()
def lists2dict(list1, list2):
    """Return a dictionary where list1 provides
    the keys and list2 provides the values."""

    # Zip lists: zipped_lists
    zipped_lists = zip(list1, list2)

    # Create a dictionary: rs_dict
    rs_dict = dict(zipped_lists)

    return rs_dict


# Call lists2dict: rs_fxn
rs_fxn = lists2dict(feature_names, row_vals)

# Print rs_fxn
print(rs_fxn)

{'CountryName': 'Arab World', 'CountryCode': 'ARB', 'IndicatorName': 'Adolescent fertility rate (births per 1,000 women ages 15-19)', 'IndicatorCode': 'SP.ADO.TFRT', 'Year': '1960', 'Value': '133.56090740552298'}


#### `Using a list comprehension`
This time, you're going to use the __lists2dict()__ function you defined in the last exercise to turn a bunch of lists into a list of dictionaries with the help of a list comprehension.

The __lists2dict()__ function has already been preloaded, together with a couple of lists, __feature_names__ and __row_lists__. __feature_names__ contains the header names of the World Bank dataset and __row_lists__ is a list of lists, where each sublist is a list of actual values of a row from the dataset.

Your goal is to use a list comprehension to generate a list of dicts, where the keys are the header names and the values are the row entries.

In [15]:
row_lists = [['Arab World', 'ARB', 'Adolescent fertility rate (births per 1,000 women ages 15-19)', 'SP.ADO.TFRT', '1960', '133.56090740552298'], ['Arab World', 'ARB', 'Age dependency ratio (% of working-age population)', 'SP.POP.DPND', '1960', '87.7976011532547'], ['Arab World', 'ARB', 'Age dependency ratio, old (% of working-age population)', 'SP.POP.DPND.OL', '1960', '6.634579191565161'], ['Arab World', 'ARB', 'Age dependency ratio, young (% of working-age population)', 'SP.POP.DPND.YG', '1960', '81.02332950839141'], ['Arab World', 'ARB', 'Arms exports (SIPRI trend indicator values)', 'MS.MIL.XPRT.KD', '1960', '3000000.0'], ['Arab World', 'ARB', 'Arms imports (SIPRI trend indicator values)', 'MS.MIL.MPRT.KD', '1960', '538000000.0'], ['Arab World', 'ARB', 'Birth rate, crude (per 1,000 people)', 'SP.DYN.CBRT.IN', '1960', '47.697888095096395'], ['Arab World', 'ARB', 'CO2 emissions (kt)', 'EN.ATM.CO2E.KT', '1960', '59563.9892169935'], ['Arab World', 'ARB', 'CO2 emissions (metric tons per capita)', 'EN.ATM.CO2E.PC', '1960', '0.6439635478877049'], ['Arab World', 'ARB', 'CO2 emissions from gaseous fuel consumption (% of total)', 'EN.ATM.CO2E.GF.ZS', '1960', '5.041291753975099'], [
    'Arab World', 'ARB', 'CO2 emissions from liquid fuel consumption (% of total)', 'EN.ATM.CO2E.LF.ZS', '1960', '84.8514729446567'], ['Arab World', 'ARB', 'CO2 emissions from liquid fuel consumption (kt)', 'EN.ATM.CO2E.LF.KT', '1960', '49541.707291032304'], ['Arab World', 'ARB', 'CO2 emissions from solid fuel consumption (% of total)', 'EN.ATM.CO2E.SF.ZS', '1960', '4.72698138789597'], ['Arab World', 'ARB', 'Death rate, crude (per 1,000 people)', 'SP.DYN.CDRT.IN', '1960', '19.7544519237187'], ['Arab World', 'ARB', 'Fertility rate, total (births per woman)', 'SP.DYN.TFRT.IN', '1960', '6.92402738655897'], ['Arab World', 'ARB', 'Fixed telephone subscriptions', 'IT.MLT.MAIN', '1960', '406833.0'], ['Arab World', 'ARB', 'Fixed telephone subscriptions (per 100 people)', 'IT.MLT.MAIN.P2', '1960', '0.6167005703199'], ['Arab World', 'ARB', 'Hospital beds (per 1,000 people)', 'SH.MED.BEDS.ZS', '1960', '1.9296220724398703'], ['Arab World', 'ARB', 'International migrant stock (% of population)', 'SM.POP.TOTL.ZS', '1960', '2.9906371279862403'], ['Arab World', 'ARB', 'International migrant stock, total', 'SM.POP.TOTL', '1960', '3324685.0']]

- Inspect the contents of __row_lists__ by printing the first two lists in __row_lists__.
- Create a list comprehension that generates a dictionary using __lists2dict()__ for each sublist in __row_lists__. The keys are from the __feature_names__ list and the values are the row entries in __row_lists__. Use __sublist__ as your iterator variable and assign the resulting list of dictionaries to __list_of_dicts__.
- Look at the first two dictionaries in __list_of_dicts__ by printing them out.

In [16]:
# Print the first two lists in row_lists
print(row_lists[0])
print(row_lists[1])

# Turn list of lists into list of dicts: list_of_dicts
list_of_dicts = [lists2dict(feature_names, sublist) for sublist in row_lists]

# Print the first two dictionaries in list_of_dicts
print(list_of_dicts[0])
print(list_of_dicts[1])

['Arab World', 'ARB', 'Adolescent fertility rate (births per 1,000 women ages 15-19)', 'SP.ADO.TFRT', '1960', '133.56090740552298']
['Arab World', 'ARB', 'Age dependency ratio (% of working-age population)', 'SP.POP.DPND', '1960', '87.7976011532547']
{'CountryName': 'Arab World', 'CountryCode': 'ARB', 'IndicatorName': 'Adolescent fertility rate (births per 1,000 women ages 15-19)', 'IndicatorCode': 'SP.ADO.TFRT', 'Year': '1960', 'Value': '133.56090740552298'}
{'CountryName': 'Arab World', 'CountryCode': 'ARB', 'IndicatorName': 'Age dependency ratio (% of working-age population)', 'IndicatorCode': 'SP.POP.DPND', 'Year': '1960', 'Value': '87.7976011532547'}


#### `Turning this all into a DataFrame`
You've zipped lists together, created a function to house your code, and even used the function in a list comprehension to generate a list of dictionaries. That was a lot of work and you did a great job!

You will now use all of these to convert the list of dictionaries into a pandas DataFrame. You will see how convenient it is to generate a DataFrame from dictionaries with the __DataFrame()__ function from the pandas package.

The __lists2dict()__ function, __feature_names__ list, and __row_lists__ list have been preloaded for this exercise.

Go for it!

- To use the __DataFrame()__ function you need, first import the pandas package with the alias __pd__.
- Create a DataFrame from the list of dictionaries in __list_of_dicts__ by calling __pd.DataFrame()__. Assign the resulting DataFrame to __df__.
- Inspect the contents of __df__ printing the head of the DataFrame. Head of the DataFrame __df__ can be accessed by calling __df.head()__.

In [17]:
# Import the pandas package
import pandas as pd

# Turn list of lists into list of dicts: list_of_dicts
list_of_dicts = [lists2dict(feature_names, sublist) for sublist in row_lists]

# Turn list of dicts into a DataFrame: df
df = pd.DataFrame(list_of_dicts)

# Print the head of the DataFrame
print(df.head())

  CountryName CountryCode                                      IndicatorName  \
0  Arab World         ARB  Adolescent fertility rate (births per 1,000 wo...   
1  Arab World         ARB  Age dependency ratio (% of working-age populat...   
2  Arab World         ARB  Age dependency ratio, old (% of working-age po...   
3  Arab World         ARB  Age dependency ratio, young (% of working-age ...   
4  Arab World         ARB        Arms exports (SIPRI trend indicator values)   

    IndicatorCode  Year               Value  
0     SP.ADO.TFRT  1960  133.56090740552298  
1     SP.POP.DPND  1960    87.7976011532547  
2  SP.POP.DPND.OL  1960   6.634579191565161  
3  SP.POP.DPND.YG  1960   81.02332950839141  
4  MS.MIL.XPRT.KD  1960           3000000.0  


In [18]:
# df.to_csv('world_dev_ind.csv', index=False)

#### `Processing data in chunks (1)`
Sometimes, data sources can be so large in size that storing the entire dataset in memory becomes too resource-intensive. In this exercise, you will process the first 1000 rows of a file line by line, to create a dictionary of the counts of how many times each country appears in a column in the dataset.

The csv file '__world_dev_ind.csv__' is in your current directory for your use. To begin, you need to open a connection to this file using what is known as a context manager. For example, the command __with open('datacamp.csv') as datacamp__ binds the csv file '__datacamp.csv__' as __datacamp__ in the context manager. Here, the __  __ statement is the context manager, and its purpose is to ensure that resources are efficiently allocated when opening a connection to a file.

If you'd like to learn more about context managers, refer to the DataCamp course on Importing Data in Python.

- Use __open()__ to bind the csv file '__world_dev_ind.csv__' as __file__ in the context manager.
- Complete the __for__ loop so that it iterates ___1000___ times to perform the loop body and process only the first ___1000___ rows of data of the file.

In [21]:
# Open a connection to the file
with open('world_dev_ind.csv', 'r') as file:

    # Skip the column names
    file.readline()

    # Initialize an empty dictionary: counts_dict
    counts_dict = {}

    # Process only the first 1000 rows
    for j in range(1000):

        # Split the current line into a list: line
        line = file.readline().split(',')

        # Get the value for the first column: first_col
        first_col = line[0]

        # If the column value is in the dict, increment its value
        if first_col in counts_dict.keys():
            counts_dict[first_col] += 1

        # Else, add to the dict and set value to 1
        else:
            counts_dict[first_col] = 1

# Print the resulting dictionary
print(counts_dict)

{'Arab World': 20, '': 980}


#### `Writing a generator to load data in chunks (2)`
In the previous exercise, you processed a file line by line for a given number of lines. What if, however, you want to do this for the entire file?

In this case, it would be useful to use generators. ___Generators___ allow users to lazily evaluate data. This concept of lazy evaluation is useful when you have to deal with very large datasets because it lets you generate values in an efficient manner by yielding only chunks of data at a time instead of the whole thing at once.

In this exercise, you will define a generator function __read_large_file()__ that produces a generator object which yields a single line from a file each time __next()__ is called on it. The csv file '__world_dev_ind.csv__' is in your current directory for your use.

Note that when you open a connection to a file, the resulting file object is already a generator! So out in the wild, you won't have to explicitly create generator objects in cases such as this. However, for pedagogical reasons, we are having you practice how to do this here with the __read_large_file()__ function. Go for it!

- In the function __read_large_file()__, read a line from __file_object__ by using the method __readline()__. Assign the result to data.
- In the function __read_large_file()__, __yield__ the line read from the file __data__.
- In the context manager, create a generator object __gen_file__ by calling your generator function __read_large_file()__ and passing __file__ to it.
- Print the first three lines produced by the generator object __gen_file__ using __next()__.

In [22]:
# Define read_large_file()
def read_large_file(file_object):
    """A generator function to read a large file lazily."""

    # Loop indefinitely until the end of the file
    while True:

        # Read a line from the file: data
        data = file_object.readline()

        # Break if this is the end of the file
        if not data:
            break

        # Yield the line of data
        yield data


# Open a connection to the file
with open('world_dev_ind.csv') as file:

    # Create a generator object for the file: gen_file
    gen_file = read_large_file(file)

    # Print the first three lines of the file
    print(next(gen_file))
    print(next(gen_file))
    print(next(gen_file))

CountryName,CountryCode,IndicatorName,IndicatorCode,Year,Value

Arab World,ARB,"Adolescent fertility rate (births per 1,000 women ages 15-19)",SP.ADO.TFRT,1960,133.56090740552298

Arab World,ARB,Age dependency ratio (% of working-age population),SP.POP.DPND,1960,87.7976011532547



##### `Note that since a file object is already a generator, you don't have to explicitly create a generator object with your __read_large_file()__ function. However, it is still good to practice how to create generators`

- __Using the zip()__ function to combine two lists into a dictionary. For example, __zip(feature_names, row_vals)__ pairs each element of __feature_names__ with the corresponding element in __row_vals__.
- __Creating dictionaries__ from zipped lists using the dict() function, turning paired lists into a dictionary where the first list provides keys and the second list provides values.
- __Defining and using functions__ to automate repetitive tasks, such as converting lists into dictionaries. The __lists2dict(list1, list2)__ function was defined to zip two lists together and then convert the zipped object into a dictionary.
- __Employing list comprehensions__ to create a list of dictionaries from a list of lists, using a predefined function __lists2dict()__. This approach simplifies the process of converting row data into a structured format.
- __Converting a list of dictionaries into a pandas DataFrame__ to facilitate data analysis. By using __pd.DataFrame(list_of_dicts)__, you transformed the list of dictionaries into a DataFrame, making it easier to manipulate and analyze the data.

In [4]:
import pandas as pd
import matplotlib.pyplot as plt

In [5]:
# Initialize reader object: urb_pop_reader
urb_pop_reader = pd.read_csv('ind_pop_data.csv', chunksize=1000)

# Initialize empty DataFrame: data
data = pd.DataFrame()

# Iterate over each DataFrame chunk
for df_urb_pop in data:

    # Check out specific country: df_pop_ceb
    df_pop_ceb = df_urb_pop[df_urb_pop['CountryCode'] == 'CEB']

    # Zip DataFrame columns of interest: pops
    pops = zip(df_pop_ceb['Total Population'],
               df_pop_ceb['Urban population (% of total)'])

    # Turn zip object into list: pops_list
    pops_list = list(pops)

    # Use list comprehension to create new DataFrame column 'Total Urban Population'
    df_pop_ceb['Total Urban Population'] = [
        int(tup[0] * tup[1] * 0.01) for tup in pops_list]

    # Concatenate DataFrame chunk to the end of data: data
    data = pd.concat(data, df_pop_ceb)

# Plot urban population data
data.plot(kind='scatter', x='Year', y='Total Urban Population')
plt.show()

FileNotFoundError: [Errno 2] No such file or directory: 'ind_pop_data.csv'

In [None]:
# Define plot_pop()
def plot_pop(filename, country_code):
    # Initialize reader object: urb_pop_reader
    urb_pop_reader = pd.read_csv(filename, chunksize=1000)

    # Initialize empty DataFrame: data
    data = pd.DataFrame()

    # Iterate over each DataFrame chunk
    for df_urb_pop in urb_pop_reader:
        # Check out specific country: df_pop_ceb
        df_pop_ceb = df_urb_pop[df_urb_pop['CountryCode'] == country_code]

        # Zip DataFrame columns of interest: pops
        pops = zip(df_pop_ceb['Total Population'],
                   df_pop_ceb['Urban population (% of total)'])

        # Turn zip object into list: pops_list
        pops_list = list(pops)

        # Use list comprehension to create new DataFrame column 'Total Urban Population'
        df_pop_ceb['Total Urban Population'] = [
            int(tup[0] * tup[1] * 0.01) for tup in pops_list]

        # Concatenate DataFrame chunk to the end of data: data
        data = pd.concat([data, df_pop_ceb])

    # Plot urban population data
    data.plot(kind='scatter', x='Year', y='Total Urban Population')
    plt.show()


# Set the filename: fn
fn = 'ind_pop_data.csv'

# Call plot_pop for country code 'CEB'
plot_pop(fn, 'CEB')

# Call plot_pop for country code 'ARB'
plot_pop(fn, 'ARB')