# Data Serialization Formats - Cumulative Lab

## Introduction

Now that you have learned about CSV and JSON file formats individually, it's time to bring them together with a cumulative lab! Even as a junior data scientist, you can often produce novel, interesting analyses by combining multiple datasets that haven't been combined before.

## Objectives

You will be able to:

* Practice reading serialized JSON and CSV data from files into Python objects
* Practice extracting information from nested data structures
* Practice cleaning data (filtering, normalizing locations, converting types)
* Combine data from multiple sources into a single data structure
* Interpret descriptive statistics and data visualizations to present your findings

## Your Task: Analyze the Relationship between Population and World Cup Performance

![Russia 2018 branded soccer ball and trophy](images/world_cup.jpg)

<span>Photo by <a href="https://unsplash.com/@fznsr_?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Fauzan Saari</a> on <a href="https://unsplash.com/s/photos/soccer-world-cup?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></span>

### Business Understanding

#### What is the relationship between the population of a country and their performance in the 2018 FIFA World Cup?

Intuitively, we might assume that countries with larger populations would have better performance in international sports competitions. While this has been demonstrated to be [true for the Olympics](https://www.researchgate.net/publication/308513557_Medals_at_the_Olympic_Games_The_Relationship_Between_Won_Medals_Gross_Domestic_Product_Population_Size_and_the_Weight_of_Sportive_Practice), the results for the FIFA World Cup are more mixed:

<p><a href="https://commons.wikimedia.org/wiki/File:World_cup_countries_best_results_and_hosts.PNG#/media/File:World_cup_countries_best_results_and_hosts.PNG"><img src="https://upload.wikimedia.org/wikipedia/commons/b/b7/World_cup_countries_best_results_and_hosts.PNG" alt="World cup countries best results and hosts.PNG" height="563" width="1280"></a><br><a href="http://creativecommons.org/licenses/by-sa/3.0/" title="Creative Commons Attribution-Share Alike 3.0">CC BY-SA 3.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=578740">Link</a></p>

In this analysis, we are going to look specifically at the sample of World Cup games in 2018 and the corresponding 2018 populations of the participating nations, to determine the relationship between population and World Cup performance for this year.

### Data Understanding

The data sources for this analysis will be pulled from two separate files.

#### `world_cup_2018.json`

* **Source**: This dataset comes from [`football.db`](http://openfootball.github.io/), a "free and open public domain football database & schema for use in any (programming) language"
* **Contents**: Data about all games in the 2018 World Cup, including date, location (city and stadium), teams, goals scored (and by whom), and tournament group
* **Format**: Nested JSON data (dictionary containing a list of rounds, each of which contains a list of matches, each of which contains information about the teams involved and the points scored)

#### `country_populations.csv`

* **Source**: This dataset comes from a curated collection by [DataHub.io](https://datahub.io/core/population), originally sourced from the World Bank
* **Contents**: Data about populations by country for all available years from 1960 to 2018
* **Format**: CSV data, where each row contains a country name, a year, and a population

### Requirements

#### 1. List of Teams in 2018 World Cup

Create an alphabetically-sorted list of teams who competed in the 2018 FIFA World Cup.

#### 2. Associating Countries with 2018 World Cup Performance

Create a data structure that connects a team name (country name) to its performance in the 2018 FIFA World Cup. We'll use the count of games won in the entire tournament (group stage as well as knockout stage) to represent the performance.

This will help create visualizations to help the reader understand the distribution of games won and the performance of each team.

#### 3. Associating Countries with 2018 Population

Add to the existing data structure so that it also connects each country name to its 2018 population, and create visualizations comparable to those from step 2.

#### 4. Analysis of Population vs. Performance

Choose an appropriate statistical measure to analyze the relationship between population and performance, and create a visualization representing this relationship.

### Checking for Understanding

Before moving on to the next step, pause and think about the strategy for this analysis.

Remember, our business question is:

> What is the relationship between the population of a country and their performance in the 2018 FIFA World Cup?

#### Unit of Analysis

First, what is our **unit of analysis**, and what is the **unique identifier**? In other words, what will one record in our final data structure represent, and what attribute uniquely describes it?

.

.

.

*Answer:* 

> What is the relationship between the population of a **country** and their performance in the 2018 FIFA World Cup?

*Our unit of analysis is a* ***country*** *and the unique identifier we'll use is the* ***country name***

#### Features

Next, what **features** are we analyzing? In other words, what attributes of each country are we interested in?

.

.

.

*Answer:* 

> What is the relationship between the **population** of a country and their **performance in the 2018 FIFA World Cup**?

*Our features are* ***2018 population*** *and* ***count of wins in the 2018 World Cup***

#### Dataset to Start With

Finally, which dataset should we **start** with? In this case, any record with missing data is not useful to us, so we want to start with the smaller dataset.

.

.

.

*Answer: There are only 32 countries that compete in the World Cup each year, compared to hundreds of countries in the world, so we should start with the* ***2018 World Cup*** *dataset. Then we can join it with the relevant records from the country population dataset.*

## Getting the Data

Below we import the `json` and `csv` modules, which will be used for reading from `world_cup_2018.json` and `country_populations.csv`, respectively.

In [2]:
# Run this cell without changes
import json
import csv

Next, we open the relevant files.

In [3]:
# Run this cell without changes
# world_cup_file = open("data/world_cup_2018.json", encoding='utf8')

# population_file = open("data/country_populations.csv")

In [4]:
world_cup_data = {}
with open("data/world_cup_2018.json", encoding='utf8') as f:
    data = json.load(f)

    
 


**Hint:** if your code below is not working, (e.g. `ValueError: I/O operation on closed file.`, or you get an empty list or dictionary) try re-running the cell above to reopen the files, then re-run your code.

### 2018 World Cup Data

In the cell below, use the `json` module to load the data from `world_cup_file` into a dictionary called `world_cup_data`

In [5]:
# Replace None with appropriate code
cup = json.dumps(data, indent = 2)
type(cup) 

  
# Close the file now that we're done reading from it
# world_cup_file.close()

str

In [6]:
world_cup_data = json.loads(cup)

Make sure the `assert` passes, ensuring that `world_cup_data` has the correct type.

In [7]:
# Run this cell without changes

# Check that the overall data structure is a dictionary
type(world_cup_data)

# Check that the dictionary has 2 keys, 'name' and 'rounds'
list((world_cup_data).keys())

['name', 'rounds']

### Population Data

Now use the `csv` module to load the data from `population_file` into a list of dictionaries called `population_data`

(Recall that you can convert a `csv.DictReader` object into a list of dictionaries using the built-in `list()` function.)

In [8]:
from csv import DictReader
# Replace None with appropriate code
population_data = {}
with open('data/country_populations.csv') as f:
        for column in csv.DictReader(f, delimiter=","):
            population_data = list(column.values())
            print((population_data))
#             for row in data:
#                 print(list(row))
                
# #            
# print(data)

# population_data = dict(zip(row))
# population_data
    
# for data in reader:
#     population_data.append(data)
   
# #     # printing each row of table as dictionary 
# #     for row in reader:
# #         print(row)

# Close the file now that we're done reading from it

['2714', 'Afghanistan', 'AFG', '1960', '8996973']
['2715', 'Afghanistan', 'AFG', '1961', '9169410']
['2716', 'Afghanistan', 'AFG', '1962', '9351441']
['2717', 'Afghanistan', 'AFG', '1963', '9543205']
['2718', 'Afghanistan', 'AFG', '1964', '9744781']
['2719', 'Afghanistan', 'AFG', '1965', '9956320']
['2720', 'Afghanistan', 'AFG', '1966', '10174836']
['2721', 'Afghanistan', 'AFG', '1967', '10399926']
['2722', 'Afghanistan', 'AFG', '1968', '10637063']
['2723', 'Afghanistan', 'AFG', '1969', '10893776']
['2724', 'Afghanistan', 'AFG', '1970', '11173642']
['2725', 'Afghanistan', 'AFG', '1971', '11475445']
['2726', 'Afghanistan', 'AFG', '1972', '11791215']
['2727', 'Afghanistan', 'AFG', '1973', '12108963']
['2728', 'Afghanistan', 'AFG', '1974', '12412950']
['2729', 'Afghanistan', 'AFG', '1975', '12689160']
['2730', 'Afghanistan', 'AFG', '1976', '12943093']
['2731', 'Afghanistan', 'AFG', '1977', '13171306']
['2732', 'Afghanistan', 'AFG', '1978', '13341198']
['2733', 'Afghanistan', 'AFG', '1979'

['3606', 'Bangladesh', 'BGD', '1967', '59034249']
['3607', 'Bangladesh', 'BGD', '1968', '60918454']
['3608', 'Bangladesh', 'BGD', '1969', '62679765']
['3609', 'Bangladesh', 'BGD', '1970', '64232482']
['3610', 'Bangladesh', 'BGD', '1971', '65531633']
['3611', 'Bangladesh', 'BGD', '1972', '66625705']
['3612', 'Bangladesh', 'BGD', '1973', '67637530']
['3613', 'Bangladesh', 'BGD', '1974', '68742233']
['3614', 'Bangladesh', 'BGD', '1975', '70066301']
['3615', 'Bangladesh', 'BGD', '1976', '71652381']
['3616', 'Bangladesh', 'BGD', '1977', '73463584']
['3617', 'Bangladesh', 'BGD', '1978', '75450032']
['3618', 'Bangladesh', 'BGD', '1979', '77529045']
['3619', 'Bangladesh', 'BGD', '1980', '79639491']
['3620', 'Bangladesh', 'BGD', '1981', '81767515']
['3621', 'Bangladesh', 'BGD', '1982', '83932127']
['3622', 'Bangladesh', 'BGD', '1983', '86142495']
['3623', 'Bangladesh', 'BGD', '1984', '88416521']
['3624', 'Bangladesh', 'BGD', '1985', '90764183']
['3625', 'Bangladesh', 'BGD', '1986', '93187603']


['4527', 'Burkina Faso', 'BFA', '2003', '12654621']
['4528', 'Burkina Faso', 'BFA', '2004', '13030569']
['4529', 'Burkina Faso', 'BFA', '2005', '13421930']
['4530', 'Burkina Faso', 'BFA', '2006', '13829176']
['4531', 'Burkina Faso', 'BFA', '2007', '14252021']
['4532', 'Burkina Faso', 'BFA', '2008', '14689725']
['4533', 'Burkina Faso', 'BFA', '2009', '15141098']
['4534', 'Burkina Faso', 'BFA', '2010', '15605217']
['4535', 'Burkina Faso', 'BFA', '2011', '16081911']
['4536', 'Burkina Faso', 'BFA', '2012', '16571246']
['4537', 'Burkina Faso', 'BFA', '2013', '17072775']
['4538', 'Burkina Faso', 'BFA', '2014', '17586017']
['4539', 'Burkina Faso', 'BFA', '2015', '18110624']
['4540', 'Burkina Faso', 'BFA', '2016', '18646378']
['4541', 'Burkina Faso', 'BFA', '2017', '19193284']
['4542', 'Burkina Faso', 'BFA', '2018', '19751535']
['4543', 'Burundi', 'BDI', '1960', '2797932']
['4544', 'Burundi', 'BDI', '1961', '2852438']
['4545', 'Burundi', 'BDI', '1962', '2907321']
['4546', 'Burundi', 'BDI', '19

['5197', 'Colombia', 'COL', '1965', '18725245']
['5198', 'Colombia', 'COL', '1966', '19279740']
['5199', 'Colombia', 'COL', '1967', '19837510']
['5200', 'Colombia', 'COL', '1968', '20393699']
['5201', 'Colombia', 'COL', '1969', '20942456']
['5202', 'Colombia', 'COL', '1970', '21480065']
['5203', 'Colombia', 'COL', '1971', '22003980']
['5204', 'Colombia', 'COL', '1972', '22516425']
['5205', 'Colombia', 'COL', '1973', '23024517']
['5206', 'Colombia', 'COL', '1974', '23538386']
['5207', 'Colombia', 'COL', '1975', '24065507']
['5208', 'Colombia', 'COL', '1976', '24608113']
['5209', 'Colombia', 'COL', '1977', '25164545']
['5210', 'Colombia', 'COL', '1978', '25733673']
['5211', 'Colombia', 'COL', '1979', '26312994']
['5212', 'Colombia', 'COL', '1980', '26900506']
['5213', 'Colombia', 'COL', '1981', '27496617']
['5214', 'Colombia', 'COL', '1982', '28101819']
['5215', 'Colombia', 'COL', '1983', '28714190']
['5216', 'Colombia', 'COL', '1984', '29331238']
['5217', 'Colombia', 'COL', '1985', '299

['5947', 'Djibouti', 'DJI', '2007', '805451']
['5948', 'Djibouti', 'DJI', '2008', '816358']
['5949', 'Djibouti', 'DJI', '2009', '827823']
['5950', 'Djibouti', 'DJI', '2010', '840198']
['5951', 'Djibouti', 'DJI', '2011', '853674']
['5952', 'Djibouti', 'DJI', '2012', '868136']
['5953', 'Djibouti', 'DJI', '2013', '883293']
['5954', 'Djibouti', 'DJI', '2014', '898696']
['5955', 'Djibouti', 'DJI', '2015', '913993']
['5956', 'Djibouti', 'DJI', '2016', '929112']
['5957', 'Djibouti', 'DJI', '2017', '944097']
['5958', 'Djibouti', 'DJI', '2018', '958920']
['5959', 'Dominica', 'DMA', '1960', '60011']
['5960', 'Dominica', 'DMA', '1961', '61032']
['5961', 'Dominica', 'DMA', '1962', '61982']
['5962', 'Dominica', 'DMA', '1963', '62918']
['5963', 'Dominica', 'DMA', '1964', '63926']
['5964', 'Dominica', 'DMA', '1965', '65038']
['5965', 'Dominica', 'DMA', '1966', '66311']
['5966', 'Dominica', 'DMA', '1967', '67686']
['5967', 'Dominica', 'DMA', '1968', '69040']
['5968', 'Dominica', 'DMA', '1969', '70213'

['6932', 'Gambia, The', 'GMB', '1996', '1164091']
['6933', 'Gambia, The', 'GMB', '1997', '1200526']
['6934', 'Gambia, The', 'GMB', '1998', '1238125']
['6935', 'Gambia, The', 'GMB', '1999', '1277133']
['6936', 'Gambia, The', 'GMB', '2000', '1317703']
['6937', 'Gambia, The', 'GMB', '2001', '1360074']
['6938', 'Gambia, The', 'GMB', '2002', '1404261']
['6939', 'Gambia, The', 'GMB', '2003', '1449925']
['6940', 'Gambia, The', 'GMB', '2004', '1496527']
['6941', 'Gambia, The', 'GMB', '2005', '1543741']
['6942', 'Gambia, The', 'GMB', '2006', '1591445']
['6943', 'Gambia, The', 'GMB', '2007', '1639848']
['6944', 'Gambia, The', 'GMB', '2008', '1689285']
['6945', 'Gambia, The', 'GMB', '2009', '1740279']
['6946', 'Gambia, The', 'GMB', '2010', '1793196']
['6947', 'Gambia, The', 'GMB', '2011', '1848147']
['6948', 'Gambia, The', 'GMB', '2012', '1905011']
['6949', 'Gambia, The', 'GMB', '2013', '1963711']
['6950', 'Gambia, The', 'GMB', '2014', '2024042']
['6951', 'Gambia, The', 'GMB', '2015', '2085860']


['7830', 'Hong Kong SAR, China', 'HKG', '2009', '6972800']
['7831', 'Hong Kong SAR, China', 'HKG', '2010', '7024200']
['7832', 'Hong Kong SAR, China', 'HKG', '2011', '7071600']
['7833', 'Hong Kong SAR, China', 'HKG', '2012', '7150100']
['7834', 'Hong Kong SAR, China', 'HKG', '2013', '7178900']
['7835', 'Hong Kong SAR, China', 'HKG', '2014', '7229500']
['7836', 'Hong Kong SAR, China', 'HKG', '2015', '7291300']
['7837', 'Hong Kong SAR, China', 'HKG', '2016', '7336600']
['7838', 'Hong Kong SAR, China', 'HKG', '2017', '7391700']
['7839', 'Hong Kong SAR, China', 'HKG', '2018', '7451000']
['7840', 'Hungary', 'HUN', '1960', '9983967']
['7841', 'Hungary', 'HUN', '1961', '10029321']
['7842', 'Hungary', 'HUN', '1962', '10061734']
['7843', 'Hungary', 'HUN', '1963', '10087947']
['7844', 'Hungary', 'HUN', '1964', '10119835']
['7845', 'Hungary', 'HUN', '1965', '10147935']
['7846', 'Hungary', 'HUN', '1966', '10178653']
['7847', 'Hungary', 'HUN', '1967', '10216604']
['7848', 'Hungary', 'HUN', '1968', 

['8695', 'Kenya', 'KEN', '1989', '22935092']
['8696', 'Kenya', 'KEN', '1990', '23724579']
['8697', 'Kenya', 'KEN', '1991', '24521703']
['8698', 'Kenya', 'KEN', '1992', '25326078']
['8699', 'Kenya', 'KEN', '1993', '26136216']
['8700', 'Kenya', 'KEN', '1994', '26950513']
['8701', 'Kenya', 'KEN', '1995', '27768296']
['8702', 'Kenya', 'KEN', '1996', '28589451']
['8703', 'Kenya', 'KEN', '1997', '29415659']
['8704', 'Kenya', 'KEN', '1998', '30250488']
['8705', 'Kenya', 'KEN', '1999', '31098757']
['8706', 'Kenya', 'KEN', '2000', '31964557']
['8707', 'Kenya', 'KEN', '2001', '32848564']
['8708', 'Kenya', 'KEN', '2002', '33751739']
['8709', 'Kenya', 'KEN', '2003', '34678779']
['8710', 'Kenya', 'KEN', '2004', '35635271']
['8711', 'Kenya', 'KEN', '2005', '36624895']
['8712', 'Kenya', 'KEN', '2006', '37649033']
['8713', 'Kenya', 'KEN', '2007', '38705932']
['8714', 'Kenya', 'KEN', '2008', '39791981']
['8715', 'Kenya', 'KEN', '2009', '40901792']
['8716', 'Kenya', 'KEN', '2010', '42030676']
['8717', '

['9302', 'Lesotho', 'LSO', '2009', '1990131']
['9303', 'Lesotho', 'LSO', '2010', '1995581']
['9304', 'Lesotho', 'LSO', '2011', '2003787']
['9305', 'Lesotho', 'LSO', '2012', '2014990']
['9306', 'Lesotho', 'LSO', '2013', '2028519']
['9307', 'Lesotho', 'LSO', '2014', '2043437']
['9308', 'Lesotho', 'LSO', '2015', '2059021']
['9309', 'Lesotho', 'LSO', '2016', '2075001']
['9310', 'Lesotho', 'LSO', '2017', '2091412']
['9311', 'Lesotho', 'LSO', '2018', '2108132']
['9312', 'Liberia', 'LBR', '1960', '1118657']
['9313', 'Liberia', 'LBR', '1961', '1142302']
['9314', 'Liberia', 'LBR', '1962', '1166648']
['9315', 'Liberia', 'LBR', '1963', '1191802']
['9316', 'Liberia', 'LBR', '1964', '1217901']
['9317', 'Liberia', 'LBR', '1965', '1245102']
['9318', 'Liberia', 'LBR', '1966', '1273464']
['9319', 'Liberia', 'LBR', '1967', '1303035']
['9320', 'Liberia', 'LBR', '1968', '1333978']
['9321', 'Liberia', 'LBR', '1969', '1366502']
['9322', 'Liberia', 'LBR', '1970', '1400730']
['9323', 'Liberia', 'LBR', '1971',

['10451', 'Mongolia', 'MNG', '1978', '1603906']
['10452', 'Mongolia', 'MNG', '1979', '1646291']
['10453', 'Mongolia', 'MNG', '1980', '1689622']
['10454', 'Mongolia', 'MNG', '1981', '1733475']
['10455', 'Mongolia', 'MNG', '1982', '1777727']
['10456', 'Mongolia', 'MNG', '1983', '1823216']
['10457', 'Mongolia', 'MNG', '1984', '1871090']
['10458', 'Mongolia', 'MNG', '1985', '1921881']
['10459', 'Mongolia', 'MNG', '1986', '1976310']
['10460', 'Mongolia', 'MNG', '1987', '2033344']
['10461', 'Mongolia', 'MNG', '1988', '2089715']
['10462', 'Mongolia', 'MNG', '1989', '2141011']
['10463', 'Mongolia', 'MNG', '1990', '2184145']
['10464', 'Mongolia', 'MNG', '1991', '2217917']
['10465', 'Mongolia', 'MNG', '1992', '2243495']
['10466', 'Mongolia', 'MNG', '1993', '2263186']
['10467', 'Mongolia', 'MNG', '1994', '2280479']
['10468', 'Mongolia', 'MNG', '1995', '2298020']
['10469', 'Mongolia', 'MNG', '1996', '2316568']
['10470', 'Mongolia', 'MNG', '1997', '2335734']
['10471', 'Mongolia', 'MNG', '1998', '23

['11399', 'Norway', 'NOR', '1982', '4114787']
['11400', 'Norway', 'NOR', '1983', '4128432']
['11401', 'Norway', 'NOR', '1984', '4140099']
['11402', 'Norway', 'NOR', '1985', '4152516']
['11403', 'Norway', 'NOR', '1986', '4167354']
['11404', 'Norway', 'NOR', '1987', '4186905']
['11405', 'Norway', 'NOR', '1988', '4209488']
['11406', 'Norway', 'NOR', '1989', '4226901']
['11407', 'Norway', 'NOR', '1990', '4241473']
['11408', 'Norway', 'NOR', '1991', '4261732']
['11409', 'Norway', 'NOR', '1992', '4286401']
['11410', 'Norway', 'NOR', '1993', '4311991']
['11411', 'Norway', 'NOR', '1994', '4336613']
['11412', 'Norway', 'NOR', '1995', '4359184']
['11413', 'Norway', 'NOR', '1996', '4381336']
['11414', 'Norway', 'NOR', '1997', '4405157']
['11415', 'Norway', 'NOR', '1998', '4431464']
['11416', 'Norway', 'NOR', '1999', '4461913']
['11417', 'Norway', 'NOR', '2000', '4490967']
['11418', 'Norway', 'NOR', '2001', '4513751']
['11419', 'Norway', 'NOR', '2002', '4538159']
['11420', 'Norway', 'NOR', '2003',

['12084', 'Puerto Rico', 'PRI', '2018', '3195153']
['12085', 'Qatar', 'QAT', '1960', '47384']
['12086', 'Qatar', 'QAT', '1961', '51421']
['12087', 'Qatar', 'QAT', '1962', '56262']
['12088', 'Qatar', 'QAT', '1963', '61716']
['12089', 'Qatar', 'QAT', '1964', '67566']
['12090', 'Qatar', 'QAT', '1965', '73633']
['12091', 'Qatar', 'QAT', '1966', '79846']
['12092', 'Qatar', 'QAT', '1967', '86302']
['12093', 'Qatar', 'QAT', '1968', '93211']
['12094', 'Qatar', 'QAT', '1969', '100883']
['12095', 'Qatar', 'QAT', '1970', '109514']
['12096', 'Qatar', 'QAT', '1971', '119414']
['12097', 'Qatar', 'QAT', '1972', '130500']
['12098', 'Qatar', 'QAT', '1973', '142186']
['12099', 'Qatar', 'QAT', '1974', '153621']
['12100', 'Qatar', 'QAT', '1975', '164320']
['12101', 'Qatar', 'QAT', '1976', '173721']
['12102', 'Qatar', 'QAT', '1977', '182318']
['12103', 'Qatar', 'QAT', '1978', '191951']
['12104', 'Qatar', 'QAT', '1979', '205171']
['12105', 'Qatar', 'QAT', '1980', '223632']
['12106', 'Qatar', 'QAT', '1981', 

['13034', 'Somalia', 'SOM', '1974', '3632990']
['13035', 'Somalia', 'SOM', '1975', '3880292']
['13036', 'Somalia', 'SOM', '1976', '4278973']
['13037', 'Somalia', 'SOM', '1977', '4802141']
['13038', 'Somalia', 'SOM', '1978', '5375017']
['13039', 'Somalia', 'SOM', '1979', '5892755']
['13040', 'Somalia', 'SOM', '1980', '6281134']
['13041', 'Somalia', 'SOM', '1981', '6511113']
['13042', 'Somalia', 'SOM', '1982', '6608044']
['13043', 'Somalia', 'SOM', '1983', '6618588']
['13044', 'Somalia', 'SOM', '1984', '6614715']
['13045', 'Somalia', 'SOM', '1985', '6648627']
['13046', 'Somalia', 'SOM', '1986', '6736748']
['13047', 'Somalia', 'SOM', '1987', '6862273']
['13048', 'Somalia', 'SOM', '1988', '7005234']
['13049', 'Somalia', 'SOM', '1989', '7133258']
['13050', 'Somalia', 'SOM', '1990', '7225092']
['13051', 'Somalia', 'SOM', '1991', '7274030']
['13052', 'Somalia', 'SOM', '1992', '7295384']
['13053', 'Somalia', 'SOM', '1993', '7315865']
['13054', 'Somalia', 'SOM', '1994', '7372598']
['13055', 'So

['14147', 'Tonga', 'TON', '1966', '76771']
['14148', 'Tonga', 'TON', '1967', '79029']
['14149', 'Tonga', 'TON', '1968', '81079']
['14150', 'Tonga', 'TON', '1969', '82855']
['14151', 'Tonga', 'TON', '1970', '84351']
['14152', 'Tonga', 'TON', '1971', '85499']
['14153', 'Tonga', 'TON', '1972', '86323']
['14154', 'Tonga', 'TON', '1973', '86962']
['14155', 'Tonga', 'TON', '1974', '87582']
['14156', 'Tonga', 'TON', '1975', '88318']
['14157', 'Tonga', 'TON', '1976', '89234']
['14158', 'Tonga', 'TON', '1977', '90273']
['14159', 'Tonga', 'TON', '1978', '91335']
['14160', 'Tonga', 'TON', '1979', '92268']
['14161', 'Tonga', 'TON', '1980', '92971']
['14162', 'Tonga', 'TON', '1981', '93409']
['14163', 'Tonga', 'TON', '1982', '93641']
['14164', 'Tonga', 'TON', '1983', '93729']
['14165', 'Tonga', 'TON', '1984', '93788']
['14166', 'Tonga', 'TON', '1985', '93896']
['14167', 'Tonga', 'TON', '1986', '94088']
['14168', 'Tonga', 'TON', '1987', '94321']
['14169', 'Tonga', 'TON', '1988', '94591']
['14170', '

['14678', 'United Arab Emirates', 'ARE', '1966', '159976']
['14679', 'United Arab Emirates', 'ARE', '1967', '169771']
['14680', 'United Arab Emirates', 'ARE', '1968', '182627']
['14681', 'United Arab Emirates', 'ARE', '1969', '203106']
['14682', 'United Arab Emirates', 'ARE', '1970', '234514']
['14683', 'United Arab Emirates', 'ARE', '1971', '277471']
['14684', 'United Arab Emirates', 'ARE', '1972', '330974']
['14685', 'United Arab Emirates', 'ARE', '1973', '394624']
['14686', 'United Arab Emirates', 'ARE', '1974', '467451']
['14687', 'United Arab Emirates', 'ARE', '1975', '548301']
['14688', 'United Arab Emirates', 'ARE', '1976', '637922']
['14689', 'United Arab Emirates', 'ARE', '1977', '735344']
['14690', 'United Arab Emirates', 'ARE', '1978', '835508']
['14691', 'United Arab Emirates', 'ARE', '1979', '931749']
['14692', 'United Arab Emirates', 'ARE', '1980', '1019509']
['14693', 'United Arab Emirates', 'ARE', '1981', '1096610']
['14694', 'United Arab Emirates', 'ARE', '1982', '1164

In [9]:
type(population_data)

list

In [10]:
# with open('data/country_populations.csv','r') as file:
#     reader = DictReader(file)

#     print(list(reader))

Make sure the `assert`s pass, ensuring that `population_data` has the correct type.

In [11]:
# # Run this cell without changes

# # Check that the overall data structure is a list
# assert type(population_data) == list

# # Check that the 0th element is a dictionary
# # (csv.DictReader interface differs slightly by Python version;
# # either a dict or an OrderedDict is fine here)
from collections import OrderedDict
type(population_data[0]) #== dict or type(population_data[0]) == OrderedDict

str

## 1. List of Teams in 2018 World Cup

> Create an alphabetically-sorted list of teams who competed in the 2018 FIFA World Cup.

This will take several steps, some of which have been completed for you.

### Exploring the Structure of the World Cup Data JSON

Let's start by exploring the structure of `world_cup_data`. Here is a pretty-printed preview of its contents:

```
{
  "name": "World Cup 2018",
  "rounds": [
    {
      "name": "Matchday 1",
      "matches": [
        {
          "num": 1,
          "date": "2018-06-14",
          "time": "18:00",
          "team1": { "name": "Russia",       "code": "RUS" },
          "team2": { "name": "Saudi Arabia", "code": "KSA" },
          "score1":  5,
          "score2":  0,
          "score1i": 2,
          "score2i": 0,
          "goals1": [
            { "name": "Gazinsky",   "minute": 12,              "score1": 1, "score2": 0 },
            { "name": "Cheryshev",  "minute": 43,              "score1": 2, "score2": 0 },
            { "name": "Dzyuba",     "minute": 71,              "score1": 3, "score2": 0 },
            { "name": "Cheryshev",  "minute": 90, "offset": 1, "score1": 4, "score2": 0 },
            { "name": "Golovin",    "minute": 90, "offset": 4, "score1": 5, "score2": 0 }
          ],
          "goals2": [],
          "group": "Group A",
          "stadium": { "key": "luzhniki", "name": "Luzhniki Stadium" },
          "city": "Moscow",
          "timezone": "UTC+3"
        }
      ]
    },
    {
      "name": "Matchday 2",
      "matches": [
        {
          "num": 2,
          "date": "2018-06-15",
          "time": "17:00",
          "team1": { "name": "Egypt",   "code": "EGY" },
          "team2": { "name": "Uruguay", "code": "URU" },
          "score1":  0,
          "score2":  1,
          "score1i": 0,
          "score2i": 0,
          "goals1": [],
          "goals2": [
            { "name": "Giménez",  "minute": 89,  "score1": 0, "score2": 1 }
          ],
          "group": "Group A",
          "stadium": { "key": "ekaterinburg", "name": "Ekaterinburg Arena" },          
          "city": "Ekaterinburg",
          "timezone": "UTC+5"
        },
        ...
      ],
    },
  ],  
}
```

As noted previously, `world_cup_data` is a dictionary with two keys, 'name' and 'rounds'.

In [12]:
# Run this cell without changes
world_cup_data.keys()

dict_keys(['name', 'rounds'])

The value associated with the 'name' key is simply identifying the dataset.

In [13]:
# Run this cell without changes
world_cup_data["name"]

'World Cup 2018'

### Extracting Rounds

The value associated with the 'rounds' key is a list containing all of the actual information about the rounds and the matches within those rounds.

In [14]:
# Run this cell without changes
rounds = world_cup_data["rounds"]

print("type(rounds):", type(rounds))
print("len(rounds):", len(rounds))
print("type(rounds[3])", type(rounds[3]))
print("rounds[3]:")
rounds[3]

type(rounds): <class 'list'>
len(rounds): 20
type(rounds[3]) <class 'dict'>
rounds[3]:


{'name': 'Matchday 4',
 'matches': [{'num': 9,
   'date': '2018-06-17',
   'time': '21:00',
   'team1': {'name': 'Brazil', 'code': 'BRA'},
   'team2': {'name': 'Switzerland', 'code': 'SUI'},
   'score1': 1,
   'score2': 1,
   'score1i': 1,
   'score2i': 0,
   'goals1': [{'name': 'Coutinho', 'minute': 20, 'score1': 1, 'score2': 0}],
   'goals2': [{'name': 'Zuber', 'minute': 50, 'score1': 1, 'score2': 1}],
   'group': 'Group E',
   'stadium': {'key': 'rostov', 'name': 'Rostov Arena'},
   'city': 'Rostov-on-Don',
   'timezone': 'UTC+3'},
  {'num': 10,
   'date': '2018-06-17',
   'time': '16:00',
   'team1': {'name': 'Costa Rica', 'code': 'CRC'},
   'team2': {'name': 'Serbia', 'code': 'SRB'},
   'score1': 0,
   'score2': 1,
   'score1i': 0,
   'score2i': 0,
   'goals1': [],
   'goals2': [{'name': 'Kolarov', 'minute': 56, 'score1': 0, 'score2': 1}],
   'group': 'Group E',
   'stadium': {'key': 'samara', 'name': 'Samara Arena'},
   'city': 'Samara',
   'timezone': 'UTC+4'},
  {'num': 11,
   

Translating this output into English:

Starting with the original `world_cup_data` dictionary, we used the key `"rounds"` to extract a list of rounds, which we assigned to the variable `rounds`.

`rounds` is a list of dictionaries. Each dictionary inside of `rounds` contains a name (e.g. `"Matchday 4"`) as well as a list of matches.

### Extracting Matches

Now we can go one level deeper and extract all of the matches in the tournament. Because the round is irrelevant for this analysis, we can loop over all rounds and combine all of their matches into a single list.

**Hint:** This is a good use case for using the `.extend` list method rather than `.append`, since we want to combine several lists of dictionaries into a single list of dictionaries, not a list of lists of dictionaries. [Documentation here.](https://docs.python.org/3/tutorial/datastructures.html#more-on-lists)

In [15]:
# Replace None with appropriate code
matches = []

# "round" is a built-in function in Python so we use "round_" instead
for round_ in rounds:
    # Extract the list of matches for this round
    round_matches = round_["matches"]
    matches.extend(round_matches)
    # Add them to the overall list of matches
#     for elem in round_matches
#     matches[elem["num"].extend([elem['date']])

(matches[0])

{'num': 1,
 'date': '2018-06-14',
 'time': '18:00',
 'team1': {'name': 'Russia', 'code': 'RUS'},
 'team2': {'name': 'Saudi Arabia', 'code': 'KSA'},
 'score1': 5,
 'score2': 0,
 'score1i': 2,
 'score2i': 0,
 'goals1': [{'name': 'Gazinsky', 'minute': 12, 'score1': 1, 'score2': 0},
  {'name': 'Cheryshev', 'minute': 43, 'score1': 2, 'score2': 0},
  {'name': 'Dzyuba', 'minute': 71, 'score1': 3, 'score2': 0},
  {'name': 'Cheryshev', 'minute': 90, 'offset': 1, 'score1': 4, 'score2': 0},
  {'name': 'Golovin', 'minute': 90, 'offset': 4, 'score1': 5, 'score2': 0}],
 'goals2': [],
 'group': 'Group A',
 'stadium': {'key': 'luzhniki', 'name': 'Luzhniki Stadium'},
 'city': 'Moscow',
 'timezone': 'UTC+3'}

In [16]:
len(matches)

64

Make sure the `assert`s pass before moving on to the next step.

In [17]:
# Run this cell without changes

# There should be 64 matches. If the length is 20, that means
# you have a list of lists instead of a list of dictionaries
assert len(matches) == 64

# Each match in the list should be a dictionary
assert type(matches[0]) == dict

### Extracting Teams

Each match has a `team1` and a `team2`. 

In [18]:
# Run this cell without changes
print(type(matches[0]["team1"]))
print(matches[0]["team2"])

<class 'dict'>
{'name': 'Saudi Arabia', 'code': 'KSA'}


Create a list of all unique team names by looping over every match in `matches` and adding the `"name"` values associated with both `team1` and `team2`. (Same as before when creating a list of matches, it doesn't matter right now whether a given team was "team1" or "team2", we just add everything to `teams`.)

We'll use a `set` data type ([documentation here](https://docs.python.org/3/library/stdtypes.html#set-types-set-frozenset)) to ensure unique teams, then convert it to a sorted list at the end.

In [38]:
# Replace None with appropriate code
from operator import itemgetter

import pandas as pd
keys = ['team1','team2']
# key1 = ['team1']
# key2 = ['team2']
# for key in keys:
#     myDictionary.get(key)
#     [myDictionary.get(key) for key in keys]
import itertools

for match in matches:
    tea = [match.get(key) for key in keys]
    teams = itertools.chain(tea)
    print(list(teams))
    

 
#     team = list(match.get(key) for key in keys)
#     if key not in team:
#         team = list(key.values())
# #          teams = list(itertools.chain(key))
        
            
   

        
 
            
#     print((list(tea)))

 
      
# for match in matches:
#     team1 = hash(match.get(key1))
#     team2 = hash(match.get(key2))
#     teams = team1 + team2
#     print(teams)
    
        
          
        
#             print((teams))
        
            
#           
#         da = match['team1'].update
    
#     
#    
        
    
    


[{'name': 'Russia', 'code': 'RUS'}, {'name': 'Saudi Arabia', 'code': 'KSA'}]
[{'name': 'Egypt', 'code': 'EGY'}, {'name': 'Uruguay', 'code': 'URU'}]
[{'name': 'Portugal', 'code': 'POR'}, {'name': 'Spain', 'code': 'ESP'}]
[{'name': 'Morocco', 'code': 'MAR'}, {'name': 'Iran', 'code': 'IRN'}]
[{'name': 'France', 'code': 'FRA'}, {'name': 'Australia', 'code': 'AUS'}]
[{'name': 'Peru', 'code': 'PER'}, {'name': 'Denmark', 'code': 'DEN'}]
[{'name': 'Argentina', 'code': 'ARG'}, {'name': 'Iceland', 'code': 'ISL'}]
[{'name': 'Croatia', 'code': 'CRO'}, {'name': 'Nigeria', 'code': 'NGA'}]
[{'name': 'Brazil', 'code': 'BRA'}, {'name': 'Switzerland', 'code': 'SUI'}]
[{'name': 'Costa Rica', 'code': 'CRC'}, {'name': 'Serbia', 'code': 'SRB'}]
[{'name': 'Germany', 'code': 'GER'}, {'name': 'Mexico', 'code': 'MEX'}]
[{'name': 'Sweden', 'code': 'SWE'}, {'name': 'South Korea', 'code': 'KOR'}]
[{'name': 'Belgium', 'code': 'BEL'}, {'name': 'Panama', 'code': 'PAN'}]
[{'name': 'Tunisia', 'code': 'TUN'}, {'name': '

In [28]:
#     teams = [match["team1"] , match["team2"]]
#     for team in teams:
#         tea = sorted(teams)
#     print(tea)
        



    
    
    
# teams = sorted(list(teams_set))
# print(teams)


2
2


Make sure the `assert`s pass before moving on to the next step.

In [None]:
# Run this cell without changes

# teams should be a list, not a set
assert type(teams) == list

# 32 teams competed in the 2018 World Cup
assert len(teams) == 32

# Each element of teams should be a string
# (the name), not a dictionary
assert type(teams[0]) == str

Great, step 1 complete! We have unique identifiers (names) for each of our records (countries) that we will be able to use to connect 2018 World Cup performance to 2018 population.

## 2. Associating Countries with 2018 World Cup Performance

> Create a data structure that connects a team name (country name) to its performance in the 2018 FIFA World Cup. We'll use the count of games won in the entire tournament (group stage as well as knockout stage) to represent the performance.

> Also, create visualizations to help the reader understand the distribution of games won and the performance of each team.

So, we are building a **data structure** that connects a country name to the number of wins. There is no universal correct format for a data structure with this purpose, but we are going to use a format that resembles the "dataframe" format that will be introduced later in the course.

Specifically, we'll build a **dictionary** where each key is the name of a country, and each value is a nested dictionary containing information about the number of wins and the 2018 population.

The final result will look something like this:
```
{
  'Argentina': { 'wins': 1, 'population': 44494502 },
  ...
  'Uruguay':   { 'wins': 4, 'population': 3449299  }
}
```

For the current step (step 2), we'll build a data structure that looks something like this:
```
{
  'Argentina': { 'wins': 1 },
  ...
  'Uruguay':   { 'wins': 4 }
}
```

### Initializing with Wins Set to Zero

Start by initializing a dictionary called `combined_data` containing:

* Keys: the strings from `teams`
* Values: each value the same, a dictionary containing the key `'wins'` with the associated value `0`. However, note that each value should be a distinct dictionary object in memory, not the same dictionary linked as a value in multiple places.

Initially `combined_data` will look something like this:
```
{
  'Argentina': { 'wins': 0 },
  ...
  'Uruguay':   { 'wins': 0 }
}
```

In [50]:
combined_data = {}

keys = ["name"]
for match in matches:
    for team in teams:
        name = 
    
# for match in matches:
#     wins = 
            
        
        
 

[]


Check that the `assert`s pass.

In [None]:
# Run this cell without changes

# combined_data should be a dictionary
assert type(combined_data) == dict

# the keys should be strings
assert type(list(combined_data.keys())[0]) == str

# the values should be dictionaries
assert combined_data["Japan"] == {"wins": 0}

### Adding Wins from Matches

Now it's time to revisit the `matches` list from earlier, in order to associate a team with the number of times it has won a match.

This time, let's write some functions to help organize our logic.

Write a function `find_winner` that takes in a `match` dictionary, and returns the name of the team that won the match.  Recall that a match is structured like this:

```
{
  'num': 1,
  'date': '2018-06-14',
  'time': '18:00',
  'team1': { 'name': 'Russia',       'code': 'RUS' },
  'team2': { 'name': 'Saudi Arabia', 'code': 'KSA' },
  'score1': 5,
  'score2': 0,
  'score1i': 2,
  'score2i': 0,
  'goals1': [
    { 'name': 'Gazinsky',  'minute': 12, 'score1': 1, 'score2': 0 },
    { 'name': 'Cheryshev', 'minute': 43, 'score1': 2, 'score2': 0 },
    { 'name': 'Dzyuba',    'minute': 71, 'score1': 3, 'score2': 0 },
    { 'name': 'Cheryshev', 'minute': 90, 'offset': 1, 'score1': 4, 'score2': 0 },
    { 'name': 'Golovin',   'minute': 90, 'offset': 4, 'score1': 5, 'score2': 0 }
  ],
  'goals2': [],
  'group': 'Group A',
  'stadium': { 'key': 'luzhniki', 'name': 'Luzhniki Stadium' },
  'city': 'Moscow',
  'timezone': 'UTC+3'
}
```

The winner is determined by comparing the values associated with the `'score1'` and `'score2'` keys. If score 1 is larger, then the name associated with the `'team1'` key is the winner. If score 2 is larger, then the name associated with the `'team2'` key is the winner. If the values are the same, there is no winner, so return `None`. (Unlike the group round of the World Cup, we are only counting *wins* as our "performance" construct, not 3 points for a win and 1 point for a tie.)

In [None]:
# Replace None with appropriate code

def find_winner(match):
    for match in matches:
        if score1 > score2:
            print(f"The Winner is {'team1'}")
        elif 'score2' > 'score1': 
            print(f"The winner is {'team2'}")
            
        else:
            print("draw")



In [None]:
# Run this cell without changes
assert find_winner(matches[0]) == "Russia"
assert find_winner(matches[1]) == "Uruguay"
assert find_winner(matches[2]) == None

Now that we have this helper function, loop over every match in `matches`, find the winner, and add 1 to the associated count of wins in `combined_data`. If the winner is `None`, skip adding it to the dictionary.

In [None]:
# Replace None with appropriate code

for match in matches:
    # Get the name of the winner
    winner = None
    # Only proceed to the next step if there was
    # a winner
    if winner:
        # Add 1 to the associated count of wins
        None
        
# Visually inspect the output to ensure the wins are
# different for different countries
combined_data

### Analysis of Wins

While we could try to understand all 32 of those numbers just by scanning through them, let's use some descriptive statistics and data visualizations instead!

#### Statistical Summary of Wins

The code below calculates the mean, median, and standard deviation of the number of wins. If it doesn't work, that is an indication that something went wrong with the creation of the `combined_data` variable, and you might want to look at the solution branch and fix your code before proceeding.

In [None]:
# Run this cell without changes
import numpy as np

wins = [val["wins"] for val in combined_data.values()]

print("Mean number of wins:", np.mean(wins))
print("Median number of wins:", np.median(wins))
print("Standard deviation of number of wins:", np.std(wins))

#### Visualizations of Wins

In addition to those numbers, let's make a histogram (showing the distributions of the number of wins) and a bar graph (showing the number of wins by country).

In [None]:
# Run this cell without changes
import matplotlib.pyplot as plt

# Set up figure and axes
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 7))
fig.set_tight_layout(True)

# Histogram of Wins and Frequencies
ax1.hist(x=wins, bins=range(8), align="left", color="green")
ax1.set_xticks(range(7))
ax1.set_xlabel("Wins in 2018 World Cup")
ax1.set_ylabel("Frequency")
ax1.set_title("Distribution of Wins")

# Horizontal Bar Graph of Wins by Country
ax2.barh(teams[::-1], wins[::-1], color="green")
ax2.set_xlabel("Wins in 2018 World Cup")
ax2.set_title("Wins by Country");

#### Interpretation of Win Analysis

Before we move to looking at the relationship between wins and population, it's useful to understand the distribution of wins alone. A few notes of interpretation:

* The number of wins is skewed and looks like a [negative binomial distribution](https://en.wikipedia.org/wiki/Negative_binomial_distribution), which makes sense conceptually
* The "typical" value here is 1 (both the median and the highest point of the histogram), meaning a typical team that qualifies for the World Cup wins once
* There are a few teams we might consider outliers: Belgium and France, with 6x the wins of the "typical" team and 1.5x the wins of the next "runner-up" (Uruguay, with 4 wins)
* This is a fairly small dataset, something that becomes more noticeable with such a "spiky" (not smooth) histogram


## 3. Associating Countries with 2018 Population

> Add to the existing data structure so that it also connects each country name to its 2018 population, and create visualizations comparable to those from step 2.

Now we're ready to add the 2018 population to `combined_data`, finally using the CSV file!

Recall that `combined_data` currently looks something like this:
```
{
  'Argentina': { 'wins': 1 },
  ...
  'Uruguay':   { 'wins': 4 }
}
```

And the goal is for it to look something like this:
```
{
  'Argentina': { 'wins': 1, 'population': 44494502 },
  ...
  'Uruguay':   { 'wins': 4, 'population': 3449299  }
}
```

To do that, we need to extract the 2018 population information from the CSV data.

### Exploring the Structure of the Population Data CSV

Recall that previously we loaded information from a CSV containing population data into a list of dictionaries called `population_data`.

In [None]:
# Run this cell without changes
len(population_data)

12,695 is a very large number of rows to print out, so let's look at some samples instead.

In [None]:
# Run this cell without changes
np.random.seed(42)
population_record_samples = np.random.choice(population_data, size=10)
population_record_samples

There are **2 filtering tasks**, **1 data normalization task**, and **1 type conversion task** to be completed, based on what we can see in this sample. We'll walk through each of them below.

(In a more realistic data cleaning environment, you most likely won't happen to get a sample that demonstrates all of the data cleaning steps needed, but this sample was chosen carefully for example purposes.)

### Filtering Population Data

We already should have suspected that this dataset would require some filtering, since there are 32 records in our current `combined_data` dataset and 12,695 records in `population_data`. Now that we have looked at this sample, we can identify 2 features we'll want to use in order to filter down the `population_data` records to just 32. Try to identify them before looking at the answer below.

.

.

.

*Answer: the two features to filter on are* ***`'Country Name'`*** *and* ***`'Year'`***. *We can see from the sample above that there are countries in `population_data` that are not present in `combined_data` (e.g. Malta) and there are years present that are not 2018.*

In the cell below, create a new variable `population_data_filtered` that only includes relevant records from `population_data`. Relevant records are records where the country name is one of the countries in the `teams` list, and the year is "2018".

(It's okay to leave 2018 as a string since we are not performing any math operations on it, just make sure you check for `"2018"` and not `2018`.)

In [None]:
# Replace None with appropriate code

population_data_filtered = []

for record in population_data:
    # Add record to population_data_filtered if relevant
    None
    
len(population_data_filtered) # 27

Hmm...what went wrong? Why do we only have 27 records, and not 32?

Did we really get a dataset with 12k records that's missing 5 of the data points we need?

Let's take a closer look at the population data samples again, specifically the third one:

In [None]:
# Run this cell without changes
population_record_samples[2]

And compare that with the value for Iran in `teams`:

In [None]:
# Run this cell without changes
teams[13]

Ohhhh...we have a data normalization issue! One dataset refers to this country as `'Iran, Islamic Rep.'`, while the other refers to it as `'Iran'`. This is a common issue we face when using data about countries and regions, where there is no universally-accepted naming convention.

### Normalizing Locations in Population Data

Sometimes data normalization can be a very, very time-consuming task where you need to find "crosswalk" data that can link the two formats together, or you need to write advanced regex formulas to line everything up.

For this task, there are only 5 missing, so we'll just go ahead and give you a function that makes the appropriate substitutions.

In [None]:
# Run this cell without changes
def normalize_location(country_name):
    """
    Given a country name, return the name that the
    country uses when playing in the FIFA World Cup
    """
    name_sub_dict = {
        "Russian Federation": "Russia",
        "Egypt, Arab Rep.": "Egypt",
        "Iran, Islamic Rep.": "Iran",
        "Korea, Rep.": "South Korea",
        "United Kingdom": "England"
    }
    # The .get method returns the corresponding value from
    # the dict if present, otherwise returns country_name
    return name_sub_dict.get(country_name, country_name)

# Example where normalized location is different
print(normalize_location("Russian Federation"))
# Example where normalized location is the same
print(normalize_location("Argentina"))

Now, write new code to create `population_data_filtered` with normalized country names.

In [None]:
# Replace None with appropriate code

population_data_filtered = []

for record in population_data:
    # Get normalized country name
    None
    # Add record to population_data_filtered if relevant
    if None:
        # Replace the country name in the record
        None
        # Append to list
        None
        
len(population_data_filtered) # 32

Great, now we should have 32 records instead of 27!

### Type Conversion of Population Data

We need to do one more thing before we'll have population data that is usable for analysis. Take a look at this record from `population_data_filtered` to see if you can spot it:

In [None]:
# Run this cell without changes
population_data_filtered[0]

Every key has the same data type (`str`), including the population value. In this example, it's `'44494502'`, when it needs to be `44494502` if we want to be able to compute statistics with it.

In the cell below, loop over `population_data_filtered` and convert the data type of the value associated with the `"Value"` key from a string to an integer, using the built-in `int()` function.

In [None]:
# Replace None with appropriate code
for record in population_data_filtered:
    # Convert the population value from str to int
    None
    
# Look at the last record to make sure the population
# value is an int
population_data_filtered[-1]

Check that it worked with the assert statement below:

In [None]:
# Run this cell without changes
assert type(population_data_filtered[-1]["Value"]) == int

### Adding Population Data

Now it's time to add the population data to `combined_data`! Recall that the data structure currently looks like this:

In [None]:
# Run this cell without changes
combined_data

The goal is for it to be structured like this:
```
{
  'Argentina': { 'wins': 1, 'population': 44494502 },
  ...
  'Uruguay':   { 'wins': 4, 'population': 3449299  }
}
```

In the cell below, loop over `population_data_filtered` and add information about population to each country in `combined_data`:

In [None]:
# Replace None with appropriate code
for record in population_data_filtered:
    # Extract the country name from the record
    country = None
    # Extract the population value from the record
    population = None
    # Add this information to combined_data
    None
    
# Look combined_data
combined_data

Check that the types are correct with these assert statements:

In [None]:
# Run this cell without changes
assert type(combined_data["Uruguay"]) == dict
assert type(combined_data["Uruguay"]["population"]) == int

### Analysis of Population

Let's perform the same analysis for population that we performed for count of wins.

#### Statistical Analysis of Population

In [None]:
# Run this cell without changes
populations = [val["population"] for val in combined_data.values()]

print("Mean population:", np.mean(populations))
print("Median population:", np.median(populations))
print("Standard deviation of population:", np.std(populations))

#### Visualizations of Population

In [None]:
# Run this cell without changes

# Set up figure and axes
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 7))
fig.set_tight_layout(True)

# Histogram of Populations and Frequencies
ax1.hist(x=populations, color="blue")
ax1.set_xlabel("2018 Population")
ax1.set_ylabel("Frequency")
ax1.set_title("Distribution of Population")

# Horizontal Bar Graph of Population by Country
ax2.barh(teams[::-1], populations[::-1], color="blue")
ax2.set_xlabel("2018 Population")
ax2.set_title("Population by Country");

#### Interpretation of Population Analysis

* Similar to the distribution of the number of wins, the distribution of population is skewed.
* It's hard to choose a single "typical" value here because there is so much variation.
* The countries with the largest populations (Brazil, Nigeria, and Russia) do not overlap with the countries with the most wins (Belgium, France, and Uruguay)

## 4. Analysis of Population vs. Performance

> Choose an appropriate statistical measure to analyze the relationship between population and performance, and create a visualization representing this relationship.

### Statistical Measure
So far we have learned about only two statistics for understanding the *relationship* between variables: **covariance** and **correlation**. We will use correlation here, because that provides a more standardized, interpretable metric.

In [None]:
# Run this cell without changes
np.corrcoef(wins, populations)[0][1]

In the cell below, interpret this number. What direction is this correlation? Is it strong or weak?

In [None]:
# Replace None with appropriate code
"""
None
"""

### Data Visualization

A **scatter plot** is he most sensible form of data visualization for showing this relationship, because we have two dimensions of data, but there is no "increasing" variable (e.g. time) that would indicate we should use a line graph.

In [None]:
# Run this cell without changes

# Set up figure
fig, ax = plt.subplots(figsize=(8, 5))

# Basic scatter plot
ax.scatter(
    x=populations,
    y=wins,
    color="gray", alpha=0.5, s=100
)
ax.set_xlabel("2018 Population")
ax.set_ylabel("2018 World Cup Wins")
ax.set_title("Population vs. World Cup Wins")

# Add annotations for specific points of interest
highlighted_points = {
    "Belgium": 2, # Numbers are the index of that
    "Brazil": 3,  # country in populations & wins
    "France": 10,
    "Nigeria": 17
}
for country, index in highlighted_points.items():
    # Get x and y position of data point
    x = populations[index]
    y = wins[index]
    # Move each point slightly down and to the left
    # (numbers were chosen by manually tweaking)
    xtext = x - (1.25e6 * len(country))
    ytext = y - 0.5
    # Annotate with relevant arguments
    ax.annotate(
        text=country,
        xy=(x, y),
        xytext=(xtext, ytext)
    )

### Data Visualization Interpretation

Interpret this plot in the cell below. Does this align with the findings from the statistical measure (correlation), as well as the map shown at the beginning of this lab (showing the best results by country)?

In [None]:
# Replace None with appropriate text
"""
None
"""

### Final Analysis

> What is the relationship between the population of a country and their performance in the 2018 FIFA World Cup?

Overall, we found a very weakly positive relationship between the population of a country and their performance in the 2018 FIFA World Cup, as demonstrated by both the correlation between populations and wins, and the scatter plot.

In the cell below, write down your thoughts on these questions:

 - What are your thoughts on why you may see this result?
 - What would you research next?

In [None]:
# Replace None with appropriate text
"""
None
"""

## Summary

Congratulations! That was a long lab, pulling together a lot of material. You read data into Python, extracted the relevant information, cleaned the data, and combined the data into a new format to be used in analysis. While we will continue to introduce new tools and techniques, these essential steps will be present for the rest of your data science projects from here on out!