# Lab-P7: Analyzing Covid Vaccination Data

In [1]:
## Please make sure "vaccinations.csv" is in your "lab7" folder.

import csv
import datetime


## Segment 2: Loading Data from CSVs

### Task 2.1: Process the CSV file
[Chapter 14](https://automatetheboringstuff.com/chapter14/) of Automate the Boring Stuff introduces CSV files and provides a code snippet we can reuse. We will use the same code we used for p6 to read in the csv file.

In [2]:
def process_csv(filename):
    example_file = open(filename, encoding="utf-8")
    example_reader = csv.reader(example_file)
    example_data = list(example_reader)
    example_file.close()
    return example_data


In [3]:
# Use process_csv to pull out the header and data rows

csv_rows = process_csv("vaccinations.csv")

# Use indexing to extract the first inner list

csv_header = csv_rows[0]

# Use slicing to extract all the inner lists, except the first one

csv_data = csv_rows[1:]


In [4]:
# Question: What are the names of the columns in the dataset?
# We did this one for you:

csv_header


['country',
 'date',
 'daily_vaccinations',
 'total_vaccinations',
 'people_vaccinated',
 'people_fully_vaccinated',
 'total_boosters',
 'population']

In [5]:
# Question: How many rows of data (excluding header) are present in the dataset?
# Fill in the blank

print("Expected: 1218 \t Actual:", len(csv_data))


Expected: 1218 	 Actual: 1218


### About the dataset
The `vaccinations.csv` file has vaccination data about 174 different countries in the last week of January 2022. Each row in the file contains the following information about a country on a specific date:

1. `daily_vaccinations` - Number of vaccines administered on that day
2. `total_vaccinations` - Total number of vaccines administered in total up to that day
3. `people_vaccinated` - Total number of people who have received at least one dose of the vaccine.
4. `people_fully_vaccinated` - Total number of people who have received two doses of the vaccine.
5. `total_boosters` - Total number of COVID-19 vaccination booster doses administered (doses administered beyond the number prescribed by the vaccination protocol).
6. `population` - Population of the country

<b>Note:</b> Keep in mind while writing your project, some entries may be missing data for specific columns. Sadly, data in real life is often messy, and in p7, we will have to deal with missing data.



### Task 2.2: Modify the cell function from p6 to work with vaccinations.csv
The vaccination data is formatted similarly to the airbnb data from p6. Modify the cell() function you wrote in p6 so that it correctly converts the data into the correct types. Keep in mind that:

1. `daily_vaccinations` - should be an int
2. `total_vaccinations` - should be an int
3. `people_vaccinated` - should be an int
4. `people_fully_vaccinated` - should be an int
5. `total_boosters` - should be an int
6. `population` - should be an int
7. `date` - should be a string
8. `country` - should be a string

In [6]:
def cell(row_idx, col_name):
    """
    Returns the data value (cell) corresponding to the row index and 
    the column name of a CSV file.
    """
    col_idx = csv_header.index(col_name)
    val = csv_data[row_idx][col_idx]
    
    int_list = ['daily_vaccinations', 'total_vaccinations', 'people_vaccinated', \
                'people_fully_vaccinated', 'total_boosters', 'population']
    
    if val == "": # this is how we handle a missing value in the dataset
        return None
    
    elif col_name in int_list:
        return int(val)  # TODO: make sure the function returns val with the appropriate type
    
    return val


In [7]:
# Test out your implementation:

print("Expected: Afghanistan with type string \t Actual:", cell(0, "country"), type(cell(0, "country")))


Expected: Afghanistan with type string 	 Actual: Afghanistan <class 'str'>


In [8]:
# Test out your implementation

print("Expected: 30887 with type int \t Actual:", cell(66, "daily_vaccinations"), \
      type(cell(66, "daily_vaccinations")))


Expected: 30887 with type int 	 Actual: 30887 <class 'int'>


## Segment 3: Dictionaries

### Task 3.1: Use a dictionary to organize the booster data by country.
In this task, you will create a dictionary whose keys are country names, and the corresponding values are the total number of booster shots administered for that country. Note that we don't have booster data for many countries, so some values in the dictionary should be None.


In [9]:
boosters = {} #key: country name, value: total boosters

for row_idx in range(len(csv_data)):
    country = cell(row_idx, "country")
    total_boosted = cell(row_idx, "total_boosters")
                         
    if country not in boosters:
        boosters[country] = None # Initial place holder for country, we will replace this using next condition
    
    if boosters[country] == None or country in boosters: # TODO: check that the booster data is not missing
        boosters[country] = total_boosted # TODO: add the key-value pair to the dict

boosters
        

{'Afghanistan': None,
 'Albania': None,
 'Angola': None,
 'Anguilla': None,
 'Antigua and Barbuda': None,
 'Argentina': 12868208,
 'Armenia': None,
 'Aruba': None,
 'Austria': None,
 'Azerbaijan': 2028205,
 'Bahamas': None,
 'Bahrain': 939199,
 'Bangladesh': None,
 'Barbados': None,
 'Belarus': None,
 'Belgium': 6580264,
 'Belize': None,
 'Bermuda': None,
 'Bhutan': None,
 'Bolivia': 879174,
 'Bosnia and Herzegovina': None,
 'Brazil': 47268419,
 'British Virgin Islands': None,
 'Brunei': None,
 'Bulgaria': 611314,
 'Cambodia': 6083677,
 'Canada': 15516481,
 'Cayman Islands': None,
 'Central African Republic': None,
 'Chad': None,
 'Chile': 12713482,
 'China': None,
 'Colombia': 5715017,
 'Comoros': None,
 'Costa Rica': 579075,
 "Cote d'Ivoire": None,
 'Croatia': None,
 'Cuba': 5358553,
 'Curacao': 36730,
 'Cyprus': 404293,
 'Czechia': 3754377,
 'Democratic Republic of Congo': None,
 'Denmark': 3566714,
 'Djibouti': None,
 'Dominica': None,
 'Dominican Republic': 1948527,
 'Ecuador': 26

In [10]:
# Test your implementation here:

print("Expected: 89474239\t Actual:", boosters['United States'])


Expected: 89474239	 Actual: 89474239


In [11]:
print("Expected: None\t Actual:", boosters['Armenia'])


Expected: None	 Actual: None


### Task 3.2: Improve the dictionary so that it uses the most recent vaccination data that is not missing
You may have noticed that there are missing entries in the data. For example, for Bosnia and Herzegovina, the data is missing from Jan 30. So, for Jan 30 to Jan 31, we will use the data from Jan 29 which is the most recent day before Jan 30.

<img src="https://github.com/msyamkumar/cs220-s22-projects/raw/main/lab-p7/images/bosniaherzegovina.png/" width="600">

For other countries such as Rwanda, the data is only available on and after Jan 27. However, for Jan 25 to Jan 26, we don't have any information available. Therefore, we will set the values to None for Jan 25 to Jan 26.

<img src="https://github.com/msyamkumar/cs220-s22-projects/raw/main/lab-p7/images/rwanda.png/" width="600">


Fill in the stencil below to create a dictionary that maps country name to the most recent data in the given column `col_name`. Your answer should be a dict mapping each country to the last date for which the column data is not missing. For countries that have data missing on all seven days, the value should be None. You might find the `get_number_of_days` function from p5 useful here to check if the date is on or before the given date, so we have copied it below.

In [23]:
def get_number_of_days(start_date, end_date):
    
    """Gets the number of days between the start_date and end_date"""
    
    # The second argument is a format string to tell the function how to process the date string
    
    day1 = datetime.datetime.strptime(start_date, '%m/%d/%Y') 
    day2 = datetime.datetime.strptime(end_date, '%m/%d/%Y')
    
    delta = day2 - day1
    
    return delta.days

get_number_of_days("01/25/2022", "01/22/2022")

-3

In [34]:
def most_recent_total(col_name, given_date):
    
    '''return a dictionary mapping each country to the most recent column value in the data 
    available by the given date; if no data is available, the value is None.'''
    
    country_info = {}
                
    for row_idx in range(len(csv_data)):
        country = cell(row_idx, "country")
        date = cell(row_idx, "date")
        col_value = cell(row_idx, col_name)
        date_diff = get_number_of_days(date, given_date)
                
        if country not in country_info:
            country_info[country] = col_value
        
        if col_value != None and date_diff >= 0:    
            country_info[country] = col_value

    return country_info

most_recent_total("daily_vaccinations", "01/27/2022")


{'Afghanistan': 6868,
 'Albania': None,
 'Angola': None,
 'Anguilla': None,
 'Antigua and Barbuda': None,
 'Argentina': None,
 'Armenia': None,
 'Aruba': None,
 'Austria': None,
 'Azerbaijan': 29515,
 'Bahamas': None,
 'Bahrain': 2827,
 'Bangladesh': 1414487,
 'Barbados': 194,
 'Belarus': None,
 'Belgium': 42940,
 'Belize': None,
 'Bermuda': None,
 'Bhutan': None,
 'Bolivia': 50562,
 'Bosnia and Herzegovina': None,
 'Brazil': 1656940,
 'British Virgin Islands': None,
 'Brunei': None,
 'Bulgaria': 12621,
 'Cambodia': 79450,
 'Canada': 259273,
 'Cayman Islands': None,
 'Central African Republic': None,
 'Chad': None,
 'Chile': 71298,
 'China': 5425000,
 'Colombia': 245216,
 'Comoros': None,
 'Costa Rica': None,
 "Cote d'Ivoire": None,
 'Croatia': 9458,
 'Cuba': 99338,
 'Curacao': None,
 'Cyprus': 2596,
 'Czechia': 39833,
 'Democratic Republic of Congo': None,
 'Denmark': 12265,
 'Djibouti': None,
 'Dominica': None,
 'Dominican Republic': 50037,
 'Ecuador': 66792,
 'Egypt': None,
 'El Sal

Is your implementation correct? Test it with the following:

In [35]:
people_fully_vaccinated_by_Jan25 = most_recent_total("people_fully_vaccinated", "01/25/2022")
people_fully_vaccinated_by_Jan26 = most_recent_total("people_fully_vaccinated", "01/26/2022")
people_fully_vaccinated_by_Jan27 = most_recent_total("people_fully_vaccinated", "01/27/2022")
people_fully_vaccinated_by_Jan28 = most_recent_total("people_fully_vaccinated", "01/28/2022")
people_fully_vaccinated_by_Jan29 = most_recent_total("people_fully_vaccinated", "01/30/2022")

print("Expected: 842954\t Actual:", people_fully_vaccinated_by_Jan28['Bosnia and Herzegovina'])
print("Expected: 846080\t Actual:", people_fully_vaccinated_by_Jan29['Bosnia and Herzegovina'])

# Different country:

print("Expected: None\t Actual:", people_fully_vaccinated_by_Jan25['Rwanda'])
print("Expected: None\t Actual:", people_fully_vaccinated_by_Jan26['Rwanda'])
print("Expected: 7044723\t Actual:", people_fully_vaccinated_by_Jan27['Rwanda'])

# If you get None for the test below, you might have forgotten to make sure the col_value
# is not missing on line 11 in most_recent_total

print("Expected: 4517380\t Actual:", most_recent_total("people_vaccinated", "01/28/2022")['Afghanistan'])


Expected: 842954	 Actual: 842954
Expected: 846080	 Actual: 846080
Expected: None	 Actual: None
Expected: None	 Actual: None
Expected: 7044723	 Actual: 7044723
Expected: 4517380	 Actual: 4517380


In [36]:
# Check to make sure your code works with different column names:

daily_vaccinations_by_Jan29 = most_recent_total("daily_vaccinations", "01/29/2022")
total_vaccinations_by_Jan29 = most_recent_total("total_vaccinations", "01/29/2022")
people_vaccinated_by_Jan29 = most_recent_total("people_vaccinated", "01/29/2022")


print("Expected: 6868\t Actual:", daily_vaccinations_by_Jan29['Afghanistan'])
print("Expected: 5081064\t Actual:", total_vaccinations_by_Jan29['Afghanistan'])
print("Expected: 4517380\t Actual:", people_vaccinated_by_Jan29['Afghanistan'])


Expected: 6868	 Actual: 6868
Expected: 5081064	 Actual: 5081064
Expected: 4517380	 Actual: 4517380


**Important**: If you are unsure if you implementation is correct, raise your hand and confirm your implementation with a TA. The `most_recent_total` function will be important for the project.

## Segment 4: Operations on Dictionaries

### Task 4.1 Find the max value in a dictionary
Imagine that a dorm kept statistics on the number of noise complaint incidents in different years. Complete the following code to find the year with the highest number of incidents:

*Hint*: How did you find the highest speed hurricane in p5 and lab-p5? Try to apply the same idea here.

In [37]:
#find the year with the most incidents

incidents = {2017: 14, 2020: 18, 2018: 13, 2019: 16, 2021: 25, 2016: 10}

max_year = None
most_inc = 0

for year in incidents:
    val = incidents[year]
    
    if max_year == None or val > most_inc:
        most_inc = val
        max_year = year
        
max_year


2021

### Task 4.2 Find the percentage of free throws made

Consider the following example, where we have statistics about free throws for three basketball players. How can you calculate the percentage of free throws each player made? 

In [38]:
free_throws_made = {"Jim": 1, "Annie": 2, "Fred": 3}

total_free_throws = {"Jim": 2, "Annie": 4, "Fred": 4}

percentage_made = {}

key_list = list(free_throws_made)

for key in key_list:
    percent = free_throws_made[key] / total_free_throws[key]
    percentage_made[key] = percent

percentage_made


# TODO: fill in the percentage_made dictionary so that the keys are the player names
# and the values are the percentage of free throws they made


{'Jim': 0.5, 'Annie': 0.5, 'Fred': 0.75}

In [39]:
# Test your implementation below:

print("Expected: 0.5\t Actual:", percentage_made['Jim'])
print("Expected: 0.5\t Actual:", percentage_made['Annie'])
print("Expected: 0.75\t Actual:", percentage_made['Fred'])


Expected: 0.5	 Actual: 0.5
Expected: 0.5	 Actual: 0.5
Expected: 0.75	 Actual: 0.75


## Great work! You are now ready to start P7.
We have also provided some optional exercises below in case you want more practice with lists and dictionaries:

## Optional Exercises

### Dictionary from a list of Keys and a list of Values

Create a dictionary that maps the English words in list `keys` to their corresponding Spanish translations in list `vals`:

In [40]:
# dict from list of keys and values

keys = ["two", "zero"]
vals = ["dos", "cero"]

english_to_spanish = {}
          
for i in range(len(keys)):
    english_to_spanish[keys[i]] = vals[i]

english_to_spanish

# TODO: fill in the english_to_spanish dictionary so that the keys are english
# words, and the values are the spanish translations


{'two': 'dos', 'zero': 'cero'}

The resulting dictionary containing the mapping from English to Spanish
words should look like this:

```python
{'two': 'dos', 'zero': 'cero'}
```

Now lets try using your `english_to_spanish` dictionary to partially translate the following English sentence.
Not exactly a replacement for Google translate just yet, but it's
a good start...

In [41]:
#words = "I love Comp Sci two two zero".split(" ")

words = "The spanish translations for the english words: two zero".split(" ")
print(words)

for i in range(len(words)):
    default = words[i] # default is to not translate it
    words[i] = english_to_spanish.get(words[i], default)

    " ".join(words)
    
print(words)


['The', 'spanish', 'translations', 'for', 'the', 'english', 'words:', 'two', 'zero']
['The', 'spanish', 'translations', 'for', 'the', 'english', 'words:', 'dos', 'cero']


*Question: What is the purpose of the 'default' variable?*

*Question: What is the purpose of the line words[i] = english_to_spanish.get(words[i], default)?*

### Flipping Keys and Values

What if we want a dictionary to convert from Spanish back to English?


In [42]:
# flipping keys and values

spanish_to_english = {}

for key in english_to_spanish:
    val = english_to_spanish[key]
    spanish_to_english[val] = key 
    
spanish_to_english

# TODO: fill in spanish_to_english so that the keys are spanish 
# words and the values are the english translations.

# Hint: You should only need to use your english_to_spanish dictionary,
# and not the original keys and vals lists.


{'dos': 'two', 'cero': 'zero'}

Your spanish_to_english dictionary should look like this if you print it out:

```python
{ 'dos': 'two', 'cero': 'zero'}
```