### [Python Sets]

* A set is similar to list and tuples but they don't have a specified order, nor can they contain duplicates.
* Empty curly braces is actually an empty dictionary in Python.
* If you want to create an empty set, you need to write 'set()'.


In [None]:
"""
Using sets.
"""

print("Set Literals")
print("============")

numbers = {3, 2, 1, 4}
print(numbers)

letters = {"a", "b", "a", "c", "b"}
print(letters)

empty = {}
print(empty, type(empty))

empty2 = set()
print(empty2, type(empty2))

set1 = set([3, 1, 1, 3, 6, 5])
print(set1)

set2 = set(range(5))
print(set2)

print("")
print("Adding/Removing Elements")
print("========================")

set1.add(10)
print(set1)

element = set1.pop()
print(element)
print(set1)

set1.discard(3)
print(set1)

print("")
print("Set Iteration")
print("=============")

for item in letters:
    print(item)
    

Set Literals
{1, 2, 3, 4}
{'b', 'c', 'a'}
{} <class 'dict'>
set() <class 'set'>
{1, 3, 5, 6}
{0, 1, 2, 3, 4}

Adding/Removing Elements
{1, 3, 5, 6, 10}
1
{3, 5, 6, 10}
{5, 6, 10}

Set Iteration
b
c
a


### [Hashing]

#### * Sets
* With a list, you would have to seach the entire list to determine whether or not a particular element is contained in the list so the time to test for membership in a list can be proportional to the size of the list.
* With a set, the time is relatively constant no matter how many times are in the set because of a hash table.
> It means when the collection get large, the fast membership operations can be a big win.

#### * Hashing
* A hash table is a table in which you look things up directly by index into that table.
* In order to use such a table, you need to be able to determine a consistent index for an arbitrary item (such as a string) that you want to store in the table.
* This is where hashing comes in.  Python includes a set of hash functions, one for each hashable type.
* These hash functions have multiple important properties.
> 1. The hash function will always return the same hash for objects that have the same value.
> 2. The hash function will distribute the hash values across the possible space. (important for efficiency, not for correctness).

* A good hash function avoids collisions as much as possible. (Collision happens when two different objects have the same index into the hasy table.)

#### * Hashability

* Only immutable objects can successfully be stored in a hash table, and therefore a Python set.
* A type is hashable if there is a hash function that can be used for the immutable builtin types.

#### * Dictionaries

* Python dictionaries also use hash tables internally. You cannot have duplicate keys in a dictionary. You can use a hash table and get the same performance benefits.


### [Analyzing the Efficiency of Your code]

* Using code that's correct and understandable and that others can work with easily is important.
* However, as you prgress, the efficiency is also important.
* It means how fast your code runs on a problem of a particular size.
> * [Algorithmic efficiency](https://en.wikipedia.org/wiki/Algorithmic_efficiency)
> * [TimeComplexity](https://wiki.python.org/moin/TimeComplexity)

### [Comparing Two Methods for joining CSV Files]


In [None]:
PATH = '/content/drive/MyDrive/etc/'

In [None]:
"""
Some sample code illustrating the speed of the "in" operator for lists vs. dictionaries in Python
See https://wiki.python.org/moin/TimeComplexity for more details
"""

import time
import random
import csv


def read_csv_file(file_name):
    """
    Given a CSV file, read the data into a nested list
    Input: String corresponding to comma-separated  CSV file
    Output: Nested list consisting of the fields in the CSV file
    """ 
       
    with open(file_name, newline='') as csv_file:       # don't need to explicitly close the file now
        csv_table = []
        csv_reader = csv.reader(csv_file, delimiter=',')
        for row in csv_reader:
            csv_table.append(row)
    return csv_table



CANCER_RISK_FIPS_COL = 2
CENTER_FIPS_COL = 0


def test_CSV_join_efficiency(cancer_csv_file, center_csv_file):
    """
    Extract lists of FIPS codes from cancer-risk data set and county center data set
    Measure running time to determine whether FIPS codes in cancer-risk set are in county center set
    """
    
    # Read in both CSV files
    risk_table = read_csv_file(cancer_csv_file)
    risk_FIPS_list = [risk_table[idx][CANCER_RISK_FIPS_COL] for idx in range(len(risk_table))]
    print("Read", len(risk_FIPS_list), "cancer-risk FIPS codes")
       
    center_table = read_csv_file(center_csv_file)
    center_FIPS_list = [center_table[idx][CENTER_FIPS_COL] for idx in range(len(center_table))]
    print("Read", len(center_FIPS_list), "county center FIPS codes")
    
    start_time = time.time()
    for code in risk_FIPS_list:
        if code in center_FIPS_list:      
            pass
    end_time = time.time()
    print("Checked for FIPS membership using list in", end_time-start_time, "seconds")
    
    
    center_FIPS_dict = {code : True for code in center_FIPS_list}
    start_time = time.time()
    for code in risk_FIPS_list:
        if code in center_FIPS_dict:  
            pass
    end_time = time.time()
    print("Checked for FIPS membership using dict in", end_time-start_time, "seconds")


test_CSV_join_efficiency(PATH+"cancer_risk_trimmed_solution.csv", PATH+"USA_Counties_with_FIPS_and_centers.csv")

Read 3276 cancer-risk FIPS codes
Read 3143 county center FIPS codes
Checked for FIPS membership using list in 0.09780097007751465 seconds
Checked for FIPS membership using dict in 0.0005114078521728516 seconds


In [None]:

# Code that test the efficiency of "in" operator on larger examples

TEST_SIZE = 20000
TEST_STRIDE = 3

def test_membership_efficiency():
    """
    Test the efficiency of the "in" operator on list/dictionaries of larger size
    """
    test_list = list(range(0, TEST_SIZE, TEST_STRIDE))  # Convert range() to list, membership in range is fast
    test_dict = {idx : True for idx in test_list}
    
    print()

    # Code to test efficiency of "in" for lists
    start_time = time.time()
    for idx in range(TEST_SIZE):
        if idx in test_list:      # "in" operation iterates through entire list to check membership
            pass
    end_time = time.time()
    print("Total time for", TEST_SIZE, "membership test for list is", end_time - start_time, "seconds")
    
    # Code to test efficiency of "in" for dicts
    start_time = time.time()
    for idx in range(TEST_SIZE):
        if idx in test_dict:      # "in" operations does NOT iterate through dictionary to check membership
            pass
    end_time = time.time()
    print("Total time for", TEST_SIZE, "membership test for dict is", end_time - start_time, "seconds")
    
test_membership_efficiency()


Total time for 20000 membership test for list is 1.5725421905517578 seconds
Total time for 20000 membership test for dict is 0.0013015270233154297 seconds




---
## [Practice Project: Reconciling Cancer-Risk Data with the USA Map]

1. Merge two county-level data sets by common FIPS code
2. Write a script that reads both CSV files, merges by FIPS code, outputs to a CSV file 
3. Investigate discrepancies in the two sets of FIPS code



In [14]:
"""
Week 3 practice project template for Python Data Visualization
Read two CSV files and join the resulting tables based on shared FIPS codes
Analyze both data sources for anamolous FIPS codes
"""

import csv



#########################################################
# Provided code for week 3

def print_table(table):
    """
    Echo a nested list to the console
    """
    for row in table:
        print(row)


def read_csv_file(file_name):
    """
    Given a CSV file, read the data into a nested list
    Input: String corresponding to comma-separated  CSV file
    Output: Nested list consisting of the fields in the CSV file
    """
       
    with open(file_name, newline='') as csv_file:       # don't need to explicitly close the file now
        csv_table = []
        csv_reader = csv.reader(csv_file, delimiter=',')
        for row in csv_reader:
            csv_table.append(row)
    return csv_table



def write_csv_file(csv_table, file_name):
    """
    Input: Nested list csv_table and a string file_name
    Action: Write fields in csv_table into a comma-separated CSV file with the name file_name
    """
    
    with open(file_name, 'w', newline='') as csv_file:
        csv_writer = csv.writer(csv_file, delimiter=',', quoting=csv.QUOTE_MINIMAL)
        for row in csv_table:
            csv_writer.writerow(row)



# Part 1 - function that creates a dictionary from a table

def make_dict(table, key_col):
    """
    Given a 2D table (list of lists) and a column index key_col,
    return a dictionary whose keys are entries of specified column
    and whose values are lists consisting of the remaining row entries
    """
    return_dict = {}
    for row in table:
        temp_list = row.copy()
        temp_list.remove(row[key_col])
        return_dict[row[key_col]] = temp_list
    return return_dict




def test_make_dict():
    """
    Some tests for make_dict()
    """
    table1 = [[1, 2], [3, 4], [5, 6]]
    print(make_dict(table1, 0))
    print(make_dict(table1, 1))
    table2 = [[1, 2, 3], [2, 4, 6], [3, 6, 9]]
    print(make_dict(table2, 1))
    print(make_dict(table2, 2))
    
test_make_dict()



# Part 2 - script for merging the CSV files

CANCER_RISK_FIPS_COL = 2
CENTER_FIPS_COL = 0

def merge_csv_files(cancer_csv_file, center_csv_file, joined_csv_file):
    """
    Read two specified CSV files as tables
    Join the these tables by shared FIPS codes
    Write the resulting joined table as the specified file
    Analyze for problematic FIPS codes
    """
    # Read in both CSV files
    cancer_data = read_csv_file(cancer_csv_file) #3276
    center_data = read_csv_file(center_csv_file) #3143 결론은 3140건

    print("Read risk table of lengh", len(cancer_data))
    print("Read center table of lengh", len(center_data))

    center_dict = make_dict(center_data, CENTER_FIPS_COL)
    
    # Compute joined table, print warning about cancer-risk FIPS codes that are not in USA map
    joined_table = []
    for row in cancer_data:
        FIPS_code = row[2]
        if FIPS_code in center_dict:
            joined_table.append(row+center_dict[FIPS_code])
        else:
            print("Row", row, "in cancer risk table not present in USA map")


    # Write joined table
    print("Wrote joined table of length", len(joined_table))
    write_csv_file(joined_table, joined_csv_file)
    
    # Print warning about FIPS codes in USA map that are missing from cancer risk data
    print()
    risk_code = [cancer_data[idx][2] for idx in range(len(cancer_data))]
    for center_code in center_dict:
        print("Code", center_code, "in center table not present in cancer risk table")


merge_csv_files(PATH+"cancer_risk_trimmed_solution.csv", PATH+"USA_Counties_with_FIPS_and_centers.csv", "cancer_risk_joined.csv")




## Part 3 - Explanation for anomalous FIPS codes

## https://www1.udel.edu/johnmack/frec682/fips_codes.html
##
## Output anamolies for cancer risk data
## Puerto Rico, Virgin Island, Statewide, Nationwide - FIPS codes are all not present on USA map
## One specific county (Clifton Forge, VA - 51560) is also not present in USA map.
## According URL above, Clifton Forge was merged with another VA county prior to 2001
##
## Output anamolies for USA map
## State_Line, separator - FIPS codes are all not present in cancer-risk data
## One specific county (Broomfield County - 08014) is also not present in cancer-risk data
## Accoring to URL above, Broomfield County was created in 2001
##
## Implies cancer risk FIPS codes were defined prior to 2001, the USA map FIPS codes were defined after 2001

{1: [2], 3: [4], 5: [6]}
{2: [1], 4: [3], 6: [5]}
{2: [1, 3], 4: [2, 6], 6: [3, 9]}
{3: [1, 2], 6: [2, 4], 9: [3, 6]}
Read risk table of lengh 3276
Read center table of lengh 3143
Row ['CA', 'Statewide', '06000', '33871648', '7.7E-05'] in cancer risk table not present in USA map
Row ['DC', 'Statewide', '11000', '572059', '7.7E-05'] in cancer risk table not present in USA map
Row ['NY', 'Statewide', '36000', '18976457', '7.2E-05'] in cancer risk table not present in USA map
Row ['MD', 'Statewide', '24000', '5296486', '5.7E-05'] in cancer risk table not present in USA map
Row ['NJ', 'Statewide', '34000', '8414350', '5.6E-05'] in cancer risk table not present in USA map
Row ['AZ', 'Statewide', '04000', '5130632', '5.5E-05'] in cancer risk table not present in USA map
Row ['OR', 'Statewide', '41000', '3421399', '5.5E-05'] in cancer risk table not present in USA map
Row ['CT', 'Statewide', '09000', '3405565', '5.3E-05'] in cancer risk table not present in USA map
Row ['GA', 'Statewide', '13