# CPSC 368 CSV to SQL Notebook (KNM Neighbours)

- All of the code below is to be run AFTER running `cpsc_368_project_knm.ipynb` for data cleaning.

## KFF2019_new
- `Location`: State within U.S.; `VARCHAR2`; PRIMARY KEY
- `All_Uninsured`: Proportion of uninsured individuals aged between 19 and 64; `DECIMAL`
- `Female_Uninsured`: Proportion of uninsured female individuals aged between 19 and 64; `DECIMAL`
- `Male_Uninsured`: Proportion of uninsured male individuals aged between 19 and 64; `DECIMAL`

## USCDI
Due to how the data is formatted, practically all the `VARCHAR2` columns, along with `YearStart` and `YearEnd`, are primary keys.
- `YearStart`: Start year of measurements; NUMBER(4, 0); PRIMARY KEY
- `YearEnd`: End year of measurements; NUMBER(4, 0); PRIMARY KEY
- `LocationDesc`: State within U.S.; `VARCHAR2`; PRIMARY KEY
- `Topic`: Topic of interest; `VARCHAR2`; PRIMARY KEY
- `Question`: Question of interest, based on `Topic`; `VARCHAR2`(100); PRIMARY KEY
- `DataValueUnit`: Unit of data value depending on `Topic` and `Question`; `VARCHAR2`; PRIMARY KEY
- `DataValueType`: Type of data value (e.g. Crude value, age-adjusted); `VARCHAR2`; PRIMARY KEY
- `DataValue`: Data value, with specific interpretation dependent on its `DataValueType`, `DataValueUnit`, `Topic` and `Question`; NUMBER
- `StratificationCategory1`: Category to stratify data; includes "Age", "Sex", "Race/Ethnicity" and "Overall"; `VARCHAR2`; PRIMARY KEY
- `Stratification1`: Specific group within `StratificationCategory1`; `VARCHAR2`; PRIMARY KEY
- `Has2019`: Boolean on whether or not 2019 is in the data; NUMBER
- `Range`: Number of years between `YearStart` and `YearEnd`; NUMBER
- `AvgDataValue`: `DataValue`/`Range`; NUMBER

## USCDI_CHD
- `LocationDesc`: State within U.S.; `VARCHAR2`; PRIMARY KEY
- `Frac_F`: Proportion of female mortalities by CHD; DECIMAL(18, 18)
- `CHD_Deaths`: Number of mortalities by CHD per 100,000 people for individuals aged between 19-64; DECIMAL(24, 18)
- `CHD_Deaths_F`: Number of female mortalities by CHD per 100,000 people for individuals aged between 19-64; DECIMAL(24, 18)
- `CHD_Deaths_M`: Number of male mortalities by CHD per 100,000 people for individuals aged between 19-64; DECIMAL(24, 18)
- `CHDPercentage`: Percentage of mortalities by CHD within 100,000 people for individuals aged between 19-64; DECIMAL(18, 18)
- `CHDPercentage_F`: Percentage of female mortalities by CHD within 100,000 people for individuals aged between 19-64; DECIMAL(18, 18)
- `CHDPercentage_M`: Percentage of male mortalities by CHD within 100,000 people for individuals aged between 19-64; DECIMAL(18, 18)

In [1]:
import numpy as np
import pandas as pd
import csv

KFF2019_new = pd.read_csv("final_datasets_V1/cleaned/KFF2019_new.csv")
display(KFF2019_new.head())
USCDI = pd.read_csv('final_datasets_V1/cleaned/USCDI.csv')
USCDI_CHD = pd.read_csv("final_datasets_V1/cleaned/USCDI_CHD.csv")

Unnamed: 0,Location,All_Uninsured,Female_Uninsured,Male_Uninsured
0,Alabama,0.149,0.133,0.167
1,Alaska,0.153,0.119,0.187
2,Arizona,0.154,0.138,0.17
3,Arkansas,0.132,0.113,0.151
4,California,0.11,0.095,0.125


In [2]:
# https://www.geeksforgeeks.org/how-to-find-maximum-string-length-by-column-in-r-dataframe/
# https://stackoverflow.com/questions/69687640/how-to-iterate-over-columns-using-pandas

# To help determine VARCHAR length
USCDI_varchar = ['LocationDesc', 'Topic', 'Question', 'DataValueUnit', 'DataValueType', 'StratificationCategory1', 'Stratification1']
for col in USCDI_varchar:
    print(col)
    print(USCDI[col].str.len().max())
print("============================================")

USCDI_decimal = ['YearStart', 'YearEnd', 'DataValue', 'Has2019', 'Range', 'AvgDataValue']
print("USCDI_decimal")
for col in USCDI_decimal:
    print(col)
    print(USCDI[col].astype(str).str.len().max())

print("============================================")
USCDI_CHD_decimal = ['CHD_Deaths', 'CHD_Deaths_F', 'CHD_Deaths_M', 'CHDPercentage', 'CHDPercentage_F', 'CHDPercentage_M']
for col in USCDI_CHD_decimal:
    print(col)
    print(USCDI_CHD[col].astype(str).str.len().max())

LocationDesc
20
Topic
22
Question
81
DataValueUnit
17
DataValueType
17
StratificationCategory1
7
Stratification1
9
USCDI_decimal
YearStart
4
YearEnd
4
DataValue
6
Has2019
1
Range
1
AvgDataValue
18
CHD_Deaths
18
CHD_Deaths_F
18
CHD_Deaths_M
18
CHDPercentage
18
CHDPercentage_F
18
CHDPercentage_M
18


## Process

### `create_tbl_statements()`
- Provides dictionary that provides multi-line scripts

In [3]:
def create_tbl_statements(): 
    return {"KFF2019_new":
        """
CREATE TABLE KFF2019_new(
    Location VARCHAR(50),
    All_Uninsured DECIMAL(19, 18),
    Female_Uninsured DECIMAL(19, 18),
    Male_Uninsured DECIMAL(19, 18)
    PRIMARY KEY(Location)
);
        """
        , "USCDI":
        """
CREATE TABLE USCDI(
    YearStart NUMBER(4,0),
    YearEnd NUMBER(4,0),
    LocationDesc VARCHAR(50),
    Topic VARCHAR(30),
    Question VARCHAR(100),
    DataValueUnit VARCHAR(20),
    DataValueType VARCHAR(20),
    DataValue DECIMAL(24, 18),
    StratificationCategory1 VARCHAR(10),
    Stratification1 VARCHAR(10),
    Has2019 NUMBER(1,0) NOT NULL,
    Range NUMBER(1,0) NOT NULL,
    AvgDataValue DECIMAL(24, 18),
    PRIMARY KEY(YearStart, YearEnd, LocationDesc, Topic, Question, DataValueUnit, DataValueType, StratificationCategory1, Stratification1)
);
        """
        , "USCDI_CHD":
        """
CREATE TABLE USCDI_CHD(
    LocationDesc VARCHAR(50), 
    Frac_F DECIMAL(19, 18), 
    CHD_Deaths DECIMAL(24, 18), 
    CHD_Deaths_F DECIMAL(24, 18), 
    CHD_Deaths_M DECIMAL(24, 18), 
    CHDPercentage DECIMAL(19, 18), 
    CHDPercentage_F DECIMAL(19, 18), 
    CHDPercentage_M DECIMAL(19, 18),
    PRIMARY KEY(LocationDesc)
);
        """
           }

### `csv_to_insert()`
- Create a list of script lines.
- Open the file.
- Extract the header.
- For each row in file:
    - Create list of values to extract.
    - For each cell in row:
        - If the string is only digits, treat it as an integer and append it into the values list as is.
        - If the string is not only digits, attempt to convert it into a float and round it to 18 decimal places.
        - If float conversion works, convert it back to a string in order to append it into the values list.
        - If float conversion fails, add `'` quotes to it before appending it into the values list.
    - Join the values together into a value string.
    - Use the value string along with the headers joined by commas to create a script string.
    - Add the script string to the list of script lines.

In [4]:
# https://stackoverflow.com/questions/9282967/how-to-open-a-file-using-the-open-with-statement
# https://www.geeksforgeeks.org/check-if-value-is-int-or-float-in-python/
# https://www.geeksforgeeks.org/formatted-string-literals-f-strings-python/
# https://www.geeksforgeeks.org/python-string-join-method/

def csv_to_insert(csv_input, table_name):
    scriptlist = []
    with open(csv_input, 'r', encoding='utf-8') as table:
        file = csv.reader(table)
        header = next(file)
        for row in file:
            values = []
            for i in row:
                if i.isdigit() == True:
                    values.append(i) # Enter integer as is
                else:
                    try: 
                        values.append(str(round(float(i), 18))) # Enter decimal value rounded to 18 decimal places
                    except: 
                        values.append(f"'{i}'") # Quotations for strings
            scriptlist.append(f'INSERT INTO {table_name} ({', '.join(header)}) VALUES ({', '.join(values)});')
    return scriptlist
# csv_to_insert('final_datasets_V1/cleaned/KFF2019_new.csv', 'KFF2019_new') # Sample

### `makesql()`
For each table: 
- DROP TABLE statement 
- CREATE TABLE statement from `create_tbl_statements`
- For each row in CSV, apply INSERT INTO statement

In [5]:
# https://www.geeksforgeeks.org/writing-to-file-in-python/
def makesql(output_file):
    csv_table_tuples = [("final_datasets_V1/cleaned/KFF2019_new.csv", "KFF2019_new"), 
                       ("final_datasets_V1/cleaned/USCDI.csv", "USCDI"), 
                       ("final_datasets_V1/cleaned/USCDI_CHD.csv", "USCDI_CHD")]
    create_tbl = create_tbl_statements()
    
    # Write into output_file
    with open(output_file, 'w', encoding='utf-8') as sql_file:
        # For each table
        for csv_input, table_name in csv_table_tuples:
            # Drop table
            sql_file.write(f'DROP TABLE {table_name};' + '\n')
            
            # Create table
            sql_file.write(f'{create_tbl[table_name]}' + '\n')

            scriptlist = csv_to_insert(csv_input, table_name)

            # For each row, insert values
            for line in scriptlist:
                sql_file.write(f'{line}' + '\n')
            sql_file.write('\n')
    print(f"{output_file} written")
makesql("knm_datasetup.sql")

knm_datasetup.sql written
