# CPSC 368 CSV to SQL Notebook (KNM Neighbours)

## Loading Data and Packages

In [1]:
import numpy as np
import pandas as pd
import csv

- All of the code below is to be run AFTER running `cpsc_368_project_knm.ipynb` for data cleaning.

## KFF datasets (Adults Ages 19-64, Women Ages 19-64, Men Ages 19-64)

- There are 3 KFF datasets: one for all adults aged 19-64 (`KFF2019_adult`), and two for males (`KFF2019_male`) and females (`KFF2019_female`) aged 19-64.

| Column | Description | Data Type | Property |
| ------- | ------- | ------- | ------- |
| `Location`  | State within U.S. | `VARCHAR2(50)` | `PRIMARY KEY` |
| `Employer` | Includes those covered by employer-sponsored coverage either through their own job or as a dependent in the same household. | `DECIMAL(19, 18)` | N/A |
| `Non-Group` | Includes individuals and families that purchased or are covered as a dependent by non-group insurance. | `DECIMAL(19, 18)` | N/A |
| `Medicaid` | Proportion of uninsured male individuals aged between 19 and 64 | `DECIMAL(19, 18)` | N/A |
| `Medicare` | Proportion of uninsured male individuals aged between 19 and 64 | `DECIMAL(19, 18)` | N/A |
| `Military` | Proportion of uninsured male individuals aged between 19 and 64 | `DECIMAL(19, 18)` | N/A |
| `Uninsured` | Proportion of uninsured male individuals aged between 19 and 64 | `DECIMAL(19, 18)` | N/A |
| `Total` | Proportion of uninsured male individuals aged between 19 and 64 | `DECIMAL(19, 18)` | N/A |

## USCDI_filter

- The data is formatted such that each individual data value corresponds to a list of  practically all the `VARCHAR2` columns, along with `YearStart` and `YearEnd`, are primary keys.
- The data provided is a version of the original dataset filtered to focus on the topics of 'Cardiovascular Disease' and 'Cancer'.

| Column | Description | Data Type | Property |
| ------- | ------- | ------- | ------- |
| `YearStart`  | Start year of measurements | `NUMBER(4, 0)` | `PRIMARY KEY` |
| `YearEnd` | End year of measurements | `NUMBER(4, 0)` | `PRIMARY KEY` |
| `LocationDesc` | State within U.S. | `VARCHAR2(50)` | `PRIMARY KEY` |
| `Topic` | Topic of interest | `VARCHAR2(30)` | `PRIMARY KEY` |
| `Question`  | Question of interest, based on `Topic` | `VARCHAR2(100)` | `PRIMARY KEY` |
| `DataValueUnit` | Unit of data value depending on `Topic` and `Question` | `VARCHAR2(20)` | `PRIMARY KEY` |
| `DataValueType` | Type of data value (e.g. Crude value, age-adjusted) | `VARCHAR2(20)` | `PRIMARY KEY` |
| `DataValue` | Data value, with specific interpretation dependent on its `DataValueType`, `DataValueUnit`, `Topic` and `Question` | `DECIMAL(24, 18)` | N/A |
| `StratificationCategory1` | Category to stratify data; includes "Age", "Sex", "Race/Ethnicity" and "Overall" | `VARCHAR(10)` | `PRIMARY KEY` |
| `Stratification1` | Specific group within `StratificationCategory1` | `VARCHAR(10)` | `PRIMARY KEY` |

In [2]:
KFF2019_new = pd.read_csv("final_datasets_V1/cleaned/KFF2019_new.csv")
USCDI = pd.read_csv('final_datasets_V1/cleaned/USCDI.csv')
USCDI_CHD = pd.read_csv("final_datasets_V1/cleaned/USCDI_CHD.csv")

KFF2019_adult = pd.read_csv("final_datasets_V1/cleaned/KFF2019_adult.csv")
KFF2019_female = pd.read_csv("final_datasets_V1/cleaned/KFF2019_female.csv")
KFF2019_male = pd.read_csv("final_datasets_V1/cleaned/KFF2019_male.csv")
USCDI_filter = pd.read_csv("final_datasets_V1/cleaned/USCDI_filter.csv")

## Process

### `char_length()`
- Determine the character lengths for each variable given, to assist in writing SQL.

In [3]:
# https://www.geeksforgeeks.org/how-to-find-maximum-string-length-by-column-in-r-dataframe/
# https://stackoverflow.com/questions/69687640/how-to-iterate-over-columns-using-pandas

def char_length(dataset, cols):
    for col in cols:
        print(col)
        print(dataset[col].astype(str).str.len().max())
    print("============================================")

# USCDI_varchar = ['LocationDesc', 'Topic', 'Question', 'DataValueUnit', 'DataValueType', 'StratificationCategory1', 'Stratification1']
# char_length(USCDI, USCDI_varchar)

# USCDI_decimal = ['YearStart', 'YearEnd', 'DataValue', 'Has2019', 'Range', 'AvgDataValue']
# char_length(USCDI, USCDI_decimal)
    
# USCDI_CHD_decimal = ['CHD_Deaths', 'CHD_Deaths_F', 'CHD_Deaths_M', 'CHDPercentage', 'CHDPercentage_F', 'CHDPercentage_M']
# char_length(USCDI_CHD, USCDI_CHD_decimal)

char_length(KFF2019_adult, list(KFF2019_adult.columns))
char_length(KFF2019_female, list(KFF2019_female.columns))
char_length(KFF2019_male, list(KFF2019_male.columns))
char_length(USCDI_filter, list(USCDI_filter.columns))

Location
20
Employer
5
Non-Group
5
Medicaid
5
Medicare
5
Military
5
Uninsured
5
Total
3
Location
20
Employer
5
Non-Group
5
Medicaid
5
Medicare
5
Military
5
Uninsured
5
Total
3
Location
20
Employer
5
Non-Group
5
Medicaid
5
Medicare
5
Military
5
Uninsured
5
Total
3
YearStart
4
YearEnd
4
LocationDesc
20
Topic
22
Question
81
DataValueUnit
17
DataValueType
17
DataValue
6
StratificationCategory1
7
Stratification1
9


### `create_tbl_statements()`
- Provides dictionary that provides multi-line `CREATE TABLE` scripts.
- Begin with determining the lengths of strings for each data type to determine the parameters within each script.

In [4]:
def create_tbl_statements(): 
    return {"KFF2019_adult":
        """
CREATE TABLE KFF2019_adult(
    Location VARCHAR(50),
    Employer DECIMAL(19, 18),
    Non-Group DECIMAL(19, 18),
    Medicaid DECIMAL(19, 18),
    Medicare DECIMAL(19, 18),
    Military DECIMAL(19, 18),
    Uninsured DECIMAL(19, 18),
    PRIMARY KEY(Location)
);
        """
        , "KFF2019_female":
        """
CREATE TABLE KFF2019_female(
    Location VARCHAR(50),
    Employer DECIMAL(19, 18),
    Non-Group DECIMAL(19, 18),
    Medicaid DECIMAL(19, 18),
    Medicare DECIMAL(19, 18),
    Military DECIMAL(19, 18),
    Uninsured DECIMAL(19, 18),
    PRIMARY KEY(Location)
);
        """
        , "KFF2019_male":
        """
CREATE TABLE KFF2019_male(
    Location VARCHAR(50),
    Employer DECIMAL(19, 18),
    Non-Group DECIMAL(19, 18),
    Medicaid DECIMAL(19, 18),
    Medicare DECIMAL(19, 18),
    Military DECIMAL(19, 18),
    Uninsured DECIMAL(19, 18),
    PRIMARY KEY(Location)
);
        """
        , "USCDI_filter":
        """
CREATE TABLE USCDI_filter(
    YearStart NUMBER(4,0),
    YearEnd NUMBER(4,0),
    LocationDesc VARCHAR(50),
    Topic VARCHAR(30),
    Question VARCHAR(100),
    DataValueUnit VARCHAR(20),
    DataValueType VARCHAR(20),
    DataValue DECIMAL(24, 18),
    StratificationCategory1 VARCHAR(10),
    Stratification1 VARCHAR(10),
    PRIMARY KEY(YearStart, YearEnd, LocationDesc, Topic, Question, DataValueUnit, DataValueType, StratificationCategory1, Stratification1)
);
        """
           }

In [5]:
# def create_tbl_statements(): 
#     return {"KFF2019_new":
#         """
# CREATE TABLE KFF2019_new(
#     Location VARCHAR(50),
#     All_Uninsured DECIMAL(19, 18),
#     Female_Uninsured DECIMAL(19, 18),
#     Male_Uninsured DECIMAL(19, 18)
#     PRIMARY KEY(Location)
# );
#         """
#         , "USCDI":
#         """
# CREATE TABLE USCDI(
#     YearStart NUMBER(4,0),
#     YearEnd NUMBER(4,0),
#     LocationDesc VARCHAR(50),
#     Topic VARCHAR(30),
#     Question VARCHAR(100),
#     DataValueUnit VARCHAR(20),
#     DataValueType VARCHAR(20),
#     DataValue DECIMAL(24, 18),
#     StratificationCategory1 VARCHAR(10),
#     Stratification1 VARCHAR(10),
#     Has2019 NUMBER(1,0) NOT NULL,
#     Range NUMBER(2,0) NOT NULL,
#     AvgDataValue DECIMAL(24, 18),
#     PRIMARY KEY(YearStart, YearEnd, LocationDesc, Topic, Question, DataValueUnit, DataValueType, StratificationCategory1, Stratification1)
# );
#         """
#         , "USCDI_CHD":
#         """
# CREATE TABLE USCDI_CHD(
#     LocationDesc VARCHAR(50), 
#     Frac_F DECIMAL(19, 18), 
#     CHD_Deaths DECIMAL(24, 18), 
#     CHD_Deaths_F DECIMAL(24, 18), 
#     CHD_Deaths_M DECIMAL(24, 18), 
#     CHDPercentage DECIMAL(19, 18), 
#     CHDPercentage_F DECIMAL(19, 18), 
#     CHDPercentage_M DECIMAL(19, 18),
#     PRIMARY KEY(LocationDesc)
# );
#         """
#            }

### `csv_to_insert()`
- Create an empty list of script lines.
- Open the file.
- Extract the header.
- For each remaining row in the CSV file:
    - Create an empty list of values to extract.
    - For each cell in the row:
        - If the string is only digits, treat it as an integer and append it into the values list as is.
        - If the string is not only digits, attempt to convert it into a float and round it to 18 decimal places, then convert it back to a string in order to append it into the values list.
        - If float conversion fails, add `'` quotes to it before appending it into the values list.
    - Join the values together with commas into a value string.
    - Use the value string along with the headers joined by commas to create a script string.
    - Add the script string to the list of script lines.

In [6]:
# https://stackoverflow.com/questions/9282967/how-to-open-a-file-using-the-open-with-statement
# https://www.geeksforgeeks.org/check-if-value-is-int-or-float-in-python/
# https://www.geeksforgeeks.org/formatted-string-literals-f-strings-python/
# https://www.geeksforgeeks.org/python-string-join-method/

def csv_to_insert(csv_input, table_name):
    scriptlist = []
    with open(csv_input, 'r', encoding='utf-8') as table:
        file = csv.reader(table)
        header = next(file)
        for row in file:
            values = []
            for i in row:
                if i.isdigit() == True:
                    values.append(i) # Enter integer as is
                else:
                    try: 
                        values.append(str(round(float(i), 18))) # Enter decimal value rounded to 18 decimal places
                    except: 
                        values.append(f"'{i}'") # Quotations for strings
            scriptlist.append(f'INSERT INTO {table_name} ({', '.join(header)}) VALUES ({', '.join(values)});')
    return scriptlist
# csv_to_insert('final_datasets_V1/cleaned/KFF2019_new.csv', 'KFF2019_new') # Sample

### `makesql()`
For each table, write: 
- `DROP TABLE` statement 
- `CREATE TABLE` statement from `create_tbl_statements`
- For each row in CSV, apply `INSERT INTO` statement

In [7]:
# https://www.geeksforgeeks.org/writing-to-file-in-python/
def makesql(output_file):
    # csv_table_tuples = [("final_datasets_V1/cleaned/KFF2019_new.csv", "KFF2019_new"), 
    #                    ("final_datasets_V1/cleaned/USCDI.csv", "USCDI"), 
    #                    ("final_datasets_V1/cleaned/USCDI_CHD.csv", "USCDI_CHD")]
    
    csv_table_tuples = [("final_datasets_V1/cleaned/KFF2019_adult.csv", "KFF2019_adult"), 
                        ("final_datasets_V1/cleaned/KFF2019_female.csv", "KFF2019_female"), 
                        ("final_datasets_V1/cleaned/KFF2019_male.csv", "KFF2019_male"), 
                        ("final_datasets_V1/cleaned/USCDI_filter.csv", "USCDI_filter")]
    
    create_tbl = create_tbl_statements()
    
    # Write into output_file
    with open(output_file, 'w', encoding='utf-8') as sql_file:
        # For each table
        for csv_input, table_name in csv_table_tuples:
            
            # Drop table
            sql_file.write(f'DROP TABLE {table_name};' + '\n')
            
            # Create table
            sql_file.write(f'{create_tbl[table_name]}' + '\n')

            # For each row, insert values
            scriptlist = csv_to_insert(csv_input, table_name)
            for line in scriptlist:
                sql_file.write(f'{line}' + '\n')
            sql_file.write('\n')
            
    print(f"SQL written into {output_file}")

makesql("knm_datasetup.sql")

SQL written into knm_datasetup.sql


## KFF2019_new

| Column | Description | Data Type | Property |
| ------- | ------- | ------- | ------- |
| `Location`  | State within U.S. | `VARCHAR2(50)` | `PRIMARY KEY` |
| `All_Uninsured` | Proportion of uninsured individuals aged between 19 and 64 | `DECIMAL(19, 18)` | N/A |
| `Female_Uninsured` | Proportion of uninsured female individuals aged between 19 and 64 | `DECIMAL(19, 18)` | N/A |
| `Male_Uninsured` | Proportion of uninsured male individuals aged between 19 and 64 | `DECIMAL(19, 18)` | N/A |

## USCDI

| Column | Description | Data Type | Property |
| ------- | ------- | ------- | ------- |
| `YearStart`  | Start year of measurements | `NUMBER(4, 0)` | `PRIMARY KEY` |
| `YearEnd` | End year of measurements | `NUMBER(4, 0)` | `PRIMARY KEY` |
| `LocationDesc` | State within U.S. | `VARCHAR2(50)` | `PRIMARY KEY` |
| `Topic` | Topic of interest | `VARCHAR2(30)` | `PRIMARY KEY` |
| `Question`  | Question of interest, based on `Topic` | `VARCHAR2(100)` | `PRIMARY KEY` |
| `DataValueUnit` | Unit of data value depending on `Topic` and `Question` | `VARCHAR2(20)` | `PRIMARY KEY` |
| `DataValueType` | Type of data value (e.g. Crude value, age-adjusted) | `VARCHAR2(20)` | `PRIMARY KEY` |
| `DataValue` | Data value, with specific interpretation dependent on its `DataValueType`, `DataValueUnit`, `Topic` and `Question` | `DECIMAL(24, 18)` | N/A |
| `StratificationCategory1` | Category to stratify data; includes "Age", "Sex", "Race/Ethnicity" and "Overall" | `VARCHAR(10)` | `PRIMARY KEY` |
| `Stratification1` | Specific group within `StratificationCategory1` | `VARCHAR(10)` | `PRIMARY KEY` |
| `Has2019` | Boolean on whether or not 2019 is in the data | `NUMBER(1,0)` | `NOT NULL` |
| `Range` | Number of years between `YearStart` and `YearEnd` | `NUMBER(2,0)` | `NOT NULL` |
| `AvgDataValue`  | $\frac{DataValue}{Range}$ | `DECIMAL(24, 18)` | N/A |