# Analysis

## Setup

Zuerst laden wir die benötigten Daten herunter und initialisieren die genutzten Python Objekte.

In [16]:
tables = [
    "careplans",
    "conditions",
    "observations",
    "patients",
]

files = [
    "data/others/",
    "data/asthma/",
    "data/gallstones/",
    "data/hypertension/",
]

In [17]:
!mkdir -p data/allergy

In [18]:
from urllib.request import urlopen
import os

def ensure_file_has_been_downloaded(filename):
    full_filename = "../" + filename

    url = "https://raw.githubusercontent.com/Fuenfgeld/DMA2022DataProjectC/main/" + filename
    if os.path.isfile(full_filename):
        print("File {} already exists, skipping download".format(filename))
    else:
        print("Downloading {}".format(filename))
        download_file(url, full_filename)

def download_file(url, filename):
    with open(filename, 'wb') as out_file:
        with urlopen(url) as file:
            out_file.write(file.read())

if not os.path.isfile("extract.py"):
    download_file(
        "https://raw.githubusercontent.com/Fuenfgeld/DMA2022DataProjectC/main/src/extract.py",
        "extract.py"
    )

for file in files:
    for table in tables:
        ensure_file_has_been_downloaded(file+table+".csv")

File data/others/careplans.csv already exists, skipping download
File data/others/conditions.csv already exists, skipping download
File data/others/observations.csv already exists, skipping download
File data/others/patients.csv already exists, skipping download
File data/asthma/careplans.csv already exists, skipping download
File data/asthma/conditions.csv already exists, skipping download
File data/asthma/observations.csv already exists, skipping download
File data/asthma/patients.csv already exists, skipping download
File data/gallstones/careplans.csv already exists, skipping download
File data/gallstones/conditions.csv already exists, skipping download
File data/gallstones/observations.csv already exists, skipping download
File data/gallstones/patients.csv already exists, skipping download
File data/hypertension/careplans.csv already exists, skipping download
File data/hypertension/conditions.csv already exists, skipping download
File data/hypertension/observations.csv already exis

In [19]:
from logger import Logger
from test_executer import TestExecutor

logger = Logger()
testExecutor = TestExecutor(logger)

## Mit Datenbank verbinden

In [20]:
import extract
import time

databaseFile = "data.sqlite"

logger.startTimeMeasurement('open-db', 'Connected to db and created tables')
connection = extract.connect_to_db(logger, databaseFile)
logger.endTimeMeasurement('open-db')

In [21]:
def test_sqliteConnection(_logger):
    cursor = connection.cursor()
    cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
    tablesInDb = list(map(lambda tableResult: tableResult[0], cursor.fetchall()))
    tablesInDb.sort()

    for table in tables:
        if not(table in tablesInDb):
            raise Exception('Table not found:', table)

testExecutor.execute('Test connection to database', test_sqliteConnection)

{"type": "info", "time": 1656252038707, "message": "✅ Test ran successfully: Test connection to database", "params": null}


## Daten in Datenbank laden

Lade der verwendete Daten in die Datenbank:

-   careplans
-   conditions
-   observations
-   patients

In [22]:
logger.startTimeMeasurement('load-data', 'Loading data into db')
for file in files:
    for table in tables:
        extract.insert_values_to_table(logger, connection.cursor(), table, "../"+file+ table + ".csv")
        connection.commit()
    logger.endTimeMeasurement('load-data')

{"type": "info", "time": 1656252038736, "message": "🏗 Extracting data from ../data/others/careplans.csv", "params": null}
{"type": "info", "time": 1656252038743, "message": "🏗 Extracting data from ../data/others/conditions.csv", "params": null}
{"type": "info", "time": 1656252038761, "message": "🏗 Extracting data from ../data/others/observations.csv", "params": null}
{"type": "info", "time": 1656252039228, "message": "🏗 Extracting data from ../data/others/patients.csv", "params": null}
{"type": "info", "time": 1656252039233, "message": "🏗 Extracting data from ../data/asthma/careplans.csv", "params": null}
{"type": "info", "time": 1656252039246, "message": "🏗 Extracting data from ../data/asthma/conditions.csv", "params": null}
{"type": "info", "time": 1656252039288, "message": "🏗 Extracting data from ../data/asthma/observations.csv", "params": null}
{"type": "info", "time": 1656252040663, "message": "🏗 Extracting data from ../data/asthma/patients.csv", "params": null}
{"type": "info", "

## Messung der Datenfehler

Für unsere Forschungsfrage sind nur alle Daten mit gemessenen BMI relevant. Wurde dieser nicht vermessen oder eingetragen können die Daten für die Forschungsfrage nicht verwendet werden und sind somit unbrauchbar.

In [59]:
all_patients_query = """
SELECT COUNT(id) FROM patients;"""
count_bmi_query = """
SELECT COUNT(distinct id) FROM patients JOIN observations on patients.id == observations.patient WHERE observations.Code = '59576-9'"""

count_all_bmi_query = f"""
SELECT COUNT(patient) FROM observations WHERE observations.Code = '59576-9'"""

patient_all_count = connection.execute(all_patients_query).fetchall()[0][0]
patient_bmi_count = connection.execute(count_bmi_query).fetchall()[0][0]
bmi_count = connection.execute(count_all_bmi_query).fetchall()[0][0]
ratio = round(patient_bmi_count/patient_all_count, 3) * 100

print(f"Die Daten behinhalten insgesamt {patient_all_count} Patienten. \
Wobei nur bei {patient_bmi_count} Patienten der BMI {bmi_count} mal vermessen wurde.\
Somit sind nur {ratio}% der Patienten für unsere Forschungsfrage relevant.")

Die Daten behinhalten insgesamt 396 Patienten. Wobei nur bei 140 Patienten der BMI 1157 mal vermessen wurde.Somit sind nur 35.4% der Patienten für unsere Forschungsfrage relevant.


## Aufräumen & Logs speichern

In [23]:
#connection.close()
#logger.logTimings()
#logger.writeToFile("../artefacts-for-release/analysis-log.json")