# Data Extraction

The main purpose of this module is to retrieve, transform, clean, and load data from medical CSV files, which will serve as the initial dataset for our healthcare support system.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import json

# Json file Path, saved on google drive of the collaboratos
json_path = '/content/drive/MyDrive/Colab Notebooks/Big Data/Final_Project/secret.json'

# Loading the json file
with open(json_path) as f:
  secrets = json.load(f)

# Secret info from json
#mongo_uri = secrets["MONGO_BASE_URI"]
mongo_uri = secrets["MONGO_M10_URI"]
collection_string_list = secrets["COLLECTION_STRING_LIST"]

In [3]:
import os
import pandas as pd
dataset_path = '/content/drive/MyDrive/Colab Notebooks/Big Data/Final_Project/Dataset/'

## Data Retrieve


For the first thing we have to retrieve from csv files all data as our database. We will clean the data, but to do this we need a framework that scales on horizontal cluster. Let's use spark!

In [4]:
#Install Java8
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

# download spark3.4.4 (list of mirrors)
#!wget -q https://apache.osuosl.org/spark/spark-3.4.4/spark-3.4.4-bin-hadoop3.tgz
#!wget -q https://dlcdn.apache.org/spark/spark-3.4.4/spark-3.4.4-bin-hadoop3.tgz
!wget -q https://archive.apache.org/dist/spark/spark-3.4.4/spark-3.4.4-bin-hadoop3.tgz

# unzip it
!tar xf spark-3.4.4-bin-hadoop3.tgz

# install findspark
!pip install -q findspark

# Scarica il connettore MongoDB-Spark
!wget -q https://repo1.maven.org/maven2/org/mongodb/spark/mongo-spark-connector_2.12/10.4.1/mongo-spark-connector_2.12-10.4.1.jar
!wget -q https://repo1.maven.org/maven2/org/mongodb/mongodb-driver-sync/4.10.2/mongodb-driver-sync-4.10.2.jar
!wget -q https://repo1.maven.org/maven2/org/mongodb/bson/4.10.2/bson-4.10.2.jar
!wget -q https://repo1.maven.org/maven2/org/mongodb/mongodb-driver-core/4.10.2/mongodb-driver-core-4.10.2.jar



In the second part of this notebook, we will load all the cleaned data to a MongoDB server using a MongoDB Atlas connection URI. To do this directly with PySpark, we need to use a dedicated connector. The following cells will contain its configuration.

In [5]:
import os

# Enviroment variable
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.4.4-bin-hadoop3"
os.environ['PYSPARK_SUBMIT_ARGS'] = (
    '--jars /content/mongo-spark-connector_2.12-10.4.1.jar,'
    '/content/mongodb-driver-sync-4.10.2.jar,'
    '/content/bson-4.10.2.jar,'
    '/content/mongodb-driver-core-4.10.2.jar pyspark-shell'
)

In [6]:
import findspark
findspark.init()

In [7]:
# Libraries for SQL Spark
from pyspark.sql import SparkSession
from pyspark.sql import functions
from pyspark.sql.functions import split, explode, trim, count, sum, col, current_date, lower, regexp_replace
import time

# Spark Session configuration
spark = SparkSession.builder \
    .appName("MongoDBAtlasConnection") \
    .config("spark.mongodb.read.connection.uri", mongo_uri) \
    .config("spark.mongodb.write.connection.uri", mongo_uri) \
    .config("spark.jars", "/content/mongo-spark-connector_2.12-10.4.1.jar") \
    .getOrCreate()
print(spark)

<pyspark.sql.session.SparkSession object at 0x7d2b798d1fd0>


In [8]:
# Check if the connector works
print(spark.sparkContext.getConf().get("spark.jars"))

/content/mongo-spark-connector_2.12-10.4.1.jar


In [9]:
# Trying to Connect to mongoAtlas
try:
    df = spark.read \
        .format("mongodb") \
        .option("database", "CAMPANIA_SALUTE") \
        .option("collection", "ANAGRAFICA") \
        .load()
    print("Connessione riuscita! Ecco i primi 5 documenti:")
    df.show(5)
except Exception as e:
    print("Errore di connessione:", str(e))

Connessione riuscita! Ecco i primi 5 documenti:
++
||
++
++



The number of tables (collections in MongoDB) we need to load is significant. To ensure an efficient workflow, we need to implement a proper organization system for this process.

In [10]:
# Dictionary of all csv paths
csvPaths = {}
healthDB_path = os.path.join(dataset_path, '2024-05-05-DATABASE')
for collection in collection_string_list:
  csvPaths[collection] = os.path.join(healthDB_path,collection + '.csv')

In [11]:
print(csvPaths['ANAGRAFICA'])

/content/drive/MyDrive/Colab Notebooks/Big Data/Final_Project/Dataset/2024-05-05-DATABASE/ANAGRAFICA.csv


In [12]:
import csv

class CSVLoaderManager:
    def __init__(self, spark: SparkSession, mongo_uri: str = None):
        """
        The constructor initializes the CSVLoaderManager with a SparkSession and an optional MongoDB URI.
        """
        self.spark = spark
        self.mongo_uri = mongo_uri
        self.datasets = {}

    def _detect_delimiter(self, name, file_path: str, sample_size: int = 2048) -> str:
        """
        The function reads a file part to infer the delimiter (CSV or TSV)
        """
        with open(file_path, 'r', encoding='utf-8') as f:
            # Reads some samples
            sample = f.read(sample_size)
            sniffer = csv.Sniffer()
            try:
                # Use sniffer for checking the delimiter
                dialect = sniffer.sniff(sample)
                print(f"Name: {name}; Delimiter: {dialect.delimiter}")
                return dialect.delimiter
            except csv.Error:
                raise ValueError(f"Unable to automatically detect the file delimiter format of: {file_path}")


    def load_csv(self, name: str, file_path: str) -> None:
        """
        This function loads a CSV file into a Spark DataFrame,
        it adds the dataframe to the manager's datasets
        """
        delimiter = self._detect_delimiter(name, file_path)
        ds = self.spark.read \
          .option("delimiter",delimiter) \
          .option("inferSchema", "true") \
          .option("header", "true") \
          .option("multiline", "true") \
          .option("quote", "\"") \
          .option("escape", "\"") \
          .csv(file_path)
        self.datasets[name] = ds

    def load_many(self, files) -> None:
        """
        This function loads multiple CSV files into Spark DataFrames
        """
        for name, path in files.items():
            self.load_csv(name, path)

    def get(self, name: str):
        return self.datasets.get(name)

    def set(self, name: str, df, overwrite: bool=True):
      """
      This function sets a DataFrame in the manager's datasets,
      it can overwrites the DataFrame if it already exists with the parameter 'overwrite'
      """
      if not hasattr(df, 'schema'):
        raise TypeError("The given value is not a Spark dataframe.")
      if name in self.datasets and not overwrite:
        raise ValueError(f"The dataset '{name}' already exists. Use overwrite=True.")
      self.datasets[name] = df

    def list_datasets(self):
      """
      This function returns a list of the names of the datasets in the manager
      """
      return list(self.datasets.keys())

    def save_to_mongo(self, name: str, database: str, mode: str = "ignore") -> None:
      """
      This function saves a DataFrame to MongoDB ATLAS
      """
      if name not in self.datasets:
          raise ValueError(f"Dataset '{name}' not found.")
      if not self.mongo_uri:
          raise ValueError("Mongo URI not configured.")
      self.datasets[name].write \
          .format("mongodb") \
          .mode(mode) \
          .option("database", database) \
          .option("collection", name) \
          .save()

    def drop_collections(self, database: str, collections: list) -> None:
      """
      Drops specified collections from the given MongoDB database.
      """
      if not self.mongo_uri:
          raise ValueError("Mongo URI not configured.")
      client = MongoClient(self.mongo_uri)
      db = client[database]
      for collection in collections:
        if collection in db.list_collection_names():
          db.drop_collection(collection)
          print(f"Collection '{collection}' dropped successfully.")
        else:
          print(f"Collection '{collection}' not found in the database.")
      client.close()


Now we can load all the collections into dataframe to manipulate them.

In [13]:
manager = CSVLoaderManager(spark, mongo_uri)
manager.load_many(csvPaths)
manager.list_datasets()

Name: ANAGRAFICA; Delimiter: ,
Name: ANAMNESI; Delimiter: 	
Name: CORONAROGRAFIA_PTCA; Delimiter: 	
Name: ECOCARDIO_DATI; Delimiter: 	
Name: ECOCAROTIDI; Delimiter: 	
Name: ESAMI_LABORATORIO; Delimiter: 	
Name: ESAMI_SPECIALISTICI; Delimiter: 	
Name: ESAMI_STRUMENTALI_CARDIO; Delimiter: 	
Name: LISTA_EVENTI; Delimiter: 	
Name: PREVALENT; Delimiter: 	
Name: RICOVERO_OSPEDALIERO; Delimiter: 	
Name: VISITA_CONTROLLO_ECG; Delimiter: 	


['ANAGRAFICA',
 'ANAMNESI',
 'CORONAROGRAFIA_PTCA',
 'ECOCARDIO_DATI',
 'ECOCAROTIDI',
 'ESAMI_LABORATORIO',
 'ESAMI_SPECIALISTICI',
 'ESAMI_STRUMENTALI_CARDIO',
 'LISTA_EVENTI',
 'PREVALENT',
 'RICOVERO_OSPEDALIERO',
 'VISITA_CONTROLLO_ECG']

In [14]:
manager.get('LISTA_EVENTI').printSchema()

root
 |-- SEZIONE: integer (nullable = true)
 |-- CODPAZ: integer (nullable = true)
 |-- DATA: date (nullable = true)
 |-- NUM_PROGRESSIVO: integer (nullable = true)
 |-- TIPO_EVENTO: string (nullable = true)
 |-- NUM_PROGRESSIVO_GLOBALE: integer (nullable = true)



In [None]:
# ------ COSE DA FARE ----------

### CONTINUA DA QUA, FAI LISTA COLONNE DA FARE LOWER CASE
### FORMATTARE TUTTO A NONE
### CONCLUSIONI IN ESAMI_STRUMENTALI_CARDIO UNKNOW-> NONE
### CARICARE SU MONGO DB TUTTE LE COLLECTION
### PROVARE A GENERARE DELLE QUERY CON LLM
### INIZIARE MODULO GRAFICO PER STREAMLIT


In general all datasets are quite cleaned, but there is something that could give trouble in the future. For example some columns have same values but sometimes in lower case and other time in Upper case.

In [15]:
# Let's capitalize the columns that may give problems in query generation
from pyspark.sql.functions import col, upper

df = manager.get('ANAGRAFICA')

upper_df = df \
      .withColumn("COGNOME", upper(col("COGNOME"))) \
      .withColumn("NOMEPAZ", upper(col("NOMEPAZ"))) \
      .withColumn("COMUNE_DI_NASCITA", upper(col("COMUNE_DI_NASCITA")))

manager.set('ANAGRAFICA', upper_df)
manager.get('ANAGRAFICA').show(10)

+-------+------+------------+----------+-------------+-----+-----------------+------------------------+----------------+----------------+--------------------+------------+
|SEZIONE|CODPAZ|     COGNOME|   NOMEPAZ|DATADINASCITA|SESSO|COMUNE_DI_NASCITA|CODICE_COMUNE_DI_NASCITA|  CODICE_FISCALE|GATE_DI_INGRESSO|      MOTIVO_DECESSO|DATA_DECESSO|
+-------+------+------------+----------+-------------+-----+-----------------+------------------------+----------------+----------------+--------------------+------------+
|      1|     1|           A|     NELLO|   1937-02-16|    F|                -|                    null|---NLL37B56-----|         Esterno|                null|        null|
|      1|     7|       NAVAS| MADDALENA|   1937-01-18|    F|           ACERRA|                    A024|NVSMDL37A58A024E|    Ipertensione|Causa extracardio...|  2019-07-02|
|      1|    17|D`ALESSANDRO|  RAFFAELE|   1932-10-24|    M|                 |                        |DLSRFL32R24-----|    Ipertensione|   

Another problem is the presence of more "null" values in some columns.

In [16]:
def normalize_null(value):
    """
    The function uniforms null or unexisting values at None.
    """
    null_equivalents = {"", " ", "  ", "null", "None", "N/A", "na", "-", "--", "NaN"}

    if isinstance(value, str):
        value = value.strip()
    return None if value in null_equivalents else value


In [17]:
from pyspark.sql.functions import udf, col
from pyspark.sql.types import StringType

for name in collection_string_list:
  df = manager.get(name)

  norm_udf = udf(lambda x: normalize_null(x), StringType())
  for column_name in df.columns:
    df = df.withColumn(column_name, norm_udf(col(column_name)))

  manager.set(name, df)


### Drop Collection 🚩***Run it only if you want to eliminate some collection from mongo atlas***

In [None]:
manager.drop_collections("CAMPANIA_SALUTE", ["LISTA_EVENTI"])

## Data Loading in Mongo DB

We'll now load our processed datasets into MongoDB Atlas using PySpark's native connector. This efficient approach enables seamless integration between Spark DataFrames and MongoDB collections. The following configuration ensures optimal performance and reliability.

In [19]:
!pip install pymongo

Collecting pymongo
  Downloading pymongo-4.13.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB)
Collecting dnspython<3.0.0,>=1.16.0 (from pymongo)
  Downloading dnspython-2.7.0-py3-none-any.whl.metadata (5.8 kB)
Downloading pymongo-4.13.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.4 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m40.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dnspython-2.7.0-py3-none-any.whl (313 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m313.6/313.6 kB[0m [31m22.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dnspython, pymongo
Successfully installed dnspython-2.7.0 pymongo-4.13.0


In [20]:
for name in collection_string_list:
  manager.save_to_mongo(name, "CAMPANIA_SALUTE", mode="overwrite")