#HW1 - Clinical Trials Analytics in PySpark

### 1. Spark Installation


Download and install Spark with all its dependencies

In [1]:
#Install Java8
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

# download spark3.4.4
!wget -q https://apache.osuosl.org/spark/spark-3.4.4/spark-3.4.4-bin-hadoop3.tgz

# unzip it
!tar xf spark-3.4.4-bin-hadoop3.tgz

# install findspark
!pip install -q findspark

It's necessary to add enviroment variables to make visible runtime Spark to linux enviroment. We could install different versions of spark and decide later which one we would use.

In [2]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.4.4-bin-hadoop3"

You import the library `findspark` that allow to find and automatically initialize Spark configuration without having to manually configure enviroment variable and other options

In [3]:
import findspark
findspark.init()

# Spark version verification on cluster
from pyspark.context import SparkContext
sc = SparkContext.getOrCreate()

assert "3." in sc.version, "Verify that the cluster Spark's version is 3.x"

In [4]:
sc

### 2. Loading the Dataset on Spark


RDD is a data representation in Spark, but for simplicity of coding and design, it became necessary to introduce a new, more responsive data model.

Spark SQL came to live, it offers to the users the opportunity ot use datasets/dataframes. They are objects tablelike: each column has a name and a type, each row is a combination of column values

The SQL engine on Spark translates sql-like operations in RDDs and gives at the end a RDD with the results.

In [5]:
# Libraries for SQL Spark
from pyspark.sql import SparkSession
from pyspark.sql import functions
from pyspark.sql.functions import split, explode, trim, count, sum, col, current_date, lower, regexp_replace
import time

spark = SparkSession(sc)
print(spark)

<pyspark.sql.session.SparkSession object at 0x7bc80ef8b510>


We can download the dataset from multiple sources (remote or local). Since it is a single csv file, not so big, let's download it on google drive enviroment, then mount the enviroment to the cluster file system

In [6]:
from google.colab import drive
drive.mount('/content/drive')

#Print all directories in my enviroment
print(os.listdir('/content/drive/MyDrive/Colab Notebooks/Big Data'))

homework1Path = "/content/drive/MyDrive/Colab Notebooks/Big Data/Homework 1"

#Save the path to the CSVs directory
csvPath = os.path.join(homework1Path,"dimensions_clinicalTrials.csv")
print(csvPath)

Mounted at /content/drive
['Esercitazione Spark 25 03 2025', 'Esercitazione Hive 25 03 2025', 'Homework 1']
/content/drive/MyDrive/Colab Notebooks/Big Data/Homework 1/dimensions_clinicalTrials.csv


Create the Dataset Object with the spark Object `read`

In [7]:
# Indentifies types for each column (float, integer, string, etc)
# Gets columns names of the first csv's row
ctDS = spark.read \
  .option("inferSchema", "true") \
  .option("header", "true") \
  .option("multiline", "true") \
  .option("quote", "\"") \
  .option("escape", "\"") \
  .csv(csvPath)

# Print the schema
ctDS.printSchema()

root
 |-- Rank: integer (nullable = true)
 |-- Trial ID: string (nullable = true)
 |-- Title: string (nullable = true)
 |-- Brief title: string (nullable = true)
 |-- Acronym: string (nullable = true)
 |-- Abstract: string (nullable = true)
 |-- Start date: date (nullable = true)
 |-- Start Year: double (nullable = true)
 |-- End Date: date (nullable = true)
 |-- Completion Year: double (nullable = true)
 |-- Phase: string (nullable = true)
 |-- Study Type: string (nullable = true)
 |-- Study Design: string (nullable = true)
 |-- Conditions: string (nullable = true)
 |-- Recruitment Status: string (nullable = true)
 |-- Number of Participants: double (nullable = true)
 |-- Intervention: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Age: string (nullable = true)
 |-- Registry: string (nullable = true)
 |-- Investigators/Contacts: string (nullable = true)
 |-- Sponsors/Collaborators: string (nullable = true)
 |-- City of Sponsor/Collaborator: string (nullable = true

In [8]:
# Let's print some rows
ctDS.select("Study Design").show(20, truncate = False)

+----------------------------------------------------------------------------------------------------------------------------+
|Study Design                                                                                                                |
+----------------------------------------------------------------------------------------------------------------------------+
|Allocation: Randomized; Intervention Model: Parallel Assignment; Masking: None (Open Label); Primary Purpose: Treatment     |
|Allocation: N/A; Intervention Model: Single Group Assignment; Masking: None (Open Label); Primary Purpose: Supportive Care  |
|Allocation: Randomised Controlled Trial; Primary Purpose: Treatment                                                         |
|Allocation: Randomized; Intervention Model: Parallel Assignment; Masking: Double; Primary Purpose: Treatment                |
|Observational Model: Cohort                                                                                   

### 3. Preprocessing


At a first sight it could not be clear, but in this dataset there are a lot of duplicate rows. Clean the dataset with dropDuplicates()



In [9]:
# Dirty Dataset
count_dirty = ctDS.count()

# Cleaned Dataset
#ctDS = ctDS.fillna("NA_TEMP")
ctDS = ctDS.dropDuplicates(["Trial ID"])
#ctDS = ctDS.replace("NA_TEMP", None)
count_clean = ctDS.count()

print("Dirty: " + str(count_dirty))
print("Clean: " + str(count_clean))


Dirty: 15990
Clean: 8356


In [10]:
# Before the cleaning there were 7 rows with this "title"
ctDS.select("*").where(ctDS["Title"] == "Phase III Study on STem cElls Mobilization in Acute Myocardial Infarction").show(20,truncate=False)

+----+-----------+-------------------------------------------------------------------------+--------------------------------------------------------------------+--------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

There's still something wrong...Some columns have pseudo-structured data, like dictionaries and lists. In particular:

**Dictionaries**

*   Study Design

**Lists**

*   Conditions
*   Investigators/Contacts
*   Sponsors/Collaborators
*   City of Sponsor/Collaborator
*   State of Sponsor/Collaborator
*   Country of Sponsor/Collaborator
*   Fields of Research (ANZSRC 2020)
*   RCDC Categories
*   HRCS HC Categories
*   HRCS RAC Categories
*   Cancer Types
*   CSO Categories

**List of two elements**

*   Age (min ; max)

For better analytics, we have changed manually the type of these columns with a preprocessing script.

In [11]:
from pyspark.sql.functions import expr

# All columns that are pseudo-lists
columnsList = ["Conditions","Investigators/Contacts","Sponsors/Collaborators","City of Sponsor/Collaborator","State of Sponsor/Collaborator","Country of Sponsor/Collaborator","Fields of Research (ANZSRC 2020)","RCDC Categories","HRCS HC Categories","HRCS RAC Categories","Cancer Types","CSO Categories"]

# Cicle for all lists
for column in columnsList:
  ctDS = ctDS.withColumn(
      column,
      # Wiht expr function, we use sql like preprocessing
      expr(f"""
        transform(
            split(`{column}`, ';'),
            x -> CASE WHEN trim(x) = '' THEN NULL ELSE trim(x) END
        )
      """
    )
  )



In [12]:
from pyspark.sql.functions import udf
from pyspark.sql.types import MapType, StringType

def string_to_map(text):
    if text is None:
        return None
    result = {}
    parts = text.split(';')
    for part in parts:
        if ':' in part:
            key, value = part.split(':', 1)
            result[key.strip()] = value.strip()
    return result

string_to_map_udf = udf(string_to_map, MapType(StringType(), StringType()))

ctDS = ctDS.withColumn("Study Design", string_to_map_udf(col("Study Design")))


In [13]:
from pyspark.sql.types import ArrayType, IntegerType
import re

def age_preprocessing(text):
  if text is None:
    return None

  result = []
  parts = text.split("-")
  if len(parts) == 2:
    for part in parts:
      matching = re.search(r'\d+(\.\d+)?',part)
      if matching:
        result.append(int(matching.group()))
      else:
        result.append(None)
  else:
    return None

  return result

age_preprocessing_udf = udf(age_preprocessing,ArrayType(IntegerType()))

ctDS = ctDS.withColumn("Age", age_preprocessing_udf(col("Age")))


In [14]:
#ctDS.select("*").show(100,truncate = True)
ctDS.select(col("Conditions")[0].alias("first_condition")).show(truncate = False)

+-------------------------+
|first_condition          |
+-------------------------+
|null                     |
|null                     |
|null                     |
|null                     |
|null                     |
|null                     |
|null                     |
|null                     |
|Chronic myeloid leukaemia|
|null                     |
|null                     |
|null                     |
|null                     |
|null                     |
|Epilepsy                 |
|Colorectal cancer NOS    |
|null                     |
|null                     |
|null                     |
|null                     |
+-------------------------+
only showing top 20 rows



In [15]:
total_rows = ctDS.count()
null_rows = ctDS.filter(ctDS["Study Design"].isNull()).count()

if total_rows == null_rows:
    print("La colonna è completamente null")
else:
    print(f"La colonna contiene {total_rows - null_rows} valori non-null")

ctDS.select("Study Design")  \
  .filter(ctDS["Study Design"].isNotNull()) \
  .show(truncate=False)

La colonna contiene 7352 valori non-null
+------------------------------------------------------------------------------------------------------------------------+
|Study Design                                                                                                            |
+------------------------------------------------------------------------------------------------------------------------+
|{Intervention Model -> Parallel, Masking -> No, Primary Purpose -> Treatment, Allocation -> Randomized Controlled Study}|
|{Intervention Model -> Parallel, Masking -> No, Primary Purpose -> Treatment, Allocation -> Randomized Controlled Study}|
|{Primary Purpose -> Basic Research/Physiological Study}                                                                 |
|{Primary Purpose -> Basic Research/Physiological Study}                                                                 |
|{Primary Purpose -> Treatment}                                                                   

In [16]:
ctDS.select("Study Design").where(ctDS["Trial ID"]=="NCT05817903").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------+
|Study Design                                                                                                                     |
+---------------------------------------------------------------------------------------------------------------------------------+
|{Intervention Model -> Parallel Assignment, Masking -> None (Open Label), Primary Purpose -> Treatment, Allocation -> Randomized}|
+---------------------------------------------------------------------------------------------------------------------------------+



In [17]:
ctDS.select("Age").where(ctDS["Age"][0].isNotNull()).show(truncate=False)

+----------+
|Age       |
+----------+
|[35, 65]  |
|[12, 17]  |
|[18, null]|
|[18, null]|
|[18, 60]  |
|[18, 60]  |
|[18, null]|
|[2, 100]  |
|[18, 100] |
|[18, 100] |
|[18, 44]  |
|[18, 100] |
|[2, 17]   |
|[18, 100] |
|[12, 64]  |
|[2, 100]  |
|[18, 64]  |
|[18, 100] |
|[2, 100]  |
|[18, 100] |
+----------+
only showing top 20 rows



### 4. Analytics

In [18]:
# Save all the results in csv format

# create the new repo path
resultsPath = os.path.join(homework1Path,"results")

In [19]:
# Number of studies started per year
studiesPerYear = ctDS.select("Start Year") \
  .filter(ctDS["Start Year"].isNotNull()) \
  .groupBy(ctDS["Start Year"]) \
  .count() \
  .withColumnRenamed("count","NumStudies per Year") \
  .orderBy(col("NumStudies per Year").desc())

studiesPerYear.show(50,truncate = False)

# Saving result in a csv
studiesPerYearResult = studiesPerYear.toPandas()
studiesPerYearResult.to_csv(os.path.join(resultsPath,"studiesPerYear.csv"))


+----------+-------------------+
|Start Year|NumStudies per Year|
+----------+-------------------+
|2021.0    |722                |
|2020.0    |661                |
|2019.0    |640                |
|2018.0    |589                |
|2022.0    |585                |
|2017.0    |548                |
|2015.0    |457                |
|2016.0    |446                |
|2023.0    |399                |
|2014.0    |397                |
|2013.0    |370                |
|2012.0    |367                |
|2011.0    |325                |
|2010.0    |295                |
|2009.0    |287                |
|2008.0    |287                |
|2007.0    |226                |
|2006.0    |200                |
|2005.0    |130                |
|2004.0    |101                |
|2003.0    |56                 |
|2001.0    |45                 |
|2002.0    |40                 |
|2024.0    |38                 |
|2000.0    |35                 |
|1998.0    |19                 |
|1999.0    |18                 |
|1997.0   

In [20]:
# Average number of participants per study title

# Check if the title column is unique, maybe the question is refering title type
tot_rows = ctDS.count()

distinct_rows = ctDS.select("Title").distinct().count()

if tot_rows == distinct_rows:
  print("They're the same")
else:
  print("They're NOT the same")

# As shown they aren't the same, so we can group by title
print("Tot_rows: " + str(tot_rows))
print("Distinct_rows: " + str(distinct_rows))

They're NOT the same
Tot_rows: 8356
Distinct_rows: 8334


In [21]:
# Check which Title is duplicated, but the trials are different
duplicates = ctDS.groupBy("Title") \
  .count() \
  .filter(col("count") > 1)

duplicates.select("Title","count").show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
|Title                                                                                                                                                                                                                                                                        |count|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
|SAlute e LaVoro in Chirurgia Oncologica (SALVO)                                                                                                                      

In [22]:
from pyspark.sql.functions import avg

# Average number of participants per study title
averagePerTitle = ctDS.select("Title","Number of Participants") \
  .filter(ctDS["Title"].isNotNull()) \
  .groupBy("Title") \
  .agg(avg("Number of Participants").alias("Average per Title")) \
  .orderBy(col("Average per Title").desc())

averagePerTitle.show(20,truncate=False)

averagePerTitleResult = averagePerTitle.toPandas()
averagePerTitleResult.to_csv(os.path.join(resultsPath,"averagePerTitle.csv"))


+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+
|Title                                                                                                                                                                                                                                                                                                                                                                                                                                                               

In [23]:
# Top 10 most frequent medical conditions
ctDS.select(explode(ctDS["Conditions"]).alias("Condition")) \
  .filter(col("Condition").isNotNull()) \
  .groupBy("Condition") \
  .count() \
  .withColumnRenamed("count","Count per Condition") \
  .orderBy(col("Count per Condition").desc()) \
  .limit(10).show(truncate=False)

+-----------------------+-------------------+
|Condition              |Count per Condition|
+-----------------------+-------------------+
|Breast Cancer          |157                |
|Multiple Myeloma       |83                 |
|Coronary Artery Disease|71                 |
|Heart Failure          |62                 |
|Ovarian Cancer         |60                 |
|Lung Cancer            |60                 |
|Colorectal Cancer      |56                 |
|Ulcerative Colitis     |52                 |
|Melanoma               |51                 |
|Prostate Cancer        |47                 |
+-----------------------+-------------------+



In [24]:
# Countries with the highest average number of participants per study
from pyspark.sql.functions import array_distinct
from pyspark.sql.functions import avg, row_number
from pyspark.sql.window import Window

# First query to retrieve Avg per study type/country
allAvgParticipants = ctDS.withColumn("Unique_Countries", array_distinct(col("Country of Sponsor/Collaborator"))) \
  .select("Study Type", explode(col("Unique_Countries")).alias("Country"),"Number of Participants") \
  .filter(col("Study Type").isNotNull() & col("Country").isNotNull() & col("Number of Participants").isNotNull()) \
  .groupBy("Study Type","Country") \
  .agg(avg("Number of Participants").alias("Average per Type/Country")) \
  .orderBy(col("Average per Type/Country").desc()) \

# Define a window to applicate the max to each study type
windowSpec = Window.partitionBy("Study Type").orderBy(col("Average per Type/Country").desc())

# Filter only max avg for each study type
maxCountryAvgPerStudyType = allAvgParticipants.withColumn("rank",row_number().over(windowSpec)) \
  .filter(col("rank")==1) \
  .select("Study Type","rank", "Country", "Average per Type/Country") \
  .show(100,truncate=False)




+-------------------+----+--------+------------------------+
|Study Type         |rank|Country |Average per Type/Country|
+-------------------+----+--------+------------------------+
|Active surveillance|1   |Italy   |115000.0                |
|CCT                |1   |Colombia|520.0                   |
|Interventional     |1   |Iceland |7924.888888888889       |
|Non-interventional |1   |Belgium |8122.5                  |
|Observational      |1   |Iceland |609000.0                |
|Other              |1   |Italy   |202.0                   |
+-------------------+----+--------+------------------------+



In [25]:
# Range di anni (date) in cui hanno partecipato più persone agli studi
# Città che hanno trattato maggiormente una determinata condizione - FATTO
# Range di età che viene studiato/si sottopone per una determinata medical condition
# Nazione che sponsorizza di più clinical trial indirizzati al genere femminile - FATTO
# Nazione con il più alto numero di studi che comprende collaboratori minorenni - FATTO
# Stato che studia maggiormente un determinato tipo di cancro - FATTO
# Studi/Trial che attualmente sono terminati, con info anche su che tipo di condizioni trattano - FATTO
# Studi con Maggiore Visibilità Mediatica (Altmetric Score) - FATTO
# Condizioni che in media hanno più visibilità Mediatica (AlMetric Score)
# Condizioni che vengono studiate anche su minorenni

##### ATTENZIONE IN CONDITIONS C'è UN CAMPO MORE CONDITIONS CHE è UN DIZIONARIO #######

In [26]:
# Città che hanno trattato maggiormente una determinata condizione (Per condizione)

# Preprocessing, because select can't include more than one explode at time
cityConditionDS = ctDS.withColumn("City",explode(array_distinct(col("City of Sponsor/Collaborator")))) \
  .withColumn("Condition",explode(ctDS["Conditions"]))

# Defining the windowSpec
windowSpecCityCondition = Window.partitionBy("Condition").orderBy(col("NumStudies per City/Condition").desc())

# Here we could filter for a specific city, for instance where City == Bologna
cityConditionDS.select("City","Condition")\
  .groupBy("City","Condition") \
  .count() \
  .withColumnRenamed("count","NumStudies per City/Condition") \
  .withColumn("rank",row_number().over(windowSpecCityCondition)) \
  .filter((col("rank") == 1) & (col("City").isNotNull()) & (col("NumStudies per City/Condition") > 1)) \
  .select("City","Condition","NumStudies per City/Condition") \
  .show(20,truncate=False)


+----------------+---------------------------------------------------+-----------------------------+
|City            |Condition                                          |NumStudies per City/Condition|
+----------------+---------------------------------------------------+-----------------------------+
|Pavia           |AL Amyloidosis                                     |6                            |
|Milan           |ALS                                                |3                            |
|Naples          |Acid Maltase Deficiency                            |2                            |
|Orlando         |Acquired Immunodeficiency Syndrome                 |2                            |
|Yorkville       |Acral Lentiginous Melanoma                         |2                            |
|Brno            |Acute Ischemic Stroke                              |2                            |
|Rome            |Acute Myeloid Leukemia                             |29                   

In [27]:
# Nation that sponsors the most clinical trials for women

ctDS.select(explode(array_distinct(ctDS["Country of Sponsor/Collaborator"])).alias("Nation")) \
  .where((ctDS["Gender"] == "Female") & (col("Nation").isNotNull())) \
  .groupBy("Nation") \
  .count() \
  .withColumnRenamed("count","NumFemaleStudies per Nation") \
  .orderBy(col("NumFemaleStudies per Nation").desc()) \
  .show(10,truncate=False)


+--------------+---------------------------+
|Nation        |NumFemaleStudies per Nation|
+--------------+---------------------------+
|Italy         |582                        |
|United States |176                        |
|Spain         |133                        |
|Belgium       |109                        |
|France        |106                        |
|Germany       |102                        |
|United Kingdom|102                        |
|Canada        |75                         |
|Poland        |71                         |
|Netherlands   |59                         |
+--------------+---------------------------+
only showing top 10 rows



In [37]:
# Nazione con il più alto numero di studi che comprende collaboratori minorenni
ctDS.select(explode(array_distinct("Country of Sponsor/Collaborator")).alias("Country")) \
  .filter((col("Age").isNotNull()) & (col("Age")[0] < 18) & (col("Country").isNotNull())) \
  .groupBy("Country") \
  .count() \
  .withColumnRenamed("count","Count of Studies per Country") \
  .orderBy(col("Count of Studies per Country").desc()) \
  .show(20,truncate=False)

+--------------+----------------------------+
|Country       |Count of Studies per Country|
+--------------+----------------------------+
|Italy         |828                         |
|United States |475                         |
|Spain         |361                         |
|France        |356                         |
|United Kingdom|342                         |
|Germany       |339                         |
|Belgium       |265                         |
|Canada        |242                         |
|Poland        |220                         |
|Netherlands   |218                         |
|Australia     |190                         |
|Israel        |158                         |
|Austria       |133                         |
|Switzerland   |129                         |
|Hungary       |127                         |
|Czechia       |126                         |
|Sweden        |123                         |
|Japan         |115                         |
|Brazil        |106               

In [42]:
#Gli stati che studiano di più un determinato tipo di cancro
# Preprocessing, because select can't include more than one explode at time
stateCancerDS = ctDS.withColumn("State",explode(array_distinct(col("State of Sponsor/Collaborator")))) \
  .withColumn("Cancer Type",explode(ctDS["Cancer Types"]))

# Defining the windowSpec
windowSpecStateCancer = Window.partitionBy("Cancer Type").orderBy(col("NumStudies per State/Cancer").desc())

# Here we could filter for a specific city, for instance where City == Bologna
stateCancerDS.select("State","Cancer Type")\
  .groupBy("State","Cancer Type") \
  .count() \
  .withColumnRenamed("count","NumStudies per State/Cancer") \
  .withColumn("rank",row_number().over(windowSpecStateCancer)) \
  .filter((col("rank") == 1) & (col("State").isNotNull())) \
  .show(20,truncate=False)

+-----+-----------+---------------------------+----+
|State|Cancer Type|NumStudies per State/Cancer|rank|
+-----+-----------+---------------------------+----+
+-----+-----------+---------------------------+----+



In [48]:
# Per Stati non si può fare perchè gli Stati con rank = 1 sono null, quindi facciamo per città
stateCancerDS.select("State","Cancer Type")\
  .groupBy("State","Cancer Type") \
  .count() \
  .withColumnRenamed("count","NumStudies per State/Cancer") \
  .withColumn("rank",row_number().over(windowSpecStateCancer)) \
  .orderBy(col("rank")) \
  .show(100,truncate=False)


+---------------+----------------------------------------------------------+---------------------------+----+
|State          |Cancer Type                                               |NumStudies per State/Cancer|rank|
+---------------+----------------------------------------------------------+---------------------------+----+
|null           |Kaposi's Sarcoma                                          |4                          |1   |
|null           |Anal Cancer                                               |4                          |1   |
|null           |Bladder Cancer                                            |58                         |1   |
|null           |Blood Cancer                                              |46                         |1   |
|null           |Bone Cancer, Osteosarcoma / Malignant Fibrous Histiocytoma|34                         |1   |
|null           |Breast Cancer                                             |340                        |1   |
|null     

In [51]:
#Le che studiano di più un determinato tipo di cancro
# Preprocessing, because select can't include more than one explode at time
cityCancerDS = ctDS.withColumn("City",explode(array_distinct(col("City of Sponsor/Collaborator")))) \
  .withColumn("Cancer Type",explode(ctDS["Cancer Types"]))

# Defining the windowSpec
windowSpecCityCancer = Window.partitionBy("Cancer Type").orderBy(col("NumStudies per City/Cancer").desc())

# Here we could filter for a specific city, for instance where City == Bologna
cityCancerDS.select("City","Cancer Type")\
  .groupBy("City","Cancer Type") \
  .count() \
  .withColumnRenamed("count","NumStudies per City/Cancer") \
  .withColumn("rank",row_number().over(windowSpecCityCancer)) \
  .filter((col("rank") == 1) & (col("City").isNotNull())) \
  .orderBy(col("NumStudies per City/Cancer").desc()) \
  .show(20,truncate=False)

+-------+----------------------------------------------------------+--------------------------+----+
|City   |Cancer Type                                               |NumStudies per City/Cancer|rank|
+-------+----------------------------------------------------------+--------------------------+----+
|Milan  |Lung Cancer                                               |255                       |1   |
|Rome   |Leukemia / Leukaemia                                      |218                       |1   |
|Milan  |Non-Hodgkin's Lymphoma                                    |216                       |1   |
|Milan  |Breast Cancer                                             |190                       |1   |
|Milan  |Colon and Rectal Cancer                                   |136                       |1   |
|Milan  |Not Site-Specific Cancer                                  |101                       |1   |
|Milan  |Liver Cancer                                              |89                     

In [61]:
# Studi/Trial che attualmente sono terminati, con info anche su che tipo di condizioni trattano
from pyspark.sql.functions import to_date, current_date

ctDS.withColumn("End Date",to_date("End Date","yyyy-MM-dd")) \
  .select("Trial ID","End Date",explode(col("Conditions")).alias("Condition")) \
  .where(col("End Date") < current_date() & (col("Condition").isNotNull())) \
  .orderBy(col("End Date").desc()) \
  .show(truncate=False)

+--------------+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Trial ID      |End Date  |Condition                                                                                                                                                                     |
+--------------+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|NCT03370172   |2025-04-24|Hemophilia A                                                                                                                                                                  |
|NCT05405114   |2025-04-16|Sickle Cell Disease                                                                                                                                              

In [62]:
# Studi con Maggiore Visibilità Mediatica (Altmetric Score)
ctDS.select("Trial ID","Altmetric Attention Score",explode(col("Conditions")).alias("Condition")) \
  .where(col("Altmetric Attention Score").isNotNull() & col("Condition").isNotNull()) \
  .orderBy(col("Altmetric Attention Score").desc()) \
  .show(truncate=False)

+-----------+-------------------------+-----------------------------------------------+
|Trial ID   |Altmetric Attention Score|Condition                                      |
+-----------+-------------------------+-----------------------------------------------+
|NCT04575597|1703.0                   |Coronavirus Disease (COVID-19)                 |
|NCT02819635|1283.0                   |Ulcerative Colitis (UC)                        |
|NCT03104400|1247.0                   |Psoriatic Arthritis                            |
|NCT03105128|1129.0                   |Crohn's Disease                                |
|NCT04292899|1081.0                   |COVID-19                                       |
|NCT03398148|1061.0                   |Ulcerative Colitis (UC)                        |
|NCT03398135|1030.0                   |Ulcerative Colitis (UC)                        |
|NCT03569293|1010.0                   |Atopic Dermatitis                              |
|NCT03671148|962.0              

In [65]:
# Condizioni che in media hanno più visibilità Mediatica (AlMetric Attention Score)
ctDS.select(explode(col("Conditions")).alias("Condition"),"Altmetric Attention Score") \
  .filter(col("Condition").isNotNull()) \
  .groupBy("Condition") \
  .agg(avg("Altmetric Attention Score").alias("Average Attention Score")) \
  .orderBy(col("Average Attention Score").desc()) \
  .show(truncate=False)

+-----------------------------------------------+-----------------------+
|Condition                                      |Average Attention Score|
+-----------------------------------------------+-----------------------+
|Coronavirus Disease (COVID-19)                 |1703.0                 |
|Psoriatic Arthritis (PsA)                      |962.0                  |
|Untreated AML                                  |834.0                  |
|Newly Diagnosed Acute Myeloid Leukemia (AML)   |834.0                  |
|AML Arising From Myelodysplastic Syndrome (MDS)|834.0                  |
|KRAS p, G12c Mutated /Advanced Metastatic NSCLC|788.0                  |
|Central Nervous System Neoplasms               |784.0                  |
|Breast Diseases                                |784.0                  |
|Brain Neoplasms                                |784.0                  |
|Retinal Vein Occlusion                         |774.0                  |
|Glioblastoma Multiforme, Adult       