# EUDRA Vigilance
Data source : EudraVigilance, Human datasets
Data location : https://www.adrreports.eu/

```
EudraVigilance is the system for managing and analysing information on suspected adverse reactions to medicines which have been authorised or being studied in clinical trials in the European Economic Area (EEA). The European Medicines Agency (EMA) operates the system on behalf of the European Union (EU) medicines regulatory network.
```

* [COVID-19 MRNA VACCINE MODERNA CX-024414](https://dap.ema.europa.eu/analyticsSOAP/saw.dll?PortalPages&amp;PortalPath=%2Fshared%2FPHV%20DAP%2F_portal%2FDAP&amp;Action=Navigate&amp;P0=1&amp;P1=eq&amp;P2=%22Line%20Listing%20Objects%22.%22Substance%20High%20Level%20Code%22&amp;P3=1+40983312)
* [COVID-19 MRNA VACCINE PFIZER-BIONTECH TOZINAMERAN](https://dap.ema.europa.eu/analyticsSOAP/saw.dll?PortalPages&amp;PortalPath=%2Fshared%2FPHV%20DAP%2F_portal%2FDAP&amp;Action=Navigate&amp;P0=1&amp;P1=eq&amp;P2=%22Line%20Listing%20Objects%22.%22Substance%20High%20Level%20Code%22&amp;P3=1+42325700)
* [COVID-19 VACCINE ASTRAZENECA (CHADOX1 NCOV-19)](https://dap.ema.europa.eu/analyticsSOAP/saw.dll?PortalPages&amp;PortalPath=%2Fshared%2FPHV%20DAP%2F_portal%2FDAP&amp;Action=Navigate&amp;P0=1&amp;P1=eq&amp;P2=%22Line%20Listing%20Objects%22.%22Substance%20High%20Level%20Code%22&amp;P3=1+40995439)
* [COVID-19 VACCINE JANSSEN AD26.COV2.S](https://dap.ema.europa.eu/analyticsSOAP/saw.dll?PortalPages&amp;PortalPath=%2Fshared%2FPHV%20DAP%2F_portal%2FDAP&amp;Action=Navigate&amp;P0=1&amp;P1=eq&amp;P2=%22Line%20Listing%20Objects%22.%22Substance%20High%20Level%20Code%22&amp;P3=1+42287887)
* and more ...

On a report ADRReports, 
1. open the **`Line listing`** tab
2. set `YEAR=2021`
3. open line report via **`Run Line Listing Report`**.
4. From the `Line listing` report page,   choose **`Export`/`Data`/`Tab delimited`** to download TSV files.


**Warning : convert files to UTF-8**

I experienced one downloaded file  UTF-8 encoded, while others were UTF16-le encoded.
* to view enconding, use ```file -i yourfile```
* to convert encoding, you can use iconv such as 
```iconv -f utf-16le -t utf-8 your-utf16le-file --output your-utf8-file```


The following notebook transforms locally downloaded TSV formated data, into rows.

# Source path & encoding

In [None]:
val sourceEudraTSV="file:///path/to/the/files"
val sourceEncoding="UTF-8"

# Tooling for data preparation
* imports
* case classes
* user defined functions for data parsing

In [None]:
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
import scala.collection.mutable.ListBuffer
import org.apache.spark.sql.functions.udf
import spark.implicits._
import java.sql.Date

case class Reaction(
    description: String = ""
    , duration: String = ""
    , outcome: String = ""
    , seriousnessCriteria: String = "") 

case class Drug(
    name: String = ""
  , letter: String = ""
  , indication: String = ""
  , actionTaken: String = ""
  , duration: String = ""
  , dose: String = ""
  , route: String = "" ) 

//Eudra records are exposing all values a strings in CSV. To reduce the memory footprint, weĺl use Boolean, Dates or Integers when possible. User defined functions will help us.
case class EudraRecord(
    reportEULocalNumber : String 
    , reportReceived:Date 
    , ageGroup:Integer
    , isSpontaneous:Boolean
    , isHealthCareProfessionnal:  Boolean 
    , isSourceRegionEU: Boolean 
    , isPatientMale:Boolean
    , isPatientFemale:Boolean
 ) 

case class EudraAgeGroup (
    ageGroupCode:Byte
    , minAgeMonth:Int
    ,maxAgeMonth:Int
    ,ageGroupName:String
)



/**
  transform reaction string into List[Reaction] 
**/
def getReactionList  = (eudraNestedString: String) => {
    val reactionPattern= """(?<descripton>.*)\s\((?<duration>[^\t]*)\t(?<outcome>[^\t]*)\t(?<seriousness>.*)\)"""
        .r("description","duration","outcome","seriousness")
    val result= ListBuffer.empty[Reaction]
    if (! (eudraNestedString == null || eudraNestedString.length==0)) try {
      val reactionArray:Array[String]=eudraNestedString.split( "\n")
      reactionArray.toList.foreach(reactionString=> {
          val matched = reactionPattern.findFirstMatchIn(reactionString)
           if (!matched.isEmpty) 
               result +=  {Reaction(matched.get.group("description"), matched.get.group("duration"), matched.get.group("outcome"), matched.get.group("seriousness"))}
           } 
        )
    } catch { case e: Exception => {/* do nothing */} }
    result.toList
}
val getReactionListUDF = udf(getReactionList)

/** 
  transform drug string into List[Drug] 
**/ 
def getDrugList = (eudraNestedString: String) =>   {
    val drugPattern= """(?<name>.*)\s\((?<drugchar>[^\t]{1})\t(?<indication>[^\t]*)\t(?<action>[^\t]*)\t\[(?<duration>[^\t]*)\t(?<dose>[^\t]*)\t(?<route>[^\]]*)"""
      .r("name","drugchar","indication","action","duration","dose","route")
    var result = ListBuffer.empty[Drug]
    if (! (eudraNestedString == null || eudraNestedString.length==0)) try {
      val drugsArray:Array[String]=eudraNestedString.split( "\n")
      drugsArray.toList.foreach(drugString=> {
        val matched = drugPattern.findFirstMatchIn(drugString)
        if (!matched.isEmpty)
          result += Drug( matched.get.group("name")
            ,  matched.get.group("drugchar")
            ,  matched.get.group("indication")
            ,  matched.get.group("action")
            ,  matched.get.group("duration")
            ,  matched.get.group("dose")
            ,  matched.get.group("route"))
      })
    } catch { case e: Exception => { /* do nothing */ } }
    result.toList
  }
val getDrugListUDF = udf(getDrugList)


/**
    normalize drug name containing COVID-19 and  VACCINE and (ASTRAZENECA or JANSSEN or MODERNA or TOZINAMERAN)
    into COVID-19 ASTZ | COVID-19 JANS | COVID-19 MODR | COVID-19 PFIZ | COVID-19 ??? | 
    or return same string.
**/

def getCleanerDrugName = (eudraDrugName:String)=> {
    if (eudraDrugName==null) None
    val drugPattern= """^\[([^,]*)\]""".r("name")
    val matched = drugPattern.findFirstMatchIn(eudraDrugName)
    if (!matched.isEmpty)
        matched.get.group("name")
    else
        eudraDrugName
   
}
val getCleanerDrugNameUDF = udf(getCleanerDrugName)   
spark.udf.register("getCleanerDrugName", getCleanerDrugName)


/*  
 * getCovidVaccineCode to enable correlation between EUDRAVigilance and ECDC Vaccine tracker we need to get
 */
def getCovidVaccineCode = (eudraDrugName:String)=> {
    if (eudraDrugName==null) None
    val name=eudraDrugName.toUpperCase
    if (name.contains("TOZINAMERAN"))  Some("COM") 
    else if (name.contains("CHADOX1")) Some ("AZ")
    else if (name.contains("CX-024414"))  Some("MOD")
    else if (name.contains("SPIKEVAX"))  Some("MOD") 
    else if (name.contains("COMIRNATY"))  Some("COM") 
    else if (name.contains("STRAIN CZ02")) Some ("SIN")
    else if (name.contains("HB02")) Some ("BBIBP")
    else if (name.contains("AD26.COV2.S"))  Some("JANSS")
    else if (name.contains("BNT 162")) Some("COM")
    else if (name.contains("PFIZER")) Some("COM")
    else if (name.contains("BIONTECH")) Some("COM")
    else if (name.contains("ASTRAZENECA")) Some("AZ")
    else if (name.contains("MODERNA")) Some("MOD")
    else if (name.contains("JANSSEN")) Some("MOD")
    //not found specific 
    else if (name.contains("COVID-19")) Some("UNK")
    else None
}
val getCovidVaccineCodeUDF = udf(getCovidVaccineCode)   
spark.udf.register("getCovidVaccineCode",getCovidVaccineCode)




val isSpontaneous = (s:String)=> {
    if ("Spontaneous".equals(s)) Some(true)
    else if (s!=null) Some(false)
    else None
}
val  isSpontaneousUDF = udf( isSpontaneous )   

val isPatientMale = (s:String)=> {
    if ("Male".equals(s)) Some(true)
    else if (s !=null && ! "".equals(s)) Some(false)
    else None
}
val  isPatientMaleUDF = udf( isPatientMale )

val isPatientFemale = (s:String)=> {
    if ("Female".equals(s)) Some(true)
    else if (s !=null && ! "".equals(s)) Some(false)
    else None
}
val  isPatientFemaleUDF = udf( isPatientFemale )


val isHealthCareProfessional = (eudraReportType:String)=> {
    if ("Healthcare Professional".equals(eudraReportType)) Some(true)
    else if ("Non Healthcare Professional".equals(eudraReportType)) Some(false)
    else None
}
val  isHealthCareProfessionalUDF = udf( isHealthCareProfessional)   
spark.udf.register("isHealthCareProfessional",isHealthCareProfessional)


val isSourceRegionEU = (eudraReportType:String)=> {
    if ("European Economic Area".equals(eudraReportType)) Some(true)
    else if ("Non European Economic Area".equals(eudraReportType)) Some(false)
 
    else None
}
val  isSourceRegionEUUDF = udf( isSourceRegionEU)   
spark.udf.register("isSourceRegionEU",isSourceRegionEU)

val getAgeGroupCode =(ageGroupLabel:String) => {
    ageGroupLabel match {
        case "0-1 Month"=>Some(0)
        case "2 Months - 2 Years"=>Some(1)
        case "3-11 Years"=>Some(2)
        case "12-17 Years"=>Some(3)
        case "18-64 Years"=>Some(4)
        case "64-85 Years"=>Some(5)
        case "more than 85 Years"=>Some(6)
        case _  => None
    } 
}
val  getAgeGroupCodeUDF = udf( getAgeGroupCode)   


// val eudraAgeGroupDF= spark.createDataset(
//     Seq( EudraAgeGroup(0,0,    (1+1),     "0-1 Month")
//         ,EudraAgeGroup(1,2,    (2+1)*12,  "2 Months - 2 Years")
//         ,EudraAgeGroup(2,3*12, (11+1)*12, "3-11 Years")
//         ,EudraAgeGroup(3,12*12,(17+1)*12, "12-17 Years")
//         ,EudraAgeGroup(4,18*12,(64+1)*12, "18-64 Years")
//         ,EudraAgeGroup(5,64*12,(85+1)*12, "64-85 Years")
//         ,EudraAgeGroup(6,85*12,Integer.MAX_VALUE,"more than 85 Years")
//     ) )
// eudraAgeGroupDF.createOrReplaceTempView("VEudraAgeGroup")
// eudraAgeGroupDF.printSchema


  spark.udf.register("getReactionList",getReactionList)


# Reading and parsing source file
In this first step we read a tab separated values file, explode the nested lists into List[&lt;case class&gt;] objects via user defined functions.

In [None]:

val eudraDF=spark.read
    .option("header","true")
    .option("encoding", sourceEncoding)
    .option("delimiter","\t")
    .csv(sourceEudraTSV)
    .withColumnRenamed ("EU Local Number","reportEULocalNumber")
    .withColumn( "input_file_name", input_file_name())
.cache

 eudraDF.printSchema
println (s"Source CSV data imported :  ${eudraDF.count} rows in DataFrame")
// eudraDF.show

# Exploding source data

## Virtual tables
Working with list of elements is possible but not easy in SQL : to facilitate usage of data, let's explode the lists into dedicated dataframes and table having all having in common with the `Records` their identifier `reportEULocalNumber`
* `reactions` 
* `concomitant drugs` 
* `suspect drugs` 
 

### Definition of eudraAgeGroupDF and view VEudraAgeGroup

### Definition of recordsDF and view VRecords

In [None]:
val recordsDF=eudraDF
    .withColumn("reportReceived", to_date(col("EV Gateway Receipt Date")))
    .withColumn("ageGroup",getAgeGroupCodeUDF(col("Patient Age Group")))
    .withColumnRenamed("Patient Sex", "patientSex")
     .withColumn("isHealthCareProfessionnal", isHealthCareProfessionalUDF(col("Primary Source Qualification")))
     .withColumn("isSourceRegionEU", isSourceRegionEUUDF(col("Primary Source Country for Regulatory Purposes")))
     .withColumn("isSpontaneous", isSpontaneousUDF(col("Report Type")))
     .withColumn("isPatientMale", isPatientMaleUDF(col("patientSex")))
     .withColumn("isPatientFeMale", isPatientFemaleUDF(col("patientSex")))
    .select("reportEULocalNumber","reportReceived", "ageGroup", "isPatientMale","isPatientFemale", "isHealthCareProfessionnal",  "isSourceRegionEU" , "isSpontaneous")
    .as[EudraRecord]
println (s"recordsDF:  ${recordsDF.count} rows in DataFrame")
recordsDF.createOrReplaceTempView("VRecords")
// recordsDF.show

### Definition of reactionsDF and view VReactions


In [None]:
val reactionsDF = eudraDF
 .withColumn("reactionList"
        , getReactionListUDF(regexp_replace( regexp_replace(col("Reaction List PT (Duration – Outcome - Seriousness Criteria)"), " - ", "\t") , ",<BR><BR>", "\n"  ) ))
   .select("reportEULocalNumber", "reactionList") 
   .withColumn("reaction", explode(col("reactionList")))
   .select("reportEULocalNumber","reaction")
   .as[(String,Reaction)]

println (s"reactionsDF:  ${reactionsDF.count} rows in DataFrame")
reactionsDF.createOrReplaceTempView("VReactions")
// reactionsDF.show

### Definition of suspectDrugsDF and view VSuspectDrugs

In [None]:
val suspectDrugsDF = eudraDF
    .withColumn("suspectList"
        , getDrugListUDF(regexp_replace(  regexp_replace(col("Suspect/interacting Drug List (Drug Char - Indication PT - Action taken - [Duration - Dose - Route])"), " - ", "\t") , ",<BR><BR>", "\n"  ) ) )
    .select("reportEULocalNumber","suspectList")
    .withColumn("suspectDrug", explode(col("suspectList")))
    .select("reportEULocalNumber","suspectDrug")
    .as[(String,Drug)]
println (s"suspectDrugsDF:  ${suspectDrugsDF.count} rows in DataFrame")
suspectDrugsDF.createOrReplaceTempView("VSuspectDrugs")
// suspectDrugsDF.show

### Definition of concomitantDrugsDF and view VConcomitantDrugs

In [None]:
val concomitantDrugsDF = eudraDF
    .withColumn("concomitantList"
        , getDrugListUDF(regexp_replace(  regexp_replace(col("Concomitant/Not Administered Drug List (Drug Char - Indication PT - Action taken - [Duration - Dose - Route])"), " - ", "\t") , ",<BR><BR>", "\n"  ) ) )
    .withColumn("concomitantDrug", explode(col("concomitantList")))
    .select("reportEULocalNumber","concomitantDrug")
    .as[(String,Drug)]
println (s"concomitantDrugsDF:  ${concomitantDrugsDF.count} rows in DataFrame")       
concomitantDrugsDF.createOrReplaceTempView("VConcomitantDrugs")


<hr />


# Data preparation Done
Here are the Spark SQL tables

In [None]:
println("List of exposed Spark SQL tables:")
spark.sql("show tables")
.select("tableName")
.as[String]
.collect
.foreach(t=>{
    println(s"table $t:")
    println("="*(s"table $t:").length)
     spark.sql ( s"select * from $t").printSchema
})


# Quality score 
Suspicious records : 
* coming from non medical source
* having too many symptoms declared =>  exclude the reports having a count of simultaneous symptoms in top 1% per vax.
* suspected to be a duplicate of a previous one => 
* inconsistent timing between vax and symptom => NOT POSSIBLE WITH EUDRAVigilance


In [None]:
//Let's have a look on count of reactions, suspect and concomitant drugs declared per records

eudraDF
 .withColumn("count of concomitant drugs" 
             , size(getDrugListUDF(
                     regexp_replace(  regexp_replace(col("Concomitant/Not Administered Drug List (Drug Char - Indication PT - Action taken - [Duration - Dose - Route])"), " - ", "\t") , ",<BR><BR>", "\n"  ) 
                 ) ) )
 .withColumn("count of suspect drugs" 
             , size(getDrugListUDF(
                     regexp_replace(  regexp_replace(col("Suspect/interacting Drug List (Drug Char - Indication PT - Action taken - [Duration - Dose - Route])"), " - ", "\t") , ",<BR><BR>", "\n"  ) ) ) 
            )
 .withColumn("count of reactions" 
             , size(getReactionListUDF(
                     regexp_replace( regexp_replace(col("Reaction List PT (Duration – Outcome - Seriousness Criteria)"), " - ", "\t") , ",<BR><BR>", "\n"  ) ) ) 
            ) 
 .withColumn("isHealthCareProfessionnal", isHealthCareProfessionalUDF(col("Primary Source Qualification")))
 .select("reportEULocalNumber","isHealthCareProfessionnal","count of concomitant drugs", "count of suspect drugs", "count of reactions").cache.createOrReplaceTempView("linksCount")

println("Statistics can take few seconds.")
for (counter <- ("count of concomitant drugs"::"count of suspect drugs"::"count of reactions"::Nil))
{ 
    println(s"Checking ${counter} per EUDRAVigilance report submitted by heathcare professional")
    spark.sql(s"""SELECT
        percentile_approx(`${counter}`
        , array(0.90, 0.95, 0.99, 0.999)) as `${counter}-0.90/0.95/0.99/0.999`
        , min(`${counter}`) `Min`
        , avg(`${counter}`) `Avg`
        , max(`${counter}`) `Max`
        FROM linksCount where isHealthCareProfessionnal """).show(100)
    println(s"Checking ${counter} per EUDRAVigilance report submitted by non heathcare professional")
    spark.sql(s"""SELECT
        percentile_approx(`${counter}`
        , array(0.90, 0.95, 0.99, 0.999)) as `${counter}-0.90/0.95/0.99/0.999`
        , min(`${counter}`) `Min`
        , avg(`${counter}`) `Avg`
        , max(`${counter}`) `Max`
        FROM linksCount where  isHealthCareProfessionnal=FALSE """).show(100)
}

println ("Should we considere that too many concomitant drugs, or too many reactions should be discarded ?")

In [None]:
spark.sql(s"""SELECT * FROM linksCount """).show(100)


# Some test queries

In [None]:
spark.sql("select * from vrecords").collect


In [None]:
// println("Show tables: ")
// spark.sql("show tables").show

// println("Records received dates: ")
// spark.sql("select min(reportReceived) , max(reportReceived) from VRecords").show

// // println("Top reactions in reports")
// // spark.sql("select reaction.description, count(*) n from VReactions group by reaction.description order by 2 desc ").show(25)

// println("Top concomitant drug in reports")
// spark.sql("select getCleanerDrugName(concomitantDrug.name), count(*) n from VConcomitantDrugs group by concomitantDrug.name order by 2 desc").show(25)

// println("Top suspect drug in reports")
// spark.sql("select getCleanerDrugName(suspectDrug.name), count(*) n from VSuspectDrugs group by suspectDrug.name order by 2 desc").show(25)


In [None]:
// %%python
// import pandas as PD
// pd.set_option('max_columns', None)
// #spark.sql("select reportType, count(*) reportCount from TEudraRecords group by reportType order by 2 desc ").toPandas()
// #spark.sql("select sourceQualification, count(*) from TEudraRecords group by sourceQualification order by 2 desc ").toPandas()
// spark.sql("select sourceQualification, patientAgeGroup, count(*) from TEudraRecords group by sourceQualification, patientAgeGroup")
// .groupBy($"patientAgeGroup")
// .pivot("sourceQualification")
// .agg(count($"reportCount"))
// .toPandas()


In [None]:
// println("Number of reports per age and source")
// spark.sql("select sourceQualification, patientAgeGroup, count(*) as reportCount from TEudraRecords where sourceQualification != 'Primary Source Qualification' group by sourceQualification, patientAgeGroup")
// .groupBy(col("patientAgeGroup"))
// .pivot("sourceQualification")
// .agg(sum(col("reportCount")))
// .show

In [None]:
// println("Number of reports per source over time desc")
// spark.sql("select sourceQualification, left(reportReceived,7) as reportReceived, count(*) as reportCount from TEudraRecords where sourceQualification != 'Primary Source Qualification' group by sourceQualification, left(reportReceived,7)")
// .groupBy(col("reportReceived"))
// .pivot("sourceQualification")
// .agg(sum(col("reportCount")))
// .sort (desc ("reportReceived"))
// .show

In [None]:
// 
// println("Number of report received from Non Healthcare Professional in EUDRAVigilance per month per COVID-19 vaccine" )
// eudraDF
//     .where ("sourceQualification = 'Non Healthcare Professional'")
//     .withColumn("suspectDrug", explode(col("suspectDrugs.name")))
//     .where ("suspectDrug like '%COVID-19%'")
//     .withColumn("drug", getSimplifiedCovidNameUDF(col("suspectDrug")))
//     .withColumn( "year-month", substring(col("reportReceived"),1,7))
//     .select("drug", "year-month" ) 
//     .groupBy("drug", "year-month" ).agg(count("drug") as "countPresence")
   
// .groupBy(col("year-month"))
// .pivot("drug")
// .agg(sum(col("countPresence")))
// .sort (desc ("year-month"))
// .na.fill(0)
// .show
    

In [None]:


// println("Number of report received from  Healthcare Professional in EUDRAVigilance per month per COVID-19 vaccine" )
// eudraDF.where ("sourceQualification = 'Healthcare Professional'")
//     .withColumn("suspectDrug", explode(col("suspectDrugs.name")))
//     .where ("suspectDrug like '%COVID-19%'")
//     .withColumn("drug", getSimplifiedCovidNameUDF(col("suspectDrug")))
//     .withColumn( "year-month", substring(col("reportReceived"),1,7))
//     .select("drug", "year-month" ) 
//     .groupBy("drug", "year-month" ).agg(count("drug") as "countPresence")
   
// .groupBy(col("year-month"))
// .pivot("drug")
// .agg(sum(col("countPresence")))
// .sort (desc ("year-month"))
// .na.fill(0)
// .show


In [None]:
//  println("Number of report received from NON Healthcare Professional in EUDRAVigilance per month per COVID-19 vaccine" )
// eudraDF
//     .filter("hasCovid19Drug = true")
//     .where ("sourceQualification != 'Healthcare Professional'")
//     .withColumn("suspectDrug", explode(col("suspectDrugs.name")))
//     .where ("suspectDrug like '%COVID-19%'")
//     .withColumn("drug", getSimplifiedCovidNameUDF(col("suspectDrug")))
//     .withColumn( "year-month", substring(col("reportReceived"),1,7))
//     .select("drug", "year-month" ) 
//     .groupBy("drug", "year-month" ).agg(count("drug") as "countPresence")
   
// .groupBy(col("year-month"))
// .pivot("drug")
// .agg(sum(col("countPresence")))
// .sort (desc ("year-month"))
// .na.fill(0)
// .show


In [None]:
 
// eudraDF
//  .filter("hasCovid19Drug = true")
//     .withColumn("suspectDrug", explode(col("suspectDrugs.name")))
//     .where ("suspectDrug like '%COVID-19%'")
//     .withColumn("drug", getSimplifiedCovidNameUDF(col("suspectDrug")))
//     .withColumn( "year-month", substring(col("reportReceived"),1,4))
//     .select("drug", "year-month" ) 
//     .groupBy("drug", "year-month" ).agg(count("drug") as "countPresence")
   
// .groupBy(col("year-month"))
// .pivot("drug")
// .agg(sum(col("countPresence")))
// .sort (desc ("year-month"))
// .na.fill(0)
// .show

In [None]:

// eudraDF
//  .filter("hasCovid19Drug = true")
//     .withColumn("suspectDrug", explode(col("suspectDrugs.name")))
//     .where ("suspectDrug like '%COVID-19%'")
//     .withColumn("drug", getSimplifiedCovidNameUDF(col("suspectDrug")))
//     .withColumn( "year-month", substring(col("reportReceived"),1,4))
//     .select("drug", "sourceQualification" ) 
//     .groupBy("drug", "sourceQualification" ).agg(count("drug") as "countPresence")
   
// .groupBy(col("sourceQualification"))
// .pivot("drug")
// .agg(sum(col("countPresence")))
// .sort (desc ("sourceQualification"))
// .na.fill(0)
// .show

In [None]:

// eudraDF
//     .filter("hasCovid19Drug = true")
//     .filter("sourceQualification='Healthcare Professional'")
//     .withColumn("suspectDrug", explode(col("suspectDrugs.name")))
//     .where ("suspectDrug like '%COVID-19%'")
//  .withColumn("drug", getSimplifiedCovidNameUDF(col("suspectDrug")))

//     .withColumn("reaction", explode(col("reactions")))
//     .select("drug", "reaction.description" ) 

//      .groupBy("drug", "description" ).agg(count("drug") as "countPresence")
//     .repartition(1)
//     .write.option("header", "true").csv("file:///home/taccart/Downloads/symptoms")
// //  .groupBy(col("description"))
// //  .pivot("drug")
// //  .agg(sum(col("countPresence")))
// //   .sort (desc ("COVID-19 ASTZ"))
// //  .na.fill(0)
// //  .show(500)