#### Assignment: Using Spark to find lethal Drug X

**Goal: To use Spark and confirm the hypothesis that a drug is lethal and was taken by all deceased patients**

A research institution is running a drug test on multiple drugs simultaneously, and have experienced a suspicious number of deaths. A clinician has made the observation that "all of the deceased patients shared one drug in common".

The dataset has two tables of data -
 - "patients.txt" --> list of patients, and their status "Live" or "Deceased"
  - "druglog.txt" --> record of each drug dose administered to the patients



In [6]:
# Create a Spark session
from pyspark.sql import SparkSession
from pyspark import SparkContext

spark = SparkSession.builder.appName("BigDataAnalysis").getOrCreate()
sc = SparkContext.getOrCreate()

In [7]:
# Loading and reading the dataset
drugrdd = sc.textFile("druglog.txt")
patientrdd = sc.textFile("patients.txt")

# Transforming the RDD to a structured format
drug_rdd = drugrdd.map(lambda line: line.split(','))
patient_rdd = patientrdd.map(lambda line: line.split(','))

In [11]:
drug_rdd.take(10)

[['Susan Shull', 'GenemodZ'],
 ['Edward Hurley', 'AgentA'],
 ['Denise Wyrick', 'SerumQ'],
 ['Willie Mitchell', 'SerumQ'],
 ['Rita Pefferkorn', 'SerumQ'],
 ['Roberto White', 'AgentA'],
 ['Tessie Whitehouse', 'AgentA'],
 ['Elizabeth Pinson', 'SerumQ'],
 ['Gertrude Mccormick', 'AntibodyY'],
 ['Robert Rodkey', 'SerumQ']]

In [10]:
patient_rdd.take(10)

[['Mary Yates', 'Deceased'],
 ['James Parker', 'Live'],
 ['Kim Bond', 'Deceased'],
 ['Thomas Broadnax', 'Deceased'],
 ['Stephen Williams', 'Live'],
 ['Mark Hinton', 'Deceased'],
 ['Deborah Lloyd', 'Live'],
 ['Linda Barnes', 'Live'],
 ['Donald Zawacki', 'Live'],
 ['Rene Spencer', 'Deceased']]

In [12]:
combined_rdd = patient_rdd.join(drug_rdd)
combined_rdd.take(10)

[('Donald Zawacki', ('Live', 'AntibodyY')),
 ('Christi Bailey', ('Live', 'GenemodZ')),
 ('Derek Auger', ('Deceased', 'AntibodyY')),
 ('Derek Auger', ('Deceased', 'SerumQ')),
 ('Herbert Hamilton', ('Deceased', 'SerumQ')),
 ('Herbert Hamilton', ('Deceased', 'AgentA')),
 ('Herbert Hamilton', ('Deceased', 'SubstanceX')),
 ('Michelle Tutwiler', ('Live', 'SubstanceX')),
 ('Michelle Tutwiler', ('Live', 'AntibodyY')),
 ('Ryan Coleman', ('Live', 'AgentA'))]

In [13]:
def flatten_tuple(element):
    ...:     name, (status, drug) = element
    ...:     return name, status, drug
 
combined_rdd1 = combined_rdd.map(flatten_tuple)
combined_rdd1.take(10)

[('Donald Zawacki', 'Live', 'AntibodyY'),
 ('Christi Bailey', 'Live', 'GenemodZ'),
 ('Derek Auger', 'Deceased', 'AntibodyY'),
 ('Derek Auger', 'Deceased', 'SerumQ'),
 ('Herbert Hamilton', 'Deceased', 'SerumQ'),
 ('Herbert Hamilton', 'Deceased', 'AgentA'),
 ('Herbert Hamilton', 'Deceased', 'SubstanceX'),
 ('Michelle Tutwiler', 'Live', 'SubstanceX'),
 ('Michelle Tutwiler', 'Live', 'AntibodyY'),
 ('Ryan Coleman', 'Live', 'AgentA')]

In [14]:
#to only filter by those who are deceased
filtered_rdd = combined_rdd1.filter(lambda x: x[1] == 'Deceased')
filtered_rdd.take(10)

[('Derek Auger', 'Deceased', 'AntibodyY'),
 ('Derek Auger', 'Deceased', 'SerumQ'),
 ('Herbert Hamilton', 'Deceased', 'SerumQ'),
 ('Herbert Hamilton', 'Deceased', 'AgentA'),
 ('Herbert Hamilton', 'Deceased', 'SubstanceX'),
 ('Terry Averitt', 'Deceased', 'SerumQ'),
 ('Terry Averitt', 'Deceased', 'AgentA'),
 ('Natasha Echeverria', 'Deceased', 'SerumQ'),
 ('Amanda Pollock', 'Deceased', 'SerumQ'),
 ('Claudia Knight', 'Deceased', 'SerumQ')]

From the above output, we can see that a common drug among all of them is "Serum Q"

In [17]:
#Checking the number of patients who took Serum Q and passed away
filtered_rdd1 = filtered_rdd.filter(lambda element: element[2] == 'SerumQ')
filtered_rdd1.count()

4999

In [16]:
#Checking the number of patients who passed away only
distinct_first_elements = filtered_rdd.map(lambda element: element[0]).distinct()
distinct_first_elements.count()

4999

From the above we can see that the count for filtered_rdd1 and count for distinct_first_elements are 4999, proving that the number of deceased is equal to the number of people who took serum Q. Therefore, proving that those who are deceased took serum Q.
