<h3>Task 5 - Predictive Health Care</h3>
Comparing adverse effects of pain medicaments

<h3>Task Overview</h3>

<b>What you'll learn</b>
 - How to analyze adverse drug effects using provided data.

<b>What you'll do</b>
 - Analyze 2019 FAERS data to find the top 10 Tramal adverse effects.
 - Compare Tramal and Lyrica's adverse effects.
 - Suggest further investigations based on dataset findings.

<h3>Here is your task:</h3>

<b>Jakob asks you to create a PowerPoint slide deck while tackling the following steps. Use screenshots and diagrams to illustrate your findings as well.</b>

<b>Step 1</b>

Create a descriptive overview of adverse effects of tramal based on the available FAERS datasets, which you’ll find in your resource section. For your analysis, only use the FAERS data from the year 2019.

Show the 10 most common adverse effects as they are reported in the FAERS database. Jakob loves bar plots, so would be great if you use one.

<b>Step 2</b>

Compare tramal to another medication called lyrica that is also commonly used to treat neurological pain. Are the adverse effects similar?

Use Rscript to solve the task and make sure using it in your presentation.

</b>Step 3</b>

Define what further investigations might be helpful in determining whether a certain drug might be more preferable over another drug. Base your solution on the results of your dataset work.

In [None]:
import pandas as pd
import numpy as np
from pandas.api.types import CategoricalDtype
from mizani.transforms import trans

# Set Working Directory. This is the directory we saved the downloaded .txt files in.
# For simplicity, let's assume the data files are in the current working directory.
# You may need to adjust the path accordingly.
# setwd("~/Documents/Faers")

# Define years we later use in the loop for the path. Here, it's only 1 year
years = ["19"]

# Define quarters we later use in the loop for the path
quarters = ["1", "2", "3", "4"]

# Define generic path that we later use for the loop
generic = "faers_ascii_20"

# Create empty master datasets with the right column names
relevant_demo_columns = ["primaryid", "age", "sex", "wt", "reporter_country", "event_dt", "init_fda_dt"]
relevant_drug_columns = ["primaryid", "drugname", "drug_seq", "Freq"]
relevant_ther_columns = ["primaryid", "dsg_drug_seq", "start_dt", "end_dt", "dur"]
relevant_react_columns = ["primaryid", "pt"]

demo_df = pd.DataFrame(columns=relevant_demo_columns)
drug_df = pd.DataFrame(columns=relevant_drug_columns)
drug_df_all = pd.DataFrame(columns=relevant_drug_columns)
ther_df = pd.DataFrame(columns=relevant_ther_columns)
react_df = pd.DataFrame(columns=relevant_react_columns)

# Define all drug names we are looking for. There are drugs that are the same but with different names.
# In this case, we only focus on tramadol and lyrica. If you did it with other drugs too, that is totally fine!
drug_names = ["tramal", "lyrica"]

# Create a loop that goes over all folders/files to read in the data from all years and quarters
# and appends them to each other
for year in years:  # Loop Over Years
    for quarter in quarters:  # Loop Over Quarters

        # Print both to check whether it is working
        print(year)
        print(quarter)

        # Create Paths depending on year and quarter
        path_demo = f"{generic}{year}q{quarter}/ASCII/DEMO{year}q{quarter}.txt"
        path_drug = f"{generic}{year}q{quarter}/ASCII/DRUG{year}q{quarter}.txt"
        path_ther = f"{generic}{year}q{quarter}/ASCII/THER{year}q{quarter}.txt"
        path_react = f"{generic}{year}q{quarter}/ASCII/REAC{year}q{quarter}.txt"

        # Read in the files
        demo = pd.read_csv(path_demo, sep="$", engine="python")
        drug = pd.read_csv(path_drug, sep="$", engine="python")
        ther = pd.read_csv(path_ther, sep="$", engine="python")
        react = pd.read_csv(path_react, sep="$", engine="python")

        # Change all column names of read-in files to lower case
        demo.columns = demo.columns.str.lower()
        drug.columns = drug.columns.str.lower()
        ther.columns = ther.columns.str.lower()
        react.columns = react.columns.str.lower()

        # Check whether the column name "sex" does not appear in the column name of the demo DF.
        # If no, change the column "gndr_cod" to "sex". This was an issue in the earlier data sets
        if "sex" not in demo.columns:
            demo.rename(columns={"gndr_cod": "sex"}, inplace=True)

        # If there is a column with the name "isr", then change it to "primaryid".
        # Again, naming issue with earlier datasets
        if "isr" in demo.columns:
            demo.rename(columns={"isr": "primaryid"}, inplace=True)
            drug.rename(columns={"isr": "primaryid"}, inplace=True)
            ther.rename(columns={"isr": "primaryid"}, inplace=True)
            react.rename(columns={"isr": "primaryid"}, inplace=True)

        # Same for drug_seq and dsg_drug_seq
        if "drug_seq" in ther.columns:
            ther.rename(columns={"drug_seq": "dsg_drug_seq"}, inplace=True)

        drug_unique = drug[~drug.duplicated(subset=["primaryid", "drugname"])]
        drug_freq = drug_unique.groupby("primaryid").size().reset_index(name="Freq")
        drug = drug.merge(drug_freq, on="primaryid", how="left")

        # Only select relevant columns from datasets
        demo = demo[relevant_demo_columns]
        drug = drug[relevant_drug_columns]
        ther = ther[relevant_ther_columns]
        react = react[relevant_react_columns]

        # Change entries in drugname column to lower case
        drug["drugname"] = drug["drugname"].str.lower()

        drug_all = drug.copy()

        # Only select entries that contain our drugs of interest
        drug = drug[drug["drugname"].isin(drug_names)]

        # Create year and yearquarter columns
        drug["datequarter"] = pd.to_datetime(f"20{year}-{quarter}-01").to_period("Q")
        drug["dateyear"] = pd.to_datetime(f"20{year}-01-01").to_period("Y")

        # Append our dataset from the loop to our master datasets, we later use
        drug_df = pd.concat([drug_df, drug])
        drug_df_all = pd.concat([drug_df_all, drug_all])
        demo_df = pd.concat([demo_df, demo])
        ther_df = pd.concat([ther_df, ther])
        react_df = pd.concat([react_df, react])

# Save DataFrames to files
drug_df.to_pickle("drugDFCase.pkl")
drug_df_all.to_pickle("drugDFAllCase.pkl")
demo_df.to_pickle("demoDFCase.pkl")
ther_df.to_pickle("therDFCase.pkl")
react_df.to_pickle("reactDFCase.pkl")

# Merge
# Find all reactions per event and drug
# Find unique event/drug
drug_unique = drug_df[~drug_df.duplicated(subset=["primaryid", "drugname"])]

# Merge the adverse reactions with the drugs using the primary id as a key
react_drug = pd.merge(react_df, drug_unique, on="primaryid", how="left")

# Remove observations that contain NAs
react_drug = react_drug.dropna()

# Remove duplicate entries (adverse effects for each patient)
react_drug = react_drug[~react_drug.duplicated(subset=["primaryid", "pt"])]

# Optional: Merge with demographics of patients. This could be used later on for creating models
react_drug_demo = pd.merge(react_drug, demo_df, on="primaryid", how="left")

# Optional: Create a unique drugseq key, so we can identify the drug within an event.
# Here, we create a unique key for the drug in the therapy dataset
ther_df["drugkey"] = ther_df["primaryid"] + ther_df["d
ther_df["drugkey"] = ther_df["primaryid"] + ther_df["dsg_drug_seq"]

# The same we do for the already created dataset
react_drug_demo["drugkey"] = react_drug_demo["primaryid"] + react_drug_demo["drug_seq"]

# Now we can merge the therapy master dataset with the previously merged dataset
# based on the drugkey we defined earlier
drug_demo_ther_react = pd.merge(react_drug_demo, ther_df, on="drugkey", how="left")

# We make all adverse effects lower case
drug_demo_ther_react["pt"] = drug_demo_ther_react["pt"].str.lower()

# Rename dataset for the sake of ease
tl = drug_demo_ther_react

# Filter by tramadol
tramal_df = tl[tl["drugname"] == "tramal"]
lyrica_df = tl[tl["drugname"] == "lyrica"]

# Count the top 10 adverse effects for tramadol and lyrica
top_10_adverse_effects_tramadol = tramal_df["pt"].value_counts().nlargest(10).reset_index()
top_10_adverse_effects_lyrica = lyrica_df["pt"].value_counts().nlargest(10).reset_index()

# Plot the bar charts for the top 10 adverse effects
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
sns.barplot(x="n", y="index", data=top_10_adverse_effects_tramadol, palette="viridis")
plt.title("Top 10 Adverse Effects for Tramadol")
plt.xlabel("Frequency")
plt.ylabel("Adverse Effect")

plt.subplot(1, 2, 2)
sns.barplot(x="n", y="index", data=top_10_adverse_effects_lyrica, palette="viridis")
plt.title("Top 10 Adverse Effects for Lyrica")
plt.xlabel("Frequency")
plt.ylabel("Adverse Effect")

plt.tight_layout()
plt.show()