# Loading the Questions into Anki and csv file

This section focuses on loading data from the data warehouse into an Anki deck and creating a consolidated CSV file containing all exams.

## Importing libraries

We will use the following libraries:

* **pandas:** For efficient data manipulation.
* **numpy:** To handle NaN (Not a Number) data types.
* **sqlite3:** To read cleaned data from the 'bir_warehouse.db' database.
* **json and urllib.request:** To interact with the AnkiConnect add-on API.

In [3]:
import pandas as pd
import numpy as np
import sqlite3
import json
import urllib.request

## Importing code to interface with the Anki Connect API

In [4]:
# Copied from anki connect site
def request(action: str, **params) -> dict:
    return {"action": action, "params": params, "version": 6}


# Copied from anki connect site
def invoke(action: str, **params):
    requestJson = json.dumps(request(action, **params)).encode("utf-8")
    response = json.load(
        urllib.request.urlopen(
            urllib.request.Request("http://127.0.0.1:8765", requestJson)
        )
    )
    if len(response) != 2:
        raise Exception("response has an unexpected number of fields")
    if "error" not in response:
        raise Exception("response is missing required error field")
    if "result" not in response:
        raise Exception("response is missing required result field")
    if response["error"] is not None:
        raise Exception(response["error"])
    return response["result"]

## Retriving Data from Data Warehouse

Using the sqlite3 library to create the connection with the Data Warehouse, performing the query joining and selecting all the relevant data, and finally using the pandas function `read_sql()` to extract the relevant information.

In [5]:
db_path = "../data/clean/bir_warehouse.db"
query: str = """
SELECT y.year_name, ex.exam_type, q.question, qo.option_text, qo.is_correct
FROM questions_options AS qo
JOIN questions AS q ON qo.question_id = q.id
JOIN year AS y ON q.exam_year = y.id_year
JOIN exam AS ex ON q.exam_subject = ex.id_type;
"""

con = sqlite3.connect(db_path)
exams_df: pd.DataFrame = pd.read_sql(query,con)
con.close()

exams_df.head()

Unnamed: 0,year_name,exam_type,question,option_text,is_correct
0,2024,bir,1. La barrera hematoencefálica:,1. Es permeable a todas las sustancias present...,0
1,2024,bir,1. La barrera hematoencefálica:,2. Es permeable al O2 y al CO2.,1
2,2024,bir,1. La barrera hematoencefálica:,3. Es impermeable al etanol.,0
3,2024,bir,1. La barrera hematoencefálica:,4. Es impermeable al agua.,0
4,2024,bir,2. El espacio subaracnoideo se encuentra:,1. Entre la aracnoides y la duramadre.,0


## Formating the Data extracted

This step will focus on cleaning the data into a suitable format for pandas as practice.

### Changing Columns Datatypes

Using `.astype`, the columns will be converted to the correct datatype.

In [6]:
exams_df["year_name"] = exams_df["year_name"].astype(int)
exams_df["is_correct"] = exams_df["is_correct"].astype(bool)
exams_df.head()

Unnamed: 0,year_name,exam_type,question,option_text,is_correct
0,2024,bir,1. La barrera hematoencefálica:,1. Es permeable a todas las sustancias present...,False
1,2024,bir,1. La barrera hematoencefálica:,2. Es permeable al O2 y al CO2.,True
2,2024,bir,1. La barrera hematoencefálica:,3. Es impermeable al etanol.,False
3,2024,bir,1. La barrera hematoencefálica:,4. Es impermeable al agua.,False
4,2024,bir,2. El espacio subaracnoideo se encuentra:,1. Entre la aracnoides y la duramadre.,False


### Grouping the Questions

To get the desired DataFrame, the rows are grouped by question, exam_type, and year_name, since this combination will be unique for any question.

In [7]:
exams_df_group = exams_df.groupby(["question", "exam_type", "year_name"])
exams_df_group.head()

Unnamed: 0,year_name,exam_type,question,option_text,is_correct
0,2024,bir,1. La barrera hematoencefálica:,1. Es permeable a todas las sustancias present...,False
1,2024,bir,1. La barrera hematoencefálica:,2. Es permeable al O2 y al CO2.,True
2,2024,bir,1. La barrera hematoencefálica:,3. Es impermeable al etanol.,False
3,2024,bir,1. La barrera hematoencefálica:,4. Es impermeable al agua.,False
4,2024,bir,2. El espacio subaracnoideo se encuentra:,1. Entre la aracnoides y la duramadre.,False
...,...,...,...,...,...
24276,2013,bir,235. En la electroforesis de hemoglobina en ac...,1. Se emplean como colorantes el Rojo Neutro o...,False
24277,2013,bir,235. En la electroforesis de hemoglobina en ac...,2. Se obtienen tres fraccione s diferenciadas:...,True
24278,2013,bir,235. En la electroforesis de hemoglobina en ac...,3. La fracción mayoritaria es la HbA2.,False
24279,2013,bir,235. En la electroforesis de hemoglobina en ac...,4. Se obtiene el hemolizado empleando CINH4.,False


Once the grouped DataFrame is created, an empty DataFrame with the final columns is created.

In [9]:
col_names = ["year", "exam", "question", "option1", "option2", "option3", "option4", "option5", "correct"]
exams_df_pivoted = pd.DataFrame(columns= col_names)

To pivot the DataFrame so it automatically computes the correct answer option from the `is_correct` column, I decided to use a for loop over the grouped DataFrame. This way, the iteration over the individual DataFrames for each question allows me to extract all the relevant information, while also handling questions with 4 or 5 options and annulled questions that have no correct option.

In [10]:
n_question: int = 0
quest_dict = {}
for _, group in exams_df_group:
    question: list = [group["year_name"].iloc[0], group["exam_type"].iloc[0], group["question"].iloc[0]]
    for row_tuple in group.itertuples():
        question.append(row_tuple[4])
    if group["year_name"].iloc[0] > 2014:
        question.append("blank")
    correct = ([i+1 for i, boolean in enumerate(group["is_correct"]) if boolean] or [0])[0]
    question.append(correct)
    if len(question) > 9:
        question.append("0")
    exams_df_pivoted.loc[n_question] = [str(x) for x in question]
    n_question += 1
exams_df_pivoted.head()

Unnamed: 0,year,exam,question,option1,option2,option3,option4,option5,correct
0,2014,bir,1. Confiere a la membrana plasmática alta pe r...,1. Acuaporinas.,2. Canales iónicos.,3. La Na+/K+-ATPasa.,4. Intercambiador Cl/HCO3.,5. Su composición lípidica.,1
1,2018,bir,1. El fosfolamban:,1. Regula la Ca2+-ATPasa del retículo sarcoplá...,2. Es un fosfolípido de la membrana plasmática.,3. Bloquea el receptor de dihidropiridinas.,4. Activa receptores de rianodina en retículo ...,blank,1
2,2007,bir,1. El movimiento de rotación es detectado por el:,1. Utrículo.,2. Sáculo.,3. Canales semicirculares.,4. Órgano de Corti.,5. Otolitos.,3
3,2017,bir,1. El periodo refractario absoluto de una fibr...,1. Inactivación de canales de Ca2+.,2. Inactivación de canales de K+.,3. Inactivación de canales de Na+.,4. La posthiperpolarización que sigue al poten...,blank,3
4,2015,bir,1. El potencial de equilibrio para un ión perm...,1. El equilibrio de Gibbs-Donnan.,2. La ecuación de Goldman-Hodgkin-Katz.,3. La ecuación de Ohm.,4. La ecuación de Nernst.,blank,4


In this step, the correct datatypes are assigned to the columns, and a new column named `num_q` is created by splitting the `question` column at the period and taking the first item. This field is relevant for sorting the DataFrame in a later step

In [13]:
exams_df_clean = exams_df_pivoted
exams_df_clean["year"] = exams_df_clean["year"].astype(int)
exams_df_clean["correct"] = exams_df_clean["correct"].astype(int)
exams_df_clean["num_q"] = exams_df_clean["question"].apply(lambda x: x.split(".", 1)[0])
exams_df_clean["num_q"] = exams_df_clean["num_q"].astype(int)
exams_df_clean["option5"] = exams_df_clean["option5"].replace("blank", np.nan)
exams_df_clean.head()

Unnamed: 0,year,exam,question,option1,option2,option3,option4,option5,correct,num_q
0,2014,bir,1. Confiere a la membrana plasmática alta pe r...,1. Acuaporinas.,2. Canales iónicos.,3. La Na+/K+-ATPasa.,4. Intercambiador Cl/HCO3.,5. Su composición lípidica.,1,1
1,2018,bir,1. El fosfolamban:,1. Regula la Ca2+-ATPasa del retículo sarcoplá...,2. Es un fosfolípido de la membrana plasmática.,3. Bloquea el receptor de dihidropiridinas.,4. Activa receptores de rianodina en retículo ...,,1,1
2,2007,bir,1. El movimiento de rotación es detectado por el:,1. Utrículo.,2. Sáculo.,3. Canales semicirculares.,4. Órgano de Corti.,5. Otolitos.,3,1
3,2017,bir,1. El periodo refractario absoluto de una fibr...,1. Inactivación de canales de Ca2+.,2. Inactivación de canales de K+.,3. Inactivación de canales de Na+.,4. La posthiperpolarización que sigue al poten...,,3,1
4,2015,bir,1. El potencial de equilibrio para un ión perm...,1. El equilibrio de Gibbs-Donnan.,2. La ecuación de Goldman-Hodgkin-Katz.,3. La ecuación de Ohm.,4. La ecuación de Nernst.,,4,1


## Validating the Data

This step performs a simple check using the fact that the option columns should always begin with the option number. Using `startswith`, we can validate the data, and using the `all()` function ensures that `True` is returned only if all evaluations are correct.

To handle option 5 having missing values for years that only have 4 options, a Boolean Series is used to select only the rows containing text.

Finally, with the validations of all the options, a simple `if` statement and the `all()` function check if all tests passed. If not, a warning is raised to check the data.

In [14]:
validate_option_1 = (exams_df_clean["option1"].str.startswith("1")).all()
validate_option_2 = (exams_df_clean["option2"].str.startswith("2")).all()
validate_option_3 = (exams_df_clean["option3"].str.startswith("3")).all()
validate_option_4 = (exams_df_clean["option4"].str.startswith("4")).all()
validate_option_5 = (exams_df_clean["option5"][exams_df_clean["option5"].notna()].str.startswith("5")).all()

if not all([validate_option_1, validate_option_2, validate_option_3, validate_option_4, validate_option_5]):
    raise Warning("Data Validation failed, please check the DataFrame for missing values")
print("DataFrame validated")

DataFrame validated


## Final Sorting

In this step, the DataFrame will be sorted by:

* Year in descending order.
* Type of exam in ascending order.
* Number of question in ascending order.

Also, the `num_q` column will be deleted once the DataFrame is sorted.

In [15]:
exams_df_sorted = exams_df_clean.sort_values(by=["year", "exam", "num_q"], ascending= [False, True, True])
exams_df_sorted = exams_df_sorted.reset_index(drop= True)
del exams_df_sorted["num_q"]
exams_df_sorted.head()

Unnamed: 0,year,exam,question,option1,option2,option3,option4,option5,correct
0,2024,bir,1. La barrera hematoencefálica:,1. Es permeable a todas las sustancias present...,2. Es permeable al O2 y al CO2.,3. Es impermeable al etanol.,4. Es impermeable al agua.,,2
1,2024,bir,2. El espacio subaracnoideo se encuentra:,1. Entre la aracnoides y la duramadre.,2. Entre la aracnoides y la piamadre.,3. Lleno de sangre.,4. En el sistema nervioso periférico.,,2
2,2024,bir,3. La endolinfa:,1. Rellena el laberinto óseo del oído interno.,2. Contiene baja concentración de K+.,3. Rellena el laberinto membranoso del oído in...,4. Contiene alta concentración de Na+.,,3
3,2024,bir,4. Las células olfatorias humanas:,1. Son células nerviosas bipolares.,2. Son células nerviosas multipolares.,3. Son células nerviosas unipolares.,4. Son células epiteliales.,,1
4,2024,bir,5. La esclerosis múltiple es una patología que...,1. Hipermielinización de los axones con pérdid...,2. Desmielinización de los axones con pérdida ...,3. Incremento en la resistencia eléctrica axonal.,4. Reducción en la capacitancia axonal.,,2


## Loading

This step focuses on loading the formated data.

### csv file

For ease of use, a combined CSV file with all the questions is created.

In [None]:
exams_df_sorted.to_csv("../data/clean/bir_examenes.csv", index= False)

### Anki Deck

In this step, a for loop and `itertuples` are used to append the cards, formatted as per the AnkiConnect documentation, and then added to the Anki deck using the `addNotes` method.

The questions will have the following structure:

Front:

* Year and exam type, followed by a newline.
* The question, followed by three newlines.
* Each option, followed by two newlines.

Back:

* The number of the correct question.

In [None]:
deck_name: str = "BIR_Examenes"
notes_list: list = []

for row in exams_df_sorted.itertuples():
    front: str = f"{row[1]} {row[2].upper()}<br>{row[3]}<br><br><br>{row[4]}<br><br>{row[5]}<br><br>{row[6]}<br><br>{row[7]}" 
    back: str = str(row[9])
    tags: list = [f"Examen::{row[2].upper()}::{row[1]}"]
    if not pd.isna(row[8]):
        front += f"<br><br>{row[8]}"
    note = {
        "deckName": deck_name,
        "modelName": "Basic",
        "fields": {"Front": front, "Back": back},
        "tags": tags
    }
    notes_list.append(note)

invoke("addNotes", notes = notes_list)