# Dataset Analysis

 This notebook covers basic data analysis, going through the dataset columns, finding out inconsistencies, etc.
 
 ##### Any AI based assistance, code completion was not used to solve this problem.

In [104]:
import re
import json
import spacy
import pandas as pd
from collections import defaultdict, Counter
from utils import write_json

In [105]:
dataset = pd.read_csv('./data/medical_dataset.csv')
print('Total dataset size: ',len(dataset))

Total dataset size:  16406


In [106]:
dataset.head(15)

Unnamed: 0,question,answer
0,What is (are) Glaucoma ?,Glaucoma is a group of diseases that can damag...
1,What is (are) Glaucoma ?,The optic nerve is a bundle of more than 1 mil...
2,What is (are) Glaucoma ?,Open-angle glaucoma is the most common form of...
3,Who is at risk for Glaucoma? ?,Anyone can develop glaucoma. Some people are a...
4,How to prevent Glaucoma ?,"At this time, we do not know how to prevent gl..."
5,What are the symptoms of Glaucoma ?,"At first, open-angle glaucoma has no symptoms...."
6,What are the treatments for Glaucoma ?,"Yes. Immediate treatment for early stage, open..."
7,what research (or clinical trials) is being do...,Through studies in the laboratory and with pat...
8,Who is at risk for Glaucoma? ?,Encourage them to have a comprehensive dilated...
9,What is (are) Glaucoma ?,National Eye Institute National Institutes of...


In [107]:
# check for nan values
print(dataset.isna().any(),dataset['answer'].isna().sum()) 

question    False
answer       True
dtype: bool 5


In [108]:
rows_with_nan = dataset[dataset.isna().any(axis=1)]
rows_with_nan

Unnamed: 0,question,answer
3587,What is (are) HELLP syndrome ?,
3836,What is (are) X-linked lymphoproliferative syn...,
4196,What is (are) Familial HDL deficiency ?,
4429,What is (are) Emery-Dreifuss muscular dystroph...,
6689,What is (are) Emery-Dreifuss muscular dystroph...,


In [109]:
# remove rows with NaN as there are just 5 rows.
dataset.dropna(inplace=True)
dataset.shape

(16401, 2)

# Question: What is (are) Glaucoma ?

Understanding how the answers are structure for a sample question

In [110]:
dataset.iloc[9]['answer']

'National Eye Institute  National Institutes of Health  2020 Vision Place  Bethesda, MD 20892-3655  301-496-5248  E-mail: 2020@nei.nih.gov  www.nei.nih.gov The Glaucoma Foundation  80 Maiden Lane, Suite 700  New York, NY 10038  212-285-0080 Glaucoma Research Foundation  251 Post Street, Suite 600  San Francisco, CA 94108  1-800-826-6693'

In [111]:
dataset.iloc[1]['answer']

'The optic nerve is a bundle of more than 1 million nerve fibers. It connects the retina to the brain.'

In [112]:
dataset.iloc[2]['answer']

'Open-angle glaucoma is the most common form of glaucoma. In the normal eye, the clear fluid leaves the anterior chamber at the open angle where the cornea and iris meet. When the fluid reaches the angle, it flows through a spongy meshwork, like a drain, and leaves the eye. Sometimes, when the fluid reaches the angle, it passes too slowly through the meshwork drain, causing the pressure inside the eye to build. If the pressure damages the optic nerve, open-angle glaucoma -- and vision loss -- may result.'

In [113]:
dataset.iloc[0]['answer']

"Glaucoma is a group of diseases that can damage the eye's optic nerve and result in vision loss and blindness. The most common form of the disease is open-angle glaucoma. With early treatment, you can often protect your eyes against serious vision loss. (Watch the video to learn more about glaucoma. To enlarge the video, click the brackets in the lower right-hand corner. To reduce the video, press the Escape (Esc) button on your keyboard.)  See this graphic for a quick overview of glaucoma, including how many people it affects, whos at risk, what to do if you have it, and how to learn more.  See a glossary of glaucoma terms."

# Question to Answers Counter

In [116]:
# Counts number of answers per question before cleaning.

question_to_answers = defaultdict(list)

for row, text in dataset.iterrows():
    q = text['question']
    a = text['answer']
    question_to_answers[q].append(a)

question_to_answers_cnt = Counter(dataset['question'].values)
    
print('Total unique Questions: ',len(question_to_answers))

Total unique Questions:  14976


In [117]:
question_to_answers_cnt

Counter({'What causes Causes of Diabetes ?': 20,
         'What is (are) High Blood Cholesterol ?': 19,
         'What is (are) Medicare and Continuing Care ?': 14,
         'What is (are) Stroke ?': 13,
         'What is (are) Skin Cancer ?': 13,
         'What is (are) Breast Cancer ?': 12,
         'What are the treatments for Breast Cancer ?': 12,
         'What is (are) Colorectal Cancer ?': 12,
         'What is (are) Kidney Failure: Eat Right to Feel Right on Hemodialysis ?': 12,
         'What is (are) High Blood Pressure ?': 11,
         "What is (are) Parkinson's Disease ?": 11,
         'What are the treatments for Prostate Cancer ?': 10,
         'What is (are) Leukemia ?': 10,
         'What is (are) Nutrition for Advanced Chronic Kidney Disease in Adults ?': 10,
         'What is (are) Prostate Cancer ?': 9,
         'Who is at risk for Prostate Cancer? ?': 9,
         'What is (are) Age-related Macular Degeneration ?': 9,
         'Who is at risk for Breast Cancer? ?': 9

# Normalize questions

In [115]:
def clean_question(q):
    """
    This function removes unncessary spaces, and extra '?' from the questions and lowercases everything.
    Args:
        q (str): a string of question
    Returns:
        str : clean question.
    """
    q = q.strip().lower()
    q = re.sub(r'\s+', ' ', q)                   # collapse multiple spaces
    q = re.sub(r'\s+([?.!,])', r'\1', q)         # remove space before punctuation
    q = re.sub(r'([?.!,]){2,}', r'\1', q)        # collapse repeated punctuation
    return q

def clean_answer(text):
    """
    This function removes emails, urls, bullet points and extra white spaces from the answers.
    Args:
        text (str) : the answer associated with questions.

    Returns:
        str : clean text.
    """
    text = re.sub(r'\S+@\S+', '', text)  # Remove emails
    text = re.sub(r'http\S+|www\.\S+', '', text)  # remove URLs
    text = re.sub(r'-\s+', '', text)  # remove bullets
    text = re.sub(r'\s+', ' ', text)  # normalize whitespace
    text = text.strip()
    return text

In [121]:
# Count of answers per question post cleaning.

clean_q_to_a = defaultdict(list)
clean_q_to_a_cnt = {}

for question, answers in question_to_answers.items():
    clean_q = clean_question(question) 
    clean_a = [clean_answer(answer) for answer in answers]
    clean_q_to_a[clean_q].extend(clean_a)

for q, a in clean_q_to_a.items():
    clean_q_to_a_cnt[q] = len(a)

clean_q_to_a_cnt = dict(sorted(clean_q_to_a_cnt.items(), key=lambda x:x[1], reverse=True))

In [122]:
print('Total unique questions post cleaning and normalization: ',len(clean_q_to_a))

Total unique questions post cleaning and normalization:  14342


In [123]:
clean_q_to_a_cnt

{'what causes causes of diabetes?': 20,
 'what is (are) high blood cholesterol?': 19,
 'what is (are) medicare and continuing care?': 14,
 'what is (are) breast cancer?': 13,
 'what are the treatments for breast cancer?': 13,
 'what is (are) stroke?': 13,
 'what is (are) skin cancer?': 13,
 'what is (are) colorectal cancer?': 12,
 'what is (are) kidney failure: eat right to feel right on hemodialysis?': 12,
 'what is (are) high blood pressure?': 11,
 'what are the treatments for prostate cancer?': 11,
 "what is (are) parkinson's disease?": 11,
 'what is (are) prostate cancer?': 10,
 'what is (are) age-related macular degeneration?': 10,
 'what is (are) leukemia?': 10,
 'what is (are) nutrition for advanced chronic kidney disease in adults?': 10,
 'who is at risk for prostate cancer?': 9,
 'who is at risk for breast cancer?': 9,
 'what is (are) lung cancer?': 9,
 'what is (are) diabetes?': 8,
 'what is (are) dry mouth?': 8,
 "what is (are) alzheimer's disease?": 8,
 'what are the treatm

### Find out minimum and maximum number of words in an answer. 
This code also checks for questions with empty answer strings and with only less than 10 words for a single answer.

In [125]:
q_to_word_counts = {}
min_q_to_word_counts = {}
min_word = 999999
max_word = 0


for question, answers in clean_q_to_a.items():
    word_count = [len(answer.split()) for answer in answers]
    if len(word_count)==0: # empty list of words
        min_q_to_word_counts[question] = [0]
        continue
    elif len(word_count)==1 and word_count[0]<10: # only single answer with less than 10 words.
        min_q_to_word_counts[question] = word_count 
        continue
    min_word = min(min_word, min(word_count))
    max_word = max(max_word, max(word_count))
    q_to_word_counts[question] = word_count

In [127]:
print('Minimum words: ', min_word)
print('Maximum words: ', max_word)

Minimum words:  4
Maximum words:  4188


In [129]:
min_q_to_word_counts # display all questions with single answers having less than 10 words.

{'what causes childhood brain stem glioma?': [9],
 'what causes childhood ependymoma?': [9],
 "what are the symptoms of bell's palsy?": [7],
 "what causes bell's palsy?": [4],
 'what are the treatments for epidermolysis bullosa acquisita?': [7],
 'what are the treatments for progressive hemifacial atrophy?': [7],
 'how to diagnose occipital horn syndrome?': [8],
 'what is (are) parasites - toxocariasis (also known as roundworm infection)?': [6],
 'what can i do to prevent poisoning by marine toxins?': [6],
 'how to prevent acanthamoeba - granulomatous amebic encephalitis (gae); keratitis?': [1],
 'what is (are) parasites - african trypanosomiasis (also known as sleeping sickness)?': [4],
 'what is (are) parasites - paragonimiasis (also known as paragonimus infection)?': [4],
 'what is (are) parasites - lymphatic filariasis?': [6],
 'what is the outlook for thyrotoxic myopathy?': [9],
 'what is the outlook for agnosia?': [6],
 'how many people are affected by hirschsprung disease?': [9]

### Removing questions with non-informative single answers.

In [132]:
final_clean_q_to_a = {}

for question, answers in clean_q_to_a.items():
    word_count = [len(answer.split()) for answer in answers]
    if len(word_count)<1 or (len(word_count)==1 and word_count[0]<5):
        print(question, answers) # non informative answers
        continue
    final_clean_q_to_a[question] = answers

len(final_clean_q_to_a)

what causes bell's palsy? ["What causes Bell's palsy?"]
how to prevent acanthamoeba - granulomatous amebic encephalitis (gae); keratitis? ['Topics']
what is (are) parasites - african trypanosomiasis (also known as sleeping sickness)? ['Frequently Asked Queestions (FAQs)']
what is (are) parasites - paragonimiasis (also known as paragonimus infection)? ['Frequently Asked Queestions (FAQs)']


14338

## Observations:
1. In some cases, the question is present as a part of the answer.
2. Found multiple cases where the answers have less than 6 words.
3. Most of those cases with less than 6 words, have no actual informative content.

### Save clean question to answers dictionary map

In [80]:
file_path = 'clean_questions_to_answers_dataset_v1.json'
write_json(file_path, final_clean_q_to_a)