# BioASQ challenge: Query-Focused Summarization Project on Medical Questions

## Objectives

Working on a task of query-focused summarization using a dataset derived from the **BioASQ challenge** (http://www.bioasq.org/), focusing on medical questions and relevant answers extracted from medical publications.
Developing an understanding of the data, which consisted of medical questions paired with sentences labeled based on their relevance to the questions.
Implementing deep neural networks to accurately identify sentences from the list that effectively answered the medical questions.

In [None]:
!unzip data.zip

In [2]:
import pandas as pd
dataset = pd.read_csv("bioasq10b_labelled.csv")
dataset.head()

Unnamed: 0,qid,sentid,question,sentence text,label
0,0,0,Is Hirschsprung disease a mendelian or a multi...,Hirschsprung disease (HSCR) is a multifactoria...,0
1,0,1,Is Hirschsprung disease a mendelian or a multi...,"In this study, we review the identification of...",1
2,0,2,Is Hirschsprung disease a mendelian or a multi...,The majority of the identified genes are relat...,1
3,0,3,Is Hirschsprung disease a mendelian or a multi...,The non-Mendelian inheritance of sporadic non-...,1
4,0,4,Is Hirschsprung disease a mendelian or a multi...,Coding sequence mutations in e.g.,0


The columns of the CSV file are:

* `qid`: an ID for a question. Several rows may have the same question ID, as we can see above.
* `sentid`: an ID for a sentence.
* `question`: The text of the question. In the above example, the first rows all have the same question: "Is Hirschsprung disease a mendelian or a multifactorial disorder?"
* `sentence text`: The text of the sentence.
* `label`: 1 if the sentence is a part of the answer, 0 if the sentence is not part of the answer.


### 1. Part of Speech Statistics Analysis

Developed a function, `stats_pos`, to analyze and report the normalized frequency of part of speech tags in medical questions and answers using NLTK's Universal tag set.
- Designed the process to handle data uniquely by concatenating all questions and answers separately, ensuring accurate linguistic analysis.
- Employed NLTK libraries for sentence and word tokenization, and used NLTK's part of speech tagging to compare linguistic features between questions and answers.


#### PoS Tag Differences:

- Noun Use: Both questions and answers lean heavily on nouns, which makes sense because they're often talking about specifics .

- Verb Use: Questions are more packed with verbs than answers

- Punctuation and Particles: There are more punctuation in answers. I think this is due to listing them out in more details.

- Adjectives (ADJ): Answers use adjectives more than questions do. This mostly due to the nature of answers which tend to be more descriptive.

- Adverbs (ADV) and Conjunctions (CONJ): More adverbs and conjunctions come up in answers. This probably means answers are connecting thoughts and add depth to the explanations.

- Determiners (DET) and Pronouns (PRON): Questions often grab onto determiners and pronouns.

- Numerals (NUM): Answers include more numerals than questions, due to their role in providing quantified information

### 2. N-gram Frequency Analysis

Created a function `stats_top_stem_ngrams` to identify and report the top N most frequent n-grams of word stems for medical questions and answers, focusing on their normalized frequency using NLTK's tools.
- Implemented unique data handling strategies by concatenating distinct questions and answers for comprehensive n-gram analysis.
- Utilized NLTK's tokenization and Porter Stemmer for accurate stem extraction, ensuring n-grams were correctly identified within the boundaries of individual sentences.
- Analyzed and compared linguistic patterns between questions and answers, enhancing understanding of commonalities in medical query language.

-  Shared Bigrams: Both questions and answers use bigrams ('of', 'the') and ('in', 'the') I think both questions and answers deal with descriptions or specifications involving locations or possessions

- Question-Specific Bigrams: bigrams ('what', 'is') and ('are', 'the'), coming up mostly in questions. Makes sense, as questions "What is this?" or "Are these the facts?" are common."

- Answer-Specific Bigrams: Answers use (',', 'and') and (')', ','), which points to them often isting stuff or explaining details.


### 3. Named Entity Recognition Analysis
Developed the function `stats_ne` to compute and report the normalized frequency of named entity types in medical questions and answers using spaCy's en_core_web_sm model.
- Ensured consistent analysis across different datasets by adhering strictly to spaCy's default entity types.
- Provided insights into the distribution of named entities, aiding in the understanding of key information components in medical text queries and responses.
ol. 

### 4. TF-IDF Similarity Analysis Project

Implemented the `stats_tfidf` function to determine the ratio of questions whose most similar sentence, based on tf-idf cosine similarity, is actually part of the answer set.
- Configured and utilized Scikit-learn's TfidfVectorizer with English stop words removal to analyze text data, optimizing the identification of relevant text features.
- Analyzed and refined models by fitting the vectorizer on unique datasets comprising both questions and their respective answers, enhancing accuracy in similarity assessments.
