# Advanced Certification in AIML
## A Program by IIIT-H and TalentSprint

## Problem Statement

Determine Research Areas and corresponding Research Investigators based on the research interest of an individual

## Learning Objectives

At the end of the Mini Hackathon, you will be able to :

* cluster similar research areas from the given abstracts using K-means
* identify the top research investigators of those research areas

In [2]:
#@title Mini-hackathon Walkthrough
from IPython.display import HTML

HTML("""<video width="800" height="600" controls>
  <source src="https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/Walkthrough/Clustering_MH_Walkthrough.mp4" type="video/mp4">
</video>
""")

## Background

Every year, millions of students apply to graduate schools worldwide. The process of graduate school selection could be based on several criteria such as location, weather, affordability, school reputation, faculty, areas of research interest, funding, etc. Choosing an area of research that enhances the student's academic or professional goals is key to attain career success. Currently, there are insufficient tools to search for schools and faculty based on areas of research. Students either need to search through publications, explore independent faculty web pages, or browse through several search results obtained through a web search.

A search tool to identify academic groups in graduate schools, working in specific research areas, will enable better decision making in the selection of graduate schools. It will also increase the chances of professional success through a better match of candidates and their research interests and goals.

## Methodology

This is an Exploratory Data Mining Approach. Using a large, real-world dataset of biomedical research topics, abstracts, research investigators, and their funding records, we will perform NLP and Clustering (Unsupervised Learning) to obtain research area based investigator clusters.

## Dataset

[World RePORT](https://worldreport.nih.gov/app/#!/) is an open-access database that provides data on biomedical research funding for worldwide projects. It contains information on >1 lakh funded proposals and includes names of the research organizations, principal investigator, research topic, research abstract, funding received, etc. The given dataset contains ~7000 research abstracts' text that extracted from abstract links from the World RePORT database and corresponding investigator and funding data

## Grading = 20 Marks

## Setup Steps

In [None]:
#@title Run this cell to download the dataset

from IPython import get_ipython
ipython = get_ipython()

def setup():
   ipython.magic("sx wget https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/Funding_Organizations_Records.zip")
   ipython.magic("sx unzip Funding_Organizations_Records.zip")
   print ("Setup completed successfully")
   return

setup()

**Import Required Packages**

In [None]:
import re
import pandas as pd
import numpy
import numpy as np
import gensim
from sklearn.cluster import KMeans
from gensim.models import Doc2Vec
import nltk
from nltk.corpus import stopwords
from scipy.spatial.distance import cdist
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

## **Stage 1:** Data Loading and Pre-processing

### 3 Marks - >  Performing basic cleanup operations and pre-process the data 

1. Load and Explore Train data

2. Data cleaning (Drop missing data) and reset the indices of the dataframe

3. Preprocess the abstracts of train data by following pre-processing steps:
  * Remove Stopwords
  * Remove special characters and alpha numeric words
  * Lemmatization





In [None]:
# YOUR CODE HERE

## **Stage 2:**  Feature Extraction 

### 3 Marks - > Extract feature vectors of the abstracts using TF-IDF or Doc2Vec

Provide the below parameters while using TFidfVectorizer
  * Ignore the least frequent words with a threshold value of 0.01.

    Hint: Use min_df parameters.

  * Give binary as True and norm as L1

  Refer to [sklearn TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) for more details.

               

>>  **(OR)**


While using Doc2Vec, follow the below steps:

* Tag the documents.
* Intialize the Doc2Vec.
* Build the Vocabulary.
* Train the model by giving total_examples=model.corpus_count, epochs=10, start_alpha=0.002, end_alpha=-0.016.

Refer to [Doc2Vec 1](https://medium.com/@ermolushka/text-clusterization-using-python-and-doc2vec-8c499668fa61) (or) [Doc2Vec 2](https://medium.com/@mishra.thedeepak/doc2vec-simple-implementation-example-df2afbbfbad5) for more details.

In [None]:
# YOUR CODE HERE

## **Stage 3**  Kmeans clustering
Perform Kmeans clustering for the abstracts

### 3 Marks - > Find the optimal number of clusters (K) by using the [Elbow method](https://pythonprogramminglanguage.com/kmeans-elbow-method/). 

In [None]:
# YOUR CODE HERE
# Hint: Experiment with different range of clusters until a rapid decline is found at a point. for eg: (2, 20)

### 2 Marks - > Train the k-Means model with the arrived optimal number of clusters.

1. Initialize the k-Means with optimal K value.
2. Fit the k-Means model with the feature vectors.
3. Predict the labels (i.e., clusters) of the feature vectors. 
4. Add the predicted labels to the existing train dataframe.

In [None]:
# YOUR CODE HERE

### 4 Marks - > Visualize the top frequent words in any 2 clusters' abstracts, using a [word cloud](https://programmerbackpack.com/word-cloud-python-tutorial-create-wordcloud-from-text/) approach. 

#### This will allow you to identify the research areas in the different clusters, based on the most frequently occurring words

1. Combine all the abstracts of each chosen cluster.
2. Generate and display the word cloud of the chosen clusters.


In [None]:
#YOUR CODE HERE

## **Stage 4:**  Deriving Insights


### 1 Mark - > List the PI names of each cluster

In [None]:
# YOUR CODE HERE

### 2 Marks - > Predict the label (cluster) for the given search item



*   Get the vectors of the search item by transforming with TfidfVectorizer or Doc2Vec

*   Predict the label of the search item using k-Means model.

In [None]:
search_item = ["""Approximately 20 million people globally are infected with tuberculosis, and about 1.5 million people die of the disease annually, i.e. one death every 20 seconds. Currently, tuberculosis of the lungs is treated with four drugs ethambutol, isoniazid, rifampicin, and pyrazinamide daily for the first two months, followed by the two drugs isoniazid and rifampicin for the next four months. This drug combination is recommended by the World Health Organisation and is used in most countries of the world.
                The combination is highly effective if taken properly, but despite this about 15% patients worldwide are not cured. Factors such as patients not completing the course, missing multiple doses, or taking (or being prescribed) the wrong dose contribute to treatment failure. Although the drugs are free to patients, there is a substantial cost, in terms of time and administration, to both the patient and the treatment services. A recent study by Gospodarevskaya et al (Int J Tub Lung Dis. 18: 810-817) has found that patients have to terminate productive/economic activities and are often forced to borrow money and/or sell assets to cover cost of treatment, which can amount to more than three-quarters of patients' income, in the last 2 months of treatment. Reducing the duration of treatment should increase the number of people who successfully complete treatment and reduce the cost to them.
                A reduction could be achieved in one of two ways: using combinations of the new drugs currently under development, or by using the currently available drugs more effectively. Given the enormous cost and long time required to develop new drugs the second option is attractive. Increasing the dose of one of the currently available drugs may allow the duration of treatment to be shortened in the very near future.
                Three recently published Phase III trials (RIFAQUIN, ReMOX, OFLOTUB) have failed to demonstrate that treatment shortening can be achieved with the quinolones. hus, the rifamycins offer the best hope if higher doses can be shown to be safe.
                Rifampicin which is responsible for killing most tuberculosis bacteria, appears to be the best choice since increasing doses of rifampicin increases its ability to kill TB bacilli in vitro and animal studies. A similar result could be obtained in human tuberculosis. However, one concern would be a possible increase in unwanted serious side effects with increasing doses. Liver damage by rifampicin appears to be rare and not connected to dose size. In the RIFATOX Trial, a dose of 1200mg, in 100 patients did not increase its toxicity.
                The central question this trial aims to answer is therefore: does an increase in the dosage of rifampicin allow us to shorten treatment from 6 to 4 months? We are assessing whether giving double or triple the usual dose of rifampicin (1200mg, or 1800mg rather than 600mg daily) is safe and, when given for 4 months only, will result in relapse rates similar to (or better than) those found in the standard 6 month course of treatment. Patients with newly diagnosed tuberculosis of the lung, who agree to participate and have signed a consent form, will receive either the standard 6 month treatment or a 4 month treatment containing the standard drugs but with a double or triple dose of rifampicin. Treatment allocation will be random. The success of treatment in each method will be closely monitored both clinically and by regular microscopic examination of sputum, and the safety of the increased dose of rifampicin will be monitored clinically and with blood tests.
                If the trial is successful, it will lead to a shorter treatment course for pulmonary tuberculosis. The expected consequences would be: more patients completing the course and higher rates of cure, reduction in rates of transmission of tuberculosis with fewer people becoming infected, a reduced cost of treatment for both patients and treatment facilities and, perhaps, a reduction in the emergence of bacterial drug resistance.
                """]

In [None]:
# YOUR CODE HERE

### 2 Marks - > Find the top-10 corresponding **PI Names** from the predicted cluster, which are most relevant to the given search item.

Step 1 : Get the feature vectors of the documents (abstracts) of the above predicted cluster.
      
Hint: Use the indices of the documents that belong to the predicted cluster and get their feature vectors.

Step 2 : Calculate the distance between **search item feature vector** and **predicted cluster feature vectors**.

Hint: Use cdist from scipy for calculating the distance.


Step 3 : Find the top 10 feature vectors that have the least distance from the search item feature vector.

Step 4 : Give the PI Names corresponding to the top 10 feature vectors.

In [None]:
# YOUR CODE HERE

### (Optional): Identify the top funded research investigators most relevant to the search item

In [None]:
# YOUR CODE HERE