# Advanced Certification in AIML
## A Program by IIIT-H and TalentSprint

## Problem Statement

Determine Research Areas and corresponding Research Investigators based on the research interest of an individual

## Learning Objectives

At the end of the Mini Hackathon, you will be able to :

* cluster similar research areas from the given abstracts using K-means
* identify the top research investigators of those research areas

In [None]:
#@title Mini-hackathon Walkthrough
from IPython.display import HTML

HTML("""<video width="800" height="500" controls>
  <source src="https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/Walkthrough/Clustering_MH_Walkthrough.mp4" type="video/mp4">
</video>
""")

## Background

Every year, millions of students apply to graduate schools worldwide. The process of graduate school selection could be based on several criteria such as location, weather, affordability, school reputation, faculty, areas of research interest, funding, etc. Choosing an area of research that enhances the student's academic or professional goals is key to attain career success. Currently, there are insufficient tools to search for schools and faculty based on areas of research. Students either need to search through publications, explore independent faculty web pages, or browse through several search results obtained through a web search.

A search tool to identify academic groups in graduate schools, working in specific research areas, will enable better decision making in the selection of graduate schools. It will also increase the chances of professional success through a better match of candidates and their research interests and goals.

## Methodology

This is an Exploratory **Data Mining Approach**. Using a large, real-world dataset of **biomedical research topics**, abstracts, research investigators, and their funding records, we will perform **NLP and Clustering** (Unsupervised Learning) to **obtain research area based investigator clusters**.

## Dataset

[World RePORT](https://worldreport.nih.gov/app/#!/) is an open-access database that provides data on biomedical research funding for worldwide projects. It contains information on >1 lakh funded proposals and includes names of the research organizations, principal investigator, research topic, research abstract, funding received, etc. The given dataset **contains ~7000 research abstracts**' text that extracted from abstract links from the World RePORT database and corresponding investigator and funding data

## Grading = 20 Marks

## Setup Steps

In [7]:
#@title Run this cell to download the dataset

from IPython import get_ipython
ipython = get_ipython()

def setup():
   ipython.magic("sx wget https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/Funding_Organizations_Records.zip")
   ipython.magic("sx unzip Funding_Organizations_Records.zip")
   print ("Setup completed successfully")
   return

setup()

Setup completed successfully


**Import Required Packages**

In [8]:
import re
import pandas as pd
import numpy as np
import gensim
from sklearn.cluster import KMeans
from gensim.models import Doc2Vec
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
from nltk.corpus import stopwords
stopwords = set(stopwords.words("english"))   
from scipy.spatial.distance import cdist
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import RegexpTokenizer


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## **Stage 1:** Data Loading and Pre-processing

### 3 Marks - >  Performing basic cleanup operations and pre-process the data 

1. Load and Explore Train data

2. Data cleaning (Drop missing data) and reset the indices of the dataframe

3. Preprocess the abstracts of train data by following pre-processing steps:
  * Remove Stopwords
  * Remove special characters and alpha numeric words
  * Lemmatization





In [9]:
# YOUR CODE HERE
abstract_df = pd.read_csv('Train_data.csv')
print(abstract_df.head())
print(abstract_df.columns)
print(abstract_df.shape)

FileNotFoundError: ignored

In [None]:
print(abstract_df['PI Name'])

In [None]:
#Data cleaning (Drop missing data) and reset the indices of the dataframe
abstract_df.dropna(inplace=True)
abstract_df.reset_index(drop=True, inplace=True)
print(abstract_df.shape)

In [None]:
def identify_tokens(row):
    abstract = row['Abstracts']
    tokens = nltk.word_tokenize(abstract)
    # taken only words (not punctuation)
    token_words = [w for w in tokens if w.isalpha()] #w.lower()
    return token_words
abstract_df['words'] = abstract_df.apply(identify_tokens, axis=1)


In [None]:
# Before removing stop words 
print(len(abstract_df['words']))
print(len(abstract_df['words'][666]))
print(abstract_df['words'][666])

In [None]:
print(stopwords)

In [None]:
x1 = ['DESCRIPTION', 'provided', 'by', 'applicant', 'JC', 'virus', 'JCV', 'causes', 'a', 'fatal', 'disease', 'in', 'the', 'central']
print(len(x1))
y1 = [w for w in x1 if not w in stopwords]
print(len(y1))

In [None]:
#stopwords = set(stopwords.words("english"))

def remove_stops(row):
    abstract_list = row['words']
    meaningful_words = [w for w in abstract_list if not w in stopwords]
    return (meaningful_words)

abstract_df['token_meaningful'] = abstract_df.apply(remove_stops, axis=1)

In [None]:
# After removing stop words 
print(len(abstract_df['token_meaningful']))
print(len(abstract_df['token_meaningful'][666]))
print(abstract_df['token_meaningful'][666])

In [None]:
def rejoin_words(row):
    abstract_list = row['token_meaningful']
    joined_words = ( " ".join(abstract_list))
    return joined_words

abstract_df['token_processed'] = abstract_df.apply(rejoin_words, axis=1)

In [None]:
# Before removing special character and alhpa numberic 
print(abstract_df['token_processed'][5211])

In [None]:
# Remove special characters and alpha numeric words on token processed 

abstract_df['token_meaningful_alpha'] = abstract_df.token_processed.str.strip()     
abstract_df['token_meaningful_alpha'] = abstract_df.token_processed.str.replace('', '_')         
abstract_df['token_meaningful_alpha'] = abstract_df.token_processed.str.replace(r"[^a-zA-Z\d\_]+", "")    
abstract_df['token_meaningful_alpha']= abstract_df.token_processed.str.replace(r"[^a-zA-Z\d\_]+", " ")


In [None]:
abstract_df['token_meaningful_alpha'] = abstract_df['token_processed'].map(lambda x: re.sub(r'\W+', ' ', x))


In [None]:
# After removing special character and alhpa numberic 

print(len(abstract_df['token_meaningful_alpha']))
print(len(abstract_df['token_meaningful_alpha'][666]))
print(abstract_df['token_meaningful_alpha'][666])

In [None]:
# Apply lemmatization
lemmatizer = nltk.stem.WordNetLemmatizer()
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()

def lemmatize_abstract(abstract):
    return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(abstract)]

abstract_df['abstract_lemmatized'] = abstract_df['token_meaningful_alpha'].apply(lemmatize_abstract)

In [None]:
# After applying lemmatizer  
print(abstract_df['abstract_lemmatized'][666])

In [None]:
def rejoin_words(row):
    abstract_list = row['abstract_lemmatized']
    joined_words = ( " ".join(abstract_list))
    return joined_words
abstract_df['lemmatized_processed'] = abstract_df.apply(rejoin_words, axis=1)

In [None]:
# After joined lemmatizer  
print(abstract_df['lemmatized_processed'][5211])

## **Stage 2:**  Feature Extraction 

### 3 Marks - > Extract feature vectors of the abstracts using TF-IDF or Doc2Vec

Provide the below parameters while using TFidfVectorizer
  * Ignore the least frequent words with a threshold value of 0.01.

    Hint: Use min_df parameters.

  * Give binary as True and norm as L1

  Refer to [sklearn TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) for more details.

               

>>  **(OR)**


While using Doc2Vec, follow the below steps:

* Tag the documents.
* Intialize the Doc2Vec.
* Build the Vocabulary.
* Train the model by giving total_examples=model.corpus_count, epochs=10, start_alpha=0.002, end_alpha=-0.016.

Refer to [Doc2Vec 1](https://medium.com/@ermolushka/text-clusterization-using-python-and-doc2vec-8c499668fa61) (or) [Doc2Vec 2](https://medium.com/@mishra.thedeepak/doc2vec-simple-implementation-example-df2afbbfbad5) for more details.

In [None]:
# Extract feature vectors of the abstracts using TF-IDF 
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df= 0.01, binary= True, norm= 'l1')
X = vectorizer.fit_transform(abstract_df.lemmatized_processed)

In [None]:
X

In [None]:
print(abstract_df.lemmatized_processed.iloc[-1])

In [None]:
print(X.shape)
print(X[-1,:])
Pred_x = X[-1,:]

In [None]:
abstract_df['features'] = X

In [None]:
print(abstract_df.columns)

In [None]:
print(abstract_df.head())

In [None]:
abstract_df.to_csv('df_with_features.csv')

In [None]:
# Using DataFrame.insert() to add a column 
abstract_df.insert(2, "Age", [21, 23, 24, 21], True) 

In [None]:
print(vectorizer.get_feature_names())
print(X.shape)

## **Stage 0_3**  Kmeans clustering using Doc2Vec features
Perform Kmeans clustering for the abstracts

In [None]:
# Featuresed based on Doc2Vec
#Import all the dependencies
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize

In [None]:
LabeledSentence1 = gensim.models.doc2vec.TaggedDocument
all_content_train = []
j=0
for em in abstract_df['lemmatized_processed'].values:
    all_content_train.append(LabeledSentence1(em,[j]))
    j+=1
print("Number of texts processed: ", j)

In [None]:
d2v_model = Doc2Vec(all_content_train, size = 100, window = 10, \
                    min_count = 500, workers=7, dm = 1,alpha=0.025, min_alpha=0.001)
d2v_model.train(all_content_train, total_examples=d2v_model.corpus_count, \
                epochs=10, start_alpha=0.002, end_alpha=-0.016)

In [None]:
kmeans_doc2vec = KMeans(n_clusters=11, init='k-means++', random_state=42) 
X_D2V = kmeans_doc2vec.fit(d2v_model.docvecs.doctag_syn0)
labels_D2V=kmeans_doc2vec.labels_.tolist()

In [None]:
# fit the model 
from sklearn.decomposition import PCA
model_D2V = kmeans_doc2vec.fit_predict(d2v_model.docvecs.doctag_syn0)
pca = PCA(n_components=11).fit(d2v_model.docvecs.doctag_syn0)
datapoint = pca.transform(d2v_model.docvecs.doctag_syn0)

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure
label1 = ['#FFFF00', '#008000', '#0000FF', '#800080', '#00FF00', '#FF0000','#FFFFFF',  '#FFFF00',  '#800080', '#FFF8DC',  '#8B008B']
color = [label1[i] for i in labels_D2V]
plt.scatter(datapoint[:, 0], datapoint[:, 1], c=color)
centroids = kmeans_doc2vec.cluster_centers_
centroidpoint = pca.transform(centroids)
plt.scatter(centroidpoint[:, 0], centroidpoint[:, 1], marker='^', s=300, c='#000000')
plt.show()

## **Stage 3**  Kmeans clustering
Perform Kmeans clustering for the abstracts

### 3 Marks - > Find the optimal number of clusters (K) by using the [Elbow method](https://pythonprogramminglanguage.com/kmeans-elbow-method/). 

In [None]:
# YOUR CODE HERE
# Hint: Experiment with different range of clusters until a rapid decline is found at a point. for eg: (2, 20)
from sklearn.cluster import KMeans
wcss = []
for i in range(2,20):
    kmeans = KMeans(n_clusters=i, init='k-means++', random_state= 42)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)
plt.plot(range(2,20), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('wcss')
plt.show()

### 2 Marks - > Train the k-Means model with the arrived optimal number of clusters.

1. Initialize the k-Means with optimal K value.
2. Fit the k-Means model with the feature vectors.
3. Predict the labels (i.e., clusters) of the feature vectors. 
4. Add the predicted labels to the existing train dataframe.

In [None]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=11 , init='k-means++', random_state= 42) #k-means++ to avoid random initializtion class
y_pred_kmeans = kmeans.fit_predict(X) # return similar in each group of dependent variable

In [None]:
# pickle the model
from sklearn.externals import joblib
joblib.dump(kmeans,  'abstract_clusters.pkl')

In [None]:
# extract list clusters
clusters = kmeans.labels_.tolist()

In [None]:
print((clusters))

In [None]:
# assign each cluster to corresponding abastract
abstract_df[['cluster']] = clusters

In [None]:
#number of abstracts per cluster (clusters from 0 to 10)
abstract_df['cluster'].value_counts() 


In [None]:
abstract_df.to_csv('allclusters.csv')

In [None]:
allclusters = pd.read_csv('allclusters.csv')

In [None]:
print(list(allclusters))

In [None]:
print(len(y_pred_kmeans)) # dependent variable of K-means

In [None]:
# Visualize kmeans on TFIDF features 
centroids = kmeans.cluster_centers_

labels = kmeans.labels_

print(centroids)
print(labels)

colors = ['#FFFF00', '#008000', '#0000FF', '#800080', '#00FF00', '#FF0000','#FFFFFF',  '#FFFF00',  '#800080', '#FFF8DC',  '#8B008B']


for i in centroids: plt.plot( [0, len(y_pred_kmeans)-1],[i,i], "k" )
for i in range(len(y_pred_kmeans)):
    plt.plot(i, abstract_df[i], colors[labels[i]], markersize = 10)

plt.show()

### 4 Marks - > Visualize the top frequent words in any 2 clusters' abstracts, using a [word cloud](https://programmerbackpack.com/word-cloud-python-tutorial-create-wordcloud-from-text/) approach. 

#### This will allow you to identify the research areas in the different clusters, based on the most frequently occurring words

1. Combine all the abstracts of each chosen cluster.
2. Generate and display the word cloud of the chosen clusters.


In [None]:
#YOUR CODE HERE (first) combining cluster 6 and 2 as most frequent words
#!pip install wordcloud
filterinfDataframe = abstract_df[(abstract_df['cluster'] == 6) | (abstract_df['cluster'] == 2) ]
filterinfDataframe.to_csv('sixth_second_clusters.csv')

In [None]:
#number of abstracts on clusters 6 and 2 
filterinfDataframe['cluster'].value_counts() 

In [None]:
# (second) Generate and display the word cloud of the 6 and 2 clusters
from wordcloud import WordCloud
field = ['lemmatized_processed']
cluster_6_2_df = pd.read_csv('sixth_second_clusters.csv', usecols=field)


In [None]:
wordcloud2 = WordCloud().generate(' '.join(cluster_6_2_df['lemmatized_processed']))
# Generate plot
plt.figure(figsize = (40,6))
plt.imshow(wordcloud2, interpolation='nearest')
 

## **Stage 4:**  Deriving Insights


### 1 Mark - > List the PI names of each cluster

In [None]:
print(allclusters['PI Name'] )
print( allclusters['cluster']  )

In [None]:
data ={'PI Name': allclusters['PI Name'], 'Cluster': allclusters['cluster']}

In [None]:
cluster_PI_Name = pd.DataFrame(data )
print(cluster_PI_Name.head())

In [None]:
# YOUR CODE HERE
filterinfDataframe['cluster'].values

### 2 Marks - > Predict the label (cluster) for the given search item



*   Get the vectors of the search item by transforming with TfidfVectorizer or Doc2Vec

*   Predict the label of the search item using k-Means model.

In [None]:
search_item = ["""Approximately 20 million people globally are infected with tuberculosis, and about 1.5 million people die of the disease annually, i.e. one death every 20 seconds. Currently, tuberculosis of the lungs is treated with four drugs ethambutol, isoniazid, rifampicin, and pyrazinamide daily for the first two months, followed by the two drugs isoniazid and rifampicin for the next four months. This drug combination is recommended by the World Health Organisation and is used in most countries of the world.
                The combination is highly effective if taken properly, but despite this about 15% patients worldwide are not cured. Factors such as patients not completing the course, missing multiple doses, or taking (or being prescribed) the wrong dose contribute to treatment failure. Although the drugs are free to patients, there is a substantial cost, in terms of time and administration, to both the patient and the treatment services. A recent study by Gospodarevskaya et al (Int J Tub Lung Dis. 18: 810-817) has found that patients have to terminate productive/economic activities and are often forced to borrow money and/or sell assets to cover cost of treatment, which can amount to more than three-quarters of patients' income, in the last 2 months of treatment. Reducing the duration of treatment should increase the number of people who successfully complete treatment and reduce the cost to them.
                A reduction could be achieved in one of two ways: using combinations of the new drugs currently under development, or by using the currently available drugs more effectively. Given the enormous cost and long time required to develop new drugs the second option is attractive. Increasing the dose of one of the currently available drugs may allow the duration of treatment to be shortened in the very near future.
                Three recently published Phase III trials (RIFAQUIN, ReMOX, OFLOTUB) have failed to demonstrate that treatment shortening can be achieved with the quinolones. hus, the rifamycins offer the best hope if higher doses can be shown to be safe.
                Rifampicin which is responsible for killing most tuberculosis bacteria, appears to be the best choice since increasing doses of rifampicin increases its ability to kill TB bacilli in vitro and animal studies. A similar result could be obtained in human tuberculosis. However, one concern would be a possible increase in unwanted serious side effects with increasing doses. Liver damage by rifampicin appears to be rare and not connected to dose size. In the RIFATOX Trial, a dose of 1200mg, in 100 patients did not increase its toxicity.
                The central question this trial aims to answer is therefore: does an increase in the dosage of rifampicin allow us to shorten treatment from 6 to 4 months? We are assessing whether giving double or triple the usual dose of rifampicin (1200mg, or 1800mg rather than 600mg daily) is safe and, when given for 4 months only, will result in relapse rates similar to (or better than) those found in the standard 6 month course of treatment. Patients with newly diagnosed tuberculosis of the lung, who agree to participate and have signed a consent form, will receive either the standard 6 month treatment or a 4 month treatment containing the standard drugs but with a double or triple dose of rifampicin. Treatment allocation will be random. The success of treatment in each method will be closely monitored both clinically and by regular microscopic examination of sputum, and the safety of the increased dose of rifampicin will be monitored clinically and with blood tests.
                If the trial is successful, it will lead to a shorter treatment course for pulmonary tuberculosis. The expected consequences would be: more patients completing the course and higher rates of cure, reduction in rates of transmission of tuberculosis with fewer people becoming infected, a reduced cost of treatment for both patients and treatment facilities and, perhaps, a reduction in the emergence of bacterial drug resistance.
                """]

In [None]:
df1 = pd.read_csv('Train_Data.csv')
print(df1.shape)
print(df1.columns)

In [None]:
x = 'dummy'
y = 'dummy'
x1= 'dummy'
y1 = 0



In [None]:
df1.at['6813', 'Program Title'] = x
df1.at['6813', 'Funding Organization'] = y
df1.at['6813', 'PI Name'] = x1
df1.at['6813', 'Funding Amount - 2015 and later only'] = y1
df1.at['6813', 'Abstracts'] = search_item

df1.to_csv('searchitem.csv')

In [None]:
print(df1.shape)
print(df1['Abstracts'][6812])


In [None]:
df2 = pd.read_csv('searchitem.csv')
print(df2.shape)
print(df2.columns)
print(df2['PI Name'][6813])
print(df2.tail())

In [None]:
# YOUR CODE HERE
abstract_searchitem_df = pd.read_csv('searchitem.csv')
print(abstract_searchitem_df.head())
print(abstract_searchitem_df.columns)
print(abstract_searchitem_df.shape)

In [None]:
from nltk.tokenize import word_tokenize
word_tokens = []
for sent in search_item:
    print(word_tokenize(sent))
    word_tokens.append(word_tokenize(sent))

In [None]:
print(type(word_tokens))
print(word_tokens)
print(len(word_tokens[0])) 
 

In [None]:
type(word_tokens)

In [None]:
# remove special character
import re
test_str = ' '.join([re.sub('[^a-zA-Z]+', ' ', _) for _ in search_item])

In [None]:
print(type(test_str))
print(test_str)
print(len(test_str[0])) 

In [None]:
# remove stop words
from nltk.corpus import stopwords
stops = set(stopwords.words("english"))     
test_str =  ' '.join([word for word in test_list.split() if word not in stops]) # delete stopwords from text

In [None]:
# remove words with 1 length
new_test_string = ' '.join([w for w in test_str.split() if len(w)>1])


In [None]:
# convert the above processed string to list
def Convert(string): 
    li = list(string.split(" ")) 
    return li

In [None]:
new_test_list = Convert(new_test_string)
 

In [None]:
print(type(new_test_list))
print(new_test_list)
print(len(new_test_list[8])) 

In [None]:
# YOUR CODE HERE
# Extract feature vectors of the abstracts using TF-IDF 
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df= 0.01, binary= True, norm= 'l1')
test_X = vectorizer.fit_transform(new_test_list)
print(vectorizer.get_feature_names())
print(test_X.shape)

In [None]:
print(X.shape)

In [None]:
# Load the model
kmeans = joblib.load('abstract_clusters.pkl')


In [None]:
#Predict the label of the search item using k-Means model.
y_predict = kmeans.predict(Pred_x)

In [None]:
print(y_predict)

In [None]:
# Save to df
#abstract_df['y_predict'] = y_predict
abstract_df.to_csv('predict_cluster.csv')
list(abstract_df)

### 2 Marks - > Find the top-10 corresponding **PI Names** from the predicted cluster, which are most relevant to the given search item.

Step 1 : Get the feature vectors of the documents (abstracts) of the above predicted cluster.
      
Hint: Use the indices of the documents that belong to the predicted cluster and get their feature vectors.

Step 2 : Calculate the distance between **search item feature vector** and **predicted cluster feature vectors**.

Hint: Use cdist from scipy for calculating the distance.


Step 3 : Find the top 10 feature vectors that have the least distance from the search item feature vector.

Step 4 : Give the PI Names corresponding to the top 10 feature vectors.

In [None]:
# the feature vectors of the documents (abstracts) of the above predicted cluster.

df_with_features = pd.read_csv('df_with_features.csv')
print(df_with_features.head())
print(df_with_features['cluster'].head())
print(df_with_features['cluster'].tail())

In [None]:
search_item_vector = X[-1, :]
print(search_item_vector.shape)
print(df_with_features_9.shape)
print(search_item_vector)


In [None]:
df_with_features_9 =X[-2, :]

In [None]:
#df_with_features_9 = X[-1, :]
print(df_with_features_9.shape)
print(df_with_features_9)

In [None]:
#features_cluser_9 = df_with_features['cluster'] == 9 
#features_cluser_9.loc[features_cluser_9['cluster'].isin(9)]

In [None]:
dx = np.array(search_item_vector)
dy = np.array(df_with_features_9)

In [None]:
from scipy.spatial import distance
dist = distance,cdist(dy, dx, 'euclidean')

In [None]:
type(df_with_features_9)

In [None]:
scipy.sparse.csr_matrix.toarray as trans 
csr_matrix.toarray(self, order=None, out=None)



In [None]:
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances

X_testing = np.asarray(search_item_vector)
test = np.asarray(df_with_features_9)
dist = euclidean_distances(test, X_testing)
print(dist)  

In [None]:
from scipy.spatial import distance_matrix
distance_matrix( test, test_X)

In [None]:
import numpy as np
from scipy.spatial import distance



### (Optional): Identify the top funded research investigators most relevant to the search item

In [None]:
# YOUR CODE HERE