# Mission field analysis

We start the analisys of the enriched GTR dataset (see `01` and `02` notebooks)

## Load data

In [1]:
projects = pd.read_csv('../data/processed/22_1_2019_projects_clean.csv')

### Word embeddings query / Clio query?

To keep things simple, we will train a w2v model, identify synonyms for a set of seed terms and query the data for those

In [9]:
# %load lda_pipeline.py
from gensim import corpora, models
from string import punctuation
from string import digits
import re
import pandas as pd
import numpy as np
from gensim.models import Word2Vec


#Characters to drop
drop_characters = re.sub('-','',punctuation)+digits

#Stopwords
from nltk.corpus import stopwords

stop = stopwords.words('English')

#Stem functions
from nltk.stem import *
stemmer = PorterStemmer()

def flatten_list(a_list):
    return([x for el in a_list for x in el])


def clean_tokenise(string,drop_characters=drop_characters,stopwords=stop):
    '''
    Takes a string and cleans (makes lowercase and removes stopwords)
    
    '''
    

    #Lowercase
    str_low = string.lower()
    
    
    #Remove symbols and numbers
    str_letters = re.sub('[{drop}]'.format(drop=drop_characters),'',str_low)
    
    
    #Remove stopwords
    clean = [x for x in str_letters.split(' ') if (x not in stop) & (x!='')]
    
    return(clean)


class CleanTokenize():
    '''
    This class takes a list of strings and returns a tokenised, clean list of token lists ready
    to be processed with the LdaPipeline
    
    It has a clean method to remove symbols and stopwords
    
    It has a bigram method to detect collocated words
    
    It has a stem method to stem words
    
    '''
    
    def __init__(self,corpus):
        '''
        Takes a corpus (list where each element is a string)
        '''
        
        #Store
        self.corpus = corpus
        
    def clean(self,drop=drop_characters,stopwords=stop):
        '''
        Removes strings and stopwords, 
        
        '''
        
        cleaned = [clean_tokenise(doc,drop_characters=drop,stopwords=stop) for doc in self.corpus]
        
        self.tokenised = cleaned
        return(self)
    
    def stem(self):
        '''
        Optional: stems words
        
        '''
        #Stems each word in each tokenised sentence
        stemmed = [[stemmer.stem(word) for word in sentence] for sentence in self.tokenised]
    
        self.tokenised = stemmed
        return(self)
        
    
    def bigram(self,threshold=10):
        '''
        Optional Create bigrams.
        
        '''
        
        #Colocation detector trained on the data
        phrases = models.Phrases(self.tokenised,threshold=threshold)
        
        bigram = models.phrases.Phraser(phrases)
        
        self.tokenised = bigram[self.tokenised]
        
        return(self)
        
        
        
        

class LdaPipeline():
    '''
    This class processes lists of keywords.
    How does it work?
    -It is initialised with a list where every element is a collection of keywords
    -It has a method to filter keywords removing those that appear less than a set number of times
    
    -It has a method to process the filtered df into an object that gensim can work with
    -It has a method to train the LDA model with the right parameters
    -It has a method to predict the topics in a corpus
    
    '''
    
    def __init__(self,corpus):
        '''
        Takes the list of terms
        '''
        
        #Store the corpus
        self.tokenised = corpus
        
    def filter(self,minimum=5):
        '''
        Removes keywords that appear less than 5 times.
        
        '''
        
        #Load
        tokenised = self.tokenised
        
        #Count tokens
        token_counts = pd.Series([x for el in tokenised for x in el]).value_counts()
        
        #Tokens to keep
        keep = token_counts.index[token_counts>minimum]
        
        #Filter
        tokenised_filtered = [[x for x in el if x in keep] for el in tokenised]
        
        #Store
        self.tokenised = tokenised_filtered
        self.empty_groups = np.sum([len(x)==0 for x in tokenised_filtered])
        
        return(self)
    
    def clean(self):
        '''
        Remove symbols and numbers
        
        '''
        
        
        
    
        
    def process(self):
        '''
        This creates the bag of words we use in the gensim analysis
        
        '''
        #Load the list of keywords
        tokenised = self.tokenised
        
        #Create the dictionary
        dictionary = corpora.Dictionary(tokenised)
        
        #Create the Bag of words. This converts keywords into ids
        corpus = [dictionary.doc2bow(x) for x in tokenised]
        
        self.corpus = corpus
        self.dictionary = dictionary
        return(self)
        
    def tfidf(self):
        '''
        This is optional: We extract the term-frequency inverse document frequency of the words in
        the corpus. The idea is to identify those keywords that are more salient in a document by normalising over
        their frequency in the whole corpus
        
        '''
        #Load the corpus
        corpus = self.corpus
        
        #Fit a TFIDF model on the data
        tfidf = models.TfidfModel(corpus)
        
        #Transform the corpus and save it
        self.corpus = tfidf[corpus]
        
        return(self)
    
    def fit_lda(self,num_topics=20,passes=5,iterations=75,random_state=1803):
        '''
        
        This fits the LDA model taking a set of keyword arguments.
        #Number of passes, iterations and random state for reproducibility. We will have to consider
        reproducibility eventually.
        
        '''
        
        #Load the corpus
        corpus = self.corpus
        
        #Train the LDA model with the parameters we supplied
        lda = models.LdaModel(corpus,id2word=self.dictionary,
                              num_topics=num_topics,passes=passes,iterations=iterations,random_state=random_state)
        
        #Save the outputs
        self.lda_model = lda
        self.lda_topics = lda.show_topics(num_topics=num_topics)
        

        return(self)
    
    def predict_topics(self):
        '''
        This predicts the topic mix for every observation in the corpus
        
        '''
        #Load the attributes we will be working with
        lda = self.lda_model
        corpus = self.corpus
        
        #Now we create a df
        predicted = lda[corpus]
        
        #Convert this into a dataframe
        predicted_df = pd.concat([pd.DataFrame({x[0]:x[1] for x in topics},
                                              index=[num]) for num,topics in enumerate(predicted)]).fillna(0)
        
        self.predicted_df = predicted_df
        
        return(self)
    

In [7]:
#Create sentence corpus
sentence_corpus = flatten_list([x.split('. ') for x in projects['abstract']])


#Tokenize etc using the classes above
sentence_tokenised = CleanTokenize(sentence_corpus).clean().bigram()

#Also tokenise by documents so we can query them later
corpus_tokenised = CleanTokenize(projects['abstract']).clean().bigram()


In [10]:
#Training W2V
w2v = Word2Vec(sentence_tokenised.tokenised)

In [11]:
with open(f'../models/{today_str}_word_embeddings.p','wb') as outfile:
    pickle.dump(w2v,outfile)

In [12]:
def synonym_chaser(seed_list,model,similarity,occurrences=1):
    '''
    Takes a seed term and expands it with synonyms (above a certain similarity threshold)
    
    '''
    
    #All synonyms of the terms in the seed_list above a certain threshold
    set_ws = flatten_list([[term[0] for term in model.most_similar(seed) if term[1]>similarity] for seed in seed_list])
    
    #return(set_ws)
    
    #This is the list of unique occurrences (what we want to return at the end)
    set_ws_list = list(set(set_ws))
    
    #For each term, if it appears multiple times, we expand
    for w in set_ws:
        if set_ws.count(w)>occurrences:
            
            #As before
            extra_words = [term[0] for term in model.most_similar(w) if term[1]>similarity]
            
            set_ws_list + extra_words
            
    #return(list(set(set_ws_list)))
    return(set_ws_list)
    

    
def querier(corpus,keywords):
    '''
    Loops over a tokenised corpus and returns the number of hits (number of times that any of the terms appears in the document)
    
    '''
    #Intersection of tokens
    out = [len(set(keywords) & set(document)) for document in corpus]
    
    return(out)
    
    

### AI and Chronic diseases (crude keyword search-based)

In [13]:
ai_expanded = synonym_chaser(seed_list=['machine_learning','artificial_intelligence','deep_learning','ai','machine_vision'],model=w2v,similarity=0.8)
chronic_expanded = synonym_chaser(seed_list=['chronic_disease','chronic'],model=w2v,similarity=0.8)

  
  if np.issubdtype(vec.dtype, np.int):


In [14]:
projects['has_ai'],projects['has_chronic'] = [querier(corpus_tokenised.tokenised,keys) for keys in [ai_expanded,chronic_expanded]]

In [15]:
100*pd.crosstab(projects['has_ai']>0,projects['has_chronic']>0,normalize=1)

has_chronic,False,True
has_ai,Unnamed: 1_level_1,Unnamed: 2_level_1
False,95.769475,97.681704
True,4.230525,2.318296


In [21]:
projects.loc[(projects['has_ai']>0) & (projects['has_chronic']>0)].head()

Unnamed: 0,index,title,year,abstract,status,grant_category,funder,amount,biological_sciences,physics,...,arts_humanities,prods,ip,tech,spin,pubs,has_ai,has_chronic,has_age,has_inclusion
409,01B6A723-34E3-4F13-9318-5257D1FC1D54,Non-invasive assessment and management of coro...,2018.0,The main aim is to improve and test software c...,Active,Fellowship,MRC,284559.0,0.000366,1.280349e-07,...,1.989821e-08,0.0,0.0,0.0,0.0,0.0,1,1,0,0
545,0248EE56-D9BA-4D29-A438-BA2BD05A3168,Micromechanical measurements in living embryos,2013.0,The embryo is a complex system wherein local t...,Closed,Research Grant,BBSRC,585065.0,0.469256,2.409735e-09,...,3.811508e-07,0.0,0.0,2.0,0.0,4.0,1,1,0,0
619,0283F735-409F-49A2-9DEB-2DCF4E8884D5,VIRTUAL REALITY ASSESSMENT AND REHABILITATION ...,2017.0,"The proposed PhD project will use innovative, ...",Active,Studentship,EPSRC,0.0,0.000403,5.355883e-06,...,0.000915907,0.0,0.0,0.0,0.0,0.0,1,1,0,0
2586,0A62A025-8483-4E00-B724-7286B6DF772E,A Universal PAN Architecture for Monitoring Mu...,2008.0,People living with chronic medical conditions ...,Closed,Research Grant,EPSRC,179286.0,0.000147,0.0003143689,...,0.03501907,0.0,0.0,0.0,0.0,0.0,1,1,0,0
3097,0C7B23FD-07CE-420B-B9FE-7C0860B83199,Learning MRI and histology image mappings for ...,2017.0,This project aims to exploit recent advances i...,Active,Research Grant,EPSRC,774254.0,0.00012,1.584342e-05,...,8.204334e-06,0.0,0.0,0.0,0.0,0.0,2,1,0,0


### Ageing and inclusion/inequality (crude keyword search-based)

In [22]:
age_expanded = synonym_chaser(seed_list=['ageing','aging'],model=w2v,similarity=0.8)
inclusion_expanded = synonym_chaser(seed_list=['inclusion','inclusiveness','inclusive','inequality'],model=w2v,similarity=0.8)

  
  if np.issubdtype(vec.dtype, np.int):


In [23]:
projects['has_age'],projects['has_inclusion'] = [querier(corpus_tokenised.tokenised,keys) for keys in [age_expanded,inclusion_expanded]]

In [24]:
pd.crosstab(projects['has_age']>0,projects['has_inclusion']>0)

has_inclusion,False,True
has_age,Unnamed: 1_level_1,Unnamed: 2_level_1
False,68628,1727
True,1964,37


In [20]:
projects.loc[(projects['has_age']>0) & (projects['has_inclusion']>0)].head()

Unnamed: 0,index,title,year,abstract,status,grant_category,funder,amount,biological_sciences,physics,...,arts_humanities,prods,ip,tech,spin,pubs,has_ai,has_chronic,has_age,has_inclusion
2370,098AFC90-05C9-4F53-A584-F780EA2BD004,MECHANISM OF INFLAMMATION IN ENVIRONMENTAL ENT...,2018.0,Malnutrition is the greatest barrier to health...,Active,Fellowship,MRC,815863.0,2.7e-05,1.19212e-08,...,1e-06,0.0,0.0,0.0,0.0,0.0,0,1,1,1
2656,0AA6BCA6-50DD-4767-87B2-AFE0D7399233,Enabling Ongoingness: Content Creation &amp; C...,2017.0,The 'oldest old' are the fastest growing age g...,Active,Research Grant,EPSRC,885437.0,2.9e-05,1.483308e-09,...,0.415132,0.0,0.0,0.0,0.0,0.0,0,0,1,1
10048,27E22B78-1328-4874-B4A1-A2BBD48F00D7,Causes of heterogeneity in ageing - the Whiteh...,2010.0,When the Whitehall II study started in 1985 it...,Closed,Research Grant,MRC,2099998.0,0.000966,4.291643e-08,...,0.000847,0.0,0.0,0.0,0.0,944.0,0,1,3,1
10142,2835D915-D062-4B01-89BC-140637C6A54D,How do neighbourhood deprivation and neighbour...,2017.0,How do neighbourhood deprivation and neighbour...,Active,Studentship,ESRC,0.0,0.001661,4.205033e-06,...,0.001017,0.0,0.0,0.0,0.0,0.0,0,0,1,1
10178,28649AB9-D8A5-4017-905A-7A4147D98915,Family Demography and Health in Low- and Middl...,2013.0,Intergenerational relations involve the exchan...,Closed,Research Grant,ESRC,17070.0,0.000142,1.276702e-07,...,0.002074,0.0,0.0,0.0,0.0,14.0,0,0,1,1


Looks like the beginning of an approach

### Next steps

* Integrate with TRL analysis
* Integrate with SDG analysis
* Generate metrics
* Check social media discussion around papers
