The goal of this notebook is to explain how to use spaCy to extract key words from the project name, that could be helpful when trying to make clusters of different loans.

We begin by imporing the data frame from the previous team to start playing right away. However, we hope that our approach is soft enough to still be applicable to our own data frame.

In [2]:
#We import the necessary Packages
import pandas as pd
import re
import numpy as np
import datetime
import os.path
import random

DATA_DIR = os.path.abspath('/Users/attiliocastano/Documents/GitHub/amethyst-erdos-may-2021/data')
DF_Extracted=pd.read_csv(os.path.join(DATA_DIR,'Extracted_Attributes_LoanUSD_Country.csv'))

In [3]:
#The features of interest for this notebook are just 'filename', 'Project_name', and 'Project_desc'
#We first select this sub-data frame, and them we drop the different loans whose information was not
#extracted property.



text_features = ['filename', 'Project_name', 'Project_desc']
DF = DF_Extracted[text_features]
DF.iloc[:,0]

for i in DF.index:
    if not(type(DF.loc[i, 'Project_desc']) == str):
        DF = DF.drop(index = i)
        
for i in DF.index:
    if not(type(DF.loc[i, 'Project_name']) == str):
        DF = DF.drop(index = i)
        
for i in DF.index:
    if not(type(DF.loc[i, 'filename']) == str):
        DF = DF.drop(index = i)
        
DF.shape

(2318, 3)

In [4]:
DF.head()

Unnamed: 0,filename,Project_name,Project_desc
0,1990_april_24_587321468019152780_conformed-cop...,forestry sector project,Description of the Project The objectives of t...
2,1990_april_25_904191468298750561_conformed-cop...,environment management project,Description of the Project The objectives of t...
3,1990_april_30_410811468040573756_conformed-cop...,rural electrification project,Description of the Project The objective of th...
4,1990_april_30_725911468042268845_conformed-cop...,third telecommunications project,Description of the Project The objectives of t...
5,1990_april_5_790191468211457471_conformed-copy...,west mitidja irrigation project,Description of the Project The objectives of t...


We proceed by importing spacy and the relevant packages that we will use in this exercises

In [5]:
import spacy
nlp = spacy.load("en_core_web_sm")
#spacy comes with an english package that will be extremely useful for us.
#One big reason to prefer spacy over tensorflow, is that spacy already comes trained in the english language,
#while for something like tensorflow you would have to train the model from zero.

from spacy.matcher import Matcher
from spacy import displacy
#displacy is a powerful visualization technique

from spacy.matcher import DependencyMatcher
#Using the visualization technique we are able to determine a pattern for the project_name, which we will use to
#extract the essential words. This is what DependencyMatcher is for

We want to understand some of the simple structure in Project_name, as a first step in clustering the different loans in groups. So we begin by visualizing a sample of the project_names and their structure.

spaCy provides us with helpful packages for this.

In [6]:
for i in DF.sample(3).index:
    #First we transform the string into a 'doc', the main objects of spacy.
    #And we print the original string for convenience
    doc = nlp(DF.Project_name[i])
    print('Project_name: ' + doc.text)
    print()
    
    #Even though displacy already includes the pos_ and dep_ of each token, we print it
    #to get a little practice with the syntax
    
    for token in nlp(DF.Project_name[i]):
        print(token.text, token.pos_, token.dep_)
    
    #Finally we print the visualization
    displacy.render(doc, jupyter = True)
    print('-----')

Project_name: forestry and environment project

forestry NOUN nmod
and CCONJ cc
environment NOUN conj
project NOUN ROOT


-----
Project_name: bam earthquake emergency reconstruction project

bam NOUN compound
earthquake NOUN compound
emergency NOUN compound
reconstruction NOUN compound
project NOUN ROOT


-----
Project_name: health sector reform project

health NOUN compound
sector NOUN compound
reform NOUN compound
project NOUN ROOT


-----


After analysing some of the Project_names and their structure we conclude the following:

* The word 'Project' appears in almost every project_name description
* spaCy uses a tree-like model to interpret the syntax of the sentence
* Looking at the Nouns that are children of 'Project' seem to provide a lot of key words relating to the project.

We will extract all the nouns that are a children of 'project'.

In what follows, the most used page of spaCy's documentation is the following: [Rule-based Matching](https://spacy.io/usage/rule-based-matching). Specially the section 'Dependency Matcher'



In [7]:
#We now describe the pattern we are looking to use

pattern = [
        #First we look for the word 'project', and we soften the seach by lemmatizing the word 'Project'
        {
            'RIGHT_ID': 'Root',
            'RIGHT_ATTRS': {'LOWER': 'project'}
        },
        #We search for all children of Project, which is a noun and which compounds 
        {
            'LEFT_ID': 'Root',
            'REL_OP': '>>',
            'RIGHT_ID': 'Topic',
            'RIGHT_ATTRS': {'POS': {'IN': ['NOUN', 'PROPN']}}
        }
    ]

In [8]:
for i in DF.sample(20).index:
    doc = nlp(DF.Project_name[i])
    print(doc.text)
    matcher = DependencyMatcher(nlp.vocab, validate = True)
    
    
    matcher.add('Key', [pattern])
    matches = matcher(doc)
    
    #Initialize the list where we will save all the nouns which are children of 'project'
    L = []
    
    for match in matches:
        match_id, token_ids = match
        L.append(doc[token_ids[1]].text) 
    
    if len(L)> 0:
        print (L)
    else:
        print('No patterns')
    
    displacy.render(doc, jupyter = True)
    
    print('--------')
    
    

education sector development project
['education', 'sector', 'development']


--------
investment recovery project
['investment', 'recovery']


--------
transport project general
['transport']


--------
additional financing for the tarbela fourth extension hydropower project
['tarbela', 'extension', 'hydropower']


--------
mining development technical assistance project
['mining', 'development', 'assistance']


--------
northern region transmission project
['region', 'transmission']


--------
public hospital modernization project
['hospital', 'modernization']


--------
leytecebu geothermal project npc
['leytecebu', 'npc']


--------
lifelong learning and training project
['learning', 'training']


--------
coal pilot project
['coal', 'pilot']


--------
basic education quality improvement project
['education', 'quality', 'improvement']


--------
irrigated agriculture intensification ii project
['agriculture', 'intensification', 'ii']


--------
yunnan early childhood education innovation project
['yunnan', 'childhood', 'education', 'innovation']


--------
karnataka municipal reform project
['karnataka', 'reform']


--------
power market development project
['power', 'market', 'development']


--------
additional financing for social protection system project
['protection', 'system']


--------
water resources management modernization project
['water', 'resources', 'management', 'modernization']


--------
second gujarat state highway project gshpii
['gujarat', 'state', 'highway']


--------
highway rehabilitation and maintenance precject
No patterns


--------
support to oportunidades project
['oportunidades']


--------
