# A Knowledge based Recommendation System for Project Mangement Ontology learning approach

The procedure of ontology construction can be done in one of three ways: manual construction; cooperative construction (need the human intervention during the constructing process) and (semi-) automatic construction which considered the Ontology Learning (OL) approach. OL from text is the process for acquiring and representing knowledge from text [structured (database), semi-structured (XML file) and unstructured (.txt, pdf, etc)] to be in machine-understandable form [OWL, RDF (Resource Description Framework), or RDFS (Resource Description Framework Schema)], by applying a set of methods and techniques (NLP, data mining, and machine learning).

-Natural Language Processing via NLTK and Spacy matcher linguistic-based preprocessing technique: (1)Tokénization and normalization (2) part-of –speech tagging (POS), (3)posTagger, (4) stopwords, Lemmatization (Stemming), (5) chunking. levenshetein measure, TF-IDF measure, leveithen measure, cosine similarity measure, topic modeling LDA, n-gram, -Recommenadtions techniques based on ML : hiaachical clutering, classification, KNN, etc -Performance measures: Precision, Recall and F measure, -Programmation: Python, java; -Semantic web language/Tool: OWL2, RDF and SWRL

# Imports

In [1]:
import sys, fitz
import pandas as pd
import re
import string
import pandas as pd

import spacy
from spacy.matcher import Matcher
from spacy.tokens import span
from spacy import displacy


import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import WordNetLemmatizer,SnowballStemmer
from nltk.corpus import wordnet, stopwords
from nltk import pos_tag, RegexpParser

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer 


from nltk.corpus import stopwords

from rdflib.namespace import DC, DCTERMS, DOAP, FOAF, OWL, RDF, RDFS, SKOS, VOID, XMLNS
from rdflib import URIRef, BNode, Literal, Namespace, Graph
from rdflib.extras import describer
from rdflib.namespace import XSD


#nltk.download('conll2000')
from nltk.corpus import conll2000
from nltk.tag import UnigramTagger, BigramTagger
from nltk.chunk import ChunkParserI

from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet as wn
from keybert import KeyBERT


#kw_model = KeyBERT(model='all-mpnet-base-v2')


#stop = stopwords.words('english')
#nlp = spacy.load("en_core_web_sm")
#nltk.download('omw-1.4')
#nltk.download('wordnet')
#nltk.download('stopwords')
#nltk.download('punkt')
#nltk.download('averaged_perceptron_tagger')

# Util Funtions For Text-Preprocessing

In [2]:
def remove_punctuation(text):
    """
        Remove the punctuation
    """
    return re.sub(r'[]!"$%&\'()*+/:;=#@?[\\^_`{|}~-]+', " ", text)

In [3]:
def remove_number(text):
    """
        Remove the number
    """
    pattern = r'[0-9]'
    # Match all digits in the string and replace them with an empty string
    new_string = re.sub(pattern, ' ', text)

    return new_string

In [4]:
def remove_non_ascii(text):
    """
        Remove non-ASCII characters 
    """
    return re.sub(r'[^\x00-\x7f]',r' ', text)

In [5]:
def remove_lineBreak(text):
    """
        Remove line break
    """
    return re.sub("\n"," ",text)

In [6]:
def lowerCase(text):
    """
        Transform all the text to lower case
    """
    return text.lower()

In [7]:
def remove_extra_whitespaces_func(text):
    '''
    Removes extra whitespaces from a string, if present
    
    Args:
        text (str): String to which the function is to be applied, string
    
    Returns:
        Clean string without extra whitespaces
    ''' 
    return re.sub(r'^\s*|\s\s*', ' ', text).strip()

In [8]:
def find_title(text):
    '''
    Find the title presented in the text with the regex pattern (Exp: 5.1 Plan scope mangement)
    
    Args:
        text (str): String to which the function is to be applied, string
    
    Returns:
        the title found in the text 
    '''
    return re.findall("[0-9].*[A-Z]\n",text)

In [9]:
def get_section(text):
    '''
    Get the sections in a text (Exp: section 4.1.2)
    
    Args:
        text (str): String to which the function is to be applied, string
    
    Returns:
        the list of the section founded in the paragraph 
    '''
    return re.findall('Section\s[0-9].{1,7}',text)

In [10]:
def get_figure(text):
    '''
    Get the figures in a text (Exp: section 4.1.2)
    
    Args:
        text (str): String to which the function is to be applied, string
    
    Returns:
        the list of the figures founded in the paragraph 
    '''
    return re.findall("Figure\s[0-9]-[0-9]{1,2}.\s[A-Z]{1}.*",text)

In [11]:
def get_def(text):
    if re.findall('^Described in Section',text)  :
        return ''
    else : 
        return re.sub('The inputs, tools and techniques, and outputs of this process are depicted in (.*)',' ',text)
    #return text

In [12]:
def get_clean_figure(text):
    return re.sub('(Figure\s\d-\d{1,2}.\sFigure\s[0-9].*)', ' ', text) 

In [13]:
def get_process_of_concept(df):
    '''
    Get the topic of each concept  (Exp: concept = PROJECT CHARTER -- topic = PLAN SCOPE MANAGEMENT )
    
    Args:
        df (str): the concept column to which the function is to be applied, string
    
    Returns:
        the topic of the concept passed 
    '''
    index = []
    for i in range(0,len(df['content'])) :
        if df['content'][i] == ' ':
            index.append(i)
    df['topic'] = ''
    for j in range(0,len(index)):
        if j == len(index)-1:
            df['topic'][index[j]:] = df['concept'][index[j]]
            break;
        else :
            df['topic'][index[j]:index[j+1]] = df['concept'][index[j]]
    return df['topic']

In [14]:
def create_columns_from_each_chapter(text,title,content,figures):
    '''
    Creation of the dataframe from the text passed
    
    Args:
        text (str): the text from the pmbok that will be preprocced, string
        title (list): the list that will contain the title 
        content (str): the str that will contain the content
        figures (list): the list that contain the figures
    Returns:
        A dataframe with the following columns : concept, topic, figure, section, type,content 
    '''
    dic = {}
    title = find_title(text)
    for i in range(0,len(title)):
        value = re.sub('\n',' ',title[i])
        title[i] = value.strip()
    
    text = remove_lineBreak(text)
    
    for i in range(0,len(title)):
        if i == (len(title)-1) : 
            content.append(re.findall(title[-1]+'(.*)'+'endchapter',text))
            break;
        else:
            content.append(re.findall(title[i]+'(.*)'+title[i+1],text))
    for i in range(0,len(content)):
        dic[title[i]] = content[i][0]
        
    df = pd.DataFrame(list(dic.items()),columns=['concept','content'])
    
    df['section'] = df['content'].apply(lambda x : get_section(x))
    df['figure'] = None
    for i in range(0,len(figures)):
        df['figure'][i] = figures[i] #df['definition'].apply(lambda x : get_figure(x))
    df['type'] = get_type(df)
    
    
    
    return df #title,content

In [15]:
def get_type(df):
    
    '''
    Get the type of each concept  (Exp: concept = PROJECT CHARTER -- type =  inputs of PLAN SCOPE MANAGEMENT )
    
    Args:
        df (str): the concept column to which the function is to be applied, string
    
    Returns:
        the type of each concept passed 
    '''
    
    start_input = []
    start_tools_and_techniques = []
    start_output = []
    
    title_list = df['concept'].apply(lambda x : lowerCase(x))
    for i in range(0,len(title_list)):
        if 'inputs' in title_list[i] :
            start_input.append(i)
        if 'tools and techniques' in title_list[i] :
            start_tools_and_techniques.append(i)
        if 'outputs' in title_list[i] :
            start_output.append(i)
    for i in range(0,len(start_input)) :
        if i == len(start_input)-1 :
            df.loc[start_input[-1]:start_tools_and_techniques[-1],'type'] = 'inputs'
            df.loc[start_tools_and_techniques[-1]:start_output[-1],'type'] = 'tools_and_techniques'
            df.loc[start_output[-1]:,'type'] = 'outputs'
            break
        df.loc[start_input[i]:start_tools_and_techniques[i],'type'] = 'inputs'
        df.loc[start_tools_and_techniques[i]:start_output[i],'type'] = 'tools_and_techniques'
        df.loc[start_output[i]:start_input[i+1],'type'] = 'outputs'
    return df['type']

# Import Data From The PDF File

## 1. Retrive The Text From The PMBOK

In [16]:
# Get The text from the pdf file with the library fitz
fdoc = fitz.open("PMBOK6-2017.pdf")
header = "Header"  # text in header
footer = "Page %i of %i"  # text in footer
page = []
for i  in range(0,573):
    page.append(fdoc[i].get_text())  # insert header

# 2. Get the text for each chapter

### 2.1 Initialize Variables

In [17]:
# variable represent each chapter that we will work with 
scope = []
schedule = []
cost = []

# variable represent title that we get from each chapter and the variable with contient the whole text that we will work with 
scope_title = []
scope_title_without_number = []
scope_text = ' '
scope_content = []

schedule_title = []
schedule_title_without_number = []
schedule_text = ' '
schedule_content = []

cost_title = []
cost_title_without_number = []
cost_text = ' '
cost_content = []

figures_scope = []
figures_schedule = []
figures_cost = []

### 2.2 Project scope management

In [18]:
for i in range(164,207):
    p = re.sub('Not For Distribution, Sale or Reproduction.', ' ', page[i])
    p = re.sub('[0-9]+\s\sPart 1 - Guide', ' ', p)
    scope.append(p)
    scope_title.append(re.findall("[0-9].*[A-Z]\n",p))
    figures_scope.append(re.findall("Figure\s[0-9]-[0-9]{1,2}.\s[A-Z]{1}.*",p))

In [19]:
for i in range(0,len(scope)):
    scope_text = scope_text + scope[i]
scope_text = scope_text + ' endchapter'

In [20]:
df_scope = create_columns_from_each_chapter(scope_text,scope_title,scope_content,figures_scope)
df_scope['topic'] = ''
df_scope['topic'] = get_process_of_concept(df_scope)
df_scope['content'] = df_scope['content'].apply(lambda x : remove_extra_whitespaces_func(x))
df_scope['def'] = df_scope['content'].apply(lambda x : get_def(x))
df_scope['ref'] = df_scope['section'] + df_scope['figure']

In [21]:
df_scope.head()

Unnamed: 0,concept,content,section,figure,type,topic,def,ref
0,5.4 Create WBS,—The process of subdividing project deliverabl...,[],[],,,—The process of subdividing project deliverabl...,[]
1,5.1 PLAN SCOPE MANAGEMENT,Plan Scope Management is the process of creati...,"[Section 4.2.3.1), Section 2.3), an, Section 2...",[Figure 5-1. Project Scope Management Overview],,,Plan Scope Management is the process of creati...,"[Section 4.2.3.1), Section 2.3), an, Section 2..."
2,5.1.1 PLAN SCOPE MANAGEMENT: INPUTS,,[],[],inputs,5.1.1 PLAN SCOPE MANAGEMENT: INPUTS,,[]
3,5.1.1.1 PROJECT CHARTER,Described in Section 4.1.3.1. The project char...,[Section 4.1.3.1.],[],inputs,5.1.1 PLAN SCOPE MANAGEMENT: INPUTS,,[Section 4.1.3.1.]
4,5.1.1.2 PROJECT MANAGEMENT PLAN,Described in Section 4.2.3.1. Project manageme...,"[Section 4.2.3.1., Section 8.1.3.1.]",[],inputs,5.1.1 PLAN SCOPE MANAGEMENT: INPUTS,,"[Section 4.2.3.1., Section 8.1.3.1.]"


In [22]:
df_scope['def'][1]

'Plan Scope Management is the process of creating a scope management plan that documents how the project and product scope will be deﬁned, validated, and controlled. The key beneﬁt of this process is that it provides guidance and direction on how scope will be managed throughout the project. This process is performed once or at predeﬁned points in the project.  '

In [23]:
df_scope['content'][1]

'Plan Scope Management is the process of creating a scope management plan that documents how the project and product scope will be deﬁned, validated, and controlled. The key beneﬁt of this process is that it provides guidance and direction on how scope will be managed throughout the project. This process is performed once or at predeﬁned points in the project. The inputs, tools and techniques, and outputs of this process are depicted in Figure 5-2. Figure 5-3 depicts the data ﬂow diagram of the process. Figure 5-2. Plan Scope Management: Inputs, Tools & Techniques, and Outputs Figure 5-3. Plan Scope Management: Data Flow Diagram Tools & Techniques Inputs Outputs Plan Scope Management .1 Expert judgment .2 Data analysis • Alternatives analysis .3 Meetings .1 Project charter .2 Project management plan • Quality management plan • Project life cycle description • Development approach .3 Enterprise environmental factors .4 Organizational process assets .1 Scope management plan .2 Requiremen

In [24]:
df_scope[16:30]

Unnamed: 0,concept,content,section,figure,type,topic,def,ref
16,5.2.1.1 PROJECT CHARTER,Described in Section 4.1.3.1. The project char...,[Section 4.1.3.1.],[],inputs,5.2.1 COLLECT REQUIREMENTS: INPUTS,,[Section 4.1.3.1.]
17,5.2.1.2 PROJECT MANAGEMENT PLAN,Described in Section 4.2.3.1. Project manageme...,"[Section 4.2.3.1., Section 5.1.3.1., Section 5...",[Figure 5-6. Context Diagram],inputs,5.2.1 COLLECT REQUIREMENTS: INPUTS,,"[Section 4.2.3.1., Section 5.1.3.1., Section 5..."
18,5.2.1.3 PROJECT DOCUMENTS,Examples of project documents that can be cons...,"[Section 4.1.3.2., Section 4.4.3.1., Section 1...",[],inputs,5.2.1 COLLECT REQUIREMENTS: INPUTS,Examples of project documents that can be cons...,"[Section 4.1.3.2., Section 4.4.3.1., Section 1..."
19,5.2.1.4 BUSINESS DOCUMENTS,Described in Section 1.2.6. A business documen...,[Section 1.2.6. A],[],inputs,5.2.1 COLLECT REQUIREMENTS: INPUTS,,[Section 1.2.6. A]
20,5.2.1.5 AGREEMENTS,Described in Section 12.2.3.2. Agreements can ...,[Section 12.2.3.2],[Figure 5-7. Example of a Requirements Traceab...,inputs,5.2.1 COLLECT REQUIREMENTS: INPUTS,,"[Section 12.2.3.2, Figure 5-7. Example of a Re..."
21,5.2.1.6 ENTERPRISE ENVIRONMENTAL FACTORS,The enterprise environmental factors that can ...,[],[Figure 5-8. Figure 5-9 depicts the data flow ...,inputs,5.2.1 COLLECT REQUIREMENTS: INPUTS,The enterprise environmental factors that can ...,[Figure 5-8. Figure 5-9 depicts the data flow ...
22,5.2.1.7 ORGANIZATIONAL PROCESS ASSETS,The organizational process assets that can inﬂ...,[],[Figure 5-9. Define Scope: Data Flow Diagram],inputs,5.2.1 COLLECT REQUIREMENTS: INPUTS,The organizational process assets that can inﬂ...,[Figure 5-9. Define Scope: Data Flow Diagram]
23,5.2.2 COLLECT REQUIREMENTS: TOOLS AND TECHNIQUES,,[],[],tools_and_techniques,5.2.2 COLLECT REQUIREMENTS: TOOLS AND TECHNIQUES,,[]
24,5.2.2.1 EXPERT JUDGMENT,Described in Section 4.1.2.1. Expertise should...,[Section 4.1.2.1.],[],tools_and_techniques,5.2.2 COLLECT REQUIREMENTS: TOOLS AND TECHNIQUES,,[Section 4.1.2.1.]
25,5.2.2.2 DATA GATHERING,Data-gathering techniques that can be used for...,"[Section 4.1.2.2., Section 8.1.2.2.]",[],tools_and_techniques,5.2.2 COLLECT REQUIREMENTS: TOOLS AND TECHNIQUES,Data-gathering techniques that can be used for...,"[Section 4.1.2.2., Section 8.1.2.2.]"


# Text Preprocessing

# Final DataFrame

# OWL File Creation

# Model Creation 