#### Purpose of this algorithm 
* Build an mvp or algorithm to answer the question from semi structured documents like word.
* Note: This algorithm will only work if the word document is having structured headings.

#### High level steps of the algorithm
* Read's one or more word document(s) from the target folder.
* Loops through the document for the headings.
* Goes through various transformations and cleanup and finally frames heading as a question and the content under heading as an answer to the question.
* Creates the knowledgebase for all the available word documents.
* Builds a search index for all the questions and ready to answer the question(s)

#### Algorithm
* Used gensim's tf-idf model to build the search index - as explained in the question_answering_structured.ipynb some algorithms works better based on the data and their respective problems, tf-idf worked better in this case instead of bringing in the semantic or contextual meaning to the words (sentences). This might be because most of the question sentences are technical words and they are also short.

In [35]:
# Do all the package imports 
import os
#from pprint import pprint
#from tabulate import tabulate

from docx import Document
import glob

from gensim import corpora, models, similarities
import pandas as pd

from nltk.corpus import stopwords
import re

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [7]:
"""
Start creating knowledge base. Currently below knowledge extraction from the word documents
works for the ones with structure which has headings.
"""
def create_knowledge(doc):
    doc_answers = []
    temp_answers = []
    doc_head = 'no'
    """
    Loop through the headings of the document, for each heading accumulate its heading and
    their relevant paragraphs to form the knowledge base from their respective headings.
    
    Firstly accumulate the heading and paragraphs in to temp list, once the heading is over 
    then put them into doc_answers list, reset the temp list and get ready for the next
    heading
    """
    for paragraph in doc.paragraphs:
        if paragraph.style.name.startswith("Heading"):
            doc_head = 'yes'
        if (doc_head == 'yes'):
            if(len(temp_answers) > 0):
                doc_answers.append(temp_answers)
            doc_head = 'no'
            temp_answers = []
        if (doc_head == 'no'):
            temp_answers.append(paragraph.text)
    
    """ 
    Put the last temp heading into the doc_answers as the last one will not cover in the
    above loop
    """
    doc_answers.append(temp_answers)
    """
    Remove the first element from the doc_answer as it contains the table of contents and 
    other junk data with lot of empty elements in its respective list
    """
    doc_answers = doc_answers[1:]
    
    """
    Remove all the empty elements from the lists
    """
    doc_answers_1 = []
    for doc_answer in doc_answers:
        t_paragraphs = list(filter(None, doc_answer))
        doc_answers_1.append(t_paragraphs)
    
    """
    Remove the lists which contains just one element and they will not contribute much to
    the knowledge base
    """
    doc_answers_2 = []
    for doc_answer in doc_answers_1:
        if(len(doc_answer) > 1):
            doc_answers_2.append(doc_answer) 
    
    """
    Add numbering to the answer. Strip first element from each list to extract the topic 
    or heading which will later used to build the machine learning model. Create a 
    tuples for each list of elements and return the knowledge for the word document.
    """
    doc_answers_3 = []
    for doc_answer in doc_answers_2:
        i = 1
        t_answer = ''
        for answer in doc_answer:
            t_answer = t_answer + '(' + str(i)  + ')' + ' ' + answer + ' '
            i = i +1 
        doc_answers_3.append((doc_answer[0], t_answer.strip()))
        t_answer = ''
        i = 1
    
    return doc_answers_3


In [23]:
# Frame the data path
data_dir = "C:\\srini\\word"
file_extn = "*.docx"
data_path = os.path.join(data_dir, file_extn)

"""
Loop through each word document and accumulate all the headings and their respective 
paragraphs to create the final knowledge base. Create tuples using file name as document
or guide name, topic or heading and its respective paragraphs to create the complete 
knowledge base.
"""
knowledge_base = []
for file_name in glob.glob(data_path):
    doc = Document(file_name)
    knowledge_doc = create_knowledge(doc)
    for topic, answer in knowledge_doc:
        #knowledge_base.append((file_name, topic, answer))
        knowledge_base.append((os.path.basename(file_name), topic, answer))

"""
Create a data frame with final knowledge base to feed into the machine learning model
"""
knowledge_base = pd.DataFrame(knowledge_base, columns =['guide_name', 'question', 'answer']) 


In [45]:
from IPython.core.interactiveshell import InteractiveShell  
InteractiveShell.ast_node_interactivity = "all"

# Sanity check the knowledge base
print('total question and answers framed --> {}' .format(len(knowledge_base)))
knowledge_base.head()
knowledge_base.tail()

total question and answers framed --> 308


Unnamed: 0,guide_name,question,answer
0,IAP_AdminGuide_v5.docx,Overview,(1) Overview (2) This document provides inform...
1,IAP_AdminGuide_v5.docx,Logging in to the portal,(1) Logging in to the portal (2) IAP allows us...
2,IAP_AdminGuide_v5.docx,Expanding/Collapsing the left-pane,(1) Expanding/Collapsing the left-pane (2) The...
3,IAP_AdminGuide_v5.docx,Performing Search on the Portal,(1) Performing Search on the Portal (2) Admins...
4,IAP_AdminGuide_v5.docx,Navigating through Records,(1) Navigating through Records (2) The user ca...


Unnamed: 0,guide_name,question,answer
303,IAP_EventManagementGuide_v5.docx,Advance Search,(1) Advance Search (2) User can search for a p...
304,IAP_EventManagementGuide_v5.docx,Reset,(1) Reset (2) In order to clear all the param...
305,IAP_EventManagementGuide_v5.docx,Maximizing the Message,(1) Maximizing the Message (2) This icon is di...
306,IAP_EventManagementGuide_v5.docx,Knowledge Management,(1) Knowledge Management (2) Knowledge managem...
307,IAP_EventManagementGuide_v5.docx,Ticket Flow,(1) Ticket Flow (2) This feature will allow us...


In [46]:
# Get stopwords from NLTK
stopWords = stopwords.words('english')

In [48]:
# pre process or clean data for each question/record
def clean_data(sentence):
    # convert to lowercase, ignore all special characters - 
    # keep only alpha-numericals and spaces (not removing full-stop here)
    sentence = sentence.replace('/', ' ')
    sentence = sentence.replace('-', ' ')
    sentence = sentence.replace('.', ' ')
    sentence = re.sub(r'[^A-Za-z0-9\s.]', r'', str(sentence).lower())
    sentence = re.sub(r'\n', r' ', sentence)
    sentence = re.sub('\d+', ' ', sentence)
    
    # remove stop words
    sentence = " ".join([word for word in sentence.split() if word not in stopWords])
    
    return sentence.split()

# clean up all the questions with above function
questions_list = knowledge_base.question.map(lambda x: clean_data(x))
# make a list of lists for algorithm feed
questions_list = questions_list.tolist()
# print few questions after clean up
questions_list[10:15]


[['adding', 'user', 'groups'],
 ['modifying', 'user', 'groups'],
 ['de', 'activating', 'user', 'groups'],
 ['activating', 'user', 'groups'],
 ['user', 'roles']]

In [49]:
# Build the similarity index model
dictionary = corpora.Dictionary(questions_list)
feature_cnt = len(dictionary.token2id)

corpus = [dictionary.doc2bow(question_corpus) for question_corpus in questions_list]
tfidf = models.TfidfModel(corpus) 
sms_index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features = feature_cnt, num_best=5)


2020-02-25 19:07:59,185 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2020-02-25 19:07:59,198 : INFO : built Dictionary(270 unique tokens: ['overview', 'logging', 'portal', 'collapsing', 'expanding']...) from 308 documents (total 812 corpus positions)
2020-02-25 19:07:59,204 : INFO : collecting document frequencies
2020-02-25 19:07:59,206 : INFO : PROGRESS: processing document #0
2020-02-25 19:07:59,208 : INFO : calculating IDF weights for 308 documents and 270 features (808 matrix non-zeros)
2020-02-25 19:07:59,212 : INFO : creating sparse index
2020-02-25 19:07:59,214 : INFO : creating sparse matrix from corpus
2020-02-25 19:07:59,216 : INFO : PROGRESS: at document #0
2020-02-25 19:07:59,238 : INFO : created <308x270 sparse matrix of type '<class 'numpy.float32'>'
	with 808 stored elements in Compressed Sparse Row format>


In [51]:
# Build a question for inferencing
test_question = "user groups"
q_vector = dictionary.doc2bow(cleanData(test_question.lower()))
tfidf_sims = sms_index[tfidf[q_vector]]

In [52]:
# Print the retreived questions from the model index
for i in range(len(tfidf_sims)):
    print("confidence score --> {} for question --> {} -- > guide name --> {}" 
          .format(tfidf_sims[i][1], knowledge_base.question[tfidf_sims[i][0]],
          knowledge_base.guide_name[tfidf_sims[i][0]]))

confidence score --> 1.0 for question --> User Groups -- > guide name --> IAP_AdminGuide_v5.docx
confidence score --> 0.9425783157348633 for question --> Viewing User Groups -- > guide name --> IAP_AdminGuide_v5.docx
confidence score --> 0.8166521787643433 for question --> Adding User Groups  -- > guide name --> IAP_AdminGuide_v5.docx
confidence score --> 0.8097742795944214 for question --> Activating User Groups  -- > guide name --> IAP_AdminGuide_v5.docx
confidence score --> 0.6644854545593262 for question --> Modifying User Groups  -- > guide name --> IAP_AdminGuide_v5.docx


In [53]:
for i in range(len(tfidf_sims)):
    print("answer from guide --> {} --> answer --> {}" 
          .format(knowledge_base.guide_name[tfidf_sims[i][0]], 
                  knowledge_base.answer[tfidf_sims[i][0]]))

answer from guide --> IAP_AdminGuide_v5.docx --> answer --> (1) User Groups (2) Groups will be associated with users and it will be associated with Business service. To gain access to the business service, the user has to be part of the group which is associated with the Business Service.
answer from guide --> IAP_AdminGuide_v5.docx --> answer --> (1) Viewing User Groups (2) This page lists all the existing active groups records.   (3) To View Groups, On the Admin home page, under General Settings, click Groups. Once clicked, the Groups page is displayed. The Group page is loaded with Active tab selected by default. The groups homepage displays groups information in a tabular format. The available columns are: Groups, Description and Actions. (4) 
 (5) View Groups
answer from guide --> IAP_AdminGuide_v5.docx --> answer --> (1) Adding User Groups  (2) Admins can add new groups to the active list of group records. (3) To Add Groups: On the Admin homepage, click Groups. The New Group page