# Overview
The intent of this program is to pull data information from a csv:
Process Task, Process Category, Process, Inputs1, Inputs2, Inputs3, Inputs4, Inputs5, Inputs6, Inputs7, Inputs8, T&T1, T&T2, T&T3, T&T4, T&T5, T&T6, T&T7, T&T8, T&T9, Outputs1, Outputs2, Outputs3, Outputs4, Outputs5, Outputs6, Outputs7, Outputs8, Outputs9, Outputs10
For example:
Monitoring and Controlling, Quality, Executing, Project management Plan, Project documents, Organizational process assets, , , , , , Data gathering, Data analysis, Decision making, Data representation, Audits, Design for X, Problem solving, Quality improvement methods, , Quality reports, Test and evaluation documents, Change requests, Project management plan updates, Project documents updates , , , , ,

And create an association network from process to process category, to process task, to input and output.

After the inputs and outputs are associated with the process, another csv with sample filenames will be imported and the file names will be compared to the inputs/outputs.

There are many procedural problems with this effort. More time is needed to create a true "synonym list" for each input/output. Simply going off of true grammatical synonyms is not accurate enough. An industry standard for what exactly are Organizational process assets, for example, is needed. Additionally, getting ahold of real files to evaluate there names was impossible in the time that I had available, so I ended up generating fake names. Since there is little repetition between filenames, the AI, if it were trained, would be heavily overfit.

If I had more time, I would attempt to put together some sort of probability function that determines the number of input/output references for each process. If Project Documents occurs more in one process than another, then the probability that a file that is called a project document would be associated with that process over a less likely one. The AI would then train to min/max these relations.

A link to the github repository and docker are below:

Github:
https://github.com/Staizer/module_14_System_project.git

Docker:
https://hub.docker.com/r/staizer/module14_system_project

In [1]:
#import modules

import nltk #intended to do some AI computation on text
from nltk.stem.wordnet import WordNetLemmatizer as WNL #used to develop the lemmas of file names and input/output types
from nltk.corpus import wordnet as wn #used to find synonyms for file names and input/outputs
import pandas as pd #unused for the moment
from wordsegment import load, segment #since filenames often come as a single text string of words (I.E. useragreement), 
# I needed a way to segment these strings out into possible words. Still does incorrect work though. 
#(I.E. studentinfo segments to student, in, f, o)

In [None]:
#Generate several classes to create objects for correlation
class Category: #class intended to store category name and associated tasks
    def __init__(self, category):
        self.category = category
        self.tasks = []

    def __str__(self): # prints the number and suit of the card when called
        return f"{self.category}"

class Task: #class intended to store task name and associated inputs and outputs
    def __init__(self, task):
        self.task = task
        self.inputs = []
        self.outputs = []

    def __str__(self): # prints the number and suit of the card when called
        return f"{self.task}"

class Word: #class intended to store the word and it's synonyms
    def __init__(self,word):
        self.word = word
        self.synonyms = []

class Process: #class intended to store the process name, and its inputs, outputs, tasks, and categories
    def __init__(self,process):
        self.process = process
        self.inputs = []
        self.outputs = []
        self.categories = []
        self.tasks = []
        
    def __str__(self): # prints the number and suit of the card when called
        return f"{self.process}"

In [None]:
#various functions that will be used to separate and manipulate the data sets
def open_file(file): #used to take in csv and make a single list from each line
    line_list = []
    with open(file, "r") as f:
        lines = f.readlines()
        for i, line in enumerate(lines):
            if i == 0:
                continue
            else:
                line = line.strip("\n")
                line_list.append(line.split(','))
    # print(line_list)
    return line_list
#O(n) where n is the number of lines in the file

#finds the process column and stores all processes in a list, then assigns each process name to a process object
def process(lines): 
    processes = []
    process_list = []
    for i in lines:
        processes.append(i[2])
    
    processes = list(set(processes))
    
    for i in processes:
        P = Process(i)
        process_list.append(P)
    return process_list
#O(lines)


#finds the category column and stores all categories in a list, then assigns each category name to a category object
def category(lines):
    categories = []
    category_list = []
    for i in lines:
        categories.append(i[1])
 
    categories = list(set(categories))
    for i in categories:
        C = Category(i)
        category_list.append(C)
    return category_list
#O(lines)

#finds the task column and stores all tasks in a list, then assigns each task name to a task object
def tasks(lines):
    tasks = []
    task_list = []
    for i in lines:
        tasks.append(i[0])
    for i in tasks:
        T = Task(i)
        task_list.append(T)
    task_list = list(set(task_list))
    return task_list
#O(lines)

#finds the input coluumns and stores all inputs in a list after cleaning it
def inputs(lines):
    inputs = []
    for i in lines:
        inputs.append([i[3],i[4],i[5],i[6],i[7],i[8],i[9],i[10]])
    for i in inputs:
        for j in range(len(i)-1,0,-1):
            if i[j] == '':
                i.pop(j)
            else:
                continue
    return inputs
#O(lines^2)

#finds the output coluumns and stores all outputs in a list after cleaning it
def outputs(lines):
    outputs = []
    for i in lines:
        outputs.append([i[20],i[21],i[22],i[23],i[24],i[25],i[26],i[27],i[28],i[29]])
    for i in outputs:
        for j in range(len(i)-1,0,-1):
            if i[j] == '':
                i.pop(j)
            else:
                continue
    return outputs
#O(lines^2)

#finds the lemmas of a word
def lemmize(words):
    word_list = 0
    lemmize = []
    for i in range(len(words)):
        if word_list == 0:
            word_list = words[i]
        else:
            word_list = word_list + ' ' + words[i]
    wtoken = nltk.word_tokenize(word_list)
    for f in wtoken:
        lemmize.append(WNL().lemmatize(f, 'v'))
    return lemmize
#O(words)

#finds the synonym of each lemma
def synonyms(word):
    # print(word)
    synonyms = wn.synsets(word)
    return synonyms
#O(1)

In [None]:
#main part of program, creates various lists to be used later
file = 'Project Management Inputs Outputs and Tasks.csv'
train = 'file_names.csv'
opened = open_file(file) #has 49 lines
open_train = open_file(train) #train has 13000 files
training = []
for i in open_train:
    for j in i:
        x = j.split('_',5)
        x = x[-1]
        x = x.split('.')
        x = x[0]
        training.append(x)
#O(13000^2)
training = sorted(training)

lines = []
for i in opened:
    line = []
    for j in i:
        if j =='\n':
            continue
        else:
            line.append(j)
    lines.append(line)
#O(49^2)
processes = process(lines)
categories = category(lines)
tasks = tasks(lines)
inputs = inputs(lines)
outputs = outputs(lines)


In [None]:
#The beginning of my troubles, far too many for loops for that amount of work that I need them to do. The intent here is to 
#create the nodal connections between processes, process categories, process tasks, and inputs/outputs
for b,i in enumerate(lines):
    for j in categories:
        if i[1] == j.category:
            j.tasks.append(i[0])
    for a,j in enumerate(tasks):
        if i[0] == j.task:
            j.inputs = lemmize(inputs[a])
            j.outputs = lemmize(outputs[a])
    for j in processes:
        if i[2] == j.process:
            for k in categories:
                if i[1] == k.category:
                    j.categories.append(k)
            j.tasks.append(i[0])
            j.inputs.append(lemmize(inputs[b]))
            j.outputs.append(lemmize(outputs[b]))
#O(lines*(category+task+process))            
    


In [None]:
#my attempt at generating a list of synonyms for both files. This is where the program finally breaks completely.
synonym = []

for line in inputs:    
    for word in line:
        synonym.append(synonyms(word))
#O(inputs^2)            
for line in outputs:
    for word in line:
        synonym.append(synonyms(word))
#O(outputs^2)            
train_syn = []
for line in training:
    for word in line:
        train_syn.append(synonyms(word))
#O(training^2)                
similarities = []    
s_one = []
s_two = []
for syn in synonym:
    for s in syn:
        s_one.append(s)
#O(synonyms)
for train in train_syn:
    for t in train:
        s_two.append(t)
#O(training synonyms)
for s in s_one:
    for t in s_two:
        similarities.append(s.wup_similarity(t))
print(similarities) 
#O(n*m)


If I had more time, I would attempt to re-write this code, but I fear that any change I make will only make matters worse and won't solve my fundeaental problems. I need better data, and a more refined list of inputs/outputs to train an AI to identify.