## Preparation data of the ML algorithm

In [1]:
using CSV
using DataFrames
using StringEncodings
using RDatasets 
using BenchmarkTools
using Distributions
using Dates
import Dates
using SparseArrays, SharedArrays
using Distributed
using Base.Threads
using StatsBase
 using LinearAlgebra

## read_data
The function belows reads each file, a is the path of data, data is a output matrix

In [2]:
# This function is to read each file, a is the path of data, data is a output matrix
function read_data(a)
    f=open(a,"r")
    s=StringDecoder(f,"LATIN1", "UTF-8")
    data= CSV.read(s)
    close(s)
    close(f)
    return data
end
    

read_data (generic function with 1 method)

## create_features

We will add one output and one input to this function: 
* the output is feature vector of a study case.
* the input is the indices of the values in the connection matrix of the controllers that are varied between 2 study cases. In our project only the values of connections between SAS_RCSL, PS_PAS, PS_SAS, PS_RCSl and DAS_PA, are varied between two different study cases. So this input will be equal to:  [2 5;3 1;3 2;3 5;4 1]. These numbers represent the number of row and column of each connection.
<br> All the other outputs and the inputs are still the same.
<br>The value of the percentage initialization parameter is related to the complexity of the study case. Hence, we have to extract features that represent this complexity which is mainly related to three factors:

* The number of non allocated functions. 
* The number and the values of connections for each non allocated function. Each function can connect to certain number of functions and these connections have certain capacity values. This factor can be represented by:
<br>&nbsp;&nbsp;      1: maximum  of these numbers of connections.
<br>&nbsp;&nbsp;      2: minimum of these numbers of connections.
<br>&nbsp;&nbsp;      3: mean of these numbers of connections.
<br>&nbsp;&nbsp;      4: standard deviation of these number of connections.
<br>&nbsp;&nbsp;      5: mean of the connections capacity values.
<br>&nbsp;&nbsp;      6: standard deviation of connections capacity values.

* The values of connections between the controllers:  only the values of connections between SAS_RCSL, PS_PAS, PS_SAS, PS_RCSl and DAS_PA, are varied between two different study cases. Each of these five connections should be passed by the following steps:
<br>&nbsp;&nbsp;       1:Calculating the values of connections between the functions already allocated to controller i and the functions already allocated to controller j. 
<br>&nbsp;&nbsp;       2:Subtracting the value calculated in the previous step from the initial connection value between the controllers i and j.
<br>&nbsp;&nbsp;       3: Adding the final output to the features vector of the study case.
<br>These steps are already done by the calculation of remain_connection_ctr matrix. We just have to add the values of these five connections to the features vector.

<br> The total number of features will be twelve.

In [17]:
function create_features(m,q,indice_connections)
    F1 = read_data(m); # FunctionsWithDomains
    connection_ctr1 = read_data(q); #controllers_connection

    "convert connection controllers to 2d array"
    connection_ctr = connection_ctr1[1:size(connection_ctr1,1),2:size(connection_ctr1,2)];
    connection_ctr =convert(Matrix,connection_ctr);
    controllers = String.(names(connection_ctr1));  # get the name vector of the controllers
    controllers = controllers[2:length(controllers)];
    #
    #
    nb_ctr = length(controllers);  "nb_ctr: number of controllers" 
    choixprefere = F1.ChoixPrefere;  "the attribute Choix prefere (contains numbers)"
    nb_f = length(choixprefere);  "nb_f: number of functions"
    allowed_f = F1.DejaAllouee; "get DejaAllouee attribute (contains name of controllers)"
    
    "creat connection matrix 
    output1 is vector of length nb_f represents the output connections for each function
    weights_output is the values of the output connections
    input1 is vector of length nb_f represents the input connections for each function
    weights_input is the values of the input connections
    temp is a vector of length nb_f represents the number of output connections for each function
    features is the features vector of the study case and its length equals to 12."
    
    output1 =  F1.outputN1; 
    weights_output = F1.PoidsOutputN1; 
    input1 = F1.inputN1; 
    weights_input = F1.PoidsInuputN1;
    temp = zeros(Int,nb_f);
    features = zeros(12);
    
    for i in 1:nb_f
        "calculate maximum of out connections"
        inputs = output1[i];
        if length(inputs)!=2  "output1[i] different from [], it is not empty"
            inputs = inputs[2:length(inputs)-1]; "to remove [] from output[i]"
            inputs = split(inputs,";");
            inputs = [parse(Int, x) for x in inputs];  "convert from string to int"
            temp[i] += length(inputs);   "add output connections"
            if(length(inputs)>max_out)
                max_out = length(inputs);
            end
        end
        "calculate maximum of input connections"
        inputs = input1[i];
        if length(inputs)!=2  "input1[i] is not empty" 
            inputs = inputs[2:length(inputs)-1];
            inputs = split(inputs,";");
            inputs = [parse(Int, x) for x in inputs];
            if(length(inputs)>max_in)
                max_in = length(inputs);
            end
        end
    end
    
    output_f =  zeros(Int,nb_f,max_out);  "create output_f" 
    output_weight = zeros(Int,nb_f,max_out); "create output_weight"
    input_f = zeros(Int,nb_f,max_in);
    input_weight = zeros(Int,nb_f,max_in);
    #
 "this for loop is to determine the functions connected in both ways to each function and the weights of these connections
     first one is for output connections
     second on is for input connections"
    for i in 1:nb_f
        "output connections"
        inputs = output1[i];
        weight = weights_output[i];
        if length(inputs)!=2
            inputs = inputs[2:length(inputs)-1];
            inputs = split(inputs,";");
            inputs = [parse(Int, x) for x in inputs];
            weight = weight[2:length(weight)-1];
            weight = split(weight,";");
            weight = [parse(Int, x) for x in weight];

            for j in 1:length(inputs)
                output_f[i,j] = inputs[j]+1;
                output_weight[i,j] = weight[j];
            end
        end
        "input connections"
        inputs = input1[i];
        weight = weights_input[i];
        if length(inputs)!=2
            inputs = inputs[2:length(inputs)-1];
            inputs = split(inputs,";");
            inputs = [parse(Int, x) for x in inputs];
            weight = weight[2:length(weight)-1];
            weight = split(weight,";");
            weight = [parse(Int, x) for x in weight];
            for j in 1:length(inputs)
                input_f[i,j] = inputs[j]+1;
                input_weight[i,j] = weight[j];
            end
        end
    end
    #
    #
    G = zeros(Int,nb_f,nb_ctr);
    id1 = zeros(Int,0);
    id2 = zeros(Int,0);
    for i in 1:nb_f 
        groups = allowed_f[i];   "get the name of controllers allowod to  function i"
        choix = choixprefere[i]; "get the degrees of controllers allowed to  function i"
        choix = choix[2:length(choix)-1];
        choix = split(choix,";");
        choix = [parse(Int, x) for x in choix];
                        
        groups = groups[2:length(groups)-1];
        groups = split(groups,";");
        if (length(choix)==1)  "choix = a ( one number),  function i is allocated"
            id1 = append!(id1,i);
            group = findall(x-> x==groups[1],controllers); "find the index of controller where function i is allocated"
            G[i,group] = choix; 
        "if the function i is not allocated"
        else  
            for j in 1:length(choix)
                "find the index of each controller allowed to function i  and add its degree G[i,:]"
                index_weight = findall(x-> x==groups[j],controllers);
                G[i,index_weight[1]] = choix[j];
            end
            id2 = append!(id2,i);
        end
    end
    "y-value_connections represents the values of connections of each non allocated function"
    y_value_connections = zeros(length(id2));
    for i in 1:length(id2)
        " calculation of the output connections values of a non allocated function"
        y_value_connections[i] += sum(output_weight[id2[i],:]);
        " this for loop is to calculate the number and the values of input connections of a non allocated function
          with the allocated functions"
        for j in 1:size(input_f,2)
            if(input_f[id2[i],j]==0)
                break;
            elseif length(findall(x-> x!=0,G[input_f[id2[i],j],1:nb_ctr]))==1
                y_value_connections[i] += input_weight[id2[i],j];
                temp[id2[i]] += 1;
            end
        end
    end
    features[1] = length(id2);
    features[4] = mean(temp[id2]);
    features[5] = std(temp[id2]);
    features[6] = mean( y_value_connections);
    features[7] = std( y_value_connections);
    "connection_G1 is a matrix of dimension nb_ctr by nb_ctr. 
    It contains the values of connections between the allocated functions."
    
    connection_G1 = zeros(Int,nb_ctr,nb_ctr);
    # connection between G1
    
    "the first for loop is to pass through all the functions
    the first if is to verify that the function is allocated
    second for loop is to pass through all the output connections of the allocated function
    second if is to verify that the connected function to function i is not zero and allocated"
    for i in 1:nb_f
        if (length(findall(x-> x!=0,G[i,1:nb_ctr]))==1)
            for j in  1:size(output_f,2) 
                if (output_weight[i,j]!=0) && (length(findall(x-> x!=0,G[output_f[i,j],1:nb_ctr]))==1) 
                        a = findall(x-> x!=0,G[i,1:nb_ctr]); "find the group of function i" 
                        b = findall(x-> x!=0,G[output_f[i,j],1:nb_ctr]); "find the group of connected function"
                        if(a[1]!=b[1])  "if the 2 functions are not in the same controller"
                            connection_G1[a[1],b[1]] += output_weight[i,j];
                        end
                end
            end
        end
    end
    
    "calculate the remain capacity connections between the controllers"
    remain_connection_ctr = connection_ctr - connection_G1;
    for z in 1:size(indice_connections,1)
        features[7+z] = remain_connection_ctr[indice_connections[z,1],indice_connections[z,2]];
    end
  
    type_ctr_connection = zeros(Int,nb_ctr,nb_ctr);
    for i in 1:nb_ctr
        for j in 1:nb_ctr
            if((connection_ctr[i,j] !=0) && (connection_ctr[j,i] !=0))
                type_ctr_connection[i,j] = type_ctr_connection[j,i] = 3;
            elseif ((connection_ctr[i,j] !=0) && (connection_ctr[j,i] ==0))
                    type_ctr_connection[i,j] = 1;
            elseif ((connection_ctr[i,j] ==0) && (connection_ctr[j,i] !=0))
                    type_ctr_connection[i,j] = 2;
            else
                    type_ctr_connection[i,j]  = 0;
            end
        end
    end
    max_out = 0;
    max_in = 0;
    min_out = 1000;
    min_in = 1000;
    for i in 1:length(id2)
        output_connections = findall(x-> x!=0,output_f[id2[i],:])
        if length(output_connections)>max_out
            max_out = length(output_connections)
        end
        if length(output_connections)<min_out
            min_out = length(output_connections)
        end
        input_connections = findall(x-> x!=0,input_f[id2[i],:])
        if length(input_connections)>max_in
            max_in = length(input_connections)
        end
        if length(input_connections)<min_in
            min_in = length(input_connections)
        end
    end
    features[2] = max_out+max_in;
    features[3] = min_out+min_in;
    nb_connections = temp[id2];
    return G,id1,id2,type_ctr_connection,remain_connection_ctr,output_f,output_weight,input_f,input_weight,
           nb_connections,features;
end

create_features (generic function with 1 method)

## evaluation
<br>This function take as input path of choixprefere column from functionwithDomain file, population(pop), the output of creat_features function and the upper bound of the study case (up_bound)
<br> It calculates the fitness function of all the individuals and returns:
* <b> y:</b> vector of length equals to population size, it contains the result of the fitness function.
* <b> best_ind:</b> contains the maximum of y, values of violate function and capacity for the individual that has the maximum


In [4]:

function evaluation(G,pop,type_ctr_connection,remain_connection_ctr,up_bound,output_f,output_weight,
                    input_f,input_weight)
    y = zeros(collect(size(pop))[1]);
    y1 = zeros(Int,collect(size(pop))[1]);
    "nb_ctr is number of controllers
    violate_f represents the number of functions that can not do all their connections 
    without the violation of the connection constraints between the controllers in each individual .It has length
    equals to population size.
    violate_capacity represents the values of connections betweenn the controllers
    that exceed the limit in each individual"
    
    nb_ctr = collect(size(remain_connection_ctr))[1];
    violate_f = zeros(Int,collect(size(pop))[1]);
    violate_capacity = zeros(Int,collect(size(pop))[1]);
    
    "first for loop to visit each individual"
    for i in 1:size(pop,1) 
        capacity_ctr = zeros(Int,nb_ctr,nb_ctr); "connection capacity between the functions for individual i"
        "second fo loop to visit each function in each individual"
        for j in 1:size(pop,2)  
            "verify the output connections"
            l = 0;
            if(length(findall(x-> x!=0,G[j,:]))!=1)  "if the function is not allocated"
                inputs = output_f[j,:]; 
                weight = output_weight[j,:];
                for k in 1:length(inputs) "visit the connected functions to function j"
                    if(weight[k]==0)  "means no more connection or there is no connection"
                        break;
                    elseif (pop[i,j]!=pop[i,inputs[k]]) "if j and the connected f are in different controllers"
                        "pop[i,j]: controller of function j
                         pop[i,inputs[k]] controller of function connected to j"
                        
                        capacity_ctr[pop[i,j],pop[i,inputs[k]]] += weight[k];  
                            
                        "the if below is to verify that the 2 functions are in 2 controllers 
                        can be connected in one direction"
                            
                        if((type_ctr_connection[pop[i,j],pop[i,inputs[k]]] == 2) || 
                                (type_ctr_connection[pop[i,j],pop[i,inputs[k]]]== 0))
                                    if(l==0)  "function j until now has not violated the 
                                                connectivty with any of its output connected functions"
                                        l = 1;
                                        violate_f[i] += 1;
                                    end
                            end
                    end   
                end
        
                "verify function j not allocated  can receive data from functions already allocated"
           
                inputs = input_f[j,:];
                weight = input_weight[j,:];
                for k in 1:length(inputs)
                    if (weight[k]==0)
                        break;
                    "if the function that send data to function j is allocated"
                    elseif(length(findall(x-> x!=0,G[inputs[k],:]))==1) 
                            if(pop[i,j]!=pop[i,inputs[k]]) "if the functions are not in the same group"
                                capacity_ctr[pop[i,inputs[k]],pop[i,j]] += weight[k];
                                if((type_ctr_connection[pop[i,j],pop[i,inputs[k]]] == 1) ||
                                    (type_ctr_connection[pop[i,j],pop[i,inputs[k]]]== 0))
                                    "if function j has not violated any constraint of connectivity yet"
                                    if(l==0)
                                        l = 1;
                                        violate_f[i] += 1;
                                    end
                                end
                            end

                    end
                end
            end
            #
            "if function j respect all the connectivity constraints"
            if(l==0)
                y[i] += G[j,pop[i,j]];
            end
        end 
        "end of second loop"  
        "calculate the remaining capacity between the controllers"
        capacity_ctr = remain_connection_ctr - capacity_ctr;
        for k in 1:nb_ctr
            for z in 1:nb_ctr
                "verify if any remain capacity is negative if initially this capacity is not zero"
                if(capacity_ctr[k,z]<0) && (remain_connection_ctr[k,z]!=0)
                    violate_capacity[i] += -capacity_ctr[k,z]
                end
            end
        end
    end
    y1 = y;
    y = y./up_bound;
    y = y - violate_f./(maximum(violate_f)+1);
    y = y - violate_capacity./(maximum(violate_capacity)+1);
    best_indice = argmax(y);
    best_ind = zeros(Int,3);
    best_ind[1] = y1[best_indice];
    best_ind[2] = violate_f[best_indice];
    best_ind[3] = violate_capacity[best_indice];
    return y,best_ind
end

evaluation (generic function with 1 method)

##  heuristic_initialization

In [5]:
function heuristic_initialization(G,population_size,initialization_perc,id2,nb_connections)
    pop = zeros(Int,population_size,collect(size(G))[1]);
    " e is the number of functions allocate to the controllers that have the highest degree between 
      their available controllers
    idx represents the indexes of these functions. These indexes are gotten by decreasing
    order sort of the functions based on their number of connections"
    e = initialization_perc*length(id2);
    e = convert(Int64, round(e, digits=0));
    idx = sortperm(nb_connections)[1:e]; 
    for j in 1:population_size
        for i in 1:e
            a = argmax(G[id2[idx[i]],:]);
            pop[j,id2[idx[i]]] = a[1];
        end
        for i in 1:size(G,1)
            c = findall(x-> x==i,id2[idx]);
            if(length(c)==0)
                b = findall(x-> x!=0, G[i,1:6]);
                id = rand(1:length(b));
                pop[j,i] = b[id[1]];
            end
        end
    end
    return pop;
end
        

pop_initialize (generic function with 1 method)

### Create csv file for the features 

In [None]:
features_df = DataFrame(case_study=String[],nb_functions_G2=[],max_connections=[],min_connections=[],
                        mean_connections = [],variance_connections = [],mean_val_connect=[],std_val_connect=[],
                        SAS_RCSL=[],PS_PAS=[],PS_SAS=[],PS_RCSL=[],DAS_PAS=[]);
CSV.write("C:\\Users\\AH262855\\Desktop\\Nouveau dossier\\Data_ML\\perc_initialization
            \\features_train_data.csv", features_df);

## Case Study execution

The weight of the heuristic initialization is its ability to throw the algorithm in a point close from the global optimum. So, the best value of initialization percentage parameter is the value that throws the algorithm in the closest point. Closest point means point respects all the constraints firstly, and has the highest degree secondly. Hence, it is not necessary to execute GA until the convergence to determine the label of study case, we can determine it using only the results obtained directly after the initialization

Each study case will be executed with different  values of the target parameter (percentage initialization),then we will choose the best one. However, the results show that the effect of this parameter becomes negative when it exceeds 0.5. To recover all the values between 0 and 0.5, we will decompose this interval to five intervals: [0,0.1], [0.1,0.2], [0.2,0.3], [0.3,0.4] and [0.4,0.5]. The process to get the best value of a study case can be explained in these steps:
* Generation of 8 random values that lie in one interval.
* Execution of the study case using these values.
* Getting the mean of the results obtained from these executions.
* Repeating the first three steps for the five intervals.
* The middle of the interval that gives the best results will be the label of the case.



In [22]:
results_modify = zeros(3);
population_size = 100;
perc_ini = [0,0.1,0.2,0.3,0.4,0.5];
indice_connections = [2 5;3 1;3 2;3 5;4 1];
nb_functions =[1000,2000,3000,4000,5000];
indices = [100,-100,200,-200,300,-300,400,-400,500,-500];
for m in 1:length(nb_functions)
    nb_f = nb_functions[m]
    for i in 1:length(indices)
        indice = indices[i]
        b = "_";
        C = "C";
        for k in 0:99
            for x in 0:0
                X = "X$x"
                F2 ="C:\\Users\\AH262855\\Desktop\\Nouveau dossier\\Data_ML\\perc_initialization\\
                     NbFunctions_$nb_f\\NbCommunications_$indice\\functionsWithDomains_F$nb_f$b$C$indice$b$k.csv";
                connection_ctr1= "C:\\Users\\AH262855\\Desktop\\Nouveau dossier\\Data_ML\\perc_initialization\\
                                  NbFunctions_$nb_f\\NbCommunications_$indice\\ComN1N1_F$nb_f$b$C$indice$b$X$b$k.csv";
               G,id1,id2,type_ctr_connection,remain_connection_ctr,output_f,output_weight,input_f,input_weight,
               nb_connections,features =  creat_features(F2,connection_ctr1,indice_connections);
               nb_ctr = size(connection_ctr,1);
               up_bound =0;
               for z in 1:size(G,1)
                 up_bound += maximum(G[z,1:nb_ctr]);
               end
            nam_case = "F$nb_f$b$C$indice$b$X$b$k";
            name_case = string(name_case);
            features_df = DataFrame(features');
            insert!(features_df, 1, name_case,:case_study); " to add the name of case to the feature vector"
            CSV.write("C:\\Users\\AH262855\\Desktop\\Nouveau dossier\\Data_ML\\perc_initialization\\
                       features_train_data.csv", features_df,append=true);
                println("F$nb_f$b$C$indice$b$k = " ,up_bound);
                results_df = DataFrame(perc_ini =[],max = [],violate_f = [],violate_capacity = []);
                CSV.write("C:\\Users\\AH262855\\Desktop\\Nouveau dossier\\Data_ML\\perc_initialization\\Train_Data
                           \\F$nb_f$b$C$indice$b$X$b$k.csv",results_df);
                results = zeros(length(perc_ini)-1,4); 
                for z in 1:(length(perc_ini)-1)
                        results[z,1] = (perc_ini[z] + perc_ini[z+1])/2;
                        @inbounds Threads.@threads for j in 1:8
                             temp = rand(Uniform(perc_ini[z], perc_ini[z+1]));
                             pop =heuristic_initialization(G,population_size,temp,id2,nb_connections);
                             y,performance = evaluation(G,pop,type_ctr_connection,
                                                        remain_connection_ctr,up_bound,output_f,output_weight
                                                        ,input_f,input_weight);
                             results[z,2:4] .+=  performance;
                        end
                        results[z,2:4] ./= 8;
                        println("Mean results of 8 iterations of F$nb_f$b$C$indice$b$X$b$k :",
                                    " ",results[z,:]);
                        results_df = DataFrame(results[z,:]'); 
                        CSV.write("C:\\Users\\AH262855\\Desktop\\Nouveau dossier\\Data_ML\\perc_initialization\\
                                   Train_Data\\F$nb_f$b$C$indice$b$X$b$k.csv",results_df,append = true);

                end
            end
        end
        println("ende of case study: F$nb_f$b$C$indice");
    end
end


ende of case study: F5000_C100
ende of case study: F5000_C-100
ende of case study: F5000_C200
ende of case study: F5000_C-200
ende of case study: F5000_C300
ende of case study: F5000_C-300
ende of case study: F5000_C400
ende of case study: F5000_C-400
ende of case study: F5000_C500
ende of case study: F5000_C-500


### Creaste csv file of the output label

In [None]:
df = DataFrame(case_study=String[],best_perc_ini=[])
CSV.write("C:\\Users\\AH262855\\Desktop\\Nouveau dossier\\Data_ML\\perc_initialization\\best_perc_ini.csv",df);

## Calculate best percentage of initialization  for each case

 The new initialization method throws the algorithm most of the times in a point  where all the functions can do their connections. The problem is in violation of the connections values between the controllers.
 <br> Each value of the target parameter throws the algorithm in a point where this constraint is violated, where the value of this violation is more than 50 untits, will  be eliminated.
 <br> Between the remain values, the best values will be the value that has the highest degree. 
 <br> if all the values should be eliminated, the best value will be the value that has the lowest violation value. 

In [16]:
"counts represents the number of cases in each class."
nb_functions =[2000,3000,4000,5000];
perc_ini = [0.05,0.15,0.25,0.35,0.45];
indices = [-100,200,-200,300,-300,400,-400,500,-500];
counts = zeros(length(perc_ini));
for z in 1:length(nb_functions)
    nb_f = nb_functions[z]
    for i in 1:length(indices)
        indice = indices[i]
        for k in 0:99
            b = "_"; 
            C = "C";
            for x in 0:0
                X = "X$x"
                F = "C:\\Users\\AH262855\\Desktop\\Nouveau dossier\\Data_ML\\perc_initialization\\
                        Train_Data\\F$nb_f$b$C$indice$b$X$b$k.csv"
                F1 = read_data(F)
                name = "F$nb_f$b$C$indice$b$X$b$k";
                best_perc = 0;
                violate_c = F1.violate_capacity
                a = findall(x-> x<=50,violate_c)
                if (length(a)==1)
                    best_perc = F1.perc_ini[a[1]]
                elseif (length(a)==0)
                    e = argmin(violate_c)
                    best_perc = F1.perc_ini[e[1]]
                elseif (length(a)>1)
                    max_sizes = F1.max[a];
                    d = argmax(max_sizes)
                    c = findall(x-> x==max_sizes[d],F1.max)
                    best_perc = F1.perc_ini[c[1]] 
                end
                best_perc  = convert(Float64, round(best_perc, digits=2));
                counts[findall(x-> x==best_perc,perc_ini)[1]] += 1;
                df = DataFrame(case_study=name,best_perc_ini = best_perc);
                CSV.write("C:\\Users\\AH262855\\Desktop\\Nouveau dossier\\Data_ML\\perc_initialization\\
                         best_perc_ini.csv",df,append=true);

            end
        end
    end
end
println(counts)                

[9.0, 82.0, 70.0, 18.0, 1.0]
