# <b>BIG DATA PROJECT - GROUP 1</b> 🚀

+ NGANKAM Paul-Henry
+ BEHLE Ralph
+ DOKOULA Yann
+ NOUMEN Darryl
+ TEMA Gregori
+ TANKWA Jordan

## Summary
---
+ [Context](#context)
+ [Base problem](#base-problem)
    + [Mathematical modeling](#mathematical-modeling)
    + [Algorithmic modeling](#algorithmic-modeling)
    + [Python implementation](#python-implementation)
    + [Tests](#tests)
    + [Statistics & Behavioral study](#statistics--behavioral-study)
+ [Additionnal constraints](#additionnal-constraints)

## Context
---
This project discusses about the global awareness of the need to reduce energy consumption and greenhouse gas emissions since the 1990s. While some commitments have been made, many scientists consider them insufficient to slow down climate change. The focus is now on changing behaviors, with a shift towards a more circular economy, better transportation modes, and energy-efficient buildings. The ADEME has launched a call for interest to promote the implementation of new mobility solutions for people and goods adapted to different types of territories. The Ucac-Icam structure is already active in this domain and has conducted several studies on Intelligent Multimodal Mobility. The study in question is a response to the call of the ADEME and aims to address the major challenge of optimizing the management of resources in transportation logistics with significant environmental impact.

The aim of this project is to generate an optimal delivery tour using operational research techniques. We will need to calculate a tour that connects a set of cities, minimizing the distance traveled while respecting the problem's constraints.

## Base problem
---
### Mathematical modeling

#### Vehicle Routing Problem
The <b>Vehicle Routing Problem</b> is a well-known optimization and combinatorial problem that aims to determine the most efficient routes for vehicles that need to visit multiple customers in a single trip, while respecting constraints such as vehicle capacity, time slots, and delivery priorities. 

The basic version of this problem considers a single vehicle, which brings us back to the <b>Travelling Salesman Problem.</b>

#### The Travelling Salesman Problem

In its simplest form (one vehicle), our problem is similar to that of the famous <b>Travelling Salesman Problem</b>(TSP). Indeed, in both cases, it is a matter of finding a path that connects a set of cities while minimizing the total distance travelled. The main difference with the situation described before is that in the TSP all cities need to be visited <b>only once</b>, whereas in our problem only a subset of cities needs to be connected. However, as in the TSP, our problem requires finding a path that allows us to return to the starting point.

#### Optimization & Decisional forms

Considering an undirected weighted graph G = (N, M, w), where N is the set of cities, M is the set of streets connecting the cities, and w is a weight function associated with each edge (representing the length of streets). Consider a subgraph of G, G'= (N', M', w') which represents the set of cities concerned by the tour.

+ <b>Optimization problem :</b> Find the shortest Hamiltonian cycle in the subgraph G'. 

+ <b>Associated decision problem :</b> Given an integer k, can we find a Hamiltonian cycle to manage the delivery round so that the distance travelled is less than or equal to k, k being the minimum distance to be travelled? 
  
#### Integer Linear optimization problem

The Travelling Salesman Problem can be formulated as an Integer linear optimization problem as follows:

##### Decision Variables 

While considering the above formula of the objective function, we have as decision variables :
+ i and j which represent two cities of an undirected, weighted and complete graph G
+ $c_{ij}$ which represents the distance between the two cities i and j in G
+ $x_{ij}$ which represents a binary variable that takes value 1 if the salesman travels from city i to city j and 0 otherwise

##### Objective function 

The main aim of the TSP is to minimize the total distance or cost of the tour. This can be translated into the following equation :
$$z = \sum_{i=1}^{n}\sum_{j=1}^{n} c_{ij} x_{ij}$$


##### Constraints

+ Each city should be visited exactly once :

$$\sum_{j=1}^{n} x_{ij} = 1$$  
$\forall$  $i \in {1,2,...,n}$  

+ The salesman must leave and arrive at the same city :

$$\sum_{i=1}^{n} x_{ij} = 1$$  
$\forall$  $ j \in {1,2,...,n}$ 

+ Subtour elimination constraints :

These constraints ensure that no subtours (a subset of cities that are visited together) are formed in the tour. 

These constraints can be written as :

$u_{i} - u_{j} + nx_{ij} \leq n-1$

$\forall$  $i, j \in {1,2,...,n}$  where $u_{i}$ represents the position of city i in the tour and n the total number of cities.


### Algorithmic modeling
---

#### TSP complexity study

The travelling salesman problem (TSP) is a well-known combinatorial optimization problem that involves finding the shortest possible route that visits a given set of cities and returns to the starting city, subject to various constraints. The TSP is also known to be an NP-complete problem, which means that it is computationally difficult to find an exact solution in a reasonable amount of time for large problem instances.

To demonstrate that the travelling salesman problem (TSP) is NP-complete, we need to show that it is both in the NP complexity class and that it is at least as hard as any other problem in NP.

+ First, let's show that the TSP is in the NP complexity class. This means that given a solution to a TSP instance, we can verify its correctness in polynomial time. To do this, we can simply check that the solution is a valid tour that visits each city exactly once and returns to the starting city, and that the total distance traveled is less than or equal to a given threshold. This verification can be done in polynomial time, so the TSP is in NP.

+ Next, let's show that the TSP is at least as hard as any other problem in NP. We can do this by showing that we can reduce any other problem in NP to an instance of the TSP in polynomial time. This means that if we can solve an instance of the TSP in polynomial time, we can also solve any other problem in NP in polynomial time. One way to do this reduction is through [a polynomial-time reduction from the Hamiltonian cycle problem to the TSP](https://opendsa-server.cs.vt.edu/ODSA/Books/Everything/html/hamiltonianCycle_to_TSP.html).

In conclusion, we have shown that the TSP is in NP and that it is at least as hard as any other problem in NP. Therefore, we can conclude that the TSP is NP-complete.

#### Presentation of TSP solving methods

The methods for solving TSP can be classified into three categories : exact, heuristic, and metaheuristic.

Exact methods guarantee to find the optimal solution but are generally limited to relatively small instances. Exact methods include linear programming, dynamic programming, exhaustive search, and branch and cut.

Heuristic methods do not guarantee to find the optimal solution but are often faster and can find reasonable quality solutions for larger instances. Heuristic methods include construction algorithms, local search algorithms, simulated annealing algorithms, and tabu search algorithms.

Metaheuristic methods are problem-solving algorithms that can be applied to a wide range of difficult problems, including TSP. Metaheuristic methods include genetic algorithms, ant colony algorithms, simulated annealing algorithms, tabu search algorithms, and swarm optimization algorithms.

Here are some examples of algorithms for each category :

+ <b>Exact methods</b> : Integer linear programming (ILP), branch and cut, branch and price, cutting planes, etc. 
+ <b>Heuristic methods</b> : Nearest neighbor, furthest insertion, 2-opt, 3-opt, Lin-Kernighan, etc. 
+ <b>Metaheuristic methods</b> : Genetic algorithm, ant colony optimization, simulated annealing, tabu search, swarm optimization, etc.

#### Comparison of TSP solving methods :

In general, exact methods are recommended for relatively small instances and for problems where the quality of the solution is critical. Heuristic and metaheuristic methods are recommended for larger instances and for problems where computation time is critical. The selection of the method depends on the specific problem, objective, constraints, and available resources.
The choice of TSP solving method depends on several criteria, such as :  

+ The size of the problem instance : Exact methods are more suitable for small instances, while heuristic and metaheuristic methods are more suitable for larger instances.

+ The required solution quality : Exact methods guarantee the optimal solution, while heuristic and metaheuristic methods do not guarantee the quality of the solution but can provide a reasonable quality solution in a reasonable amount of time.  

+ The constraints of the problem : Some TSP problems may have specific constraints, such as time, capacity, or distance constraints. Some algorithms may be better suited for handling these constraints.  

+ Available computing resources : Exact methods may require a lot of computation time and memory for larger instances, while heuristic and metaheuristic methods are generally faster and require fewer computing resources.  

+ The complexity of the problem : Some TSP problems may be more complex than others in terms of graph topology or cost distribution. Some algorithms may be better suited for solving more complex problems.  

+ The flexibility of the algorithm : Some TSP algorithms are more flexible and can be easily adapted to other combinatorial optimization problems.  

The choice of the TSP solving method should be based on a comprehensive evaluation of the above criteria. Besides, TSP solving algorithms can also be combined to improve solution quality and reduce computation time. 

Metaheuristic methods are often used to solve the TSP because the problem is known to be NP-complete, which means that it is computationally infeasible to find the exact optimal solution for large instances within a reasonable amount of time. Metaheuristic methods offer a way to find solutions that are close to the optimal solution in a reasonable amount of time, even for large instances. There are many different metaheuristic methods that can be used to solve the TSP, including **genetic algorithms, ant colony optimization, simulated annealing, tabu search, and swarm optimization**. Each of these methods has its own strengths and weaknesses, and the choice of method will depend on the specific requirements of the problem and the available resources.  

In summary, we use **metaheuristic methods** to solve the TSP because they offer a way to find good quality solutions in a reasonable amount of time, even for large instances, when exact methods are computationally infeasible. 

#### Choice of the suitable algorithm

More precisely, we choose  genetic algorithm (GA) since it is a metaheuristic algorithm that is frequently used to solve the Traveling Salesman Problem (TSP) due to its ability to find high-quality solutions in a reasonable amount of time.  

GA is inspired by the process of natural selection, where solutions are evolved through a process of selection, crossover, and mutation. In GA algorithms, a population of candidate solutions is evolved through generations, and the fittest solutions are selected for reproduction to produce the next generation. Crossover and mutation operators introduce diversity into the population and can help to avoid getting stuck in local optima. GA algorithms have been shown to be effective in finding high-quality solutions for TSP instances of various sizes.  GA algorithms have been extensively studied and have been shown to perform well on the TSP and other combinatorial optimization problems.  


#### Pseudo-code of the genetic algorithm

Generate an initial population $P_{0}$ of p individuals  
Set i ← 0  
As long as no stopping criteria are met  
Set i ← i + 1  
Put $P_{i}$ ← ∅  
Repeat p times  
Create e by crossing 2 individuals of $P_{i}−1$  
Mutate e and add it to $P_{i}$



### Python implementation

#### Data structures

+ **Graph**

For representing a weighted graph inside the program, we can use various data structures, including:

1. <b>Adjacency Matrix</b>: This data structure uses a two-dimensional matrix to represent the graph. The rows and columns of the matrix represent the vertices, and the values in the matrix represent the weights of the edges between the vertices.

2. <b>Adjacency List</b>: This data structure uses an array of linked lists or arrays to represent the graph. Each vertex has a corresponding list/array containing the vertices it is connected to, along with their respective weights.

3. <b>Edge List</b>: This data structure is a simple list of edges, where each edge contains the source vertex, destination vertex, and weight.

Each data structure has its own advantages and disadvantages, depending on the specific requirements. For the case of our work, we will be using the <b>Adjacency Matrix</b>. Here's an example for a graph with 3 nodes:

In [None]:
matrix = [
    [0, 2, 1],
    [2, 0, 1],
    [1, 1, 0],
]

### Solution candidates

A candidate solution will be represented as a list of the considered nodes (cities), randomly shuffled. For example:

In [None]:
candidate_solution = [1, 2 ,4, 6, 8, 5]

This list represents the path 1 -> 2 -> 4 -> 6 -> 8 -> 5, in the listing order.

#### Utility functions
We are now going to write a set of functions that are going to help us implement the algorithm.

+ **Cost function**

This function calculates the cost (distance) of a solution, in order to compare candidate solutions with one another:

In [None]:
# Calculate the cost of a tour
def cost(matrix, tour):
    n = len(matrix)

    total_distance = 0
    for i in range(n - 1):
        total_distance += matrix[tour[i]][tour[i + 1]]
    total_distance += matrix[tour[-1]][tour[0]]  # Retour à la première ville
    return total_distance

+ **Generate random population**

This function will generate a random population of solutions that will be used as a starting point for the genetic algorithm:

In [None]:
import random

# Initialize a random population of solutions
def initialize_population(matrix, population_size):
    population = []
    for _ in range(population_size):
        individual = list(range(len(matrix)))
        random.shuffle(individual)

        while cost(matrix, individual) == 0:
            random.shuffle(individual)

        population.append(individual)
    return population

+ **Selection function**

This function will perform the selection phase of the genetic algorithm, choosing two 'parents' of best fitness for the next generation:

In [None]:
# Selection
def selection(matrix, population):
    fitness_values = []
    for individual in population:
        fitness_values.append(1 / cost(matrix, individual))
    sum_fitness = sum(fitness_values)
    probabilities = [fitness / sum_fitness for fitness in fitness_values]
    parents = random.choices(population, probabilities, k=2)
    return parents

+ **Crossover function**

This function will perform the crossover operation within selected parents in order to generate the offspring for the current generation: 

In [None]:
def crossover(matrix, parents):
    n = len(matrix)
    parent1, parent2 = parents
    child = [None] * n
    start, end = sorted(random.sample(range(n), 2))
    child[start:end] = parent1[start:end]
    remaining_cities = [city for city in parent2 if city not in child[start:end]]
    index = 0
    for i in range(n):
        if child[i] is None:
            child[i] = remaining_cities[index]
            index += 1
    return child

+ **Mutation function**

The mutation function randomly modifies children solutions for the current generation:

In [None]:
def mutate(matrix, individual, mutation_rate):
    n = len(matrix)
    for i in range(n):
        if random.random() < mutation_rate:
            j = random.randint(0, n - 1)
            individual[i], individual[j] = (
                individual[j],
                individual[i],
            )
    return individual

#### Algorithm implementation

The following function uses the peviously defined utility functions to build a full implementation of a the genetic algorithm for solving the TSP (and VRP) 

In [None]:
def genetic_algorithm(matrix, population_size, generations, mutation_rate):
    population = initialize_population(matrix, population_size)
    best_individual = None
    best_fitness = float("inf")

    for _ in range(generations):
        new_population = []

        while len(new_population) < population_size:
            parents = selection(matrix, population)
            child = crossover(matrix, parents)
            child = mutate(matrix, child, mutation_rate)
            new_population.append(child)

        population = new_population

        for individual in population:
            fitness = cost(matrix, individual)
            if fitness < best_fitness:
                best_fitness = fitness
                best_individual = individual

    return best_individual, best_fitness

### Tests
---

#### Define the appropriate method for random graph generation

There are several methods for generating random graphs. Here are some of the most common methods:

__Erdős-Rényi model :__ This model is based on a fixed probability for each pair of nodes to have an edge between them. The graphs generated by this model have a Poisson degree distribution, which means that the number of nodes with a given degree follows a Poisson distribution.

__Barabási-Albert model :__ This model is based on the principle that new nodes connect to existing nodes with a probability proportional to their degree. The graphs generated by this model have a power law degree distribution, which means that there are a small number of nodes with a high degree and a large number of nodes with a low degree.

__Watts-Strogatz model :__ This model is based on creating a regular graph and randomly rewiring edges to create a more random graph. The graphs generated by this model have a degree distribution that has a bell shape.

__Configuration model :__ This model is based on creating a graph by specifying the desired degree distribution of the nodes. The edges are then created in a way that respects these degrees. The graphs generated by this model have a specified degree distribution in advance.

__Kleinberg's random graph model :__ This model is based on the idea that nodes are more likely to connect to nearby nodes, but there can be long-distance connections. The graphs generated by this model have a power law degree distribution and a small characteristic path length.

These methods for generating random graphs are widely used in computer science and mathematics research to study the properties of random graphs and their behavior in different situations. The *Erdős-Rényi model* is a popular method for producing random graphs because it is simple to implement and can generate graphs with a wide range of densities. The model is based on the idea of randomly adding edges to a set of n nodes, with each possible edge being included with a fixed probability p. The resulting graph is known as an *Erdős-Rényi random graph*, denoted by G(n,p).

The Erdős-Rényi model has several advantages, including its simplicity, ease of implementation, and flexibility in generating graphs with a wide range of densities. It has been used in a variety of applications, including the study of random networks, the modeling of social networks, and the analysis of communication networks.

However, one limitation of the Erdős-Rényi model is that it assumes a uniform distribution of edges across all pairs of nodes, which may not reflect the structure of real-world networks. Additionally, the model does not take into account any underlying structure or clustering in the network.

Despite these limitations, the Erdős-Rényi model remains a popular choice for generating random graphs and has been used extensively in the field of graph theory and network science.t contexts.

#### Probability formula

The formula for generating an Erdős-Rényi random graph is as follows:

+ Start with n isolated nodes.
+ For each pair of nodes i and j, add an edge between them with probability p, independently of all other pairs.
+ The resulting graph has n nodes and an expected number of m = p * n * (n-1) / 2 edges.


#### Translate the formula into Python code

The previous methods are implemented to create a random instance generator that will output a random graph as well as one or more (VRP) subgraph(s) representing the delivery points (clients) for the vehicle(s):

In [None]:

import numpy as np
import networkx as nx
import random
import matplotlib.pyplot as plt
import json
import string


def save_json(g):
    file_name = "".join(random.choices(string.ascii_lowercase + string.digits, k=6))
    graph_data = nx.node_link_data(g)

    f = open(f"data/instances/{file_name}.json", "w")
    json.dump(graph_data, f)
    f.close()


def generate_adj_matrix(n, p):
    adj_matrix = np.zeros((n, n))

    for i in range(n):
        for j in range(i + 1, n):
            if random.random() < p:
                weight = random.randint(1, 10)
                adj_matrix[i][j] = weight
                adj_matrix[j][i] = weight

    return adj_matrix


def generate_instance(n, p, min, max, n_vehicles):
    # Génération de la matrice d'adjacence aléatoire pondérée
    adj_matrix = np.zeros((n, n))
    is_valid = False

    while not is_valid:
        adj_matrix = generate_adj_matrix(n, p)
        is_valid = True

        for i in range(n):
            if sum(adj_matrix[i]) == 0:
                is_valid = False
                break

    # Création d'un graphe vide
    G = nx.from_numpy_array(adj_matrix)

    subgraphs = []

    for _ in range(n_vehicles):
        # Génération du sous-graphe
        subgraph_nodes = random.sample(
            sorted(G.nodes()), k=random.randint(min, max)
        )  # génère une liste de nœuds aléatoires à partir du graphe G.
        subgraph = G.subgraph(subgraph_nodes)
        subgraphs.append(subgraph)

    save_json(G)

    return G, subgraphs


Utility functions have been added to handle random adjacency matrix generation as well as graph storage, as json.

### Statistics & Behavioral study

---

#### Determine the appropriate caracters for the study

In order to be able to measure the efficiency or quality of our chosen algorithm to solve the VRP Problem, we will carry out a statistical descriptive study of the solution we put in place. As such, it is necessary to first of all determine the caracters which will be used as a base for our statistical measures as well as an upcoming statistical predictive study. 

+ **Number of vehicles**

This caracter will allow us to compare the different solutions taking into account the varrying number of vehicles we set at the departure of the algorithm.
    
+ **Number of Iterations**

This caracter will enable us to know the number of iterations which permitted our algorithm to find the optimal solution. As such, we will be able to use it as unit of comparism in our sample.
    
+ **Number of clients**

The number of clients which is equally the number of vertices to cover in the sub-graph will change for each individual of the sample. As such, it can be used to compare the efficiency of our algorithm.
    
+ **Covered distance**

Which is the sum of the weight of the path chosen by the algorithm in the subgraph.
    
+ **Convergence Time**

This is the time taken by our algorithm for each individual of our sample to find the optimal solution. We should note that it is one of the main factors which will permit us to determine the efficiency of our solution.

+ **Generate a sample**

To generate a study sample, we run the algorithm multiple times on different instances, varying the caracters defined previously. The following script is used to create a sample of executions in a Excel file:


In [None]:
import networkx as nx
import matplotlib.pyplot as plt
import genetic_algorithm as ga
import instance_generator as ig
import pathfinder as pf
import time
import random
import create_excel_file as xls
import math

n_sample = int(input("How many records ? "))
MUTATION_RATE = 0.01
sample_data = []


for i in range(n_sample):
    N_CITIES = random.randint(5, 20)
    N_VEHICLES = random.randint(1, 10)
    POPULATION_SIZE = random.randint(30, 100)
    N_GENERATIONS = random.randint(30, 100)

    min_subgraph_size = int(N_CITIES / 2)
    max_subgraph_size = int(N_CITIES - 1)

    # Generate a test graph and a random subgraph
    graph, subgraphs = ig.generate_instance(
        N_CITIES, 0.7, min_subgraph_size, max_subgraph_size, N_VEHICLES
    )

    pos = nx.spring_layout(graph)

    graph_matrix = nx.adjacency_matrix(graph).todense().tolist()

    k = 1
    pos = nx.spring_layout(graph)
    execution_time = 0
    total_distance = 0
    total_cities = 0

    print(f"--- Record {i + 1} ----")

    for subgraph in subgraphs:
        start_time = time.time()

        complete_subgraph = nx.transitive_closure(subgraph)

        subgraph_matrix = nx.adjacency_matrix(subgraph).todense().tolist()

        nx.set_edge_attributes(
            complete_subgraph,
            {
                e: {"weight": pf.shortest_distance(graph_matrix, e[0], e[1])}
                for e in complete_subgraph.edges
            },
        )

        complete_subgraph_matrix = (
            nx.adjacency_matrix(complete_subgraph).todense().tolist()
        )

        best_tour, best_distance = ga.genetic_algorithm(
            complete_subgraph_matrix, POPULATION_SIZE, N_GENERATIONS, MUTATION_RATE
        )

        # Indexed solution
        node_list = sorted(list(complete_subgraph))
        indexed_tour = [node_list[idx] for idx in best_tour]
        indexed_tour.append(indexed_tour[0])

        final_tour = indexed_tour.copy()

        # Replace inexistant edges in the tour with the shortest path
        for i in range(len(final_tour) - 1):
            if graph_matrix[final_tour[i]][final_tour[i + 1]] == 0:
                final_tour[i : i + 2] = pf.shortest_path(
                    graph_matrix, final_tour[i], final_tour[i + 1]
                )

        execution_time += time.time() - start_time
        total_distance += best_distance
        total_cities += len(subgraph)

        print(
            "vehicle",
            k,
            ":",
            final_tour,
            "(",
            best_distance,
            ") --",
            execution_time,
            "seconds elapsed",
        )

        k += 1

    sample_data.append(
        [
            N_VEHICLES,
            N_GENERATIONS,
            total_cities,
            total_distance,
            round(execution_time, 2),
        ]
    )

xls.create_excel_file(sample_data, "sample.xlsx")


#### Descriptive statistical study
In order to be able to do a statistical description of the quality or efficiency level of our algorithm, we were required to do the following study :

In [None]:
library(readxl)

#Specifying the excel file location
path <- "C:/Users/hp/Documents/X3/U.E_5_Traitement_de_données/Projet_DATA/projet-data/data/sample.xlsx"

#Accessing the excel file using R 
dataFile <- read_excel(path)

#Attaching the data base(Excel file) to R's search path
attach(dataFile)

#Renders various model fitting functions according to the data entries
summary(dataFile)

#Shows the standard deviation of a chosen caracter
sd(Covered_dist)
sd(Convergence_Time)


#Création d'une liste avec les caractères à étudier
#Puis on boucle sur chaque élément de la liste afin d'afficher le Mode
lst <- list(Covered_dist, Convergence_Time)
for (i in 1:length(lst)) {
  elmt <- table(lst[[i]])
  mode_function <- rownames(elmt)[which.max(elmt)]
  
  print(paste("Mode for element", i, ":", mode_function))
}

#Draws a histogram of the convergence time then delays for 2 seconds
hist(Convergence_Time, main = "Histogram of Convergence Time", xlab = "Values")
Sys.sleep(2)

#Draws a histogram of the covered distance then delays for 2 seconds
hist(Covered_dist, main = "Histogram of Covered Distance", xlab = "Values")
Sys.sleep(2)

#Draws a mustache box of the Convergence Time features then delays for 2 seconds
boxplot(Convergence_Time, main = "Boxplot Convergence Time", col = c("skyblue"), ylab = "Values", xlab = "Data")
Sys.sleep(2)

#Draws a mustache box of the Covered distance features
boxplot(Covered_dist, main = "Boxplot Covered Distances", col = c("red"), ylab = "Values", xlab = "Data")


After Execution of the above code, we obtained the following plots:

##### Histogram of the Convergence Time
![Histogram_conv_time.png](attachment:Histogram_conv_time.png)

##### Histogram of the Convergence Time
![Histogram_cov_dist.png](attachment:Histogram_cov_dist.png)

##### Mustache Box of the Convergence Time
![boxplot_time1.png](attachment:boxplot_time1.png)

##### Mustache Box of the Covered Distance
![boxplot_dist1.png](attachment:boxplot_dist1.png)

#### Predictive statistical study
For instance, in order to predict the performance of our solving algorithm even for large number of cities we can use the following code :

In [None]:
library(ggplot2)
library(readxl)

# import the sample and set the path
path_file = "C:/Users/hp/Documents/X3/U.E_5_Traitement_de_données/Projet_DATA/projet-data/data/sample.xlsx"

#Accessing the excel file using R 
data_file <- read_excel(path_file)

# set the file name as the current variable
attach(data_file)

# calculate the linear regression between two caracters
fit <- lm(Convergence_Time ~ Clients_no, data = data_file)

# draw the linear regression between two caracters
plot(Clients_no, Convergence_Time)
abline(fit)

ggplot(data_file, aes(x = Clients_no, y = Convergence_Time)) + geom_point() + geom_smooth(method = "lm") 

# predict values for larger instances
new_instance <- data.frame(Clients_no=50)
predict(fit, new_instance)

After execution of the above code, we obtained the following plot :

##### Linear Regression Plot on long term
![predictive_stats.png](attachment:predictive_stats.png)

## Additionnal constraints
--- 

## Capacited Vechicle Routing Problem

The constraint related to the capacity of the different vehicles leads us to solve the <b>CVRP</b>. The changes in the formulation of the linear optimization problem are as follows:

##### Objective function 

The objective of CVRP (Capacitated Vehicle Routing Problem) is to find the best way to deliver goods from a central warehouse to a set of customers using a set of vehicles with a limited capacity. It is about minimizing the total distance traveled by vehicles while satisfying vehicle capacity constraints and visiting each customer exactly once. It can be translated mathematically by the following function:

$$min(z) = \sum_{i=1}^{n}\sum_{j=1}^{n}\sum_{k=1}^{n} c_{ij} x_{ijk}$$

##### Decision Variables  

While considering the above formula of the objective function, we have as decision variables :
+ i and j which represent two cities of an undirected, weighted and complete graph G
+ k which represents the truck which will have to make the delivery
+ $c_{ij}$ which represents the distance between the two cities i and j in G
+ $x_{ijk}$ which represents a binary variable that takes value 1 if the truck k travels from city i to city j and 0 otherwise

##### Constraints

+ Each city should be visited exactly once :
$$\sum_{j=1}^{n} x_{ij} = 1$$  
$\forall$  $i \in {1,2,...,n}$  

+ The salesman must leave and arrive at the same city :
$$\sum_{i=1}^{n} x_{ij} = 1$$  
$\forall$  $ j \in {1,2,...,n}$ 

+ The maximum capacity of each vehicle must not be exceeded :
$$\sum_{o=1}^{n} q_{o}x_{ij} <= Q$$  
$\forall$  $ o \in {1,2,...,n}$ , where $q_{o}$ is the size of the object to be delivered and Q is the maximum capacity of the truck.


+ Subtour elimination constraints :

These constraints ensure that no subtours (a subset of cities that are visited together) are formed in the tour. 

These constraints can be written as :

$u_{i} - u_{j} + nx_{ij} \leq n-1$

$\forall$  $i, j \in {1,2,...,n}$  where $u_{i}$ represents the position of city i in the tour and n the total number of cities.