## **Objectif:**
#### Le but de ce projet est de proposer à un utilisateur le chemin optimal pour rejoindre une destination en France.

## **Fonctionnalités:**
*   L'utilisateur enregistre un audio avec une phrase du genre: ***Je suis chez moi à Paris et je veux me rendre à Dijon pour une remise de diplôme***.

*   Le programme extrait la ville de départ et d'arrivée puis propose le plus court chemin pour arriver à destination.


## **Techniques utilisées:**
*   NLP (Natural Language Processing)
*   Reconnaissance Vocale
*   Algorithmique de parcours de graphes


## **Jeux de données:**
*   Nous avons créé un jeu de données qui contient des exemples de phrases qui vont aider à entrainer notre modèle pour reconnaitre les villes de départ et d'arrivée dans la demande de l'utilisateur. https://drive.google.com/uc?export=download&id=1n8vKheaOfZfekyUn94ax40RB3cU0BS5P

*   Nous utilisons également un jeu de données de la SNCF qui affiche les durées de trajet en fonction de la ville de départ et la ville d'arrivée. Ce jeu de données va nous permettre de determiner le plus court chemin entre une ville de départ et une ville d'arrivée. https://drive.google.com/uc?export=download&id=1CL_z0bH464x42fRfWwM1qJaGKjd9fCtp.








# Code

### Module 1: Speech to text

In [None]:
import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding
!pip install ffmpeg-python

Collecting ffmpeg-python
  Downloading ffmpeg_python-0.2.0-py3-none-any.whl (25 kB)
Installing collected packages: ffmpeg-python
Successfully installed ffmpeg-python-0.2.0


In [None]:
from IPython.display import HTML, Audio
from google.colab.output import eval_js
from base64 import b64decode
import numpy as np
from scipy.io.wavfile import read as wav_read
from scipy.io.wavfile import write
import io
import ffmpeg

AUDIO_HTML = """
<script>
var my_div = document.createElement("DIV");
var my_p = document.createElement("P");
var my_btn = document.createElement("BUTTON");
var t = document.createTextNode("Press to start recording");

my_btn.appendChild(t);
//my_p.appendChild(my_btn);
my_div.appendChild(my_btn);
document.body.appendChild(my_div);

var base64data = 0;
var reader;
var recorder, gumStream;
var recordButton = my_btn;

var handleSuccess = function(stream) {
  gumStream = stream;
  var options = {
    //bitsPerSecond: 8000, //chrome seems to ignore, always 48k
    mimeType : 'audio/webm;codecs=opus'
    //mimeType : 'audio/webm;codecs=pcm'
  };
  //recorder = new MediaRecorder(stream, options);
  recorder = new MediaRecorder(stream);
  recorder.ondataavailable = function(e) {
    var url = URL.createObjectURL(e.data);
    var preview = document.createElement('audio');
    preview.controls = true;
    preview.src = url;
    document.body.appendChild(preview);

    reader = new FileReader();
    reader.readAsDataURL(e.data);
    reader.onloadend = function() {
      base64data = reader.result;
      //console.log("Inside FileReader:" + base64data);
    }
  };
  recorder.start();
  };

recordButton.innerText = "Recording... press to stop";

navigator.mediaDevices.getUserMedia({audio: true}).then(handleSuccess);


function toggleRecording() {
  if (recorder && recorder.state == "recording") {
      recorder.stop();
      gumStream.getAudioTracks()[0].stop();
      recordButton.innerText = ""
  }
}

// https://stackoverflow.com/a/951057
function sleep(ms) {
  return new Promise(resolve => setTimeout(resolve, ms));
}

var data = new Promise(resolve=>{
//recordButton.addEventListener("click", toggleRecording);
recordButton.onclick = ()=>{
toggleRecording()

sleep(2000).then(() => {
  // wait 2000ms for the data to be available...
  // ideally this should use something like await...
  //console.log("Inside data:" + base64data)
  resolve(base64data.toString())

});

}
});

</script>
"""

def get_audio():
  display(HTML(AUDIO_HTML))
  data = eval_js("data")
  binary = b64decode(data.split(',')[1])

  process = (ffmpeg
    .input('pipe:0')
    .output('pipe:1', format='wav')
    .run_async(pipe_stdin=True, pipe_stdout=True, pipe_stderr=True, quiet=True, overwrite_output=True)
  )
  output, err = process.communicate(input=binary)

  riff_chunk_size = len(output) - 8
  # Break up the chunk size into four bytes, held in b.
  q = riff_chunk_size
  b = []
  for i in range(4):
      q, r = divmod(q, 256)
      b.append(r)

  # Replace bytes 4:8 in proc.stdout with the actual size of the RIFF chunk.
  riff = output[:4] + bytes(b) + output[8:]

  sr, audio = wav_read(io.BytesIO(riff))

  byte_io = io.BytesIO(bytes())
  write(byte_io, sr, audio)
  result_bytes = byte_io.read()

  audio_data = speech_recognition.AudioData(result_bytes, sr, 2)

  return audio_data

In [None]:
!pip install speechRecognition

Collecting speechRecognition
  Downloading SpeechRecognition-3.10.0-py2.py3-none-any.whl (32.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m32.8/32.8 MB[0m [31m18.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: speechRecognition
Successfully installed speechRecognition-3.10.0


In [None]:
import speech_recognition

In [None]:
def speech_to_text(audio_data):

  r = speech_recognition.Recognizer()

  try:
    output = r.recognize_google(audio_data, language='fr-FR', show_all=True)
    travel_request = output["alternative"][0]["transcript"]
    print("The travel request sentence is : ", travel_request)
    return travel_request
  except speech_recognition.RequestError as e:
      print("Could not request results; {0}".format(e))

  except speech_recognition.UnknownValueError as e:
      print("unknown error occured")


## Module 2: NLP

### 3.0 Load data

In [None]:
import urllib.request
import pandas as pd
from io import BytesIO


urllib.request.urlretrieve("https://drive.google.com/uc?export=download&id=1n8vKheaOfZfekyUn94ax40RB3cU0BS5P", "sentence_dataset.xlsx")

excel_data = pd.read_excel("/content/sentence_dataset.xlsx", sheet_name="Test")
print(f'Jeu de test : {len(excel_data)}')
print(excel_data[0:5])
print(excel_data.dtypes)

Jeu de test : 90
                                            sentence    departure  destination
0  Je voudrais aller à Paris en partant de Montpe...  Montpellier        Paris
1  Est-ce que tu peux me trouver un itinéraire de...        Paris  Montpellier
2  Je souhaite aller à Lyon en partant de Montpel...  Montpellier         Lyon
3        Je suis à Montpellier, je vous aller à Lyon  Montpellier         Lyon
4           Trouve moi un itinéraire pour Lyon Paris         Lyon        Paris
sentence       object
departure      object
destination    object
dtype: object


In [None]:
!python -m spacy download fr_core_news_md

2023-08-06 11:51:36.993063: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-06 11:51:39.595367: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-08-06 11:51:39.595923: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-08-

### 3.1 Algorithmic

In [None]:
import spacy

nlp = spacy.load("fr_core_news_md")

def location_count(sentence):
    doc = nlp(sentence)
    locations = tuple(filter(lambda x: (x.label_ == "LOC"),doc.ents))
    return True if len(locations) == 2 else False

def predict(d):

    departure = ""
    arrival = ""

    children_0_strings = [i.text for i in d[list(d.keys())[0]]["children"]]
    children_1_strings = [i.text for i in d[list(d.keys())[1]]["children"]]

    head_0_string = d[list(d.keys())[0]]["head"]
    head_1_string = d[list(d.keys())[1]]["head"]

    arrival_children = ["à", "pour"]
    departure_children = ["de", "depuis"]

    arrival_head = ["aller"]
    departure_head = ["suis", "partant"]

    if any(i in departure_children for i in children_0_strings) and not any(i in departure_children for i in children_1_strings) :
        departure = list(d.keys())[0]
        arrival = list(d.keys())[1]
    elif any(i in departure_children for i in children_1_strings) and not any(i in departure_children for i in children_0_strings) :
        departure = list(d.keys())[1]
        arrival = list(d.keys())[0]
    elif any(i in departure_head for i in head_0_string) and not any(i in departure_head for i in head_1_string) :
        departure = list(d.keys())[0]
        arrival = list(d.keys())[1]
    elif any(i in departure_head for i in head_1_string) and not any(i in departure_head for i in head_0_string) :
        departure = list(d.keys())[1]
        arrival = list(d.keys())[0]
    else :
        departure = list(d.keys())[0]
        arrival = list(d.keys())[1]

    return [departure, arrival]

#extract methode

def extract_cities_with_algo(sentence):
		if location_count(sentence):
			sentence = sentence
		else:
			print("sentence is not valid")

		doc = nlp(sentence)
		data = {}

		for ent in doc.ents:
				if ent.label_ == "LOC":
						data[f"{ent.text}"] = {"label":ent.label_}
						for token in ent:
								data[f"{ent.text}"]["head"] = token.head.text
								data[f"{ent.text}"]["children"] = [child for child in token.children]

		prediction = predict(data)
		print("Departure : ",prediction[0])
		print("Destination : ",prediction[1])
		return prediction

### 3.2 Custom NER

### Train model



In [None]:
training_data = []

for index, row in excel_data.iterrows():
    sentence = row['sentence']
    departure = row['departure']
    destination = row['destination']
    training_data.append((row['sentence'], {"entities": [
        (sentence.find(departure), sentence.find(departure) + len(departure), "DEPARTURE_LOC")
        ,(sentence.find(destination), sentence.find(destination) + len(destination), "DESTINATION_LOC")
    ]}))

print(training_data[0:5])

[('Je voudrais aller à Paris en partant de Montpellier', {'entities': [(40, 51, 'DEPARTURE_LOC'), (20, 25, 'DESTINATION_LOC')]}), ('Est-ce que tu peux me trouver un itinéraire de Paris à Montpellier', {'entities': [(47, 52, 'DEPARTURE_LOC'), (55, 66, 'DESTINATION_LOC')]}), ('Je souhaite aller à Lyon en partant de Montpellier', {'entities': [(39, 50, 'DEPARTURE_LOC'), (20, 24, 'DESTINATION_LOC')]}), ('Je suis à Montpellier, je vous aller à Lyon', {'entities': [(10, 21, 'DEPARTURE_LOC'), (39, 43, 'DESTINATION_LOC')]}), ('Trouve moi un itinéraire pour Lyon Paris', {'entities': [(30, 34, 'DEPARTURE_LOC'), (35, 40, 'DESTINATION_LOC')]})]


In [None]:
# shuffle training_data
import random

random.shuffle(training_data)

#split training data into train and test data sets
print(len(training_data))

test_data = training_data[:len(training_data)//5]
train_data = training_data[len(training_data)//5:]

print(len(train_data))
print(len(test_data))

90
72
18


In [None]:
import pandas as pd
import os
from tqdm import tqdm
import spacy
from spacy.tokens import DocBin

nlp = spacy.load("fr_core_news_md")

db_train = DocBin()
db_test = DocBin()

for text, annot in tqdm(train_data):
    doc = nlp.make_doc(text)
    ents = []
    for start, end, label in annot["entities"]:
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)

    doc.ents = ents
    db_train.add(doc)



for text, annot in tqdm(test_data):
    doc = nlp.make_doc(text)
    ents = []
    for start, end, label in annot["entities"]:
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    doc.ents = ents
    db_train.add(doc)

db_train.to_disk("./train.spacy")
db_test.to_disk("./test.spacy")

100%|██████████| 72/72 [00:00<00:00, 2115.63it/s]


Skipping entity


100%|██████████| 18/18 [00:00<00:00, 1797.09it/s]

Skipping entity
Skipping entity





In [None]:
# download spacy config
urllib.request.urlretrieve("https://drive.google.com/uc?export=download&id=1NloGxCPXKFaLfMpwZANnOBi2taeZrIJy", "config.cfg")

('config.cfg', <http.client.HTTPMessage at 0x79fc1fcffe20>)

In [None]:
!python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./test.spacy

[38;5;2m✔ Created output directory: output[0m
[38;5;4mℹ Saving to output directory: output[0m
[38;5;4mℹ Using CPU[0m
[38;5;4mℹ To switch to GPU 0, use the option: --gpu-id 0[0m
[1m
[2023-08-06 11:53:41,749] [INFO] Set up nlp object from config
[2023-08-06 11:53:41,779] [INFO] Pipeline: ['tok2vec', 'ner']
[2023-08-06 11:53:41,785] [INFO] Created vocabulary
[2023-08-06 11:53:43,987] [INFO] Added vectors: fr_core_news_md
[2023-08-06 11:53:43,988] [INFO] Finished initializing nlp object
[2023-08-06 11:53:44,451] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00     43.60    0.00    0.00    0.00    0.00
 12     200         13.90    856.12    0.00    0.00    0.00    0.00
 28     400         

In [None]:
def extract_cities_with_custom_NER(sentence):

  nlp_custom_ner = spacy.load("/content/output/model-best")
  doc = nlp_custom_ner(sentence)

  departure= ""
  destination = ""
  for ent in doc.ents:
    if ent.label_ == "DEPARTURE_LOC":
      departure = ent.text
    if ent.label_== "DESTINATION_LOC":
      destination = ent.text

  print("Departure : ", departure)
  print("Destination : ", destination)

  return departure, destination

### 3.3 Open AI API

In [None]:
!pip install openai

Collecting openai
  Downloading openai-0.27.8-py3-none-any.whl (73 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/73.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.6/73.6 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: openai
Successfully installed openai-0.27.8


In [None]:
import openai

#construct prompt with context and example
def generate_prompt(sentence):
	return """Trouve la gare de départ et la gare d'arrivée.
\nPhrase: Trouve moi le trajet le plus court entre Paris et Montpellier en passant par Lyon.
\nReponse:Paris;Montpellier
\nPhrase:{}
\nReponse:""".format(sentence)

openai.api_key = ""
#api call
def openai_api_call(travel_request):
	call_response = openai.Completion.create(
						model="text-davinci-003",
						prompt=generate_prompt(travel_request),
						temperature=0.7,
						max_tokens=100)
	return call_response.choices[0].text

def format_response(response):
	splitted_res = response.split(';')
	return splitted_res

def extract_cities_with_openai(travel_request):
	res = openai_api_call(travel_request)
	formatted_response = format_response(res)
	print("Departure : ",formatted_response[0])
	print("Destination : ",formatted_response[1])
	return formatted_response


## Module 3: Pathfinding

In [None]:
from collections import defaultdict
import csv
import urllib.request

In [None]:
class Graph():
    def __init__(self):
        self.edges = defaultdict(list)
        self.weights = {}

    def add_edge(self, from_node, to_node, weight):
        # Note: assumes edges are bi-directional
        self.edges[from_node].append(to_node)
        self.edges[to_node].append(from_node)
        self.weights[(from_node, to_node)] = weight
        self.weights[(to_node, from_node)] = weight

graph = Graph()
urllib.request.urlretrieve("https://drive.google.com/uc?export=download&id=1CL_z0bH464x42fRfWwM1qJaGKjd9fCtp", "sncf_data.csv")

with open('/content/sncf_data.csv', newline='', encoding='UTF-8') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')

    next(reader)

    for row in reader:
        graph.add_edge(row[1].lower(), row[2].lower(), int(row[3]))

In [None]:
def pathfinder(graph, initial, end):
    initial = initial.lower()
    end = end.lower()
    # shortest paths is a dict of nodes
    # whose value is a tuple of (previous node, weight)
    shortest_paths = {initial: (None, 0)}
    current_node = initial
    visited = set()

    while current_node != end:
        visited.add(current_node)
        destinations = graph.edges[current_node]
        weight_to_current_node = shortest_paths[current_node][1]

        for next_node in destinations:
            weight = graph.weights[(current_node, next_node)] + weight_to_current_node
            if next_node not in shortest_paths:
                shortest_paths[next_node] = (current_node, weight)
            else:
                current_shortest_weight = shortest_paths[next_node][1]
                if current_shortest_weight > weight:
                    shortest_paths[next_node] = (current_node, weight)

        next_destinations = {node: shortest_paths[node] for node in shortest_paths if node not in visited}
        if not next_destinations:
            return "Route Not Possible"
        # next node is the destination with the lowest weight
        current_node = min(next_destinations, key=lambda k: next_destinations[k][1])

    # Work back through destinations in shortest path
    path = []
    while current_node is not None:
        path.append(current_node)
        next_node = shortest_paths[current_node][0]
        current_node = next_node
    # Reverse path
    path = path[::-1]
    return path, weight

def cityToStation(graph, city):
    list = []
    for key, val in graph.edges.items():
        if city in key:
            list += [key]
    return list

def get_shortest_path(departure, arrival):
    stationsDeparture = cityToStation(graph, departure)
    stationsArrival = cityToStation(graph, arrival)
    min_weight = 10000
    final_path = 0
    for stationD in stationsDeparture:
        for stationA in stationsArrival:
            path, weight = pathfinder(graph, stationD, stationA)
            if weight < min_weight:
                final_path = path
                min_weight = weight

    print("The shortest path is : ", final_path)

# DEMO

## 1 - Record audio

In [None]:
audio_data = get_audio()

## 2 - Get travel request using speech to text api from google

In [None]:
travel_request = speech_to_text(audio_data)

The travel request sentence is :  salut je suis une fille je suis actuellement à Paris et j'aimerais bien aller à Toulouse


## 3 - Extract departure and destination cities from the travel request

### 3.1 - Algorithmic approach

In [None]:
departure, destination = extract_cities_with_algo(travel_request)

Departure :  Paris
Destination :  Toulouse


### 3.2 Custom trained NER approach

In [None]:
departure, destination = extract_cities_with_custom_NER(travel_request)

Departure :  Paris
Destination :  Toulouse


### 3.3 OpenAI api

In [None]:
departure, destination = extract_cities_with_openai(travel_request)

Departure :  Paris
Destination :  Toulouse


## 4 - Pathfinding

In [None]:
get_shortest_path(departure.lower(), destination.lower())

The shortest path is :  ['paris-austerlitz', 'vierzon', 'limoges-bénédictins', 'agen', 'toulouse-matabiau']
