# __Webscapping processes for gathering prices over Rungis website__

Here we present the different stages we went through to gather data for our simulations. 

First we started to import the various libraries. The most important here is `BeatifulSoup`that is used to navigate through the HTML code we previously fetched with `urllib.request`. 

In [1]:
from bs4 import BeautifulSoup
import urllib.request
import csv
import pandas as pd 
import numpy as np 
import os
import threading

In [2]:
urlpage = "https://rnm.franceagrimer.fr/prix?FRUITS-ET-LEGUMES" #Defining the URL for the base

page = urllib.request.urlopen(urlpage) #Fecthing the page

soup = BeautifulSoup(page, 'html.parser') #Fetching the HTML code

In [3]:
print(soup)

<!DOCTYPE html>

<html lang="fr">
<head>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta charset="utf-8"/>
<link href="/css/basernm.css?051020" rel="stylesheet" type="text/css"/>
<script async="" src="/css/basernm.js?270919"></script>
<link href="/ico/icon-57x57.png" rel="apple-touch-icon" sizes="57x57"/>
<link href="/ico/icon-60x60.png" rel="apple-touch-icon" sizes="60x60"/>
<link href="/ico/icon-72x72.png" rel="apple-touch-icon" sizes="72x72"/>
<link href="/ico/icon-76x76.png" rel="apple-touch-icon" sizes="76x76"/>
<link href="/ico/icon-114x114.png" rel="apple-touch-icon" sizes="114x114"/>
<link href="/ico/icon-120x120.png" rel="apple-touch-icon" sizes="120x120"/>
<link href="/ico/icon-144x144.png" rel="apple-touch-icon" sizes="144x144"/>
<link href="/ico/icon-152x152.png" rel="apple-touch-icon" sizes="152x152"/>
<link href="/ico/icon-180x180.png" rel="apple-touch-icon" sizes="180x180"/>
<link href="/ico/icon-192x192.png" rel="icon" sizes="192x192" type=

In order to fetch the data we want, we had to know the structure of the page itself.
Here, we notice that each product has its own page which is identified with the __div__ class `listunproduit` and the __href__ `/prix?product`. In odrer to get the prices of the X last months we simply add `&Xmois`at the end of the url. So we simply search for all the urls contained in all the __div__ of the class `listunproduit`.

In [4]:
table = soup.find_all('div', attrs={'class': 'listunproduit'}) #Listing all the class listunproduit

urls = [] #Initializing a list of all the urls for the product

for produit in table:
    urls.append("https://rnm.franceagrimer.fr"+produit.find('a',href = True)['href']+'&12MOIS') #Fetching all the urls of all the products

N = len(urls)

In [5]:
print(f"The urls list contains urls like : {urls[0]}")

The urls list contains urls like : https://rnm.franceagrimer.fr/prix?ABRICOT&12MOIS


The next step is the core of that part. Once we get the urls we need to fecth the prices of all the product. In order to be as fast as possible we use a multithreading technique with 4 threads. 

But as we did before, we first need to know the structure of all the pages we visit. So we begin by printing the HTML code of one page. Fortunately all the pages have the same structure.

In [6]:
page = urllib.request.urlopen(urls[0])
soup = BeautifulSoup(page, 'html.parser')
print(soup)

<!DOCTYPE html>

<html lang="fr">
<head>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta charset="utf-8"/>
<link href="/css/basernm.css?051020" rel="stylesheet" type="text/css"/>
<script async="" src="/css/basernm.js?270919"></script>
<link href="/ico/icon-57x57.png" rel="apple-touch-icon" sizes="57x57"/>
<link href="/ico/icon-60x60.png" rel="apple-touch-icon" sizes="60x60"/>
<link href="/ico/icon-72x72.png" rel="apple-touch-icon" sizes="72x72"/>
<link href="/ico/icon-76x76.png" rel="apple-touch-icon" sizes="76x76"/>
<link href="/ico/icon-114x114.png" rel="apple-touch-icon" sizes="114x114"/>
<link href="/ico/icon-120x120.png" rel="apple-touch-icon" sizes="120x120"/>
<link href="/ico/icon-144x144.png" rel="apple-touch-icon" sizes="144x144"/>
<link href="/ico/icon-152x152.png" rel="apple-touch-icon" sizes="152x152"/>
<link href="/ico/icon-180x180.png" rel="apple-touch-icon" sizes="180x180"/>
<link href="/ico/icon-192x192.png" rel="icon" sizes="192x192" type=

As we can see, prices are contained in a __table__ of the class `tabcotmar`. So for each page we fetch the code of that special table and then the figures are contained in __tr__ of class `tdcotr m12` themselves contained in __td__ of the class `tdcotl`.

In [7]:
produits = []

n_thread = 4 
def getPrice(tid):
    i_start = int(i*N/n_thread)
    i_end = min(N,int((i+1)*N/n_thread))
    for k in range(i_start,i_end):
        url = urls[k]
        page = urllib.request.urlopen(url)
        soup = BeautifulSoup(page, 'html.parser')
        table = soup.find('table',attrs={'id' : 'tabcotmar'})
        for line in table.find('tbody').find_all('tr'):
            location_title = line.find('td', attrs={'class' : 'tdcotcolspan'})
            if location_title == None:
                produit = line.find('td',attrs={'class' : 'tdcotl'})
                prix = line.find_all('td',attrs={'class' : 'tdcotr m12'})
                produits.append([produit.getText(),prix[0].getText(),
                                 prix[1].getText(),prix[2].getText(),
                                 prix[3].getText(),prix[4].getText(),
                                 prix[5].getText(),prix[6].getText(),
                                 prix[7].getText(),prix[8].getText(),
                                 prix[9].getText(),prix[10].getText(),
                                 prix[11].getText()])

In [8]:
class myThread (threading.Thread):
    def __init__(self, threadID,name, counter):
        threading.Thread.__init__(self)
        self.threadID = threadID
        self.name = name
        self.counter = counter
    def run(self):
        getPrice(self.threadID)

In [9]:
threads = [myThread(i,"T"+str(i),i) for i in range(n_thread)]
print(threads)
for i in range(n_thread):
    threads[i].start()
for thread in threads:
    thread.join()

[<myThread(T0, initial)>, <myThread(T1, initial)>, <myThread(T2, initial)>, <myThread(T3, initial)>]


In [13]:
print("The data we gathered look like this :")
print(produits[0])

The data we gathered look like this :
[' CLÉMENTINE Méditerranée biologique tout-calibre (le kg) ', ' \xa0', ' \xa0', ' \xa0', ' \xa0', ' \xa0', ' 3.71', ' 3.70', ' 3.70', ' \xa0', ' \xa0', ' \xa0', ' \xa0']


We notice that tha data contains a lot of __'\xa0'__ which means that a lot of data are missing which is due to the fact that the prices are note stated every month. In order to deal with this problem we need to fill the data. That point will be described in an other notebook contained in __`./programms/data_computing`__. 

In [10]:
tableau = np.array(produits)
print("Creating csv file")
pd.DataFrame(tableau).to_csv("prix3.csv",header = ["Produit","Prix 1","Prix 2","Prix 3","Prix 4","Prix 5","Prix 6","Prix 7","Prix 8","Prix 9","Prix 10","Prix 11","Prix 12"],sep=";")
print("Done")

Creating csv file
Done
