# Fiverr Scraping Projet

## Explication


Nous avons décider de partir sur un projet d'analyze de données sur Fiverr, un site où des personnes proposent tout type de services à des utilisateurs. La première étape de ce projet fut de trouver le moyen de récupérer le plus d'informations possibles du site web grâce à la technique de scraping. Le temps de réalisation de cette étape dépend entièrement de l'éthique qu'ont eus les développeurs pour la réalisation dudit site. Et ce jeu de hasard, malheuresement, nous l'avons perdu.

![](./images/1.png)

on passe de \<b> à \<strong> pour aucune raisons pour 2 éléments différents d'une même liste

![](./images/2.png)

ou encore l'organisation des informations qui est très mal gérée

mais bref, juste quelques difficultés incontournables.

Nous avons eu 2 idées de script :  
    
- on choisi un sujet, photoshop, python ou piano, et pour chaque page, on récupères toutes les propositions de service que nous renvoie le site. Chaque page contient environ 40 propositions, et Fiverr nous donne des informations complémentaires comme la note moyenne de l'auteur, le nombre d'avis, la description et le lien du profil. Nous extrayons tout pour analyzer ces informations, pour les mettre en base de données. L'idée est de comparer toutes les propositions entre elles et de retourner la proposition la plus efficace suivant la note et le nombre d'avis.
- Depuis la base de données, on scrape tous les profils grâce au lien qu'on a scrapé dans le 1er script. Avec ces informations supplémentaires, nous avons eu l'idée de réaliser un profiler, grâce à un model de machine learning, on lui donne toutes ces informations et ils nous retourne le profil qui à le plus de chance de réussir

## Code

In [1]:
import re
import os
import math
import time
import unidecode
import base64
import pandas as pd
import numpy as np
import plotly.express as px
from collections import Counter
from wordcloud import WordCloud
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options
from jupyter_dash import JupyterDash
from dash import Dash, html, dcc, Input, Output
from nltk.tokenize import word_tokenize # Passing the string text into word tokenize for breaking the sentences
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem.snowball import FrenchStemmer
import nltk.corpus



Choix du sujet et création du dossier qui contiendra les fichiers csv, notre base de données

In [2]:
# ["python", "data science", "copyright", "comptability", "design", "excel, javascript", "adobe premiere", "photoshop", " c++", "guitare", "math", "piano", "violon", "java"]
subject = 'javascript'
CARDS_COLUMNS = ["name", "price", "nb_comments", "note", "profil_link", "short_description"]
PROFILS_COLUMS = ["name", "location", "created_at", "response_time", "last_order", "languages", "linked_acc", "skills", "education", "description"]
try: os.mkdir('./data/' + subject)
except FileExistsError: pass

Cette fonction retourne le driver obligatoire pour récupérer le code html de la page web, les options permettent d'éviter de se faire détecter par les différents bots (les sites web utilisent des techniques pour empêcher le scraping)

In [3]:
def get_driver():
    options = Options()
    options.add_argument("--incognito")
    options.add_argument("--headless")
    options.add_argument('--disable-browser-side-navigation')
    return webdriver.Firefox(options=options, executable_path='/home/skyw4rds/Téléchargements/geckodriver-v0.30.0-linux64/geckodriver')

Le script pour scraper les pages d'un sujet choisi et écrire dans un .csv les informations récupérées

In [4]:
def scrape_cards(subject, page=10) -> None:
    with open(f'./data/{subject}/cards.csv', 'a') as f:
        for p in range(page):
            driver = get_driver()
            driver.get(f"https://fr.fiverr.com/search/gigs?query={subject}&source=pagination&search_in=everywhere&search-autocomplete-original-term={subject}&page={p}")

            if 'block.fiverr.com' in driver.current_url or '/404' in driver.current_url:
                print('blocked')
                driver.close()
                raise Exception

            elements = driver.find_elements(By.XPATH, "//div[contains(@class, 'gig-card-layout')]")
            for element in elements:
                try: 
                    name = element.find_element(By.XPATH, ".//div[contains(@class, 'seller-name')]/a").text
                    profil_link = element.find_element(By.XPATH, ".//div[contains(@class, 'seller-name')]/a").get_attribute('href')
                    description = element.find_element(By.XPATH, ".//h3/a").text.replace(';', ',')
                    ratingText = element.find_element(By.XPATH, ".//span[contains(@class, 'gig-rating')]").text
                    groups = re.match(r'(\d)(?:,)(\d)(?:\()(\d+)(?:\))', ratingText).groups()
                    note = int(groups[0])+int(groups[1])/10
                    price = int(element.find_element(By.XPATH, ".//a[contains(@class, 'price')]/span").text[:-2])/100
                    
                except Exception as e:
                    print(f"{name} -> {str(e)}")
                    driver.close()
                        
                f.write(f"{name};{price};{int(groups[2])};{note};{profil_link};{description}\n")
                
            driver.close()


In [5]:
# cards = scrape_cards(subject)

On récupère ces informations

In [6]:
cards = pd.read_csv(f'./data/{subject}/cards.csv', sep=';', names=CARDS_COLUMNS) \
    .drop_duplicates(subset=['short_description', 'name'])
cards.shape

(76, 6)

On scrape tous les profils en récupérant les liens des profils dans cards.csv  
A cause de la detection de scraping, il est difficile de tout scraper d'un coup, cette fonction permet de fragmenter le scraping en plusieurs fois, dynamiquement.

In [7]:
def get_start_index():
    with open(f'./data/{subject}/profils.csv', 'ab+') as file:
        if file.tell() == 0:
            return 0
        while file.read(1) != b'\n':
            try:
                file.seek(-1, os.SEEK_CUR)
                file.seek(-1, os.SEEK_CUR)
            except OSError:
                break
        last_profil_name = file.readline().decode('utf-8').split(';')[0]
        
    return cards.index[cards["name"] == last_profil_name][0] + 1

Le script pour scraper tous les profils et écrire dans un .csv leurs informations

In [8]:
def scrape_profils(subject, start, end):
    with open(f'./data/{subject}/profils.csv', 'a') as f:
            
        for id in range(start, end):
            
            driver = get_driver()
            driver.get(cards.loc[id]['profil_link'])
            
            # The website can block from request or the page doesn't exist or the profil link doesn't exist so it returns to the main page
            if 'block.fiverr.com' in driver.current_url:
                driver.close()
                raise Exception
                
            name = cards.loc[id]['name']
            f.write(f"{name}")
            
            if  '/404' in driver.current_url or len(driver.current_url) < 25:
                f.write(f"\n")
                continue

            stats = [i.text.strip() for i in driver.find_elements(By.XPATH, "//ul[contains(@class, 'user-stats')]/li/*[self::b or self::strong]")]
            location = stats[0]
            created_at = stats[1]
            response_time = stats[2]
            last_order = stats[3]
            languages = [f"{i.text[i.text.find('(')+1:i.text.find(')')]}-{i.text.split('-')[-1].strip()}" for i in driver.find_elements(By.XPATH, "//div[contains(@class, 'languages')]/ul/li")]
            linked_acc = [i.text for i in driver.find_elements(By.XPATH, "//div[contains(@class, 'linked-accounts')]/ul/li/span[@class='text']")]
            skills = [i.text for i in driver.find_elements(By.XPATH, "//div[contains(@class, 'skills')]/ul/li/a")]
            
            try:
                education = driver.find_element(By.XPATH, "//div[contains(@class, 'education-list')]/ul/li/p").text
            except Exception:
                education = ''
            description = driver.find_element(By.XPATH, "//div[contains(@class, 'description')]/p").text.replace(';', ',')
            
            f.write(f";{location};{created_at};{response_time};{last_order};{'|'.join(languages)};{'|'.join(linked_acc)};{'|'.join(skills)};{education};{description}\n")
            time.sleep(5)
            driver.close()

In [9]:
# start = get_start_index()
# end = start + 100
# scrape_profils(subject, start, end)

Formatage du Dataframe

In [54]:
MONTH_CORRESPONDING = {
    'janv.': '01',
    'févr.': '02',
    'mars': '03',
    'avr.': '04',
    'mai': '05',
    'juin':'06',
    'juil.': '07',
    'août': '08',
    'sept.': '09',
    'oct.': '10',
    'nov.': '11',
    'déc.': '12',
}
regex = re.compile('(\w+.?) (\d+)')

def sigmoid(x,weight=1):
    return 1 / (1 + math.exp(-x/weight))

filtre_stopfr = lambda text: [token for token in text if token.lower() not in set(stopwords.words('french'))]
wordcloud = WordCloud(background_color="white", max_words=5000, contour_width=3, contour_color='steelblue')

In [11]:
def preprossessing(row: pd.Series):
    
    ## FROM skills TO skill_1, skill_2, ...
    try:
        skills = row["skills"].split("|")
    except AttributeError:
        skills = []
    
    for i, skill in enumerate(skills):
        row[f"skill_{i+1}"] = skill
    
    ## FROM linked_accs TO linked_acc_1, linked_acc_2, ...
    try:
        linked_accs = row["linked_acc"].split("|")
    except AttributeError:
        linked_accs = []
    
    for i, linked_acc in enumerate(linked_accs):
        row[f"linked_acc_{i+1}"] = linked_acc
    
    ## FROM languages TO language_1, language_2, ...
    ## ADD level_1, level_2, ...
    try:
        languages = row["languages"].split("|")
    except AttributeError:
        languages = []
        
    languages = [tuple(language.split('-', 1)) for language in languages]
    
    for i, (language, level) in enumerate(languages):
        if '/' in level: level = level.split('/')[0].strip()
        row[f"language_{i+1}"] = language
        row[f"level_{i+1}"] = level

    ## FROM "mai-2005" TO 2005/05/01
    row["created_at"] = re.sub(regex, lambda x: f"{x.group(2)}-{MONTH_CORRESPONDING[x.group(1)]}", row["created_at"])
        
    ## FROM response_time "3 heures" TO response_time_day, response_time_hour
    response_time = re.match(r'(\d+) (\w+)', row["response_time"]).groups()
    if 'jour' in response_time[1]:
        row["response_time_day"] = response_time[0]
        row["response_time_hour"] = 0
    else:
        row["response_time_day"] = 0
        row["response_time_hour"] = response_time[0]
    
    ## FROM last_order "3 jours" TO last_order_day, last_order_hour
    last_order = re.match(r'(\d+) (\w+)', row["last_order"]).groups()
    if 'jour' in last_order[1]:
        row["last_order_day"] = last_order[0]
        row["last_order_hour"] = 0
    else:
        row["last_order_day"] = 0
        row["last_order_hour"] = last_order[0]
    
    ## ADD score 
    row['score'] = row['note'] * sigmoid(row['nb_comments'], 50)
    
    ## Drop columns
    row.drop(["languages", "linked_acc", "skills", "response_time", "last_order"], inplace=True)

    return row

In [12]:
def convert_type_column(df: pd.DataFrame) -> pd.DataFrame:
    df["created_at"] = pd.to_datetime(df["created_at"])
    df["last_order_day"] = df["last_order_day"].astype('int')
    df["last_order_hour"] = df["last_order_hour"].astype('int')
    df["response_time_hour"] = df["response_time_hour"].astype('int')
    df["response_time_day"] = df["response_time_day"].astype('int')
    
    return df

In [13]:
profils = pd.read_csv(f'./data/{subject}/profils.csv', sep=';', names=PROFILS_COLUMS)

Analyse des données

In [52]:
def handle_skill(df: pd.DataFrame):
    skills = Counter()
    for rowIndex, row in df.filter(regex=r'skill_\d+').iterrows():
        for columnIndex, value in row.items():
            if not pd.isnull(value): skills[value] += 1

    return skills.most_common(10)

def handle_languages(df: pd.DataFrame):
    languages = Counter()
    for rowIndex, row in df.filter(regex=r'language_\d+').iterrows():
        for columnIndex, value in row.items():
            if not pd.isnull(value): languages[value] += 1

    return languages.most_common(10)

def handle_note(df: pd.DataFrame):
    print(pd.cut(df['note'], bins=np.linspace(0, 5, num=50)))

def get_best(df: pd.DataFrame, nb: int) -> pd.DataFrame:
    return df.sort_values(by='score', ascending=False).iloc[:nb]

In [16]:
def normalize_text(text):
    tokenizer = RegexpTokenizer(r'\w+')
    stemmer = FrenchStemmer()

    # remove accents
    text = unidecode.unidecode(str(text))
    
    # remove punctuation
    text = " ".join(tokenizer.tokenize(text))
    
    # remove useless words
    words = filtre_stopfr(word_tokenize(text, language="french"))
    
    # normalize words
    words = " ".join([stemmer.stem(w) for w in words])

    return words

In [30]:
def get_div(df: pd.DataFrame, category: str):

    skills = handle_skill(df)
    languages = handle_languages(df)
    profils = get_best(df, 3)

    category_title = html.H2(category.title(), style={'font-size': '1.7rem', 'text-align': 'center'})

    best_profils_title = html.H3('💪 Meilleurs profils 💪', style={'font-size': '1.5rem', 'text-align': 'center'})

    graph_skills = html.Div(dcc.Graph(
        figure=px.bar(df, x=[i[0] for i in skills], y=[i[1] for i in skills], barmode="group", title='Compétences', labels={'x': '', 'y': ''})
    ))

    graph_languages = html.Div(dcc.Graph(
        figure=px.bar(df, x=[i[0] for i in languages], y=[i[1] for i in languages], barmode="group", title='Langues', labels={'x': '', 'y': ''})
    ))


    txt = normalize_text(' '.join(df['description'].to_list()))
    wordcloud.generate(txt)
    wordcloud.to_file(filename='images/words.png')
    
    encoded_image = base64.b64encode(open('images/words.png', 'rb').read())
    image_words = html.Div(html.Img(src=f'data:image/png;base64,{encoded_image.decode()}'), style={'margin': '20px 0'})

    profils_div = []
    for index, profil in profils.iterrows():
        
        div = html.Div([
            html.Span(profil['name'], style={'font-size': '1.2rem'}),
            html.Span(f"⭐ {profil['note']} ({profil['nb_comments']})", style={'margin': '5px 0'}),
            html.Span(f"🏆 Score : {round(profil['score'], 3)}", style={'margin': '5px 0'}),
            html.Span(profil['description']),
            html.A('Lien du profil', href=profil['profil_link'], style={'font-size': '1.2rem'})
        ], style={'margin': '0 20px', 'padding': '10px', 'border': '1px solid white', 'width': '400px', 'display': 'flex', 'flex-direction': 'column', 'align-items': 'center'})

        profils_div.append(div)

    profils_render = html.Div(profils_div, style={'display': 'flex'})

    return html.Div(children=[
        category_title,
        best_profils_title,
        profils_render,
        image_words,
        graph_skills,
        graph_languages,
    ], style={'color': 'white', 'display': 'flex', 'flex-direction': 'column', 'align-items': 'center'})

Je crée un serveur Dash pour visualiser certaines informations

In [18]:
app = JupyterDash(__name__)

In [19]:
categories = ['javascript']

In [20]:
dfs = []
for category in categories:

    profils = pd.read_csv(f'./data/{category}/profils.csv', sep=';', names=PROFILS_COLUMS)
    
    new_profils = (profils
        .merge(cards, how='left', on="name") 
        .apply(preprossessing, axis=1) 
    )
    new_profils = convert_type_column(new_profils)

    dfs.append(new_profils)

In [53]:
handle_note(dfs[0])

0      (4.898, 5.0]
1      (4.898, 5.0]
2      (4.898, 5.0]
3      (4.898, 5.0]
4      (4.898, 5.0]
5      (4.898, 5.0]
6      (4.898, 5.0]
7      (4.898, 5.0]
8      (4.898, 5.0]
9      (4.898, 5.0]
10     (4.898, 5.0]
11     (4.898, 5.0]
12     (4.898, 5.0]
13     (4.898, 5.0]
14     (4.898, 5.0]
15     (4.898, 5.0]
16     (4.898, 5.0]
17     (4.898, 5.0]
18     (4.898, 5.0]
19     (4.898, 5.0]
20     (4.898, 5.0]
21     (4.898, 5.0]
22     (4.898, 5.0]
23     (4.898, 5.0]
24     (4.898, 5.0]
25     (4.898, 5.0]
26     (4.898, 5.0]
27     (4.898, 5.0]
28     (4.898, 5.0]
29     (4.898, 5.0]
30     (4.898, 5.0]
31     (4.898, 5.0]
32    (4.388, 4.49]
33     (4.898, 5.0]
34     (4.898, 5.0]
35     (4.898, 5.0]
36     (4.898, 5.0]
37     (4.898, 5.0]
38     (4.898, 5.0]
39     (4.898, 5.0]
40     (4.898, 5.0]
41     (4.898, 5.0]
42     (4.898, 5.0]
43     (4.898, 5.0]
44     (4.898, 5.0]
45     (4.898, 5.0]
46     (4.898, 5.0]
47     (4.898, 5.0]
48     (4.898, 5.0]
Name: note, dtype: c

In [31]:
# @app.callback(
#     Output('dd-output-container', 'children'),
#     Input('demo-dropdown', 'value'),
# )
# def update_output(value: str):
#     return f'{value.title()}'

app.layout = html.Div(children=[get_div(df, category) for df, category in zip(dfs, categories)])

if __name__ == '__main__':
    app.run_server(
        debug=True,
        mode='inline'
    )

0     NaN
1     NaN
2     NaN
3     NaN
4     NaN
5     NaN
6     NaN
7     NaN
8     NaN
9     NaN
10    NaN
11    NaN
12    NaN
13    NaN
14    NaN
15    NaN
16    NaN
17    NaN
18    NaN
19    NaN
20    NaN
21    NaN
22    NaN
23    NaN
24    NaN
25    NaN
26    NaN
27    NaN
28    NaN
29    NaN
30    NaN
31    NaN
32    NaN
33    NaN
34    NaN
35    NaN
36    NaN
37    NaN
38    NaN
39    NaN
40    NaN
41    NaN
42    NaN
43    NaN
44    NaN
45    NaN
46    NaN
47    NaN
48    NaN
Name: note, dtype: category
Categories (0, interval[int64, right]): []
