# Data Analysis of Comeup Website Services

This notebook analyzes web development services data scraped from Comeup.com, a French freelance marketplace. The dataset contains information about various web development services offered by freelancers, including pricing, ratings, technologies used, and service categories.

## Dataset Overview

The dataset includes **50 services** from different vendors with the following key attributes:
- **Vendor information**: Names and service descriptions
- **Pricing data**: Service costs in GBP
- **Quality metrics**: Ratings and number of reviews
- **Technical details**: Technologies and platforms used
- **Service categorization**: Type of web development service offered

## Key Insights Preview

From the initial data exploration, we can observe:
- Services range from basic WordPress sites (£12.64) to premium custom solutions (£674.11)
- Most services maintain high ratings (4.9-5.0 stars)
- WordPress dominates as the primary technology platform
- Services span multiple categories including website development, web design, and SEO optimization

In [7]:
# Import necessary libraries
import pandas as pd
from collections import Counter
import re

In [None]:
# Set data directory path and load the dataset
data_raw_dir = '../../../data/raw/'

# Load the dataset directly into a DataFrame
df = pd.read_csv(data_raw_dir + 'comeup-category-site-vitrine.csv')

In [8]:
# Suppression des colonnes avec des valeurs non pertinentes ou nulles
df = df.drop(columns=['colorGrey350 2', 'colorGrey350 3', 'colorGrey350 4', 'colorGrey350 5', 'colorGrey200', 'colorGrey200 2', 'affiche'])

# Info sur les colonnes supprimées
print(f"Colonnes restantes: {df.columns.tolist()}")
print(f"Shape après suppression: {df.shape}")

Colonnes restantes: ['d-block src', 'me-3 src', 'vendeur', 'description', 'stretched-link href', 'mb-1', 'note', 'nb-rate']
Shape après suppression: (50, 8)


In [9]:
# Renommage des colonnes pour une meilleure lisibilité
df = df.rename(columns={
    'd-block src': 'Image_URL',
    'me-3 src': 'Vendor_Image_URL',
    'vendeur': 'Vendor_Name',
    'description': 'Service_Description',
    'stretched-link href': 'Service_Link',
    'mb-1': 'Price',
    'note': 'Rating',
    'nb-rate': 'Number_of_Ratings'
})

print("Colonnes renommées:")
print(df.columns.tolist())

Colonnes renommées:
['Image_URL', 'Vendor_Image_URL', 'Vendor_Name', 'Service_Description', 'Service_Link', 'Price', 'Rating', 'Number_of_Ratings']


In [10]:
# Vérification des valeurs manquantes avant nettoyage
print("Valeurs manquantes par colonne:")
print(df.isnull().sum())
print(f"\nNombre total de lignes: {len(df)}")

Valeurs manquantes par colonne:
Image_URL              0
Vendor_Image_URL       0
Vendor_Name            0
Service_Description    0
Service_Link           0
Price                  0
Rating                 1
Number_of_Ratings      1
dtype: int64

Nombre total de lignes: 50


In [11]:
# Suppression des lignes avec des valeurs manquantes dans les colonnes critiques
colonnes_critiques = ['Vendor_Name', 'Service_Description', 'Price', 'Rating', 'Number_of_Ratings']
df = df.dropna(subset=colonnes_critiques)

print(f"Lignes restantes après suppression: {len(df)}")
print("Valeurs manquantes après nettoyage:")
print(df.isnull().sum())

Lignes restantes après suppression: 49
Valeurs manquantes après nettoyage:
Image_URL              0
Vendor_Image_URL       0
Vendor_Name            0
Service_Description    0
Service_Link           0
Price                  0
Rating                 0
Number_of_Ratings      0
dtype: int64


In [12]:
# Conversion du prix en valeur numérique
print("Exemples de prix avant nettoyage:")
print(df['Price'].head(10))

df['Price'] = df['Price'].str.extract(r'(\d+[\\.,]?\d*)').replace(',', '.', regex=True).astype(float)

print("\nPrix après nettoyage:")
print(df['Price'].head(10))
print(f"Prix min: {df['Price'].min()}, Prix max: {df['Price'].max()}")

Exemples de prix avant nettoyage:
1     À partir de 328,63 £GB
2     À partir de 332,84 £GB
3      À partir de 42,13 £GB
4     À partir de 328,63 £GB
5     À partir de 589,85 £GB
6     À partir de 674,11 £GB
7     À partir de 164,31 £GB
8     À partir de 164,31 £GB
9     À partir de 417,11 £GB
10    À partir de 210,66 £GB
Name: Price, dtype: object

Prix après nettoyage:
1     328.63
2     332.84
3      42.13
4     328.63
5     589.85
6     674.11
7     164.31
8     164.31
9     417.11
10    210.66
Name: Price, dtype: float64
Prix min: 12.64, Prix max: 674.11


In [13]:
# Nettoyage de la colonne Rating
print("Exemples de ratings avant nettoyage:")
print(df['Rating'].head(10))

df['Rating'] = df['Rating'].str.replace(',', '.').astype(float)

print("\nRatings après nettoyage:")
print(df['Rating'].head(10))
print(f"Rating min: {df['Rating'].min()}, Rating max: {df['Rating'].max()}")

Exemples de ratings avant nettoyage:
1     5,0
2     5,0
3     5,0
4     5,0
5     5,0
6     4,9
7     5,0
8     5,0
9     5,0
10    5,0
Name: Rating, dtype: object

Ratings après nettoyage:
1     5.0
2     5.0
3     5.0
4     5.0
5     5.0
6     4.9
7     5.0
8     5.0
9     5.0
10    5.0
Name: Rating, dtype: float64
Rating min: 4.9, Rating max: 5.0


In [14]:
# Extraction des valeurs numériques de Number_of_Ratings
print("Exemples de number_of_ratings avant nettoyage:")
print(df['Number_of_Ratings'].head(10))

df['Number_of_Ratings'] = df['Number_of_Ratings'].str.extract(r'(\d+)').astype(int)

print("\nNumber_of_Ratings après nettoyage:")
print(df['Number_of_Ratings'].head(10))
print(f"Nombre d'évaluations min: {df['Number_of_Ratings'].min()}, max: {df['Number_of_Ratings'].max()}")

Exemples de number_of_ratings avant nettoyage:
1     (141)
2     (384)
3      (10)
4      (83)
5      (33)
6      (46)
7      (32)
8     (281)
9      (43)
10     (78)
Name: Number_of_Ratings, dtype: object

Number_of_Ratings après nettoyage:
1     141
2     384
3      10
4      83
5      33
6      46
7      32
8     281
9      43
10     78
Name: Number_of_Ratings, dtype: int32
Nombre d'évaluations min: 1, max: 384


In [15]:
# Reset de l'index
df = df.reset_index(drop=True)
print(f"Dataframe final shape: {df.shape}")
print("\nPremières lignes du dataframe nettoyé:")
df.head()

Dataframe final shape: (49, 8)

Premières lignes du dataframe nettoyé:


Unnamed: 0,Image_URL,Vendor_Image_URL,Vendor_Name,Service_Description,Service_Link,Price,Rating,Number_of_Ratings
0,https://thumbor.comeup.com/nANntK1xwVY8UAdLIRO...,https://thumbor.comeup.com/QfSpUO0DaPYjoMIGGxh...,UpWeb_Agency,"Je vais créer votre site web, vitrine WordPres...",https://comeup.com/fr/service/143078/creer-vot...,328.63,5.0,141
1,https://thumbor.comeup.com/I6QTBL8dkCj0DS33mRl...,https://thumbor.comeup.com/W-_IGvlC1p2f-mPTSBy...,Caroline_WordPress,Je vais créer votre site web WordPress personn...,https://comeup.com/fr/service/141075/creer-vot...,332.84,5.0,384
2,https://thumbor.comeup.com/aUORgiOzZB1wCeSXsTU...,https://thumbor.comeup.com/k1c7EOqwSJshNHpmLq3...,Netcoaching,Je vais créer ou faire la refonte Premium de v...,https://comeup.com/api/stats/click/9b27b3f6907...,42.13,5.0,10
3,https://thumbor.comeup.com/Niq9MJKl4hYZ9e0LbRx...,https://thumbor.comeup.com/ssg3Cav2fLmtJe3IzJT...,EcomDev,Je vais créer votre site web optimisé SEO avec...,https://comeup.com/fr/service/270709/creer-vot...,328.63,5.0,83
4,https://thumbor.comeup.com/NhWS2Q82LvT_EO1_1Kp...,https://thumbor.comeup.com/MKqdGSoYYMaJaJhUuuQ...,hbconsultant,Je vais vous aider à développer votre site des...,https://comeup.com/fr/service/55271/vous-aider...,589.85,5.0,33


In [16]:
# Fonction de catégorisation des services
def categorize_service(description):
    if "WordPress" in description:
        return "Website Development"
    elif "SEO" in description:
        return "SEO Optimization"
    elif "Elementor" in description or "Divi" in description:
        return "Web Design"
    else:
        return "Other"

# Application de la catégorisation
df['Category'] = df['Service_Description'].apply(categorize_service)

print("Distribution des catégories:")
print(df['Category'].value_counts())

Distribution des catégories:
Category
Website Development    21
Other                  21
Web Design              5
SEO Optimization        2
Name: count, dtype: int64


In [17]:
# Sélection des colonnes à conserver pour l'étude de marché
columns_to_keep = ["Vendor_Name", "Service_Description", "Price", "Rating", "Number_of_Ratings", "Category"]
df = df[columns_to_keep]

print(f"Colonnes finales: {df.columns.tolist()}")
print(f"Shape final avant analyse technologique: {df.shape}")

Colonnes finales: ['Vendor_Name', 'Service_Description', 'Price', 'Rating', 'Number_of_Ratings', 'Category']
Shape final avant analyse technologique: (49, 6)


In [18]:
# Dictionnaire des technologies avec leurs variantes
tech_patterns = {
    'WordPress': r'\b(wordpress|wp)\b',
    'Webflow': r'\b(webflow)\b',
    'Divi': r'\b(divi)\b', 
    'Elementor': r'\b(elementor)\b',
    'Wix': r'\b(wix)\b',
    'Drupal': r'\b(drupal)\b',
    'NextJs': r'\b(nextjs|next\.js|next js)\b',
    'Framer': r'\b(framer)\b',
    'Django': r'\b(django)\b',
    'Woocommerce': r'\b(woocommerce|woo commerce)\b',
    'HTML': r'\b(html|html5)\b',
    'CSS': r'\b(css|css3)\b',
    'JavaScript': r'\b(javascript|js)\b',
    'PHP': r'\b(php)\b',
    'React': r'\b(react|reactjs)\b',
    'Vue': r'\b(vue|vuejs)\b'
}

print(f"Technologies à détecter: {list(tech_patterns.keys())}")

Technologies à détecter: ['WordPress', 'Webflow', 'Divi', 'Elementor', 'Wix', 'Drupal', 'NextJs', 'Framer', 'Django', 'Woocommerce', 'HTML', 'CSS', 'JavaScript', 'PHP', 'React', 'Vue']


In [19]:
def categorize_techno_improved(description):
    """Retourne toutes les technologies trouvées dans la description"""
    if pd.isna(description):
        return ['Unprecised']
    
    # Gestion du cas spécial "pas de WordPress"
    if re.search(r'\b(pas de wordpress|sans wordpress|no wordpress)\b', description, re.IGNORECASE):
        return ['Pas de WordPress']
    
    found_techs = []
    description_lower = description.lower()
    
    for tech, pattern in tech_patterns.items():
        if re.search(pattern, description_lower, re.IGNORECASE):
            found_techs.append(tech)
    
    return found_techs if found_techs else ['Unprecised']

print("Fonction de catégorisation technologique définie")

Fonction de catégorisation technologique définie


In [20]:
# Application de la catégorisation technologique
df['Techno_List'] = df['Service_Description'].apply(categorize_techno_improved)

print("Exemples de technologies détectées:")
for i in range(min(5, len(df))):
    print(f"Service {i+1}: {df.iloc[i]['Techno_List']}")

Exemples de technologies détectées:
Service 1: ['WordPress', 'Divi']
Service 2: ['WordPress', 'Divi', 'Elementor']
Service 3: ['Divi', 'Elementor']
Service 4: ['WordPress']
Service 5: ['Unprecised']


In [21]:
# Création des colonnes d'analyse technologique
df['Techno_Main'] = df['Techno_List'].apply(
    lambda x: x[0] if len(x) == 1 else 'Multiple' if len(x) > 1 else 'Unprecised'
)

df['Techno_Count'] = df['Techno_List'].apply(len)
df['Is_Multi_Tech'] = df['Techno_Count'] > 1

print(f"Services avec une seule technologie: {sum(df['Techno_Count'] == 1)}")
print(f"Services multi-technologies: {sum(df['Is_Multi_Tech'])}")
print(f"Services sans technologie précisée: {sum(df['Techno_List'].apply(lambda x: x == ['Unprecised']))}")

Services avec une seule technologie: 38
Services multi-technologies: 11
Services sans technologie précisée: 10


In [22]:
# Statistiques des technologies
all_technologies = [tech for tech_list in df['Techno_List'] for tech in tech_list]
tech_counts = Counter(all_technologies)

print("Technologies les plus utilisées:")
for tech, count in tech_counts.most_common():
    print(f"{tech}: {count}")

Technologies les plus utilisées:
WordPress: 28
Unprecised: 10
Divi: 8
Elementor: 6
Wix: 2
NextJs: 2
HTML: 1
CSS: 1
JavaScript: 1
PHP: 1
Pas de WordPress: 1
Webflow: 1
Drupal: 1
Framer: 1
Django: 1
Woocommerce: 1


In [23]:
# Services multi-technologies
multi_tech_services = df[df['Is_Multi_Tech']]
print(f"Nombre de services multi-tech: {len(multi_tech_services)}")

# Combinaisons fréquentes
if len(multi_tech_services) > 0:
    combinations = df[df['Is_Multi_Tech']]['Techno_List'].apply(
        lambda x: ', '.join(sorted(x))
    ).value_counts()
    
    print("\nCombinaisons les plus fréquentes:")
    print(combinations.head())
else:
    print("Aucune combinaison de technologies trouvée")

Nombre de services multi-tech: 11

Combinaisons les plus fréquentes:
Techno_List
Divi, Elementor, WordPress    3
Divi, WordPress               2
Divi, Elementor               2
CSS, HTML, JavaScript, PHP    1
Drupal, Wix, WordPress        1
Name: count, dtype: int64


In [25]:
# Aperçu final du dataframe
print("Colonnes finales:")
print(df.columns.tolist())
print(f"\nShape final: {df.shape}")
print("\nPremières lignes:")

df

Colonnes finales:
['Vendor_Name', 'Service_Description', 'Price', 'Rating', 'Number_of_Ratings', 'Category', 'Techno_List', 'Techno_Main', 'Techno_Count', 'Is_Multi_Tech']

Shape final: (49, 10)

Premières lignes:


Unnamed: 0,Vendor_Name,Service_Description,Price,Rating,Number_of_Ratings,Category,Techno_List,Techno_Main,Techno_Count,Is_Multi_Tech
0,UpWeb_Agency,"Je vais créer votre site web, vitrine WordPres...",328.63,5.0,141,Website Development,"[WordPress, Divi]",Multiple,2,True
1,Caroline_WordPress,Je vais créer votre site web WordPress personn...,332.84,5.0,384,Website Development,"[WordPress, Divi, Elementor]",Multiple,3,True
2,Netcoaching,Je vais créer ou faire la refonte Premium de v...,42.13,5.0,10,Web Design,"[Divi, Elementor]",Multiple,2,True
3,EcomDev,Je vais créer votre site web optimisé SEO avec...,328.63,5.0,83,Website Development,[WordPress],WordPress,1,False
4,hbconsultant,Je vais vous aider à développer votre site des...,589.85,5.0,33,Other,[Unprecised],Unprecised,1,False
5,gataka_web,Je vais créer un site internet professionnel a...,674.11,4.9,46,Website Development,[WordPress],WordPress,1,False
6,Franki1607,Je vais coder votre site professionnel avec ht...,164.31,5.0,32,Other,"[HTML, CSS, JavaScript, PHP]",Multiple,4,True
7,SolutionWeb,Je vais créer votre site web sur mesure optimi...,164.31,5.0,281,Website Development,[WordPress],WordPress,1,False
8,ConsultantWeb,Je vais créer un site web professionnel,417.11,5.0,43,Other,[Unprecised],Unprecised,1,False
9,horizonplus,Je vais créer votre site web ou refondre votre...,210.66,5.0,78,Website Development,[WordPress],WordPress,1,False


In [26]:
data_processed_dir = "../../../data/processed/"

# Enregistrement du DataFrame nettoyé et enrichi
df.to_csv(data_processed_dir + 'comeup-category-site-vitrine-cleaned.csv', index=False)
print(f"DataFrame enregistré dans {data_processed_dir + 'comeup-category-site-vitrine-cleaned.csv'}")

DataFrame enregistré dans ../../../data/processed/comeup-category-site-vitrine-cleaned.csv
