# Brief Métier : Exploitation des Données Scrappées avec Pandas

## Contexte Professionnel

Après la phase de **scraping** réalisée sur des sites concurrents de Castorama, vous avez obtenu deux fichiers CSV :

- categories.csv : contenant les informations relatives aux catégories et sous-catégories.
- products.csv : contenant les informations relatives aux produits (nom, prix, disponibilité, promotions, etc.).

En tant que **Data Analyst / Data Engineer**, votre rôle est désormais de **nettoyer**, **préparer** et **analyser** ces données afin d’en extraire des **informations pertinentes**. Ces informations permettront à Castorama de mieux comprendre l’état du marché, de mettre en place une stratégie tarifaire compétitive et d'anticiper les tendances.

### Installations

In [None]:
!pip install --upgrade pip

In [None]:
!pip install ipykernel

In [None]:
!pip install pandas

In [None]:
!pip install numpy

In [None]:
!pip install matplotlib

In [None]:
!pip install seaborn

## Importations

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

## Chargement des Données

In [None]:
# Load category data

category_data = pd.read_csv("castorama_categories.csv")

# Load products data

product_data = pd.read_csv("castorama_products.csv")

In [None]:
# Confirm category data loaded correctly

category_data

In [None]:
# Confirm product data loaded correctly

product_data

## Exploration et Premières Manipulations

### Aperçu des données

In [None]:
# Get information about category_data

category_data.info()

In [None]:
# Get information about product_data

product_data.info()

In [9]:
# View category_data summary statistics 

category_data.describe()

Unnamed: 0,category,is_page_list,url
count,2207,2207,2207
unique,1871,2,1412
top,lame-et-accessoire-de-scie,False,https://www.castorama.fr/salle-de-bains-et-wc/...
freq,6,1695,6


In [10]:
# View product_data summary statistics 

product_data.describe()

Unnamed: 0,unique_id,category,subcategory,subsubcategory,subsubsubcategory,title,price,url
count,28347,28347,28347,28347,13428,28347,28347,28347
unique,28347,16,52,150,75,27665,5740,28347
top,7610583118042_CAFR.prd,salle-de-bains-et-wc,decoration-textile,coussin-plaid-et-pouf,tapis,Tapis de bain uni en polyester 50x80cm,1990,https://www.castorama.fr/salle-de-bains-et-wc/...
freq,1,8456,8096,3550,1589,15,325,1


In [None]:
# View first 5 data in category_data

category_data.head()

In [None]:
# View first 5 data in product_data

product_data.head()

## Nettoyage et Préparation des Données

### Category_data.csv

In [None]:
# Duplicate raw data

df = category_data.copy()
df

In [14]:
# Check for missing data

df.isna().sum()

category        0
is_page_list    0
url             0
dtype: int64

In [None]:
# View duplicated categories 

pd.set_option('display.max_rows', None)

duplicates = df[df["category"].duplicated(keep=False)]

duplicates

In [39]:
# Drop duplicates (keep only first occurrence)

df.drop_duplicates(subset=["category"], inplace=True)
df.describe()

Unnamed: 0,category,is_page_list,url
count,1871,1871,1871
unique,1871,2,1384
top,arrosage-enterre,False,https://www.castorama.fr/peinture/peinture-interieure/preparation-des-murs-et-plafonds/sous-couche/cat_id_2171.cat
freq,1,1371,3


In [None]:
# View duplicated urls

pd.set_option('display.max_colwidth', None)

duplicated_urls = df[df["url"].duplicated(keep=False)]

# Sort by url
duplicates_sorted = duplicated_urls.sort_values(by="url")

duplicates_sorted

In [49]:
# Drop duplicate urls (Observation: Double is_page_list created for SEO and Diacritics)

df.drop_duplicates(subset=["url"], inplace=True)
df.describe()

Unnamed: 0,category,is_page_list,url
count,1384,1384,1384
unique,1383,2,1384
top,Porte-savon,False,https://www.castorama.fr/jardin-et-terrasse/serre-de-jardin-tunnel-et-voile-d-hivernage/serre-de-jardin/cat_id_26.cat
freq,2,1369,1


- Manipulation de chaînes :

In [None]:
# Remove spaces and characters in category name
df["category"] = df["category"].str.strip()
df

In [None]:
# Capitalize category name
df["category"] = df["category"].str.capitalize()
df

In [None]:
# Relace spaces and commas with underscore

replacements = {",": "", " ": "_"}

def replace_commas_and_spaces(input_str, replacement):
    for old, new in replacement.items():
        input_str = input_str.replace(old, new)
        return input_str

df["category"] = df["category"].apply(lambda x: replace_commas_and_spaces(str(x), replacements))

df

In [None]:
# Handling non-breaking spaces and apostrophes explicitly

df["category"] = df["category"].str.replace('\xa0', '_').str.replace(r'\s+', '_', regex=True).str.replace("'",'_')

df

In [None]:
# Remove accents

replacements = {"à": "a", "á": "a", "â": "a", "ä": "a",
                "é": "e", "è": "e", "ê": "e", "ë": "e", "É":"E", "È":"E",
                "î": "i", "ï":"i", "ì": "i", "í": "i",
                "ö": "o", "ô": "o", "ò": "o", "ó": "o",
                "ü": "u", "û": "u", "ù": "u", "ú": "u"}

def replace_accents(input_str, replacement):
    for old, new in replacement.items():
        input_str = input_str.replace(old, new)
    return input_str

df["category"] = df["category"].apply(lambda x: replace_accents(str(x), replacements))
df

In [None]:
# Reset index 

#df.reset_index(drop=True)

In [None]:
#pd.set_option('display.max_rows', None)

#pd.reset_option('display.max_rows')


# import unicodedata

# def robust_remove_accents(input_str):
#     # Normalize to decomposed form
#     normalized = unicodedata.normalize('NFD', input_str)
#     # Remove combining characters (accents)
#     without_accents = ''.join(c for c in normalized if unicodedata.category(c) != 'Mn')
#     # Explicitly replace known problematic characters
#     replacements = {"é": "e", "è": "e", "à": "a", "ü": "u", "ö": "o", "ë": "e", "ô": "o", "ê": "e", "î": "i", "É":"E"}
#     for accented_char, replacement in replacements.items():
#         without_accents = without_accents.replace(accented_char, replacement)
#     # Handle lingering issues and strip
#     return without_accents.replace('\xa0', ' ').strip()

# df["category"] = df["category"].apply(
#     lambda x: robust_remove_accents(str(x)) if isinstance(x, str) else x
# )

**Notes:** 

- price to be converted to float type *
- " " starts some category names
- some category names separated by "-"
- replace spaces and "," by "_" *
- capitalize each category name * (only first letter)
- replace accented letters (é, ç, à, é, ö) ***
- special case (Concept Rand : une solution de rangement modulable pour tous les usages)