<a href="https://colab.research.google.com/github/AbdoulKidakou/M1SLED/blob/main/Impl%C3%A9mentationdes_m%C3%A9thodes_de_traitement_de_donn%C3%A9e.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# **data handling techniques**


# **Main Goal:**

To equip users with practical tools and techniques for cleaning and preparing datasets, ensuring that the data used for analysis or modeling is reliable and representative of the underlying patterns. This notebook provides a foundation for robust data preprocessing and quality assurance.

In [None]:
#Traitement des données manquantes et aberrantes

# Importation des bibliothèques nécessaires
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns



In [None]:
# Exemple de jeu de données
np.random.seed(42)
data = {
    "A": [1, 2, np.nan, 4, 5, np.nan, 7, 8, 9, 10],
    "B": [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
    "C": [np.nan, 1, 2, 3, 4, 5, 6, np.nan, 8, 1000]  # Contient une valeur aberrante
}
df = pd.DataFrame(data)

print("\n=== Jeu de données initial ===")
print(df)


The purpose of Notebook 2 is to demonstrate data handling techniques, specifically focusing on addressing missing values and identifying/managing outliers in datasets. These tasks are critical in preparing data for machine learning models to ensure they perform accurately and robustly.

# **1.   Handling Missing Data:**

-  Understand different strategies to deal with missing values, such as:

  * Removing rows/columns with missing values.
  * Imputing missing values using statistical methods (mean, median, etc.).


- Learn how to make datasets complete for model training.

In [None]:
# 1. Gestion des données manquantes

## a. Suppression des lignes contenant des valeurs manquantes
print("\n=== Suppression des lignes contenant des valeurs manquantes ===")
df_dropna = df.dropna()
print(df_dropna)

## b. Imputation avec la moyenne
print("\n=== Imputation des valeurs manquantes avec la moyenne ===")
imputer_mean = SimpleImputer(strategy="mean")
df_mean = pd.DataFrame(imputer_mean.fit_transform(df), columns=df.columns)
print(df_mean)

## c. Imputation avec la médiane
print("\n=== Imputation des valeurs manquantes avec la médiane ===")
imputer_median = SimpleImputer(strategy="median")
df_median = pd.DataFrame(imputer_median.fit_transform(df), columns=df.columns)
print(df_median)


# 2.   Handling Outliers:

*   Identify outliers using statistical methods such as the Z-score.
*   Explore techniques to address outliers by either removing or capping them.
*   Visualize data distributions to better understand anomalies.   
    

In [None]:
# 2. Gestion des valeurs aberrantes

## a. Détection des valeurs aberrantes avec Z-Score
print("\n=== Détection des valeurs aberrantes avec Z-Score ===")
z_scores = stats.zscore(df["C"], nan_policy='omit')
aberrant_indices = np.where(np.abs(z_scores) > 3)  # Seuil: |Z| > 3
print("Indices des valeurs aberrantes:", aberrant_indices)

## b. Suppression des valeurs aberrantes
print("\n=== Suppression des valeurs aberrantes ===")
df_no_outliers = df[(np.abs(z_scores) <= 3) | df["C"].isna()]
print(df_no_outliers)

## c. Visualisation des valeurs aberrantes
print("\n=== Visualisation des données ===")
plt.figure(figsize=(8, 5))
sns.boxplot(x=df["C"], color="skyblue")
plt.title("Boxplot pour la colonne C (avec valeurs aberrantes)")
plt.show()

# Résumé des étapes appliquées
print("\n=== Résumé ===")
print("1. Gestion des valeurs manquantes: Suppression et imputation (moyenne, médiane)")
print("2. Détection et suppression des valeurs aberrantes (Z-Score)")
