# üîç CICIDS2017 - Dataset Pr√©-Nettoy√©

**Dataset** : `/kaggle/input/cicids2017-cleaned-and-preprocessed`

Ce notebook traite la version **pr√©-nettoy√©e** du CICIDS2017 avec 7 cat√©gories d'attaques.

---

## üì¶ Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, StandardScaler
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')


## üìÇ Charger le Dataset

In [None]:
# Charger le dataset pr√©-nettoy√©
df = pd.read_csv('/kaggle/input/cicids2017-cleaned-and-preprocessed/cicids2017_cleaned.csv')

print(f"Shape: {df.shape}")
print(f"Colonnes: {len(df.columns)}")
print(f"Taille en m√©moire: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

## üîç Exploration Initiale

In [None]:
# Premi√®res lignes
df.head()

In [None]:
# Informations sur les colonnes
df.info()

## üè∑Ô∏è Analyse des Labels

In [None]:
# Distribution des labels
print("Distribution des types d'attaques:")
print(df['Attack Type'].value_counts())
print(f"\nPourcentages:")
print(df['Attack Type'].value_counts(normalize=True) * 100)

In [None]:
# V√©rifier les valeurs uniques (avec longueur pour d√©tecter espaces)
print("Labels uniques:")
for label in df['Attack Type'].unique():
    count = (df['Attack Type'] == label).sum()
    print(f"  '{label}' (len={len(label)}) : {count:,} instances")

In [None]:
# Visualisation
plt.figure(figsize=(14, 6))
df['Attack Type'].value_counts().plot(kind='bar', color='skyblue', edgecolor='black')
plt.title('Distribution des Types d\'Attaques - CICIDS2017 Pr√©-Nettoy√©', fontsize=16, fontweight='bold')
plt.xlabel('Type d\'Attaque', fontsize=12)
plt.ylabel('Nombre d\'Instances', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.savefig('attack_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

## üßπ V√©rification de la Qualit√©

In [None]:
# V√©rifier les valeurs manquantes
print("Valeurs manquantes:")
nan_counts = df.isnull().sum()
if nan_counts.sum() > 0:
    print(nan_counts[nan_counts > 0])
else:
    print("‚úÖ Aucune valeur NaN")

In [None]:
# V√©rifier les valeurs infinies
print("Valeurs infinies:")
numeric_cols = df.select_dtypes(include=[np.number]).columns
inf_counts = {}
for col in numeric_cols:
    inf_count = np.isinf(df[col]).sum()
    if inf_count > 0:
        inf_counts[col] = inf_count

if inf_counts:
    for col, count in sorted(inf_counts.items(), key=lambda x: x[1], reverse=True)[:5]:
        print(f"  {col}: {count:,}")
else:
    print("‚úÖ Aucune valeur infinie")

In [None]:
# V√©rifier les duplications
dup_count = df.duplicated().sum()
print(f"Duplications: {dup_count:,} ({dup_count/len(df)*100:.2f}%)")

## üîß Nettoyage (si n√©cessaire)

In [None]:
# Nettoyer si n√©cessaire
df_clean = df.copy()

# Remplacer infinis par NaN
df_clean.replace([np.inf, -np.inf], np.nan, inplace=True)

# Remplir NaN avec m√©diane
for col in df_clean.select_dtypes(include=[np.number]).columns:
    if df_clean[col].isnull().any():
        df_clean[col].fillna(df_clean[col].median(), inplace=True)

# Supprimer duplications
df_clean = df_clean.drop_duplicates()

print(f"‚úÖ Nettoyage termin√©")
print(f"Shape finale: {df_clean.shape}")

## üè∑Ô∏è Strat√©gie 1 : Classification Binaire

In [None]:
# Cr√©er labels binaires : 0 = Normal, 1 = Attaque
df_clean['Binary_Label'] = (df_clean['Attack Type'] != 'Normal Traffic').astype(int)

print("Distribution binaire:")
print(df_clean['Binary_Label'].value_counts())
print(f"\nPourcentages:")
print(df_clean['Binary_Label'].value_counts(normalize=True) * 100)

## üè∑Ô∏è Strat√©gie 2 : Classification Multi-Classes (6 cat√©gories)

In [None]:
# Fusionner DoS et DDoS
df_clean['Attack_Type_Merged'] = df_clean['Attack Type'].replace({
    'DoS': 'DoS_DDoS',
    'DDoS': 'DoS_DDoS'
})

print("Distribution apr√®s fusion DoS/DDoS:")
print(df_clean['Attack_Type_Merged'].value_counts())

In [None]:
# Encoder les labels
le = LabelEncoder()
df_clean['Label_Encoded'] = le.fit_transform(df_clean['Attack_Type_Merged'])

# Afficher le mapping
print("Mapping num√©rique:")
for i, label in enumerate(le.classes_):
    count = (df_clean['Label_Encoded'] == i).sum()
    print(f"{i}: {label} ({count:,} instances)")

## üéØ Pr√©paration des Features

In [None]:
# S√©parer features et labels
label_cols = ['Attack Type', 'Attack_Type_Merged', 'Label_Encoded', 'Binary_Label']
feature_cols = [col for col in df_clean.columns if col not in label_cols]

X = df_clean[feature_cols]
y = df_clean['Label_Encoded']  # Ou 'Binary_Label' pour classification binaire

print(f"Features shape: {X.shape}")
print(f"Labels shape: {y.shape}")

In [None]:
# Normaliser les features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print(f"‚úÖ Features normalis√©es")
print(f"Shape: {X_scaled.shape}")

## üíæ Sauvegarder les R√©sultats

In [None]:
# Sauvegarder le dataset complet
df_clean.to_csv('cicids2017_final.csv', index=False)
print("‚úÖ Dataset complet sauvegard√©: cicids2017_final.csv")

In [None]:
# Sauvegarder X et y pour ML
np.save('X_scaled.npy', X_scaled)
np.save('y.npy', y.values)
print("‚úÖ Features et labels sauvegard√©s: X_scaled.npy, y.npy")

In [None]:
# Sauvegarder les encodeurs
import joblib

joblib.dump(le, 'label_encoder.pkl')
joblib.dump(scaler, 'scaler.pkl')
print("‚úÖ Encodeurs sauvegard√©s: label_encoder.pkl, scaler.pkl")

## üìä Statistiques Finales

In [None]:
print("=" * 70)
print("R√âSUM√â DU PREPROCESSING")
print("=" * 70)
print(f"\nDataset final:")
print(f"  - Lignes: {df_clean.shape[0]:,}")
print(f"  - Features: {X.shape[1]}")
print(f"  - Classes: {len(le.classes_)}")
print(f"\nDistribution des classes:")
for i, label in enumerate(le.classes_):
    count = (y == i).sum()
    pct = (count / len(y)) * 100
    print(f"  {i}: {label:20s} {count:8,} ({pct:5.2f}%)")
print("\n" + "=" * 70)
print("‚úÖ Preprocessing termin√©! Pr√™t pour l'entra√Ænement.")
print("=" * 70)

## üì• Prochaines √âtapes

1. ‚úÖ Dataset explor√© et nettoy√©
2. ‚úÖ Labels encod√©s (binaire ou multi-classes)
3. ‚úÖ Features normalis√©es
4. ‚úÖ Fichiers sauvegard√©s

**√Ä faire ensuite :**
- T√©l√©charger les fichiers depuis Kaggle Output
- Placer dans `data/processed/` de votre projet local
- Commit sur Git
- Passer au Training ML/DL !

---

**Bon entra√Ænement ! üöÄ**