# Summary

Pour la suite du cours, nous allons faire l'apprentissage par la pratique avec un cas réel de projet de Machine Learning

Description du projet: Prédire le prix des biens immobiliers à Ames, Iowa.

Les données sont disponibles dans: [sklearn.datasets](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_openml.html#sklearn.datasets.fetch_openml)


Voici un récapitulatif des objectifs :
- Réaliser une analyse exploratoire.
- Tester différents modèles de prédiction afin de répondre au mieux à la problématique.

__Sources utiles__

- [Introduction à MLOps](https://ashutoshtripathi.com/2021/08/18/mlops-a-complete-guide-to-machine-learning-operations-mlops-vs-devops/)

- [MLFLOW - Site de référence](https://mlflow.org/docs/latest/index.html)
- [MLFLOW - Tutorial](https://mlflow.org/docs/latest/tutorials-and-examples/tutorial.html)
- [MLFLOW - Tracking](https://mlflow.org/docs/latest/tracking.html)
- [MLFLOW - Model Registry](https://mlflow.org/docs/latest/model-registry.html#)
- [MLFLOW - Serve a model](https://mlflow.org/docs/latest/model-registry.html#serving-an-mlflow-model-from-model-registry)

- [Evidently - tutorial d'analyse de Data drift](https://github.com/evidentlyai/evidently/tree/main/examples/sample_notebooks)
- [API Flask - Démarche de mise en oeuvre](http://web.univ-ubs.fr/lmba/lardjane/python/c4.pdf)
- [FastAPI - Démarche de mise en oeuvre](https://towardsdatascience.com/how-to-build-and-deploy-a-machine-learning-model-with-fastapi-64c505213857)
- [Azure - Tuto déploiement application web ](https://learn.microsoft.com/fr-fr/azure/app-service/quickstart-python?tabs=flask%2Cwindows%2Cazure-portal%2Cvscode-deploy%2Cdeploy-instructions-azportal%2Cterminal-bash%2Cdeploy-instructions-zip-azcli)
- [Tests unitaires - Unittest ou Pytest](https://www.sitepoint.com/python-unit-testing-unittest-pytest/)

- [Pythonanywhere](https://www.pythonanywhere.com/)
- [Heroku](https://www.heroku.com/)
-[Azure webapp - Déploiement automatisé via Github](https://learn.microsoft.com/fr-fr/azure/app-service/deploy-continuous-deployment?tabs=github)
- Streamlit ou gradio pour la mise en place d'un dashbord


# Import librairies

In [1]:
%reload_ext autoreload
%autoreload 2

import sys
from pathlib import Path

import mlflow
import matplotlib.pyplot as plt
import missingno as msno
import numpy as np
import pandas as pd
import pendulum
import plotly.express as px
import ppscore as pps
import seaborn as sns
from loguru import logger
from mlflow.models import infer_signature
from sklearn import set_config
from sklearn.compose import ColumnTransformer, make_column_selector, TransformedTargetRegressor
from sklearn.datasets import fetch_openml
from sklearn.dummy import DummyRegressor
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, VotingRegressor, GradientBoostingRegressor
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression, RidgeCV
from sklearn.metrics import (r2_score,
                             root_mean_squared_error,
                             mean_absolute_percentage_error,
                             mean_absolute_error,
                             max_error,
                            )
from sklearn.model_selection import train_test_split, learning_curve, LearningCurveDisplay
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, OneHotEncoder
from ydata_profiling import ProfileReport
from yellowbrick.regressor import PredictionError, ResidualsPlot

sys.path.append(str(Path.cwd().parent))
from settings.params import MODEL_PARAMS, SEED
from src.make_dataset import load_data


pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 100)
set_config(display="diagram", print_changed_only=False)  # display sklearn pipeline as diagram

# Settings

In [2]:
# Set logging format
log_fmt = "<green>{time:YYYY-MM-DD HH:mm:ss.SSS!UTC}</green> | <level>{level: <8}</level> | <cyan>{name}</cyan>:<cyan>{function}</cyan>:<cyan>{line}</cyan> - {message}"
logger.configure(handlers=[{"sink": sys.stderr, "format": log_fmt}])

# current data
CURRENT_DATE = pendulum.now(tz="UTC")

# target name definition
TARGET_NAME = MODEL_PARAMS["TARGET_NAME"]
logger.info(f"Target name: {TARGET_NAME}")


# directories
PROJECT_DIR = Path.cwd().parent
REPORTS_DIR = Path(PROJECT_DIR, "reports")

logger.info(f"\nProject directory: {PROJECT_DIR} \nReports dir: {REPORTS_DIR}")

[32m2025-07-30 10:35:54.836[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m10[0m - Target name: target
[32m2025-07-30 10:35:54.842[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m17[0m - 
Project directory: c:\Users\DELL\Desktop\mlops-project - Tweets 
Reports dir: c:\Users\DELL\Desktop\mlops-project - Tweets\reports


In [3]:
MODEL_PARAMS

{'TARGET_NAME': 'target',
 'MIN_COMPLETION_RATE': 0.75,
 'MIN_PPS': 0.1,
 'DEFAULT_FEATURE_NAMES': ['Alley',
  'BsmtQual',
  'ExterQual',
  'Foundation',
  'FullBath',
  'GarageArea',
  'GarageCars',
  'GarageFinish',
  'GarageType',
  'GrLivArea',
  'KitchenQualMSSubClass',
  'Neighborhood',
  'OverallQual',
  'TotRmsAbvGrd',
  'building_age',
  'remodel_age',
  'garage_age'],
 'TEST_SIZE': 0.25}

# Data collection

In [4]:
# get data like pandas.DataFrame
data = load_data(dataset_name="tweets")
data.shape

(7613, 5)

In [5]:
data.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [None]:
data.info()

In [None]:
data.describe(include="all") #, datetime_is_numeric=True

# EDA: Exploratory Data Analysis

In [None]:
# barplot for missing value rate
msno.bar(data,
         filter="top",
         p=MODEL_PARAMS["MIN_COMPLETION_RATE"],
        );

## Target analysis

Todo:

    1- Ajouter un parameter dans le dictionnaire MODEL_PARAMS, appelé TARGET_NAME, qui prend la valeur "SalePrice"

    2- Construite l'histogramme de la variable cible en récupérant le nom depuis le dictionnaire MODEL_PARAMS
    3- Construire un 2e graphique sur la même ligne (variable target transformer en log)
    4- Continuer l'analyse exploratoire des variables  explicatives (quanti et quali)
    5- Faire une première sélection des variables explicatives à utiliser pour le modèle 
    6- Ajouter dans MODEL_PARAMS, un key appelé FEATURE_NAMES, qui stocke la liste des variables explicatives sélectionnées

In [None]:
MODEL_PARAMS["TARGET_NAME"]

### Distribution des classes

In [None]:
target_counts = data[TARGET_NAME].value_counts()
target_pct = data[TARGET_NAME].value_counts(normalize=True) * 100

print(f"Distribution des classes:")
print(f"  Classe 0 (Non-Disaster): {target_counts[0]} ({target_pct[0]:.1f}%)")
print(f"  Classe 1 (Disaster): {target_counts[1]} ({target_pct[1]:.1f}%)")

In [None]:
# Visualisation de la distribution
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Graphique en barres
target_counts.plot(kind='bar', ax=axes[0], color=['skyblue', 'salmon'])
axes[0].set_title('Distribution des Classes')
axes[0].set_xlabel('Target (0=Non-Disaster, 1=Disaster)')
axes[0].set_ylabel('Nombre de tweets')
axes[0].tick_params(axis='x', rotation=0)

# Graphique en secteurs
axes[1].pie(target_counts.values, labels=['Non-Disaster', 'Disaster'], 
           autopct='%1.1f%%', colors=['skyblue', 'salmon'])
axes[1].set_title('Répartition des Classes')

# Graphique d'équilibre
class_balance = min(target_counts) / max(target_counts)
axes[2].bar(['Équilibre des Classes'], [class_balance], color='orange')
axes[2].set_title(f'Équilibre: {class_balance:.2f}')
axes[2].set_ylabel('Ratio (min/max)')
axes[2].set_ylim(0, 1)

plt.tight_layout()
plt.show()

## Analyse des features catégorielles

In [None]:
print("\n📊 4. ANALYSE DES FEATURES CATÉGORIELLES")
print("-" * 50)

# Analyse des keywords
print("🔑 ANALYSE DES KEYWORDS:")
keyword_counts = data['keyword'].value_counts(dropna=False).head(15)
print(f"Top 15 keywords les plus fréquents:")
print(keyword_counts)

# Keywords par classe
keyword_disaster = data[data['target'] == 1]['keyword'].value_counts(dropna=False).head(10)
keyword_normal = data[data['target'] == 0]['keyword'].value_counts(dropna=False).head(10)

print(f"\nTop 10 keywords - Disaster tweets:")
print(keyword_disaster)
print(f"\nTop 10 keywords - Normal tweets:")
print(keyword_normal)

# Analyse des locations
print("\n📍 ANALYSE DES LOCATIONS:")
location_counts = data['location'].value_counts(dropna=False).head(15)
print(f"Top 15 locations les plus fréquentes:")
print(location_counts)

In [None]:
# Visualisation
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Keywords globaux
keyword_counts.plot(kind='barh', ax=axes[0,0], color='lightblue')
axes[0,0].set_title('Top 15 Keywords')
axes[0,0].set_xlabel('Fréquence')

# Locations globales
location_counts.plot(kind='barh', ax=axes[0,1], color='lightgreen')
axes[0,1].set_title('Top 15 Locations')
axes[0,1].set_xlabel('Fréquence')

# Keywords par classe
keyword_disaster.head(8).plot(kind='barh', ax=axes[1,0], color='salmon')
axes[1,0].set_title('Top Keywords - Disaster Tweets')
axes[1,0].set_xlabel('Fréquence')

keyword_normal.head(8).plot(kind='barh', ax=axes[1,1], color='skyblue')
axes[1,1].set_title('Top Keywords - Normal Tweets')
axes[1,1].set_xlabel('Fréquence')

plt.tight_layout()
plt.show()

### Analyse de texte approfondie

In [None]:
print("\n📊 5. ANALYSE APPROFONDIE DU TEXTE")
print("-" * 50)

# Longueur des textes
data['text_length'] = data['text'].str.len()
data['word_count'] = data['text'].str.split().str.len()

print("📏 STATISTIQUES DE LONGUEUR:")
length_stats = data.groupby('target')[['text_length', 'word_count']].describe()
print(length_stats)

# Caractères spéciaux
data['url_count'] = data['text'].str.count(r'http\S+|www\S+')
data['mention_count'] = data['text'].str.count(r'@\w+')
data['hashtag_count'] = data['text'].str.count(r'#\w+')
data['exclamation_count'] = data['text'].str.count(r'!')
data['question_count'] = data['text'].str.count(r'\?')
data['caps_count'] = data['text'].str.count(r'[A-Z]')

print("\n🔍 CARACTÈRES SPÉCIAUX par classe:")
special_chars = ['url_count', 'mention_count', 'hashtag_count', 'exclamation_count', 'question_count', 'caps_count']
special_stats = data.groupby('target')[special_chars].mean()
print(special_stats)

In [None]:
# Visualisation des longueurs
fig, axes = plt.subplots(2, 3, figsize=(18, 10))

# Distribution longueur caractères
for i, target in enumerate([0, 1]):
    df = data[data[TARGET_NAME] == 0]['text_length']
    axes[0, 0].hist(df, bins=30, alpha=0.7, label=f'Target {target}')
axes[0, 0].set_title('Distribution - Longueur en Caractères')
axes[0, 0].set_xlabel('Nombre de caractères')
axes[0, 0].legend()

# Distribution nombre de mots
for i, target in enumerate([0, 1]):
    df = data[data[TARGET_NAME] == target]['word_count']
    axes[0, 1].hist(df, bins=30, alpha=0.7, label=f'Target {target}')
axes[0, 1].set_title('Distribution - Nombre de Mots')
axes[0, 1].set_xlabel('Nombre de mots')
axes[0, 1].legend()

# Boxplot longueurs par classe
data.boxplot(column='text_length', by='target', ax=axes[0, 2])
axes[0, 2].set_title('Longueur par Classe')
axes[0, 2].set_xlabel('Target')

# Caractères spéciaux
special_stats.T.plot(kind='bar', ax=axes[1, 0])
axes[1, 0].set_title('Caractères Spéciaux par Classe')
axes[1, 0].tick_params(axis='x', rotation=45)
axes[1, 0].legend(['Non-Disaster', 'Disaster'])

# Corrélation longueur vs target
correlation_length = data[['text_length', 'word_count', 'target']].corr()
sns.heatmap(correlation_length, annot=True, ax=axes[1, 1], cmap='coolwarm')
axes[1, 1].set_title('Corrélation Longueur vs Target')

# Distribution caps par classe
for i, target in enumerate([0, 1]):
    df = data[data[TARGET_NAME] == target]['caps_count']
    axes[1, 2].hist(df, bins=20, alpha=0.7, label=f'Target {target}')
axes[1, 2].set_title('Distribution - Lettres Majuscules')
axes[1, 2].set_xlabel('Nombre de majuscules')
axes[1, 2].legend()

plt.tight_layout()
plt.show()

In [None]:
## Télécharger NLTK
%pip install nltk

### Analyse des mots les plus fréquents

In [None]:
# Pour l'analyse textuelle
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from collections import Counter
from wordcloud import WordCloud

print("\n📊 6. ANALYSE DES MOTS LES PLUS FRÉQUENTS")
print("-" * 50)

def clean_text_for_analysis(text):
    """Nettoyage basique pour l'analyse des mots"""
    if pd.isna(text):
        return ""
    text = text.lower()
    text = re.sub(r'http\S+|www\S+|https\S+', '', text)
    text = re.sub(r'@\w+|#\w+', '', text)
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    return text

# Préparer les textes
disaster_texts = data[data['target'] == 1]['text'].apply(clean_text_for_analysis)
normal_texts = data[data['target'] == 0]['text'].apply(clean_text_for_analysis)

# Mots les plus fréquents
try:
    stop_words = set(stopwords.words('english'))
except:
    nltk.download('stopwords')
    stop_words = set(stopwords.words('english'))

def get_top_words(texts, n=20):
    """Obtenir les mots les plus fréquents"""
    all_words = []
    for text in texts:
        words = text.split()
        words = [word for word in words if word not in stop_words and len(word) > 2]
        all_words.extend(words)
    return Counter(all_words).most_common(n)

disaster_words = get_top_words(disaster_texts)
normal_words = get_top_words(normal_texts)


# Visualisation des mots fréquents
fig, axes = plt.subplots(1, 2, figsize=(16, 8))

# Disaster words
disaster_df = pd.DataFrame(disaster_words, columns=['Word', 'Count'])
disaster_df.plot(x='Word', y='Count', kind='barh', ax=axes[0], color='salmon')
axes[0].set_title('Top 20 Mots - Disaster Tweets')
axes[0].set_xlabel('Fréquence')

# Normal words
normal_df = pd.DataFrame(normal_words, columns=['Word', 'Count'])
normal_df.plot(x='Word', y='Count', kind='barh', ax=axes[1], color='skyblue')
axes[1].set_title('Top 20 Mots - Normal Tweets')
axes[1].set_xlabel('Fréquence')

plt.tight_layout()
plt.show()


### Word Clouds (Nuages de points)

In [None]:

print("\n📊 7. NUAGES DE MOTS")
print("-" * 50)

try:
    # Word cloud pour disaster tweets
    disaster_text_combined = ' '.join(disaster_texts)
    wordcloud_disaster = WordCloud(width=800, height=400, 
                                 background_color='white',
                                 stopwords=stop_words,
                                 max_words=100).generate(disaster_text_combined)

    # Word cloud pour normal tweets
    normal_text_combined = ' '.join(normal_texts)
    wordcloud_normal = WordCloud(width=800, height=400, 
                               background_color='white',
                               stopwords=stop_words,
                               max_words=100).generate(normal_text_combined)

    # Affichage
    fig, axes = plt.subplots(1, 2, figsize=(16, 8))

    axes[0].imshow(wordcloud_disaster, interpolation='bilinear')
    axes[0].set_title('Word Cloud - Disaster Tweets', fontsize=16)
    axes[0].axis('off')

    axes[1].imshow(wordcloud_normal, interpolation='bilinear')
    axes[1].set_title('Word Cloud - Normal Tweets', fontsize=16)
    axes[1].axis('off')

    plt.tight_layout()
    plt.show()
    
except ImportError:
    print("⚠️  WordCloud non disponible. Installer avec: pip install wordcloud")

### Analyse des corrélations

In [None]:
print("\n📊 8. ANALYSE DE CORRÉLATION")
print("-" * 50)

# Matrice de corrélation pour les variables numériques
numeric_features = ['text_length', 'word_count', 'url_count', 'mention_count', 
                   'hashtag_count', 'exclamation_count', 'question_count', 'caps_count', 'target']

correlation_matrix = data[numeric_features].corr()

print("🔗 Corrélations avec la variable target:")
target_correlations = correlation_matrix['target'].sort_values(key=abs, ascending=False)
print(target_correlations)

# Heatmap de corrélation
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,
            square=True, linewidths=0.5)
plt.title('Matrice de Corrélation')
plt.tight_layout()
plt.show()

### Analyse Comparative Détaillée

In [None]:
print("\n📊 9. ANALYSE COMPARATIVE DÉTAILLÉE")
print("-" * 50)

# Tests statistiques
from scipy.stats import ttest_ind, chi2_contingency

print("🧪 TESTS STATISTIQUES:")

# Test t pour longueur de texte
disaster_lengths = data[data['target'] == 1]['text_length']
normal_lengths = data[data['target'] == 0]['text_length']
t_stat, p_value = ttest_ind(disaster_lengths, normal_lengths)

print(f"Test t - Longueur texte: t={t_stat:.3f}, p={p_value:.6f}")
if p_value < 0.05:
    print("  ✅ Différence significative dans la longueur des textes")
else:
    print("  ❌ Pas de différence significative dans la longueur des textes")

# Test du chi-carré pour keywords
if not data['keyword'].isna().all():
    keyword_target_crosstab = pd.crosstab(data['keyword'].fillna('Missing'), data['target'])
    chi2, p_val, dof, expected = chi2_contingency(keyword_target_crosstab)
    print(f"Test Chi² - Keywords: χ²={chi2:.3f}, p={p_val:.6f}")
    if p_val < 0.05:
        print("  ✅ Association significative entre keywords et target")
    else:
        print("  ❌ Pas d'association significative entre keywords et target")


## Prétraitement

#### IMPORTS COMPLETS

In [5]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, accuracy_score, classification_report
import xgboost as xgb
import lightgbm as lgb

# Pour Word2Vec et FastText
from gensim.models import Word2Vec, FastText
from nltk.tokenize import word_tokenize
import nltk
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')

import re
from scipy.sparse import hstack, csr_matrix
import warnings
warnings.filterwarnings('ignore')

#### PREPROCESSING COMPLET

In [10]:

def extract_features(df):
    """Extrait TOUTES les features importantes selon votre analyse"""
    features = df.copy()
    
    # Top features (dans l'ordre de corrélation)
    features['url_count'] = df['text'].str.count(r'http\S+|www\S+')           # +0.196
    features['text_length'] = df['text'].str.len()                            # +0.182
    features['mention_count'] = df['text'].str.count(r'@\w+')                 # -0.103
    features['question_count'] = df['text'].str.count(r'\?')                  # -0.084
    features['caps_count'] = df['text'].str.count(r'[A-Z]')                   # +0.078
    features['exclamation_count'] = df['text'].str.count(r'!')                # -0.075
    features['hashtag_count'] = df['text'].str.count(r'#\w+')                 # +0.052
    features['word_count'] = df['text'].str.split().str.len()                 # +0.040
    
    # Features composées
    features['urgency_score'] = features['url_count'] + features['caps_count']
    features['social_score'] = features['mention_count'] + features['hashtag_count']
    
    return features

def clean_text(text):
    """Nettoyage intelligent du texte"""
    if pd.isna(text):
        return ""
    text = text.lower()
    text = re.sub(r'http\S+|www\S+', '', text)
    text = re.sub(r'@\w+|#\w+', '', text)
    text = re.sub(r'[^a-zA-Z\s]', ' ', text)
    return ' '.join(text.split())

def preprocess_data(df):
    """Preprocessing complet - remplace votre classe"""
    print("🔧 Preprocessing complet...")
    
    # Features numériques
    df_features = extract_features(df)
    
    # Nettoyage du texte
    df_features['clean_text'] = df['text'].apply(clean_text)
    
    # Encodage des catégories
    le_location = LabelEncoder()
    le_keyword = LabelEncoder()
    
    df_features['location_encoded'] = le_location.fit_transform(df['location'].fillna('unknown'))
    df_features['keyword_encoded'] = le_keyword.fit_transform(df['keyword'].fillna('none'))
    
    return df_features, le_location, le_keyword

print(extract_features(data).columns)

Index(['id', 'keyword', 'location', 'text', 'target', 'url_count',
       'text_length', 'mention_count', 'question_count', 'caps_count',
       'exclamation_count', 'hashtag_count', 'word_count', 'urgency_score',
       'social_score'],
      dtype='object')


#### TOUS LES VECTORIZERS 

In [17]:
def get_tfidf_features(X_train, X_test, max_features=5000):
    """TF-IDF Vectorization"""
    print("  📝 TF-IDF...")
    tfidf = TfidfVectorizer(max_features=max_features, ngram_range=(1,2))
    X_train_vec = tfidf.fit_transform(X_train['clean_text'])
    X_test_vec = tfidf.transform(X_test['clean_text'])
    return X_train_vec, X_test_vec, tfidf

def get_word2vec_features(X_train, X_test, vector_size=100):
    """Word2Vec Vectorization"""
    print("  🧠 Word2Vec...")
    
    # Tokenisation
    train_tokens = [text.split() for text in X_train['clean_text']]
    test_tokens = [text.split() for text in X_test['clean_text']]
    
    # Entraînement Word2Vec
    w2v_model = Word2Vec(sentences=train_tokens, vector_size=vector_size, window=5, min_count=2, workers=4)
    
    # Moyenne des vecteurs par document
    def average_vectors(tokens, model):
        vectors = [model.wv[word] for word in tokens if word in model.wv]
        return np.mean(vectors, axis=0) if vectors else np.zeros(vector_size)
    
    X_train_vec = np.array([average_vectors(tokens, w2v_model) for tokens in train_tokens])
    X_test_vec = np.array([average_vectors(tokens, w2v_model) for tokens in test_tokens])
    
    return csr_matrix(X_train_vec), csr_matrix(X_test_vec), w2v_model

def get_fasttext_features(X_train, X_test, vector_size=100):
    """FastText Vectorization"""
    print("  ⚡ FastText...")
    
    # Tokenisation
    train_tokens = [text.split() for text in X_train['clean_text']]
    test_tokens = [text.split() for text in X_test['clean_text']]
    
    # Entraînement FastText
    ft_model = FastText(sentences=train_tokens, vector_size=vector_size, window=5, min_count=2, workers=4)
    
    # Moyenne des vecteurs par document
    def average_vectors(tokens, model):
        vectors = [model.wv[word] for word in tokens if word in model.wv]
        return np.mean(vectors, axis=0) if vectors else np.zeros(vector_size)
    
    X_train_vec = np.array([average_vectors(tokens, ft_model) for tokens in train_tokens])
    X_test_vec = np.array([average_vectors(tokens, ft_model) for tokens in test_tokens])
    
    return csr_matrix(X_train_vec), csr_matrix(X_test_vec), ft_model

#### MODÈLES À TESTER

In [18]:
models_config = {
    'Logistic': LogisticRegression(random_state=42, max_iter=1000),
    'RandomForest': RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),
    'XGBoost': xgb.XGBClassifier(random_state=42, eval_metric='logloss', verbosity=0),
    'LightGBM': lgb.LGBMClassifier(random_state=42, verbose=-1)
}

vectorizers_config = {
    'TF-IDF': get_tfidf_features,
    'Word2Vec': get_word2vec_features,
    'FastText': get_fasttext_features
}

#### TEST TOUTES COMBINAISONS

In [15]:
def test_all_combinations(X_train, X_test, y_train, y_test):
    """Test TOUTES les combinaisons modèle + vectorizer"""
    
    print("🚀 TEST DE TOUTES LES COMBINAISONS")
    print("=" * 50)
    
    # Features numériques (communes à tous)
    num_features = ['url_count', 'text_length', 'mention_count', 'question_count',
                   'caps_count', 'exclamation_count', 'hashtag_count', 'word_count',
                   'urgency_score', 'social_score', 'location_encoded', 'keyword_encoded']
    
    scaler = StandardScaler()
    X_train_num = scaler.fit_transform(X_train[num_features])
    X_test_num = scaler.transform(X_test[num_features])
    
    results = []
    
    # Boucle sur tous les vectorizers
    for vec_name, vec_func in vectorizers_config.items():
        print(f"\n📊 VECTORIZER: {vec_name}")
        print("-" * 25)
        
        try:
            # Obtenir les features textuelles
            X_train_text, X_test_text, vectorizer = vec_func(X_train, X_test)
            
            # Combiner avec les features numériques
            X_train_final = hstack([X_train_text, csr_matrix(X_train_num)])
            X_test_final = hstack([X_test_text, csr_matrix(X_test_num)])
            
            # Tester tous les modèles avec ce vectorizer
            for model_name, model in models_config.items():
                print(f"  🎯 {model_name}...", end=" ")
                
                try:
                    # Entraînement
                    model.fit(X_train_final, y_train)
                    
                    # Prédictions
                    y_pred = model.predict(X_test_final)
                    
                    # Métriques
                    f1 = f1_score(y_test, y_pred)
                    accuracy = accuracy_score(y_test, y_pred)
                    
                    # Cross-validation
                    cv_scores = cross_val_score(model, X_train_final, y_train, cv=3, scoring='f1')
                    cv_mean = cv_scores.mean()
                    cv_std = cv_scores.std()
                    
                    results.append({
                        'Vectorizer': vec_name,
                        'Model': model_name,
                        'F1_Test': f1,
                        'Accuracy_Test': accuracy,
                        'F1_CV_Mean': cv_mean,
                        'F1_CV_Std': cv_std,
                        'Combination': f"{vec_name} + {model_name}"
                    })
                    
                    print(f"F1: {f1:.4f} | CV: {cv_mean:.4f}±{cv_std:.4f}")
                    
                except Exception as e:
                    print(f"❌ Erreur: {str(e)}")
            
        except Exception as e:
            print(f"❌ Erreur avec {vec_name}: {str(e)}")
    
    return results




#### Analyse des résultats

In [14]:
def analyze_results(results):
    """Analyse complète des résultats"""
    
    if not results:
        print("❌ Aucun résultat à analyser")
        return None
    
    df = pd.DataFrame(results)
    df_sorted = df.sort_values('F1_Test', ascending=False)
    
    print("\n🏆 TOP 10 COMBINAISONS")
    print("=" * 50)
    print(df_sorted.head(10)[['Combination', 'F1_Test', 'Accuracy_Test', 'F1_CV_Mean']].to_string(index=False))
    
    print(f"\n🥇 MEILLEURE COMBINAISON:")
    best = df_sorted.iloc[0]
    print(f"   {best['Combination']}")
    print(f"   F1 Test: {best['F1_Test']:.4f}")
    print(f"   Accuracy: {best['Accuracy_Test']:.4f}")
    print(f"   F1 CV: {best['F1_CV_Mean']:.4f} ± {best['F1_CV_Std']:.4f}")
    
    # Analyse par vectorizer
    print(f"\n📈 ANALYSE PAR VECTORIZER:")
    vec_analysis = df.groupby('Vectorizer').agg({
        'F1_Test': ['mean', 'max', 'std'],
        'F1_CV_Mean': 'mean'
    }).round(4)
    print(vec_analysis)
    
    # Analyse par modèle
    print(f"\n🤖 ANALYSE PAR MODÈLE:")
    model_analysis = df.groupby('Model').agg({
        'F1_Test': ['mean', 'max', 'std'],
        'F1_CV_Mean': 'mean'
    }).round(4)
    print(model_analysis)
    
    return df_sorted

#### Quick Start

In [19]:
def quick_start(df, target_col='target'):
    """Démarrage rapide - teste tout automatiquement"""
    
    print("⚡ QUICK START - Test complet automatique")
    print("=" * 50)
    
    # Split des données
    X = df[['text', 'location', 'keyword']]
    y = df[target_col]
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )
    
    # Preprocessing
    X_train_processed, _, _ = preprocess_data(X_train)
    X_test_processed, _, _ = preprocess_data(X_test)
    
    # Test toutes les combinaisons
    results = test_all_combinations(X_train_processed, X_test_processed, y_train, y_test)
    
    # Analyse
    df_results = analyze_results(results)
    
    return df_results, results 

quick_start(data)

⚡ QUICK START - Test complet automatique
🔧 Preprocessing complet...
🔧 Preprocessing complet...
🚀 TEST DE TOUTES LES COMBINAISONS

📊 VECTORIZER: TF-IDF
-------------------------
  📝 TF-IDF...
  🎯 Logistic... F1: 0.7530 | CV: 0.7277±0.0058
  🎯 RandomForest... F1: 0.6955 | CV: 0.6758±0.0092
  🎯 XGBoost... F1: 0.7221 | CV: 0.7003±0.0086
  🎯 LightGBM... F1: 0.7331 | CV: 0.6987±0.0083

📊 VECTORIZER: Word2Vec
-------------------------
  🧠 Word2Vec...
  🎯 Logistic... F1: 0.6128 | CV: 0.5993±0.0061
  🎯 RandomForest... F1: 0.6386 | CV: 0.6349±0.0039
  🎯 XGBoost... F1: 0.6567 | CV: 0.6498±0.0105
  🎯 LightGBM... F1: 0.6889 | CV: 0.6596±0.0020

📊 VECTORIZER: FastText
-------------------------
  ⚡ FastText...
  🎯 Logistic... F1: 0.5683 | CV: 0.5607±0.0085
  🎯 RandomForest... F1: 0.6323 | CV: 0.6368±0.0126
  🎯 XGBoost... F1: 0.6874 | CV: 0.6543±0.0118
  🎯 LightGBM... F1: 0.6834 | CV: 0.6627±0.0094

🏆 TOP 10 COMBINAISONS
            Combination  F1_Test  Accuracy_Test  F1_CV_Mean
      TF-IDF + Logist

(   Vectorizer         Model   F1_Test  Accuracy_Test  F1_CV_Mean  F1_CV_Std  \
 0      TF-IDF      Logistic  0.753036       0.799737    0.727661   0.005801   
 3      TF-IDF      LightGBM  0.733061       0.785292    0.698744   0.008280   
 2      TF-IDF       XGBoost  0.722130       0.780696    0.700260   0.008590   
 1      TF-IDF  RandomForest  0.695499       0.773473    0.675841   0.009174   
 7    Word2Vec      LightGBM  0.688871       0.748523    0.659589   0.001971   
 10   FastText       XGBoost  0.687403       0.736047    0.654262   0.011773   
 11   FastText      LightGBM  0.683401       0.743270    0.662724   0.009449   
 6    Word2Vec       XGBoost  0.656669       0.714380    0.649792   0.010472   
 5    Word2Vec  RandomForest  0.638614       0.712410    0.634936   0.003852   
 9    FastText  RandomForest  0.632280       0.709783    0.636828   0.012617   
 4    Word2Vec      Logistic  0.612800       0.682206    0.599289   0.006132   
 8    FastText      Logistic  0.568278  

### Best model

#### Prétraitement sur les données (Split données + TF-IDF)

In [20]:
# Étape 1 : split des colonnes utiles
X = data[['text', 'location', 'keyword']]
y = data['target']

from sklearn.model_selection import train_test_split
X_train_raw, X_test_raw, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Étape 2 : preprocessing (déjà défini dans ton code)
X_train_processed, _, _ = preprocess_data(X_train_raw)
X_test_processed, _, _ = preprocess_data(X_test_raw)

# Étape 3 : vectorisation TF-IDF (déjà définie aussi)
X_train_vec, X_test_vec, tfidf = get_tfidf_features(X_train_processed, X_test_processed)


🔧 Preprocessing complet...
🔧 Preprocessing complet...
  📝 TF-IDF...


#### Optimisation : Logistic Regression

In [31]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

param_grid = {
    'C': [0.01, 0.1, 1, 2, 3],
    'penalty': ['l2'],
    'solver': ['liblinear', 'lbfgs']
}

grid_lr = GridSearchCV(
    LogisticRegression(max_iter=1000, random_state=42, class_weight='balanced'),
    param_grid,
    scoring='f1',
    cv=3,
    n_jobs=-1
)

grid_lr.fit(X_train_vec, y_train)
print("Best params LR:", grid_lr.best_params_)
print("Best F1 CV LR:", grid_lr.best_score_)

Best params LR: {'C': 1, 'penalty': 'l2', 'solver': 'liblinear'}
Best F1 CV LR: 0.7389304268413587


#### Optimisation : XGBoost

In [None]:
import xgboost as xgb

param_grid_xgb = {
    'n_estimators': [700, 800, 600],
    'max_depth': [7, 5, 6],
    'learning_rate': [0.5, 0.3, 0.2]
}

grid_xgb = GridSearchCV(
    xgb.XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss'),
    param_grid_xgb,
    scoring='f1',
    cv=3,
    n_jobs=-1
)

grid_xgb.fit(X_train_vec, y_train)
print("Best params XGB:", grid_xgb.best_params_)
print("Best F1 CV XGB:", grid_xgb.best_score_)

Best params XGB: {'learning_rate': 0.2, 'max_depth': 5, 'n_estimators': 800}
Best F1 CV XGB: 0.715595017305107


#### Optimisation : LightGBM

In [None]:
import lightgbm as lgb

param_grid_lgb = {
    'n_estimators': [600, 400, 200],
    'max_depth': [-1, -3, 0],
    'learning_rate': [0.05, 0.1, 0.3]
}

grid_lgb = GridSearchCV(
    lgb.LGBMClassifier(random_state=42),
    param_grid_lgb,
    scoring='f1',
    cv=3,
    n_jobs=-1
)

grid_lgb.fit(X_train_vec, y_train)
print("Best params LGBM:", grid_lgb.best_params_)
print("Best F1 CV LGBM:", grid_lgb.best_score_)

Best params LGBM: {'learning_rate': 0.1, 'max_depth': -1, 'n_estimators': 200}
Best F1 CV LGBM: 0.6870690104510798


#### Évaluation sur test set du meilleur modèle

In [32]:
from sklearn.metrics import f1_score, accuracy_score, classification_report

best_model = grid_lr.best_estimator_
y_pred = best_model.predict(X_test_vec)

print("F1 test:", f1_score(y_test, y_pred))
print("Accuracy test:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

F1 test: 0.75642965204236
Accuracy test: 0.788575180564675
              precision    recall  f1-score   support

           0       0.82      0.81      0.81       869
           1       0.75      0.76      0.76       654

    accuracy                           0.79      1523
   macro avg       0.78      0.79      0.78      1523
weighted avg       0.79      0.79      0.79      1523



## Data profiling

## Features selection

In [None]:
SEED

In [None]:
# Predictive Power Score (PPS) : https://github.com/8080labs/ppscore/
pps_predictors = pps.predictors(df=data.drop(["Id", "YrSold", "YearBuilt", "YearRemodAdd", "GarageYrBlt"], axis=1),
                                y=TARGET_NAME, output="df", random_seed=SEED)

In [None]:
pps_predictors

In [None]:
pps.__version__

In [None]:
# check if there are invalide pps scores in the output
pps_predictors.is_valid_score.value_counts()

In [None]:
# get feature names
FEATURE_NAMES = pps_predictors.loc[pps_predictors.ppscore >= MODEL_PARAMS["MIN_PPS"], "x"].values
set(FEATURE_NAMES)

__Data leakage__

Attention à la fuite des données.

Des variables importantes alors qu'elles ne seront pas disponibles lors de la prédiction (exemple: SaleCondition)

In [None]:
plt.figure(figsize=(10, 8))
ax = sns.barplot(data=pps_predictors.loc[lambda dfr: dfr.ppscore > 0], y="x", x="ppscore", orient="h")
ax.set_title(F"Predictive Power Score (PPS) for {TARGET_NAME}")

# add the annotation
ax.bar_label(ax.containers[-1], fmt='%.3f', label_type='edge');

# Modeling

## Pipeline

![mlflow-tracking](https://mlflow.org/docs/latest/_images/quickstart_tracking_overview.png)

In [None]:
from typing import Union, Dict, Any


def eval_metrics(y_actual: Union[pd.DataFrame, pd.Series, np.ndarray],
                 y_pred: Union[pd.DataFrame, pd.Series, np.ndarray]
                 ) -> Dict[str, float]:
    """Compute evaluation metrics.

    Args:
        y_actual: Ground truth (correct) target values
        y_pred: Estimated target values.

    Returns:
        Dict[str, float]: dictionary of evaluation metrics.
            Expected keys are: "rmse", "mae", "mape", "r2", "max_error"

    """
    # Calculate Root mean squared error, named rmse
    rmse = root_mean_squared_error(y_actual, y_pred)
    # Calculate mean absolute error, named mae
    mae = mean_absolute_error(y_actual, y_pred)
    # Mean absolute percentage error (MAPE)
    mape = mean_absolute_percentage_error(y_actual, y_pred)
    # Calculate R-squared: coefficient of determination, named r2
    r2 = r2_score(y_actual, y_pred)
    # Calculate max error: maximum value of absolute error (y_actual - y_pred), named maxerror
    maxerror = max_error(y_actual, y_pred)
    return {"rmse": rmse,
            "mae": mae,
            "mape": mape,
            "r2": r2,
            "max_error": maxerror
           }

In [None]:
def define_pipeline(numerical_transformer: list,
                    categorical_transformer: list,
                    estimator: Pipeline,
                    target_transformer: bool=False,
                    **kwargs: dict) -> Pipeline:
    """Define pipeline for modeling.

    Args:
        numerical_transformer:
        categorical_transformer:
        target_transformer:
        estimator:
        kwargs:

    Returns:
        Pipeline: sklearn pipeline
    """
    numerical_transformer = make_pipeline(*numerical_transformer)

    categorical_transformer = make_pipeline(*categorical_transformer)

    preprocessor = ColumnTransformer(
        transformers=[
            ("num", numerical_transformer, make_column_selector(dtype_include=["number"])),
            ("cat", categorical_transformer, make_column_selector(dtype_include=["object", "bool"])),
        ],
        remainder="drop",  # non-specified columns are dropped
        verbose_feature_names_out=False,  # will not prefix any feature names with the name of the transformer
    )
    # Append regressor to preprocessing pipeline.
    # Now we have a full prediction pipeline.
    if target_transformer:
        model_pipe1 = Pipeline(steps=[("preprocessor", preprocessor),
                                      ("estimator", estimator)])
        model_pipe = TransformedTargetRegressor(regressor=model_pipe1,
                                                func=np.log,
                                                inverse_func=np.exp)
    
    
    else:
        model_pipe = Pipeline(steps=[("preprocessor", preprocessor), ("estimator", estimator)])
        
    # logger.info(f"{model_pipe}")
    return model_pipe

## Train / Test split

In [None]:
# Séparer les données en train et test (25%)

FEATURES = FEATURE_NAMES if any(FEATURE_NAMES) else MODEL_PARAMS["DEFAULT_FEATURE_NAMES"]

x_train, x_test, y_train, y_test = train_test_split(data.loc[:, FEATURES],
                                                    data[TARGET_NAME],
                                                    test_size=MODEL_PARAMS["TEST_SIZE"],
                                                    random_state=SEED
                                                   )

logger.info(f"\nX train: {x_train.shape}\nY train: {y_train.shape}\n"
            f"X test: {x_test.shape}\nY test: {y_test.shape}")

## Baseline

In [None]:
# Model definition
reg = define_pipeline(numerical_transformer=[SimpleImputer(strategy="median"),
                                             RobustScaler()],
                      categorical_transformer=[SimpleImputer(strategy="constant", fill_value="undefined"),
                                               OneHotEncoder(drop="if_binary", handle_unknown="ignore")],
                      target_transformer=False,
                      estimator=RandomForestRegressor(n_estimators=30)
                 )

reg

### Training

In [None]:
# model training and selection

reg.fit(x_train, y_train)

### Model evaluation

In [None]:
# evaluate trained model: sur le train et le test set

# Calcule the evaluation metrics
y_train_pred = reg.predict(x_train)
y_test_pred = reg.predict(x_test)
train_metrics = eval_metrics(y_train, y_train_pred)
test_metrics = eval_metrics(y_test, y_test_pred)

# log out metrics
logger.info(f"""Performances\n{pd.DataFrame({"train": train_metrics, "test": test_metrics}).T}""")

#### Residuals analysis

In [None]:
visualizer = ResidualsPlot(reg, is_fitted="auto")

visualizer.fit(x_train, y_train)  # Fit the training data to the visualizer
visualizer.score(x_test, y_test)  # Evaluate the model on the test data
visualizer.show();                # Finalize and render the figure

#### Prediction plot

In [None]:
visualizer = PredictionError(reg, is_fitted="auto", bestfit=True, identity=True)

visualizer.fit(x_train, y_train)  # Fit the training data to the visualizer
visualizer.score(x_test, y_test)  # Evaluate the model on the test data
visualizer.show();                # Finalize and render the figure

### Tester l'un des packages suivants:

__Exo 1__:

1- [PyCaret](https://pycaret.org/): An open source, low-code machine learning library in Python.

2- [LazyPredict](https://pypi.org/project/lazypredict/): Lazy Predict help build a lot of basic models without much code and helps understand which models works better without any parameter tuning


__Exo 2__:

1- Tester la prédiction avec la variable logarithmique


__Exo3__: 
1- Consulter la document de mlflow via les liens précisés au début du notebook



import pycaret.regression as pyr
from lazypredict.Supervised import LazyRegressor


### Tracking

In [None]:
# Create an experiment if not exists
exp_name = "house-price"
experiment = mlflow.get_experiment_by_name(exp_name)
if not experiment:
    experiment_id = mlflow.create_experiment(exp_name)
else:
    experiment_id = experiment.experiment_id
    
logger.info(f"Experience id: {experiment_id}")

In [None]:
import mlflow
for e in mlflow.search_experiments():
    print(f"{e.experiment_id} - {e.name} - {e.artifact_location}")

In [None]:
# Define models and parameters to benchmark
ESTIMATOR_PARAMS = {DummyRegressor.__name__: {"estimator": DummyRegressor,
                                              "params": {"strategy": "median"}
                                             },
                    RandomForestRegressor.__name__: {"estimator": RandomForestRegressor,
                                                     "params": {"n_estimators": 30,
                                                                "max_depth": 3,
                                                                "random_state": SEED
                                                               }
                                             },
                    GradientBoostingRegressor.__name__: {"estimator": GradientBoostingRegressor,
                                                         "params": {"n_estimators": 30,
                                                                    "learning_rate": 0.01,
                                                                    "max_depth": 3,
                                                                    "random_state": SEED
                                                                   }
                                                        }
}

ESTIMATOR_PARAMS

In [None]:
for model_name, model_configs in ESTIMATOR_PARAMS.items():
    logger.info(f"{model_name} \n{model_configs}")
    
    estimator = model_configs["estimator"]
    params = model_configs["params"]
    
    # Useful for multiple runs (only doing one run in this sample notebook)
    with mlflow.start_run(run_name=f"{CURRENT_DATE.strftime('%Y%m%d_%H%m%S')}-house_price-{model_name}",
                          experiment_id=experiment_id,
                          tags={"version": "v1", "priority": "P1"},
                          description="house price modeling",
                         ) as mlf_run:
        logger.info(f"run_id: {mlf_run.info.run_id}")
        logger.info(f"version tag value: {mlf_run.data.tags.get('version')} -------------------------------")

        # log parameters
        mlflow.log_params(params)
        
        
        # Model definition
        reg = define_pipeline(numerical_transformer=[SimpleImputer(strategy="median"),
                                                     RobustScaler()],
                              categorical_transformer=[SimpleImputer(strategy="constant", fill_value="undefined"),
                                                       OneHotEncoder(drop="if_binary", handle_unknown="ignore")],
                              target_transformer=False,
                              estimator=estimator(**params)
                         )

        reg.fit(x_train, y_train)

        # Evaluate Metrics
        y_train_pred = reg.predict(x_train)
        y_test_pred = reg.predict(x_test)
        train_metrics = eval_metrics(y_train, y_train_pred)
        test_metrics = eval_metrics(y_test, y_test_pred)
        
        # log out metrics
        logger.info(f"Train: {train_metrics}")
        logger.info(f"Test: {test_metrics}")

        # Infer model signature with a sample
        predictions = reg.predict(x_train[:30])
        signature = mlflow.models.infer_signature(x_train[:30], predictions)

        # Log  metrics, and model to MLflow
        mlflow.log_metrics(test_metrics)
        mlflow.sklearn.log_model(reg,
                                 artifact_path=reg[-1].__class__.__name__,
                                 signature=signature,
                                 input_example=x_train[:30])


Tester les modèles suivants:
    
    - DummyRegressor
    - Regression ridge
    - Bosting ou autre modèle ensembliste
    
30 mins

### Performance analysis

### Features importances

### Save model

## Hyperparameters tuning

## Session info

In [None]:
import session_info

In [None]:
session_info.show(dependencies=False)