# 🎬 Movie Tycoon IA – Prédiction de succès
Notebook d'entraînement d’un modèle de machine learning pour prédire le **succès** d’un film à partir de données issues de `dataset.csv`.

**Critère de succès** : `Average_rating ≥ 3.5`

⚙️ Dépendances nécessaires :
```bash
pip install pandas scikit-learn joblib
```

In [1]:
import re, pathlib, joblib
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score

## 🔄 Chargement et préparation des données

In [2]:
DATA_PATH = pathlib.Path('dataset.csv')
if not DATA_PATH.exists():
    raise FileNotFoundError('dataset.csv non trouvé.')

df = pd.read_csv(DATA_PATH)
df = df.dropna(subset=['Average_rating', 'Genres', 'Runtime', 'Original_language', 'Countries']).copy()

df['success'] = (df['Average_rating'] >= 3.5).astype(int)
df['Genres_clean'] = df['Genres'].fillna('').apply(lambda g: re.sub(r'[,/]|\s+', ' ', g.lower().strip()))
df['Runtime'] = df['Runtime'].fillna(df['Runtime'].median())
df[['Original_language', 'Countries']] = df[['Original_language', 'Countries']].fillna('unknown')

df.head()

Unnamed: 0,Film_title,Release_year,Director,Cast,Average_rating,Owner_rating,Genres,Runtime,Countries,Original_language,...,★★½,★★★,★★★½,★★★★,★★★★½,★★★★★,Total_ratings,Film_URL,success,Genres_clean
0,The Fan,,Eckhart Schmidt,"['Désirée Nosbusch', 'Bodo Staiger', 'Simone B...",3.57,,"['Horror', 'Drama']",92.0,['Germany'],German,...,525,1660,1950,2646,808,714,9042,https://letterboxd.com/film/the-fan-1982/,1,['horror' 'drama']
1,Mad Max: Fury Road,,George Miller,"['Tom Hardy', 'Charlize Theron', 'Nicholas Hou...",4.18,4.5,"['Adventure', 'Science Fiction', 'Action']",121.0,"['Australia', 'USA']",English,...,30112,158356,163753,477901,280815,511140,1682389,https://letterboxd.com/film/mad-max-fury-road/,1,['adventure' 'science fiction' 'action']
2,Suspiria,,Dario Argento,"['Jessica Harper', 'Stefania Casini', 'Flavio ...",3.93,4.0,['Horror'],99.0,['Italy'],English,...,14397,53427,70309,138742,60628,88628,443757,https://letterboxd.com/film/suspiria/,1,['horror']
3,Lost in Translation,,Sofia Coppola,"['Bill Murray', 'Scarlett Johansson', 'Akiko T...",3.79,4.5,"['Drama', 'Comedy', 'Romance']",102.0,"['UK', 'USA']",English,...,46716,155110,166638,314160,122359,193717,1076949,https://letterboxd.com/film/lost-in-translation/,1,['drama' 'comedy' 'romance']
4,Akira,,Katsuhiro Otomo,"['Mitsuo Iwata', 'Nozomu Sasaki', 'Mami Koyama...",4.28,5.0,"['Animation', 'Action', 'Science Fiction']",124.0,['Japan'],Japanese,...,9544,40850,61104,168485,112657,196532,600721,https://letterboxd.com/film/akira/,1,['animation' 'action' 'science fiction']


## ⚙️ Construction du pipeline

In [3]:
text_feature = 'Genres_clean'
cat_features = ['Original_language', 'Countries']
num_features = ['Runtime']

preprocessor = ColumnTransformer([
    ('text', TfidfVectorizer(max_features=5000), text_feature),
    ('cat', OneHotEncoder(handle_unknown='ignore'), cat_features),
    ('num', StandardScaler(), num_features),
])

pipeline = Pipeline([
    ('preprocessing', preprocessor),
    ('classifier', LogisticRegression(max_iter=2000, n_jobs=-1))
])

## 🚂 Entraînement et évaluation du modèle

In [4]:
X = df[[text_feature] + cat_features + num_features]
y = df['success']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
y_proba = pipeline.predict_proba(X_test)[:, 1]

print(classification_report(y_test, y_pred, digits=3))
print(f'ROC AUC: {roc_auc_score(y_test, y_proba):.3f}')

              precision    recall  f1-score   support

           0      0.745     0.862     0.799      1217
           1      0.636     0.450     0.527       653

    accuracy                          0.718      1870
   macro avg      0.691     0.656     0.663      1870
weighted avg      0.707     0.718     0.704      1870

ROC AUC: 0.761


## 💾 Sauvegarde du modèle entraîné

In [None]:
joblib.dump(pipeline, 'movie_success_model.pkl')
print('Modèle sauvegardé dans movie_success_model.pkl')