# 🧪 Lab: Modelling & Model Lifecycle — Predicting Plant Production (GIST Steel Dataset)

---

## 🎯 Learning Outcomes

By completing this lab, you will be able to:

- Prepare and analyse a dataset for modelling.  
- Train and evaluate regression models.  
- Apply cross-validation and hyperparameter tuning using scikit-learn.  
- Track experiments and store models using MLflow and/or Optuna.  
- Reflect on the practical aspects of managing the ML lifecycle.

---


In [1]:
import pandas as pd


## 🧩 1. Data Setup and Exploration
⏱ *Estimated time: 30–40 min*

### 🧭 Objective  
Understand the dataset structure and the target variable (“plant-level production”).

---

### **Task 1.1 – Load and Inspect Data**
- Load the GIST Steel dataset.
- Display basic info (shape, column names, missing values, and data types).  
- Identify the target variable (production) and key features (capacity, ...).


In [None]:
file_path = "/content/Plant-level-data-Global-Iron-and-Steel-Tracker-September-2025-V1.xlsx"

# Read the sheet named "Plant data"
df = pd.read_excel(file_path, sheet_name="Plant data")

Unnamed: 0,Plant ID,Plant name (English),Plant name (other language),Other plant names (English),Other plant names (other language),Owner,Owner (other language),Owner GEM ID,Owner PermID,SOE Status,...,Steel products,Steel sector end users,Workforce size,ISO 14001,ISO 50001,ResponsibleSteel Certification,Main production equipment,Power source,Iron ore source,Met coal source
0,P100000120004,Kurum International Elbasan steel plant,Kurum Kombinati metalurgjik,,,Kurum International ShA,,E100000130992,5037939021,,...,"billet, wire rod, rebar",unknown,1000,Yes,unknown,No,EAF,"Hydraulic, integrated plants; Four hydropower ...",unknown,unknown
1,P100000120439,Algerian Qatari Steel Jijel plant,الجزائرية القطرية للصلب,AQS,,Algerian Qatari Steel,,E100001000957,5076384326,Partial,...,"billet, wire rod, rebar",unknown,2400,Yes,unknown,No,EAF; DRI,unknown,unknown,unknown
2,P100000120442,ETRHB Annaba steel plant,,,,ETRHB Industrie SpA,,E100001010275,5074513855,,...,unknown,unknown,2000,unknown,unknown,No,EAF,unknown,unknown,unknown
3,P100000121198,Ozmert Algeria steel plant,,,,Ozmert Algeria SARL,,E100001012196,unknown,,...,unknown,unknown,unknown,unknown,unknown,No,EAF; DRI,unknown,Alwaznah and Bu Khadhrah mines,Bechar
4,P100000120440,Sider El Hadjar Annaba steel plant,مركب الحجار للحديد والصلب,"ArcelorMittal Annaba (predecessor), El Hadjar ...",,Groupe Industriel Sider SpA,,E100001000960,5000941519,Full,...,"coil, rebar, sheet",unknown,5748,unknown,unknown,No,BF; BOF; EAF; DRI,unknown,unknown,unknown


In [4]:
print("Dataset Shape:", df.shape)

Dataset Shape: (1209, 44)


In [5]:
print("\nColumn Names:", df.columns.tolist())


Column Names: ['Plant ID', 'Plant name (English)', 'Plant name (other language)', 'Other plant names (English)', 'Other plant names (other language)', 'Owner', 'Owner (other language)', 'Owner GEM ID', 'Owner PermID', 'SOE Status', 'Parent', 'Parent GEM ID', 'Parent PermID', 'Location address', 'Municipality', 'Subnational unit (province/state)', 'Country/Area', 'Region', 'Other language location address', 'Coordinates', 'Coordinate accuracy', 'GEM wiki page', 'Plant age (years)', 'Announced date', 'Construction date', 'Start date', 'Pre-retirement announcement date', 'Idled date', 'Retired date', 'Ferronickel capacity (ttpa)', 'Sinter plant capacity (ttpa)', 'Coking plant capacity (ttpa)', 'Pelletizing plant capacity (ttpa)', 'Category steel product', 'Steel products', 'Steel sector end users', 'Workforce size', 'ISO 14001', 'ISO 50001', 'ResponsibleSteel Certification', 'Main production equipment', 'Power source', 'Iron ore source', 'Met coal source']


In [6]:
print("\nMissing Values:\n", df.isnull().sum())


Missing Values:
 Plant ID                                 0
Plant name (English)                     0
Plant name (other language)            512
Other plant names (English)            507
Other plant names (other language)     922
Owner                                    0
Owner (other language)                 655
Owner GEM ID                             0
Owner PermID                             0
SOE Status                            1007
Parent                                   0
Parent GEM ID                            0
Parent PermID                            0
Location address                         0
Municipality                             0
Subnational unit (province/state)        0
Country/Area                             0
Region                                   0
Other language location address        764
Coordinates                              0
Coordinate accuracy                      1
GEM wiki page                            0
Plant age (years)                   

In [7]:
print("\nData Types:\n", df.dtypes)


Data Types:
 Plant ID                              object
Plant name (English)                  object
Plant name (other language)           object
Other plant names (English)           object
Other plant names (other language)    object
Owner                                 object
Owner (other language)                object
Owner GEM ID                          object
Owner PermID                          object
SOE Status                            object
Parent                                object
Parent GEM ID                         object
Parent PermID                         object
Location address                      object
Municipality                          object
Subnational unit (province/state)     object
Country/Area                          object
Region                                object
Other language location address       object
Coordinates                           object
Coordinate accuracy                   object
GEM wiki page                         obj

In [10]:
target = 'Plant-level production'
key_features = ['Workforce size', 'ISO 14001', 'ISO 50001', 'ResponsibleSteel Certification']  # example numeric/operational features
print(f"\nTarget variable: {target}")
print(f"Key features: {key_features}")


Target variable: Plant-level production
Key features: ['Workforce size', 'ISO 14001', 'ISO 50001', 'ResponsibleSteel Certification']


Markdown Prompt:

The dataset contains information about steel plants worldwide, including ownership, production capacity, workforce size, certifications, and raw material sources.

Target variable: `Plant-level production` (if available in the dataset, otherwise use relevant capacity columns).

Observations:

*   Several columns have missing values (e.g., ISO certifications, workforce size).
*   Some numeric columns may need type conversion.
*   Latitude/Longitude coordinates will be needed for geospatial analysis.


> 📝 *Markdown prompt:*  
Describe any patterns or potential data quality issues you notice. Which variables might strongly influence production?

---

### **Task 1.2 – Data Cleaning**
- Handle missing values appropriately (e.g., imputation, removal).  
- Check for outliers or incorrect entries in numerical columns.  
- Apply transformations if needed (e.g., log-transform for skewed distributions).

> 📝 *Markdown prompt:*  
Explain your cleaning choices. Why did you treat the missing or skewed data in that way?

---


In [None]:
# 1️⃣ Handle missing values
# Drop rows where target is missing
df_clean = df.dropna(subset=[target])

# Impute missing numeric features with median
numeric_cols = df_clean.select_dtypes(include=[np.number]).columns.tolist()
numeric_cols.remove(target)  # exclude target
for col in numeric_cols:
    df_clean[col].fillna(df_clean[col].median(), inplace=True)


KeyError: ['Plant-level production']



### **Task 1.3 – Feature Engineering**
- Create at least two new variables that might improve model performance (e.g., “capacity per worker”, “energy efficiency”).  
- Encode categorical variables and standardize numeric ones.
- Bonus: you are free to use external socioeconomic or environmental data sources to enhance your feature set.

> 📝 *Markdown prompt:*  
Document your new feature(s). What business or operational insight do they represent?

---



## 🔍 1.4 Feature Relationships and Correlations
⏱ *Estimated time: 20–25 min*

### 🧭 Objective  
Before training models, it’s essential to understand how features relate to each other and to the target variable — both linearly and nonlinearly. This helps identify redundant or uninformative predictors and guides model choice.

---

### **Task 1.4.1 – Correlation Matrix (Linear Relationships)**
- Compute a **correlation matrix** (e.g., using `df.corr()`, `seaborn.heatmap`, `skrub`) to examine pairwise linear relationships among numerical features.  
- Focus on correlations between each feature and the target (`production`), as well as between features themselves.

> 📝 *Markdown prompt:*  
Which variables show the strongest correlation with production?  
Do any features appear redundant or highly correlated with each other?

---


## 🧮 2. Building Baseline & Linear Models
⏱ *Estimated time: 25–30 min*

### 🧭 Objective  
Establish a simple baseline, then train and interpret a linear model.

---

### **Task 2.1 – Baseline**
- Compute a simple baseline predictor (e.g., mean or median production).  
- Measure RMSE or MAE compared to actual values.

> 📝 *Markdown prompt:*  
Why is it useful to have a baseline model before trying more complex ones?

---



### **Task 2.2 – Linear Regression**
- Train a multiple linear regression model using the key plant variables.  
- Display coefficients and interpret their meaning.  
- Evaluate the model on training and test data.

> 📝 *Markdown prompt:*  
Interpret one positive and one negative coefficient. What do they tell you about plant performance drivers?

---


## 🔁 3. Model Evaluation and Selection
⏱ *Estimated time: 45–60 min*

### 🧭 Objective  
Use cross-validation to estimate generalization performance and compare multiple model types.

---

### **Task 3.1 – Cross-Validation**
- Apply **K-Fold cross-validation** (e.g., K=5).  
- Record the average RMSE, MAE, and R² across folds.

> 📝 *Markdown prompt:*  
Summarize your results. How stable is performance across folds? What might this indicate about model variance?

---


### **Task 3.2 – Model Comparison**
Train and compare at least **three models**:
- Linear Regression  
- Ridge Regression (regularized linear)  
- Random Forest Regressor  

Record cross-validation performance for each model.

> 📝 *Markdown prompt:*  
Create a small results table. Which model performs best? Why might that be the case given the dataset’s characteristics?

---

### **Task 3.3 – Hyperparameter Optimization**
- Use **RandomizedSearchCV** or **GridSearchCV** to tune the top model (e.g., Random Forest).  
- Report the best parameters and corresponding validation score.

> 📝 *Markdown prompt:*  
Discuss the role of hyperparameter tuning. How did tuning change your model’s performance compared to default settings?

---


## ⚙️ 4. Model Lifecycle: Tracking, Saving, and Loading
⏱ *Estimated time: 30–40 min*

### 🧭 Objective  
Apply tools that support reproducible ML experiments.

---

### **Task 4.1 – Experiment Tracking with MLflow**
- Use MLflow to log parameters (model type, hyperparameters), metrics (RMSE, R²), and artifacts (plots or model files).  
- Run and record at least two model experiments.

> 📝 *Markdown prompt:*  
Describe how MLflow helps manage your experiments. What advantages does it give compared to manual tracking?

---

### **Task 4.2 – Hyperparameter Optimization with Optuna**
- Define an Optuna study to optimize one model (e.g., Ridge or Random Forest).  
- Record the number of trials and best result.

> 📝 *Markdown prompt:*  
Explain what Optuna is doing behind the scenes. How is it different from Grid or Random Search?

---

### **Task 4.3 – Model Storage**
- Save the best performing model to a file (e.g., using joblib or MLflow’s model registry).  
- Demonstrate loading the saved model and re-evaluating it on the test set.

> 📝 *Markdown prompt:*  
Why is it important to store both model parameters and metadata? How would you ensure version control of models in a production setting?

---


## 🚀 5. Deployment & Monitoring (Conceptual)
⏱ *Estimated time: 15–20 min*

### 🧭 Objective  
Reflect on how models transition from training to production and stay reliable over time.

---

### **Task 5.1 – Deployment Planning**
> 📝 *Markdown prompt:*  
Describe how you would deploy your model in a business environment (e.g., via REST API, batch pipeline).  
Which metrics would you monitor in production?

---

### **Task 5.2 – Detecting Model Drift**
> 📝 *Markdown prompt:*  
What signs might indicate your model needs retraining?  
Give one example of **data drift** and one of **concept drift** relevant to steel plant production.

---

## 💬 6. Reflection
⏱ *Estimated time: 10–15 min*

> 📝 *Markdown prompt:*  
1. Which step of the modelling lifecycle did you find most challenging and why?  
2. What would you do differently if you had access to additional plant-level data?  
3. How would you communicate model insights to a business audience?

---

✅ **End of Lab**

Next week: Short quiz on theoretical concepts (distributions, regression, model selection, and experiment tracking).

