# Machine Learning Lab - Steel Industry
## Part 1: Data Exploration

### Introduction to the dataset

This lab uses a dataset from a steel industry, collected over the period 2018 to 2019.
It contains detailed energy measurements from an industrial facility, offering a unique opportunity to analyze and optimize energy consumption in a real industrial context.

#### Description of variables:

1. **Temporal variables:**
   - `date`: Date and time of the measurement
   - `Day_of_week`: Day of the week (Monday to Sunday)
   - `NSM`: Number of Seconds from Midnight
   - `WeekStatus`: Type of day (Weekday/Weekend)

2. **Main energy variables:**
   - `Usage_kWh`: Energy consumption in kilowatt-hours (TARGET)
   - `Lagging_Current_Reactive.Power_kVarh`: Lagging reactive power
   - `Leading_Current_Reactive_Power_kVarh`: Leading reactive power
   - `CO2(tCO2)`: CO2 emissions in tons

3. **Power factors:**
   - `Lagging_Current_Power_Factor`: Lagging power factor
   - `Leading_Current_Power_Factor`: Leading power factor

#### Possible applications:

1. **Consumption prediction:**
   - Forecasting energy consumption
   - Estimating CO2 emissions
   - Production planning

2. **Energy optimization:**
   - Identifying periods of high consumption
   - Analyzing energy efficiency
   - Reducing CO2 emissions

3. **Anomaly detection:**
   - Identifying unusual consumption
   - Detecting malfunctions
   - Predictive maintenance

4. **Pattern analysis:**
   - Daily and weekly variations
   - Impact of weekdays vs weekends
   - Correlations between energy variables

In [None]:
# Import necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from scipy.stats import normaltest

# Display configuration
sns.set_theme()
%matplotlib inline

# Download and load the data
!wget -O steel_industry_data.zip https://archive.ics.uci.edu/static/public/851/steel+industry+energy+consumption.zip
!unzip -o steel_industry_data.zip

# Load the data
df = pd.read_csv('Steel_industry_data.csv')

# Convert dates with European format (day/month/year)
df['date'] = pd.to_datetime(df['date'], format='%d/%m/%Y %H:%M')

# Display the first rows with better formatting
print("\nPreview of the first rows:")
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
print(df.head().to_string())

# Information about the dataset structure
print("\nDataset structure:")
print(f"Number of observations: {df.shape[0]:,}")
print(f"Number of variables: {df.shape[1]:,}")

# Summary of variable types
print("\nVariable types:")
display(df.dtypes)

# Example values for categorical variables
print("\nUnique values in categorical variables:")
for col in ['Day_of_week', 'WeekStatus', 'Load_Type']:
    print(f"\n{col} :")
    print(df[col].value_counts())

# Detailed descriptive statistics
print("\nDescriptive statistics of numerical variables:")
# /!\ Complete the '...' to get a description (Pandas) of the dataset /!\
desc_stats = ...
display(desc_stats)

# Check temporal coverage
print("\nPeriod covered by the dataset:")
print(f"Start: {df['date'].min()}")
print(f"End: {df['date'].max()}")
print(f"Duration: {(df['date'].max() - df['date'].min()).days} days")

# Check for missing values
print("\nMissing values per variable:")
print(df.isnull().sum())

### Points to consider for the analysis:

1. **Necessary preprocessing:**
   - Standardization of numerical variables
   - Encoding of categorical variables
   - Handling temporal aspects

2. **Business aspects to consider:**
   - Industrial production cycles
   - Energy constraints
   - Environmental objectives (CO2)

3. **Analysis opportunities:**
   - Consumption patterns
   - Energy efficiency
   - Cost optimization

In [None]:
# Initial visualization of distributions
plt.figure(figsize=(15, 10))

# Distribution of energy consumption
plt.subplot(2, 2, 1)
sns.histplot(data=df, x='Usage_kWh', bins=50)
plt.title('Distribution of energy consumption')

# Consumption by day of the week
plt.subplot(2, 2, 2)
sns.boxplot(data=df, x='Day_of_week', y='Usage_kWh')
# /!\ Complete the '...' to rotate the x-axis labels by 45° (Matplotlib) /!\
plt.xticks(...)
plt.title('Consumption by day')

# Temporal evolution
plt.subplot(2, 2, 3)
df.set_index('date')['Usage_kWh'].plot()
plt.title('Evolution of consumption over time')

# CO2/Consumption relationship
plt.subplot(2, 2, 4)
plt.scatter(df['Usage_kWh'], df['CO2(tCO2)'], alpha=0.5)
plt.xlabel('Consumption (kWh)')
plt.ylabel('CO2 emissions (tCO2)')
plt.title('Consumption/Emissions relationship')

plt.tight_layout()
plt.show()

❓ **Questions:**
1. How many numerical and categorical variables do we have?
2. Are there any missing values to handle?
3. What are the value ranges for each variable?
4. What are the main characteristics of energy consumption?
5. How does consumption vary by day of the week?
6. What is the nature of the relationship between consumption and CO2 emissions?

### 2. Distribution analysis
Let's visualize the distribution of our main variables.

In [None]:
# Visualization of the distribution of numerical variables
plt.figure(figsize=(15, 10))
df.select_dtypes(include=['float64']).hist(bins=30)
plt.tight_layout()
plt.show()

### 3. Correlation analysis
Let's study the relationships between our variables.

In [None]:
# Data preprocessing for correlation
# Remove non-numeric columns
df_num = df.select_dtypes(include=['float64', 'int64'])

# Correlation matrix
plt.figure(figsize=(12, 8))
sns.heatmap(df_num.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation matrix')
plt.show()

# Scatter matrix for all numerical variables
# Set font size for labels
plt.rcParams['axes.labelsize'] = 8
plt.rcParams['xtick.labelsize'] = 6
plt.rcParams['ytick.labelsize'] = 6

# Create the scatter matrix
axes = pd.plotting.scatter_matrix(df_num,
                                figsize=(8, 8),
                                diagonal='kde',
                                alpha=0.5,
                                density_kwds={'alpha': 0.2},
                                marker='.',
                                s=20)  # Reduced point size

# Rotate labels for better readability
for ax in axes.flatten():
    ax.xaxis.label.set_rotation(90)
    ax.yaxis.label.set_rotation(0)
    ax.yaxis.label.set_ha('right')

# Reset font parameters
plt.rcParams['axes.labelsize'] = 10
plt.rcParams['xtick.labelsize'] = 10
plt.rcParams['ytick.labelsize'] = 10

❓ **Questions:**
1. Which variables are most correlated with energy consumption?
2. Do you observe any surprising correlations?
3. Which variables seem the least important?

### 4. In-depth analysis of relationships between variables

So far, we have analyzed linear (Pearson) correlations between our variables.
However, in real data, relationships can be more complex.
Let's deepen our analysis in three steps:

1. **Creation of new indicators:** Relevant ratios for energy analysis
2. **Analysis of non-linear correlations:** Using the Spearman coefficient
3. **Visualization of interactions:** Impact of different factors on consumption

#### 4.1 Creation of energy indicators

In [None]:
# Creation of relevant energy ratios
df['power_factor_ratio'] = df['Lagging_Current_Power_Factor'] / df['Leading_Current_Power_Factor']
df['reactive_power_ratio'] = df['Lagging_Current_Reactive.Power_kVarh'] / df['Leading_Current_Reactive_Power_kVarh']

#### 4.2 Comparison of linear and non-linear correlations

- **Pearson correlation** (seen previously): measures linear relationships
- **Spearman correlation**: measures monotonic relationships (even non-linear)

Let's compare the two approaches:

In [None]:
# Selection of numeric columns
numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns

# Creation of the two correlation matrices
pearson_corr = df[numeric_cols].corr(method='pearson')
# /!\ Complete the '...' to get a correlation matrix with the Spearman method /!\
spearman_corr = df[numeric_cols]...

# Side-by-side visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 8))

sns.heatmap(pearson_corr, annot=True, cmap='coolwarm', fmt='.2f', ax=ax1)
ax1.set_title('Pearson correlations\n(linear relationships)')

sns.heatmap(spearman_corr, annot=True, cmap='coolwarm', fmt='.2f', ax=ax2)
ax2.set_title('Spearman correlations\n(monotonic relationships)')

plt.tight_layout()
plt.show()

#### 4.3 Analysis of specific interactions

Let's visualize some important relationships to understand their nature:

In [None]:
# Visualization of key interactions
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
sns.scatterplot(data=df, x='power_factor_ratio', y='Usage_kWh', alpha=0.5)
plt.title('Consumption vs Power factor ratio')

plt.subplot(1, 3, 2)
# /!\ Complete the '...' to get a scatterplot (Seaborn) of the reactive power ratio as a function of energy consumption /!\
sns.scatterplot(...)
plt.title('Consumption vs Reactive power ratio')

plt.subplot(1, 3, 3)
sns.scatterplot(data=df, x='CO2(tCO2)', y='Usage_kWh',
                hue='WeekStatus', alpha=0.5)
plt.title('Consumption vs CO2 by day type')

plt.tight_layout()
plt.show()

❓ **Analysis questions:**

1. **Comparison of correlations**
   - What differences do you observe between Pearson and Spearman correlations?
   - For which variables are the differences most marked?
   - What does this teach us about the nature of the relationships between variables?

2. **Energy ratios**
   - Why were these specific ratios created?
   - What do they reveal about energy efficiency?

3. **Consumption patterns**
   - How does the CO2/consumption relationship vary by type of day?
   - What implications does this have for energy management?
   - What recommendations could you make?

### 5. Data preparation for learning

In [None]:
# Data preparation for tests
# Conversion of categorical variables
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Encoding of categorical variables
# Define the explicit order of days
days_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

# Encoding of categorical variables
categorical_features = ['Day_of_week', 'WeekStatus']
encoder = OneHotEncoder(sparse_output=False)  # Remove drop='first' to see all days

# Ensure days are in the correct order
df['Day_of_week'] = pd.Categorical(df['Day_of_week'], categories=days_order, ordered=True)

# Encode the variables
encoded_features = encoder.fit_transform(df[categorical_features])

# Create column names for encoded variables
day_names = [f'Day_{day}' for day in encoder.categories_[0]]  # All days
week_status_names = [f'Status_{status}' for status in encoder.categories_[1]]
encoded_columns = day_names + week_status_names

# Display for verification
print("Day categories:", encoder.categories_[0])
print("Encoded days:", day_names)

# Create DataFrame with encoded variables
df_encoded = pd.DataFrame(encoded_features, columns=encoded_columns)

# Select numeric features
numeric_features = [
    'Usage_kWh',
    'Lagging_Current_Reactive.Power_kVarh',
    'Leading_Current_Reactive_Power_kVarh',
    'CO2(tCO2)',
    'Lagging_Current_Power_Factor',
    'Leading_Current_Power_Factor',
    'NSM'
]

# Standardization of numeric variables
scaler = StandardScaler()
df_scaled = pd.DataFrame(
    scaler.fit_transform(df[numeric_features]),
    columns=numeric_features
)

# Combine numeric and encoded features
# /!\ Complete the '...' to get a concatenation (Pandas) of df_scaled and df_encoded /!\
df_final = ...

display(df_final)
# display(df_final.loc[2000:2300])

# Check residual correlations
plt.figure(figsize=(12, 8))
sns.heatmap(df_final.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlations of prepared features')
plt.show()

❓ **Questions:**

1. **Relationships between energy variables**
   - What is the relationship between consumption (Usage_kWh) and CO2 emissions?
   - Why is there a strong correlation between these variables?
   - What other variables are strongly correlated with consumption?

2. **Relationships between power factors**
   - How to interpret the correlation between Lagging and Leading Power Factor?
   - Why do these factors have different relationships with consumption?
   - What impact can this have on energy efficiency?

3. **Data structure**
   - Are there redundant variables that could be eliminated?
   - Which variables seem most important for prediction?

4. **Practical implications**
   - How can these correlations guide energy optimization?
   - Which variables should be monitored as a priority?
   - What business recommendations can be drawn from this?

### 6. In-depth temporal analysis

Temporal analysis is crucial to understanding energy consumption patterns.

In [None]:
# Conversion of the NSM column (Number of Seconds from Midnight) to hour
df['hour'] = df['NSM'] / 3600

# Hourly analysis
plt.figure(figsize=(15, 5))

plt.subplot(1, 2, 1)
hourly_consumption = df.groupby('hour')['Usage_kWh'].mean()
plt.plot(hourly_consumption.index, hourly_consumption.values)
plt.title('Average consumption by hour')
plt.xlabel('Hour')
plt.ylabel('Consumption (kWh)')

plt.subplot(1, 2, 2)
sns.boxplot(data=df, x='Day_of_week', y='Usage_kWh')
plt.title('Consumption distribution by day')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

# Weekly pattern analysis
weekly_stats = df.groupby('Day_of_week').agg({
    'Usage_kWh': ['mean', 'std', 'min', 'max'],
    'CO2(tCO2)': ['mean', 'std']
}).round(2)

print("\nWeekly statistics:")
display(weekly_stats)

❓ **Questions:**
1. What are the peak consumption hours?
2. Is there a significant difference between days of the week?
3. How can the observed variations be explained?