# Case Study: CPU Utilization Prediction

This notebook outlines the steps to predict CPU utilization using historical data. We'll go through data preparation, exploratory data analysis (EDA), model training, and evaluation.

## 1. Importing Libraries
Let's start by importing the necessary Python libraries.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
import seaborn as sns
sns.set()

## 2. Load and Explore the Data
Next, we'll load the CPU utilization dataset and perform some initial exploration to understand the data.

In [2]:
# Load the data
df = pd.read_csv('../data/raw/cpu_data.csv')  # Adjust the path as necessary

# Display the first few rows of the dataframe
df.head()

### Data Summary
Let's summarize the data to understand its structure and content.

In [3]:
# Data summary
df.info()
df.describe()

## 3. Feature Engineering
We'll extract useful features from the `timestamp` column, such as `hour` and `day_of_week`, which might influence CPU utilization.

In [4]:
# Convert timestamp to datetime
df['timestamp'] = pd.to_datetime(df['timestamp'])

# Extract hour and day of week
df['hour'] = df['timestamp'].dt.hour
df['day_of_week'] = df['timestamp'].dt.dayofweek

# Drop the timestamp column
df = df.drop(columns=['timestamp'])

# Display the updated dataframe
df.head()

## 4. Exploratory Data Analysis (EDA)
Let's visualize the relationships between the features and the target variable (`cpu_utilization`).

In [5]:
# Pairplot to visualize relationships
sns.pairplot(df)
plt.show()

### Correlation Matrix
We'll also look at the correlation between different features.

In [6]:
# Correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.show()

<AxesSubplot: >

<AxesSubplot: >

## 5. Split Data into Training and Testing Sets
We'll split the data into training and testing sets to evaluate the model's performance.

In [7]:
# Define features and target
X = df.drop(columns=['cpu_utilization'])
y = df['cpu_utilization']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 6. Model Training
We'll train a linear regression model to predict CPU utilization.

In [8]:
# Initialize and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

## 7. Model Evaluation
We'll evaluate the model's performance using Mean Absolute Error (MAE) and Mean Squared Error (MSE).

In [9]:
# Calculate MAE and MSE
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)

print(f'Mean Absolute Error: {mae}')
print(f'Mean Squared Error: {mse}')

## 8. Visualization
Let's visualize the actual vs predicted CPU utilization.

In [10]:
# Plot actual vs predicted CPU utilization
plt.figure(figsize=(10, 6))
plt.plot(y_test.values, label='Actual CPU Utilization')
plt.plot(y_pred, label='Predicted CPU Utilization')
plt.legend()
plt.xlabel('Sample Index')
plt.ylabel('CPU Utilization (%)')
plt.title('Actual vs Predicted CPU Utilization')
plt.show()

## 9. Save the Model (Optional)
You can save the trained model to a file for later use.

In [11]:
import joblib

# Save the model
joblib.dump(model, '../models/cpu_utilization_model.pkl')

## Conclusion

In this notebook, we went through the process of loading data, performing exploratory data analysis, training a linear regression model, and evaluating its performance. This model can now be used to predict CPU utilization based on the available features.