# Part - 1
# Predicting Laptop Prices with Neural Networks

In this exercise, you will learn how to use PyTorch, a powerful machine learning library, to solve computer systems related problems. More specifically, we want to create a neural network that predicts their values based on their specifications. This task will take you through loading and preprocessing data, creating a neural network model, training the model, and evaluating its performance.

## Objective

- Understand how to handle and preprocess data for a machine learning task.
- Learn the basics of PyTorch by creating a simple neural network.
- Train the neural network on a dataset of laptop specifications and prices.
- Evaluate the model's performance using various metrics.

Let's get started by importing the necessary libraries.


In [21]:
# Import necessary libraries
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score



- Familiarize yourself with the libraries imported above. PyTorch will be used for creating the neural network, pandas for data manipulation, and scikit-learn for data preprocessing and evaluation metrics.
# Loading and Inspecting the Dataset

The first step in any machine learning task is to understand the data we're working with. We'll load our dataset using pandas and inspect the first few rows to see what our data looks like.


In [22]:
# Load the dataset
dataset_path = 'laptop_data_cleaned.csv'  # Make sure to replace this with the actual path to your dataset
df = pd.read_csv(dataset_path)


# Understanding Numerical and Categorical Features

When we work with datasets in machine learning, it's important to distinguish between different types of data. Primarily, we deal with two types: numerical and categorical features. Understanding the difference between these two types is crucial for preprocessing data correctly before training a model.

## Numerical Features

Numerical features are data types that represent quantitative measurements. They are numbers that can be measured or counted. These features can be further divided into two sub-categories:

- **Continuous Features**: These are measurements that can take on any value within a range. Examples include height, weight, temperature, and price. The key characteristic of continuous data is that it can be infinitely fine-grained.
- **Discrete Features**: These are numeric values that have a finite number of possible values. They often represent counts of objects or occurrences. Examples include the number of bedrooms in a house, the number of pets a person has, or the number of times a customer has made a purchase.

## Categorical Features

Categorical features, on the other hand, represent qualitative data. These features can take on a limited number of categories or distinct groups. Categorical data cannot be naturally ordered. Examples include colors (red, blue, green), brand names (Nike, Adidas, Puma), and product categories (electronics, furniture, clothing). Categorical features can be further classified as:

- **Nominal Features**: These are categories without any inherent order. For example, the brand names of cars or the type of cuisine.
- **Ordinal Features**: These are categories that do have a natural order or ranking to them, but the difference between the categories is not uniform. Examples include satisfaction ratings (satisfied, neutral, dissatisfied) and education level (high school, bachelor's, master's).

## Why the Distinction Matters

The distinction between numerical and categorical features matters because it dictates the type of preprocessing needed before using the data for training a machine learning model. For example:

- Numerical data might need to be normalized or standardized to bring all the features to a similar scale.
- Categorical data needs to be encoded before it can be used in most machine learning models. Common encoding techniques include one-hot encoding for nominal features and ordinal encoding for ordinal features.

Understanding and correctly preprocessing numerical and categorical data can significantly impact the performance of your machine learning model.

## Task-1 (5 points):

- Look at the output of `df.head()`. Can you identify which columns are numerical and which are categorical?
- Identify the target variable we are trying to predict.


In [23]:
# Task-1

# Numerical: Ram, Weight, TouchScreen, Ips, Ppi, HDD, SSD
# Categorical: Company, TypeName, Cpu_brand, Gpu_brand, Os

# target variable for prediction: Price

df.head()

Unnamed: 0,Company,TypeName,Ram,Weight,Price,TouchScreen,Ips,Ppi,Cpu_brand,HDD,SSD,Gpu_brand,Os
0,Apple,Ultrabook,8,1.37,11.175755,0,1,226.983005,Intel Core i5,0,128,Intel,Mac
1,Apple,Ultrabook,8,1.34,10.776777,0,0,127.67794,Intel Core i5,0,0,Intel,Mac
2,HP,Notebook,8,1.86,10.329931,0,0,141.211998,Intel Core i5,0,256,Intel,Others
3,Apple,Ultrabook,16,1.83,11.814476,0,1,220.534624,Intel Core i7,0,512,AMD,Mac
4,Apple,Ultrabook,8,1.37,11.473101,0,1,226.983005,Intel Core i5,0,256,Intel,Mac


# Preprocessing Data for Machine Learning

Before we feed our data into a machine learning model, it's crucial to preprocess it. This step ensures that our data is in the right format and is standardized or normalized, making it easier for the model to learn and make accurate predictions. In this notebook, we'll discuss why preprocessing is necessary and go over the mathematics behind the standard scaler and one-hot encoder, two common preprocessing techniques.

## Why Preprocess Data?

- **Compatibility**: Most machine learning algorithms expect numerical input, so we need to convert categorical data into a numerical format.
- **Scale**: Features might be on different scales (e.g., age vs income). Differences in scale can lead to biases where the algorithm disproportionately favors features with larger scales.
- **Normalization/Standardization**: This helps to ensure that each feature contributes equally to the prediction.

## The Mathematics Behind Preprocessing

### Standard Scaler (Standardization)

The Standard Scaler standardizes features by removing the mean and scaling to unit variance. This process is also known as "Z-score normalization". The formula for calculating the standardized value of a feature is:

$$ z = \frac{x - \mu}{\sigma} $$

where:
- $z$ is the standardized value.
- $x$ is the original value of the feature.
- $\mu$ is the mean of the feature values.
- $\sigma$ is the standard deviation of the feature values.

### One-Hot Encoder (for Categorical Data)

One-Hot Encoding converts categorical variables into a format that can be provided to ML algorithms to do a better job in prediction. For each unique category in a feature, one-hot encoding creates a new column (binary) where 1 indicates the presence of the category and 0 indicates absence. For example, if we have a `Color` feature with three categories ['Red', 'Green', 'Blue'], one-hot encoding it will result in three new columns, one for each category:

- Color_Red: [1, 0, 0]
- Color_Green: [0, 1, 0]
- Color_Blue: [0, 0, 1]

This process does not involve complex mathematics but is crucial for handling categorical data.

## Applying Preprocessing

In our dataset, we use `StandardScaler` for numerical features to standardize them, and `OneHotEncoder` for categorical features to convert them into a numerical format that our machine learning model can work with.

```python
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['numerical_features_here']),
        ('cat', OneHotEncoder(), ['categorical_features_here'])
    ])
```
# Task-2 (5 points):

- Put the numrical and cateorical features in the code below:


In [24]:
# Separate target and features
X = df.drop('Price', axis=1)
y = df['Price']


# Task-2
# Define preprocessing for numerical and categorical features
preprocessor = ColumnTransformer(
    transformers=[
        # numerical features
        ('num', StandardScaler(), ['Ram', 'Weight', 'TouchScreen', 'Ips', 'Ppi', 'HDD', 'SSD']),
        # categorical features
        ('cat', OneHotEncoder(), ['Company', 'TypeName', 'Cpu_brand', 'Gpu_brand', 'Os'])
    ])

# Apply preprocessing
X_preprocessed = preprocessor.fit_transform(X)
y = y.to_numpy().reshape(-1, 1)

# Convert to PyTorch tensors
X_tensor = torch.tensor(X_preprocessed.toarray().astype(np.float32))
y_tensor = torch.tensor(y.astype(np.float32))


# Splitting the Dataset

It's important to split our dataset into a training set and a testing set. This way, we can train our model on one portion of the data and evaluate its performance on another set that it hasn't seen before, ensuring our model can generalize well to new data.



In [25]:
# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X_tensor, y_tensor, test_size=0.2, random_state=42)

# Convert to PyTorch DataLoader
train_loader = DataLoader(TensorDataset(X_train, y_train), batch_size=32, shuffle=True)
test_loader = DataLoader(TensorDataset(X_test, y_test), batch_size=32)


# Creating the Neural Network

Now comes the exciting part! We'll define our neural network architecture.

## Task-3 (10 points):

- Define a neural network class named `LaptopPricePredictor`. This class should inherit from `nn.Module` and define the layers of the network in the `__init__` method. Then, implement the forward pass in the `forward` method.


In [26]:
# Task-3
class LaptopPricePredictor(nn.Module):
    def __init__(self):
        super(LaptopPricePredictor, self).__init__()
        # initialize the layers of the network
        self.fc1 = nn.Linear(X_train.shape[1], 64)
        self.fc2 = nn.Linear(64, 1)

    def forward(self, x):
        # put x through the layer(s) and activation function(s) (like ReLU)
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# Initialize the model
model = LaptopPricePredictor()


# Training the Model

With our data prepared and our model defined, it's now time to train our model. This involves feeding it the input data, calculating the loss (difference between the model's predictions and the actual prices), and adjusting the model's weights through backpropagation.

## Task-4 (20 points):

- Complete the training loop below. Fill in the missing parts to calculate the loss, perform backpropagation, and update the model's weights.
- Adjust hyperparameters (number of epochs and learning rate) as appropriate.

In [27]:
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Task-4
def train_model(model, train_loader, criterion, optimizer, epochs=100):
    model.train()  # Set the model to training mode
    for epoch in range(epochs):
        for inputs, targets in train_loader:

            # Zeroring the gradients strored in optimizer
            optimizer.zero_grad()

            # Write the forward pass
            output = model(X_train)
            # Calculate loss
            loss = criterion(output, y_train)
            # Calculate gradients
            loss.backward()
            # Update the model's weights. Tip: step.
            optimizer.step()

        if epoch % 10 == 0:  # Print the loss every 10 epochs
            print(f'Epoch {epoch+1}/{epochs}, Loss: {loss.item():.4f}')

# Call the train_model function
train_model(model, train_loader, criterion, optimizer, epochs=100)


Epoch 1/100, Loss: 101.6531
Epoch 11/100, Loss: 0.1515
Epoch 21/100, Loss: 0.0751
Epoch 31/100, Loss: 0.0582
Epoch 41/100, Loss: 0.0522
Epoch 51/100, Loss: 0.0485
Epoch 61/100, Loss: 0.0446
Epoch 71/100, Loss: 0.0414
Epoch 81/100, Loss: 0.0386
Epoch 91/100, Loss: 0.0359


# Evaluating the Model

After training the model, it's crucial to evaluate its performance on the test set to see how well it predicts laptop prices on data it hasn't seen before. We'll use several metrics for a comprehensive evaluation.

## Task-5 (10 points):

- Write the missing code to evaluate the model on the test set. Calculate and print the Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R-squared (R2) score.


In [28]:
# Task-5
def evaluate_model(model, test_loader):
    model.eval()  # Set the model to evaluation mode
    predictions = []
    actuals = []
    with torch.no_grad():
        for inputs, targets in test_loader:
            # make an inference
            outputs = model(inputs)
            predictions.extend(outputs.view(-1).tolist())
            actuals.extend(targets.view(-1).tolist())

    mse = mean_squared_error(actuals, predictions)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(actuals, predictions)
    r2 = r2_score(actuals, predictions)

    print(f'MSE: {mse:.4f}, RMSE: {rmse:.4f}, MAE: {mae:.4f}, R2: {r2:.4f}')

# Call the evaluate_model function
evaluate_model(model, test_loader)

MSE: 0.0695, RMSE: 0.2636, MAE: 0.1893, R2: 0.8246


# NB: The highest R2 scores and the lowest MSEs get bonus (10) points.

# Part - 2
# Predicting Actual Computer Performance

In our Operating Systems (OS) class, we've explored how computers can be monitored by operating systems, from the intricacies of hardware components to the complexities of software operations. We've discussed how an OS is not just the backbone of our computing environments but also a rich source of data that can be analyzed to understand and predict computer performance. Through our discussions and previous homework assignments, we've seen how various metrics and measurements logged by operating systems can offer insights into the efficiency and capability of computers.

This assignment takes us a step further into the practical application of what we've learned. We have a dataset, logged by the operating systems of individual computers. It encompasses a variety of attributes that contribute to a computer's performance.

Our objective is to predict the Estimated Real Performance (ERP) of these computers, as reported by the OS, leveraging the dataset's attributes. This endeavor will not only reinforce our understanding of the operating system's role in monitoring performance but also enhance our skills in data analysis and machine learning, preparing us for the real-world challenges of optimizing and predicting computer performance.
## Dataset Description

The dataset contains the following attributes:
- **Vendor Name**: The manufacturer of the computer.
- **Model Name**: The model of the computer.
- **MYCT**: Machine cycle time in nanoseconds, indicating the speed at which the computer operates.
- **MMIN**: Minimum main memory in kilobytes, a crucial component for running applications.
- **MMAX**: Maximum main memory in kilobytes, defining the upper limit of what the computer can handle in terms of memory.
- **CACH**: Cache memory in kilobytes, essential for reducing the average time to access data from the main memory.
- **CHMIN**: Minimum channels in units, representing the minimum number of I/O devices the computer can handle.
- **CHMAX**: Maximum channels in units, indicating the computer's capability to manage multiple I/O operations.
- **PRP**: Performance as Reported by the Producer, offering a benchmark for what we might expect in terms of computer performance.
- **ERP**: Estimated Real Performance reported by the OS, the metric we aim to predict to understand the computer's actual performance in real-world operations.




In [29]:
import pandas as pd

# Load the dataset
file_path = 'machine-data.txt'  # Update this path for your file path
columns = ['Vendor Name', 'Model Name', 'MYCT', 'MMIN', 'MMAX', 'CACH', 'CHMIN', 'CHMAX', 'PRP', 'ERP']
data = pd.read_csv(file_path, names=columns)

# Display the first few rows of the dataframe
data.head()


Unnamed: 0,Vendor Name,Model Name,MYCT,MMIN,MMAX,CACH,CHMIN,CHMAX,PRP,ERP
0,adviser,32/60,125,256,6000,256,16,128,198,199
1,amdahl,470v/7,29,8000,32000,32,8,32,269,253
2,amdahl,470v/7a,29,8000,32000,32,8,32,220,253
3,amdahl,470v/7b,29,8000,32000,32,8,32,172,253
4,amdahl,470v/7c,29,8000,16000,32,8,16,132,132


# Data Preprocessing

- Before we can train our model, we need to preprocess our data. We will standardize the numerical features and encode the categorical features. Since our target, `CACH`, is numerical, we will predict it as is without any encoding.



In [30]:

# Identifying features and target variable
X = data.drop('ERP', axis=1)
y = data['ERP']

# Preprocessing steps
categorical_features = ['Vendor Name']
numerical_features = ['MYCT', 'MMIN', 'MMAX', 'CHMIN', 'CHMAX', 'PRP', 'CACH']

# Create the preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(), categorical_features)
    ])

X_processed = preprocessor.fit_transform(X)
y = y.ravel()

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=0.2, random_state=42)
# Prepare the data for PyTorch
X_train_t = torch.FloatTensor(X_train.toarray())
X_test_t = torch.FloatTensor(X_test.toarray())
y_train_t = torch.FloatTensor(y_train)
y_test_t = torch.FloatTensor(y_test)


# Building and Training the Neural Network

- Now, it's time to build our neural network that will predict the ERP. We'll define a simple architecture in PyTorch, compile the model, and then train it on our preprocessed dataset.

# Task-6 (20 points):
- Implement the neural network

In [31]:
# Task-6
# Create the model

class PerformancePredictor(nn.Module):
    def __init__(self):
        super(PerformancePredictor, self).__init__()
        # initialize the layers of the network
        self.fc1 = nn.Linear(X_train.shape[1], 2048)
        self.fc2 = nn.Linear(2048, 1)

    def forward(self, x):
        # put x through the layer(s) and activation function(s) (like ReLU)
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

model = PerformancePredictor()


# Task-7 (30 points):
- Train the Neural Network

In [32]:
# Create dataloaders
train_data = TensorDataset(X_train_t, y_train_t)
train_loader = DataLoader(train_data, batch_size=64, shuffle=True)

# Model instantiation
#model = PerformancePredictor(X_train.shape[1])
model = PerformancePredictor()
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

#X_train_t_torch = torch.FloatTensor(X_train_t)
#y_train_t_torch = torch.FloatTensor(y_train_t)

# Training loop
epochs = 1500
for epoch in range(epochs):
  model.train()
  optimizer.zero_grad()
  outputs = model(X_train_t)
  loss = criterion(outputs.squeeze(1), y_train_t)
  loss.backward()
  optimizer.step()

  if epoch % 100 == 0:
    print(f'Epoch {epoch}, Loss: {loss.item()}')


Epoch 0, Loss: 24494.734375
Epoch 100, Loss: 2744.623046875
Epoch 200, Loss: 633.4061889648438
Epoch 300, Loss: 345.3572082519531
Epoch 400, Loss: 229.3052520751953
Epoch 500, Loss: 170.43516540527344
Epoch 600, Loss: 132.0568084716797
Epoch 700, Loss: 101.63204956054688
Epoch 800, Loss: 76.54833984375
Epoch 900, Loss: 56.404788970947266
Epoch 1000, Loss: 41.00255584716797
Epoch 1100, Loss: 29.11956214904785
Epoch 1200, Loss: 20.952884674072266
Epoch 1300, Loss: 15.337861061096191
Epoch 1400, Loss: 11.383468627929688


# Model Evaluation

- After training our model, let's evaluate its performance on the test set to see how well it predicts the ERP.


In [33]:
model.eval()
with torch.no_grad():
    predictions = model(X_test_t).squeeze()

# Calculate the test loss
test_loss = criterion(predictions, y_test_t)
print(f'Test Loss: {test_loss.item()}')

# Convert predictions and actual values to a pandas DataFrame for easier comparison
results_comparison = pd.DataFrame({'Actual ERP': y_test_t.numpy(), 'Predicted ERP': predictions.numpy()})
results_comparison = results_comparison.head(10)  # Display the first 10 results for a quick comparison

results_comparison


Test Loss: 967.8099975585938


Unnamed: 0,Actual ERP,Predicted ERP
0,102.0,93.885033
1,25.0,21.604725
2,25.0,21.144485
3,919.0,862.994873
4,34.0,34.03347
5,267.0,266.967438
6,41.0,40.55854
7,19.0,22.815035
8,1238.0,1050.170532
9,227.0,229.080505
