# **3.1.1 Step -1- Data Understanding, Analysis and Preparations:**
# In this step we will read the data, understand the data, perform some basic data cleaning, and store everything in the matrix as shown below.

**– Objective of the Task**

**To Predict the marks obtained in writing based on the marks of Math and Reading.**

• To - Do - 1:

1. Read and Observe the Dataset.
2. Print top(5) and bottom(5) of the dataset {Hint: pd.head and pd.tail}.
3. Print the Information of Datasets. {Hint: pd.info}.
4. Gather the Descriptive info about the Dataset. {Hint: pd.describe}
5. Split your data into Feature (X) and Label (Y).

In [1]:
import pandas as pd

# Loaded the dataset
file_path = '/content/drive/MyDrive/Data Set/Copy of student.csv'
data = pd.read_csv(file_path)

# Step 2: Printing top(5) and bottom(5) of the dataset
print("Top 5 rows of the dataset:")
print(data.head())

print("\nBottom 5 rows of the dataset:")
print(data.tail())

# Step 3: Printing the Information of Datasets
print("\nDataset Information:")
data.info()

# Step 4: Gathering the Descriptive info about the Dataset
print("\nDescriptive Statistics of the Dataset:")
print(data.describe())

# Step 5: Spliting the data into Feature (X) and Label (Y)

X = data[['Math', 'Reading']]
y = data['Writing']

print("\nFeatures (X):")
print(X.head())

print("\nLabel (y):")
print(y.head())


Top 5 rows of the dataset:
   Math  Reading  Writing
0    48       68       63
1    62       81       72
2    79       80       78
3    76       83       79
4    59       64       62

Bottom 5 rows of the dataset:
     Math  Reading  Writing
995    72       74       70
996    73       86       90
997    89       87       94
998    83       82       78
999    66       66       72

Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   Math     1000 non-null   int64
 1   Reading  1000 non-null   int64
 2   Writing  1000 non-null   int64
dtypes: int64(3)
memory usage: 23.6 KB

Descriptive Statistics of the Dataset:
              Math      Reading      Writing
count  1000.000000  1000.000000  1000.000000
mean     67.290000    69.872000    68.616000
std      15.085008    14.657027    15.241287
min      13.000000    19.000000    14.000000
25%    

In [2]:
import numpy as np

# Initializing the weight vector W with random values (same dimension as features in X)
W = np.random.rand(X.shape[1])  # W \in R^d

# Printing the initialized weight vector W
print("\nWeight vector (W):")
print(W)

# Reshape y to ensure it's a column vector
Y = np.reshape(y, (-1, 1))  # Y \in R^n

# Printing the target vector Y
print("\nTarget vector (Y):")
print(Y[:5])

# Verifying the dimensions
print("\nDimensions:")
print(f"W: {W.shape}, X: {X.shape}, Y: {Y.shape}")



Weight vector (W):
[0.4822373  0.58066864]

Target vector (Y):
[[63]
 [72]
 [78]
 [79]
 [62]]

Dimensions:
W: (2,), X: (1000, 2), Y: (1000, 1)


# • To - Do - 3:
**1. Split the dataset into training and test sets.**

**2. You can use an 80-20 or 70-30 split, with 80% (or 70%) of the data used for training and the rest for testing.**

In [3]:
#spliting the dataset
split_ratio = 0.8  # 80-20 split
split_index = int(len(X) * split_ratio)

# Shuffling the data
indices = np.arange(len(X))
np.random.shuffle(indices)

# Converting to NumPy arrays (if not already)
X = np.array(X)
y = np.array(y)

# Applying the shuffled indices
X_shuffled = X[indices]
y_shuffled = y[indices]

# Spliting the data
X_train = X_shuffled[:split_index]
X_test = X_shuffled[split_index:]
y_train = y_shuffled[:split_index]
y_test = y_shuffled[split_index:]

print("\nTraining Features (X_train):")
print(X_train[:5])

print("\nTesting Features (X_test):")
print(X_test[:5])

print("\nTraining Labels (y_train):")
print(y_train[:5])

print("\nTesting Labels (y_test):")
print(y_test[:5])



Training Features (X_train):
[[77 79]
 [70 67]
 [77 85]
 [85 82]
 [54 51]]

Testing Features (X_test):
[[73 64]
 [56 65]
 [44 55]
 [70 81]
 [71 64]]

Training Labels (y_train):
[84 64 81 83 53]

Testing Labels (y_test):
[69 66 66 78 58]


# **3.1.2 Step -2- Build a Cost Function:**
# Cost function is the average of loss function measured across the data point.

In [4]:
def cost_function(X, Y, W):
    """
    Parameters:
    X: Feature Matrix
    Y: Target Matrix
    W: Weight Matrix

    Returns:
    cost: accumulated mean square error.
    """
    # Hypothesis function
    y_pred = np.dot(X, W)

    # Mean Squared Error calculation
    errors = y_pred - Y.flatten()
    cost = (1 / (2 * len(Y))) * np.sum(errors ** 2)
    return cost

# Designing a Test Case for Cost Function:
**We will first calculate the loss value manually and then verify the output via our code. If the computed valuematches, we will proceed further**

In [5]:
# Test case for Cost Function as given in the question
X_test = np.array([[1, 2], [3, 4], [5, 6]])
Y_test = np.array([3, 7, 11])
W_test = np.array([1, 1])

# Calculating the cost function
cost = cost_function(X_test, Y_test, W_test)
if cost == 0:
    print("\nProceed Further")
else:
    print("\nSomething went wrong: Reimplement the cost function")

print("Cost function output:", cost)


Proceed Further
Cost function output: 0.0


# **3.1.3 Step -3- Gradient Descent for Simple Linear Regression:**
# Objective: Learn the Parameters
**To learn the parameters w (weights) and b (biases), we will assume that b = 0 for simplicity. Thus no need to update biases or w0.**

# ***Gradient Descent Code below***

In [6]:
# Implementing Gradient Descent function
def gradient_descent(X, Y, W, alpha, iterations):
    """
    Perform gradient descent to optimize the parameters of a linear regression model.
    Parameters:
    X (numpy.ndarray): Feature matrix (m x n).
    Y (numpy.ndarray): Target vector (m x 1).
    W (numpy.ndarray): Initial guess for parameters (n x 1).
    alpha (float): Learning rate.
    iterations (int): Number of iterations for gradient descent.
    Returns:
    tuple: A tuple containing the final optimized parameters (W_update) and the history of cost values.
    W_update (numpy.ndarray): Updated parameters (n x 1).
    cost_history (list): History of cost values over iterations.
    """
    # Initializing cost history
    cost_history = [0] * iterations
    # Number of samples
    m = len(Y)

    W_update = W.copy()
    for iteration in range(iterations):
        # Step 1: Hypothesis Values
        Y_pred = np.dot(X, W_update)

        # Step 2: Difference between Hypothesis and Actual Y
        loss = Y_pred - Y.flatten()

        # Step 3: Gradient Calculation
        dw = (1 / m) * np.dot(X.T, loss)

        # Step 4: Updating Values of W using Gradient
        W_update = W_update - alpha * dw

        # Step 5: New Cost Value
        cost = cost_function(X, Y, W_update)
        cost_history[iteration] = cost

    return W_update, cost_history

# ***Test Code for Gradient Descent function below:***

In [7]:
# Testing the gradient_descent function
np.random.seed(0)  # For reproducibility
X_test = np.random.rand(100, 3)  # 100 samples, 3 features
Y_test = np.random.rand(100)
W_test = np.random.rand(3)  # Initial guess for parameters

# Set hyperparameters
alpha = 0.01
iterations = 1000

# Perform Gradient Descent
final_params, cost_history = gradient_descent(X_test, Y_test, W_test, alpha, iterations)

# Print the final parameters and cost history
print("\nFinal Parameters:", final_params)
print("\nCost History:", cost_history[:10], "...")  # Display first 10 cost values


Final Parameters: [0.20551667 0.54295081 0.10388027]

Cost History: [0.10711197094660153, 0.10634880599939901, 0.10559826315680618, 0.10486012948320558, 0.1041341956428534, 0.10342025583900626, 0.1027181077540776, 0.1020275524908062, 0.10134839451441931, 0.1006804415957737] ...


# **3.1.4 Step -4- Evaluate the Model:**
**Evaluation in Machine Learning measures the goodness of fit of your build model. Lets see How Good ismodel we designed above, as discussed in the class for regression we can use following function as evaluation**
**measure.**
# ***1. Root Mean Square Error:***
***The Root Mean Squared Error (RMSE) is a commonly used metric for measuring the average magnitude of the errors between predicted and actual values.**

# **The code for RMSE Model is below**

In [8]:
#Model Evaluation For RMSE
def rmse(Y, Y_pred):
    """
    This Function calculates the Root Mean Squares.
    Input Arguments:
    Y: Array of actual (Target) Dependent Variables.
    Y_pred: Array of predicted Dependent Variables.
    Output Arguments:
    rmse: Root Mean Square.
    """
    return np.sqrt(np.mean((Y - Y_pred) ** 2))

# ***R2 or Coefficient of Determination:***
***R-squared, or the coefficient of determination, measures the proportion of the variance in the dependent variable that is predictable from the independent variables.***
# **The code for R2 Model is below**

In [9]:
#Model Evaluation For R2
def r2(Y, Y_pred):
    """
    This Function calculates the R Squared Error.
    Input Arguments:
    Y: Array of actual (Target) Dependent Variables.
    Y_pred: Array of predicted Dependent Variables.
    Output Arguments:
    r2: R Squared Error.
    """
    mean_y = np.mean(Y)
    ss_tot = np.sum((Y - mean_y) ** 2)
    ss_res = np.sum((Y - Y_pred) ** 2)
    return 1 - (ss_res / ss_tot)

# **3.1.5 Step -5- Main Function to Integrate All Steps:**
**In this section, we will create a main function that integrates the data loading, preprocessing, cost function,gradient descent, and model evaluation. This will help in running the entire workflow with minimal effort.**

# • Objective:
**The objective of the main function is to execute the full process, from loading the data to performinglinear regression using gradient descent and evaluating the results using metrics like RMSE and R2**
.

**• To - Do:**

We will define a function that:

1. Loads the data and splits it into training and test sets.
2. Prepares the feature matrix (X) and target vector (Y).
3. Defines the weight matrix (W) and initializes the learning rate and number of iterations.
4. Calls the gradient descent function to learn the parameters.
5. Evaluates the model using RMSE and R2.
Re-wrote the following code below

In [11]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Main Function
def main():
    # Step 1: Load the dataset
    data = pd.read_csv('/content/drive/MyDrive/Data Set/Copy of student.csv')

    # Step 2: Split the data into features (X) and target (Y)
    X = data[['Math', 'Reading']].values  # Features: Math and Reading marks
    Y = data['Writing'].values            # Target: Writing marks

    # Step 3: Split the data into training and test sets (80% train, 20% test)
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

    # Step 4: Initialize weights (W) to zeros, learning rate, and number of iterations
    W = np.zeros(X_train.shape[1])  # Initialize weights
    alpha = 0.00001  # Learning rate
    iterations = 1000  # Number of iterations for gradient descent

    # Step 5: Perform Gradient Descent
    W_optimal, cost_history = gradient_descent(X_train, Y_train, W, alpha, iterations)

    # Step 6: Make predictions on the test set
    Y_pred = np.dot(X_test, W_optimal)

    # Step 7: Evaluate the model using RMSE and R-Squared
    model_rmse = rmse(Y_test, Y_pred)
    model_r2 = r2(Y_test, Y_pred)

    # Step 8: Output the results
    print("Final Weights:", W_optimal)
    print("Cost History (First 10 iterations):", cost_history[:10])
    print("RMSE on Test Set:", model_rmse)
    print("R-Squared on Test Set:", model_r2)

# Execute the main function
if __name__ == "__main__":
    main()


Final Weights: [0.34811659 0.64614558]
Cost History (First 10 iterations): [2013.165570783755, 1640.286832599692, 1337.0619994901588, 1090.4794892850578, 889.9583270083234, 726.8940993009545, 594.2897260808594, 486.4552052951635, 398.7634463599484, 327.4517147324688]
RMSE on Test Set: 5.2798239764188635
R-Squared on Test Set: 0.8886354462786421


Present your finding:
# **1. Did your Model Overfitt, Underfitts, or performance is acceptable?**

# **Ans:-**
**The results show that the R-squared value on the test set is 0.88863, indicating that the model explains 88.9% of the variance in the data, which is generally a strong performance.**

**The first 10 values of the cost function history demonstrate a decrease from 2013.16 to 327.45, signifying that the model is learning from its errors and improving over time.**

**Overall, the model performs well, capturing a significant portion of the data's variance. The RMSE value is relatively low, and the decreasing cost history further confirms that the model is making fewer errors and improving. These metrics suggest that the model is neither overfitting nor underfitting, and its performance is satisfactory.**

In [None]:
# Experimenting with different learning rates
alpha_values = [0.0001, 0.0005, 0.001, 0.01]  # Example learning rates

for alpha in alpha_values:
    print(f"\nTesting with alpha = {alpha}")
    W_optimal, cost_history = gradient_descent(X_train, Y_train, W, alpha, iterations)

    Y_pred = np.dot(X_test, W_optimal)
    model_rmse = rmse(Y_test, Y_pred)
    model_r2 = r2(Y_test, Y_pred)

    print("Final Weights:", W_optimal)
    print("Cost History (First 10 iterations):", cost_history[:10])
    print("RMSE on Test Set:", model_rmse)
    print("R-Squared on Test Set:", model_r2)

# **2. Experiment with different value of learning rate, making it higher and lower, observe the result.**

# **Ans:-**

**1. When alpha is set too high (e.g., 0.01), the model may fail to converge properly, causing the cost to oscillate or even increase.**


**2. When alpha is too low (e.g., 0.0001), the model may converge very slowly and may not achieve a satisfactory solution within a reasonable number of iterations.**


**3. The optimal alpha ensures steady convergence, with a consistently decreasing cost and acceptable RMSE and R-squared values. Adjusting alpha based on the model's performance is essential for achieving the best results.**