
<a href="https://www.zero-grad.com/">
         <img alt="Zero Grad" src="https://i.postimg.cc/y8LZ0CM6/linear-Algebra.png" >
      </a>

# üß™ Assignment 1: Build Your Own `train_test_split_np()` Using NumPy


##  Introduction:




In Machine Learning, we build models that **learn from data** and then **make predictions** on new, unseen data.

But if we train and test our model on the same data, we won't know if it's actually good ‚Äî it might just be memorizing the data (overfitting). To avoid this, we **split the dataset** into two parts:

- **Training Set** üß†  
  Used by the model to learn patterns from data.

- **Testing Set** üß™  
  Used to evaluate how well the model performs on new, unseen data.

This approach helps us **estimate how the model will perform in the real world**.

---

### ‚öñÔ∏è Common Split Ratios

- **80% Train / 20% Test** ‚Äì most common for general tasks.
- **70% Train / 30% Test** ‚Äì when you want more test data.
- **90% Train / 10% Test** ‚Äì if your dataset is large.

There's no one "right" ratio ‚Äî it depends on your dataset size and use case.

---

### üîÄ Should We Shuffle the Data?

Yes!  
If your data is ordered (e.g. time-based), you should **shuffle** it before splitting to avoid bias.

Shuffling ensures that both the training and test sets represent the overall data distribution fairly.

---

### üß† Real-Life Analogy

Imagine you're studying for an exam.

- You **practice with sample questions** (training set).
- Then you **test yourself with new questions** you've never seen (test set).

If you only "test" yourself using the same practice questions, you're not really testing your understanding ‚Äî you're just repeating.

---

### üß™ Summary

| Term           | Purpose                          |
|----------------|----------------------------------|
| Training Set   | Learn from it                    |
| Testing Set    | Evaluate model on unseen data    |
| Shuffle        | Avoid bias from data order       |
| Test Ratio     | Controls how much data is held out for testing |

Understanding this concept is the first step toward building reliable, real-world ML models.


## Implementation




Create a custom `train_test_split_np()` function using **NumPy only**, similar to scikit-learn's `train_test_split`, with full control over:

- Test ratio (e.g., 20% test)
- Shuffle behavior
- Random seed for reproducibility

In [9]:
import numpy as np

def train_test_split_np(X, y, test_ratio=0.2, seed=None, shuffle=True):
    """
    Splits X and y into training and testing sets using NumPy only.

    Parameters:
        X (ndarray): Feature array of shape (n_samples, n_features)
        y (ndarray): Target array of shape (n_samples, 1) or (n_samples,)
        test_ratio (float): Ratio of data to use for testing (0 < test_ratio < 1)
        seed (int): Optional random seed for reproducibility
        shuffle (bool): Whether to shuffle data before splitting

    Returns:
        X_train, X_test, y_train, y_test
    """
    if seed is not None:
        np.random.seed(seed)
    n_samples = X.shape[0] # the same if you use y.shape[0] since X and y must have the same number of samples
    indices = np.arange(n_samples) # create an array of indices from 0 to n_samples-1
    
    if shuffle:
        indices = np.random.permutation(n_samples) # shuffle the indices in a random order
    
    test_size = int(n_samples * test_ratio) # to duduce the number of samples for the test set not all indices
    test_indices = indices[:test_size] # first part for test set using slicing numpy
    train_indices = indices[test_size:] # remaining part for training set using slicing numpy

    X_train = X[train_indices]  # get training features using the train indices ( Duduce by indexing numpy )
    X_test = X[test_indices] # get testing features using the test indices ( Duduce by indexing numpy )
    y_train = y[train_indices] # get training targets using the train indices ( Duduce by indexing numpy )
    y_test = y[test_indices] # get testing targets using the test indices ( Duduce by indexing numpy )

    return X_train, X_test, y_train, y_test

In [10]:
# Example usage
# the original array data x and y
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]])
y = np.array([1, 2, 3, 4, 5])

print("Original X:\n", X)
print("Original y:\n", y)


Original X:
 [[ 1  2]
 [ 3  4]
 [ 5  6]
 [ 7  8]
 [ 9 10]]
Original y:
 [1 2 3 4 5]


In [None]:
# check the funcion is acceptable
X_train, X_test, y_train, y_test = train_test_split_np(X, y, test_ratio=0.2, seed=42, shuffle=True)
# When you put the return values = the function, the seed is set to 42 or any number else for reproducibility 
# and shuffle is set to True to shuffle the data before splitting
# if you change the test_ratio you will get different sizes for the train and test sets but you can choose any value between 0 and 1

print("X_train:\n", X_train)
print("X_test:\n", X_test)
print("y_train:\n", y_train)
print("y_test:\n", y_test)

X_train:
 [[ 9 10]
 [ 5  6]
 [ 1  2]
 [ 7  8]]
X_test:
 [[3 4]]
y_train:
 [5 3 1 4]
y_test:
 [2]


# üìà Assignment 2: Build Linear Regression from Scratch (Using Your Train-Test Split)


## üß† What You'll Learn



In this assignment, you‚Äôll build a **Linear Regression model** step-by-step using only **NumPy** ‚Äî and apply it to the training and test sets you created in **Assignment 1**.

You will:

- Use your own `train_test_split_np()` function  
- Fit a line to training data using the **Normal Equation**  
- Make predictions on test data



 üß™ **The Idea**

Linear Regression tries to find the best-fitting line:

```
y = Œ∏‚ÇÄ + Œ∏‚ÇÅx
```

We use a mathematical formula (Normal Equation) to find the best values for `Œ∏‚ÇÄ` and `Œ∏‚ÇÅ`:

```
Œ∏ = (X·µÄX)‚Åª¬π X·µÄy
```

Then we use this line to make predictions for new data.


## üìù Tasks

### ‚úÖ Step 1: Generate the Dataset


In [4]:
import numpy as np

np.random.seed(42)
# X = 2 * np.random.rand(100, 1) # number of samples is 100, number of features is 1 
# and without 2, the values will be between 0 and 1: This is to change the range of values
X = np.random.rand(100, 1) # 100 samples, 1 feature : my try for range between 0 and 1 not between 0 and 2
y = 4 + 3 * X + np.random.randn(100, 1)

print("X shape:", X.shape)
print("y shape:", y.shape)

X shape: (100, 1)
y shape: (100, 1)


In [5]:
print(X)

[[0.37454012]
 [0.95071431]
 [0.73199394]
 [0.59865848]
 [0.15601864]
 [0.15599452]
 [0.05808361]
 [0.86617615]
 [0.60111501]
 [0.70807258]
 [0.02058449]
 [0.96990985]
 [0.83244264]
 [0.21233911]
 [0.18182497]
 [0.18340451]
 [0.30424224]
 [0.52475643]
 [0.43194502]
 [0.29122914]
 [0.61185289]
 [0.13949386]
 [0.29214465]
 [0.36636184]
 [0.45606998]
 [0.78517596]
 [0.19967378]
 [0.51423444]
 [0.59241457]
 [0.04645041]
 [0.60754485]
 [0.17052412]
 [0.06505159]
 [0.94888554]
 [0.96563203]
 [0.80839735]
 [0.30461377]
 [0.09767211]
 [0.68423303]
 [0.44015249]
 [0.12203823]
 [0.49517691]
 [0.03438852]
 [0.9093204 ]
 [0.25877998]
 [0.66252228]
 [0.31171108]
 [0.52006802]
 [0.54671028]
 [0.18485446]
 [0.96958463]
 [0.77513282]
 [0.93949894]
 [0.89482735]
 [0.59789998]
 [0.92187424]
 [0.0884925 ]
 [0.19598286]
 [0.04522729]
 [0.32533033]
 [0.38867729]
 [0.27134903]
 [0.82873751]
 [0.35675333]
 [0.28093451]
 [0.54269608]
 [0.14092422]
 [0.80219698]
 [0.07455064]
 [0.98688694]
 [0.77224477]
 [0.19

In [6]:
print(y)

[[5.21066742]
 [6.55313557]
 [6.2877426 ]
 [3.80840654]
 [4.24838403]
 [4.82509613]
 [5.65214488]
 [6.08025822]
 [4.99485143]
 [5.62246069]
 [4.9771556 ]
 [7.23848067]
 [5.96756772]
 [5.15028477]
 [4.64255245]
 [5.51885852]
 [4.21067364]
 [5.24660715]
 [4.9037269 ]
 [3.41017247]
 [6.13167896]
 [4.67953685]
 [4.8815474 ]
 [4.8644984 ]
 [3.95283921]
 [5.93488256]
 [4.25630683]
 [4.74042605]
 [5.61595799]
 [4.54340209]
 [7.70882046]
 [4.68615018]
 [4.45270517]
 [6.7722107 ]
 [4.97812488]
 [6.39867817]
 [4.97407152]
 [6.75625845]
 [5.86033811]
 [5.62200482]
 [4.33140293]
 [4.31685269]
 [5.24598838]
 [7.47989424]
 [5.56737189]
 [5.0781794 ]
 [6.33792754]
 [4.158353  ]
 [6.22698793]
 [6.74501899]
 [5.91821756]
 [5.75910074]
 [6.91814819]
 [6.1810064 ]
 [4.24303651]
 [6.83418568]
 [3.20317379]
 [5.06154102]
 [3.21625763]
 [6.5259254 ]
 [4.38277858]
 [4.49198558]
 [7.29972974]
 [3.83939566]
 [5.07026346]
 [6.935231  ]
 [2.81528944]
 [6.5912248 ]
 [4.48353473]
 [7.74248368]
 [5.0797836 ]
 [3.27

### ‚úÖ Step 2: Add Intercept Column (x‚ÇÄ = 1)


> Example:

![Adding 1](https://i.ibb.co/d0zpGpcB/adding-1.png)


In [7]:
# X_b =
ones = np.ones((X.shape[0], 1)) # create a column of ones with the same number of rows as X
Xb = np.c_[ones, X] # add the column of ones to X as the first column
print("Xb shape:", Xb.shape)
print(Xb)


Xb shape: (100, 2)
[[1.         0.37454012]
 [1.         0.95071431]
 [1.         0.73199394]
 [1.         0.59865848]
 [1.         0.15601864]
 [1.         0.15599452]
 [1.         0.05808361]
 [1.         0.86617615]
 [1.         0.60111501]
 [1.         0.70807258]
 [1.         0.02058449]
 [1.         0.96990985]
 [1.         0.83244264]
 [1.         0.21233911]
 [1.         0.18182497]
 [1.         0.18340451]
 [1.         0.30424224]
 [1.         0.52475643]
 [1.         0.43194502]
 [1.         0.29122914]
 [1.         0.61185289]
 [1.         0.13949386]
 [1.         0.29214465]
 [1.         0.36636184]
 [1.         0.45606998]
 [1.         0.78517596]
 [1.         0.19967378]
 [1.         0.51423444]
 [1.         0.59241457]
 [1.         0.04645041]
 [1.         0.60754485]
 [1.         0.17052412]
 [1.         0.06505159]
 [1.         0.94888554]
 [1.         0.96563203]
 [1.         0.80839735]
 [1.         0.30461377]
 [1.         0.09767211]
 [1.         0.68423303]
 [1.  

### ‚úÖ Step 3: Split the Data


Use your function to split `X` and `y`:

```python
X_train, X_test, y_train, y_test = train_test_split_np(X, y, test_ratio=0.2, seed=42)
```

In [None]:
X_train, X_test, y_train, y_test = train_test_split_np(Xb, y, test_ratio=0.2, seed=42, shuffle=True)
# using the function to split the data into training and testing sets with default test_ratio of 0.2, seed of 42 for reproducibility, and shuffle set to True
print(" X_train Shape:", X_train.shape)
print(" X_test Shape:", X_test.shape)
print(" y_train Shape:", y_train.shape)
print(" y_test Shape:", y_test.shape)

 X_train Shape: (80, 2)
 X_test Shape: (20, 2)
 y_train Shape: (80, 1)
 y_test Shape: (20, 1)


### ‚úÖ Step 4: Compute Œ∏ Using the Normal Equation

![image.png](https://miro.medium.com/v2/resize:fit:1120/1*7ZiWm6xAF4oWiYfWklUMEw.jpeg)

In [23]:
# Calculating Theta_best using the Normal Equation
# by me very traditional way
X_transpose = X_train.T
print("X_train Transpose Shape:", X_transpose.shape)
dot = X_transpose @ X_train
part1 = np.linalg.inv(dot)
part2 = X_transpose @ y_train
Theta_best = part1 @ part2
# Calculating the best theta using the Normal Equation
print("Theta_best Shape:", Theta_best.shape)
print("theta_best:\n", Theta_best)

X_train Transpose Shape: (2, 80)
Theta_best Shape: (2, 1)
theta_best:
 [[4.14291332]
 [2.59864731]]


In [None]:
# Professional way using x train and y train
Theta_best = np.linalg.inv(X_train.T @ X_train) @ (X_train.T) @ (y_train)
print("Theta_best Shape:", Theta_best.shape)
print("theta_best:\n", Theta_best)

Theta_best Shape: (2, 1)
theta_best:
 [[4.14291332]
 [2.59864731]]


In [21]:
# by zero grade using xb
theta_best = np.linalg.inv(Xb.T @ Xb) @ Xb.T @ y

print("theta_best shape:", theta_best.shape)
print("theta_best:\n", theta_best)

theta_best shape: (2, 1)
theta_best:
 [[4.21509616]
 [2.54022677]]


### ‚úÖ Step 5: Predict on the Test Set

In [24]:
y_pred = X_test @ Theta_best  # on my train data
print("y_pred shape:", y_pred.shape)
print("y_pred:\n", y_pred)  



y_pred shape: (20, 1)
y_pred:
 [[4.30807906]
 [6.46825401]
 [6.14970512]
 [5.86457507]
 [4.81539122]
 [5.28671442]
 [4.90209423]
 [6.38581472]
 [4.19640516]
 [5.11621099]
 [5.26538608]
 [5.72170812]
 [6.26200997]
 [6.60873217]
 [4.45369659]
 [4.54835074]
 [6.14717293]
 [4.33532925]
 [6.30613815]
 [4.58604538]]


### ‚úÖ Step 6: Evaluate the Model

Use **Mean Squared Error (MSE)** to evaluate your model's performance:

<img src="https://www.i2tutorials.com/wp-content/media/2019/11/Differences-between-MSE-and-RMSE-1-i2tutorials.jpg" width="400">


In [25]:
MSE = np.mean((y_test - y_pred) ** 2)
print("Mean Squared Error (MSE):", MSE)

Mean Squared Error (MSE): 0.6536995137170009


In [34]:
z = np.array([[1, 2], [3, 4]])
np.linalg.inv(z)

array([[-2. ,  1. ],
       [ 1.5, -0.5]])

In [35]:
z**-1

ValueError: Integers to negative integer powers are not allowed.