<a href="https://colab.research.google.com/github/AyaNabih7/train_test_split_function_from_Scratch/blob/main/train_test_split_function_from_Scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

"""
# 🧩 Train-Test Split Function from Scratch

This Jupyter Notebook demonstrates how to build a custom implementation
of the `train_test_split` function using **NumPy** and **Pandas**,
without relying on scikit-learn.

### 🔍 Project Overview:
The goal of this notebook is to:
1. Understand how dataset splitting works conceptually.
2. Implement a reproducible train/test split manually.
3. Learn how randomization and indexing work using NumPy.

### 📦 Libraries Used:
- **NumPy**: For random permutation and array operations.
- **Pandas**: For handling the dataset and returning DataFrames.

### ⚙️ Function Description:
`my_own_train_test_split(df, train_ratio=0.8, test_ratio=0.2, seed=None)`

**Parameters:**
- `df`: Input DataFrame to be split.
- `train_ratio`: Fraction of data to allocate for training.
- `test_ratio`: Fraction of data to allocate for testing.
- `seed`: Random seed for reproducibility.

**Returns:**
- `train_df`: DataFrame containing training samples.
- `test_df`: DataFrame containing testing samples.

### 🧠 Learning Outcome:
By the end of this notebook, you’ll understand:
- How data is randomly shuffled and split.
- Why using a random seed ensures consistent results.
- The logic behind scikit-learn’s `train_test_split`.

---
"""


In [7]:
import numpy as np
import pandas as pd

# Define a custom train-test split function
def my_own_train_test_split(df, train_ratio=0.8, test_ratio=0.2, seed=None):
    """
    Custom implementation of train-test split using NumPy and Pandas.
    """
    # Set the random seed (for reproducibility)
    if seed is not None:
        np.random.seed(seed)

    # Calculate total number of rows
    total_size = len(df)

    # Determine the size of test and train sets
    test_size = int(total_size * test_ratio)
    train_size = total_size - test_size

    # Generate random permutation of indices (to shuffle data)
    indices = np.random.permutation(total_size)

    # Split indices for training and testing
    train_idx = indices[:train_size]
    test_idx = indices[train_size:]

    # Select the actual rows from the DataFrame
    train_df = df.iloc[train_idx]
    test_df = df.iloc[test_idx]

    # Return the resulting datasets
    return train_df, test_df


In [21]:
df = pd.read_csv("SMSSpamCollection.tsv",sep="\t", header=None)
df.columns=['label','sms']

In [22]:
df

Unnamed: 0,label,sms
0,ham,I've been searching for the right words to tha...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,ham,Even my brother is not like to speak with me. ...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!
...,...,...
5563,spam,This is the 2nd time we have tried 2 contact u...
5564,ham,Will ü b going to esplanade fr home?
5565,ham,"Pity, * was in mood for that. So...any other s..."
5566,ham,The guy did some bitching but I acted like i'd...


In [23]:
train_df,test_df = my_own_train_test_split(df, train_ratio=0.8, test_ratio=0.2, seed=123)

In [28]:
train_df.head(5)

Unnamed: 0,label,sms
3000,ham,I will see in half an hour
457,ham,Where did u go? My phone is gonna die you have...
2982,ham,No break time one... How... I come out n get m...
1762,ham,Hi this is yijue... It's regarding the 3230 te...
2148,ham,I surely dont forgot to come:)i will always be...


In [29]:
test_df.head(5)

Unnamed: 0,label,sms
3000,ham,I will see in half an hour
457,ham,Where did u go? My phone is gonna die you have...
2982,ham,No break time one... How... I come out n get m...
1762,ham,Hi this is yijue... It's regarding the 3230 te...
2148,ham,I surely dont forgot to come:)i will always be...
