In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load raw data
df = pd.read_csv("raw_data.csv")

# Split into train (70%), validation (15%), and test (15%)
train, temp = train_test_split(df, test_size=0.3, random_state=46)
validation, test = train_test_split(temp, test_size=0.5, random_state=46)

# Save the splits
train.to_csv("train.csv", index=False)
validation.to_csv("validation.csv", index=False)
test.to_csv("test.csv", index=False)

print("Data split and saved successfully!")

Data split and saved successfully!


In [4]:
df = pd.read_csv("raw_data.csv")
print(df.columns)

Index(['Label', 'Message'], dtype='object')


In [5]:
for split in ["train.csv", "validation.csv", "test.csv"]:
    df = pd.read_csv(split)
    print(f"{split}:")
    print(df["Label"].value_counts(), "\n")  # Use "Label" instead of "label"

train.csv:
Label
ham     3362
spam     538
Name: count, dtype: int64 

validation.csv:
Label
ham     736
spam    100
Name: count, dtype: int64 

test.csv:
Label
ham     727
spam    109
Name: count, dtype: int64 



In [6]:
from sklearn.model_selection import train_test_split

# Load raw data
df = pd.read_csv("raw_data.csv")

# Re-split with a different random seed
train, temp = train_test_split(df, test_size=0.3, random_state=12000)  # New random seed
validation, test = train_test_split(temp, test_size=0.5, random_state=12000)  # New random seed

# Save the updated splits
train.to_csv("train.csv", index=False)
validation.to_csv("validation.csv", index=False)
test.to_csv("test.csv", index=False)

print("Data re-split and saved successfully!")

Data re-split and saved successfully!


In [7]:
for split in ["train.csv", "validation.csv", "test.csv"]:
    df = pd.read_csv(split)
    print(f"{split}:")
    print(df["Label"].value_counts(), "\n")

train.csv:
Label
ham     3368
spam     532
Name: count, dtype: int64 

validation.csv:
Label
ham     731
spam    105
Name: count, dtype: int64 

test.csv:
Label
ham     726
spam    110
Name: count, dtype: int64 



In [8]:
for split in ["train.csv", "validation.csv", "test.csv"]:
    df = pd.read_csv(split)
    print(f"{split} (Previous):")
    print(df["Label"].value_counts(), "\n")

train.csv (Previous):
Label
ham     3362
spam     538
Name: count, dtype: int64 

validation.csv (Previous):
Label
ham     736
spam    100
Name: count, dtype: int64 

test.csv (Previous):
Label
ham     727
spam    109
Name: count, dtype: int64 



In [9]:
for split in ["train.csv", "validation.csv", "test.csv"]:
    df = pd.read_csv(split)
    print(f"{split} (Updated):")
    print(df["Label"].value_counts(), "\n")

train.csv (Updated):
Label
ham     3368
spam     532
Name: count, dtype: int64 

validation.csv (Updated):
Label
ham     731
spam    105
Name: count, dtype: int64 

test.csv (Updated):
Label
ham     726
spam    110
Name: count, dtype: int64 



## Conclusion: Data Preparation

In this notebook, we performed the following steps to prepare the data for machine learning:

### **1. Loading the Raw Data**
- The raw dataset (`raw_data.csv`) was loaded using `pandas`.
- The dataset contains two columns:
  - `Label`: The target variable (e.g., `ham` or `spam`).
  - `Message`: The text data (e.g., SMS messages).

### **2. Splitting the Data**
- The dataset was split into three subsets:
  - **Training Set (70%)**: Used to train the machine learning models.
  - **Validation Set (15%)**: Used to tune hyperparameters and evaluate model performance during training.
  - **Test Set (15%)**: Used to evaluate the final model performance.
- The splits were created using `train_test_split` from `scikit-learn` with a fixed random seed (`random_state=46`) for reproducibility.

### **3. Saving the Splits**
- The training, validation, and test sets were saved as separate CSV files:
  - `train.csv`
  - `validation.csv`
  - `test.csv`

### **4. Tracking Data with DVC**
- The data files (`train.csv`, `validation.csv`, `test.csv`) were tracked using **DVC (Data Version Control)**.
- DVC allows us to version control large data files without storing them in Git.
- The following commands were executed in the terminal to track the data:
  ```bash
  dvc add train.csv validation.csv test.csv
  git add train.csv.dvc validation.csv.dvc test.csv.dvc .gitignore
  git commit -m "Track train/validation/test splits with DVC"
  dvc push

Updating Data Splits with a New Random Seed
To ensure robustness, the data was re-split using a different random seed (random_state=12000).

The updated splits were saved and tracked with DVC using the following commands:

bash
Copy
dvc add train.csv validation.csv test.csv
git add train.csv.dvc validation.csv.dvc test.csv.dvc
git commit -m "Update data split with new random seed (12000)"
dvc push
6. Verifying Data Distributions
The distribution of the target variable (Label) was checked for each split:

Training Set: Distribution of ham and spam messages.

Validation Set: Distribution of ham and spam messages.

Test Set: Distribution of ham and spam messages.

This ensures that the splits are balanced and representative of the original dataset.

7. Checking Out Previous Versions
To compare the data distributions before and after updating the random seed, the following commands were executed:

bash
Copy
git checkout HEAD~1  # Checkout the previous commit
dvc checkout
After verifying the previous splits, the updated version was restored using:

bash
Copy
git checkout master  # Checkout the latest commit
dvc checkout
Key Takeaways
The data preparation process ensures that the dataset is ready for machine learning.

Using DVC allows us to version control the data and track changes over time.

Splitting the data into training, validation, and test sets ensures that the model can be trained, tuned, and evaluated effectively.