Title: Data Splitting (Train-Test-Validation)


Task 1: House Prices Dataset (Regression)<br>
Use the House Prices dataset to predict house prices.<br>
Split the data into training, validation, and test sets (70% train, 15% validation, 15% test).

In [1]:
# House Prices Dataset - Data Splitting (Train, Validation, Test)

import pandas as pd
from sklearn.model_selection import train_test_split

# Example: Simulated house prices dataset (replace with actual dataset as needed)
data = {
    'SquareFootage': [1400, 1600, 1700, 1875, 1100, 1550, 2350, 2450, 1425, 1700],
    'Bedrooms': [3, 3, 2, 4, 2, 3, 4, 4, 2, 3],
    'Age': [20, 15, 18, 10, 25, 12, 5, 7, 30, 16],
    'Price': [245000, 312000, 279000, 308000, 199000, 219000, 405000, 324000, 220000, 295000]
}
df = pd.DataFrame(data)

# Features and target
X = df.drop('Price', axis=1)
y = df['Price']

# First split: train (70%) and temp (30%)
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Second split: validation (15%) and test (15%) from temp (which is 30% of data)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42
)

print("Train set size:", X_train.shape[0])
print("Validation set size:", X_val.shape[0])
print("Test set size:", X_test.shape[0])

Train set size: 7
Validation set size: 1
Test set size: 2


Task 2: Iris Dataset (Classification)<br>
Apply data splitting to the Iris dataset.<br>
Split it into train (70%), validation (15%), and test (15%).


In [2]:
# Iris Dataset - Data Splitting (Train, Validation, Test)

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import pandas as pd

# Load the Iris dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target)

# First split: train (70%) and temp (30%)
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Second split: validation (15%) and test (15%) from temp (which is 30% of data)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp
)

print("Train set size:", X_train.shape[0])
print("Validation set size:", X_val.shape[0])
print("Test set size:", X_test.shape[0])

Train set size: 105
Validation set size: 22
Test set size: 23



Task 3: Customer Churn Dataset (Classification)<br>
Predict customer churn using the telecom dataset.<br>
Split the data into training, validation, and test sets.

In [4]:
# Customer Churn Dataset - Data Splitting (Train, Validation, Test)

import pandas as pd
from sklearn.model_selection import train_test_split

# Example: Simulated telecom churn dataset (replace with actual dataset as needed)
data = {
    'monthly_minutes': [300, 250, 400, 150, 500, 100, 350, 200, 450, 120],
    'customer_support_calls': [1, 3, 0, 5, 2, 6, 1, 4, 0, 7],
    'contract_length_months': [12, 24, 12, 6, 24, 6, 12, 6, 24, 6],
    'churn': [0, 0, 0, 1, 0, 1, 0, 1, 0, 1]
}
df = pd.DataFrame(data)

# Features and target
X = df.drop('churn', axis=1)
y = df['churn']

# First split: train (70%) and temp (30%) with stratify
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Second split: validation (15%) and test (15%) from temp (no stratify due to small size)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42
)

print("Train set size:", X_train.shape[0])
print("Validation set size:", X_val.shape[0])
print("Test set size:", X_test.shape[0])

Train set size: 7
Validation set size: 1
Test set size: 2
