# Creating a Test Set
When building machine learning models, one of the most important steps is 
splitting your dataset into training and test sets. This ensures your model is
evaluated on data it has never seen before, which is critical for assessing its ability
to generalize

### 1. The Problem of Data Snooping Bias

<b>Data snooping bias </b> occurs when information from the test set leaks into the
training process. This can lead to overly optimistic performance metrics and
models that don’t perform well in real-world scenarios.

To avoid this, the test set must be isolated before any data exploration, feature
selection, or model training begins

## 2.Random Sampling: A Basic Approach

- A simple method to split the data is to randomly shuffle it and then divide it

In [1]:
import pandas as pd
df = pd.read_csv("housing.csv")
df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41,880,129.0,322,126,8.3252,452600,NEAR BAY
1,-122.22,37.86,21,7099,1106.0,2401,1138,8.3014,358500,NEAR BAY
2,-122.24,37.85,52,1467,190.0,496,177,7.2574,352100,NEAR BAY
3,-122.25,37.85,52,1274,235.0,558,219,5.6431,341300,NEAR BAY
4,-122.25,37.85,52,1627,280.0,565,259,3.8462,342200,NEAR BAY
...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25,1665,374.0,845,330,1.5603,78100,INLAND
20636,-121.21,39.49,18,697,150.0,356,114,2.5568,77100,INLAND
20637,-121.22,39.43,17,2254,485.0,1007,433,1.7000,92300,INLAND
20638,-121.32,39.43,18,1860,409.0,741,349,1.8672,84700,INLAND


In [2]:
import numpy as np                      # Import NumPy library for numerical operations

def shuffle_and_split_data(data, test_ratio):      # Define a function to shuffle and split dataset
    np.random.seed(42)                               # Fix random seed so results are same every time
    
    shuffled_indices = np.random.permutation(len(data))  # Create a randomly shuffled list of row indices
    test_set_size = int(len(data) * test_ratio)           # Calculate number of rows for test set
    test_indices = shuffled_indices[:test_set_size]        # Select first part as test indices
    train_indices = shuffled_indices[test_set_size:]         # Remaining indices used for training set
    return data.iloc[train_indices], data.iloc[test_indices]   # Return training data and test data
                                         

> Setting the random seed (e.g., with np.random.seed(42) ) ensures consistency
across runs — this is crucial for debugging and comparing models fairly.

In [3]:
train , test = shuffle_and_split_data(df,0.2)

In [4]:
test.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
20046,-119.01,36.06,25,1505,,1392,359,1.6812,47700,INLAND
3024,-119.46,35.14,30,2943,,1565,584,2.5313,45800,INLAND
15663,-122.44,37.8,52,3830,,1310,963,3.4801,500001,NEAR BAY
20484,-118.72,34.28,17,3051,,1705,495,5.7376,218600,<1H OCEAN
9814,-121.93,36.62,34,2351,,1063,428,3.725,278000,NEAR OCEAN


In [5]:
train.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
14196,-117.03,32.71,33,3126,627.0,2300,623,3.2596,103000,NEAR OCEAN
8267,-118.16,33.77,49,3382,787.0,1314,756,3.8125,382100,NEAR OCEAN
17445,-120.48,34.66,4,1897,331.0,915,336,4.1563,172600,NEAR OCEAN
14265,-117.11,32.69,36,1421,367.0,1418,355,1.9425,93400,NEAR OCEAN
2271,-119.8,36.78,43,2382,431.0,874,380,3.5542,96500,INLAND


In [6]:
test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4128 entries, 20046 to 3665
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           4128 non-null   float64
 1   latitude            4128 non-null   float64
 2   housing_median_age  4128 non-null   int64  
 3   total_rooms         4128 non-null   int64  
 4   total_bedrooms      3921 non-null   float64
 5   population          4128 non-null   int64  
 6   households          4128 non-null   int64  
 7   median_income       4128 non-null   float64
 8   median_house_value  4128 non-null   int64  
 9   ocean_proximity     4128 non-null   object 
dtypes: float64(4), int64(5), object(1)
memory usage: 354.8+ KB


In [7]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16512 entries, 14196 to 15795
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           16512 non-null  float64
 1   latitude            16512 non-null  float64
 2   housing_median_age  16512 non-null  int64  
 3   total_rooms         16512 non-null  int64  
 4   total_bedrooms      16512 non-null  float64
 5   population          16512 non-null  int64  
 6   households          16512 non-null  int64  
 7   median_income       16512 non-null  float64
 8   median_house_value  16512 non-null  int64  
 9   ocean_proximity     16512 non-null  object 
dtypes: float64(4), int64(5), object(1)
memory usage: 1.4+ MB
