📊Loading Data

download and extract a housing dataset

The extracted data is then loaded into a Pandas DataFrame for further analysis

In [9]:
from pathlib import Path
import pandas as pd
import tarfile
import urllib.request

def load_housing_data():
    """
    This function checks if the housing dataset exists locally.
    If not, it downloads, extracts, and loads it into a Pandas DataFrame.
    """
    tarball_path = Path('datasets/housing.tgz')

    if not tarball_path.is_file():
        Path('datasets').mkdir(parents=True, exist_ok=True)
        url = "https://github.com/ageron/data/raw/main/housing.tgz"

        urllib.request.urlretrieve(url, tarball_path)

        with tarfile.open(tarball_path) as housing_tarball:
            housing_tarball.extractall(path='datasets')

    return pd.read_csv(Path('datasets/housing/housing.csv'))

housing = load_housing_data()


In [10]:
housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


→ Data Shuffling and Splitting

Using np.random.permutation()

In [11]:
import numpy as np
np.random.seed(42) #🌱 To specify a fixed starting point for the random generation process, ensuring that each run of the code produces the same results

def shuffle_and_split_data(data,test_ratio):
  """
    This function shuffles the dataset randomly and splits it into training and test sets
     Arguments:
    - data: Pandas DataFrame, the dataset to split
    - test_ratio: float, the proportion of data to allocate to the test set

    Returns:
    - train_set: DataFrame containing the training data
    - test_set: DataFrame containing the test data
  """
  shuffled_indices = np.random.permutation(len(data)) # Shuffle the input data

  print('Total number of shuffled indecies', len(shuffled_indices)) # Print The total number of indices that were shuffled
  test_set_size = int(len(data)*test_ratio) # Calculate the test set size
  test_indices = shuffled_indices[:test_set_size]
  train_indices = shuffled_indices[test_set_size:]
  return data.iloc[train_indices],data.iloc[test_indices]
train_set,test_set = shuffle_and_split_data(housing,0.2) # shuffle_and_split_data() to split housing data using 20% ​​for testing and 80% for training
print("First 5 rows of training set:")
print(train_set.head())
print('------------------------------------------------------------------------')
print("Training set size: ",len(train_set))
print("Test set size: ",len(test_set))

Total number of shuffled indecies 20640
First 5 rows of training set:
       longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
14196    -117.22     32.75                34.0       6001.0          1111.0   
8267     -117.03     32.69                10.0        901.0           163.0   
17445    -122.27     37.74                28.0       6909.0          1554.0   
14265    -121.82     37.25                25.0       4021.0           634.0   
2271     -115.98     33.32                 8.0        240.0            46.0   

       population  households  median_income  median_house_value  \
14196      2654.0      1072.0         4.5878            291000.0   
8267        698.0       167.0         4.6648            156100.0   
17445      2974.0      1484.0         3.6875            353900.0   
14265      2178.0       650.0         5.1663            241200.0   
2271         63.0        24.0         1.4688             53800.0   

      ocean_proximity  
14196      NEAR OCEAN 

→ Data Splitting

Using train_test_split

In [16]:
from sklearn.model_selection import train_test_split

# Divide median_income column into 5 categories (income_cat) using (bins)
# labels=[1, 2, 3, 4, 5] gives each category a serial number
# pd.cut() is used to classify median_income into categories (income_cat)
housing['income_cat'] = pd.cut(housing['median_income'],bins=[0.,1.5,3.0,4.5,6.,np.inf],labels=[1,2,3,4,5])
print(housing['income_cat']) # Print income classifications for each row in the data

# Split data using train_test_split()
# test_size : 20% , training_set : 80%
# stratify=housing['income_cat'] : It ensures that the distribution of income_cat in the test set is almost identical to its distribution in the training set
# random_state=42 : Ensures that the results will be repeated every time the code is run.
strat_train_set,strat_test_set = train_test_split(housing,test_size=0.2,stratify=housing['income_cat'],random_state=42)

0        5
1        5
2        5
3        4
4        3
        ..
20635    2
20636    2
20637    2
20638    2
20639    2
Name: income_cat, Length: 20640, dtype: category
Categories (5, int64): [1 < 2 < 3 < 4 < 5]


In [15]:
from sklearn.model_selection import train_test_split

print(len(housing["median_income"].unique()))
# Wrong stratify !
#strat_train_set, strat_test_set = train_test_split(housing, test_size=0.2, stratify=housing["median_income"], random_state=42)
"""
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

"""

12928


ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

NOTES :
Stratified Sampling is a method used to split data while preserving the proportions of different classes in each subset (e.g., Train, Validation, Test). This is especially useful when dealing with imbalanced classes.

Why is it helpful?

- It ensures that each dataset subset contains the same class distribution as the original dataset.
- This improves the model's learning performance, especially when working with imbalanced data.
------------------------------------------------------------------------------
⚠️ERROR NOTES :

Handling Continuous Data for Stratified Sampling

When the data is continuous, **Stratified Sampling** cannot be directly applied

should be at least two rows of the same class ⇒ two classes of the same label

If we assume we only have one number of class, where should it be placed?

In which dataset??

**solution**

Define Ranges (Binning):
Convert Continuous Value to Classes


𓂃🖊 Summary

Stratified sampling is useful when dealing with imbalanced datasets, as it ensures an equal distribution of the existing classes. However, if the data is continuous, it must first be converted into classes by binning the values into predefined ranges. Only then can stratified sampling be applied.

_______________________________________________________________________________________________

→  K-Fold Cross-Validation

Divides the data into 5 parts (k=5).
Each time, 4 parts are trained and the remaining part is tested

The RMSE is calculated for each part and then the mean and standard deviation are calculated

In [17]:
from sklearn.model_selection import KFold,cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

X = housing[['median_income','housing_median_age','population','total_rooms']] # Features : Which will be used to predict the price of the house
y = housing['median_house_value'] # Target

model = LinearRegression() # Create a linear regression model

k = 5
kf = KFold(n_splits=k,shuffle=True,random_state=42)
"""
k = 5: Split the data into 5 folds
shuffle=True: Data is shuffled randomly before splitting, preventing bias
random_state=42: Ensures that the results are repeated on each run
"""

cv_scores = cross_val_score(model,X,y,cv=kf,scoring='neg_root_mean_squared_error')
"""
cross_val_score() does:
Train the model in 4 parts and test it in the fifth part
Calculate the RMSE for each part
Returns values ​​negatively (neg_root_mean_squared_error) because cross_val_score() uses negative error values
"""

rmse_scores = (-cv_scores) # Convert negative values ​​to positive because RMSE cannot be negative

# Print Result
print(f'RMSE scores for each fold : {rmse_scores}')
print(f'Mean RMSE : {rmse_scores.mean()}')
print(f'Standard Deviation of RMSE : {rmse_scores.std()}')

RMSE scores for each fold : [81223.17364865 79504.08481614 81550.91401948 79609.37454749
 79349.31733515]
Mean RMSE : 80247.37287338135
Standard Deviation of RMSE : 939.939298274745


------------------------------------------------------------------------------------
RMSE (Root Mean Squared Error) :
It is a metric used to evaluate the performance of machine learning and regression models

The smaller the RMSE, the better the model ✨



