## 1.1 Reading a Dataset

* So far we have dealt with synthetic data made from ready-made tensor.
* What if we're dealing with messy, disorganized data luckily we have library called `pandas` which does all the dirty work.
* Here we'll use CSV(Comma-Separated values) files which are everywhre for storing tabular data.
* To show how to load CSV files with `pandas` we use a CSV file which represents the housing prices in California state

In [4]:
from sklearn.datasets import fetch_california_housing
import pandas as pd

# 1. Fetch the data
housing = fetch_california_housing(as_frame=True)

# 2. Combine features and target into one DataFrame
df = housing.frame

# 3. Save to CSV
df.to_csv('california_housing.csv', index=False)

print("File saved as california_housing.csv")


File saved as california_housing.csv


In [5]:
data = pd.read_csv('california_housing.csv')
print(data.head())

   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88   
1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86   
2  7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85   
3  5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85   
4  3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85   

   Longitude  MedHouseVal  
0    -122.23        4.526  
1    -122.22        3.585  
2    -122.24        3.521  
3    -122.25        3.413  
4    -122.25        3.422  


## 1.2 Data Preparation.

* In supervised learning, we train models to predict a designated `target` value, given a set of input values.
* Our first step in processsing the dataset is to seperate out the columns to input versus target.
* We can select columns either by name or via integer-location based indexing (`iloc`).
* Let's first get information of our dataset.

In [9]:
## getting general information about the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   MedInc       20640 non-null  float64
 1   HouseAge     20640 non-null  float64
 2   AveRooms     20640 non-null  float64
 3   AveBedrms    20640 non-null  float64
 4   Population   20640 non-null  float64
 5   AveOccup     20640 non-null  float64
 6   Latitude     20640 non-null  float64
 7   Longitude    20640 non-null  float64
 8   MedHouseVal  20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB


In [10]:
##getting statistical summary
df.describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,3.870671,28.639486,5.429,1.096675,1425.476744,3.070655,35.631861,-119.569704,2.068558
std,1.899822,12.585558,2.474173,0.473911,1132.462122,10.38605,2.135952,2.003532,1.153956
min,0.4999,1.0,0.846154,0.333333,3.0,0.692308,32.54,-124.35,0.14999
25%,2.5634,18.0,4.440716,1.006079,787.0,2.429741,33.93,-121.8,1.196
50%,3.5348,29.0,5.229129,1.04878,1166.0,2.818116,34.26,-118.49,1.797
75%,4.74325,37.0,6.052381,1.099526,1725.0,3.282261,37.71,-118.01,2.64725
max,15.0001,52.0,141.909091,34.066667,35682.0,1243.333333,41.95,-114.31,5.00001


In [13]:
## checking if any column has a missing value
df.isnull().values.any()

np.False_

* From the above information we can see that our dataset is full and does not have any missing values. But what if columns in our dataset had missing values?
* What would we need to do fix this issue. This issues are called the `bed bugs` of data science.
* Depending  upon the context, missing values might be handled either via `imputation or deletion`.
* Imputation replaces missing values with estiates of their values while deletion simply discards either those rows or columns that contain missing columns.
* But first let's delete the first 10 element in the `MedianInc` and `HouseAge` columns  and replace the empty spaces with `NaN` value which represents slots with missing values.

In [17]:
import numpy as np

# Replace rows 0 through 9 in the 'MedInc' and `HouseAge` column
df.loc[0:9, 'MedInc'] = np.nan
df.loc[0:9,'HouseAge'] = np.nan
df

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,,,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526
1,,,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585
2,,,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521
3,,,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,,,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
...,...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,0.781
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,0.771
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.923
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32,0.847


* Since the columns `MedInc` and `HouseAge` are numerical valued columns.
* The common method is to replace the `NaN` entries with mean value of the corresponding column.

In [18]:
df['MedInc'] = df['MedInc'].fillna(df['MedInc'].mean())
df['HouseAge'] = df['HouseAge'].fillna(df['HouseAge'].mean())
df

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,3.870125,28.630683,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526
1,3.870125,28.630683,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585
2,3.870125,28.630683,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521
3,3.870125,28.630683,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.870125,28.630683,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
...,...,...,...,...,...,...,...,...,...
20635,1.560300,25.000000,5.045455,1.133333,845.0,2.560606,39.48,-121.09,0.781
20636,2.556800,18.000000,6.114035,1.315789,356.0,3.122807,39.49,-121.21,0.771
20637,1.700000,17.000000,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.923
20638,1.867200,18.000000,5.329513,1.171920,741.0,2.123209,39.43,-121.32,0.847


* Since  `MedHouseVal` is our target variable we want to seperate it from the rest of the dataset so as to have two variables `X(input variables)` and `Y(target variable)`.
* To obtain X we drop `MedHouseVal` along axis 1.

In [22]:
X = df.drop('MedHouseVal',axis=1)
y = df['MedInc']
X = pd.DataFrame(X)
y = pd.DataFrame(y)
print(X)
print(y)

         MedInc   HouseAge  AveRooms  AveBedrms  Population  AveOccup  \
0      3.870125  28.630683  6.984127   1.023810       322.0  2.555556   
1      3.870125  28.630683  6.238137   0.971880      2401.0  2.109842   
2      3.870125  28.630683  8.288136   1.073446       496.0  2.802260   
3      3.870125  28.630683  5.817352   1.073059       558.0  2.547945   
4      3.870125  28.630683  6.281853   1.081081       565.0  2.181467   
...         ...        ...       ...        ...         ...       ...   
20635  1.560300  25.000000  5.045455   1.133333       845.0  2.560606   
20636  2.556800  18.000000  6.114035   1.315789       356.0  3.122807   
20637  1.700000  17.000000  5.205543   1.120092      1007.0  2.325635   
20638  1.867200  18.000000  5.329513   1.171920       741.0  2.123209   
20639  2.388600  16.000000  5.254717   1.162264      1387.0  2.616981   

       Latitude  Longitude  
0         37.88    -122.23  
1         37.86    -122.22  
2         37.85    -122.24  
3      

## 1.3 Converting to tensor format

In [29]:
## converting X and y into tensor format
import torch
import numpy as np
import pandas as pd


X = df.drop('MedHouseVal', axis=1)
y = df['MedInc']

#coverting X and y into dataframe
X = pd.DataFrame(X)
y = pd.DataFrame(y)

# Convert pandas DataFrames to NumPy arrays with float32 dtype
X_np = X.astype(np.float32).to_numpy()
y_np = y.astype(np.float32).to_numpy()

# Convert NumPy arrays to PyTorch tensors
X = torch.tensor(X_np)
y = torch.tensor(y_np)

X, y

(tensor([[   3.8701,   28.6307,    6.9841,  ...,    2.5556,   37.8800,
          -122.2300],
         [   3.8701,   28.6307,    6.2381,  ...,    2.1098,   37.8600,
          -122.2200],
         [   3.8701,   28.6307,    8.2881,  ...,    2.8023,   37.8500,
          -122.2400],
         ...,
         [   1.7000,   17.0000,    5.2055,  ...,    2.3256,   39.4300,
          -121.2200],
         [   1.8672,   18.0000,    5.3295,  ...,    2.1232,   39.4300,
          -121.3200],
         [   2.3886,   16.0000,    5.2547,  ...,    2.6170,   39.3700,
          -121.2400]]),
 tensor([[3.8701],
         [3.8701],
         [3.8701],
         ...,
         [1.7000],
         [1.8672],
         [2.3886]]))