<a href="https://colab.research.google.com/github/DavoodSZ1993/Dive-into-Deep-Learning-Notes-/blob/main/5_mlp_kaggle.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## `os` Built-in Module

This module provides a portable way of using **operating system** dependent functionality.

* `os.mkdir()`: Can be used to create a directory named path with the specific numeric mode.
* `os.path.join()`: Join one or more path components intelligently.
* `os.path.exists()`: Check whether the specific path exists or not.
* `os.path.dirname()`: Used to get the dictionary name from the specific path.
* `os.path.splittext()`: Used to split the pathname into a pair root and ext.

## `tarfile` Module

We can create tar files using this module. This module makes it possible to read and write tar archives, including those using gzip, bz2 and lzma compression.

* `tarfile.open()`: Return a tarfile object.

## `zipfile` Module

This module provides tools to create, read, write, append, and list ZIP files.

* `zipfile.ZipFile()`: Open a ZIP file.

## With Open in Python

* `open()`: This function opens a file in python. However, the `open()` function does not close the file. So, the file should be closed with `close()` method. 
* The `with` statement works with the `open()` function to open a file.
* Unlike the `open()` where we have to close the file with the `close()` method, the `with` statement closes the file authomatically.

## Preprocessing the Data
### `opendatasets`
`opendatasets` is a Python library for downloading datasets from online sources like Kaggle and Google Drive using a simple Python command.

In [1]:
!pip install opendatasets --upgrade --quiet

In [8]:
import opendatasets as od

dataset_url = 'https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data'
od.download(dataset_url)

Skipping, found downloaded files in "./house-prices-advanced-regression-techniques" (use force=True to force download)


In [13]:
import pandas as pd

df_train = pd.read_csv('./house-prices-advanced-regression-techniques/train.csv')
df_val = pd.read_csv('./house-prices-advanced-regression-techniques/test.csv')

df_train.shape, df_val.shape

((1460, 81), (1459, 80))

* `pd.concat()`: Concatenate pandas objects along a particular axis.
* `df.dtypes()`: Return the dtypes in the DataFrame.
* `df.apply()`: Apply a function along an axis of the DataFrame.
* `df.fillna()`: Fill NA/NaN values using the specified method.
* `pd.get_dummies()`: Convert categorical variable into dummy/indicator variables. 
* `df.index`: The index (raw label) of the DataFrame.

In [17]:
# Removing index and label columns

label = 'SalePrice'

raw_train = df_train.drop(columns=['Id', label])
raw_val = df_val.drop(columns = ['Id'])

raw_train.shape, raw_val.shape

((1460, 79), (1459, 79))

In [21]:
# Concatenating two datasets for further pre-process.

features = pd.concat((raw_train, raw_val))  # raw_train (1460 x 79)
features.shape                              # --------
                                            # raw_val (1459 x 79)

(2919, 79)

In [33]:
# Processing numerical columns (mean = 0, standard deviation = 1)

numeric_features = features.dtypes[features.dtypes != 'object'].index
numeric_features

Index(['MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond',
       'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2',
       'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
       'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
       'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces',
       'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal',
       'MoSold', 'YrSold'],
      dtype='object')

In [38]:
features[numeric_features] = features[numeric_features].apply(
    lambda x: (x - x.mean()) / (x.std()))    # function along each column

features.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,0.06732,RL,-0.184443,-0.217841,Pave,,Reg,Lvl,AllPub,Inside,...,-0.285886,-0.063139,,,,-0.089577,-1.551918,0.157619,WD,Normal
1,-0.873466,RL,0.458096,-0.072032,Pave,,Reg,Lvl,AllPub,FR2,...,-0.285886,-0.063139,,,,-0.089577,-0.446848,-0.602858,WD,Normal
2,0.06732,RL,-0.055935,0.137173,Pave,,IR1,Lvl,AllPub,Inside,...,-0.285886,-0.063139,,,,-0.089577,1.026577,0.157619,WD,Normal
3,0.302516,RL,-0.398622,-0.078371,Pave,,IR1,Lvl,AllPub,Corner,...,-0.285886,-0.063139,,,,-0.089577,-1.551918,-1.363335,WD,Abnorml
4,0.06732,RL,0.629439,0.518814,Pave,,IR1,Lvl,AllPub,FR2,...,-0.285886,-0.063139,,,,-0.089577,2.131647,0.157619,WD,Normal


In [39]:
# Replacing NAN values with zero is numerical columns.

features[numeric_features] = features[numeric_features].fillna(0)

* `pd.get_dummies(dummy_na=True)`: Add a column to indicate NaNs, if `False` NaNs are ignored.

In [41]:
# Convert categorcial values to discerete numbers

features = pd.get_dummies(features, dummy_na=False)
print(features.shape)

features = pd.get_dummies(features, dummy_na=True)
print(features.shape)

(2919, 331)
(2919, 331)


In [43]:
# Separate training and evaluation data from features.

train_data = features[: raw_train.shape[0]].copy()
train_data[label] = df_train[label].copy()

val_data = features[raw_train.shape[0]:].copy()

train_data.shape, val_data.shape

((1460, 332), (1459, 331))

## Converting the Data to PyTorch Datasets.

* Python `slice()` Function: Returns a slice object.
* `df.values`: Return a Numpy representation of the DataFrame.

In [45]:
import torch

In [65]:
get_tensor = lambda x: torch.tensor(x.values, dtype=torch.float32)

tensors = get_tensor(train_data.drop(columns=[label])) # converts the Numpy representation od the data frame to tensors.

type(train_data.values), type(tensors)

(numpy.ndarray, torch.Tensor)

In [77]:
indices = slice(0, None)

tensors = tuple(a[indices] for a in tensors)  # this line of code puts each sample (331 features) as an element in a tuple.
indices, type(tensors), len(tensors)

(slice(0, None, None), tuple, 1460)

In [47]:
def get_tensorloader(tensors, indices=slice(0, None)):
  tensors = tuple(a[indices] for a in tensors)
  dataset = torch.utils.data.TensorDataset(*tensors)
  return torch.utils.data.DataLoader(dataset, batch_size=32, shuffle=True)