## Data Prerocessing

### Reading the dataset

Comma-separated values (CSV) files are ubiquitous for the storing of
tabular (spreadsheet-like) data. In them, each line corresponds to one
record and consists of several (comma-separated) fields, e.g., "Albert
Einstein,March 14 1879,Ulm,Federal polytechnic school,field of
gravitational physics". To demonstrate how to load CSV files with
``pandas``, we create a CSV file below ``../data/house_tiny.csv``. This
file represents a dataset of homes, where each row corresponds to a
distinct home and the columns correspond to the number of rooms
(``NumRooms``), the roof type (``RoofType``), and the price (``Price``)

In [3]:
import os

os.makedirs(os.path.join("..", "data"), exist_ok = True)
data_file = os.path.join("..", "data", "house_tiny.csv")
with open(data_file, "w") as f:
    f.write('''NumRooms,RoofType,Price
NA,NA,127500
2,NA,106000
4,Slate,178100
NA,NA,140000''')


In [4]:
import pandas as pd 
data = pd.read_csv(data_file)
data

Unnamed: 0,NumRooms,RoofType,Price
0,,,127500
1,2.0,,106000
2,4.0,Slate,178100
3,,,140000


### Data Preparation


In supervised learning, we train models to predict a designated *target*
value, given some set of *input* values. Our first step in processing
the dataset is to separate out columns corresponding to input versus
target values. We can select columns either by name or via
integer-location based indexing (``iloc``).

You might have noticed that ``pandas`` replaced all CSV entries with
value ``NA`` with a special ``NaN`` (*not a number*) value. This can
also happen whenever an entry is empty, e.g., "œ3,,,270000". These are
called *missing values* and they are the "bed bugs" of data science, a
persistent menace that you will confront throughout your career.
Depending upon the context, missing values might be handled either via
*imputation* or *deletion*. Imputation replaces missing values with
estimates of their values while deletion simply discards either those
rows or those columns that contain missing values.

Here are some common imputation heuristics. For categorical input
fields, we can treat ``NaN`` as a category. Since the ``RoofType``
column takes values ``Slate`` and ``NaN``, ``pandas`` can convert this
column into two columns ``RoofType_Slate`` and ``RoofType_nan``. A row
whose roof type is ``Slate`` will set values of ``RoofType_Slate`` and
``RoofType_nan`` to 1 and 0, respectively. The converse holds for a row
with a missing ``RoofType`` value.


In [None]:
inputs, targets = data.iloc[:, 0: 1+1], data.iloc[:, 2] #take input and target(the output) using slicing and iloc
inputs

Unnamed: 0,NumRooms,RoofType
0,,
1,2.0,
2,4.0,Slate
3,,


In [7]:
targets

0    127500
1    106000
2    178100
3    140000
Name: Price, dtype: int64

In [8]:
inputs = pd.get_dummies(inputs, dummy_na= True)
inputs

Unnamed: 0,NumRooms,RoofType_Slate,RoofType_nan
0,,False,True
1,2.0,False,True
2,4.0,True,False
3,,False,True


### get_dummies()
Convert categorical variable into dummy/indicator variables.
```bash
pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)
```
Each variable is converted in as many 0/1 variables as there are different values. Columns in the output are each named after a value; if the input is a DataFrame, the name of the original variable is prepended to the value.

#### Parameters

- **data** : array-like, Series, or DataFrame  
  Data of which to get dummy indicators.

- **prefix** : str, list of str, or dict of str, default `None`  
  String to append DataFrame column names.  
  - Pass a list with length equal to the number of columns when calling `get_dummies` on a DataFrame.  
  - Alternatively, `prefix` can be a dictionary mapping column names to prefixes.

- **prefix_sep** : str, default `'_'`  
  Separator/delimiter to use when appending prefix.  
  - Can also pass a list or dictionary as with `prefix`.

- **dummy_na** : bool, default `False`  
  - If `True`, add a column to indicate NaNs.  
  - If `False`, NaNs are ignored.

- **columns** : list-like, default `None`  
  Column names in the DataFrame to be encoded.  
  - If `None`, then all the columns with object, string, or category dtype will be converted.

- **sparse** : bool, default `False`  
  Whether the dummy-encoded columns should be backed by a `SparseArray` (`True`) or a regular NumPy array (`False`).

- **drop_first** : bool, default `False`  
  Whether to get `k-1` dummies out of `k` categorical levels by removing the first level.

- **dtype** : dtype, default `bool`  
  Data type for new columns. Only a single dtype is allowed.

---

#### Returns

- **DataFrame**  
  Dummy-coded data.  
  - If `data` contains other columns than the dummy-coded one(s), these will be prepended, unaltered, to the result.
