<a href="https://colab.research.google.com/github/GeoLabUniLaSalle/Python/blob/main/6_7_Part_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Data Science**

Machine Learning

# **Data Preprocessing**
Lyes LAKHAL
lyes.lakhal@unilasalle.fr




We will briefly walk through steps for preprocessing raw data with `pandas`
and converting them into the tensor format.
We will cover more data preprocessing techniques in later chapters.

# **2.1 Reading the Dataset**

As an example, we begin by creating an artificial dataset that is stored in a
csv (comma-separated values) file `../data/house_tiny.csv`. Data stored in other
formats may be processed in similar ways.



In [None]:
import os

os.makedirs(os.path.join('..', 'data'), exist_ok=True)
data_file = os.path.join('..', 'data', 'house_tiny.csv')
with open(data_file, 'w') as f:
    f.write('NumRooms,Alley,Price\n')  # Column names
    f.write('NA,Pave,127500\n')  # Each row represents a data example
    f.write('2,NA,106000\n')
    f.write('4,titi,178100\n')
    f.write('NA,lulu,140000\n')

To load the raw dataset from the created csv file,
we import the `pandas` package and invoke the `read_csv` function.



In [None]:
# If pandas is not installed, just uncomment the following line:
# !pip install pandas
import pandas as pd

data = pd.read_csv(data_file)
print(data)

   NumRooms Alley   Price
0       NaN  Pave  127500
1       2.0   NaN  106000
2       4.0  titi  178100
3       NaN  lulu  140000


# **2.2. Handling Missing Data**

Note that "NaN" entries are missing values.
To handle missing data, typical methods include *imputation* and *deletion*,
where imputation replaces missing values with substituted ones,
while deletion ignores missing values. Here we will consider imputation.

By integer-location based indexing (`iloc`), we split `data` into `inputs` and `outputs`,
where the former takes the first two columns while the latter only keeps the last column.
For numerical values in `inputs` that are missing,
we replace the "NaN" entries with the mean value of the same column.


In [None]:
inputs, outputs = data.iloc[:, 0:2], data.iloc[:, 2]
print(inputs)
print(outputs)

   NumRooms Alley
0       NaN  Pave
1       2.0   NaN
2       4.0  titi
3       NaN  lulu
0    127500
1    106000
2    178100
3    140000
Name: Price, dtype: int64


In [None]:
inputs, outputs = data.iloc[:, 0:2], data.iloc[:, 2]
inputs = inputs.fillna(inputs.mean())
print(inputs)


   NumRooms Alley
0       3.0  Pave
1       2.0   NaN
2       4.0  titi
3       3.0  lulu


For categorical or discrete values in `inputs`, we consider "NaN" as a category.
Since the "Alley" column only takes two types of categorical values "Pave" and "NaN",
`pandas` can automatically convert this column to two columns "Alley_Pave" and "Alley_nan".


In [None]:
inputs = pd.get_dummies(inputs, dummy_na=True)
print(inputs)

   NumRooms  Alley_Pave  Alley_lulu  Alley_titi  Alley_nan
0       3.0           1           0           0          0
1       2.0           0           0           0          1
2       4.0           0           0           1          0
3       3.0           0           1           0          0


## **2.3. Conversion to the Tensor Format**

Now that all the entries in `inputs` and `outputs` are numerical, they can be converted to the tensor format.
Once data are in this format, they can be further manipulated with those tensor functionalities that we have previously introduced.


In [None]:
import torch

X, y = torch.tensor(inputs.values), torch.tensor(outputs.values)
X, y

(tensor([[3., 1., 0., 0., 0.],
         [2., 0., 0., 0., 1.],
         [4., 0., 0., 1., 0.],
         [3., 0., 1., 0., 0.]], dtype=torch.float64),
 tensor([127500, 106000, 178100, 140000]))

## Summary

* Like many other extension packages in the vast ecosystem of Python, `pandas` can work together with tensors.
* Imputation and deletion can be used to handle missing data.

