In [None]:
from sklearn.datasets import load_iris
from utils import ruin_data
import pandas as pd
import seaborn as sns

Scikit-learn offers some toy datasets. For this notebook, we will use the iris dataset, which can be loaded as follows:

In [None]:
data, targets = load_iris(return_X_y=True, as_frame=True)

### Brief excursus on obtaining datasets
scikit learn offers some convenient way to access the datasets hosted at the openml.org dataset repository (can be searched [here](https://www.openml.org/search?type=data&sort=runs&status=active))

```python
from sklearn.datasets import fetch_openml

data_openml = fetch_openml(name='soybean', version=1) # by name
data_openml = fetch_openml(data_id=42) # or by id
```

In many cases, datasets are provided as DataFrames

In [None]:
data

In [None]:
targets

# Preprocessing
Before we analyse the data, it is crucial to inspect it. 
In this case, we know it is "clean", but with real-life data it may not be the case.
So, we can introduce some issues in the data :)

In [None]:
data = ruin_data(data)
data["target"] = targets


## Check for missing values
Some data points may be missing from the dataset. use the `describe` method to see some information.

Looking at the counts, it seems like this is the case. You can choose to (depending on the situation):
- remove the rows where data is missing (using `dropna` method)
- replace the missing data with the mean of the column (`fillna` using `data.mean()`)
- remove the column where data is missing

In [None]:
data = data.dropna()
data.describe()

## Check for outliers
Outliers can be identified visually and using e.g. z-score or IQR
Try using `sns.pairplot` and coloring by target.

you can also use `zscore` from `scipy.stats` to calculate z scores and select points with a score larger than a certain threshold.

use pairplot again

## Data scaling
Some techniques based on distances can suffer from data having different scales in different dimensions. 
Try some other scaling methods.

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Apply scaling
scaler = StandardScaler()
data_scaled = pd.DataFrame(scaler.fit_transform(data), columns=data.columns)


# Exploratory analysis

You can plot the correlation matrix between the features using a `heatmap` from seaborn, together with the `corr` method of a dataframe.
Set `vmax` and `vmin` appropriately.

Convert the data to numpy format (use `to_numpy()` method) and save it (use `np.save()`)

Don't forget to separate the features and the labels

In [None]:
X

In [None]:
y

In [None]:
np.save("X", X)
np.save("y", y)