# Handling Categorical Values in Scikit-Learn 🏠

Most machine learning algorithms work best with **numerical data**. But real-world datasets often contain **categorical** or **text attributes**. Let’s dive into how to handle these using the `ocean_proximity` column from the California housing dataset as an example.

---

### 1. Categorical Attributes 📝

Text columns like **"ocean\_proximity"** are not free-form text, but limited to a fixed set of values (e.g., "NEAR BAY", "INLAND"). These are known as **categorical attributes**.

Example:

```python
housing_cat = housing[["ocean_proximity"]]
housing_cat.head()
```

---

### 2. Ordinal Encoding 🎲

Scikit-Learn's **`OrdinalEncoder`** can convert categories to numbers. This is useful when the categories have an inherent order.

```python
from sklearn.preprocessing import OrdinalEncoder

ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)
```

This will output a **2D NumPy array** with numerical category codes.

To see the mapping:

```python
ordinal_encoder.categories_
```

Output:

```python
array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'])
```

⚠️ **Caution**: Ordinal encoding implies an order between categories, which may not always be correct. For example, it treats `INLAND (1)` as closer to `<1H OCEAN (0)` than `NEAR OCEAN (4)`, which might not make sense in some contexts.

---

### 3. One-Hot Encoding 🔥

For **unordered categories**, **one-hot encoding** is a better choice. It creates one binary column per category, so the model doesn't assume any order between them.

```python
from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
```

This gives a **sparse matrix** (efficient storage for mostly zeros).

To convert it to a **regular NumPy array**:

```python
housing_cat_1hot.toarray()
```

Or directly get a **dense array**:

```python
cat_encoder = OneHotEncoder(sparse=False)
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
```

To check the category order:

```python
cat_encoder.categories_
```

---

### 4. Summary 🗂️

| **Method**         | **Use When**             | **Output Type** |
| ------------------ | ------------------------ | --------------- |
| **OrdinalEncoder** | Categories have an order | 2D NumPy array  |
| **OneHotEncoder**  | Categories are unordered | Sparse or dense |

Using the right encoding ensures your model learns correctly from **categorical features**.

In [2]:
import pandas as pd
import numpy as np

In [3]:
data = pd.read_csv("xlsx/California Housing Prices/housing.csv")

In [4]:
data["income_cat"] = pd.cut(data["median_income"], bins=[0, 1.5, 3.0, 4.5, 6.0, np.inf], labels=[1, 2, 3, 4, 5])

In [5]:
import matplotlib.pyplot as plt

In [6]:
from sklearn.model_selection import StratifiedShuffleSplit

In [7]:
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

In [8]:
for train_index, test_index in split.split(data, data["income_cat"]):
    strat_train_set = data.loc[train_index]
    strat_test_set = data.loc[test_index]

In [9]:
# Remove the income_cat column
for sett in (strat_train_set, strat_test_set):
    sett.drop("income_cat", axis=1, inplace=True)

In [10]:
df = strat_train_set.copy()

In [11]:
housing = strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set["median_house_value"].copy()

In [12]:
from sklearn.impute import SimpleImputer

In [13]:
imputer = SimpleImputer(strategy="median")

In [14]:
housing_num = housing.select_dtypes(include=[np.number])

In [15]:
imputer.fit(housing_num)

In [16]:
imputer.statistics_

array([-118.51   ,   34.26   ,   29.     , 2119.     ,  433.     ,
       1164.     ,  408.     ,    3.54155])

In [17]:
X = imputer.transform(housing_num)

In [18]:
housing = pd.DataFrame(X, columns=housing_num.columns, index=housing_num.index)

In [19]:
housing["ocean_proximity"] = df["ocean_proximity"]

In [20]:
housing_for_onehot = housing[["ocean_proximity"]]

In [21]:
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
12655,-121.46,38.52,29.0,3873.0,797.0,2237.0,706.0,2.1736,INLAND
15502,-117.23,33.09,7.0,5320.0,855.0,2015.0,768.0,6.3373,NEAR OCEAN
2908,-119.04,35.37,44.0,1618.0,310.0,667.0,300.0,2.875,INLAND
14053,-117.13,32.75,24.0,1877.0,519.0,898.0,483.0,2.2264,NEAR OCEAN
20496,-118.7,34.28,27.0,3536.0,646.0,1837.0,580.0,4.4964,<1H OCEAN


In [22]:
housing["ocean_proximity"].unique()

array(['INLAND', 'NEAR OCEAN', '<1H OCEAN', 'NEAR BAY', 'ISLAND'],
      dtype=object)

In [23]:
from sklearn.preprocessing import OrdinalEncoder

In [24]:
ordinal_encoder = OrdinalEncoder()

In [25]:
housing_cat = ordinal_encoder.fit_transform(housing)

In [26]:
housing_cat = pd.DataFrame(housing_cat, columns=housing.columns, index=housing.index)

In [27]:
housing_cat.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
12655,239.0,569.0,28.0,3555.0,795.0,2167.0,704.0,1815.0,1.0
15502,662.0,55.0,6.0,4411.0,853.0,1965.0,766.0,9519.0,4.0
2908,481.0,273.0,43.0,1410.0,308.0,625.0,298.0,3475.0,1.0
14053,672.0,21.0,23.0,1669.0,517.0,856.0,481.0,1959.0,4.0
20496,515.0,174.0,26.0,3269.0,644.0,1791.0,578.0,6883.0,0.0


In [28]:
from sklearn.preprocessing import OneHotEncoder

In [29]:
one_hot_encoder = OneHotEncoder()

In [30]:
housing_cat = one_hot_encoder.fit_transform(housing_for_onehot)

In [31]:
housing_cat

<16512x5 sparse matrix of type '<class 'numpy.float64'>'
	with 16512 stored elements in Compressed Sparse Row format>

In [32]:
housing_cat.toarray()

array([[0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0.],
       ...,
       [1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.]])

In [33]:
one_hot_encoder.categories_

[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
       dtype=object)]

In [34]:
housing_cat = pd.DataFrame(housing_cat.toarray(), columns=['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'], index=housing.index)

In [35]:
housing_cat

Unnamed: 0,<1H OCEAN,INLAND,ISLAND,NEAR BAY,NEAR OCEAN
12655,0.0,1.0,0.0,0.0,0.0
15502,0.0,0.0,0.0,0.0,1.0
2908,0.0,1.0,0.0,0.0,0.0
14053,0.0,0.0,0.0,0.0,1.0
20496,1.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...
15174,1.0,0.0,0.0,0.0,0.0
12661,0.0,1.0,0.0,0.0,0.0
19263,1.0,0.0,0.0,0.0,0.0
19140,1.0,0.0,0.0,0.0,0.0
