# Handling Text and Categorical Attributes
    - Most Machine Learning algorithms prefer to work with numbers, so it's recomended to convert these categories from text to numbers

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit

housing = pd.read_csv(r"C:\Users\georg\Desktop\end_end\datasets\housing\housing.csv") 

housing["income_category"] =pd.cut(housing["median_income"],bins=[0,1.5,3.0,4.5,6.,np.inf],labels=[1,2,3,4,5])
split_indices = StratifiedShuffleSplit(n_splits=1,test_size=0.2,random_state=42)
for train_index, test_index in split_indices.split(housing,housing["income_category"]): 
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index] 
# standard by now

In [7]:
housing_category = housing["ocean_proximity"]  #first using single square brackets [] captures a pandas series
print(type(housing_category))
housing_category1 = housing[["ocean_proximity"]] # using to square brackets [[]] returns a pandas Dataframe
print(type(housing_category1))

<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>


In [13]:
housing_category1.value_counts()

ocean_proximity
<1H OCEAN          9136
INLAND             6551
NEAR OCEAN         2658
NEAR BAY           2290
ISLAND                5
dtype: int64

### We will use Scikit-Learn’s OrdinalEncoder class
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html

In [14]:
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()

In [17]:
housing_cat_encoded = ordinal_encoder.fit_transform(housing_category)  
# !!! this is why in the book they use double [[ ]]  because it will not work on a 1 dimension array

ValueError: Expected 2D array, got 1D array instead:
array=['NEAR BAY' 'NEAR BAY' 'NEAR BAY' ... 'INLAND' 'INLAND' 'INLAND'].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [19]:
housing_cat_encoded = ordinal_encoder.fit_transform(housing_category1)
ordinal_encoder.categories_  # encoder gets the categories of each categorical attribute ( in our case only ocean_proximity)

[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
       dtype=object)]

#### the encoder assigns a numerical value to each distinct category it founds starting from 1 

In [36]:
print(housing_cat_encoded[5:8])
print(housing_category1[5:8])  #so NEAR BAY being at index 3 in the list ordinal_encoder.categories_ gets assigned numerical 3

[[3.]
 [3.]
 [3.]]
  ocean_proximity
5        NEAR BAY
6        NEAR BAY
7        NEAR BAY


### One issue with this representation is that ML algorithms will assume that two nearby values are more similar than two distant values.
### This may be fine in some cases (e.g., for ordered categories such as “bad”, “average”, “good”, “excellent”), but it is obviously not the case for the ocean_proximity column (for example, categories 0 and 4 are clearly more similar than categories 0 and 1).
### To fix this issue, a common solution is to create _one binary attribute per category_: one attribute equal to 1 when the category is “<1H OCEAN” (and 0 otherwise), another attribute equal to 1 when the category is “INLAND” (and 0 otherwise), and so on. This is called one-hot encoding, because only one attribute will be equal to 1 (hot), while the others will be 0 (cold). The new attributes are sometimes called _dummy attributes_.

#### Scikit-Learn provides a _OneHotEncoder_ class to convert categorical values into one-hot vectors

In [37]:
from sklearn.preprocessing import OneHotEncoder  #import the utility

cat_encoder = OneHotEncoder() # make an instance of that class
housing_cat_1hot = cat_encoder.fit_transform(housing_category1) #use fit_transform to "learn" and then transform a dataset
housing_cat_1hot # it will return a sparse matrix

<20640x5 sparse matrix of type '<class 'numpy.float64'>'
	with 20640 stored elements in Compressed Sparse Row format>

#### Notice that the output is a SciPy sparse matrix, instead of a NumPy array. This is very useful when you have categorical attributes with thousands of categories. After onehot encoding we get a matrix with thousands of columns, and the matrix is full of zeros except for a single 1 per row. Using up tons of memory mostly to store zeros would be very wasteful, so instead a sparse matrix only stores the location of the non‐ zero elements. You can use it mostly like a normal 2D array,21 but if you really want to convert it to a (dense) NumPy array, just call the _toarray()_ method:

In [38]:
housing_cat_1hot.toarray()

array([[0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0.],
       ...,
       [0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.]])

#### If a categorical attribute has a large number of possible categories (e.g., country code, profession, species, etc.), then one-hot encoding will result in a large number of input features. This may slow down training and degrade performance. If this happens, you may want to replace the categorical input with useful numerical features related to the categories: for example, you could replace the ocean_proximity feature with the distance to the ocean

# OneHotEncoding explained :

Making each category of an attribute a individual attribute helps the ML algorithms **"objective"**. 

###### In our first case (ordinal_encoder):
    - each category was assigned a numerical value from 0 to 4 (5 categories);
    - so inland = 1 and island =2 , in reality the two categories are very different but numerically they are neighbours
    - for the ML algorithm now island is more similar to inland then maybe <1H OCEAN and island 
    - when the algorithm will start making corellations between each value it will misinterpret the data
    
##### In the second case (OneHotEncoder):
    - each category is transformed into an attribute
    - then each of this attribute  gets assigned only a binary value 0 or 1;
    - for the ML algorithm it is now a clear distinction between them - one is present the rest or not.
    - when the algorthm will start making corellations it will not misinterpret the data