# **Categorical Variables**

A variable that contains information about some characteristic describing the observation in data.

Three approaches to preprocess the categorical data :
1. Drop Categorical Variable : We simply remove them from the dataset.
2. Ordinal Encoding : Assign a different integer to each unique value.

  Ordinal variable : a type of categorical variable that has a ranking of its categories.

  Ex : Breakfast : {every day,never,rarely,most days,never}

  breakfast : {3,0,1,2,0}

  Assume the order never(0) < rarely(1) < most days(3) < every day(3)
  
3. One Hot Encoding : creates a new column indicating presencve/absence of each possible value in original data.

  Nominal variable : they are nominal variable without intrinsic ranking.

  Ex : Color : {red,red,yellow,green,yellow}

  color : {(1,0,0),(1,0,0),(0,1,0),(0,0,1),(0,1,0)}

  *not useful when variable takes more than 15 values.

In [None]:
#importing libraries
import pandas as pd
from sklearn.model_selection import train_test_split

#reading data
data = pd.read_csv('/content/melb_data.csv')

#Spliting data into training and validation sets
y = data.Price
X = data.drop(['Price'],axis=1)

X_train_full,X_valid_full,y_train,y_val = train_test_split(X,y,train_size=0.8,test_size=0.2,random_state=0)

#dropping columns with missing values
missing_columns = [col for col in X_train_full.columns if X_train_full[col].isnull().any()]
X_train_full.drop(missing_columns, axis=1, inplace=True)
X_valid_full.drop(missing_columns, axis=1,inplace=True)

#selecting columns with low cardinality
low_cardinality_columns = [col for col in X_train_full.columns if X_train_full[col].nunique()<10 and X_train_full[col].dtype=='object']

#selecting numerical columns
numerical_columns = [col for col in X_train_full if X_train_full[col].dtype in ['int64','float64']]

#keeping selected columns only
selected_cols = low_cardinality_columns + numerical_columns
X_train = X_train_full[selected_cols].copy()
X_val = X_valid_full[selected_cols].copy()

X_train.head()


Unnamed: 0,Type,Method,Regionname,Rooms,Distance,Postcode,Bedroom2,Bathroom,Landsize,Lattitude,Longtitude,Propertycount
12167,u,S,Southern Metropolitan,1,5.0,3182.0,1.0,1.0,0.0,-37.85984,144.9867,13240.0
6524,h,SA,Western Metropolitan,2,8.0,3016.0,2.0,2.0,193.0,-37.858,144.9005,6380.0
8413,h,S,Western Metropolitan,3,12.6,3020.0,3.0,1.0,555.0,-37.7988,144.822,3755.0
2919,u,SP,Northern Metropolitan,3,13.0,3046.0,3.0,1.0,265.0,-37.7083,144.9158,8870.0
6043,h,S,Western Metropolitan,3,13.3,3020.0,3.0,1.0,673.0,-37.7623,144.8272,4217.0


In [None]:
#list of categorical variables
s = (X_train.dtypes == 'object')
categorical_cols = list(s[s].index)

print("Categorical variables are:")
print(categorical_cols)

Categorical variables are:
['Type', 'Method', 'Regionname']


**Function for finding Mean Absolute Error**

In [None]:
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import RandomForestRegressor

def absolute_error(X_train_full,X_valid_full,y_train,y_val):
  model = RandomForestRegressor(random_state=0)
  model.fit(X_train_full,y_train)
  predict = model.predict(X_valid_full)
  return mean_absolute_error(y_val,predict)


**Approach 1 : Drop categorical variables**

In [None]:
dropped_X_train = X_train.select_dtypes(exclude=['object'])
dropped_X_valid = X_val.select_dtypes(exclude=['object'])

print("MAE for approach 1 is : ")
print(absolute_error(dropped_X_train,dropped_X_valid,y_train,y_val))

MAE for approach 1 is : 
175703.48185157913


**Approach 2 : Ordinal Encoding**

In [None]:
from sklearn.preprocessing import OrdinalEncoder

labeled_X_train_full = X_train.copy()
labeled_X_valid_full = X_val.copy()

ordinalencoder = OrdinalEncoder()

labeled_X_train_full[categorical_cols] = ordinalencoder.fit_transform(X_train[categorical_cols])
labeled_X_valid_full[categorical_cols] = ordinalencoder.transform(X_val[categorical_cols])

print("\nMAE for approach 2 is : ")
print(absolute_error(labeled_X_train_full,labeled_X_valid_full,y_train,y_val))


MAE for approach 2 is : 
165936.40548390493


**Approach 3 : One Hot Encoding**

handle_unknown='ignore': Ignores new/unseen categories during transformation to avoid errors.

sparse=False: Returns a dense array instead of a sparse matrix.

.fit_transform fits the encoder to the categorical columns  in training data and transforms into one hot encoded format

.transform uses the already fitted encoder to transform the validation data

In [None]:
from sklearn.preprocessing import OneHotEncoder

#Applying One-Hot Encoding
ohEncoder = OneHotEncoder(handle_unknown = 'ignore', sparse_output=False)
oh_train_cols = pd.DataFrame(ohEncoder.fit_transform(X_train[categorical_cols]))
oh_val_cols = pd.DataFrame(ohEncoder.transform(X_val[categorical_cols]))


#restoring the index
oh_train_cols.index = X_train.index
oh_val_cols.index = X_val.index

#removing the categorical columns and replacing it with one hot encoded columns
numeric_X_train = X_train.drop(categorical_cols, axis=1)
numeric_X_val = X_val.drop(categorical_cols,axis=1)

oh_X_train = pd.concat([numeric_X_train, oh_train_cols], axis=1)
oh_X_val = pd.concat([numeric_X_val, oh_val_cols], axis=1)

#ensuring the columns are string
oh_X_train.columns = oh_X_train.columns.astype(str)
oh_X_val.columns = oh_X_val.columns.astype(str)

print("MAE for approach 3 is :")
print(absolute_error(oh_X_train, oh_X_val, y_train, y_val))

MAE for approach 3 is :
166089.4893009678
