---
<h1 align='center' style="color:blue">Feature Selection with Categorical Data</h1>

---

https://machinelearningmastery.com/

#### Feature selection is the process of identifying and selecting a subset of input features that are most relevant to the target variable.

#### Feature selection is often straightforward when working with real-valued data, such as using the Pearson’s correlation coefficient, but can be challenging when working with categorical data.

#### The two most commonly used feature selection methods for categorical input data when the target variable is also categorical (e.g. classification predictive modeling) are the **chi-squared statistic** and the **mutual information statistic**.

#### here We  will discover how to perform feature selection with categorical input data.

#### After completing this tutorial, you will know:
>- The breast cancer predictive modeling problem with categorical inputs and binary classification target variable.
>- How to evaluate the importance of categorical features using the chi-squared and mutual information statistics.
>- How to perform feature selection for categorical data when fitting and evaluating a classification model.

---
## **We divided This into three parts:**

#### 1. Breast Cancer Categorical Dataset
#### 2. Categorical Feature Selection
#### 3. Modeling With Selected Features

---
---
## 1. **Breast Cancer Categorical Dataset**

#### The dataset classifies breast cancer patient data as either a recurrence or no recurrence of cancer. There are 286 examples and nine input variables. It is a binary classification problem.

#### @Looking at the data, we can see that all nine input variables are categorical.

#### Specifically, all variables are quoted strings; some are ordinal and some are not.

### @ Load the dataset

In [29]:
import pandas as pd
data=pd.read_csv("https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv",header=None)
data.head()
# Convert dataframe into CSV file
data.to_csv("data/breast-cancer.csv",index=None,header=None)

### Split the columns into input (X) and output(y) for modeling.

In [30]:
dataset=data.values
# dataset

# split the data
X=dataset[:,:-1]
X

array([["'40-49'", "'premeno'", "'15-19'", ..., "'right'", "'left_up'",
        "'no'"],
       ["'50-59'", "'ge40'", "'15-19'", ..., "'right'", "'central'",
        "'no'"],
       ["'50-59'", "'ge40'", "'35-39'", ..., "'left'", "'left_low'",
        "'no'"],
       ...,
       ["'30-39'", "'premeno'", "'30-34'", ..., "'right'", "'right_up'",
        "'no'"],
       ["'50-59'", "'premeno'", "'15-19'", ..., "'right'", "'left_low'",
        "'no'"],
       ["'50-59'", "'ge40'", "'40-44'", ..., "'left'", "'right_up'",
        "'no'"]], dtype=object)

In [31]:
y=dataset[:-1]
y

array([["'40-49'", "'premeno'", "'15-19'", ..., "'left_up'", "'no'",
        "'recurrence-events'"],
       ["'50-59'", "'ge40'", "'15-19'", ..., "'central'", "'no'",
        "'no-recurrence-events'"],
       ["'50-59'", "'ge40'", "'35-39'", ..., "'left_low'", "'no'",
        "'recurrence-events'"],
       ...,
       ["'50-59'", "'premeno'", "'25-29'", ..., "'left_low'", "'yes'",
        "'no-recurrence-events'"],
       ["'30-39'", "'premeno'", "'30-34'", ..., "'right_up'", "'no'",
        "'no-recurrence-events'"],
       ["'50-59'", "'premeno'", "'15-19'", ..., "'left_low'", "'no'",
        "'no-recurrence-events'"]], dtype=object)

#### We can force all fields in the input data to be string, just in case Pandas tried to map some automatically to numbers.

#### Format all fields as String

In [32]:
X=X.astype(str)
X.dtype

dtype('<U11')

---
### We can tie all of this together into a helpful function that we can reuse later.

In [33]:
# load dataset
# create function for load dataset
def load_dataset(filename):
    # load dataset as a pandas fataframe
    data=pd.read_csv(filename,header=None)
    
    # retrieve numpy  array
    dataset=data.values
    
    # Split into Input(X) and output(y) variables
    X=dataset[:,:-1]
    y=dataset[:,-1]
    
    # format all fileds as string
    X=X.astype(str)
    return X,y

#### Once loaded, we can split the data into training and test sets so that we can fit and evaluate a learning model.

#### We will use the train_test_split() function form scikit-learn 

In [45]:
# load the dataset
X,y=load_dataset("breast-cancer.csv")

# Split the data into train & test
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=30,random_state=1)

### Tying all of these elements together, the complete example of loading, splitting, and summarizing the raw categorical dataset is listed below.

In [47]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split


# load dataset
# create function for load dataset
def load_dataset(filename):
    # load dataset as a pandas fataframe
    data=pd.read_csv(filename,header=None)
    
    # retrieve numpy  array
    dataset=data.values
    
    # Split into Input(X) and output(y) variables
    X=dataset[:,:-1]
    y=dataset[:,-1]
    
    # format all fileds as string
    X=X.astype(str)
    return X,y

# load the dataset
X,y=load_dataset("breast-cancer.csv")

# Split the data into train & test
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=30,random_state=1)

# summarize
print('Train', X_train.shape, y_train.shape)
print('Test', X_test.shape, y_test.shape)

Train (256, 9) (256,)
Test (30, 9) (30,)


---
---
---
---
## **Encode the Modeling**

#### let’s look at how we can encode it for modeling.

#### We can use the **OrdinalEncoder()** from scikit-learn to encode each variable to integers. This is a flexible class and does allow the order of the categories to be specified as arguments if any such order is known.

#### **Note:** I will leave it as an exercise to you to update the example below to try specifying the order for those variables that have a natural ordering and see if it has an impact on model performance.

#### The best practice when encoding variables is to fit the encoding on the training dataset, then apply it to the train and test datasets.

#### The function below named prepare_inputs() takes the input data for the train and test sets and encodes it using an ordinal encoding.

### @Prepare Input data

In [41]:
from sklearn.preprocessing import OrdinalEncoder
# prepare input data
def prepare_inputs(X_train,X_test):
    oe=OrdinalEncoder()
    oe.fit(X_train)
    X_train_enc=oe.transform(X_train)
    X_test_enc=oe.transform(X_test)
    return X_train_enc,X_test_enc

#### We also need to prepare the target variable.



#### It is a binary classification problem, so we need to map the two class labels to 0 and 1. This is a type of ordinal encoding, and scikit-learn provides the LabelEncoder class specifically designed for this purpose. We could just as easily use the OrdinalEncoder and achieve the same result, although the LabelEncoder is designed for encoding a single variable.

#### The prepare_targets() function integer encodes the output data for the train and test sets.

In [43]:
from sklearn.preprocessing import LabelEncoder

# Prepare target
def prepare_targets(y_train,y_test):
    le=LabelEncoder()
    le.fit(y_train)
    y_train_enc=le.transform(y_train)
    y_test_enc=le.transform(y_test)
    return y_train_enc,y_test_enc

#### We can call these functions to prepare our data.

In [44]:
# prepare input data
X_train_enc,X_test_enc=prepare_inputs(X_train,X_test)

# prepare output data
y_train_enc,y_test_enc=prepare_targets(y_train,y_test)

### Tying this all together, the complete example of loading and encoding the input and output variables for the breast cancer categorical dataset is listed below.

In [48]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import LabelEncoder


# load dataset
# create function for load dataset
def load_dataset(filename):
    # load dataset as a pandas fataframe
    data=pd.read_csv(filename,header=None)
    
    # retrieve numpy  array
    dataset=data.values
    
    # Split into Input(X) and output(y) variables
    X=dataset[:,:-1]
    y=dataset[:,-1]
    
    # format all fileds as string
    X=X.astype(str)
    return X,y

# prepare input data
def prepare_inputs(X_train,X_test):
    oe=OrdinalEncoder()
    oe.fit(X_train)
    X_train_enc=oe.transform(X_train)
    X_test_enc=oe.transform(X_test)
    return X_train_enc,X_test_enc


# Prepare target
def prepare_targets(y_train,y_test):
    le=LabelEncoder()
    le.fit(y_train)
    y_train_enc=le.transform(y_train)
    y_test_enc=le.transform(y_test)
    return y_train_enc,y_test_enc

# load the dataset
X, y = load_dataset('breast-cancer.csv')
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# prepare input data
X_train_enc, X_test_enc = prepare_inputs(X_train, X_test)
# prepare output data
y_train_enc, y_test_enc = prepare_targets(y_train, y_test)