<a href="https://colab.research.google.com/github/Omarsawan/Feature-construction-and-Categorical-features-tutorial/blob/master/feature_construction_and_categorical_features.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

In this tutorial , We are going to talk about categorical features and using original features of the data set to construct new features. Often in a data set, the given set of features in their raw form do not provide enough, or the most optimal, information to train a good performant model. In some cases model performance may be improved if we transform one or more features into a different representation to provide better information to the model , this is known as feature construction .

# Categorical features

A categorical feature takes only a limited number of values , it doesn't have continous values .A categorical variable can take on one of a limited, and usually fixed, number of possible values, assigning each individual or other unit of observation to a particular group or category. 
Consider a survey that asks how often you eat breakfast and provides four options: "Never", "Rarely", "Most days", or "Every day". In this case, the data is categorical, because responses fall into a fixed set of categories.
If people responded to a survey about which brand of car they owned, the responses would fall into categories like "Honda", "Toyota", and "Ford". In this case, the data is also categorical.

# Types of categorical features

We can classify the categorical features into two types , ordinal features and nominal features .

### 1 - Ordinal features

They are features which has values can be ordered , like asking about the frequency of doing something , the answer would be one of the following :
"Never" (0) < "Rarely" (1) < "Most days" (2) < "Every day" (3) .
So this type of features have values that are classified into categories but can be ordered .

### 2 - Nominal features

They are features which has values that fall into some categories but also can't be ordered or there is no relative order between the categories , like asking about car brands , and the possible categories are "HONDA" , "FORD" , "TOYOTA" , they fall into categories but we can't order them as they don't have an intrinsic ranking.

# Examples

First I will import the data set to show examples about categorical features .
Our dataset is a collections of some information about projects that are launched at specific date , at specific category and other features to predict the state of the project if it succeeded or failed or something else .

In [None]:
#import panda library to be able to load the dataset
import pandas as pd
#read the dataset
data=pd.read_csv('https://raw.githubusercontent.com/Omarsawan/Feature-construction-and-Categorical-features-tutorial/master/data/ks-projects-201801.csv',parse_dates=['deadline', 'launched'], encoding='latin-1')
#show the first 7 rows
data.head(7)

So we can consider that categorical features are features with data type equal to object or date since we know that these data types don't have continuous values .

In [None]:
#make a data series with index equal to the column name and the value of the index is whether this column is of type object
indicesObjects=(data.dtypes=='object')
#make a data series with index equal to the column name and the value of the index is whether this column is of type datetime
indicesDate=(data.dtypes=='datetime64[ns]')

#make a data series with columns that have true only
objectColumns=indicesObjects[indicesObjects]
dateColumns=indicesDate[indicesDate]

#show only the index (which is the column name of categorical features)
objectsList=list(objectColumns.index)
datetimeList=list(dateColumns.index)

categoricalFeatures=(objectsList+datetimeList)

print('Columns with data type object: ',objectsList)
print('Columns with data type datetime: ',datetimeList)
print('Categorical features are',categoricalFeatures)
print('Count of Categorical features is',len(categoricalFeatures))

So we can say that our dataset has 8 categorical features .

# Approaches to handle categorical features

You will get an error if you try to plug these variables into most machine learning models in Python without preprocessing them first.So there are many approaches that you can use to prepare your categorical data .

### 1 - Dropping features

The first and easiest approach is to just remove this variables from the data set .This approach will only work well if the columns did not contain useful information.

In [None]:
dropCategorical=data.select_dtypes(exclude='object').select_dtypes(exclude='datetime64[ns]')
dropCategorical.head(7)

### 2 - Label encoding

The second approach is Label encoding ,it assigns each unique value to a different integer.
This approach assumes that the features that will be encoded are ordinal features.
This assumption makes sense in some examples but other examples may not make sense , in our data set we can see that column 'name' caould be considered ordinal feature if we want to consider the values in lexicographically order . But the other features are nominal features so this method isn't suitable to encode them . So we will encode only column 'name' .

Scikit-learn has a LabelEncoder class that can be used to get label encodings. We apply the label encoder separately to each column.

In [None]:
from sklearn.preprocessing import LabelEncoder

# Make copy to avoid changing original data 
label_data = data.copy()

# Apply label encoder to column 'name'
label_encoder = LabelEncoder()
label_data['name'] = label_encoder.fit_transform(label_data['name'])
label_data.head(7)

### 3 - One-Hot Encoding

The third approach is to use One-Hot Encoding , it creates new columns indicating the presence (or absence) of each possible value in the original data. 
For example if we have a column color which has only 3 categories : (RED,YELLOW,BLUE) , it converts it into 3 columns and if some row has the value 'RED' , then its encoding will be (1,0,0) , and if another row has the value 'YELLOW' then its encoding will be (0,1,0).
So  the corresponding one-hot encoding contains one column for each possible value, and one row for each row in the original dataset .

In contrast to label encoding, one-hot encoding does not assume an ordering of the categories. Thus, you can expect this approach to work particularly well if there is no clear ordering in the categorical data (e.g., "Red" is neither more nor less than "Yellow"). So this method can work with nominal and ordinal variables.

One-hot encoding generally does not perform well if the categorical variable takes on a large number of values (i.e., you generally won't use it for variables taking more than 15 different values).


In [None]:
#lets count the number of unique values in each column
uniqueCount ={}
for col in categoricalFeatures:
    curCol=data.filter([col]).iloc[:,0]
    uniqueCount[col]=len(curCol.unique())
print(uniqueCount)

So lets make one hot encoding only in columns state and currency to be able to visualize them.

In [None]:
from sklearn.preprocessing import OneHotEncoder

cols=['state','currency']

# Make copy to avoid changing original data 
OH_data = data.copy()

# Apply one-hot encoder to the columns we have choosen
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols = pd.DataFrame(OH_encoder.fit_transform(OH_data[cols]))

# One-hot encoding removed index; put it back
OH_cols.index = OH_data.index

# Remove categorical columns (will be replaced with one-hot encoding)
num_X = OH_data.drop(cols, axis=1)

# Add one-hot encoded columns to numerical features
OH_X = pd.concat([num_X, OH_cols], axis=1)

OH_X.head(7)

The world is filled with categorical data. You will be a much more effective data scientist if you know how to use this common data type!

# Feature construction

Creating new features from the raw data is one of the best ways to improve your model , it can be done with the encoding methods which we introduced two of them and also can be done through different methods.

Lets consider our column 'launched' which is of timedate type , so instead of encoding it using different encoding methods , we can simply replace it with 4 columns which are the hour , day , month , year .

In [None]:
# Make copy to avoid changing original data 
data_copy = data.copy()

#add the four columns
data_copy=data_copy.assign(hour=data_copy.launched.dt.hour,
                           day=data_copy.launched.dt.day,
                           month=data_copy.launched.dt.month,
                           year=data_copy.launched.dt.year)
#remove the launched column
data_copy=data_copy.drop(['launched'], axis=1)

data_copy.head(7)

One of the easiest ways to create new features is by combining categorical variables. For example, if one record has the country "CA" and category "Music", you can create a new value "CA_Music". This is a new categorical feature that can provide information about correlations between categorical variables. This type of feature is typically called an interaction.
In general, you would build interaction features from all pairs of categorical features. You can make interactions from three or more features as well, but you'll tend to get diminishing returns.

Pandas lets us simply add string columns together like normal Python strings.

In [None]:
# Make copy to avoid changing original data 
data_copy = data.copy()
#make new feature
interactions = data['category'] + "_" + data['country']
#give this new column a name
interactions.name='category-country'
#add the column to the data
dataInteraction=pd.concat([interactions, data_copy], axis=1)

#remove the category and country columns
dataInteraction=dataInteraction.drop(['category','country'], axis=1)

dataInteraction.head(7)

Also we can make a new feature only by observing our dataset and its initial features to make a new feature which can enhance our model performance, we can see in our dataset that we can make a new feature which is the count of the number of projects launched in the last week , maybe this count can affect our model and its performance , since we have the launched date of each project , we can count the number of projects launched in the last week.

In [None]:
# First, create a Series with a timestamp index and the values in the series are the original index of rows
# then sort it by the timestamp index
launched = pd.Series(data.index, index=data.launched, name="count_last_week").sort_index()
launched.head(20)

Using a time series as the index allows us to define the rolling window size in terms of hours, days, weeks , you can use .rolling() to select time periods as the window. For example launched.rolling('7d') creates a rolling window that contains all the data in the previous 7 days. The window contains the current record, so if we want to count all the previous projects but not the current one, we'll need to subtract 1.

In [None]:
count_last_week = launched.rolling('7d').count() - 1
count_last_week.head(20)

In [None]:
#now that we have the counts, we need to adjust the index so we can join it with the other training data.
count_last_week.index = launched.values
count_last_week = count_last_week.reindex(data.index)
count_last_week.head(20)

In [None]:
#now join the new feature with the other data again using .join since we've matched the index.
data.join(count_last_week).head(10)

Now we constructed a new feature .