## Date and time

### we are creating a bunch of new columns using the datetime column. Let’s see some of the sample features that can be created.

In [1]:
import pandas as pd
# create a series of datetime with a frequency of 10 hours
s = pd.date_range('2020-01-06', '2020-01-10', freq='10H').to_series()
# create some features based on datetime
features = {
 "dayofweek": s.dt.dayofweek.values,
 "dayofyear": s.dt.dayofyear.values,
 "hour": s.dt.hour.values,
 "is_leap_year": s.dt.is_leap_year.values,
 "quarter": s.dt.quarter.values,
 "weekofyear": s.dt.weekofyear.values
}

  "weekofyear": s.dt.weekofyear.values


Using aggregates in pandas, it is quite easy to create features like these.

In [2]:
def generate_features(df):
 # create a bunch of features using the date column
 df.loc[:, 'year'] = df['date'].dt.year
 df.loc[:, 'weekofyear'] = df['date'].dt.weekofyear
 df.loc[:, 'month'] = df['date'].dt.month
 df.loc[:, 'dayofweek'] = df['date'].dt.dayofweek
 df.loc[:, 'weekend'] = (df['date'].dt.weekday >=5).astype(int)
 
 # create an aggregate dictionary
 aggs = {}
 # for aggregation by month, we calculate the
 # number of unique month values and also the mean
 aggs['month'] = ['nunique', 'mean']
 aggs['weekofyear'] = ['nunique', 'mean']
 # we aggregate by num1 and calculate sum, max, min 
 # and mean values of this column
 aggs['num1'] = ['sum','max','min','mean']
 # for customer_id, we calculate the total count
 aggs['customer_id'] = ['size']
 # again for customer_id, we calculate the total unique
 aggs['customer_id'] = ['nunique']
 
 # we group by customer_id and calculate the aggregates
 agg_df = df.groupby('customer_id').agg(aggs)
 agg_df = agg_df.reset_index()
 return agg_df

 ## statistical features

Sometimes, for example, when dealing with time-series problems, you might have 
features which are not individual values but a list of values. For example, 
transactions by a customer in a given period of time. In these cases, we create 
different types of features such as: with numerical features, when you are grouping 
on a categorical column, you will get features like a list of values which are time 
distributed. In these cases, you can create a bunch of statistical features such as:
- Mean
- Max
- Min
- Unique
- Skew
- Kurtosis
- Kstat
- Percentile
- Quantile
- Peak to peak
- And many more

In [3]:
import numpy as np
feature_dict = {}
# calculate mean
feature_dict['mean'] = np.mean(x)
# calculate max
feature_dict['max'] = np.max(x)
# calculate min
feature_dict['min'] = np.min(x)
# calculate standard deviation
feature_dict['std'] = np.std(x)
# calculate variance
feature_dict['var'] = np.var(x)
# peak-to-peak
feature_dict['ptp'] = np.ptp(x)
# percentile features
feature_dict['percentile_10'] = np.percentile(x, 10)
feature_dict['percentile_60'] = np.percentile(x, 60)
feature_dict['percentile_90'] = np.percentile(x, 90)
# quantile features
feature_dict['quantile_5'] = np.quantile(x, 0.05)
feature_dict['quantile_95'] = np.quantile(x, 0.95)
feature_dict['quantile_99'] = np.quantile(x, 0.99)

NameError: name 'x' is not defined

### time series data

The time series data (list of values) can be converted to a lot of features. 
A python library called tsfresh is instrumental in this case.

In [4]:
from tsfresh.feature_extraction import feature_calculators as fc
# tsfresh based features
feature_dict['abs_energy'] = fc.abs_energy(x)
feature_dict['count_above_mean'] = fc.count_above_mean(x)
feature_dict['count_below_mean'] = fc.count_below_mean(x)
feature_dict['mean_abs_change'] = fc.mean_abs_change(x)
feature_dict['mean_change'] = fc.mean_change(x)

ModuleNotFoundError: No module named 'tsfresh'

 tsfresh offers hundreds of features and tens of variations of different 
features that you can use for time series (list of values) based features. In the 
examples above, x is a list of values. But that’s not all. There are many other features 
that you can create for numerical data with or without categorical data. A simple 
way to generate many features is just to create a bunch of polynomial features. For 
example, a second-degree polynomial feature from two features “a” and “b” would 
include: “a”, “b”, “ab”, “a2
” and “b2
”.

In [5]:
import numpy as np
# generate a random dataframe with 
# 2 columns and 100 rows
df = pd.DataFrame(
 np.random.rand(100, 2),
 columns=[f"f_{i}" for i in range(1, 3)]
)

In [6]:
df

Unnamed: 0,f_1,f_2
0,0.385130,0.224575
1,0.912021,0.443297
2,0.641941,0.992088
3,0.005152,0.685992
4,0.805110,0.933088
...,...,...
95,0.837821,0.783369
96,0.485438,0.585550
97,0.926228,0.359476
98,0.518330,0.348956


And we can create two-degree polynomial features using PolynomialFeatures from 
scikit-learn.

In [7]:
from sklearn import preprocessing
# initialize polynomial features class object
# for two-degree polynomial features
pf = preprocessing.PolynomialFeatures(
 degree=2,
 interaction_only=False,
 include_bias=False
)
# fit to the features
pf.fit(df)
# create polynomial features
poly_feats = pf.transform(df)
# create a dataframe with all the features
num_feats = poly_feats.shape[1]
df_transformed = pd.DataFrame(
 poly_feats,
 columns=[f"f_{i}" for i in range(1, num_feats + 1)])

In [9]:
df_transformed.head()

Unnamed: 0,f_1,f_2,f_3,f_4,f_5
0,0.38513,0.224575,0.148325,0.086491,0.050434
1,0.912021,0.443297,0.831782,0.404296,0.196512
2,0.641941,0.992088,0.412088,0.636862,0.984239
3,0.005152,0.685992,2.7e-05,0.003534,0.470585
4,0.80511,0.933088,0.648203,0.751239,0.870653


If you create third-degree 
polynomial features, you will end up with nine features in total.

### Binning

Another interesting feature converts the numbers to categories. It’s known as 
binning. 

In [10]:
# create bins of the numerical columns
# 10 bins
df["f_bin_10"] = pd.cut(df["f_1"], bins=10, labels=False)
# 100 bins
df["f_bin_100"] = pd.cut(df["f_1"], bins=100, labels=False)

In [11]:
df.head()

Unnamed: 0,f_1,f_2,f_bin_10,f_bin_100
0,0.38513,0.224575,3,38
1,0.912021,0.443297,9,91
2,0.641941,0.992088,6,64
3,0.005152,0.685992,0,0
4,0.80511,0.933088,8,80


### variance without and with the log transformation.

In [15]:
df_transformed.f_3.var()

0.08612085456683241

In [19]:
df_transformed.f_3.apply(lambda x: np.log(1 + x)).var()

0.04424523282580211

### k-nearest neighbour

A fancy way of filling in the missing values would be to use a k-nearest neighbour
method. You can select a sample with missing values and find the nearest 
neighbours utilising some kind of distance metric, for example, Euclidean distance. 
Then you can take the mean of all nearest neighbours and fill up the missing value. 
You can use the KNN imputer implementation for filling missing values like this. 

In [20]:
import numpy as np
from sklearn import impute
# create a random numpy array with 10 samples
# and 6 features and values ranging from 1 to 15
X = np.random.randint(1, 15, (10, 6))
# convert the array to float
X = X.astype(float)
# randomly assign 10 elements to NaN (missing)
X.ravel()[np.random.choice(X.size, 10, replace=False)] = np.nan
# use 2 nearest neighbours to fill na values
knn_imputer = impute.KNNImputer(n_neighbors=2)
knn_imputer.fit_transform(X)

array([[ 8. ,  8.5, 14. , 10. , 12. ,  6.5],
       [ 6. , 11. , 12. ,  7. , 10. ,  3. ],
       [ 6. ,  8. , 14. , 12. , 13. , 10. ],
       [ 5. ,  8. ,  5. , 14. , 11. ,  4. ],
       [ 5.5,  6. ,  9.5, 14. , 12.5,  9. ],
       [14. ,  3. ,  6. ,  8. ,  5. ,  6. ],
       [ 8. , 10. , 10. ,  3. ,  9. ,  6. ],
       [ 8. ,  9.5,  8. , 13. , 12. ,  4. ],
       [ 8. , 14. ,  7. ,  8.5,  5. , 13. ],
       [ 7. ,  4. ,  1. ,  8. , 12. , 11. ]])