# Feature Engineering
---
1) Importing Necessary Libraries
2) Loading EDA Performed Dataset
3) Feature Engineering

        a) Creating new feature (off-peak-diff)
        b) Creating new date based feature
        c) Transforming Categorical Attributes
        d) Feature Scaling
---
**1. Importing Necessary Libraries**

In [24]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler

**2. Loading EDA Performed Dataset**

In [5]:
df = pd.read_csv('clean_data_after_eda.csv')
df.head()

Unnamed: 0,id,channel_sales,cons_12m,cons_gas_12m,cons_last_month,date_activ,date_end,date_modif_prod,date_renewal,forecast_cons_12m,...,var_6m_price_off_peak_var,var_6m_price_peak_var,var_6m_price_mid_peak_var,var_6m_price_off_peak_fix,var_6m_price_peak_fix,var_6m_price_mid_peak_fix,var_6m_price_off_peak,var_6m_price_peak,var_6m_price_mid_peak,churn
0,24011ae4ebbe3035111d65fa7c15bc57,foosdfpfkusacimwkcsosbicdxkicaua,0,54946,0,2013-06-15,2016-06-15,2015-11-01,2015-06-23,0.0,...,0.000131,4.100838e-05,0.0009084737,2.086294,99.530517,44.235794,2.086425,99.53056,44.2367,1
1,d29c2c54acc38ff3c0614d0a653813dd,MISSING,4660,0,0,2009-08-21,2016-08-30,2009-08-21,2015-08-31,189.95,...,3e-06,0.001217891,0.0,0.009482,0.0,0.0,0.009485,0.001217891,0.0,0
2,764c75f661154dac3a6c254cd082ea7d,foosdfpfkusacimwkcsosbicdxkicaua,544,0,0,2010-04-16,2016-04-16,2010-04-16,2015-04-17,47.96,...,4e-06,9.45015e-08,0.0,0.0,0.0,0.0,4e-06,9.45015e-08,0.0,0
3,bba03439a292a1e166f80264c16191cb,lmkebamcaaclubfxadlmueccxoimlema,1584,0,0,2010-03-30,2016-03-30,2010-03-30,2015-03-31,240.04,...,3e-06,0.0,0.0,0.0,0.0,0.0,3e-06,0.0,0.0,0
4,149d57cf92fc41cf94415803a877cb4b,MISSING,4425,0,526,2010-01-13,2016-03-07,2010-01-13,2015-03-09,445.75,...,1.1e-05,2.89676e-06,4.86e-10,0.0,0.0,0.0,1.1e-05,2.89676e-06,4.86e-10,0


**3. Feature Engineering**

**a) Creating New feature off-peak-diff(Dec-Jan)** 

In [6]:
df['activation_month'] = pd.to_datetime(df['date_activ']).dt.month
df['activation_year'] = pd.to_datetime(df['date_activ']).dt.year

In [7]:
december_data = df[df['activation_month'] == 12]
january_data = df[df['activation_month'] == 1]

In [8]:
december_off_peak_price = december_data['forecast_price_energy_off_peak'].mean()
january_off_peak_price = january_data['forecast_price_energy_off_peak'].mean()

In [9]:
# Defining function for off peak difference

def off_peak_diff(row):
    if row['activation_month'] == 12:
        return row['forecast_price_energy_off_peak'] - january_off_peak_price
    elif row['activation_month'] == 1:
        return december_off_peak_price - row['forecast_price_energy_off_peak']
    else:
        return np.nan

In [10]:
df['off_peak_diff_dec_jan'] = df.apply(off_peak_diff, axis = 1)

df[['off_peak_diff_dec_jan']].head()

Unnamed: 0,off_peak_diff_dec_jan
0,
1,
2,
3,
4,0.0215


**b) Creating New Date Based Feature**

In [11]:
df['date_activ'] = pd.to_datetime(df['date_activ'])
df['date_end'] = pd.to_datetime(df['date_end'])
df['date_modif_prod'] = pd.to_datetime(df['date_modif_prod'])
df['date_renewal'] = pd.to_datetime(df['date_renewal'])

In [12]:
df['contract_duration'] = (df['date_end'] - df['date_activ']).dt.days
df['renewal_gap'] = (df['date_renewal'] - df['date_modif_prod']).dt.days

In [13]:
df[['contract_duration', 'renewal_gap']].head()

Unnamed: 0,contract_duration,renewal_gap
0,1096,-131
1,2566,2201
2,2192,1827
3,2192,1827
4,2245,1881


**c) Transforming Categorical Attributes**

In [16]:
df['channel_sales_encoded'] = df['channel_sales'].astype('category').cat.codes
df['origin_up_encoded'] = df['origin_up'].astype('category').cat.codes

In [17]:
df[['channel_sales_encoded', 'origin_up_encoded']].head()

Unnamed: 0,channel_sales_encoded,origin_up_encoded
0,4,4
1,0,2
2,4,2
3,5,2
4,0,2


The above categorical features are label encoded. It can also be converted using label encoder library.

**d) Feature Scaling**

In [19]:
scaler = StandardScaler()

numeric_col = ['cons_12m', 'forecast_cons_12m', 'net_margin', 'pow_max', 'off_peak_diff_dec_jan']
df[numeric_col] = scaler.fit_transform(df[numeric_col])

In [21]:
df[numeric_col].head()

Unnamed: 0,cons_12m,forecast_cons_12m,net_margin,pow_max,off_peak_diff_dec_jan
0,-0.277655,-0.782669,1.570703,1.885055,
1,-0.269529,-0.703109,-0.546444,-0.320308,
2,-0.276707,-0.762581,-0.585862,-0.31617,
3,-0.274893,-0.682129,-0.525372,-0.36464,
4,-0.269939,-0.595967,-0.453144,0.123011,0.603219


Final Feature Engineered Data

In [26]:
df.head()

Unnamed: 0,id,channel_sales,cons_12m,cons_gas_12m,cons_last_month,date_activ,date_end,date_modif_prod,date_renewal,forecast_cons_12m,...,var_6m_price_mid_peak,churn,activation_month,activation_year,off_peak_diff_dec_jan,contract_duration,renewal_gap,channel_sales_encoded,encoded_up_encoded,origin_up_encoded
0,24011ae4ebbe3035111d65fa7c15bc57,foosdfpfkusacimwkcsosbicdxkicaua,-0.277655,54946,0,2013-06-15,2016-06-15,2015-11-01,2015-06-23,-0.782669,...,44.2367,1,6,2013,,1096,-131,4,4,4
1,d29c2c54acc38ff3c0614d0a653813dd,MISSING,-0.269529,0,0,2009-08-21,2016-08-30,2009-08-21,2015-08-31,-0.703109,...,0.0,0,8,2009,,2566,2201,0,2,2
2,764c75f661154dac3a6c254cd082ea7d,foosdfpfkusacimwkcsosbicdxkicaua,-0.276707,0,0,2010-04-16,2016-04-16,2010-04-16,2015-04-17,-0.762581,...,0.0,0,4,2010,,2192,1827,4,2,2
3,bba03439a292a1e166f80264c16191cb,lmkebamcaaclubfxadlmueccxoimlema,-0.274893,0,0,2010-03-30,2016-03-30,2010-03-30,2015-03-31,-0.682129,...,0.0,0,3,2010,,2192,1827,5,2,2
4,149d57cf92fc41cf94415803a877cb4b,MISSING,-0.269939,0,526,2010-01-13,2016-03-07,2010-01-13,2015-03-09,-0.595967,...,4.86e-10,0,1,2010,0.603219,2245,1881,0,2,2


Now that the features are engineered, there will be several columns which consist of missing values. These missing values must again be preprocessed to further feed it as an input to the predictive models.