# Feature Engineering

---

1. Import packages
2. Load data
3. Feature engineering

---

## 1. Import packages

In [889]:
import pandas as pd

---
## 2. Load data

In [890]:
df = pd.read_csv('/kaggle/input/bcg-eda-forage/clean_data_after_eda.csv')
df["date_activ"] = pd.to_datetime(df["date_activ"], format='%Y-%m-%d')
df["date_end"] = pd.to_datetime(df["date_end"], format='%Y-%m-%d')
df["date_modif_prod"] = pd.to_datetime(df["date_modif_prod"], format='%Y-%m-%d')
df["date_renewal"] = pd.to_datetime(df["date_renewal"], format='%Y-%m-%d')

In [891]:
df.head(3)

Unnamed: 0.1,Unnamed: 0,id,channel_sales,cons_12m,cons_gas_12m,cons_last_month,date_activ,date_end,date_modif_prod,date_renewal,...,mean_3m_price_off_peak_var,mean_3m_price_peak_var,mean_3m_price_mid_peak_var,mean_3m_price_off_peak_fix,mean_3m_price_peak_fix,mean_3m_price_mid_peak_fix,mean_3m_price_off_peak,mean_3m_price_peak,mean_3m_price_mid_peak,churn
0,0,24011ae4ebbe3035111d65fa7c15bc57,foosdfpfkusacimwkcsosbicdxkicaua,0,54946,0,2013-06-15,2016-06-15,2015-11-01,2015-06-23,...,0.131756,0.092638,0.036909,42.497907,12.218665,8.145777,42.629663,12.311304,8.182687,1
1,1,d29c2c54acc38ff3c0614d0a653813dd,MISSING,4660,0,0,2009-08-21,2016-08-30,2009-08-21,2015-08-31,...,0.1476,0.0,0.0,44.44471,0.0,0.0,44.59231,0.0,0.0,0
2,2,764c75f661154dac3a6c254cd082ea7d,foosdfpfkusacimwkcsosbicdxkicaua,544,0,0,2010-04-16,2016-04-16,2010-04-16,2015-04-17,...,0.167798,0.088409,0.0,44.44471,0.0,0.0,44.612508,0.088409,0.0,0


---

## 3. Feature engineering

### Difference between off-peak prices in December and preceding January

Below is the code created by your colleague to calculate the feature described above. Use this code to re-create this feature and then think about ways to build on this feature to create features with a higher predictive power.

In [892]:
price_df = pd.read_csv('/kaggle/input/bcgvirtualforage/price_data.csv')
price_df["price_date"] = pd.to_datetime(price_df["price_date"], format='%Y-%m-%d')
price_df.head()

Unnamed: 0,id,price_date,price_off_peak_var,price_peak_var,price_mid_peak_var,price_off_peak_fix,price_peak_fix,price_mid_peak_fix
0,038af19179925da21a25619c5a24b745,2015-01-01,0.151367,0.0,0.0,44.266931,0.0,0.0
1,038af19179925da21a25619c5a24b745,2015-02-01,0.151367,0.0,0.0,44.266931,0.0,0.0
2,038af19179925da21a25619c5a24b745,2015-03-01,0.151367,0.0,0.0,44.266931,0.0,0.0
3,038af19179925da21a25619c5a24b745,2015-04-01,0.149626,0.0,0.0,44.266931,0.0,0.0
4,038af19179925da21a25619c5a24b745,2015-05-01,0.149626,0.0,0.0,44.266931,0.0,0.0


In [893]:
# Group off-peak prices by companies and month
monthly_price_by_id = price_df.groupby(['id', 'price_date']).agg({'price_off_peak_var': 'mean', 'price_off_peak_fix': 'mean'}).reset_index()

# Get january and december prices
jan_prices = monthly_price_by_id.groupby('id').first().reset_index()
dec_prices = monthly_price_by_id.groupby('id').last().reset_index()

# Calculate the difference
diff = pd.merge(dec_prices.rename(columns={'price_off_peak_var': 'dec_1', 'price_off_peak_fix': 'dec_2'}), jan_prices.drop(columns='price_date'), on='id')
diff['offpeak_diff_dec_january_energy'] = diff['dec_1'] - diff['price_off_peak_var']
diff['offpeak_diff_dec_january_power'] = diff['dec_2'] - diff['price_off_peak_fix']
diff = diff[['id', 'offpeak_diff_dec_january_energy','offpeak_diff_dec_january_power']]
diff.head()

Unnamed: 0,id,offpeak_diff_dec_january_energy,offpeak_diff_dec_january_power
0,0002203ffbb812588b632b9e628cc38d,-0.006192,0.162916
1,0004351ebdd665e6ee664792efc4fd13,-0.004104,0.177779
2,0010bcc39e42b3c2131ed2ce55246e3c,0.050443,1.5
3,0010ee3855fdea87602a5b7aba8e42de,-0.010018,0.162916
4,00114d74e963e47177db89bc70108537,-0.003994,-1e-06


In [894]:
new_df=df.merge(diff,on='id')

In [895]:
new_df.head()

Unnamed: 0.1,Unnamed: 0,id,channel_sales,cons_12m,cons_gas_12m,cons_last_month,date_activ,date_end,date_modif_prod,date_renewal,...,mean_3m_price_mid_peak_var,mean_3m_price_off_peak_fix,mean_3m_price_peak_fix,mean_3m_price_mid_peak_fix,mean_3m_price_off_peak,mean_3m_price_peak,mean_3m_price_mid_peak,churn,offpeak_diff_dec_january_energy,offpeak_diff_dec_january_power
0,0,24011ae4ebbe3035111d65fa7c15bc57,foosdfpfkusacimwkcsosbicdxkicaua,0,54946,0,2013-06-15,2016-06-15,2015-11-01,2015-06-23,...,0.036909,42.497907,12.218665,8.145777,42.629663,12.311304,8.182687,1,0.020057,3.700961
1,1,d29c2c54acc38ff3c0614d0a653813dd,MISSING,4660,0,0,2009-08-21,2016-08-30,2009-08-21,2015-08-31,...,0.0,44.44471,0.0,0.0,44.59231,0.0,0.0,0,-0.003767,0.177779
2,2,764c75f661154dac3a6c254cd082ea7d,foosdfpfkusacimwkcsosbicdxkicaua,544,0,0,2010-04-16,2016-04-16,2010-04-16,2015-04-17,...,0.0,44.44471,0.0,0.0,44.612508,0.088409,0.0,0,-0.00467,0.177779
3,3,bba03439a292a1e166f80264c16191cb,lmkebamcaaclubfxadlmueccxoimlema,1584,0,0,2010-03-30,2016-03-30,2010-03-30,2015-03-31,...,0.0,44.44471,0.0,0.0,44.593296,0.0,0.0,0,-0.004547,0.177779
4,4,149d57cf92fc41cf94415803a877cb4b,MISSING,4425,0,526,2010-01-13,2016-03-07,2010-01-13,2015-03-09,...,0.073719,40.728885,24.43733,16.291555,40.848791,24.539003,16.365274,0,-0.006192,0.162916


In [896]:
# These two columns will not play such an important role for model prediction
new_df=new_df.drop(columns=['Unnamed: 0','id'],axis=1)

In [897]:
# We shall remove date_activ and date_end columns and add total days of membership
new_df['days']=(new_df['date_end']- new_df['date_activ']).dt.days
new_df=new_df.drop(columns=['date_activ','date_end'],axis=1)

In [898]:
new_df['days'].value_counts()

1461    2451
2557    2401
2192    2203
1827    2069
1096     645
        ... 
2782       1
4454       1
2525       1
1770       1
2579       1
Name: days, Length: 1410, dtype: int64

In [899]:
# drop not necessary date_time columns
new_df=new_df=new_df.drop(columns=['date_modif_prod','date_renewal'],axis=1)

In [900]:
# Checking for imbalance of churn 
new_df['churn'].value_counts()

0    13186
1     1419
Name: churn, dtype: int64

In [901]:
# it is imbalanced with majority 0 and minority 1

In [902]:
# If we train the model on this data the model may be biased on churn=0 hence we need to upsample it
# We shall use SMOTE for upsampling
# But before that we shall split our data into train and test to prevent any data leakage

In [903]:

from sklearn.model_selection import train_test_split
X=new_df.drop(columns=['churn'],axis=1)
y=new_df['churn']

In [904]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=41)

In [905]:
X_train.head()

Unnamed: 0,channel_sales,cons_12m,cons_gas_12m,cons_last_month,forecast_cons_12m,forecast_cons_year,forecast_discount_energy,forecast_meter_rent_12m,forecast_price_energy_off_peak,forecast_price_energy_peak,...,mean_3m_price_mid_peak_var,mean_3m_price_off_peak_fix,mean_3m_price_peak_fix,mean_3m_price_mid_peak_fix,mean_3m_price_off_peak,mean_3m_price_peak,mean_3m_price_mid_peak,offpeak_diff_dec_january_energy,offpeak_diff_dec_january_power,days
12947,foosdfpfkusacimwkcsosbicdxkicaua,36116,0,2654,3784.55,2654,0.0,132.01,0.115939,0.100823,...,0.074744,40.728885,24.43733,16.291555,40.848632,24.539777,16.366299,-0.009779,0.162916,1827
3800,lmkebamcaaclubfxadlmueccxoimlema,26159,4670,0,3614.23,0,0.0,16.3,0.144149,0.0,...,0.0,44.26693,0.0,0.0,44.41237,0.0,0.0,-0.003994,-1e-06,1094
2158,foosdfpfkusacimwkcsosbicdxkicaua,6943,0,785,476.28,431,0.0,130.31,0.11691,0.100572,...,0.076257,40.728885,24.43733,16.291555,40.848801,24.539562,16.367812,-0.009528,0.162916,1461
11537,ewpakwlliwisiwduibdlfmalxowmwpci,19591,0,2721,1987.99,2721,0.0,116.82,0.115182,0.098841,...,0.074516,40.728885,24.43733,16.291555,40.84706,24.537821,16.366071,-0.011269,0.162916,1460
5874,foosdfpfkusacimwkcsosbicdxkicaua,14337,0,3518,1424.62,3518,0.0,129.01,0.1169,0.100015,...,0.073719,40.728885,24.43733,16.291555,40.848791,24.539003,16.365274,-0.006192,0.162916,2191


In [906]:
y_train.head()

12947    0
3800     1
2158     0
11537    1
5874     0
Name: churn, dtype: int64

In [907]:
num_feature=[ft for ft in new_df.columns if new_df[ft].dtype !='O']
cat_feature=[ft for ft in new_df.columns if new_df[ft].dtype =='O']

In [908]:
num_feature

['cons_12m',
 'cons_gas_12m',
 'cons_last_month',
 'forecast_cons_12m',
 'forecast_cons_year',
 'forecast_discount_energy',
 'forecast_meter_rent_12m',
 'forecast_price_energy_off_peak',
 'forecast_price_energy_peak',
 'forecast_price_pow_off_peak',
 'imp_cons',
 'margin_gross_pow_ele',
 'margin_net_pow_ele',
 'nb_prod_act',
 'net_margin',
 'num_years_antig',
 'pow_max',
 'mean_year_price_off_peak_var',
 'mean_year_price_peak_var',
 'mean_year_price_mid_peak_var',
 'mean_year_price_off_peak_fix',
 'mean_year_price_peak_fix',
 'mean_year_price_mid_peak_fix',
 'mean_year_price_off_peak',
 'mean_year_price_peak',
 'mean_year_price_mid_peak',
 'mean_6m_price_off_peak_var',
 'mean_6m_price_peak_var',
 'mean_6m_price_mid_peak_var',
 'mean_6m_price_off_peak_fix',
 'mean_6m_price_peak_fix',
 'mean_6m_price_mid_peak_fix',
 'mean_6m_price_off_peak',
 'mean_6m_price_peak',
 'mean_6m_price_mid_peak',
 'mean_3m_price_off_peak_var',
 'mean_3m_price_peak_var',
 'mean_3m_price_mid_peak_var',
 'mean_3m

In [909]:
cat_feature

['channel_sales', 'has_gas', 'origin_up']

In [910]:
# imp_num_feature = num_feature - churn
imp_num_feature=['cons_12m',
 'cons_gas_12m',
 'cons_last_month',
 'forecast_cons_12m',
 'forecast_cons_year',
 'forecast_discount_energy',
 'forecast_meter_rent_12m',
 'forecast_price_energy_off_peak',
 'forecast_price_energy_peak',
 'forecast_price_pow_off_peak',
 'imp_cons',
 'margin_gross_pow_ele',
 'margin_net_pow_ele',
 'nb_prod_act',
 'net_margin',
 'num_years_antig',
 'pow_max',
 'mean_year_price_off_peak_var',
 'mean_year_price_peak_var',
 'mean_year_price_mid_peak_var',
 'mean_year_price_off_peak_fix',
 'mean_year_price_peak_fix',
 'mean_year_price_mid_peak_fix',
 'mean_year_price_off_peak',
 'mean_year_price_peak',
 'mean_year_price_mid_peak',
 'mean_6m_price_off_peak_var',
 'mean_6m_price_peak_var',
 'mean_6m_price_mid_peak_var',
 'mean_6m_price_off_peak_fix',
 'mean_6m_price_peak_fix',
 'mean_6m_price_mid_peak_fix',
 'mean_6m_price_off_peak',
 'mean_6m_price_peak',
 'mean_6m_price_mid_peak',
 'mean_3m_price_off_peak_var',
 'mean_3m_price_peak_var',
 'mean_3m_price_mid_peak_var',
 'mean_3m_price_off_peak_fix',
 'mean_3m_price_peak_fix',
 'mean_3m_price_mid_peak_fix',
 'mean_3m_price_off_peak',
 'mean_3m_price_peak',
 'mean_3m_price_mid_peak',
 'offpeak_diff_dec_january_energy',
 'offpeak_diff_dec_january_power',
 'days']

In [911]:
# We shall create a data preprocessing pipeline for imputation, feature scaling and one hot  encoding
from sklearn.impute import SimpleImputer #Handling Missing values
from sklearn.preprocessing import OneHotEncoder # encoding


# Pipelines
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer


In [912]:
# Numerical pipeline
num_pipeline=Pipeline(
    steps=[
        ('imputer',SimpleImputer(strategy='median')),

    ]
)
# Categorical pipeline
cat_pipeline=Pipeline(
    steps=[
        ('imputer',SimpleImputer(strategy='most_frequent')),
        ("encoder", OneHotEncoder(handle_unknown="ignore",sparse=False)),

    ]
)


preprocessor=ColumnTransformer([
    ('num_pipeline',num_pipeline,imp_num_feature),
    ('cat_pipeline',cat_pipeline,cat_feature),
])

In [913]:
#preprocessing
# Applying the pipelines to train_set and test_test
X_train=pd.DataFrame(preprocessor.fit_transform(X_train),columns=preprocessor.get_feature_names_out())
X_test=pd.DataFrame(preprocessor.transform(X_test),columns=preprocessor.get_feature_names_out())



In [914]:
#Apply SMOTE
to_smote_df=X_train
to_smote_df['churn']=y_train.tolist()


In [915]:
to_smote_df.head()

Unnamed: 0,num_pipeline__cons_12m,num_pipeline__cons_gas_12m,num_pipeline__cons_last_month,num_pipeline__forecast_cons_12m,num_pipeline__forecast_cons_year,num_pipeline__forecast_discount_energy,num_pipeline__forecast_meter_rent_12m,num_pipeline__forecast_price_energy_off_peak,num_pipeline__forecast_price_energy_peak,num_pipeline__forecast_price_pow_off_peak,...,cat_pipeline__channel_sales_usilxuppasemubllopkaafesmlibmsdf,cat_pipeline__has_gas_f,cat_pipeline__has_gas_t,cat_pipeline__origin_up_MISSING,cat_pipeline__origin_up_ewxeelcelemmiwuafmddpobolfuxioce,cat_pipeline__origin_up_kamkkxfxxuwbdslkwifmmcsiusiuosws,cat_pipeline__origin_up_ldkssxwpmemidmecebumciepifcamkci,cat_pipeline__origin_up_lxidpiddsbxsbosboudacockeimpuepw,cat_pipeline__origin_up_usapbepcfoloekilkwsdiboslwaxobdp,churn
0,36116.0,0.0,2654.0,3784.55,2654.0,0.0,132.01,0.115939,0.100823,40.606701,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0
1,26159.0,4670.0,0.0,3614.23,0.0,0.0,16.3,0.144149,0.0,44.311378,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1
2,6943.0,0.0,785.0,476.28,431.0,0.0,130.31,0.11691,0.100572,40.606701,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0
3,19591.0,0.0,2721.0,1987.99,2721.0,0.0,116.82,0.115182,0.098841,40.606701,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1
4,14337.0,0.0,3518.0,1424.62,3518.0,0.0,129.01,0.1169,0.100015,40.606701,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0


In [916]:
from imblearn.over_sampling import SMOTE
oversample=SMOTE()
X_train,y_train=oversample.fit_resample(to_smote_df.iloc[:,:-1],to_smote_df.iloc[:,-1])

In [917]:
# After SMOTE
y_train.value_counts()

0    10534
1    10534
Name: churn, dtype: int64

In [931]:
# We shall go ahead in selecting top 15 features 
from sklearn.feature_selection import SelectKBest
select = SelectKBest(k=15)
X_new_train=pd.DataFrame(data=select.fit_transform(X_train,y_train),columns=select.get_feature_names_out())

In [932]:
X_new_train.head()

Unnamed: 0,num_pipeline__cons_12m,num_pipeline__cons_gas_12m,num_pipeline__cons_last_month,num_pipeline__margin_gross_pow_ele,num_pipeline__margin_net_pow_ele,num_pipeline__num_years_antig,num_pipeline__mean_year_price_mid_peak_var,num_pipeline__mean_year_price_mid_peak_fix,num_pipeline__mean_year_price_mid_peak,num_pipeline__days,cat_pipeline__channel_sales_MISSING,cat_pipeline__channel_sales_foosdfpfkusacimwkcsosbicdxkicaua,cat_pipeline__channel_sales_lmkebamcaaclubfxadlmueccxoimlema,cat_pipeline__origin_up_kamkkxfxxuwbdslkwifmmcsiusiuosws,cat_pipeline__origin_up_lxidpiddsbxsbosboudacockeimpuepw
0,36116.0,0.0,2654.0,3.81,3.81,4.0,0.074639,16.242678,16.317318,1827.0,0.0,1.0,0.0,0.0,1.0
1,26159.0,4670.0,0.0,31.2,31.2,3.0,0.0,0.0,0.0,1094.0,0.0,0.0,1.0,0.0,0.0
2,6943.0,0.0,785.0,29.67,29.67,4.0,0.075744,16.264402,16.340146,1461.0,0.0,1.0,0.0,0.0,1.0
3,19591.0,0.0,2721.0,25.02,25.02,3.0,0.074305,16.258971,16.333277,1460.0,0.0,0.0,0.0,0.0,1.0
4,14337.0,0.0,3518.0,36.0,36.0,6.0,0.07316,16.280694,16.353854,2191.0,0.0,1.0,0.0,0.0,1.0


In [933]:
selected_feature=X_new_train.columns

In [934]:
# Model Training using RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
rfc=RandomForestClassifier(n_estimators=100,max_depth=1,max_feature=1)


In [935]:
rfc.fit(X_new_train,y_train)

# Selection of best parameter to evaluate
 
 ## As the dataset is imbalanced accuracy would not be a great option to evaluate.
 ## In this predictive model our interest should be correctly finding out the customer that may be churned.Hence if a customer that have churned in real but is predicted  not churned  must be avoided i.e. False Negative should be of our interest, but on the other hand if the consumer has not churned but our model predicts it to be churned will be not that much of an issue.
 ## So we can go ahead with Recall,f2_score for model evaluation
 ## WE shall also use ROC_AUC score because with decreasing threshold Recall of model is overestimated



In [936]:
from sklearn.metrics import recall_score,fbeta_score,roc_auc_score

The threshold value can be changed taking considerations of domain experts.This will also impact the recall score

In [937]:
threshold = 0.5

predicted_proba = rfc.predict_proba(X_test[selected_feature])
y_pred = (predicted_proba [:,1] >= threshold).astype('int')

train_predicted_proba=rfc.predict_proba(X_new_train)
y_pred_train = (train_predicted_proba [:,1] >= threshold).astype('int')

In [938]:
print("Recall Score TEST: ",recall_score(y_test,y_pred))
print("Recall Score TRAIN: ",recall_score(y_train,y_pred_train))
print("F2 Score TEST: ",fbeta_score(y_test,y_pred,beta=2,average='weighted'))
print("F2 Score TRAIN: ",fbeta_score(y_train,y_pred_train,beta=2,average='weighted'))


Recall Score TEST:  0.6171003717472119
Recall Score TRAIN:  0.7892538446933738
F2 Score TEST:  0.6044600649378931
F2 Score TRAIN:  0.6766065010138366


In [939]:
roc_auc_score(y_test,y_pred)

0.5996512416805441

In [940]:
from sklearn.model_selection import GridSearchCV
rfc=RandomForestClassifier()

In [954]:
# hyperparameter tuning
parameter={
  'n_estimators':[100,150],
    'max_depth':[1],
    'max_features':[1,3,5]
    
      
}

In [953]:
grid=GridSearchCV(estimator=rfc,param_grid=parameter,cv=5,scoring='recall',return_train_score=True,verbose=3)
grid.fit(X_new_train,y_train)

Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV 1/5] END max_depth=1, max_features=1, n_estimators=100;, score=(train=0.794, test=0.662) total time=   0.6s
[CV 2/5] END max_depth=1, max_features=1, n_estimators=100;, score=(train=0.710, test=0.739) total time=   0.6s
[CV 3/5] END max_depth=1, max_features=1, n_estimators=100;, score=(train=0.748, test=0.788) total time=   0.6s
[CV 4/5] END max_depth=1, max_features=1, n_estimators=100;, score=(train=0.759, test=0.795) total time=   0.6s
[CV 5/5] END max_depth=1, max_features=1, n_estimators=100;, score=(train=0.779, test=0.822) total time=   0.6s


In [None]:
print("Best Parameters:",grid.best_params_)
print("Train Score:",grid.best_score_)
print("Test Score:",grid.score(X_test[selected_feature],y_test))

In [959]:
# Checking CV score of best parameters
rfc_cv=RandomForestClassifier(max_depth=1,n_estimators=100,max_features=1)
grid=GridSearchCV(estimator=rfc_cv,param_grid={},cv=5,scoring='recall',return_train_score=True,verbose=3)
grid.fit(X_new_train,y_train)

Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV 1/5] END ..............., score=(train=0.750, test=0.627) total time=   0.6s
[CV 2/5] END ..............., score=(train=0.753, test=0.778) total time=   0.6s
[CV 3/5] END ..............., score=(train=0.779, test=0.812) total time=   0.6s
[CV 4/5] END ..............., score=(train=0.715, test=0.726) total time=   0.6s
[CV 5/5] END ..............., score=(train=0.718, test=0.748) total time=   0.6s


# Performance of the model:
## The model got a recall score of about 0.6 on test data and but recall score of 0.78 on train data,but while evaluating the CV score no overfitting is observed. The F2 score of test data was about 0.6 and on the train data it was around 0.67. The ROC_AUC score was 0.6.  

# Why model underperformed ?
## The dataset was an imbalanced one, though we balanced out the training data using SMOTE but those were synthetically generated points hence may not resemble real life data.This can be a reason of underperformance of the model. Also there were many features of which some may be noises, we tried to overcome this by selecting top 15 features.

# Choice of evaluation metrics?
## As the dataset is imbalanced accuracy would not be a great option to evaluate.
## In this predictive model our interest should be correctly finding out the customer that may be churned.Hence if a customer that have churned in real but is predicted not churned must be avoided i.e. False Negative should be of our interest, but on the other hand if the consumer has not churned but our model predicts it to be churned will be not that much of an issue.
## So we can go ahead with Recall,f2_score for model evaluation
## WE shall also use ROC_AUC score because with decreasing threshold Recall of model is overestimated¶

# Advantages:
## RandomForestClassifier is a bagging ensemble technique so we were able to avoid overfitting.
## Scaling of feature was not required as RFC takes support of DecisionTree.

# Disadvantages:
## The precision score was average.
## Poor results were observed when too many(above 15) features were taken.


## The model performed decently on test data yielding a precision score of about 0.6 and f2_score of also 0.6. This might have happened because our test_set was derived from an imbalanced dataset, having more 0 and less 1, but our model was trained on a balanced data hence was not biased towards any value.Though the test score were decent but the performance will surely improve for new incoming data.

# How could the client save money?
## Out of all predicted values our model can predict 61.7% correctly.Hence the client would have given the 20% discount to these 61.7% people assuming that churning is price sensitive, but as confirmed earlier churn rate has no such correlation with prices hence the client can save its resource. If there were 100 customers in total who churns and on average if 100 dollars of compensation per customer was being rebated, then client would have incurred a cost of (61.7 X 100)= 6170 dollars, but as the hypothesis is not true the company can save 61.7 dollars per churned customer.