# Renew Power Hiring Hackathon MachineHack

**Name**: Mahesh Chandra Duddu

**Email-Id**: duddumaheshchandra@gmail.com

**PhNo**: +91-9440642368

Thank you **Renew Power** and **MachineHack** for providing me with this opportunity.

Run this notebook in kaggle, by using the dataset, which is updated one. Or To run locally in your machine, you need to first change the path of data, while before reading the dataset in the cell.

# Proposed Approach(Moderately Scalable & Highly Generalizable)
Proposed approach is to use modelling based on each turbine id using fully engineered features along with square root data transformation, 5-Fold Cross validation, later getting the median of all the predictions made by models on test data.
* **Data Cleaning**: No null values or duplicate rows.
* **Handling Outliers**: Square Root Transformation is used on data to handle skewness of data.
* **Feature Engineering**: All pairs of features are selected without repetition, to get new features by doing divide, sum, multiply and difference of those pair features.
* **Feature Selection**: Didn't improve the model results.
* **Feature Scaling**: Standardization followed by Normalization is performed.
* **Model Building**: Different algorithms are tried for this modelling, but ExtraTreesRegressor performed really well relatively, and is selcted as the model in our proposed approach.
* **Hyperparameter Tuning**: It didnt improve the model results.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Visualization Libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Feature Scaling
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Libraries required for Cross Validation and Hyperparameter Tuning
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import KFold, RandomizedSearchCV, train_test_split
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor, ExtraTreesRegressor, AdaBoostRegressor, GradientBoostingRegressor, HistGradientBoostingRegressor, StackingRegressor, VotingRegressor
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.linear_model import LinearRegression, HuberRegressor, Lasso, Ridge
from catboost import CatBoostRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.tree import DecisionTreeRegressor, ExtraTreeRegressor

In [None]:
# Reading dataset using pandas and storing it in dataframes
train = pd.read_csv('../input/renew-power-hiring-hackathon/ReNew_Participants_Data/train.csv')
test = pd.read_csv('../input/renew-power-hiring-hackathon/ReNew_Participants_Data/test.csv')
sample = pd.read_csv('../input/renew-power-hiring-hackathon/ReNew_Participants_Data/submission.csv')

In [None]:
# Printing first 5 lines of train data
train.head()

In [None]:
# Statistics of train data
train.describe()

In [None]:
# Type of features in the train data
train.info()

In [None]:
# Columns present in the train dataset
train.columns

# Null Values

In [None]:
#Number of null values in dataset
train.isnull().sum()

In [None]:
test.isnull().sum()

No null values!

# Duplicates

In [None]:
# Number of duplicates in the dataset
train.duplicated().sum()

In [None]:
test.duplicated().sum()

No duplicate rows:)

In [None]:
# Timestamp is the unique identifier of the data; it shouldn’t be used as an input in the model
# Remove timestamp column from both train and test data.
train.drop('timestamp', axis = 1, inplace = True)
# test.drop('timestamp', axis = 1, inplace = True)

In [None]:
# Rename column name reactice_power_calculated_by_converter by correcting typo error
train.rename({'reactice_power_calculated_by_converter': 'reactive_power_calculated_by_converter'}, axis = 1, inplace = True)
test.rename({'reactice_power_calculated_by_converter': 'reactive_power_calculated_by_converter'}, axis = 1, inplace = True)

## Exploratory Data Analysis(EDA)

In [None]:
inp_features = ['active_power_calculated_by_converter', 'active_power_raw',
       'ambient_temperature', 'generator_speed', 'generator_winding_temp_max',
       'grid_power10min_average', 'nc1_inside_temp', 'nacelle_temp',
       'reactive_power_calculated_by_converter', 'reactive_power',
       'wind_direction_raw', 'wind_speed_raw', 'wind_speed_turbulence',
       'turbine_id']

## Univariate Distributions

# Histogram and Density Plots

In [None]:
ig = plt.figure(figsize = (30,28))

for i in range(13):
    plt.subplot(4,4,i+1)
    plt.title(inp_features[i])
    sns.histplot(train[inp_features[i]])

In [None]:
plt.title('Target')
sns.histplot(train['Target'])

In [None]:
ig = plt.figure(figsize = (30,28))

for i in range(13):
    plt.subplot(4,4,i+1)
    plt.title(inp_features[i])
    sns.distplot(train[inp_features[i]], axlabel=inp_features[i])

In [None]:
plt.title('Target')
sns.distplot(train['Target'])

# Box Plots(Univariate Distributions)

In [None]:
ig = plt.figure(figsize = (30,28))

for i in range(13):
    plt.subplot(4,4,i+1)
    sns.boxplot(x = train[inp_features[i]])

# Bar Plots

In [None]:
fig = plt.figure(figsize=(12,6))
plt.xlabel('Turbine id')
plt.title("No of different turbine ids")
plt.ylabel('Number of turbine ids')
train['turbine_id'].value_counts(sort = True).plot.bar()
plt.show()

Target Feature VS Independent Features

In Specific, Target vs ('active_power_calculated_by_converter', 'active_power_raw',
       'ambient_temperature', 'generator_speed', 'generator_winding_temp_max',
       'grid_power10min_average', 'nc1_inside_temp', 'nacelle_temp',
       'reactice_power_calculated_by_converter', 'reactive_power',
       'wind_direction_raw', 'wind_speed_raw', 'wind_speed_turbulence',
       'turbine_id')

## Bivariate Plots

# Scatter Plot

In [None]:
ig = plt.figure(figsize = (50,50))

for i in range(13):
    plt.subplot(4,4,i+1)
    plt.xlabel(inp_features[i])
    plt.ylabel("Temperature of rotor bearing")
    plt.scatter(x = train[inp_features[i]], y = train['Target'])

In [None]:
fig = plt.figure(figsize=(20,15))
plt.scatter(x = train[inp_features[13]], y = train['Target'])
plt.xlabel(inp_features[13])
plt.ylabel("Temperature of rotor bearing")
plt.show()

# BoxPlot

In [None]:
fig = plt.figure(figsize=(25, 25))
ax=sns.boxplot(data = train, x = inp_features[13], y = 'Target');
ax.set_yscale('log')

# Joint Plot

In [None]:
ind = 0

In [None]:
plt.figure(figsize=(25,15))
sns.jointplot(x = inp_features[ind], y = 'Target', data = train)
plt.show()
ind += 1

In [None]:
plt.figure(figsize=(25,15))
sns.jointplot(x = inp_features[ind], y = 'Target', data = train)
plt.show()
ind += 1

In [None]:
plt.figure(figsize=(25,15))
sns.jointplot(x = inp_features[ind], y = 'Target', data = train)
plt.show()
ind += 1

In [None]:
plt.figure(figsize=(25,15))
sns.jointplot(x = inp_features[ind], y = 'Target', data = train)
plt.show()
ind += 1

In [None]:
plt.figure(figsize=(25,15))
sns.jointplot(x = inp_features[ind], y = 'Target', data = train)
plt.show()
ind += 1

In [None]:
plt.figure(figsize=(25,15))
sns.jointplot(x = inp_features[ind], y = 'Target', data = train)
plt.show()
ind += 1

In [None]:
plt.figure(figsize=(25,15))
sns.jointplot(x = inp_features[ind], y = 'Target', data = train)
plt.show()
ind += 1

In [None]:
plt.figure(figsize=(25,15))
sns.jointplot(x = inp_features[ind], y = 'Target', data = train)
plt.show()
ind += 1

In [None]:
plt.figure(figsize=(25,15))
sns.jointplot(x = inp_features[ind], y = 'Target', data = train)
plt.show()
ind += 1

In [None]:
plt.figure(figsize=(25,15))
sns.jointplot(x = inp_features[ind], y = 'Target', data = train)
plt.show()
ind += 1

In [None]:
plt.figure(figsize=(25,15))
sns.jointplot(x = inp_features[ind], y = 'Target', data = train)
plt.show()
ind += 1

In [None]:
plt.figure(figsize=(25,15))
sns.jointplot(x = inp_features[ind], y = 'Target', data = train)
plt.show()
ind += 1

In [None]:
plt.figure(figsize=(15,55))
sns.jointplot(x = inp_features[ind], y = 'Target', data = train)
plt.show()
ind +=1

In [None]:
plt.figure(figsize=(15,65))
sns.jointplot(x = inp_features[ind], y = 'Target', data = train)
plt.show()

# Bar Plots

In [None]:
plt.figure(figsize=(25,25))
sns.barplot(x='turbine_id', y='Target', data = train)
plt.show()

## MultiVariate Plots

# Heatmap

In [None]:
plt.figure(figsize=(15,15))
sns.heatmap(train.corr(), cmap = 'BrBG', annot=True);

# Pair Plot

In [None]:
sns.pairplot(train)

# LMplot

In [None]:
plt.figure(figsize=(15,5))
sns.lmplot(x = inp_features[0], y = 'Target', data = train, hue = 'turbine_id', fit_reg=False)

In [None]:
plt.figure(figsize=(15,5))
sns.lmplot(x = inp_features[1], y = 'Target', data = train, hue = 'turbine_id', fit_reg=False)

In [None]:
plt.figure(figsize=(15,5))
sns.lmplot(x = inp_features[2], y = 'Target', data = train, hue = 'turbine_id', fit_reg=False)

In [None]:
plt.figure(figsize=(15,5))
sns.lmplot(x = inp_features[3], y = 'Target', data = train, hue = 'turbine_id', fit_reg=False)

In [None]:
plt.figure(figsize=(15,5))
sns.lmplot(x = inp_features[4], y = 'Target', data = train, hue = 'turbine_id', fit_reg=False)

In [None]:
plt.figure(figsize=(15,5))
sns.lmplot(x = inp_features[5], y = 'Target', data = train, hue = 'turbine_id', fit_reg=False)

In [None]:
plt.figure(figsize=(15,5))
sns.lmplot(x = inp_features[6], y = 'Target', data = train, hue = 'turbine_id', fit_reg=False)

In [None]:
plt.figure(figsize=(15,5))
sns.lmplot(x = inp_features[7], y = 'Target', data = train, hue = 'turbine_id', fit_reg=False)

In [None]:
plt.figure(figsize=(15,5))
sns.lmplot(x = inp_features[8], y = 'Target', data = train, hue = 'turbine_id', fit_reg=False)

In [None]:
plt.figure(figsize=(15,5))
sns.lmplot(x = inp_features[9], y = 'Target', data = train, hue = 'turbine_id', fit_reg=False)

In [None]:
plt.figure(figsize=(15,5))
sns.lmplot(x = inp_features[10], y = 'Target', data = train, hue = 'turbine_id', fit_reg=False)

In [None]:
plt.figure(figsize=(15,5))
sns.lmplot(x = inp_features[11], y = 'Target', data = train, hue = 'turbine_id', fit_reg=False)

In [None]:
plt.figure(figsize=(15,5))
sns.lmplot(x = inp_features[12], y = 'Target', data = train, hue = 'turbine_id', fit_reg=False)

# Handling Outliers

* Tried using Flooring and Capping of outliers, model score imporved by a lot when checked with train data, but it fails to generalize well on test data. Hence, this part is left as an already implemented experiment. 
* I tried the data transformation on some features that are having high skewed distributions of value greater than 1. This increased the model score slightly.



# Flooring and Capping

In [None]:
# outlier_rmv_features = ['active_power_calculated_by_converter',
#  'active_power_raw',
#  'generator_speed',
#  'generator_winding_temp_max',
#  'grid_power10min_average',
#  'nc1_inside_temp',
#  'nacelle_temp',
#  'reactive_power_calculated_by_converter',
#  'reactive_power',
#  'wind_speed_raw',
#  'wind_speed_turbulence',
#  'Target']

In [None]:
# for i in outlier_rmv_features:
#     lower_quartile = train[i].quantile(0.05)
#     upper_quartile = train[i].quantile(0.95)
#     train[i] = np.where(train[i] < lower_quartile, lower_quartile, train[i])
#     train[i] = np.where(train[i] > upper_quartile, upper_quartile, train[i])
#     if i != 'Target':
#         lower_quartile = test[i].quantile(0.05)
#         upper_quartile = test[i].quantile(0.95)
#         test[i] = np.where(test[i] < lower_quartile, lower_quartile, test[i])
#         test[i] = np.where(test[i] > upper_quartile, upper_quartile, test[i])

## Feature Engineering

In [None]:
train.skew().sort_values(ascending = False)

In [None]:
test.skew().sort_values(ascending = False)

# Data Transformation

In [None]:
# Few columns are highly skewed 
# Lets do square root transformation taking care of sign
train['wind_speed_turbulence'] = train['wind_speed_turbulence'].apply(lambda x: -np.sqrt(np.abs(x)) if x < 0 else np.sqrt(np.abs(x)))
train['reactive_power_calculated_by_converter'] = train['reactive_power_calculated_by_converter'].apply(lambda x: -np.sqrt(np.abs(x)) if x < 0 else np.sqrt(np.abs(x)))
train['reactive_power'] = train['reactive_power'].apply(lambda x: -np.sqrt(np.abs(x)) if x < 0 else np.sqrt(np.abs(x)))
train['active_power_calculated_by_converter'] = train['active_power_calculated_by_converter'].apply(lambda x: -np.sqrt(np.abs(x)) if x < 0 else np.sqrt(np.abs(x)))
train['active_power_raw'] = train['active_power_raw'].apply(lambda x: -np.sqrt(np.abs(x)) if x < 0 else np.sqrt(np.abs(x)))
train['grid_power10min_average'] = train['grid_power10min_average'].apply(lambda x: -np.sqrt(np.abs(x)) if x < 0 else np.sqrt(np.abs(x)))

# train['Target'] = train['Target'].apply(lambda x: -np.sqrt(np.abs(x)) if x < 0 else np.sqrt(np.abs(x)))

test['wind_speed_turbulence'] = test['wind_speed_turbulence'].apply(lambda x: -np.sqrt(np.abs(x)) if x < 0 else np.sqrt(np.abs(x)))
test['reactive_power_calculated_by_converter'] = test['reactive_power_calculated_by_converter'].apply(lambda x: -np.sqrt(np.abs(x)) if x < 0 else np.sqrt(np.abs(x)))
test['reactive_power'] = test['reactive_power'].apply(lambda x: -np.sqrt(np.abs(x)) if x < 0 else np.sqrt(np.abs(x)))
test['active_power_calculated_by_converter'] = test['active_power_calculated_by_converter'].apply(lambda x: -np.sqrt(np.abs(x)) if x < 0 else np.sqrt(np.abs(x)))
test['active_power_raw'] = test['active_power_raw'].apply(lambda x: -np.sqrt(np.abs(x)) if x < 0 else np.sqrt(np.abs(x)))
test['grid_power10min_average'] = test['grid_power10min_average'].apply(lambda x: -np.sqrt(np.abs(x)) if x < 0 else np.sqrt(np.abs(x)))

In [None]:
train.skew()

In [None]:
train.kurtosis().sort_values(ascending = False)

In [None]:
turbines = list(np.sort(train['turbine_id'].unique()))
turbines

In [None]:
# Ratio Features, Product features, Additive, Subtractive Features
tr = train.drop(['Target', 'turbine_id'], axis = 1).columns

for i in range(len(tr)):
    for j in range(i+1, len(tr)):
        train[tr[i] + '/' + tr[j]] = train[tr[i]]/(1+train[tr[j]])
        train[tr[i] + '*' + tr[j]] = train[tr[i]]*(train[tr[j]])
        train[tr[i] + '+' + tr[j]] = train[tr[i]]+train[tr[j]]
        train[tr[i] + '-' + tr[j]] = train[tr[i]]-train[tr[j]]
        
        test[tr[i] + '/' + tr[j]] = test[tr[i]]/(1+test[tr[j]])
        test[tr[i] + '*' + tr[j]] = test[tr[i]]*(test[tr[j]])
        test[tr[i] + '+' + tr[j]] = test[tr[i]]+test[tr[j]]
        test[tr[i] + '-' + tr[j]] = test[tr[i]]-test[tr[j]]

In [None]:
train.columns

# Feature Selection

Tried different methods for feature selction, but the model score didn't improve, and found not useful.

In [None]:
# X = train.drop(['turbine_id', 'Target'], axis = 1)
# X.reset_index(drop = True, inplace = True)
# y = train['Target']
# y.reset_index(drop = True, inplace = True)

In [None]:
# sc = StandardScaler()
# nc = MinMaxScaler()
# x = sc.fit_transform(X)
# X_scaled = pd.DataFrame(nc.fit_transform(x), columns = X.columns)

In [None]:
# # X_train, X_test, y_train, y_test = train_test_split(X_scaled,y,test_size=0.3, random_state = 42)

# sel = SelectFromModel(RandomForestRegressor(n_jobs = -1, random_state = 42))#code


# # fit sel on training data
# sel.fit(X, y)

In [None]:
# selected_feat = X_scaled.columns[np.where(cols == True)]
# # print length of selected_feat
# len(selected_feat)

In [None]:
# fs = list(selected_feat)
# print(fs)

In [None]:
# fs = ['active_power_calculated_by_converter', 'active_power_raw', 'ambient_temperature', 'generator_speed', 'generator_winding_temp_max', 'grid_power10min_average', 'nc1_inside_temp', 'nacelle_temp', 'reactice_power_calculated_by_converter', 'reactive_power', 'wind_direction_raw', 'wind_speed_raw', 'wind_speed_turbulence']

In [None]:
# train.drop(fs, axis = 1, inplace = True)
# test.drop(fs, axis = 1, inplace = True)

# Model Building(Based on Turbine ID using 5 Fold Cross Validation)

In [None]:
# Lets create models for different turbine ids
full_preds = []
full_scores = []
shp = []

for i in turbines:
    print("Model on ", i)
    turb_data = train[train['turbine_id'] == i]
    X = turb_data.drop(['turbine_id', 'Target'], axis = 1)
    X.reset_index(drop = True, inplace = True)
    y = turb_data['Target']
    y.reset_index(drop = True, inplace = True)
    te = test[test['turbine_id'] == i]
    te = te.drop(['turbine_id'], axis = 1)
    te.reset_index(drop = True, inplace = True)
    shp.append(te.shape[0])
    kf = KFold(n_splits=5,shuffle=True,random_state=42)
    pred_test_full = []
    cv_score = []
    j = 1

    for train_index, test_index in kf.split(X, y):
        print("{} of KFold {}".format(j, kf.n_splits))
        xtr,xvl = X.loc[train_index],X.loc[test_index]
        ytr,yvl = y.loc[train_index],y.loc[test_index]
        sc = StandardScaler()
        nc = MinMaxScaler()
        xt = sc.fit_transform(xtr)
        xv = sc.transform(xvl)
        t = sc.transform(te)
        xtr = pd.DataFrame(nc.fit_transform(xt), columns = xtr.columns)
        xvl = pd.DataFrame(nc.transform(xv), columns = xvl.columns)
        te_df = pd.DataFrame(nc.transform(t), columns = te.columns)
#         estimators = [('xtree', ExtraTreesRegressor(n_jobs = -1, random_state = 42)), ('rf', RandomForestRegressor(n_jobs = -1, random_state = 42))]
#         model = VotingRegressor(estimators = estimators)
        model = ExtraTreesRegressor(n_jobs = -1, random_state = 42)
        model.fit(xtr, ytr)
        score = mean_absolute_percentage_error(yvl,model.predict(xvl))
        print('MAPE score:',score)
        cv_score.append(score)    
        pred_test = model.predict(te_df)
        pred_test_full.append(pred_test)
        j += 1
    print("MEAN CV = ",np.mean(cv_score))
    lis = np.median(pred_test_full, axis = 0)
    full_scores.append(np.mean(cv_score))
    full_preds.append(lis)
#     if i == 'Turbine_01':
#         break

In [None]:
print("Mean of scores = ", np.mean(full_scores))

# Proposed Approach Result in local CV
**median turbine sqrt transform(target not included) fullfeature engineering std norm +-*/**

**Mean of scores =  0.010849082802329121**


Other experimental results obtained:

* Mean of scores =  0.009773862662543736 When flooring and capping used, along with standardization and normalization, with fully engineered features.
* Mean of scores =  0.010934307841619816 when std+norm -+-/* features
* Mean of scores =  0.011149893671289131 when std+norm /* features
* Mean of scores =  0.011153299861025825 when Std /* features

# Hyperparameter Tuning
It didn't help the model in improving the score, and hence found not useful.

In [None]:
# Keeping the predictions with respect to turbine id and their respective rows in submission file
for i in turbines:
    tes = test[test['turbine_id'] == i]
    indexes = list(tes.index)
    k = 0
    for j in indexes:
        sample.iloc[j] = full_preds[turbines.index(i)][k]
        k += 1

In [None]:
# Exporting the final predictions file
sample.to_csv('model_turbine_median_with_sqrttransform.csv', index = False)