# Global Power Plant Database

Problem Statement:

The Global Power Plant Database is a comprehensive, open source database of power plants around the world. It centralizes power plant data to make it easier to navigate, compare and draw insights for one’s own analysis. The database covers approximately 35,000 power plants from 167 countries and includes thermal plants (e.g. coal, gas, oil, nuclear, biomass, waste, geothermal) and renewables (e.g. hydro, wind, solar). Each power plant is geolocated and entries contain information on plant capacity, generation, ownership, and fuel type. It will be continuously updated as data becomes available.

OUTCOMES : we have to make two prediction - 1) Primary fuel   2) Capacity in Mega Watt

Key attributes of the database
The database includes the following indicators:

`country` (text): 3 character country code corresponding to the ISO 3166-1 alpha-3 specification [5]

`country_long` (text): longer form of the country designation

`name` (text): name or title of the power plant, generally in Romanized form

`gppd_idnr` (text): 10 or 12 character identifier for the power plant

`capacity_mw` (number): electrical generating capacity in megawatts

`latitude` (number): geolocation in decimal degrees; WGS84 (EPSG:4326)

`longitude` (number): geolocation in decimal degrees; WGS84 (EPSG:4326)

`primary_fuel` (text): energy source used in primary electricity generation or export

`other_fuel1` (text): energy source used in electricity generation or export

`other_fuel2` (text): energy source used in electricity generation or export


`other_fuel3` (text): energy source used in electricity generation or export

`commissioning_year` (number): year of plant operation, weighted by unit-capacity when data is available

`owner` (text): majority shareholder of the power plant, generally in Romanized form

`source` (text): entity reporting the data; could be an organization, report, or document, generally in Romanized form

`url` (text): web document corresponding to the `source` field

`geolocation_source` (text): attribution for geolocation information

`wepp_id` (text): a reference to a unique plant identifier in the widely-used PLATTS-WEPP database.

`year_of_capacity_data` (number): year the capacity information was reported

`generation_gwh_2013` (number): electricity generation in gigawatt-hours reported for the year 2013

`generation_gwh_2014` (number): electricity generation in gigawatt-hours reported for the year 2014

`generation_gwh_2015` (number): electricity generation in gigawatt-hours reported for the year 2015

`generation_gwh_2016` (number): electricity generation in gigawatt-hours reported for the year 2016

`generation_gwh_2017` (number): electricity generation in gigawatt-hours reported for the year 2017

`generation_gwh_2018` (number): electricity generation in gigawatt-hours reported for the year 2018

`generation_gwh_2019` (number): electricity generation in gigawatt-hours reported for the year 2019

`generation_data_source` (text): attribution for the reported generation information

`estimated_generation_gwh_2013` (number): estimated electricity generation in gigawatt-hours for the year 2013

`estimated_generation_gwh_2014` (number): estimated electricity generation in gigawatt-hours for the year 2014 

`estimated_generation_gwh_2015` (number): estimated electricity generation in gigawatt-hours for the year 2015 

`estimated_generation_gwh_2016` (number): estimated electricity generation in gigawatt-hours for the year 2016 

`estimated_generation_gwh_2017` (number): estimated electricity generation in gigawatt-hours for the year 2017 

'estimated_generation_note_2013` (text): label of the model/method used to estimate generation for the year 2013

`estimated_generation_note_2014` (text): label of the model/method used to estimate generation for the year 2014 

`estimated_generation_note_2015` (text): label of the model/method used to estimate generation for the year 2015

`estimated_generation_note_2016` (text): label of the model/method used to estimate generation for the year 2016

`estimated_generation_note_2017` (text): label of the model/method used to estimate generation for the year 2017 

# Import necessary libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Loading dataset

In [None]:
data = pd.read_csv('https://raw.githubusercontent.com/wri/global-power-plant-database/master/source_databases_csv/database_IND.csv')

In [None]:
data.head()

# EDA

In [None]:
data.shape

This dataset contain 907 rows and 27 columns including Label

In [None]:
data.columns

In [None]:
data.dtypes

This dataset contain:

15 number float type data i.e numeric data

12 number object type data i.e classification data

In [None]:
# to check overview information of dataset
data.info()

Here we can observe that most of the column has missing values

In [None]:
#to check the missing values in column
data.isnull().sum()

as we can see most of column has missing values, we will impute it later

In [None]:
data.nunique()

Here we can see some of the column has only one 1 unique values which is irrelevent for prediction and some column has 0 unique values mean that column has only null values

columns -[ country, country_long,  other fuel_2,  year_of_capacity_data,  generation_data_source ] has only 1 unique values so we can drop these columns as it is irrelevent for prediction

columns - [others fuel_3, wepp_id,  generation_gwh_2013,  generation_gwh_2019 and estimated_generation_gwh] has 0 unique values mean these column has only NaN values and can be droped.

In [None]:
#to check the uniqueness if 'country' column
data['country'].value_counts()

In [None]:
data.columns

In [None]:
#dropping irrelevent column
data.drop(columns=['country','country_long','other_fuel2','year_of_capacity_data','generation_data_source','other_fuel3','wepp_id','generation_gwh_2013','generation_gwh_2019'],axis=1,inplace=True)

In [None]:
data.drop(columns=['estimated_generation_gwh','owner'],axis=1,inplace=True)

In [None]:
data.columns

In [None]:
#checking the count of name column
data['name'].value_counts()

this column has unique names for each power plant, so this column can be dropped

In [None]:
data.drop(['name'],axis=1, inplace=True)

In [None]:
#checking the count of gppd_idnr column
data['gppd_idnr'].value_counts()

This column also contains unique ID of power plant, so this also can be dropped 

In [None]:
data.drop(['gppd_idnr'],axis=1,inplace=True)

In [None]:
# dropping url column as it is irrelevent for prediction
data.drop(['url'],axis=1,inplace=True)

In [None]:
data.columns

# Missing values

In [None]:
data.isnull().sum()

Before applying imputation technique, let us check the skewness in dataset

In [None]:
data.skew()

the columns  [longitude, commissioning_year , generation_gwh_2014, generation_gwh_2015, generation_gwh_2016, generation_gwh_2017,  generation_gwh_2018] has huge skewness and are of numeric type, so these  null values of these columns can be replaced by Median

In [None]:
#filling missing values by median
data['longitude'] = data['longitude'].fillna(data['longitude'].median())
data['commissioning_year'] = data['commissioning_year'].fillna(data['commissioning_year'].median())
data['generation_gwh_2014'] = data['generation_gwh_2014'].fillna(data['generation_gwh_2014'].median())
data['generation_gwh_2015'] = data['generation_gwh_2015'].fillna(data['generation_gwh_2015'].median())
data['generation_gwh_2016'] = data['generation_gwh_2016'].fillna(data['generation_gwh_2016'].median())
data['generation_gwh_2017'] = data['generation_gwh_2017'].fillna(data['generation_gwh_2017'].median())
data['generation_gwh_2018'] = data['generation_gwh_2018'].fillna(data['generation_gwh_2018'].median())

The column 'Latitude' has very less skewness so missing values can be replaced by mean

In [None]:
data['latitude'] = data['latitude'].fillna(data['latitude'].mean())

the columns ['other_fuel_1', 'geolocation'] has categorical variable and can be replaced by mode 

In [None]:
#let'us check the mode of other_fuel_1
data['other_fuel1'].mode()

In [None]:
#let'us check the mode of geolocation_source
data['geolocation_source'].mode()

In [None]:
#Replacing missing values in other_fuel1 and geolocation_source
data['other_fuel1'] = data['other_fuel1'].fillna(data['other_fuel1'].mode()[0])
data['geolocation_source'] = data['geolocation_source'].fillna(data['geolocation_source'].mode()[0])

In [None]:
data.isnull().sum()

In [None]:
#let's visualize the null values with heatmap
sns.heatmap(data.isnull())

LET'S CHECK INFORMATION REGARDING TARGET COLUMN

In [None]:
data['capacity_mw'].value_counts()

In [None]:
data['primary_fuel'].value_counts()

In [None]:
#checking the list of values of commisioning_year
data['commissioning_year'].value_counts()

In [None]:
#Age calculation of power plant by subtracting commissioning year from 2018
data['power_plant_age'] = 2018-data['commissioning_year']
data.drop(['commissioning_year'],axis=1,inplace = True)   

In [None]:
data.head()

# Description of dataset

In [None]:
data.describe()

This gives the statistical information of the dataset. The summary of this dataset looks perfect as there is no negative/invalid values present.\ From the above description we can observe the following things.

The counts of the columns are same which means there are no missing values present in the dataset.

The mean is more than the median(50%) in all the columns except latitude which means they are skewed to right.

The median is bit greater than the mean in the column latitude which means it is skewed to left.

From the difference between the max and 75% percentile we can say that there are huge outliers present in most of the columns, will remove them before model building.

The minimum count of the Power plant is zero and maximum is 4760 and there is huge difference in mean and std.

SEPARATING NUMERICAL AND CATEGORICAL COLUMNS

In [None]:
# Checking for Categorical columns
categorical_col = []
for i in data.dtypes.index:
    if data.dtypes[i]=='object':
        categorical_col.append(i)
print(categorical_col)

In [None]:
# Checking for numerical columns
numerical_col = []
for i in data.dtypes.index:
    if data.dtypes[i]!='object':
        numerical_col.append(i)
print(numerical_col)

# Data visualization

In [None]:
# Visualizing the types of fuel in primary_fuel
print(data["primary_fuel"].value_counts())
plt.figure(figsize=(10,10))
sns.countplot(data['primary_fuel'])
plt.show()

here in using source as COAL generates more electricity compare to others, the data is not balanced so will balanced it later

In [None]:
# Visualizing the types of fuel in other_fuel1
print(data["other_fuel1"].value_counts())
plt.figure(figsize=(10,10))
sns.countplot(data['other_fuel1'])
plt.show()

Here the count of oil is more than that of cogenerationand gas

In [None]:
# Visualizing the counts of geolocation_source
print(data["geolocation_source"].value_counts())
labels='WRI','Industry About','National Renewable Energy Laboratory'
fig, ax = plt.subplots(figsize=(10,8))
ax.pie(data['geolocation_source'].value_counts(), labels=labels, autopct='%1.2f%%', shadow=True)
plt.show()

Here we can see world resource institute(WRI) provides more information regarding geolocation source than other two sources

In [None]:
#Lets check the relation between source and capacity_mw
plt.figure(figsize = (10,6))
sns.barplot(x = "geolocation_source", y = "capacity_mw", data = data)
plt.show()

Capacity of WRI is about 350, which is higher than others by a difference of about 280

In [None]:
# Let's check how the Power_plant_age affects the capacity of the power plant
plt.figure(figsize=[10,6])
plt.title('Comparision between Power_plant_age and capacity_mw')
sns.regplot(data['power_plant_age'],data['capacity_mw'],color = "g")

from the above plot we can observe that there is a negative linear relation beteen feature and label, with increse in power plant age, capacity of that plant decreses

In [None]:
# Let's check how the Power_plant_age affects the capacity of the power plant
plt.figure(figsize=[10,6])
plt.title('Comparision between latitude and capacity_mw')
sns.regplot(data['latitude'],data['capacity_mw'],color = "g")

In [None]:
# Let's check how the Power_plant_age affects the capacity of the power plant
plt.figure(figsize=[10,6])
plt.title('Comparision between longitude and capacity_mw')
sns.regplot(data['longitude'],data['capacity_mw'],color = "g")

In [None]:
# Let's check how the Power_plant_age affects the capacity of the power plant
plt.figure(figsize=[10,6])
plt.title('Comparision between generation_gwh_2014 and capacity_mw')
sns.regplot(data['generation_gwh_2014'],data['capacity_mw'],color = "g")

In [None]:
# Let's check how the Power_plant_age affects the capacity of the power plant
plt.figure(figsize=[10,6])
plt.title('Comparision between generation_gwh_2015 and capacity_mw')
sns.regplot(data['generation_gwh_2015'],data['capacity_mw'],color = "g")

In [None]:
# Let's check how the Power_plant_age affects the capacity of the power plant
plt.figure(figsize=[10,6])
plt.title('Comparision between generation_gwh_2016 and capacity_mw')
sns.regplot(data['generation_gwh_2016'],data['capacity_mw'],color = "g")

In [None]:
# Let's check how the Power_plant_age affects the capacity of the power plant
plt.figure(figsize=[10,6])
plt.title('Comparision between generation_gwh_2017 and capacity_mw')
sns.regplot(data['generation_gwh_2017'],data['capacity_mw'],color = "g")

In [None]:
# Let's check how the Power_plant_age affects the capacity of the power plant
plt.figure(figsize=[10,6])
plt.title('Comparision between generation_gwh_2018 and capacity_mw')
sns.regplot(data['generation_gwh_2018'],data['capacity_mw'],color = "g")

In [None]:
# Let's check how the Power_plant_age affects the primary_fuel of the power plant
plt.figure(figsize=[10,6])
plt.title('Comparision between Power_plant_age and primary_fuel')
sns.barplot(y='power_plant_age',x='primary_fuel',data=data)
plt.show()

From the above plot it can be observed that most of old power plants uses hydro(water) to generate the electricity followed by Nuclear and oil

Recently constructed power plant uses  solar,  coal,  wind,  and  gas to generate electricity

In [None]:
# Let's check how the Power_plant_age affects the primary_fuel of the power plant
plt.figure(figsize=[10,6])
plt.title('Comparision between latitude and primary_fuel')
sns.barplot(y='latitude',x='primary_fuel',data=data)
plt.show()

In [None]:
# Let's check how the longitude affects the primary_fuel of the power plant
plt.figure(figsize=[10,6])
plt.title('Comparision between longitude and primary_fuel')
sns.barplot(y='longitude',x='primary_fuel',data=data)
plt.show()

In [None]:
# Let's check how the generation_gwh_2014 affects the primary_fuel of the power plant
plt.figure(figsize=[10,6])
plt.title('Comparision between generation_gwh_2014 and primary_fuel')
sns.barplot(y='generation_gwh_2014',x='primary_fuel',data=data)
plt.show()

In [None]:
# Let's check how the generation_gwh_2015 affects the primary_fuel of the power plant
plt.figure(figsize=[10,6])
plt.title('Comparision between generation_gwh_2015 and primary_fuel')
sns.barplot(y='generation_gwh_2015',x='primary_fuel',data=data)
plt.show()

In [None]:
# Let's check how the generation_gwh_2016 affects the primary_fuel of the power plant
plt.figure(figsize=[10,6])
plt.title('Comparision between generation_gwh_2016 and primary_fuel')
sns.barplot(y='generation_gwh_2016',x='primary_fuel',data=data)
plt.show()

In [None]:
# Let's check how the generation_gwh_2017 affects the primary_fuel of the power plant
plt.figure(figsize=[10,6])
plt.title('Comparision between generation_gwh_2017 and primary_fuel')
sns.barplot(y='generation_gwh_2017',x='primary_fuel',data=data)
plt.show()

In [None]:
# Let's check how the generation_gwh_2018 affects the primary_fuel of the power plant
plt.figure(figsize=[10,6])
plt.title('Comparision between generation_gwh_2018 and primary_fuel')
sns.barplot(y='generation_gwh_2018',x='primary_fuel',data=data)
plt.show()

#### Let's check the relation between targets 

In [None]:
#Lets check the relation between primary_fuel and capacity_mw
plt.figure(figsize = (10,6))
plt.title("Comparision between primary_fuel and capacity_mw")
sns.barplot(x = "primary_fuel", y = "capacity_mw", data = data)
plt.show()

here it can be observed that nuclear fuel gives higher capacity followed by Coal,  Gas.

In [None]:
# checking the relation between feature to feature and feature to label
sns.pairplot(data=data)

---This pairplot gives pairwise relation between the columns which is plotted on regarding target variables

In [None]:
#data distribution in every column
plt.figure(figsize=(25,35),facecolor= 'blue')
plotnumber = 1


for column in numerical_col:
    if plotnumber<=9:
        plt.subplot(3,3,plotnumber)
        ax=sns.distplot(data[column])
        plt.xlabel(column,fontsize=20)
        
    plotnumber+=1
plt.show()


from the above distribution plot we can observe that data is not normaly distributed except longitude and latitude columns

### Check for outliers 

In [None]:
#data distribution in every column
plt.figure(figsize=(25,35),facecolor= 'blue')
plotnumber = 1


for column in numerical_col:
    if plotnumber<=9:
        plt.subplot(3,3,plotnumber)
        ax=sns.boxplot(data[column],orient='v')
        plt.xlabel(column,fontsize=20)
        
    plotnumber+=1
plt.show()

Here it can be observed that all the columns has outliers except Latitude

## Removing Outliers 

#### BY USING ZSCORE METHOD 

In [None]:
# Features having outliers
features = data[['longitude','generation_gwh_2014','generation_gwh_2015','generation_gwh_2016','generation_gwh_2017','generation_gwh_2018','power_plant_age']]

In [None]:
# Using zscore to remove outliers
from scipy.stats import zscore
z=np.abs(zscore(features))
z

In [None]:
#Creating new dataframe
data_new = data[(z<3).all(axis=1)]
data_new

New dataframe after removing outliers

In [None]:
print(data.shape)
print(data_new.shape)

Here we can see 56 numbers of rows has been reduced

##### Checking data loss

In [None]:
data_loss = (907-851)/907*100
data_loss

### Encoding the categorical column 

In [None]:
categorical_col = ['primary_fuel','other_fuel1','source','geolocation_source']

In [None]:
from sklearn.preprocessing import LabelEncoder
lbl = LabelEncoder()
data_new[categorical_col] = data_new[categorical_col].apply(lbl.fit_transform)

In [None]:
data_new[categorical_col]

This is the dataset after encoding

## corelation 

In [None]:
cor = data_new.corr()
cor

### corelation matrix using Heatmap

In [None]:
# Visualizing the correlation matrix by plotting heatmap
plt.figure(figsize=(30,25))
sns.heatmap(data_new.corr(),linewidths=0.1,annot=True,linecolor='black',fmt='.2g',annot_kws={'size':15},cmap="YlGnBu")
plt.show()

This heatmap shows the correlation matrix. We can visualize the relation between the feature to feature and feature to label.This heatmap contains both positive and negative correlation.

CORELATION BETWEEN CAPACITY MW AND FEATURES

---- The label capacity_mw is highly positively correlated with the features 
     generation_gwh_2017,generation_gwh_2016,generation_gwh_2015,generation_gwh_2014,generation_gwh_2013.

-----And the label is negatively correlated with the features primary_fuel, source and Powe_plant_age.

-----The columns other_fuel1 and latitude have no relation with the label, so we can drop them.


CORELATION BETWEEN PRIMARY_FUEL AND FEATURES

------The label primary_fuel is less correlated with Power_plant_age and source.

------The label is negatively correlated with geological_source, longitude,capacity_mw, and all generation_gwh years.
      Also the features other_fuel1 and latitude have very less correlation with both the lables. We can drop these columns.

### Visualization of corelation of label and feature

In [None]:
plt.figure(figsize=(25,20))
data_new.corr()['capacity_mw'].sort_values(ascending=False).drop(['capacity_mw']).plot(kind='bar',color='c')
plt.xlabel('Features',fontsize=10)
plt.ylabel('target',fontsize=10)
plt.title('Correlation between label and features using bar plot',fontsize=20)
plt.show()

### visualization of corelation of primary_fuel vs feature 

In [None]:
plt.figure(figsize=(25,20))
data_new.corr()['primary_fuel'].sort_values(ascending=False).drop(['primary_fuel']).plot(kind='bar',color='c')
plt.xlabel('Features',fontsize=10)
plt.ylabel('target',fontsize=10)
plt.title('Correlation between label and features using bar plot',fontsize=20)
plt.show()

In [None]:
# Dropping irrelavant columns
data_new.drop("other_fuel1",axis=1,inplace=True)
data_new.drop("latitude",axis=1,inplace=True)

In [None]:
data_new

## Task - 1, Prediction for capacity_MW 

In [None]:
x = data_new.drop("capacity_mw",axis=1)
y = data_new["capacity_mw"]

In [None]:
x.shape

In [None]:
y.shape

In [None]:
#checking for skewness
x.skew()

## Removing skewness using yeo-johnson method 

In [None]:
skew = ['longitude','generation_gwh_2014','generation_gwh_2015','generation_gwh_2016','generation_gwh_2017','generation_gwh_2018','power_plant_age']
from sklearn.preprocessing import PowerTransformer
scaler = PowerTransformer(method = 'yeo-johnson')
'''
parameters:
method = 'box-cox' or 'yeo-johnson'
'''

In [None]:
x[skew] = scaler.fit_transform(x[skew].values)
x[skew].head()

In [None]:
# checking skewness after using yeo-johnson method
x.skew()

now the skewness has removed

In [None]:
# Checking distribution after removing skewness
plt.figure(figsize=(20,25))
plotnumber=1
for col in x[skew]:
    if plotnumber<=9:
        ax = plt.subplot(3,3,plotnumber)
        sns.distplot(x[col],color='g',kde_kws={"shade":True},hist=False)
        plt.xlabel(col,fontsize=20)
    plotnumber+=1
plt.show()

still data is not in normal form but skewness has been removed

## Feature scaling using standard scalarization 

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x = pd.DataFrame(scaler.fit_transform(x), columns=x.columns)
x

Data is not at all baised now, it has been scaled using standard scalarization

## Check for multicolinearity 

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.DataFrame()
vif["VIF values"] = [variance_inflation_factor(x.values,i)
              for i in range(len(x.columns))]
vif["Features"] = x.columns
vif

Here all the columns has VIF values less than 10, means these coluns are free from multicolinearity

## Modeling

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

###  Finding the best Random state

In [None]:
from sklearn.ensemble import RandomForestRegressor
maxAccu=0
maxRS=0
for i in range(1,200):
    x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.30,random_state=i)
    mod = RandomForestRegressor()
    mod.fit(x_train,y_train)
    pred = mod.predict(x_test)
    acc = r2_score(y_test,pred)
    if acc>maxAccu:
        maxAccu=acc
        maxRS=i
print("Maximum r2 score is ",maxAccu,"at Random_state",maxRS)

### Creating train test split 

In [None]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=.30,random_state=185)

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor as KNN
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
from sklearn import metrics

## Random forest classifier

In [None]:
# Checking R2 score for RandomForestRegressor
RFR = RandomForestRegressor()
RFR.fit(x_train,y_train)
predRFR = RFR.predict(x_test)
print("R2_Score:",r2_score(y_test,predRFR))
print("MAE:",metrics.mean_absolute_error(y_test,predRFR))
print("MSE:",metrics.mean_squared_error(y_test,predRFR))
print("RMSE:",np.sqrt(metrics.mean_squared_error(y_test, predRFR)))

## Decision tree classifier 

In [None]:
# Checking R2 score for DecisionTreeRegressor
DTR = DecisionTreeRegressor()
DTR.fit(x_train,y_train)
predDTR = DTR.predict(x_test)
print("R2_Score:",r2_score(y_test,predDTR))
print("MAE:",metrics.mean_absolute_error(y_test,predDTR))
print("MSE:",metrics.mean_squared_error(y_test,predDTR))
print("RMSE:",np.sqrt(metrics.mean_squared_error(y_test, predDTR)))

## Kneighbors Regressor 

In [None]:
# Checking R2 score for KNN Regressor
knn = KNN()
knn.fit(x_train,y_train)
predknn = knn.predict(x_test)
print("R2_Score:",r2_score(y_test,predknn))
print("MAE:",metrics.mean_absolute_error(y_test,predknn))
print("MSE:",metrics.mean_squared_error(y_test,predknn))
print("RMSE:",np.sqrt(metrics.mean_squared_error(y_test,predknn)))

## Gradient boosting Regressor 

In [None]:
# Checking R2 score for KNN Regressor
GB = GradientBoostingRegressor()
GB.fit(x_train,y_train)
predGB = GB.predict(x_test)
print("R2_Score:",r2_score(y_test,predGB))
print("MAE:",metrics.mean_absolute_error(y_test,predGB))
print("MSE:",metrics.mean_squared_error(y_test,predGB))
print("RMSE:",np.sqrt(metrics.mean_squared_error(y_test,predGB)))

## Bagging Regressor 

In [None]:
# Checking R2 score for BaggingRegressor
BR = BaggingRegressor()
BR.fit(x_train,y_train)
predBR = BR.predict(x_test)
print("R2_Score:",r2_score(y_test,predBR))
print("MAE:",metrics.mean_absolute_error(y_test,predBR))
print("MSE:",metrics.mean_squared_error(y_test,predBR))
print("RMSE:",np.sqrt(metrics.mean_squared_error(y_test,predBR)))

## Check for cross validation score 

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
# Checking cv score for RandomForestRegressor
print(cross_val_score(RFR,x,y,cv=5).mean())

In [None]:
# Checking cv score for DecisionTreeRegressor
print(cross_val_score(DTR,x,y,cv=5).mean())

In [None]:
# Checking cv score for KNN Regressor
print(cross_val_score(knn,x,y,cv=5).mean())

In [None]:
# Checking cv score for Gradient Boosting Regressor
print(cross_val_score(GB,x,y,cv=5).mean())

In [None]:
# Checking cv score for Bagging Regressor
print(cross_val_score(BR,x,y,cv=5).mean())

From the above observation, the difference between R2 score and the cross validation score, Random forest classifier fits best for this dataset. Let's increse the accuracy by hyper parametre tuning

## Hyper parametre tunning 

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
# RandomForestRegressor
parameters = {'criterion':['mse', 'mae'],
             'max_features':['auto', 'sqrt', 'log2'],
             'n_estimators':[0,200],
             'max_depth':[2,3,4,6]}

In [None]:
GCV=GridSearchCV(RandomForestRegressor(),parameters,cv=5)

In [None]:
GCV.fit(x_train,y_train)

In [None]:
GCV.best_params_

These are the best parametres forrandom forest classifier

In [None]:
capacity = RandomForestRegressor(criterion='mse', max_depth=6, max_features='log2', n_estimators=200)
capacity.fit(x_train, y_train)
pred = capacity.predict(x_test)
print("RMSE value:",np.sqrt(metrics.mean_squared_error(y_test, predRFR)))
print('R2_Score:',r2_score(y_test,pred)*100)

## Saving the model 

In [None]:
import joblib
joblib.dump(capacity,"Global_Power_Plant_capacity_mw.pkl")

In [None]:
capacity = joblib.load("Global_Power_Plant_capacity_mw.pkl")

In [None]:
import numpy as np
a = np.array(y_test)
predicted = np.array(capacity.predict(x_test))
df_new = pd.DataFrame({"Original":a,"Predicted":predicted},index= range(len(a)))
df_new

# Task - 2,  Prediction for primary_fuel

In [None]:
x_new = data_new.drop("primary_fuel",axis=1)
y_new = data_new['primary_fuel']

In [None]:
x_new.shape

In [None]:
y_new.shape

##  Checking for skewness

In [None]:
x_new.skew()

Here we can see that all the column has skewness and need to removed

## Removing skewness by using yeo-johnson method 

In [None]:
# Making the skew less than or equal to 0.5 for better prediction using  yeo-johnson method
skew = ['capacity_mw','longitude','generation_gwh_2014','generation_gwh_2015','generation_gwh_2016','generation_gwh_2017','generation_gwh_2018','power_plant_age']

from sklearn.preprocessing import PowerTransformer
scaler = PowerTransformer(method='yeo-johnson')
'''
parameters:
method = 'box-cox' or 'yeo-johnson'
'''

In [None]:
x_new[skew] = scaler.fit_transform(x_new[skew].values)
x_new[skew].head()

In [None]:
#Checking skewness after applying yeo-johnson method
x_new.skew()

Now the skewness has been removed in all the numerical column

In [None]:
# Visualizing the distribution after removing skewness
plt.figure(figsize=(20,25), facecolor='white')
plotnumber = 1

for column in x_new[skew]:
    if plotnumber<=9:
        ax = plt.subplot(3,3,plotnumber)
        sns.distplot(x_new[column],color='indigo',kde_kws={"shade": True},hist=False)
        plt.xlabel(column,fontsize=20)
    plotnumber+=1
plt.show()

Here we can notice that skewness has been removed but plot is still not normal

### SCALING

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x_new = pd.DataFrame(scaler.fit_transform(x_new),columns = x_new.columns)
x_new

### Check for multicolinearity

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.DataFrame()
vif["VIF values"] = [variance_inflation_factor(x_new.values,i)
              for i in range(len(x_new.columns))]
vif["Features"] = x_new.columns
vif

Here we can see that VIF values for all the columns are less than 10, means these are free from multicolinearity

In [None]:
#check for label distribution
y_new.value_counts()

Here the Label is not balanced, so we will use oversampling method to balance the data

## Oversampling 

In [None]:
# oversampling the data
from imblearn.over_sampling import SMOTE
SM = SMOTE()
x_new,y_new = SM.fit_resample(x_new,y_new)

In [None]:
y_new.value_counts()

the label is balanced now

In [None]:
x_new.head()

## Modeling 

###  finding the best random state

In [None]:
from sklearn.ensemble import RandomForestRegressor
maxAccu=0
maxRS=0
for i in range(1,200):
    x_new_train,x_new_test,y_new_train,y_new_test = train_test_split(x_new,y_new,test_size=0.30,random_state=i)
    mod = RandomForestRegressor()
    mod.fit(x_new_train,y_new_train)
    pred = mod.predict(x_new_test)
    acc = r2_score(y_new_test,pred)
    if acc>maxAccu:
        maxAccu=acc
        maxRS=i
print("Maximum r2 score is ",maxAccu,"at Random_state",maxRS)

In [None]:
x_new_train,x_new_test,y_new_train,y_new_test = train_test_split(x_new,y_new,test_size=.30,random_state=183)

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier,BaggingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.metrics import classification_report,confusion_matrix,roc_curve,accuracy_score

## Decision Tree classifier 

In [None]:
# Checking Accuracy for DecisionTreeClassifier
DTC = DecisionTreeClassifier()
DTC.fit(x_new_train,y_new_train)
predDTC = DTC.predict(x_new_test)
print(accuracy_score(y_new_test,predDTC))
print(confusion_matrix(y_new_test,predDTC))
print(classification_report(y_new_test,predDTC))

In [None]:
# Lets plot confusion matrix for DTC
cm = confusion_matrix(y_new_test,predDTC)
x_axis_labels = ["0","1","2","3","4","5","6","7"]
y_axis_labels = ["0","1","2","3","4","5","6","7"]
f , ax = plt.subplots(figsize=(7,7))
sns.heatmap(cm,annot=True,linewidths=.2,linecolor="black",fmt=".0f",ax=ax,cmap="ocean_r",xticklabels=x_axis_labels,yticklabels=y_axis_labels)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion matrix for DTC")
plt.show()

## Random Forest Classifier 

In [None]:
# Checking Accuracy for RandomForestClassifier
RFC = RandomForestClassifier()
RFC.fit(x_new_train,y_new_train)
predRFC = RFC.predict(x_new_test)
print(accuracy_score(y_new_test,predRFC))
print(confusion_matrix(y_new_test,predRFC))
print(classification_report(y_new_test,predRFC))

Accuracy for random forest is 92%

In [None]:
# Lets plot confusion matrix for RFC
cm = confusion_matrix(y_new_test,predRFC)
x_axis_labels = ["0","1","2","3","4","5","6","7"]
y_axis_labels = ["0","1","2","3","4","5","6","7"]
f , ax = plt.subplots(figsize=(7,7))
sns.heatmap(cm,annot=True,linewidths=.2,linecolor="black",fmt=".0f",ax=ax,cmap="ocean_r",xticklabels=x_axis_labels,yticklabels=y_axis_labels)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion matrix for RFC")
plt.show()

##  Support vector machine classifier

In [None]:
# Checking Accuracy for SVC
svc = SVC()
svc.fit(x_new_train,y_new_train)
predsvc = svc.predict(x_new_test)
print(accuracy_score(y_new_test,predsvc))
print(confusion_matrix(y_new_test,predsvc))
print(classification_report(y_new_test,predsvc))

In [None]:
# Lets plot confusion matrix for SVC
cm = confusion_matrix(y_new_test,predsvc)
x_axis_labels = ["0","1","2","3","4","5","6","7"]
y_axis_labels = ["0","1","2","3","4","5","6","7"]
f , ax = plt.subplots(figsize=(7,7))
sns.heatmap(cm,annot=True,linewidths=.2,linecolor="black",fmt=".0f",ax=ax,cmap="ocean_r",xticklabels=x_axis_labels,yticklabels=y_axis_labels)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion matrix for SVC")
plt.show()

## K Neighbors classifier 

In [None]:
# Checking Accuracy for KNeighborsClassifier
knn = KNN()
knn.fit(x_new_train,y_new_train)
predknn = knn.predict(x_new_test)
print(accuracy_score(y_new_test,predknn))
print(confusion_matrix(y_new_test,predknn))
print(classification_report(y_new_test,predknn))

In [None]:
# Lets plot confusion matrix for KNN
cm = confusion_matrix(y_new_test,predknn)
x_axis_labels = ["0","1","2","3","4","5","6","7"]
y_axis_labels = ["0","1","2","3","4","5","6","7"]
f , ax = plt.subplots(figsize=(7,7))
sns.heatmap(cm,annot=True,linewidths=.2,linecolor="black",fmt=".0f",ax=ax,cmap="ocean_r",xticklabels=x_axis_labels,yticklabels=y_axis_labels)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion matrix for KNN")
plt.show()

## Gradient Boosting Classifier 

In [None]:
# Checking Accuracy for GradientBoostingClassifier
GB = GradientBoostingClassifier()
GB.fit(x_new_train,y_new_train)
predGB = GB.predict(x_new_test)
print(accuracy_score(y_new_test,predGB))
print(confusion_matrix(y_new_test,predGB))
print(classification_report(y_new_test,predGB))

In [None]:
# Lets plot confusion matrix for GradientBoostingClassifier
cm = confusion_matrix(y_new_test,predGB)
x_axis_labels = ["0","1","2","3","4","5","6","7"]
y_axis_labels = ["0","1","2","3","4","5","6","7"]
f , ax = plt.subplots(figsize=(7,7))
sns.heatmap(cm,annot=True,linewidths=.2,linecolor="black",fmt=".0f",ax=ax,cmap="ocean_r",xticklabels=x_axis_labels,yticklabels=y_axis_labels)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion matrix for GradientBoostingClassifier")
plt.show()

## Bagging Classifier 

In [None]:
# Checking Accuracy for BaggingClassifier
BC = BaggingClassifier()
BC.fit(x_new_train,y_new_train)
predBC = BC.predict(x_new_test)
print(accuracy_score(y_new_test,predBC))
print(confusion_matrix(y_new_test,predBC))
print(classification_report(y_new_test,predBC))

In [None]:
# Lets plot confusion matrix for BaggingClassifier
cm = confusion_matrix(y_new_test,predBC)
x_axis_labels = ["0","1","2","3","4","5","6","7"]
y_axis_labels = ["0","1","2","3","4","5","6","7"]
f , ax = plt.subplots(figsize=(7,7))
sns.heatmap(cm,annot=True,linewidths=.2,linecolor="black",fmt=".0f",ax=ax,cmap="ocean_r",xticklabels=x_axis_labels,yticklabels=y_axis_labels)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion matrix for BaggingClassifier")
plt.show()

## Check for cross validation score 

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
# cv score for DecisionTreeClassifier
print(cross_val_score(DTC,x_new,y_new,cv=5).mean())

In [None]:
# cv score for Random forest Classifier
print(cross_val_score(RFC,x_new,y_new,cv=5).mean())

In [None]:
# cv score for Support vector Classifier
print(cross_val_score(svc,x_new,y_new,cv=5).mean())

In [None]:
# cv score for Knn Classifier
print(cross_val_score(knn,x_new,y_new,cv=5).mean())

In [None]:
# cv score for Gradient boosting Classifier
print(cross_val_score(GB,x_new,y_new,cv=5).mean())

In [None]:
# cv score for Bagging Classifier
print(cross_val_score(BC,x_new,y_new,cv=5).mean())

The difference between acurracy score and cross val sccore of gradient boosting classifier is less than others, so GRADIENT BOOSTING CLASSIFIER is best fitting model

## Hyper parametre tuning 

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
#Gradient Boosting Classifier 
parameters = {'criterion':['friedman_mse','mse', 'mae'],
             'max_features':['auto', 'sqrt', 'log2'],
             'n_estimators':[0,200],
             'max_depth':[2,3,4,5,6,8]}

In [None]:
GCV=GridSearchCV(GradientBoostingClassifier(),parameters,cv=5)

In [None]:
GCV.fit(x_df_train,y_df_train)

In [None]:
GCV.best_params_

In [None]:
primary_fuel = GradientBoostingClassifier (criterion='friedman_mse', max_depth=8, max_features='sqrt', n_estimators=200)
primary_fuel.fit(x_df_train, y_df_train)
pred = primary_fuel.predict(x_df_test)
acc=accuracy_score(y_df_test,pred)
print(acc*100)

# Ploting roc and compare ROC for final model

In [None]:
from sklearn.preprocessing import label_binarize
from sklearn.metrics import roc_curve, auc
from sklearn.multiclass import OneVsRestClassifier


classifier = OneVsRestClassifier(primary_fuel)
y_score = classifier.fit(x_new_train, y_new_train).predict_proba(x_new_test)

#Binarize the output
y_df_test_bin  = label_binarize(y_new_test, classes=[0,1,2,3,4,5,6,7])
n_classes = 8

# Compute ROC curve and AUC for all the classes
false_positive_rate = dict()
true_positive_rate = dict()
roc_auc = dict()
for i in range(n_classes):
    false_positive_rate[i], true_positive_rate[i], _ = roc_curve(y_new_test_bin[:, i], y_score[:, i])
    roc_auc[i] = auc(false_positive_rate[i], true_positive_rate[i])
    
   
for i in range(n_classes):
    plt.plot(false_positive_rate[i], true_positive_rate[i], lw=2,
             label='ROC curve of class {0} (area = {1:0.2f})'
             ''.format(i, roc_auc[i]))

plt.plot([0, 1], [0, 1], 'k--', lw=2)
plt.xlim([-0.05, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic for multiclassification data')
plt.legend(loc="lower right")
plt.show()

## Saving the model 

In [None]:
# Saving the model using .pkl
import joblib
joblib.dump(primary_fuel,"Global_Power_Plant_Fuel_Type.pkl")