#GLOBAL POWER PLANT
A power station, also referred to as a power plant and sometimes generating station or generating plant, is an industrial facility for the generation of electric power. Power stations are generally connected to an electrical grid.

Many power stations contain one or more generators, a rotating machine that converts mechanical power into three-phase electric power. The relative motion between a magnetic field and a conductor creates an electric current.

The energy source harnessed to turn the generator varies widely. Most power stations in the world burn fossil fuels such as coal, oil, and natural gas to generate electricity. Low-carbon power sources include nuclear power, and an increasing use of renewables such as solar, wind, geothermal, and hydroelectric.

#DATA DESCRIPTION
The Global Power Plant Database is a comprehensive, open source database of power plants around the world. It centralizes power plant data to make it easier to navigate, compare and draw insights for one’s own analysis. The database covers approximately 35,000 power plants from 167 countries and includes thermal plants (e.g. coal, gas, oil, nuclear, biomass, waste, geothermal) and renewables (e.g. hydro, wind, solar). Each power plant is geolocated and entries contain information on plant capacity, generation, ownership, and fuel type. It will be continuously updated as data becomes available.

In [None]:
import pandas as pd
import numpy as np

df= pd.read_csv('https://raw.githubusercontent.com/wri/global-power-plant-database/master/source_databases_csv/database_IND.csv')
df

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.columns

In [None]:
#Checking for the Columns containing Null

In [None]:
df.isnull().sum()

In [None]:
#checking of the data types of the dataset
df.dtypes

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(df.isnull())
plt.title("Null Values")
plt.show()

In [None]:
df.describe(include=['O'])

Filling the Columns with Null Blank Or Empty Values that are within Range or are Manageable upto some limit..

For Integer Datatype Columns with missing Values we use Mean method

For Object Datatype Columns with missing Values we use Mode Method

For Float Datatype Columns with missing Values we use Median Method

In [None]:
df["latitude"] = df["latitude"].fillna(df["latitude"].mean())
df["other_fuel1"] = df["other_fuel1"].fillna(df["other_fuel1"].mode()[0])
df["geolocation_source"] = df["geolocation_source"].fillna(df["geolocation_source"].mode()[0])
df["longitude"] = df["longitude"].fillna(df["longitude"].median())
df["commissioning_year"] = df["commissioning_year"].fillna(df["commissioning_year"].median())
df["generation_gwh_2014"] = df["generation_gwh_2014"].fillna(df["generation_gwh_2014"].median())
df["generation_gwh_2015"] = df["generation_gwh_2015"].fillna(df["generation_gwh_2015"].median())
df["generation_gwh_2016"] = df["generation_gwh_2016"].fillna(df["generation_gwh_2016"].median())
df["generation_gwh_2017"] = df["generation_gwh_2017"].fillna(df["generation_gwh_2017"].median())
df["generation_gwh_2018"] = df["generation_gwh_2018"].fillna(df["generation_gwh_2018"].median())


 we remove all those columns unwanted columns..
    only IND is listed in the country hence there is no impact of these two features on prediction..
    there are many columns that have No Values That is Complete Blank columns..
    there are many irrelevant Columns present in the dataset..
    also Columns with More UNIQUE Values with High Null Values Such as "Owner" , "gppd_idnr" , "url" Columns

In [None]:
# Dropping all the  irrelevant columns..

df.drop(columns=["url","owner","gppd_idnr","name","country","country_long","other_fuel2", "year_of_capacity_data","generation_data_source","other_fuel3","wepp_id","estimated_generation_gwh","generation_gwh_2013","generation_gwh_2019"], axis=1, inplace=True)


In [None]:
df.isnull().sum()

In [None]:
df.shape

In [None]:
##Checking and Transforming the Data types of the Columns To Same DataTypes for Better Analysis
df.info()

In [None]:
df.describe()

In [None]:
# Checking the list of counts of capacity_mw
df['capacity_mw'].value_counts()

In [None]:
# Checking the list of counts of primary_fuel
df['primary_fuel'].value_counts()

In [None]:
# Checking the uniqueness of primary_fuel
df["primary_fuel"].unique()

In [None]:
from sklearn.preprocessing import LabelEncoder
le =LabelEncoder()

list1=['primary_fuel','other_fuel1','source','geolocation_source']
for val in list1:
  df[val]=le.fit_transform(df[val].astype(str))

In [None]:
df

In [None]:
df['commissioning_year'].value_counts()

In [None]:
 #Let's extract power plant age from commissioning year by subtracting it from the year 2018

df["Power_plant_age"] = 2018 - df["commissioning_year"]
df.drop(columns=["commissioning_year"], inplace = True)

In [None]:
df

# EDA

In [None]:
print(df['capacity_mw'].value_counts())  
plt.figure(figsize=(10,10))
sns.countplot(df['capacity_mw'])
plt.show()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

print(df['primary_fuel'].value_counts())  
plt.figure(figsize=(5,5))
sns.countplot(df['primary_fuel'])
plt.show()

In [None]:
plt.figure(figsize = (10,6))
plt.title("Comparision between Primary Fuel and capacity_mw")
sns.barplot(x = "primary_fuel", y = "capacity_mw", data = df)
plt.show()

In [None]:
#checking the count of fuel1
print(df['other_fuel1'].value_counts())
plt.figure(figsize=(5,5))
sns.countplot(df['other_fuel1'])
plt.show()

In [None]:
# Visualizing the counts of owner
print(df["geolocation_source"].value_counts())
labels='WRI','Industry About','National Renewable Energy Laboratory'
plt.figure(figsize=(8,5))
sns.countplot(df['geolocation_source'])
plt.show()

In [None]:
##Corealtion between features and target 'Capacity_mw' ( EDA )

In [None]:
#Checking the relation between target capacity_mw and variable geolocation source

plt.figure(figsize=[10,6])
plt.style.use('ggplot')
plt.title('Comparision between geolocation_source and capacity_mw')
sns.scatterplot(df['geolocation_source'],df["capacity_mw"])

In [None]:
#Checking the relation between power plant age and capacity_mw

plt.figure(figsize=[10,6])
plt.style.use('ggplot')
plt.title('Comparision between Power_plant_age and capacity_mw')
sns.scatterplot(df['Power_plant_age'],df["capacity_mw"])

In [None]:
# Checking the relation between feature latitude and targer capacity_mw

plt.figure(figsize=[10,6])
plt.style.use('ggplot')
plt.title('Comparision between latitude and capacity_mw')
sns.scatterplot(df['latitude'],df["capacity_mw"])

In [None]:
# Checking the relationship between target longitude and capacity_mw

plt.figure(figsize=[10,6])
plt.style.use('ggplot')
plt.title('Comparision between longitude and capacity_mw')
sns.regplot(df['longitude'],df["capacity_mw"]);


In [None]:
fig,axes=plt.subplots(2,2,figsize=(15,12))

#Checking the relation between feature generation_gwh_2014 and targer capacity_mw
sns.scatterplot(x='generation_gwh_2014',y='capacity_mw',ax=axes[0,0],data=df,color="b")

#Checking the relation between feature generation_gwh_2015 and targer capacity_mw
sns.scatterplot(x='generation_gwh_2015',y='capacity_mw',ax=axes[0,1],data=df,color="b")

#Checking the relation between feature generation_gwh_2016 and targer capacity_mw
sns.scatterplot(x='generation_gwh_2016',y='capacity_mw',ax=axes[1,0],data=df,color="b")

#Checking the relation between feature generation_gwh_2017 and targer capacity_mw
sns.scatterplot(x='generation_gwh_2017',y='capacity_mw',ax=axes[1,1],data=df,color="b")
plt.show()

In [None]:
#Checking the relation between target fuel_type and variable Power_plant_age

plt.figure(figsize=[10,6])
plt.title('Comparision between Power_plant_age and Primary Fuel')
sns.barplot(df['Power_plant_age'],df["primary_fuel"])

In [None]:
# Checking the relation between feature latitude and targer Fuel_Type

plt.figure(figsize=[10,6])
plt.title('Comparision between latitude and Primary_Fuel')
sns.barplot(df['latitude'],df["primary_fuel"])

In [None]:
# Checking the relationship between target longitude and Fuel_Type

plt.figure(figsize=[10,6])
plt.title('Comparision between longitude and Fuel_Type')
sns.barplot(df['longitude'],df["primary_fuel"]);

In [None]:
fig,axes=plt.subplots(2,2,figsize=(15,12))

#Checking the relation between feature generation_gwh_2014 and targe primary_fuel
sns.barplot(x='generation_gwh_2014',y='primary_fuel',ax=axes[0,0],data=df,color="b")

#Checking the relation between feature generation_gwh_2015 and target primary_fuel
sns.barplot(x='generation_gwh_2015',y='primary_fuel',ax=axes[0,1],data=df,color="b")

#Checking the relation between feature generation_gwh_2016 and target primary_fuel
sns.barplot(x='generation_gwh_2016',y='primary_fuel',ax=axes[1,0],data=df,color="b")

#Checking the relation between feature generation_gwh_2017 and target primary_fuel
sns.barplot(x='generation_gwh_2017',y='primary_fuel',ax=axes[1,1],data=df,color="b")
plt.show()

In [None]:
##Coorelation
df.corr()

In [None]:
# Coorelation with the Target Column Primary Fuel 

df.corr()['primary_fuel'].sort_values()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(15,7))
sns.heatmap(df.corr(), annot=True, linewidths=0.5,linecolor="black", fmt='.2f')


In [None]:
#Descriptive Statistics
df.describe()

In [None]:
plt.figure(figsize=(15,7))
sns.heatmap(round(df.describe()[1:].transpose(),2), annot=True, linewidths=0.5,linecolor="black", fmt='f')


In [None]:
df.info()

In [None]:
##Skewness

my_column = df.pop('capacity_mw')
df.insert(12, 'capacity_mw', my_column) 

my_column1 = df.pop('primary_fuel')
df.insert(12, 'primary_fuel', my_column1) 


df.head()

In [None]:
df.iloc[:,:-2].skew()

In [None]:
df.iloc[:,:-2].skew()

In [None]:
df.dtypes

In [None]:
from sklearn.preprocessing import power_transform
x_new=power_transform(df.iloc[:,:-2],method='yeo-johnson',standardize=True)

df.iloc[:,:-2]=pd.DataFrame(x_new,columns=df.iloc[:,:-2].columns)


In [None]:
df.iloc[:,:-2].skew()

In [None]:
##checking the outliers
import warnings
warnings.filterwarnings('ignore')
df.plot(kind='box',subplots=True, layout=(3,5), figsize=[20,8])

# IQR Proximity Rule

In [None]:
#Z - Score Technique

from scipy.stats import zscore
import numpy as np
z=np.abs(zscore(df))
z.shape

In [None]:
threshold=3
print(np.where(z>3))

In [None]:
threshold=3
print(np.where(z>3))

In [None]:
df.drop([ 15, 15, 112, 130, 137, 143, 147, 209, 308, 308, 363, 364, 364,
       364, 364, 364, 364, 375, 387, 393, 404, 415, 482, 493, 493, 493,
       493, 493, 493, 494, 494, 494, 494, 494, 494, 538, 648, 648, 648,
       648, 648, 648, 657, 657, 657, 657, 657, 695, 695, 695, 695, 695,
       695, 709, 721, 726, 726, 726, 726, 726, 726, 728, 767, 786, 786,
       786, 786, 786, 786, 788, 808, 808, 808, 808, 808, 811, 813, 817,
       880, 880, 880, 880, 880, 880, 888, 894],axis=0)


In [None]:
df=df[(z<3).all(axis=1)]
df.shape

In [None]:
##Feature Engineering ( Variantion Inflation Factor )
from statsmodels.stats.outliers_influence import variance_inflation_factor
df.corr()

In [None]:
plt.figure(figsize=(25,22))
sns.heatmap(df.corr(),linewidths=.1,vmin=-1, vmax=1, fmt='.2g', annot = True, linecolor="black",annot_kws={'size':15},cmap="YlGnBu")
plt.yticks(rotation=0)

In [None]:
x1=df.drop('capacity_mw',axis=1)
y1=df['capacity_mw']


x2=df.drop('primary_fuel',axis=1)
y2=df['primary_fuel']

In [None]:
x1

In [None]:
y1

In [None]:
def vif_calc1():
  vif=pd.DataFrame()
  vif["VIF Factor"]=[variance_inflation_factor(x1.values,i) for i in range(x1.shape[1])]
  vif["features"]=x1.columns
  print(vif)


In [None]:
vif_calc1()

In [None]:
x1.drop(['generation_gwh_2016'],axis=1,inplace=True)
vif_calc1()

In [None]:
x2

In [None]:
y2

In [None]:
def vif_calc2():
  vif=pd.DataFrame()
  vif["VIF Factor"]=[variance_inflation_factor(x2.values,i) for i in range(x2.shape[1])]
  vif["features"]=x2.columns
  print(vif)
vif_calc2()

In [None]:
x2.drop(['generation_gwh_2016'],axis=1,inplace=True)
vif_calc2()

In [None]:
##Scaling the Data
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
x1=pd.DataFrame(sc.fit_transform(x1), columns=x1.columns)
x1

In [None]:
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
x2=pd.DataFrame(sc.fit_transform(x2), columns=x2.columns)
x2


##MODELLING FOR CAPACITY_MW
Building Regression Model As Target Column Datatype is Float

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor as KNN
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn import metrics
from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.ensemble import GradientBoostingClassifier, BaggingClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, accuracy_score

%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

In [None]:
#gettig the best random state
maxAccu=0
maxRS=0
for i in range(1,100): 
    x_train,x_test, y_train, y_test =train_test_split(x1,y1, test_size=.30,random_state=i)
    mod=RandomForestRegressor()
    mod.fit(x_train, y_train)
    pred=mod.predict(x_test)
    acc=r2_score(y_test,pred)
    if acc>maxAccu:
        maxAccu=acc
        maxRS=i
print('R2 Score=', maxAccu*100, 'Random_State',maxRS)
R2 Score= 86.65659092699795 Random_State 4
Setting Train Test Split

In [None]:
x_train,x_test, y_train, y_test=train_test_split(x1,y1,test_size=.30, random_state=maxRS)


In [None]:
# Checking r2score for Linear Regression
LR = LinearRegression()
LR.fit(x_train,y_train)

# prediction
predLR=LR.predict(x_test)
print('R2_score:',(r2_score(y_test,predLR))*100)

# Mean Absolute Error (MAE)
print('MAE:',metrics.mean_absolute_error(y_test, predLR))

# Mean Squared Error (MSE)
print('MSE:',metrics.mean_squared_error(y_test, predLR))

# Root Mean Squared Error (RMSE)
print("RMSE:",np.sqrt(metrics.mean_squared_error(y_test, predLR)))


In [None]:
##Random Forest Regressor
rf=RandomForestRegressor()
rf.fit(x_train, y_train)

# prediction
predrf=rf.predict(x_test)
print('R2_score:',(r2_score(y_test,predrf))*100)

# Mean Absolute Error (MAE)
print('MAE:',metrics.mean_absolute_error(y_test, predrf))

# Mean Squared Error (MSE)
print('MSE:',metrics.mean_squared_error(y_test, predrf))

# Root Mean Squared Error (RMSE)
print("RMSE:",np.sqrt(metrics.mean_squared_error(y_test, predrf)))


In [None]:
##Gradient Boosting Regressor
gb=GradientBoostingRegressor()
gb.fit(x_train,y_train)

predgb=gb.predict(x_test)
print('R2_Score:',(r2_score(y_test,predgb))*100)

# Mean Absolute Error (MAE)
print('MAE:',metrics.mean_absolute_error(y_test, predgb))

# Mean Squared Error (MSE)
print('MSE:',metrics.mean_squared_error(y_test, predgb))

# Root Mean Squared Error (RMSE)
print("RMSE:",np.sqrt(metrics.mean_squared_error(y_test, predgb)))


In [None]:
##Bagging Regressor
br=BaggingRegressor()
br.fit(x_train,y_train)

predbr=br.predict(x_test)
print('R2_Score:',(r2_score(y_test,predbr))*100)

# Mean Absolute Error (MAE)
print('MAE:',metrics.mean_absolute_error(y_test, predbr))

# Mean Squared Error (MSE)
print('MSE:',metrics.mean_squared_error(y_test, predbr))

# Root Mean Squared Error (RMSE)
print("RMSE:",np.sqrt(metrics.mean_squared_error(y_test, predbr)))


In [None]:
Cross- Validation
# Checking cv score for Linear Regression
print(cross_val_score(LR,x1,y1,cv=5).mean()*100)

# Checking cv score for Random Forest Regressor
print(cross_val_score(rf,x1,y1,cv=5).mean()*100)

#Checking the cv score for GradientBoostingRegressor
print(cross_val_score(gb,x1,y1,cv=5).mean()*100)

#Checking the cv score for BaggingRegressor
print(cross_val_score(br,x1,y1,cv=5).mean()*100)

In [None]:
##Hyper Parameter Tuning for the model with best acc and cv score
#RandomForestRegressor
parameters = {'criterion':['mse', 'mae'],
             'max_features':['auto', 'sqrt', 'log2'],
             'n_estimators':[0,200],
             'max_depth':[2,3,4,6]}
GCV=GridSearchCV(RandomForestRegressor(),parameters,cv=5)
GCV.fit(x_train,y_train)

In [None]:
GCV.best_params_

In [None]:
capacity_mw = RandomForestRegressor(criterion='mae', max_depth=6, max_features='sqrt', n_estimators=200)
capacity_mw.fit(x_train, y_train)
pred = capacity_mw.predict(x_test)
print('R2_Score:',r2_score(y_test,pred)*100)
print("RMSE value:",np.sqrt(metrics.mean_squared_error(y_test, predrf)))


In [None]:
Saving the model
import joblib
joblib.dump(capacity,"Global_Power_Plant_capacity_mw.pkl")

MODELLING FOR PRIMARY_FUEL
Building CLASSIFICATION Model As Target Column's Datatype is Integer( In range of 8 )

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

print(df['primary_fuel'].value_counts())  
plt.figure(figsize=(5,5))
sns.countplot(df['primary_fuel'])
plt.show()

In [None]:
from imblearn.over_sampling import SMOTE
sm = SMOTE()
x2, y2 = sm.fit_resample(x2,y2)
y2.value_counts()

In [None]:
##Modelling

In [None]:
maxAccu=0
maxRS=0

for i in range(1,200):
    x_2_train,x_2_test, y_2_train, y_2_test=train_test_split(x2,y2,test_size=.30, random_state=i)
    rfc=RandomForestClassifier()
    rfc.fit(x_2_train,y_2_train)
    pred=rfc.predict(x_2_test)
    acc=accuracy_score(y_2_test,pred)
    if acc>maxAccu:
        maxAccu=acc
        maxRS=i
print("Best accuracy is ",maxAccu*100," on Random_state ",maxRS)


In [None]:
##Logistic Regression
# Checking Accuracy for Logistic Regression
log = LogisticRegression()
log.fit(x_2_train,y_2_train)

#Prediction
predlog = log.predict(x_2_test)

print(accuracy_score(y_2_test, predlog)*100)
print(confusion_matrix(y_2_test, predlog))
print(classification_report(y_2_test,predlog))

In [None]:
# Plotting Confusion_Matrix
cm = confusion_matrix(y_2_test,predlog)

x_axis_labels = ["0","1","2","3","4","5","6","7"]
y_axis_labels = ["0","1","2","3","4","5","6","7"]

f , ax = plt.subplots(figsize=(7,7))
sns.heatmap(cm, annot = True,linewidths=.2, linecolor="black", fmt = ".0f", ax=ax, cmap="BuGn",xticklabels=x_axis_labels,yticklabels=y_axis_labels)
plt.xlabel("PREDICTED LABEL")
plt.ylabel("TRUE LABEL")
plt.title('Confusion Matrix for LogisticRegression')
plt.show()

In [None]:
##Random Forest Classifier
# Checking accuracy for Random Forest Classifier
rf = RandomForestClassifier()
rf.fit(x_2_train,y_2_train)

# Prediction
predrf = rf.predict(x_2_test)

print(accuracy_score(y_2_test, predrf)*100)
print(confusion_matrix(y_2_test, predrf))
print(classification_report(y_2_test,predrf))

In [None]:
# Lets plot confusion matrix for RandomForestClassifier
cm = confusion_matrix(y_df_test,predrf)

x_axis_labels = ["0","1","2","3","4","5","6","7"]
y_axis_labels = ["0","1","2","3","4","5","6","7"]

f , ax = plt.subplots(figsize=(7,7))
sns.heatmap(cm, annot = True,linewidths=0.2, linecolor="black", fmt = ".0f", ax=ax, cmap="BuGn",xticklabels=x_axis_labels,yticklabels=y_axis_labels)
plt.xlabel("PREDICTED LABEL")
plt.ylabel("TRUE LABEL")
plt.title('Confusion Matrix for RandomForestClassifier')
plt.show()

In [None]:
#Decission Tree Classifier
# Checking Accuracy for Decision Tree Classifier
dtc = DecisionTreeClassifier()
dtc.fit(x_2_train,y_2_train)

#Prediction
preddtc = dtc.predict(x_2_test)

print(accuracy_score(y_2_test, preddtc)*100)
print(confusion_matrix(y_2_test, preddtc))
print(classification_report(y_2_test,preddtc))

In [None]:
# Lets plot confusion matrix for DTC
cm = confusion_matrix(y_df_test,preddtc)

x_axis_labels = ["0","1","2","3","4","5","6","7"]
y_axis_labels = ["0","1","2","3","4","5","6","7"]

f , ax = plt.subplots(figsize=(7,7))
sns.heatmap(cm, annot = True,linewidths=.2, linecolor="black", fmt = ".0f", ax=ax, cmap="BuGn",xticklabels=x_axis_labels,yticklabels=y_axis_labels)
plt.xlabel("PREDICTED LABEL")
plt.ylabel("TRUE LABEL")
plt.title('Confusion Matrix for Decision Tree Classifier')
plt.show()

In [None]:
#Support Vector Machine Classifier
# Checking accuracy for Support Vector Machine Classifier
svc = SVC()
svc.fit(x_2_train,y_2_train)

# Prediction
predsvc = svc.predict(x_2_test)

print(accuracy_score(y_2_test, predsvc)*100)
print(confusion_matrix(y_2_test, predsvc))
print(classification_report(y_2_test,predsvc))

In [None]:
# Lets plot confusion matrix for Support Vector Machine Classifier
cm = confusion_matrix(y_df_test,predsvc)

x_axis_labels = ["0","1","2","3","4","5","6","7"]
y_axis_labels = ["0","1","2","3","4","5","6","7"]

f , ax = plt.subplots(figsize=(7,7))
sns.heatmap(cm, annot = True,linewidths=.2, linecolor="black", fmt = ".0f", ax=ax, cmap="BuGn",xticklabels=x_axis_labels,yticklabels=y_axis_labels)

plt.xlabel("PREDICTED LABEL")
plt.ylabel("TRUE LABEL")
plt.title('Confusion Matrix for Support Vector Machine Classifier')
plt.show()

In [None]:
##Gradient Boosting Classifier
# Checking accuracy for Gradient Boosting Classifier
GB = GradientBoostingClassifier()
GB.fit(x_2_train,y_2_train)

# Prediction
predGB = GB.predict(x_2_test)

print(accuracy_score(y_2_test, predGB)*100)
print(confusion_matrix(y_2_test, predGB))
print(classification_report(y_2_test,predGB))

In [None]:
# Lets plot confusion matrix for Gradient Boosting Classifier
cm = confusion_matrix(y_df_test,predGB)

x_axis_labels = ["0","1","2","3","4","5","6","7"]
y_axis_labels = ["0","1","2","3","4","5","6","7"]

f , ax = plt.subplots(figsize=(7,7))
sns.heatmap(cm, annot = True,linewidths=.2, linecolor="black", fmt = ".0f", ax=ax, cmap="BuGn",xticklabels=x_axis_labels,yticklabels=y_axis_labels)

plt.xlabel("PREDICTED LABEL")
plt.ylabel("TRUE LABEL")
plt.title('Confusion Matrix for Gradient Boosting Classifier')


In [None]:
#cv score for Logistic Regression
print(cross_val_score(log,x2,y2,cv=5).mean()*100)

# cv score for Decision Tree Classifier
print(cross_val_score(dtc,x2,y2,cv=5).mean()*100)

# cv score for Random Forest Classifier
print(cross_val_score(rf,x2,y2,cv=5).mean()*100)

# cv score for Support Vector  Classifier
print(cross_val_score(svc,x2,y2,cv=5).mean()*100)

# cv score for Gradient Boosting Classifier
print(cross_val_score(GB,x2,y2,cv=5).mean()*100)

In [None]:
#HyperParameter Tuning for the model with best score
#Random Forest Classifier

parameters = {'criterion':['gini'],
             'max_features':['auto'],
             'n_estimators':[0,200],
             'max_depth':[2,3,4,5,6,8]}
GCV=GridSearchCV(RandomForestClassifier(),parameters,cv=5)
GCV.fit(x_2_train,y_2_train)

In [None]:
GCV.best_params_

In [None]:
fuel =RandomForestClassifier (criterion='gini', max_depth=8, max_features='auto', n_estimators=200)
fuel.fit(x_2_train, y_2_train)
pred = fuel.predict(x_2_test)
acc=accuracy_score(y_2_test,pred)
print(acc*100)

In [None]:
##Plotting ROC and compare AUC for the final model
from sklearn.preprocessing import label_binarize
from sklearn.metrics import roc_curve, auc
from sklearn.multiclass import OneVsRestClassifier
classifier = OneVsRestClassifier(fuel)
y_score = classifier.fit(x_2_train, y_2_train).predict_proba(x_2_test)

#Binarize the output
y_2_test_bin  = label_binarize(y_2_test, classes=[0,1,2,3,4,5,6,7])
n_classes = 8

# Compute ROC curve and AUC for all the classes
false_positive_rate = dict()
true_positive_rate = dict()
roc_auc = dict()
for i in range(n_classes):
    false_positive_rate[i], true_positive_rate[i], _ = roc_curve(y_2_test_bin[:, i], y_score[:, i])
    roc_auc[i] = auc(false_positive_rate[i], true_positive_rate[i])
    
for i in range(n_classes):
    plt.plot(false_positive_rate[i], true_positive_rate[i], lw=2,
             label='ROC curve of class {0} (area = {1:0.2f})'
             ''.format(i, roc_auc[i]))

plt.plot([0, 1], [0, 1], 'k--', lw=2)
plt.xlim([-0.05, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic for multiclassification data')
plt.legend(loc="lower right")
plt.show()

##Conclusion:
The accuracy score for Primary_Fuel is 94%
The accuracy score for Capacity_mw is 87%

In [None]:
##Saving the model
import joblib
joblib.dump(project,"Global_Power_Plant_Fuel_Type.pkl")