# RENEWIFY 
### Machine Learning in Global Energy Sustainability

__________

Name : Manish Subhash Vankudre

## Web Scraping

In [1]:
# importing libraries

import wikipedia as wp
import wikipediaapi
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.interpolate import make_interp_spline
from IPython.display import display, HTML
import ipywidgets as widgets
from ipywidgets import interact, interactive_output
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler


The below code uses the Wikipedia API to fetch the text content of Wikipedia pages. It defines a function, `fetch_wikipedia_html_content`, that retrieves the content of a specified Wikipedia page. The titles for pages on carbon dioxide emissions and renewable electricity production are used to fetch their respective contents.

In [2]:
wiki_wiki = wikipediaapi.Wikipedia('Manish/RENEWIFY', 'en')

def fetch_wikipedia_html_content(title):
    try:
        page = wiki_wiki.page(title)
        if page.exists():
            html_content = page.text
            return html_content
        else:
            print(f"Page not found: {title}")
            return None
    except wikipediaapi.exceptions.HTTPTimeoutError as e:
        print(f"Timeout error: {e}")
        return None

co2_emissions_title = "List_of_countries_by_carbon_dioxide_emissions"
renewable_energy_title = "List_of_countries_by_renewable_electricity_production"

co2_emissions_html_content = fetch_wikipedia_html_content(co2_emissions_title)
renewable_energy_html_content = fetch_wikipedia_html_content(renewable_energy_title)

In [3]:
print(co2_emissions_html_content)

This is a list of sovereign states and territories by carbon dioxide emissions due to certain forms of human activity, based on the EDGAR database created by European Commission and Netherlands Environmental Assessment Agency. The following table lists the 1970, 1990, 2005, 2017 and 2022 annual CO2 emissions estimates (in kilotons of CO2 per year) along with a list of calculated emissions per capita (in tons of CO2 per year).
The data only consider carbon dioxide emissions from the burning of fossil fuels and cement manufacture, but not emissions from land use, land-use change and forestry. Over the last 150 years, estimated cumulative emissions from land use and land-use change represent approximately one-third of total cumulative anthropogenic CO2 emissions. Emissions from international shipping or bunker fuels are also not included in national figures, which can make a large difference for small countries with important ports. 
In 2022, CO2 emissions from the top 10 countries with t

In [4]:
print(renewable_energy_html_content)

This is a list of countries and dependencies by electricity generation from renewable sources each year.
Renewables accounted for 28% of electric generation in 2021, consisting of hydro (55%), wind (23%), biomass (13%), solar (7%) and geothermal (1%). China produced 31% of global renewable electricity, followed by the United States (11%), Brazil (6.4%), Canada (5.4%) and India (3.9%).
Renewable investment reached almost $500 billion globally in 2022, amounting to 83% of new electric capacity that year. The renewable energy industry employs almost 14 million people.

List
Data are from IRENA unless otherwise specified, and are for the year 2021.

See also

List of countries by carbon dioxide emissions
List of countries by carbon dioxide emissions per capita
List of countries by electricity consumption
List of countries by electricity production
List of countries by energy intensity
List of countries by greenhouse gas emissions
List of countries by greenhouse gas emissions per person
Lis

__________


The below code sets a user agent and defines a function to fetch the HTML content of a Wikipedia page using the "wikipedia" library. The user agent helps in compliance with Wikipedia's guidelines for API usage.

In [5]:
wp.set_user_agent("Manish/Renewable_Revolution")

def fetch_wikipedia_html(title):
    try:
        page = wp.page(title)
        html_content = page.html()
        return html_content
    except wp.exceptions.DisambiguationError as e:
        print(f"Ambiguous term: {e.options}")
    except wp.exceptions.HTTPTimeoutError as e:
        print(f"Timeout error: {e}")
    except wp.exceptions.PageError as e:
        print(f"Page not found: {e}")

The function `parse_html_table` takes HTML content as input, searches for a table with the class 'wikitable', and extracts data from its rows. It counts the number of columns, iterates through rows, and retrieves text from table cells. The result is a structured data representation of the HTML table in a dataframe.

In [6]:
def parse_html_table(html, include_change_direction=False):
    soup = BeautifulSoup(html, 'html.parser')
    table = soup.find('table', {'class': 'wikitable'})
    
    if not table:
        print("No table found.")
        return None

    rows = table.find_all('tr')
    data = []
    
    num_columns = max(len(row.find_all(['th', 'td'])) for row in rows)

    for row in rows:
        columns = row.find_all(['th', 'td'])
        row_data = [col.get_text(strip=True) for col in columns]
        
        if include_change_direction:
            img_tag = row.find('img', alt=['Positive decrease', 'Negative increase'])
            if img_tag:
                change_direction = 'positive_decrease' if 'Positive decrease' in img_tag['alt'] else 'negative_increase'
            else:
                change_direction = pd.NA
            row_data.append(change_direction)

        row_data += [pd.NA] * (num_columns - len(row_data))

        data.append(row_data)
    
    return data

In [None]:
co2_emissions_title = "List_of_countries_by_carbon_dioxide_emissions"
renewable_energy_title = "List_of_countries_by_renewable_electricity_production"

co2_emissions_html = fetch_wikipedia_html(co2_emissions_title)
renewable_energy_html = fetch_wikipedia_html(renewable_energy_title)

In [None]:
co2_emissions_data = parse_html_table(co2_emissions_html, include_change_direction=True)

co2_emissions_columns = co2_emissions_data[0] + ['Change Direction']

co2_emissions_df = pd.DataFrame(co2_emissions_data[1:], columns=co2_emissions_columns)

In [None]:
co2_emissions_df.head()

In [None]:
co2_emissions_df.info()

In [None]:
renewable_energy_data = parse_html_table(renewable_energy_html, include_change_direction=False)

renewable_energy_columns = renewable_energy_data[0]

renewable_energy_df = pd.DataFrame(renewable_energy_data[1:], columns=renewable_energy_columns)

In [None]:
renewable_energy_df.head()

In [None]:
renewable_energy_df.info()

## Data Cleaning


#### Data Cleaning - co2_emissions_df

In [None]:
print("Unique column names in the DataFrame:")
print(co2_emissions_df.columns)

In [None]:
ce_new_column_names = ['country/territory', 'fossil_CO2_emissions_1970', 'fossil_CO2_emissions_1990', 'fossil_CO2_emissions_2005', 'fossil_CO2_emissions_2017', 'fossil_CO2_emissions_2022', 'per_capita_co2_2022', 'percent_of_world', 'percent_change' ,'change_direction']

co2_emissions_df.columns = ce_new_column_names

This code assigns meaningful column names to a DataFrame (`co2_emissions_df`). The new column names describe information about fossil CO2 emissions, per capita CO2 in 2022, percentage of global emissions, and more. It improves data clarity and analysis.

In [None]:
co2_emissions_df.head()

In [None]:
co2_emissions_df = co2_emissions_df.iloc[6:]

co2_emissions_df.reset_index(drop=True, inplace=True)

Removed irrelevant rows from `co2_emissions_df` starting from the 7th row. Reset the index to ensure a clean structure. This code is used to clean and prepare data for analysis, focusing on relevant information.

In [None]:
co2_emissions_df.head()

In [None]:
co2_emissions_df.replace('', np.nan, inplace=True)

In [None]:
co2_emissions_df['country/territory'] = co2_emissions_df['country/territory'].str.capitalize()

In [None]:
co2_emissions_df['percent_of_world'] = co2_emissions_df['percent_of_world'].str.replace(',', '').str.extract('(\d+\.\d{3})').astype(float)
co2_emissions_df['percent_change'] = co2_emissions_df['percent_change'].str.replace(',', '').str.extract('(\d+\.\d{1})').astype(float)

columns_to_convert = co2_emissions_df.columns.difference(['country/territory' ,'percent_of_world' , 'percent_change' ,'change_direction'])
co2_emissions_df[columns_to_convert] = co2_emissions_df[columns_to_convert].apply(lambda x: x.str.replace(',', '')).astype(float)

The code normalizes numerical columns in a DataFrame, addressing formatting issues like commas. It converts 'percent_of_world' and 'percent_change' columns to numeric format, removes commas, and extracts specified digit patterns. Other columns are also converted by removing commas and changing data type to float. This ensures consistent numeric representation, aiding analysis in a tabular data context.

In [None]:
co2_emissions_df.head()

In [None]:
co2_emissions_df['percent_change'] = np.where(co2_emissions_df['change_direction'] == 'negative_increase', -co2_emissions_df['percent_change'], co2_emissions_df['percent_change'])

Based on arrow images on the website, a 'change_direction' column was created. If the arrow was down ('negative_increase'), the 'percent_change' column was negated for consistency. This ensures uniformity in representing percentage changes, making it easier for analysis and interpretation of carbon dioxide emissions data.

In [None]:
co2_emissions_df.head()

In [None]:
co2_emissions_df.info()

In [None]:
co2_emissions_df.columns

#### Data Cleaning - renewable_energy_df

In [None]:
renewable_energy_df = renewable_energy_df.iloc[1: ,:]

renewable_energy_df.reset_index(drop=True, inplace=True)

In [None]:
re_new_column_names = ['country/dependency', 'percent_renewable', 'renewable_generation(GWh)', 'percent_hydro', 'percent_wind', 'percent_bio', 'percent_solar' , 'percent_geo']

renewable_energy_df.columns = re_new_column_names

Column renaming, assigning more descriptive names to the data columns like 'country,' 'percent_renewable,' 'renewable_generation,' 'percent_hydro,' 'percent_wind,' 'percent_bio,' 'percent_solar,' 'percent_geo.' Enhances clarity and readability, facilitating better analysis and understanding of renewable energy data.

In [None]:
renewable_energy_df.head()

In [None]:
renewable_energy_df.columns

In [None]:
renewable_energy_df.replace('', np.nan, inplace=True)

In [None]:
renewable_energy_df['country/dependency'] = renewable_energy_df['country/dependency'].str.capitalize()

In [None]:
columns_to_process = ['percent_renewable', 'percent_hydro', 'percent_wind', 'percent_bio', 'percent_solar', 'percent_geo']

renewable_energy_df[columns_to_process] = renewable_energy_df[columns_to_process].apply(lambda x: x.map(lambda y: float(y[:-1].replace(',', ''))))

In [None]:
renewable_energy_df['renewable_generation(GWh)'] = renewable_energy_df['renewable_generation(GWh)'].str.replace(',', '').astype(float)

The data cleaning step involves converting percentage values in the columns ['percent_renewable', 'percent_hydro', 'percent_wind', 'percent_bio', 'percent_solar', 'percent_geo'] from strings to numerical format. We removed the percentage sign, commas, and converted the values to float. This ensures accurate analysis and visualization as numerical data is easier to work with, providing meaningful insights into renewable energy statistics.

In [None]:
renewable_energy_df.head()

In [None]:
renewable_energy_df.info()

In [None]:
co2_emissions_df.columns

In [None]:
renewable_energy_df.columns

## Data Visualization

In [None]:
co2_emissions_df.head()

In [None]:
co2_emissions_df.columns

In [None]:
renewable_energy_df.head()

In [None]:
co2_emissions_df.describe()

In [None]:
co2_emissions_df[['per_capita_co2_2022']].hist(bins=20, figsize=(8, 6))
plt.title('Distribution of Per Capita CO2 Emissions in 2022')
plt.xlabel('Per Capita CO2 Emissions (Metric Tons)')
plt.ylabel('Frequency')
plt.show()


The histogram visualizes the distribution of per capita CO2 emissions in 2022. The x-axis represents the range of emissions, and the y-axis shows the frequency of countries falling into each range. This helps analyze the variation in individual emission levels, providing insights into global environmental impact and guiding sustainable policies.

_____

In [None]:
top_countries = co2_emissions_df.nlargest(15, 'fossil_CO2_emissions_2022')
plt.figure(figsize=(12, 6))
plt.bar(top_countries['country/territory'], top_countries['fossil_CO2_emissions_2022'] , color="orange")
plt.xlabel('Country')
plt.ylabel('Total CO2 Emissions in 2022 (Million Metric Tons)')
plt.title('Top 15 Countries by Total CO2 Emissions in 2022')
plt.xticks(rotation=45, ha='right')
plt.show()

This bar plot illustrates the top 15 countries by their total CO2 emissions in 2022. Each bar represents a country, and the height of the bar corresponds to its emissions. The purpose is to visually compare and identify the countries with the highest emissions, providing insights into global carbon output.

____

In [None]:
top_countries = co2_emissions_df.sort_values(by='per_capita_co2_2022', ascending=False).head(15)
plt.figure(figsize=(12, 6))
plt.bar(top_countries['country/territory'], top_countries['per_capita_co2_2022'] , color ="magenta")
plt.xlabel('Country')
plt.ylabel('Total CO2 Emissions (2022)')
plt.title('Top 15 Countries by CO2 Emissions (2022)')
plt.xticks(rotation=45, ha='right')
plt.show()



The bar chart displays the top 15 countries with the highest per capita CO2 emissions in 2022. Each bar represents a country, and its height indicates the amount of CO2 emitted per person. The magenta color emphasizes the data. This visualization helps identify countries contributing the most to per capita carbon emissions.

____

In [None]:
sns.kdeplot(data=co2_emissions_df, x='percent_change', fill=True, common_norm=False)
plt.title('Kernel Density Estimate (KDE) Plot for Percent Change in CO2 Emissions from 1990 - 2022')
plt.xlabel('Percent Change')
plt.ylabel('Density')
plt.show()

The Kernel Density Estimate (KDE) plot visualizes the distribution of percent changes in CO2 emissions from 1990 to 2022. It displays the likelihood of observing different percent changes, providing insights into the data's distribution. This helps identify patterns, such as concentration around specific values, helping environmental trend analysis.

____

In [None]:
def update_plot(selected_country):
    country_data = co2_emissions_df[co2_emissions_df['country/territory'] == selected_country]
    plt.figure(figsize=(12, 6))
    plt.plot(country_data.columns[1:6], country_data.iloc[0, 1:6], marker='o')
    plt.xlabel('Year')
    plt.ylabel('CO2 Emissions (Million Metric Tons)')
    plt.title(f'CO2 Emissions Over Time for {selected_country}')
    plt.show()

country_names = co2_emissions_df['country/territory'].tolist()

country_dropdown = widgets.Dropdown(
    options=country_names,
    value='United states',  # Default selected country
    description='Select Country:',
    disabled=False,
    layout={'width': '300px'},
    style={'description_width': 'initial'}
)

output = interactive_output(update_plot, {'selected_country': country_dropdown})

widgets.VBox([country_dropdown, output])

The line plot displays the annual CO2 emissions in million metric tons for a selected country over five years. Each point on the line represents the emissions for a specific year. This visualization helps observe trends and fluctuations in a country's carbon footprint, providing valuable insights into its environmental impact. The dropdown menu allows users to choose different countries for comparison, facilitating a better understanding of global carbon emission patterns.

____

In [None]:
top_countries = co2_emissions_df.nlargest(5, 'fossil_CO2_emissions_2022')

plt.figure(figsize=(12, 6))

for country in top_countries['country/territory']:
    country_data = co2_emissions_df[co2_emissions_df['country/territory'] == country]
    plt.plot(country_data.columns[1:6], country_data.iloc[0, 1:6], label=country, linewidth=3)

plt.xlabel('Year')
plt.ylabel('CO2 Emissions')
plt.title('Top 5 Countries: Change in Fossil CO2 Emissions (1970-2022)')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0., fontsize='medium')
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()


This line plot compares the change in fossil CO2 emissions from 1970 to 2022 for the top 5 countries. Each country has a distinct line, showcasing its emissions trend over time. This visualization helps identify patterns and variations in carbon emissions, crucial for understanding environmental impacts and policy assessments.

____

In [None]:
custom_colors = ['cyan', 'purple', 'maroon', 'yellow', 'green']

world_total_distribution = renewable_energy_df.iloc[:, 3:].sum()

plt.figure(figsize=(8, 8))
plt.pie(world_total_distribution, labels=renewable_energy_df.columns[3:], autopct='%1.1f%%', startangle=140, colors=custom_colors)
plt.title('Renewable Energy Distribution for the Entire World')
plt.show()

The pie chart illustrates the global distribution of renewable energy sources, represented by different colors. Each color corresponds to a specific type of renewable energy, such as solar, wind, and hydropower. The chart provides a visual breakdown of the contribution of each energy source to the total renewable energy production worldwide. This helps us understand the relative importance of each source in our efforts to use sustainable and environmentally friendly energy.

___

In [None]:
top_countries_renewable_generation = renewable_energy_df.nlargest(5, 'renewable_generation(GWh)')
plt.figure(figsize=(10, 6))
plt.bar(top_countries_renewable_generation['country/dependency'], top_countries_renewable_generation['renewable_generation(GWh)'], color='lightgreen')
plt.xlabel('Country/Dependency')
plt.ylabel('Renewable Energy Generation (GWh)')
plt.title('Top 5 Countries by Renewable Energy Generation')
plt.xticks(rotation=45, ha='right')
plt.show()


This bar plot displays the top 5 countries/dependencies with the highest renewable energy generation (in gigawatt-hours). Each bar represents a country, and the bar's height indicates the amount of renewable energy it produces. We use this plot to easily compare and identify the leading contributors to global renewable energy production. It helps visualize which regions are making significant strides in sustainable energy generation.

______

In [None]:
def update_plot(selected_energy_type):
    top_countries_energy_type = renewable_energy_df.nlargest(10, selected_energy_type)
    plt.figure(figsize=(12, 8))
    
    for country in top_countries_energy_type['country/dependency']:
        country_data = renewable_energy_df[renewable_energy_df['country/dependency'] == country]
        plt.bar(country_data['country/dependency'], country_data[selected_energy_type], label=country)
    
    plt.xlabel('Country/Dependency')
    plt.ylabel(f'Percentage of {selected_energy_type} Energy')
    plt.title(f'Top 10 Countries by {selected_energy_type} Percentage')
    plt.xticks(rotation=45, ha='right')
    plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
    plt.show()

energy_types = ['percent_hydro', 'percent_wind', 'percent_bio', 'percent_solar', 'percent_geo']

energy_type_dropdown = widgets.Dropdown(
    options=energy_types,
    value='percent_solar',  # Default selected energy type
    description='Select Energy Type:',
    disabled=False,
    layout={'width': '300px'},
    style={'description_width': 'initial'}
)

output = interactive_output(update_plot, {'selected_energy_type': energy_type_dropdown})

widgets.VBox([energy_type_dropdown, output])

The bar plot displays the top 10 countries/dependencies based on their percentage of renewable energy in a selected category like solar, wind, bio, hydro, or geo. Each bar represents a country, and its height shows the percentage of renewable energy in that category. This visualization helps compare and identify leading countries in different renewable energy sources, providing insights into global sustainability efforts and energy diversity.

___

In [None]:
data = renewable_energy_df[['percent_hydro', 'percent_wind', 'percent_bio', 'percent_solar', 'percent_geo']]
plt.figure(figsize=(12, 8))

sns.boxplot(data=data, palette="Set2")

plt.title('Box Plot: Distribution of Renewable Energy Types')
plt.xlabel('Renewable Energy Types')
plt.ylabel('Percentage')
plt.show()

The box plot illustrates the distribution of renewable energy types—hydro, wind, bio, solar, and geothermal. Each box represents the range of values for a specific energy type. The box's height indicates the interquartile range (IQR), showcasing where most data lies. Outliers, if any, are displayed as individual points. This plot helps us understand the variability and central tendency of each renewable energy source, aiding in effective analysis and comparison.

___

## Clustering

In [None]:
#importing libraries 

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score

from scipy.cluster.hierarchy import linkage, dendrogram
from sklearn.metrics.pairwise import cosine_similarity

from sklearn.decomposition import PCA

In [None]:
numeric_columns = ['percent_renewable', 'renewable_generation(GWh)', 'percent_hydro', 'percent_wind', 'percent_bio', 'percent_solar', 'percent_geo']
data_for_clustering = renewable_energy_df[numeric_columns]

scaler = StandardScaler()
scaled_data = scaler.fit_transform(data_for_clustering)

scaled_data_df = pd.DataFrame(scaled_data, columns=numeric_columns)

scaled_data_df.head()

In [None]:
numeric_columns = ['percent_renewable', 'renewable_generation(GWh)', 'percent_hydro', 'percent_wind', 'percent_bio', 'percent_solar', 'percent_geo']
data_for_clustering = renewable_energy_df[numeric_columns]

scaler = StandardScaler()
scaled_data = scaler.fit_transform(data_for_clustering)

wcss = [] 

for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(scaled_data)
    wcss.append(kmeans.inertia_)

# Elbow curve plot
plt.plot(range(1, 11), wcss, marker='o')
plt.title('Elbow Method for Optimal k')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Within-Cluster Sum of Squares (WCSS)')
plt.show()

In [None]:
silhouette_scores = []

for k in range(2, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(scaled_data)
    silhouette_avg = silhouette_score(scaled_data, labels)
    silhouette_scores.append(silhouette_avg)

plt.plot(range(2, 11), silhouette_scores, marker='o')
plt.title('Silhouette Score for Optimal k')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.show()

From elbow method and silhouette score we observe that k=6 is optimal number of k

In [None]:
numeric_columns_renewable = renewable_energy_df.select_dtypes(include=['float64']).columns
data_for_clustering_renewable = renewable_energy_df[numeric_columns_renewable]

scaler_renewable = StandardScaler()
scaled_data_renewable = scaler_renewable.fit_transform(data_for_clustering_renewable)

# We are trying for different values of k
for k in [5, 6, 7]:
    kmeans_renewable = KMeans(n_clusters=k, random_state=42)
    renewable_energy_df['cluster'] = kmeans_renewable.fit_predict(scaled_data_renewable)

    # Plot clusters based on percent_renewable and percent_wind
    plt.figure(figsize=(8, 6))
    plt.scatter(renewable_energy_df['percent_renewable'], renewable_energy_df['percent_wind'], c=renewable_energy_df['cluster'], cmap='plasma', s=20)
    plt.title(f'K-Means Clustering (k={k}): Percent Renewable vs Percent Wind')
    plt.xlabel('Percent Renewable')
    plt.ylabel('Percent Wind')
    plt.show()

### Hierarchical Clustering

In [None]:
numeric_columns_renewable = renewable_energy_df.select_dtypes(include=['float64']).columns
data_for_clustering_renewable = renewable_energy_df[numeric_columns_renewable]

scaler_renewable = StandardScaler()
scaled_data_renewable = scaler_renewable.fit_transform(data_for_clustering_renewable)

# cosine similarity matrix
cosine_similarity_matrix = cosine_similarity(scaled_data_renewable)

# linkage matrix using cosine similarity
linkage_matrix_cosine = linkage(1 - cosine_similarity_matrix, method='ward')

# Dendrogram plot
plt.figure(figsize=(15, 8))
dendrogram(linkage_matrix_cosine, labels=renewable_energy_df['country/dependency'].tolist(), orientation='top', distance_sort='descending', show_leaf_counts=True)
plt.title('Hierarchical Clustering Dendrogram with Cosine Similarity')
plt.xlabel('Countries/Dependencies')
plt.ylabel('Distance')
plt.show()

From above Dendrogram we can observe the luisters formed using hierarchical clustering

### PCA

In [None]:
# PCA to get eigenvalues and eigenvectors
pca = PCA()
pca.fit(scaled_data)

In [None]:
# Plot explained variance ratio
plt.figure(figsize=(12, 6))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.title('Explained Variance Ratio')
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance')
plt.show()

In [None]:
# Plot the eigenvalues
plt.figure(figsize=(12, 6))
plt.plot(range(1, len(pca.explained_variance_) + 1), pca.explained_variance_, marker='o')
plt.title('Scree Plot: Eigenvalues')
plt.xlabel('Principal Component Index')
plt.ylabel('Eigenvalue')
plt.show()

In [None]:
renewable_energy_df.head()

In [None]:
pca = PCA(n_components=2)
pca_result = pca.fit_transform(scaled_data)

renewable_energy_df['PCA1'] = pca_result[:, 0]
renewable_energy_df['PCA2'] = pca_result[:, 1]

# Clusters in PCA space
plt.figure(figsize=(12, 8))
sns.scatterplot(x='PCA1', y='PCA2', hue='cluster', data=renewable_energy_df, palette='viridis', s=50)
plt.title('K-Means Clustering in PCA Space')
plt.xlabel('Principal Component 1 (PCA1)')
plt.ylabel('Principal Component 2 (PCA2)')
plt.show()


In [None]:
renewable_energy_df.head()

In [None]:
# loading scores
loading_scores = pd.DataFrame(pca.components_.T, columns=['PC1', 'PC2'], index=data_for_clustering.columns)

plt.figure(figsize=(12, 8))
sns.heatmap(loading_scores, cmap='coolwarm', annot=True, fmt=".2f", linewidths=.5)
plt.title('Loading Scores: Contribution of Each Column to Principal Components')
plt.xlabel('Principal Components')
plt.ylabel('Columns')
plt.show()

In [None]:
data_for_clustering.head()

In [None]:
renewable_energy_df_clustering = renewable_energy_df.copy()

# Remove the specified columns from renewable_energy_df_clustering
columns_to_remove = ['cluster', 'PCA1', 'PCA2']
renewable_energy_df.drop(columns=columns_to_remove, inplace=True)

# Check the result
renewable_energy_df.head()


## Naïve Bayes

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import confusion_matrix

In [None]:
renewable_energy_df.head()

In [None]:
# Define the bins and labels for the categories
bins = [-1, 25, 50, 75, 100]
labels = [-2 , -1 , 1, 2]

renewable_energy_df_classified = renewable_energy_df.copy()

# Create a new column with the categories
renewable_energy_df_classified['renewable_category'] = pd.cut(renewable_energy_df['percent_renewable'], bins=bins, labels=labels, include_lowest=True)

renewable_energy_df_classified.head()

In [None]:
for col in ['percent_hydro', 'percent_wind', 'percent_bio', 'percent_solar', 'percent_geo']:
    threshold = renewable_energy_df[col].median()  
    renewable_energy_df_classified[f'{col}_category'] = renewable_energy_df[col].apply(lambda x: -1 if x <= threshold else 1)

renewable_energy_df_classified.head()                                                                                                                                                                                        

In [None]:
selected_columns = ['country/dependency', 'renewable_category', 'percent_hydro_category', 
                    'percent_wind_category', 'percent_bio_category', 'percent_solar_category', 
                    'percent_geo_category']

# Create a new DataFrame with only the selected columns
renewable_energy_df_classified = renewable_energy_df_classified[selected_columns]

renewable_energy_df_classified.head()

In [None]:
for col in renewable_energy_df_classified.columns:
    if col.endswith('_category'):
        print(renewable_energy_df_classified[col].value_counts())

In [None]:
# Data Preparation
X = renewable_energy_df_classified[['percent_hydro_category', 'percent_wind_category', 'percent_bio_category', 'percent_solar_category', 'percent_geo_category']]
y = renewable_energy_df_classified['renewable_category']  # Replace 'target_variable' with the name of your target variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=32)

In [None]:
X_train.head()

In [None]:
X_test.head()

In [None]:
y_train.head()

In [None]:
y_test.head()

In [None]:
#Model Training
naive_bayes = GaussianNB()
naive_bayes.fit(X_train, y_train)

y_pred = naive_bayes.predict(X_test)

# accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

#classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))


In [None]:
conf_matrix_nb = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix_nb)

In [None]:
# Plot confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix_nb, annot=True, cmap='Blues', fmt='g', cbar=False,
            xticklabels=['-2', '-1', '1', '2'], yticklabels=['-2', '-1', '1', '2'])
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

## Decision Trees

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

# Data Preparation
X = renewable_energy_df_classified.drop(columns=['country/dependency', 'renewable_category'])
y = renewable_energy_df_classified['renewable_category']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=32)


In [None]:
X_train.head()

In [None]:
X_test.head()

In [None]:
y_train.head()

In [None]:
y_test.head()

In [None]:

#Model Training
decision_tree = DecisionTreeClassifier(random_state=42)
decision_tree.fit(X_train, y_train)

y_pred = decision_tree.predict(X_test)

# accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))


In [None]:
conf_matrix_dc = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_matrix_dc)

In [None]:
# Plot confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix_dc, annot=True, cmap='Blues', fmt='g', cbar=False,
            xticklabels=['-2', '-1', '1', '2'], yticklabels=['-2', '-1', '1', '2'])
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

In [None]:
from sklearn.tree import plot_tree
# Plot the Decision Tree
plt.figure(figsize=(20,10))
plot_tree(decision_tree, filled=True, feature_names=X.columns, class_names=['-2' ,'-1','1' , '2'], rounded=True)
plt.title("Decision Tree Visualization")
plt.savefig('Media/Decision_Tree/decision_tree_visualization.png')
plt.show()

In [None]:
### Different versions of Decision tree with different maximum leaf nodes

In [None]:
# Version 1: max_leaf_nodes = 3
decision_tree_v1 = DecisionTreeClassifier(max_leaf_nodes=3, random_state=42)
decision_tree_v1.fit(X_train, y_train)

from sklearn.tree import plot_tree
# Plot the Decision Tree
plt.figure(figsize=(20,10))
plot_tree(decision_tree_v1, filled=True, feature_names=X.columns, class_names=['-2' ,'-1','1' , '2'], rounded=True)
plt.title("Decision Tree Visualization (max_leaf_nodes=3)")
plt.show()


In [None]:
# Version 2: max_leaf_nodes = 5
decision_tree_v2 = DecisionTreeClassifier(max_leaf_nodes=5, random_state=42)
decision_tree_v2.fit(X_train, y_train)

from sklearn.tree import plot_tree
# Plot the Decision Tree
plt.figure(figsize=(20,10))
plot_tree(decision_tree_v2, filled=True, feature_names=X.columns, class_names=['-2' ,'-1','1' , '2'], rounded=True)
plt.title("Decision Tree Visualization (max_leaf_nodes=5)")
plt.show()


In [None]:
# Version 3: max_leaf_nodes = 8
decision_tree_v3 = DecisionTreeClassifier(max_leaf_nodes=8, random_state=42)
decision_tree_v3.fit(X_train, y_train)

from sklearn.tree import plot_tree
# Plot the Decision Tree
plt.figure(figsize=(20,10))
plot_tree(decision_tree_v3, filled=True, feature_names=X.columns, class_names=['-2' ,'-1','1' , '2'], rounded=True)
plt.title("Decision Tree Visualization (max_leaf_nodes=8)")
plt.show()


## Support Vector Machines (SVMs)

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC

In [None]:
# Data Preparation
X = renewable_energy_df_classified.drop(columns=['country/dependency', 'renewable_category'])
y = renewable_energy_df_classified['renewable_category']

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
X_train.head()

In [None]:
X_test.head()

In [None]:
y_train.head()

In [None]:
y_test.head()

In [None]:
# Different values of C for each kernel
c_values = [0.1, 1, 10]

def train_svm(kernel, c):
    svm_classifier = SVC(kernel=kernel, C=c, random_state=32)
    svm_classifier.fit(X_train, y_train)
    
    y_pred = svm_classifier.predict(X_test)
    
    conf_matrix = confusion_matrix(y_test, y_pred)
    
    return conf_matrix


In [None]:
# Training and evaluating SVM classifiers for each kernel and C value

conf_matrices = {}
for kernel in ['linear', 'poly', 'rbf']:
    conf_matrices[kernel] = {}
    for c in c_values:
        conf_matrices[kernel][c] = train_svm(kernel, c)

In [None]:
# Plot confusion matrices for each kernel and C value

plt.figure(figsize=(20, 12))
for i, kernel in enumerate(['linear', 'poly', 'rbf']):
    for j, c in enumerate(c_values):
        plt.subplot(3, 3, i * 3 + j + 1)
        conf_matrix = train_svm(kernel, c)  # Train SVM classifier and get confusion matrix
        sns.heatmap(conf_matrix, annot=True, cmap='Blues', fmt='g', cbar=False)
        plt.title(f'{kernel.capitalize()} Kernel, C={c}')
        plt.xlabel('Predicted Label')
        plt.ylabel('True Label')
plt.tight_layout()

plt.savefig("Media/SVM/SVM.png")

plt.show()



In [None]:
# Visualization to compare the performance of the kernels
accuracies = {
    'linear': [accuracy_score(y_test, SVC(kernel='linear', C=c, random_state=42).fit(X_train, y_train).predict(X_test)) for c in c_values],
    'poly': [accuracy_score(y_test, SVC(kernel='poly', C=c, random_state=42).fit(X_train, y_train).predict(X_test)) for c in c_values],
    'rbf': [accuracy_score(y_test, SVC(kernel='rbf', C=c, random_state=42).fit(X_train, y_train).predict(X_test)) for c in c_values]
}

In [None]:
for kernel, acc_list in accuracies.items():
    print(f"Accuracy for {kernel.capitalize()} kernel:")
    for c, acc in zip(c_values, acc_list):
        print(f"C={c}: {acc:.4f}")
    print()

In [None]:
plt.figure(figsize=(10, 6))
for kernel in ['linear', 'poly', 'rbf']:
    plt.plot(c_values, accuracies[kernel], marker='o', label=kernel.capitalize())
plt.title('Accuracy vs. C for Different Kernels')
plt.xlabel('Cost Parameter (C)')
plt.ylabel('Accuracy')
plt.xticks(c_values)
plt.legend()
plt.grid(True)
plt.show()

In [None]:
# Optimal

svm_classifier_optmial = SVC(kernel='linear', C=1, random_state=32)
svm_classifier_optmial.fit(X_train, y_train)

# Model Evaluation
y_pred = svm_classifier_optmial.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Linear kernel ; C=1 ")
print("Accuracy:", accuracy)
print("Classification Report:")
print(classification_report(y_test, y_pred))