In [None]:
# Importing necessary libraries for data manipulation and visualization
import pandas as pd  # For data manipulation and analysis
import seaborn as sns  # For data visualization
import matplotlib.pyplot as plt  # For plotting
from sklearn.linear_model import LinearRegression  # For linear regression modeling
from sklearn.impute import SimpleImputer # For checking NaN values
import numpy as np  # For numerical operations

# Import data from .csv file
dataset = pd.read_csv('owid-energy-data full-1.csv')

# Display a preview of the dataset to verify successful import and understand its structure
dataset

In [None]:
# Display the summary statistics of the dataset for initial insights
dataset.describe()

In [None]:
# List of ASEAN member states for filtering the dataset
asean_member_states = [
    "Brunei", 
    "Cambodia", 
    "Indonesia", 
    "Laos", 
    "Malaysia", 
    "Myanmar", 
    "Philippines", 
    "Singapore", 
    "Thailand", 
    "Vietnam"]

# List of years to be included in the filtered dataset
included_years = [2017, 2018, 2019, 2020, 2021, 2022]

# List of column names to be retained in the filtered dataset
column_names = [
    'country',
    'year',
    'population',
    'gdp',
    'biofuel_consumption',
    'coal_consumption',
    'fossil_fuel_consumption',
    'gas_consumption',
    'hydro_consumption',
    'oil_consumption',
    'wind_consumption',
    'other_renewable_consumption',
    'solar_consumption'
]

# Filtering the dataset for the specified years and countries
filtered_year_dataset = dataset[(dataset['year'].isin(included_years)) & (dataset['country'].isin(asean_member_states))]

# Selecting only the specified columns
filtered_year_dataset = filtered_year_dataset[column_names]

# Displaying the filtered dataset
filtered_year_dataset

In [None]:
# Drop 'country' and 'year' columns from the filtered dataset
filtered_dataset = filtered_year_dataset.drop(columns=['country', 'year'])

In [None]:
def plot_gdp_regression(column_number):    
    # Set the size of each subplot
    plt.figure(figsize=(20,10))

    # Create a scatter plot
    sns.scatterplot(x=filtered_dataset[filtered_dataset.columns[column_number]], y=filtered_dataset['gdp']) # Scatter plot of GDP vs feature
    plt.title(f'Scatter Plot of GDP vs {filtered_dataset.columns[column_number]}') # Title for the subplot

    # Extract the independent and dependent variables
    X = filtered_dataset [filtered_dataset.columns[column_number]].values.reshape(-1, 1) # Independent variable
    y = filtered_dataset ['gdp'].values.reshape(-1, 1)  # Dependent variable

    # Initialize the SimpleImputer with the mean strategy
    imputer = SimpleImputer(strategy='mean')

    # Apply the imputer to handle any missing values
    X_imputed = imputer.fit_transform(X)
    y_imputed = imputer.fit_transform(y)

    # Fit the linear regression model
    model = LinearRegression()
    model.fit(X_imputed, y_imputed)

    # Extract the slope (coefficient) and intercept
    slope = model.coef_[0][0]  
    intercept = model.intercept_[0]

    # Add the regression line
    plt.plot(filtered_dataset[filtered_dataset.columns[column_number]], model.predict(X_imputed), 'r', label=f'y={slope}*x+{intercept}')

    # Calculate the correlation coefficient
    correlation_coefficient = np.corrcoef(X_imputed.flatten(), y_imputed.flatten())[0, 1]

    # Add the correlation coefficient to the plot title
    plt.title(f'Scatter Plot of GDP vs {filtered_dataset.columns[column_number]} (Correlation: {correlation_coefficient:.2f})')

    # Add legend and grid
    plt.legend()
    plt.grid(True) 

    # Show the plot
    plt.show()

In [None]:
plot_gdp_regression(2)

- **By looking at the scatterplot, is there a correlation between each pair?**

    - The scatterplot depicting the relationship between GDP and biofuel consumption reveals a very weak positive correlation, borderline no correlation at all.

- **Are the correlations intuitive or do they make sense?**

    - Looking beyond the very weak positive correlation evident in the scatterplot, there exists a logical connection between GDP and biofuel consumption. Biofuel consumption in a country can positively influence its GDP, as the utilization of biofuel signifies a progressive, eco-friendly approach to energy sourcing. This correlation implies that a country embracing biofuel is likely advancing its economic development.

- **Are there any outliers?**

    - We can see two outliers in this scatterplot near the top-left corner, almost reaching the middle of the plot.

In [None]:
plot_gdp_regression(3)

- **By looking at the scatterplot, is there a correlation between each pair?**

    - Examining the scatterplot depicting the relationship between GDP and coal consumption, a weak positive correlation becomes apparent.

- **Are the correlations intuitive or do they make sense?**

    - The scatter plot reveals a weak positive correlation between GDP and coal consumption, a relationship that aligns with economic logic. Increased coal usage, typically associated with production, indicates higher industrial activity, implying that a country is manufacturing more goods.

- **Are there any outliers?**

    - In this scatterplot, we can clearly see two outliers at the top-middle of the plot.

In [None]:
plot_gdp_regression(4)

- **By looking at the scatterplot, is there a correlation between each pair?**

    - Looking at the scatterplot depicting the relationship between GDP and fossil fuel consumption, a weak positive correlation between the two variables emerges.

- **Are the correlations intuitive or do they make sense?**

    - The scatterplot illustrates a weak positive correlation between the two variables, mirroring the pattern observed in the previous scatterplot. Higher fossil fuel consumption often signifies higher industrial activity and increased production of goods, which aligns with economic logic.

- **Are there any outliers?**

    - Observing the scatterplot, there are two outliers near the edge at the top-right corner of the plot.

In [None]:
plot_gdp_regression(5)

- **By looking at the scatterplot, is there a correlation between each pair?**

    - In the scatterplot depicting the relationship between GDP and gas consumption, a weak positive correlation between the two variables is evident.

- **Are the correlations intuitive or do they make sense?**

    -  Similar to the preceding two scatterplots, the relationship between GDP and gas consumption exhibits a weak positive correlation. This correlation is intuitive and aligns with economic logic. Gas consumption is primarily linked to industrial activity, just like the previous two relationships, indicative of heightened production of goods.

- **Are there any outliers?**

    - When examining the scatterplot, we can identify two apparent outliers on the plot, which are situated at the top right corner.

In [None]:
plot_gdp_regression(6)

- **By looking at the scatterplot, is there a correlation between each pair?**
 
    - Observing the relationship between GDP and hydro consumption, we can discern a very weak negative correlation between the two variables.

- **Are the correlations intuitive or do they make sense?**

    - Examining the scatterplot, the depicted relationship appears counterintuitive and does not make sense. It suggests that as hydro consumption increases, GDP decreases. However, this contradicts conventional knowledge, as greater utilization of renewable energy, such as hydroelectric power, typically reduces energy costs for a country, freeing up financial resources for investment in other areas.

- **Are there any outliers?**

    - Looking at the scatterplot, two outliers are noticeable on the upper left, positioned just above the main cluster of data points.

In [None]:
plot_gdp_regression(7)

- **By looking at the scatterplot, is there a correlation between each pair?**

    - When viewing the scatterplot depicting the relationship between GDP and oil consumption, a weak positive correlation between the two variables can be observed.

- **Are the correlations intuitive or do they make sense?**

    - The weak positive relationship between GDP and oil consumption evident on the scatterplot is intuitive. Increased oil consumption signifies extensive use across various transportation modes (land, air, and water), suggesting heightened economic activity. This encompasses everything from increased employment to the delivery of goods.

- **Are there any outliers?**

    - Examining the scatterplot, we can identify two outliers positioned at the top right-hand corner of the plot.

In [None]:
plot_gdp_regression(8)

- **By looking at the scatterplot, is there a correlation between each pair?**

    - The scatterplot depicting the relationship between GDP and wind consumption showcases a very weak negative correlation, bordering on having no correlation at all.

- **Are the correlations intuitive or do they make sense?**

    - Despite the very weak negative correlation, potentially influenced by incomplete data, the relationship between GDP and wind consumption remains intuitive and logical. However, the negative correlation observed in the scatterplot contradicts the expected outcome. As mentioned earlier in the context of GDP and hydro consumption, when a country opts for increased use of renewable energy sources, such as wind, the funds allocated for energy can be redirected to other industries, potentially boosting GDP. This discrepancy challenges the interpretation provided by the scatterplot.

- **Are there any outliers?**

    - Observing the scatterplot, we can identify outliers pretty easily. Two outliers are notably positioned just above the primary cluster of data points, situated at the upper left-most corner of the plot.

In [None]:
plot_gdp_regression(9)

- **By looking at the scatterplot, is there a correlation between each pair?**

    - Analyzing the scatterplot depicting the relationship between GDP and other renewable energy consumption, we can observe a weak positive correlation between the two variables.

- **Are the correlations intuitive or do they make sense?**

    - Compared to previous scatterplots focusing on renewable energy, this particular visualization appears intuitive and coherent. As previously discussed, countries that invest more in renewable energy can often allocate additional resources to other sectors, consequently enhancing their GDP, as reflected in this scatterplot.

- **Are there any outliers?**

    - In this scatterplot, two outliers are evident at the uppermost region. They are distinctly positioned apart from each other, with one leaning more towards the left side, while the other is situated towards the right. These outliers stand out from the rest of the data points.

In [None]:
plot_gdp_regression(10)

- **By looking at the scatterplot, is there a correlation between each pair?**

    - Upon examining the scatterplot illustrating the relationship between GDP and solar consumption, we can observe a very weak negative correlation between the two variables.

- **Are the correlations intuitive or do they make sense?**

    - This scatterplot follows a similar pattern to previous scatterplots of renewable energy sources, except for other renewable energy consumption, where the correlations are not intuitive, as previously discussed. As repeatedly emphasized, when a country chooses to increase its utilization of renewable energy, such as solar energy, its GDP should ideally rise. This is due to the reduction in costs associated with producing and consuming energy, which creates opportunities for the country to allocate more budget to sectors that can further enhance its GDP.
    
- **Are there any outliers?**

    - When analyzing the scatterplot, we can see two outliers positioned above the main cluster of data points, towards the upper-left section of the plot.