

---



**Step 1: Define Research Questions and Hypotheses**

**Introduction to Your Research Study**


---


My research explores the relationship between environmental factors and economic growth, focusing on air pollution, temperature anomalies, and GDP fluctuations. By analyzing datasets from NASA (Temperature Anomalies), WHO (Air Pollution), and the World Bank (GDP), the study aims to uncover how climate change and pollution impact economic performance globally.


**Key Research Objectives:**

- Impact of Air Pollution on GDP – Investigating whether higher pollution levels negatively affect economic growth.

- Effect of Temperature Anomalies on GDP – Understanding how rising global temperatures correlate with GDP fluctuations.

- Combined Influence of Climate & Pollution on Economy – Examining whether air pollution and temperature anomalies together have a stronger adverse effect on GDP.

This study leverages data-driven analysis, correlation studies, and predictive modeling to provide insights into the long-term consequences of climate change on economies. The findings can help policymakers and industries develop sustainable economic strategies while addressing environmental challenges.

Step 1.1 -> Install Required Libraries

In [1]:
import pandas as pd

Step 1.2 -> Load the Datasets

In [2]:
# Load datasets
nasa_temp_df = pd.read_csv('/content/NASA_Temperature_Anomalies_Cleaned.csv')
who_pollution_df = pd.read_excel('/content/WHO_Air_Pollution_Data.xlsx')
world_bank_gdp_df = pd.read_csv('/content/World_Bank_GDP_Cleaned.csv')

Step 1.3 -> Data Cleaning and Preprocessing

In [3]:
# Rename columns for clarity and consistency
nasa_temp_df.rename(columns={'Year': 'Year', 'Temperature_Anomaly': 'Temp_Anomaly'}, inplace=True)
who_pollution_df.rename(columns={'DIM_GEO_NAME': 'Country', 'DIM_TIME_YEAR': 'Year', 'VALUE_NUMERIC': 'Air_Pollution_Index'}, inplace=True)
world_bank_gdp_df.rename(columns={'Country Name': 'Country'}, inplace=True)

# Transform World Bank GDP dataset from wide to long format
gdp_long_df = world_bank_gdp_df.melt(id_vars=['Country', 'Country Code'], var_name='Year', value_name='GDP')
gdp_long_df['Year'] = gdp_long_df['Year'].astype(int)  # Convert Year to integer

# Drop unnecessary columns in WHO dataset
who_pollution_df = who_pollution_df[['Country', 'Year', 'Air_Pollution_Index']]

# Drop unnecessary columns in GDP dataset
gdp_long_df = gdp_long_df[['Country', 'Year', 'GDP']]

Step 1.4 -> Merging the Datasets

In [4]:
# Merge WHO Air Pollution Data with World Bank GDP Data on 'Country' and 'Year'
merged_country_data = pd.merge(who_pollution_df, gdp_long_df, on=['Country', 'Year'], how='inner')

# Merge the global NASA temperature anomaly data using 'Year' as the key
final_merged_df = pd.merge(merged_country_data, nasa_temp_df, on='Year', how='left')

# Display the final merged dataset
print(final_merged_df.head())  # Shows first few rows

       Country  Year  Air_Pollution_Index           GDP  Temp_Anomaly
0  Afghanistan  2019            265.66452  1.879944e+10         -1.49
1  Afghanistan  2019            265.66452  1.879944e+10         -1.21
2  Afghanistan  2019            265.66452  1.879944e+10         -0.31
3  Afghanistan  2019            265.66452  1.879944e+10          0.54
4  Afghanistan  2019            265.66452  1.879944e+10          1.33


Now that the data is merged, we can proceed with Data analysis such as:  
-> Summary statistics  
-> Correlation analysis  
-> Data visualization

**Step 2 -> Exploratory Data Analysis (EDA)**

2.1 : Summary Statistics

In [5]:
# Display basic statistics of the dataset
print(final_merged_df.describe())

         Year  Air_Pollution_Index           GDP  Temp_Anomaly
count  3744.0          3744.000000  3.744000e+03   3744.000000
mean   2019.0            60.662719  3.874391e+11      0.518333
std       0.0            67.783442  1.320738e+12      1.344444
min    2019.0             6.202750  1.248711e+08     -1.490000
25%    2019.0            14.484995  1.267586e+10     -0.620000
50%    2019.0            30.480490  4.117623e+10      0.585000
75%    2019.0            78.456236  2.529564e+11      1.707500
max    2019.0           305.136688  1.427997e+13      2.360000


2.2 : Check for Missing Values

In [6]:
# Check for missing values in the dataset
print(final_merged_df.isnull().sum())

Country                0
Year                   0
Air_Pollution_Index    0
GDP                    0
Temp_Anomaly           0
dtype: int64


2.3 : Correlation Analysis

In [7]:
# Calculate correlation between variables
correlation_matrix = final_merged_df[['Air_Pollution_Index', 'GDP', 'Temp_Anomaly']].corr()
print(correlation_matrix)

                     Air_Pollution_Index           GDP  Temp_Anomaly
Air_Pollution_Index         1.000000e+00 -9.130070e-02 -2.782982e-17
GDP                        -9.130070e-02  1.000000e+00  1.439948e-17
Temp_Anomaly               -2.782982e-17  1.439948e-17  1.000000e+00


**Step 3 -> Data Visualization**

We need to install **Matplotlib** and **Seaborn**

In [8]:
import matplotlib.pyplot as plt
import seaborn as sns

3.1 : Scatter Plot: Air Pollution vs GDP

In [19]:
import pandas as pd
import plotly.express as px

# Load your datasets
nasa_temp_df = pd.read_csv("NASA_Temperature_Anomalies_Cleaned.csv")
world_bank_gdp_df = pd.read_csv("World_Bank_GDP_Cleaned.csv")
who_air_pollution_df = pd.read_excel("WHO_Air_Pollution_Data.xlsx")

# Convert GDP data to long format (Year, Country, GDP)
gdp_long_df = world_bank_gdp_df.melt(id_vars=["Country Name", "Country Code"],
                                     var_name="Year", value_name="GDP")
gdp_long_df["Year"] = gdp_long_df["Year"].astype(int)

# Select relevant air pollution data
who_air_pollution_df = who_air_pollution_df[who_air_pollution_df["IND_CODE"] == "SDGPM25"]
who_air_pollution_df = who_air_pollution_df.rename(columns={"DIM_GEO_NAME": "Country",
                                                             "DIM_TIME_YEAR": "Year",
                                                             "VALUE_NUMERIC": "PM2.5"})
who_air_pollution_df = who_air_pollution_df[["Country", "Year", "PM2.5"]]

# Merge datasets
merged_df = gdp_long_df.merge(who_air_pollution_df, left_on=["Country Name", "Year"], right_on=["Country", "Year"], how="inner")
merged_df = merged_df.merge(nasa_temp_df, on="Year", how="inner")
merged_df = merged_df.drop(columns=["Country", "Country Code"])

# Reshape data for interactive visualization
melted_df = merged_df.melt(id_vars=["Year"], value_vars=["GDP", "PM2.5", "Temperature_Anomaly"],
                            var_name="Variable", value_name="Value")

# Create interactive line chart with dropdown menu
fig = px.line(melted_df, x="Year", y="Value", color="Variable",
              title="Trends of GDP, Air Pollution, and Temperature Anomalies Over Time",
              labels={"Value": "Metric Value", "Year": "Year", "Variable": "Indicator"},
              markers=True)

# Update layout to include dropdown menu
fig.update_layout(
    updatemenus=[{
        "buttons": [
            {"label": "All", "method": "update", "args": [{"visible": [True, True, True]}]},
            {"label": "GDP", "method": "update", "args": [{"visible": [True, False, False]}]},
            {"label": "Air Pollution (PM2.5)", "method": "update", "args": [{"visible": [False, True, False]}]},
            {"label": "Temperature Anomalies", "method": "update", "args": [{"visible": [False, False, True]}]}
        ],
        "direction": "down",
        "showactive": True
    }]
)

# Show interactive chart
fig.show()


3.2 : Scatter Plot: Temperature Anomalies vs GDP

In [20]:
# Convert GDP data to long format (Year, Country, GDP)
gdp_long_df = world_bank_gdp_df.melt(id_vars=["Country Name", "Country Code"],
                                     var_name="Year", value_name="GDP")
gdp_long_df["Year"] = gdp_long_df["Year"].astype(int)

# Merge datasets on Year
merged_df = gdp_long_df.merge(nasa_temp_df, on="Year", how="inner")

# Create an interactive scatter plot
fig = px.scatter(merged_df, x="Temperature_Anomaly", y="GDP", color="Year",
                 title="Impact of Temperature Anomalies on GDP",
                 labels={"Temperature_Anomaly": "Temperature Anomaly", "GDP": "Gross Domestic Product"},
                 hover_data=["Year"], size_max=10)

# Update layout to include dropdown menu
fig.update_layout(
    updatemenus=[{
        "buttons": [
            {"label": "All Years", "method": "update", "args": [{"visible": [True]}]},
            {"label": "Recent Years (2000+)", "method": "update",
             "args": [{"x": [merged_df[merged_df['Year'] >= 2000]["Temperature_Anomaly"]],
                       "y": [merged_df[merged_df['Year'] >= 2000]["GDP"]]}]},
            {"label": "Pre-2000 Data", "method": "update",
             "args": [{"x": [merged_df[merged_df['Year'] < 2000]["Temperature_Anomaly"]],
                       "y": [merged_df[merged_df['Year'] < 2000]["GDP"]]}]}
        ],
        "direction": "down",
        "showactive": True
    }]
)

# Show interactive plot
fig.show()


3.3 : Line Plot : Global Temperature Anomalies Over Time

In [21]:
import pandas as pd
import plotly.express as px

# Load dataset
nasa_temp_df = pd.read_csv("NASA_Temperature_Anomalies_Cleaned.csv")

# Create an interactive line plot for Temperature Anomalies over time
fig = px.line(nasa_temp_df, x="Year", y="Temperature_Anomaly",
              title="Global Temperature Anomalies Over Time",
              labels={"Temperature_Anomaly": "Temperature Anomaly", "Year": "Year"},
              markers=True)

# Update layout to include a dropdown menu for filtering years
fig.update_layout(
    updatemenus=[{
        "buttons": [
            {"label": "All Years", "method": "update", "args": [{"visible": [True]}]},
            {"label": "Recent Years (2000+)", "method": "update",
             "args": [{"x": [nasa_temp_df[nasa_temp_df['Year'] >= 2000]["Year"]],
                       "y": [nasa_temp_df[nasa_temp_df['Year'] >= 2000]["Temperature_Anomaly"]]}]},
            {"label": "Pre-2000 Data", "method": "update",
             "args": [{"x": [nasa_temp_df[nasa_temp_df['Year'] < 2000]["Year"]],
                       "y": [nasa_temp_df[nasa_temp_df['Year'] < 2000]["Temperature_Anomaly"]]}]}
        ],
        "direction": "down",
        "showactive": True
    }]
)

# Show interactive plot
fig.show()


3.4 : Heatmap of Correlation Matrix

In [28]:
# Create an interactive detailed correlation heatmap
import plotly.figure_factory as ff # import the missing module

fig = ff.create_annotated_heatmap(
    z=correlation_matrix.values,
    x=correlation_matrix.columns.tolist(),
    y=correlation_matrix.index.tolist(),
    colorscale="RdBu",  # A valid colorscale from Plotly
    annotation_text=correlation_matrix.round(2).values,
    showscale=True
)

# Update layout to include a dropdown menu for filtering specific correlations
fig.update_layout(
    title="Detailed Correlation Heatmap",
    updatemenus=[{
        "buttons": [
            {"label": "All Variables", "method": "update", "args": [{"visible": [True]}]},
            {"label": "GDP-Related Correlations", "method": "update",
             "args": [{"z": [correlation_matrix.loc[['GDP']].values],
                       "y": ["GDP"],
                       "x": correlation_matrix.columns.tolist()}]},
            {"label": "Environmental Factors (Air Pollution & Temp)", "method": "update",
             "args": [{"z": correlation_matrix.loc[['Air_Pollution_Index', 'Temp_Anomaly']].values.tolist(),  # Assuming 'PM2.5' should be 'Air_Pollution_Index' based on correlation_matrix
                       "y": ["Air_Pollution_Index", "Temp_Anomaly"], # Assuming 'PM2.5' should be 'Air_Pollution_Index' based on correlation_matrix
                       "x": correlation_matrix.columns.tolist()}]}
        ],
        "direction": "down",
        "showactive": True
    }]
)

# Show interactive heatmap
fig.show()

**Step 4 -> Statistical Modeling - Regression Analysis**

To quantify the relationship between Air Pollution, Temperature Anomalies, and GDP, i shall use Linear Regression.

4.1 : Install Required Libraries

In [13]:
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

4.2 : Prepare Data for Regression

In [14]:
# Select relevant columns
df_regression = final_merged_df[['Air_Pollution_Index', 'Temp_Anomaly', 'GDP']].dropna()

# Define independent variables (X) and dependent variable (y)
X = df_regression[['Air_Pollution_Index', 'Temp_Anomaly']]  # Predictors
y = df_regression['GDP']  # Target variable

# Add a constant (intercept) to X for statsmodels
X = sm.add_constant(X)

# Split data into Training (80%) and Testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

4.3 : Train Linear Regression Model

In [15]:
# Train the model using OLS (Ordinary Least Squares)
model = sm.OLS(y_train, X_train).fit()

# Print the regression results summary
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                    GDP   R-squared:                       0.008
Model:                            OLS   Adj. R-squared:                  0.007
Method:                 Least Squares   F-statistic:                     11.58
Date:                Thu, 13 Mar 2025   Prob (F-statistic):           9.82e-06
Time:                        22:51:33   Log-Likelihood:                -87894.
No. Observations:                2995   AIC:                         1.758e+05
Df Residuals:                    2992   BIC:                         1.758e+05
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
const                4.978e+11   3

4.4 : Predict and Evaluate the Model

In [16]:
# Train the model using sklearn
regressor = LinearRegression()
regressor.fit(X_train, y_train)

# Predict GDP using the test set
y_pred = regressor.predict(X_test)

# Model evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'R-squared Score: {r2}')

Mean Squared Error: 1.4027177141673478e+24
R-squared Score: 0.011455478598730084


4.5 : Visualizing Regression Results

In [32]:
import pandas as pd
import plotly.express as px

# Example Data: Replace `y_test` and `y_pred` with actual data
y_test = [500, 520, 540, 560, 580]  # Replace with actual values
y_pred = [490, 530, 550, 570, 600]  # Replace with actual predicted values

# Create a DataFrame for actual vs predicted GDP
gdp_comparison_df = pd.DataFrame({"Actual GDP": y_test, "Predicted GDP": y_pred})

# Create an interactive bar chart using Plotly
fig = px.bar(gdp_comparison_df,
             x=gdp_comparison_df.index,
             y=["Actual GDP", "Predicted GDP"],
             title="Actual vs Predicted GDP",
             labels={"index": "Data Point", "value": "GDP", "variable": "Type"},
             barmode="group")

# Add a dropdown menu to toggle between Actual, Predicted, and Both
fig.update_layout(
    updatemenus=[{
        "buttons": [
            {"label": "Both (Actual & Predicted)", "method": "update", "args": [{"visible": [True, True]}]},
            {"label": "Actual GDP Only", "method": "update", "args": [{"visible": [True, False]}]},
            {"label": "Predicted GDP Only", "method": "update", "args": [{"visible": [False, True]}]}
        ],
        "direction": "down",
        "showactive": True
    }]
)

# Show the interactive bar chart
fig.show()


-> Plot: Residuals Distribution

In [33]:
import pandas as pd
import plotly.express as px

# Example Data: Replace `y_test` and `y_pred` with actual data
y_test = [500, 520, 540, 560, 580]  # Replace with actual values
y_pred = [490, 530, 550, 570, 600]  # Replace with actual predicted values

# Create a DataFrame for residuals
residuals_df = pd.DataFrame({"Residuals": [a - b for a, b in zip(y_test, y_pred)]})

# Create an interactive violin plot for residuals
fig = px.violin(residuals_df, y="Residuals", box=True, points="all",
                title="Residuals Distribution (Actual - Predicted GDP)",
                labels={"Residuals": "Residual Value"})

# Add a dropdown menu to toggle different visualization styles
fig.update_layout(
    updatemenus=[{
        "buttons": [
            {"label": "Violin Plot (Default)", "method": "update", "args": [{"type": "violin"}]},
            {"label": "Box Plot", "method": "update", "args": [{"type": "box"}]},
            {"label": "Histogram", "method": "update", "args": [{"type": "histogram"}]}
        ],
        "direction": "down",
        "showactive": True
    }]
)

# Show interactive residuals plot
fig.show()


-> Interpret Model Results :
*   Check p-values in model.summary() to determine statistical significance.
*   Check R² score for model performance.

-> Consider More Features :
*   Additional economic indicators, population, CO2 emissions, etc.

-> Perform More Advanced Analysis :
*   Polynomial regression, Time-Series Analysis, etc.





