# Exploratory Data Analysis

**Introduction**

This notebook explores the global dataset on oil, gas, and coal production and consumption. The goal is to understand its structure, identify meaningful patterns, test simple hypotheses, and check for quality issues.
This will cover the following questions:

1. What does the dataset look like in terms of structure, variables, and size?

2. What trends, patterns, and anomalies exist in the production and consumption of fuels?

3. Are production and consumption correlated for oil, gas, and coal?

4. What issues or inconsistencies exist in the dataset?

# Part A: The Questions
Here are my guiding questions for the EDA:

- How many rows and columns, and duplicates does the dataset have?

- What data types are used? Are they appropriate?

- How much missing data is present?

- What trends do we see in production vs. consumption for each fuel type?

- Do production and consumption move together?

- What data quality issues exist?

# Part B: Exploring the Data Structure

In [15]:
import pandas as pd

# Load dataset
file_path = "C:\\Users\\PATRICK\\Documents\\Fuel production vs consumption.csv"
prod = pd.read_csv(file_path, encoding="cp1252")

# Shape of the dataset
print("Dataset Shape:", prod.shape)

# Checking duplicates
print("\nSummary of Duplicates:")
print(prod.duplicated())

# Column data types and non-null values
print("\nDataset Info:")
print(prod.info())

# First 5 rows
print("\nFirst 5 Rows:")
display(prod.head())

# Last 5 rows
print("\nLast 5 Rows:")
display(prod.tail())

# Missing values
print("\nMissing Values per Column:")
print(prod.isnull().sum())


Dataset Shape: (9237, 15)

Summary of Duplicates:
0       False
1       False
2       False
3       False
4       False
        ...  
9232    False
9233    False
9234    False
9235    False
9236    False
Length: 9237, dtype: bool

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9237 entries, 0 to 9236
Data columns (total 15 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Year                              9237 non-null   int64  
 1   Entity                            9237 non-null   object 
 2   Gas production(m³)                8708 non-null   float64
 3   Gas consumption(m³)               8689 non-null   float64
 4   Coal production(Ton)              8817 non-null   float64
 5   Coal consumption(Ton)             8817 non-null   float64
 6   Oil production(m³)                8792 non-null   float64
 7   Oil consumption(m³)               8555 non-null   float64
 8   Gas productio

Unnamed: 0,Year,Entity,Gas production(m³),Gas consumption(m³),Coal production(Ton),Coal consumption(Ton),Oil production(m³),Oil consumption(m³),Gas production per capita(m³),Gas consumption per capita(m³),Coal production per capita(Ton),Coal consumption per capita(Ton),Oil production per capita(m³),Oil consumption per capita(m³),Population
0,1980,Afghanistan,1699000000.0,56640000.0,119000.0,119000.0,0.0,406500.0,127.2,4.241,0.00891,0.00891,0.0,0.03043,13360000.0
1,1981,Afghanistan,2237000000.0,84960000.0,125000.0,125000.0,0.0,464600.0,169.9,6.45,0.00949,0.00949,0.0,0.03527,13170000.0
2,1982,Afghanistan,2294000000.0,141600000.0,145000.0,145000.0,0.0,452900.0,178.1,10.99,0.01126,0.01126,0.0,0.03516,12880000.0
3,1983,Afghanistan,2407000000.0,141600000.0,145000.0,145000.0,0.0,638800.0,192.0,11.29,0.01157,0.01157,0.0,0.05095,12540000.0
4,1984,Afghanistan,2407000000.0,141600000.0,148000.0,148000.0,0.0,638800.0,197.2,11.6,0.01213,0.01213,0.0,0.05234,12200000.0



Last 5 Rows:


Unnamed: 0,Year,Entity,Gas production(m³),Gas consumption(m³),Coal production(Ton),Coal consumption(Ton),Oil production(m³),Oil consumption(m³),Gas production per capita(m³),Gas consumption per capita(m³),Coal production per capita(Ton),Coal consumption per capita(Ton),Oil production per capita(m³),Oil consumption per capita(m³),Population
9232,2017,Zimbabwe,0.0,0.0,2928000.0,2559000.0,0.0,1427000.0,0.0,0.0,0.2057,0.1798,0.0,0.1003,14240000.0
9233,2018,Zimbabwe,0.0,0.0,3348000.0,2069000.0,0.0,1771000.0,0.0,0.0,0.2318,0.1433,0.0,0.1226,14440000.0
9234,2019,Zimbabwe,0.0,0.0,3076000.0,1826000.0,0.0,1583000.0,0.0,0.0,0.2101,0.1247,0.0,0.1081,14650000.0
9235,2020,Zimbabwe,0.0,0.0,3659000.0,3469000.0,0.0,,0.0,0.0,0.2462,0.2334,0.0,,14860000.0
9236,2021,Zimbabwe,,,,,,,,,,,,,15090000.0



Missing Values per Column:
Year                                   0
Entity                                 0
Gas production(m³)                   529
Gas consumption(m³)                  548
Coal production(Ton)                 420
Coal consumption(Ton)                420
Oil production(m³)                   445
Oil consumption(m³)                  682
Gas production per capita(m³)       1548
Gas consumption per capita(m³)      1566
Coal production per capita(Ton)     1480
Coal consumption per capita(Ton)    1480
Oil production per capita(m³)       1417
Oil consumption per capita(m³)      1636
Population                          1105
dtype: int64


# Part C: Data Trends and Patterns

In [16]:
# Summary statistics
print("Summary Statistics:")
display(prod.describe())

# Checking production vs consumption balance (basic overview)
print("\nAverage Oil Production vs Consumption:")
print(prod[['Oil production(m³)', 'Oil consumption(m³)']].mean())

print("\nAverage Gas Production vs Consumption:")
print(prod[['Gas production(m³)', 'Gas consumption(m³)']].mean())

print("\nAverage Coal Production vs Consumption:")
print(prod[['Coal production(Ton)', 'Coal consumption(Ton)']].mean())


Summary Statistics:


Unnamed: 0,Year,Gas production(m³),Gas consumption(m³),Coal production(Ton),Coal consumption(Ton),Oil production(m³),Oil consumption(m³),Gas production per capita(m³),Gas consumption per capita(m³),Coal production per capita(Ton),Coal consumption per capita(Ton),Oil production per capita(m³),Oil consumption per capita(m³),Population
count,9237.0,8708.0,8689.0,8817.0,8817.0,8792.0,8555.0,7689.0,7671.0,7757.0,7757.0,7820.0,7601.0,8132.0
mean,2000.443434,23830230000.0,23924190000.0,52618010.0,52159150.0,41411320.0,45507580.0,963.569615,612.621901,0.424189,0.434401,2.414781,1.392495,67194220.0
std,12.439275,189664000000.0,189160800000.0,435565900.0,434986900.0,290160700.0,326864100.0,4554.541934,2008.360465,1.57712,1.028148,11.332552,4.256257,470969700.0
min,1973.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,959.0
25%,1990.0,0.0,0.0,0.0,0.0,0.0,323300.0,0.0,0.0,0.0,0.0,0.0,0.1528,1308750.0
50%,2001.0,0.0,0.0,0.0,1221.0,0.0,1568000.0,0.0,1.406,0.0,0.002069,0.0,0.5834,6651000.0
75%,2011.0,1614000000.0,3981000000.0,172000.0,2001000.0,3194000.0,12070000.0,112.1,450.5,0.02211,0.2548,0.28025,1.65,21447500.0
max,2021.0,4039000000000.0,4004000000000.0,8144000000.0,8179000000.0,4817000000.0,5820000000.0,71520.0,27160.0,21.77,7.093,233.0,125.8,7786000000.0



Average Oil Production vs Consumption:
Oil production(m³)     4.141132e+07
Oil consumption(m³)    4.550758e+07
dtype: float64

Average Gas Production vs Consumption:
Gas production(m³)     2.383023e+10
Gas consumption(m³)    2.392419e+10
dtype: float64

Average Coal Production vs Consumption:
Coal production(Ton)     5.261801e+07
Coal consumption(Ton)    5.215915e+07
dtype: float64


**Observations**

**Oil:** Consumption exceeded production based on the analysis using the mean.

**Gas:** Similarly, consumption exceeded production.

**Coal:** Production slightly higher than consumption.

# Part D: Hypothesis Testing

In [12]:
# Correlation analysis
corr = prod[['Oil production(m³)', 'Oil consumption(m³)',
           'Gas production(m³)', 'Gas consumption(m³)',
           'Coal production(Ton)', 'Coal consumption(Ton)']].corr()

print("Correlation Matrix:")
display(corr)


Correlation Matrix:


Unnamed: 0,Oil production(m³),Oil consumption(m³),Gas production(m³),Gas consumption(m³),Coal production(Ton),Coal consumption(Ton)
Oil production(m³),1.0,0.966519,0.969764,0.969472,0.91232,0.904241
Oil consumption(m³),0.966519,1.0,0.971122,0.983468,0.941664,0.937392
Gas production(m³),0.969764,0.971122,1.0,0.995338,0.912819,0.903414
Gas consumption(m³),0.969472,0.983468,0.995338,1.0,0.923226,0.916394
Coal production(Ton),0.91232,0.941664,0.912819,0.923226,1.0,0.997432
Coal consumption(Ton),0.904241,0.937392,0.903414,0.916394,0.997432,1.0


**Hypotheses Tested**

1. Oil production is positively correlated with oil consumption → Confirmed (strong positive correlation).

2. Gas consumption is closely tied to gas production → Confirmed (correlation ~1).

3. Coal production has historically exceeded coal consumption → Supported in averages, but varies by year and country.

# Part E: Data Issues and Quality Checks

**Issues Identified**

- Missing values: Several thousand across per-capita columns and population data.

- Zeros that may not be real: e.g., zero oil production in some countries that likely import oil.

- Anomalies: Sudden drops in 2020 values, likely due to external shocks (pandemic).

- Consistency: In many cases, consumption > production (indicating imports), but the dataset doesn’t track imports explicitly.