<a href="https://colab.research.google.com/github/Mohawkins/Gasoline-Prices-Increase-CS-668-Capstone-Project/blob/main/Price_Increase_of_Gasoline_Financial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Increase Price of Gasoline

Monica Hawkins
Pace Universty, NY, USA

# Reference

https://www.eia.gov/energyexplained/gasoline/price-fluctuations.php

## Introduction

Retail gasoline prices are mainly affected by crude oil prices and the amount of gasoline available to meet demand. Strong and increasing demand for gasoline and other petroleum products in the United States and the rest of the world can place intense pressure on available supplies.Gasoline prices tend to increase when the available gasoline supply decreases relative to real or expected gasoline demand or consumption. Gasoline prices can change rapidly if something disrupts crude oil supplies, refinery operations, or gasoline pipeline deliveries. Even when crude oil prices are stable, gasoline prices fluctuate because of seasonal changes in demand and in gasoline specifications.

## Business Problem Statement

Gasoline prices tend to rise during the spring and summer, while heating oil demand increases in the winter. This project aims to analyze how seasonal trends in refinery production and fuel consumption affect fuel prices throughout the year. By examining refinery output data, fuel price patterns, and seasonal demand, the goal is to determine whether adjusting refinery operations could help reduce price spikes and stabilize fuel costs during the winter season.

## Goal

To build a machine learning model that predicts gasoline price increases, particularly the extreme price surges that occur during the spring and summer seasons.

## Exploratory Data Analysis

## Importing Libraries and Loading Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
from sklearn import linear_model, ensemble, neighbors, tree, svm

# ===============================
# Load Dataset
# ===============================
try:
    df = pd.read_csv("/content/U.S._Natural_Gas_Prices.csv")
    print("Dataset loaded successfully.")
except FileNotFoundError:
    print("Error: U.S._Natural_Gas_Prices.csv not found. Please check the file path.")
    # Exit or handle the error appropriately if the file is not found
    exit() # Exiting for demonstration purposes

# Preview data
print("Dataset Shape:", df.shape)
print("\nFirst 5 Rows:\n", df.head())

# ===============================
# Data Cleaning and Preprocessing
# ===============================

# Convert potential price columns to numeric, coercing errors to NaN
price_cols = ['25-Feb', '25-Mar', '25-Apr', '25-May', '25-Jun', '25-Jul']
for col in price_cols:
    df[col] = pd.to_numeric(df[col], errors='coerce')

# Drop rows with missing values for simplicity, given the small dataset size.
# For a real project, consider imputation strategies.
df.dropna(inplace=True)
print(f"\nDataset shape after dropping rows with missing values: {df.shape}")

# Basic Data Info after cleaning
print("\nData Info after cleaning:")
print(df.info())

print("\nMissing Values after cleaning:\n", df.isnull().sum())

# At this point, consider if 'Category' and 'History' columns are relevant features
# for predicting gasoline prices. 'History' seems to contain year ranges,
# and 'Category' describes the type of price.
# For a time-series prediction, the date information is crucial,
# which is currently embedded in the column names ('25-Feb', etc.).
# This data structure is not ideal for time-series analysis.

# If 'Category' is to be used, encode it.
# Be cautious with encoding 'History' as it might not be a meaningful categorical feature.
# le = LabelEncoder()
# if 'Category' in df.columns:
#     df['Category_Encoded'] = le.fit_transform(df['Category'])
#     # Drop the original 'Category' column if encoded
#     # df = df.drop('Category', axis=1)

# The current structure of the data (prices in column names) makes it difficult
# to directly apply standard machine learning models for time-series prediction.
# A more suitable format would be a long format with columns like 'Date', 'Category', 'Price'.
# However, converting to this format requires more information about the exact dates
# corresponding to the '25-Feb', '25-Mar', etc. columns.

# Given the dataset's current structure and small size, a direct application of
# the previously attempted classification models for price prediction is likely not appropriate
# and will face challenges with feature engineering, target definition, and model training.

# ===============================
# Next Steps and Considerations
# ===============================
print("\n===============================")
print("Next Steps and Considerations")
print("===============================")
print("1. Data Structure: The current data format (prices in column names) is not ideal for time-series analysis. Reshaping the data into a long format with explicit date, category, and price columns would be beneficial.")
print("2. Target Variable: The goal is to predict gasoline price increases. A clear target variable needs to be defined based on the price data (e.g., percentage change in price, or a binary indicator for significant price increases).")
print("3. Dataset Size: With only a few rows after cleaning, training robust machine learning models is challenging. Consider obtaining a larger dataset with more historical price data and relevant features.")
print("4. Modeling Approach: For price prediction (a continuous value), regression models are typically used, not classification models. If the goal is to predict *increases*, a classification approach might be possible by defining a binary target (increase/no increase), but this still requires a suitable dataset and feature engineering.")
print("5. Feature Engineering: Extracting time-based features (e.g., month, year, season) from the date information (once the data is in a suitable format) will be important for capturing seasonal trends.")

print("\nBased on the current data, I cannot proceed with training the machine learning models as originally intended without further data preparation and clarification on the target variable and desired modeling approach. ")
print("Please provide guidance on how you would like to proceed, perhaps by providing a dataset with a more suitable structure for time-series analysis or clarifying the target variable for prediction.")

Dataset loaded successfully.
Dataset Shape: (12, 8)

First 5 Rows:
                            Category  25-Feb     25-Mar 25-Apr  25-May 25-Jun  \
0  Wellhead Price eo NA NA NA NA NA     NaN  1973-2025    NaN     NaN    NaN   
1                  Imports Price eo  436.00        251    224  201.00    193   
2                    By Pipeline eo  436.00        251    2.2  201.00    192   
3                  Exports Price eo    6.65       6.66   6.32    5.48    562   
4                    By Pipeline eo    4.00        321    288    2.66    278   

   25-Jul    History  
0     NaN        NaN  
1   201.0  1989-2025  
2   201.0  1997-2025  
3   576.0  1989-2025  
4   292.0  1997-2025  

Dataset shape after dropping rows with missing values: (8, 8)

Data Info after cleaning:
<class 'pandas.core.frame.DataFrame'>
Index: 8 entries, 1 to 11
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Category  8 non-null      object 
 1   25-Fe