# Short Term Temperature Predictor Using k-NN (k-Nearest Neighbor) Machine Learning Model

## Purpose:

This script creates a short-term temperature predictor using the k-Nearest Neighbor (k-NN) machine learning algorithm.

The script in this [Jupyter](https://jupyter.org/install) notebook will:
- Fetch historical weather data for a given location from a [WeatherAPI](https://www.weatherapi.com/)
- Preprocess the data
- Save weather data to a .CSV file
- Create and train a [k-NN ML model](https://www.geeksforgeeks.org/k-nearest-neighbours/) using your downloaded data
- Predict temperature for a future date entered in the terminal by user
	- In practice, model accuracy will be between 2-6 degrees Fahrenheit ([RMSE](https://help.sap.com/docs/SAP_PREDICTIVE_ANALYTICS/41d1a6d4e7574e32b815f1cc87c00f42/5e5198fd4afe4ae5b48fefe0d3161810.html) = 2.00 - 5.79) based on the quantity and quality of weather data you provide for training the model.
		- Predicted accuracy is directly proportional to the amount of historical data you provide.
		- Providing only a weeks worth of data will net terrible performance if you ask the model to predict temperatures a more than a day or two out from the last data point in the model.
		- At least 6 months of valid historical weather data from your chosen weather API are recommended

## Platform Prerequisites:

- [Python](https://www.python.org/downloads/) 3.x or later (Developed with Python [3.13.1](https://www.python.org/downloads/release/python-3131/))
- [Jupyter](https://jupyter.org/install) notebook
- [pip](https://pypi.org/project/pip/#description)
- Required Python libraries: [pandas](https://pandas.pydata.org/docs/getting_started/install.html), [numpy](https://numpy.org/install/), [sklearn](https://scikit-learn.org/stable/install.html), [requests](https://pypi.org/project/requests/), [csv](https://docs.python.org/3/library/csv.html), [datetime](https://docs.python.org/3/library/datetime.html)

## Weather Dataset Prerequisites:

- If you just want to try out the code, you can use my preexisting `weather_data.csv` file included in the repository.
- If you want to get your own weather data, you will need to have your own personal API key to access either the [WeatherAPI.com](https://www.weatherapi.com/signup.aspx) or [NOAA](https://www.ncdc.noaa.gov/cdo-web/token) datasets. Do *NOT* share your API key with anyone else or post it publicly.
- You must edit the code blocks to insert your own API key (Do not share this with others!) in order to retrieve your own set of weather data.


## Installation:

1. Open a terminal or command prompt.
2. Navigate to the directory where you want to save the script.
3. Download the script (this file) `ML_Temp_Predict.ipynb` either manually or by using the [git](https://github.com/git-guides/install-git) command:
	- ```git clone https://github.com/DWNewton/KNN_Temperature_Predictor```
	- Navigate to the cloned directory: `cd KNN_Temperature_Predictor`
	- (Optional) 
			If you want to run the script in a virtual environment, create a new one and activate it:

		`python -m venv .venv && source`

		`source venv/bin/activate` [macOS/Linux]

		`venv/Scripts/activate` [Windows]

		NOTE:	To uninstall the script, deactivate the virtual environment and remove it:

		`deactivate && rm -rf venv` [macOS/Linux]
			
		`venv\Scripts\deactivate` [Windows]

	- Install the required libraries:
		```pip install pandas numpy sklearn requests```
		If you encounter any issues while installing libraries, try updating your [pip](https://pypi.org/project/pip/#description) or [Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html) versions.
		
		`csv`, and `datetime` are built-in Python libraries and should not need to be installed separately.

		(Optional)
			To run the script in a [Docker](https://www.docker.com/) container:

	- build the image: ```docker build -t knn_temperature_predictor.```
				
	- run the container: ```docker run -it knn_temperature_predictor```


# Weather API Fetcher
- The below script will fetch weather data from a weather API provider.
- The example used in the script fetches a weeks worth of weather data from api.weatherapi.com

NOTE: You must provide your own API Key. [WeatherAPI](https://www.weatherapi.com/signup.aspx) provides a 14 day trial allowing you to get a large amount of historical weather data for your area. As of my code release date, WeatherAPI automatically switches you to the free level after the 14 day trial expires, so you don't need to worry about a subscription surprise billing.
- If you have problems with WeatherAPI, I have also provided an example script using NOAA weather data, which is freely available but limited in the locations it will provide data for.




In [None]:
import requests
import csv
from datetime import datetime, timedelta

# API Key and base URL
API_KEY = (
    "Your API Key Here"  # You must register with WeatherAPI.com to receive an API key
)
BASE_URL = "http://api.weatherapi.com/v1/history.json"

# Parameters
location = "New York, NY"
"""
NOTE:
Insert your preferred start date, formatted as YYYY, MM, DD
The 'free' level of http://api.weatherapi.com does not support fetching
weather data from more than a week in the past.
"""
start_date = datetime(2025, 1, 22)  
end_date = datetime(2025, 1, 29)  # Your end date, formatted as YYYY, MM, DD
time_hours = ["06:00", "12:00", "18:00"]  # Target times in 24-hour format. Adjust as desired

# Output CSV file
csv_file = "weather_data.csv"  # You can name this whatever you like, as long as it has a .csv extension

# Prepare CSV file for writing
with open(csv_file, mode="w", newline="") as file:
    writer = csv.writer(file)
    writer.writerow(
        ["date", "time", "temp_f", "cloud", "wind_mph", "precip_in"]
    )  # Headers

    # Loop through each date
    current_date = start_date
    while current_date <= end_date:
        # Format the date for the API request
        formatted_date = current_date.strftime("%Y-%m-%d")
        print(f"Fetching data for {formatted_date}...")

        # Make API request
        response = requests.get(
            BASE_URL,
            params={"key": API_KEY, "q": location, "dt": formatted_date},
        )

        if response.status_code == 200:
            data = response.json()

            # Extract hourly weather data
            for hour_data in data["forecast"]["forecastday"][0]["hour"]:
                # Get the hour in "HH:MM" format
                hour_time = datetime.strptime(
                    hour_data["time"], "%Y-%m-%d %H:%M"
                ).strftime("%H:%M")

                # Check if the time matches the desired hours
                if hour_time in time_hours:
                    writer.writerow(
                        [
                            formatted_date,
                            hour_time,
                            hour_data["temp_f"],
                            hour_data["cloud"],
                            hour_data["wind_mph"],
                            hour_data["precip_in"],
                        ]
                    )
        else:
            print(
                f"Failed to fetch data for {formatted_date}. Error: {response.status_code}"
            )

        # Move to the next day
        current_date += timedelta(days=1)

print(f"Weather data saved to {csv_file}.")

# Example NOAA weather data fetcher

In [None]:
import requests
import csv
from datetime import datetime, timedelta

# NOAA API Token
NOAA_API_KEY = "Your API Key" # NOAA also requires you to register to get a personal API key, but there's no subscription or price to use their service
BASE_URL = "https://www.ncdc.noaa.gov/cdo-web/api/v2/data"

# Parameters
dataset_id = "GHCND"  # Daily summaries
location_id = "CITY:US510051"
start_date = "2024-07-10"
end_date = "2025-01-10"  # Approx. 6 months

# Output CSV file
csv_file = "noaa_weather_data.csv"

# Headers for the API request
headers = {"token": NOAA_API_KEY}

# Prepare CSV file for writing
with open(csv_file, mode="w", newline="") as file:
    writer = csv.writer(file)
    writer.writerow(["date", "datatype", "value"])  # Headers

    # Make the API request
    try:
        response = requests.get(
            BASE_URL,
            headers=headers,
            params={
                "datasetid": dataset_id,
                "locationid": location_id,
                "startdate": start_date,
                "enddate": end_date,
                "datatypeid": "TMAX,TMIN,PRCP",  # Max temp, min temp, precipitation
                "limit": 1000,
            },
        )

        if response.status_code == 200:
            data = response.json()

            # Write each data point to the CSV
            for result in data.get("results", []):
                print(response.status_code)
                print(response.text)
                writer.writerow([result["date"], result["datatype"], result["value"]])
        else:
            print(
                f"Failed to fetch data: HTTP {response.status_code} - {response.text}"
            )

    except Exception as e:
        print(f"Error during API request: {e}")

print(f"Weather data saved to {csv_file}.")

# Machine Learning Model: k-NN
Now for the fun stuff!

We're going to use the scikit-learn library to build a simple k-Nearest Neighbors (k-NN) model for predicting temperature. This model will learn from a dataset of historical weather data and then make predictions about future temperatures based on the k-nearest neighbors.

First, let's load the dataset we created earlier. 
On line 8 of the below code block, make sure to replace `weather_data.csv` with the path and name of your actual CSV file.

Or you can download and use the preexisting `weather_data.csv` dataset from my repository if you just want to run the program!


In [None]:
import pandas as pd
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error
import numpy as np

# Load the weather data from the CSV file
file_path = "weather_data.csv"  # Update this path as needed
weather_data = pd.read_csv(file_path)

# Convert 'date' and 'time' columns into a single datetime column for temporal features
weather_data["datetime"] = pd.to_datetime(
    weather_data["date"] + " " + weather_data["time"]
)

# Drop unnecessary columns for prediction
weather_data = weather_data.drop(columns=["date", "time"])

# Sort by datetime to maintain temporal order
weather_data = weather_data.sort_values(by="datetime")

# A data process we manually perform here is aggregating the data from groups
# of three days into single entries. 

# Add aggregated features for the previous 3 days
weather_data["prev_3day_avg_temp"] = (
    weather_data["temp_f"].rolling(window=3).mean().shift(1)
)
weather_data["prev_3day_avg_cloud"] = (
    weather_data["cloud"].rolling(window=3).mean().shift(1)
)
weather_data["prev_3day_avg_wind"] = (
    weather_data["wind_mph"].rolling(window=3).mean().shift(1)
)
weather_data["prev_3day_total_precip"] = (
    weather_data["precip_in"].rolling(window=3).sum().shift(1)
)

# Drop rows with NaN values (due to rolling window)
# This gets rid of any rows that don't contain all the required data points
weather_data = weather_data.dropna()

# Extract features and target
features = weather_data[
    [
        "prev_3day_avg_temp",
        "prev_3day_avg_cloud",
        "prev_3day_avg_wind",
        "prev_3day_total_precip",
    ]
]
target = weather_data["temp_f"]

# Split data into training and testing sets
# Here's where scikit-learn starts to spring into action!
X_train, X_test, y_train, y_test = train_test_split(
    features, target, test_size=0.2, random_state=42
)

# Perform cross-validation to find the optimal k
# Using 6-months of data, my optimal value was 15
# Yours may differ and your result will be printed when the program runs
param_grid = {"n_neighbors": range(1, 21)}  # Test k values from 1 to 20
grid = GridSearchCV(
    KNeighborsRegressor(), param_grid, scoring="neg_mean_squared_error", cv=5
)
grid.fit(X_train, y_train)

# Extract the best k value and refit the model
best_k = grid.best_params_["n_neighbors"]
print(f"Optimal k: {best_k}")
knn = KNeighborsRegressor(n_neighbors=best_k)
knn.fit(X_train, y_train)

# Evaluate the model using RMSE
y_pred = knn.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

# Print the RMSE - This will show you the margin of error for your trained model, in degrees Fahrenheit
print(f"Model RMSE with optimal k ({best_k}): {rmse:.3f}")

# Prompt user for a date
# Ensure you use the correct format: YYYY-MM-DD  e.g.: 2025-01-31
input_date = input("Enter a date (YYYY-MM-DD) to predict the temperature: ")
try:
    input_date = pd.to_datetime(input_date)

    # Ensure there is sufficient data for the previous 3 days
    recent_days = weather_data[weather_data["datetime"] < input_date]
    if len(recent_days) >= 3:
        input_features = (
            recent_days[["temp_f", "cloud", "wind_mph", "precip_in"]]
            .tail(3)
            .agg(
                {
                    "temp_f": "mean",
                    "cloud": "mean",
                    "wind_mph": "mean",
                    "precip_in": "sum",
                }
            )
            .values
        )
        prediction = knn.predict([input_features])
        print(
            f"Predicted Temperature for {input_date.strftime('%Y-%m-%d')}: {prediction[0]:.2f}°F"
        )
    else:
        print("Not enough recent data available for prediction.")
except ValueError:
    print("Invalid date format. Please enter a valid date in YYYY-MM-DD format.")