<a href="https://colab.research.google.com/github/Bhuvika-Agrawal/Weather-prediction-model/blob/main/weather_prediction_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Project Introduction

This project aims to predict maximum (tmax) and minimum (tmin) temperatures using historical weather data. We will preprocess the data, train a RandomForestRegressor, and evaluate its performance.


In [None]:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.model_selection import cross_val_score
from sklearn.impute import KNNImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import cross_validate
from sklearn.metrics import r2_score

## Data Loading and Preprocessing

We'll load the weather data CSV files, clean missing values, preprocess features, and prepare the dataset for model training.


In [None]:
# Function to process uploaded files
def process_files():
    dataframes = []

    # List files in the current directory
    file_list = os.listdir()

    for file_name in file_list:
        if file_name.endswith(".csv"):
            # Extract the city name from the file name
            city_name = file_name.split('_')[0]
            print(f"Processing file: {file_name}, Extracted city name: {city_name}")

            # Read the CSV file into a DataFrame
            try:
                df = pd.read_csv(file_name, usecols=["time", "tavg", "tmax", "tmin", "prcp"], dtype={"time": str})
            except ValueError as e:
                print(f"Error reading {file_name}: {e}")
                continue

            # Add the city name as a new column in the DataFrame
            df['city'] = city_name

            # Clean the 'time' column if needed
            df['time'] = df['time'].str.strip()

            print(f"DataFrame for {city_name}:")
            print(df.head().to_string(index=False))  # Display without index

            # Append the DataFrame to the list
            dataframes.append(df)

    # Check if any DataFrames were created
    if not dataframes:
        print("No DataFrames were created. Please check your CSV files.")
        return None

    # Concatenate all DataFrames into a single DataFrame
    combined_df = pd.concat(dataframes, ignore_index=True)

    # Drop duplicates if needed
    combined_df.drop_duplicates(inplace=True)

    # Standardize column names (lowercase and strip whitespace)
    combined_df.columns = [col.lower().strip() for col in combined_df.columns]

    return combined_df

# Process the files
combined_df = process_files()

# Verify the combined DataFrame before saving
if combined_df is not None:
    print("Combined DataFrame before saving:")
    print(combined_df.head().to_string(index=False))  # Display without index

    output_file = 'cleaned_weather_data.csv'
    combined_df.to_csv(output_file, index=False)
    print(f"Combined DataFrame saved to: {output_file}")


Processing file: Rajasthan_1990_2022_Jodhpur.csv, Extracted city name: Rajasthan
DataFrame for Rajasthan:
      time  tavg  tmin  tmax  prcp      city
01-01-1990  22.9  19.1  28.4   NaN Rajasthan
02-01-1990  21.7   NaN  26.5   0.0 Rajasthan
03-01-1990  21.0  16.4  26.5   0.0 Rajasthan
04-01-1990  20.8   NaN  27.4   0.0 Rajasthan
05-01-1990  20.4  14.2  26.1   0.0 Rajasthan
Processing file: Mumbai_1990_2022_Santacruz.csv, Extracted city name: Mumbai
DataFrame for Mumbai:
      time  tavg  tmin  tmax  prcp   city
01-01-1990  23.2  17.0   NaN   0.0 Mumbai
02-01-1990  22.2  16.5  29.9   0.0 Mumbai
03-01-1990  21.8  16.3  30.7   0.0 Mumbai
04-01-1990  25.4  17.9  31.8   0.0 Mumbai
05-01-1990  26.5  19.3  33.7   0.0 Mumbai
Processing file: Chennai_1990_2022_Madras.csv, Extracted city name: Chennai
DataFrame for Chennai:
      time  tavg  tmin  tmax  prcp    city
01-01-1990  25.2  22.8  28.4   0.5 Chennai
02-01-1990  24.9  21.7  29.1   0.0 Chennai
03-01-1990  25.6  21.4  29.8   0.0 Chennai
04

## Handling Missing Values

We'll handle missing values using KNN imputation to ensure the dataset is complete and ready for analysis.


In [None]:
# Separate numeric and non-numeric columns
numeric_features = combined_df.select_dtypes(include=['number']).columns
non_numeric_features = combined_df.select_dtypes(exclude=['number']).columns

# Initialize the KNNImputer
imputer = KNNImputer(n_neighbors=5)

# Impute missing values on numeric features
numeric_features_imputed = imputer.fit_transform(combined_df[numeric_features])

# Create a DataFrame from the imputed numeric features and combine it with non-numeric features
features_imputed = pd.DataFrame(numeric_features_imputed, columns=numeric_features)
features = pd.concat([features_imputed, combined_df[non_numeric_features]], axis=1)

print("Missing values handled.")


Missing values handled.


## Feature Engineering

We will create new features based on existing ones to improve the model's performance.


In [None]:
# Feature engineering
combined_df['temp_diff'] = combined_df['tmax'] - combined_df['tmin']
combined_df['avg_temp'] = (combined_df['tmax'] + combined_df['tmin']) / 2
combined_df['time'] = pd.to_datetime(combined_df['time'], format='%d-%m-%Y')

# Sorting by time
combined_df.sort_values('time', inplace=True)

# Adding lag features
combined_df['prev_day_tmax'] = combined_df['tmax'].shift(1)
combined_df['prev_day_prcp'] = combined_df['prcp'].shift(1)

# Calculate rolling mean and standard deviation for the features
combined_df['rolling_mean_7'] = combined_df['prcp'].rolling(window=7).mean()
combined_df['rolling_std_7'] = combined_df['prcp'].rolling(window=7).std()
combined_df['rolling_mean_30'] = combined_df['prcp'].rolling(window=30).mean()
combined_df['rolling_std_30'] = combined_df['prcp'].rolling(window=30).std()

# Fill NaN values resulting from rolling calculations
combined_df.fillna(method='bfill', inplace=True)


## Feature Scaling

Next we will scale the features to bring them into a consistent range.


In [None]:
# Scale features using Min-Max Scaler
scaler = MinMaxScaler()
combined_df[['tavg', 'tmax', 'tmin']] = scaler.fit_transform(combined_df[['tavg', 'tmax', 'tmin']])


## Train-Test Split

Splitting the dataset into training and testing sets to evaluate model performance.


In [None]:
# Define features and target variables
X = features.drop(columns=['time', 'city', 'tmax', 'tmin'])
y_tmax = features['tmax']
y_tmin = features['tmin']

# Split the data into training and testing sets
X_train, X_test, y_train_tmax, y_test_tmax = train_test_split(X, y_tmax, test_size=0.2, random_state=42)
X_train, X_test, y_train_tmin, y_test_tmin = train_test_split(X, y_tmin, test_size=0.2, random_state=42)


## Model Training

We'll train a Random Forest Regressor model to predict Tmax and Tmin.


In [None]:
# Define Random Forest Regressor for Tmax and Tmin
model_tmax = RandomForestRegressor(n_estimators=100, random_state=42)
model_tmin = RandomForestRegressor(n_estimators=100, random_state=42)
# Fit the models
model_tmax.fit(X_train, y_train_tmax)
model_tmin.fit(X_train, y_train_tmin)
# Set up cross-validation (e.g., 5-fold cross-validation)
num_folds = 5

# Perform cross-validation for Tmax
scores_tmax = cross_validate(model_tmax, X_train, y_train_tmax, cv=num_folds, scoring=['neg_mean_squared_error', 'neg_mean_absolute_error'], return_train_score=True)

# Perform cross-validation for Tmin
scores_tmin = cross_validate(model_tmin, X_train, y_train_tmin, cv=num_folds, scoring=['neg_mean_squared_error', 'neg_mean_absolute_error'], return_train_score=True)


## Model Evaluation

Evaluating the trained model on the test set using Mean Squared Error (MSE) and Mean Absolute Error (MAE).


In [None]:
# Calculate average scores for Tmax
mse_tmax_cv = -1 * scores_tmax['test_neg_mean_squared_error'].mean()
mae_tmax_cv = -1 * scores_tmax['test_neg_mean_absolute_error'].mean()

# Calculate average scores for Tmin
mse_tmin_cv = -1 * scores_tmin['test_neg_mean_squared_error'].mean()
mae_tmin_cv = -1 * scores_tmin['test_neg_mean_absolute_error'].mean()

print(f"Tmax - Cross-validation MSE: {mse_tmax_cv:.4f}, MAE: {mae_tmax_cv:.4f}")
print(f"Tmin - Cross-validation MSE: {mse_tmin_cv:.4f}, MAE: {mae_tmin_cv:.4f}")


Tmax - Cross-validation MSE: 2.9629, MAE: 1.3081
Tmin - Cross-validation MSE: 3.5784, MAE: 1.4287


##prediction
Using the trained models (`model_tmax` and `model_tmin`), we make predictions on the test set (`X_test`).
Calculating R², Mean Squared error MSE and Mean absolute error MAE.

In [None]:
# Predictions for Tmax and Tmin
y_pred_tmax = model_tmax.predict(X_test)
y_pred_tmin = model_tmin.predict(X_test)

# Calculate R-squared
r2_tmax = r2_score(y_test_tmax, y_pred_tmax)
r2_tmin = r2_score(y_test_tmin, y_pred_tmin)

# Calculate MSE and MAE
mse_tmax = mean_squared_error(y_test_tmax, y_pred_tmax)
mae_tmax = mean_absolute_error(y_test_tmax, y_pred_tmax)

mse_tmin = mean_squared_error(y_test_tmin, y_pred_tmin)
mae_tmin = mean_absolute_error(y_test_tmin, y_pred_tmin)

# Print the evaluation metrics
print(f"Tmax - R²: {r2_tmax:.4f}, MSE: {mse_tmax:.4f}, MAE: {mae_tmax:.4f}")
print(f"Tmin - R²: {r2_tmin:.4f}, MSE: {mse_tmin:.4f}, MAE: {mae_tmin:.4f}")


Tmax - R²: 0.8675, MSE: 2.9113, MAE: 1.2947
Tmin - R²: 0.8764, MSE: 3.4317, MAE: 1.3985
