This notebook aims to create a Random Forest model that will predict temperature given latitude, longitude, altitude, humidity, pressure.
Feel free to use the Google Colab link instead: https://colab.research.google.com/drive/1fLzJgUMWTj6y6gqluc-0-VzuMufthjHB?usp=sharing

# Cleaning the data

We will first clean and scale the data. Here, we will:
*   drop any unnecessary columns
  * including duplicate columns and weather balloon flight data
*   seperate our data into features and target variables
*   fill in any missing data with the median
*   split data for test/train
*   scale the data

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.preprocessing import StandardScaler
import numpy as np

# Load the data
file_path = "/PastLaunchData.csv"
data = pd.read_csv(file_path)

# Drop unnecessary columns (like unnamed or duplicated temperature columns)
data = data.drop(columns=["Unnamed: 1", "TEMP.1", "EULERX", "EULERY", "EULERZ",
                          "COURSE", "NUM SATS", "VEL DIFF"], errors='ignore')

# Convert all columns to numeric, coercing errors to NaN
data = data.apply(pd.to_numeric, errors='coerce')

# Fill missing values with the median of each column
data.fillna(data.median(), inplace=True)

# Separate features and target variable
X = data.drop(columns=["TEMP"], errors='ignore')  # Features (exclude TEMP)
y = data["TEMP"]  # Target variable (temperature)

# Convert TIME column to numeric, handle any non-numeric entries by coercing them to NaN
X['TIME'] = pd.to_numeric(X['TIME'], errors='coerce')

# Fill NaN values in the TIME column with the median value of the column
X['TIME'].fillna(X['TIME'].median(), inplace=True)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the feature data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X['TIME'].fillna(X['TIME'].median(), inplace=True)


# Training the Random Forest model
In this section, we will use our cleaned data to train a Random Forest model.

In [None]:
# Initialize and train the Random Forest Regressor
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train_scaled, y_train)

# Make predictions on the test set
y_pred = rf_model.predict(X_test_scaled)

# Evaluate the model's performance
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print("Mean Absolute Error (MAE):", mae)
print("Root Mean Squared Error (RMSE):", rmse)


Mean Absolute Error (MAE): 0.02030306391752736
Root Mean Squared Error (RMSE): 0.046310774122926994


After fitting model to our training data and testing the model on our test dataset, we arrive at a MAE of 0.0203 and RMSE of 0.0463, which is considered accurate.

In [None]:
# Print all feature variable names (excluding the target variable "TEMP")
feature_columns = X.columns.tolist()
print("Feature Variables:")
for feature in feature_columns:
  print(feature)

Feature Variables:
TIME
MILLIS
latitude (degrees)
longitude (degrees)
ALT
SPEED
HUMIDITY
ATM DIFF
PRESSURE
altitude (meters)
