# AOSC498 FInal Project - Machine Learning tool to applied to Mixing Layer Height dataset

## Author: Rahim Kamara
## Date: July 25, 2022

## Comments explaining the code are made after each line. I would like for everbody in the spectrum of experience of computer programming to understand and maybe find interest in continuing this project.

## Import Libraries. 
### Think of a library as a collection of resources. These resources can be used to learn patterns about data, like mixing layer height

In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from math import sqrt


## Read in Data

In [4]:
# This data was obtained from a previous project measuring mixing layer heights for Dulles, IAD during the period of June 29, 2021 to July 6, 2021
# During this period, the mixing layer was either found or could not be determined using the method in the journal article by Wang, X. Y., & Wang, K. C. (2014) excluding the 1-2-1 smoother. If this project were to be continued, the 1-2-1 smoother should be applied to the data.
# A total of 10 radiosonde profiles during this period exhibited a clear indicator of mixing layer height.
df = pd.read_csv('mixinglayerheights_nbviewer_method.csv') # This line reads in the dataset.
df # This line diplays the dataset
# Reading from left to right, the dataset are columns are datehour(YearMonthDateUTCHour), pressure(hPa), height (meter), temperature (Celsius), dewpoint (Celsius), direction (degrees), speed (knot), u wind component (knot), v wind component (knot), Airport Station (number), latitude, longitude, elevation (feet), pw???

Unnamed: 0,datehour,pressure,height,temperature,dewpoint,direction,speed,u_wind,v_wind,station_number,latitude,longitude,elevation,pw
0,2021062912,801.4,2134,15.7,0.3,270.0,3.0,3.0,0.0,72403,38.98,-77.46,93.0,32.26
1,2021063000,814.0,1979,17.2,9.2,258.0,10.0,9.781476,2.079117,72403,38.98,-77.46,93.0,32.44
2,2021063012,989.0,263,26.6,16.6,271.0,7.0,6.998934,-0.122167,72403,38.98,-77.46,93.0,34.92
3,2021070100,1000.0,138,24.6,20.1,270.0,7.0,7.0,0.0,72403,38.98,-77.46,93.0,47.8
4,2021070212,998.0,93,20.8,19.8,200.0,3.0,1.02606,2.819078,72403,38.98,-77.46,93.0,40.44
5,2021070300,719.0,2835,0.6,-2.0,296.0,27.0,24.267439,-11.836021,72403,38.98,-77.46,93.0,19.46
6,2021070412,717.0,2891,2.2,-3.9,315.0,22.0,15.556349,-15.556349,72403,38.98,-77.46,93.0,22.39
7,2021070512,732.0,2788,5.4,-0.6,323.0,12.0,7.22178,-9.583626,72403,38.98,-77.46,93.0,27.06
8,2021070600,735.0,2793,12.4,-3.6,16.0,17.0,-4.685835,-16.341449,72403,38.98,-77.46,93.0,26.3
9,2021070612,824.7,1829,17.4,6.8,300.0,9.0,7.794229,-4.5,72403,38.98,-77.46,93.0,36.3


In [5]:
#  df.head() # This line shows just the top 5 rows

In [6]:
df.isnull().values.any() # This line checks if there are any null values within the code

False

In [7]:
# df.describe() # This line displays a summary of the data

In [8]:
X = df.drop(["height"], axis=1) # All columns except the height column will be used for training.

In [9]:
y = df["height"] # The height column will be used for test.

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123) # This line specifies which features within the dataset are training and testing (splitting)

In [11]:
X_train.shape # This line tells us the number of rows and columns for training.

(8, 13)

In [12]:
X_test.shape # This line tells us the number of rows and columns for testing.

(2, 13)

In [13]:
scaler = StandardScaler() # This line standardizes the data so that all the columns have a mean value of zero and a standard deviation of 1.

In [14]:
train_scaled = scaler.fit_transform(X_train) # This line applies a fit and transform on the input samples and returns a new array

In [15]:
test_scaled = scaler.transform(X_test) # Since the input variables have different units (scales), we transform them to decrease the difficulty of the problem being modeled, and increase the performance during the learning 

In [16]:
tree_model = DecisionTreeRegressor() # Decision tree is a model this is used to learn how to split the data into different branches for non-linear relationship, like temperature versus altitude.
rf_model = RandomForestRegressor() # Random forest is a model that uses a bunch of decision trees to gather their outputs to determine the best solution

In [17]:
tree_model.fit(train_scaled, y_train) # We then train the model using the scaled input variables (datehour, air pressure pressure, temperature, etc.) and the scaled output variable (altitude).
rf_model.fit(train_scaled, y_train) # We then train the model using the scaled input variables (datehour, air pressure, temperature, etc.) and the scaled output variable (altitude)

RandomForestRegressor()

In [22]:
tree_train_mse = mean_squared_error(y_train, tree_model.predict(train_scaled)) # We then want to calculate the mean (average) squared error regression loss for the decision tree training model. The error tells us how close a set of points are above or below a regression line. The error values are then square to remove negative values. For the same regression line, more weight is given to error values of larger distances. When then take the average of these error values to get the mean squared error (mse). The lower the mse, the better the forecast
tree_train_mae = mean_absolute_error(y_train, tree_model.predict(train_scaled)) # We then want to calculate the mean (average) absolute error (mae) for the decision tree training model. Error values are all made positive through an absolute. These errors are then summed and averaged. We want to keep this value within a bracket for accurate forecasting. This is useful for comparing year by year results
rf_train_mse = mean_squared_error(y_train, rf_model.predict(train_scaled)) # We then want to calculate the mean squared error regression loss for the random forest training model. The error tells us how close a set of points are above or below a regression line. The error values are then square to remove negative values. For the same regression line, more weight is given to values of larger distances. When then take the average of these error values to get the mean squared error (mse). The lower the mse, the better the forecast
rf_train_mae = mean_absolute_error(y_train, rf_model.predict(train_scaled)) # We then want to calculate the mean (average) absolute error (mae) for the random forest training model. Error values are all made positive through an absolute. These errors are then summed and averaged. We want to keep this value within a bracket for accurate forcasting. This is useful for comparing year by year results

## Decision Tree and Random Forest Training Results

In [19]:
print("Decision Tree training mse = ",tree_train_mse," & mae = ",tree_train_mae," & rmse = ", sqrt(tree_train_mse)) # This line shows the results below of the decision tree training mean squared error, mean absolute error, and root mean squared error
print("Random Forest training mse = ",rf_train_mse," & mae = ",rf_train_mae," & rmse = ", sqrt(rf_train_mse)) # This line prints the results below of the random forest training mean squared error, mean absolute error, and root mean squared error

Decision Tree training mse =  0.0  & mae =  0.0  & rmse =  0.0
Random Forest training mse =  34440.635162499995  & mae =  141.44875000000002  & rmse =  185.58188263540166


## The decision tree training data has near perfect prediction. The random forest training data had predictions that were not accurate

In [20]:
tree_test_mse = mean_squared_error(y_test, tree_model.predict(test_scaled)) # We then want to calculate the mean (average) squared error regression loss for the decision tree testing model. The error tells us how close a set of points are above or below a regression line. The error values are then square to remove negative values. For the same regression line, more weight is given to error values of larger distances. When then take the average of these error values to get the mean squared error (mse). The lower the mse, the better the forecast
tree_test_mae = mean_absolute_error(y_test, tree_model.predict(test_scaled)) # We then want to calculate the mean (average) absolute error (mae) for the decision tree testing model. Error values are all made positive through an absolute. These errors are then summed and averaged. We want to keep this value within a bracket for accurate forecasting. This is useful for comparing year by year results 
rf_test_mse = mean_squared_error(y_test, rf_model.predict(test_scaled)) # We then want to calculate the mean squared error regression loss for the random forest testing model. The error tells us how close a set of points are above or below a regression line. The error values are then square to remove negative values. For the same regression line, more weight is given to values of larger distances. When then take the average of these error values to get the mean squared error (mse). The lower the mse, the better the forecast
rf_test_mae = mean_absolute_error(y_test, rf_model.predict(test_scaled)) # We then want to calculate the mean (average) absolute error (mae) for the random forest testing model. Error values are all made positive through an absolute. These errors are then summed and averaged. We want to keep this value within a bracket for accurate forcasting. This is useful for comparing year by year results

## Decision Tree and Random Forest Testing Results

In [21]:
print("Decision Tree testing mse = ",tree_test_mse," & mae = ",tree_test_mae," & rmse = ", sqrt(tree_test_mse)) # This line shows the results below of the decision tree testing mean squared error, mean absolute error, and root mean squared error
print("Random Forest testing mse = ",rf_test_mse," & mae = ",rf_test_mae," & rmse = ", sqrt(rf_test_mse)) # This line prints the results below of the random forest testing mean squared error, mean absolute error, and root mean squared error

Decision Tree testing mse =  1764770.5  & mae =  1020.5  & rmse =  1328.446649286301
Random Forest testing mse =  396043.72265000007  & mae =  628.9250000000001  & rmse =  629.3200478691268


## The random forest testing data is displaying even worse predictions than the training data.

## Discussion

## Both the decision tree and random forest testing sets display large errors measured by the mean squared error, mean absolute error, and root mean square error. Something to consider if this project were to be continued is using this tool during different time regimes. For example, the mixing layer height is commonly found at lower altitudes during the night and heigher altitudes during the day. We could instead separate the datasets into these 0 UTC and 12 UTC. This might improve the accuracy of the model. Again, a 1-2-1 smoother should be applied to the data initially collected by rawinsonde. I attempted to create a tool that would predict the mixing layer height days in advance, but as this task has the word in it, it was too advanced for me to do. This project has further piqued my interest in computer sciences as I see that it can provide such great wealth in our understand of the natural world.

## Acknowledgements

### I would like to thank GitHub user srnghn for their template on machine learning using decision tree and random forest models. Their comments on the script made it really helpful in understanding a topic that is novel to me. I would like to thank Dr. Ruben Delgado for being an amazing research mentor and suggesting this topic. This was probably the most mentally challenging task that I have taken on so far, and I have come around to loving it! There has no growth without strain. I would like to thank Dr. Timothy Canty for being understaning of my situation and teaching his students very well! I would like to thank Dr. Alexandra Jones for taking a risk on me and sharing her abundance of resources, allowing me to be where I am at currently. I would like to thank AOSC students for being an amazing support group.

### I deeply thank NCAS-M for providing monthly workshops with professionals aligned with NOAA’s mission, funding for conferences, and education support. I would like to thank Dr. Xin-Zhong Liang for being my NCAS-M mentor and teaching the importance of having curiousity. I wish I would have sheded the fear of being surrounded by established scientists so that I could ask more questions. 

### I would like to thank my family for raising me and teaching me well. Finally, I would like to thank Kaya for inspiring me to take a chance and risk with life in order to see tomorrow.