---
<h1 style="text-align: center;">
CSCI 4521: Applied Machine Learning (Fall 2024)
</h1>

<h1 style="text-align: center;">
Homework 4
</h1>

<h3 style="text-align: center;">
(Due Tue, Nov. 12, 11:59 PM CT)
</h3>

---

### Weather impacts life each and every day in both relatively minor and significant ways. Many people check the weather daily to know what the predicted temperature and precipitation will be so they can plan how to dress, what activities to do, and how early to leave on their daily commute.

### In this homework, your task is to predict the temperature (in degrees Celcius) at different points in time. You need to use machine learning and develop regression models to accomplish this task. The only data you have available is the weather data in the dataset `weather_csci4521_hw4.csv`. Each row in the dataset is a different point in time and the columns are the features consisting of Date and Daily Summary, and many features computed from Visibility, Wind Speed and Bearing, Humidity, Pressure, and Loud Cover. The target variable is in column "Temperature (C)".

### You must clean and preprocess the data then decide which regression algorithms to use, which and how to tune any hyperparameters, how to measure performance, which models to select, and which final model to use.

### You must use **PySpark** to clean and preprocess the data. If you use anything other than PySpark, you will receive no credit for this homework. The one exception is for feature selection. You are allowed to use Pandas to decide how many features to keep but you must use PySpark to select those features. After cleaning and preprocessing, you can use any of the coding packages we've used in class (Numpy, Pandas, PySpark, Scikit-learn, etc.). Make sure to write and submit clean, working code. Reminder, you cannot use ChatGPT or similar technologies. Please see the syllabus for more details.

### You also need to submit a short report of your work describing all steps you took, explanations of why you took those steps, results, what you learned, how you might use what you learned in the future, and your conclusions. We expect the report to be well-written and clearly describe everything you've done and why.

---

### Write your code here

# Intializing all the packages used for preprocessing

In [41]:
#Packages used for data pre processing
!pip install pyspark
from pyspark.sql import DataFrame
from pyspark.sql.functions import mean as mean, stddev as stddev, median as median
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler
from pyspark.sql import SparkSession
from pyspark.ml.feature import StandardScaler
import pyspark.sql.functions as F
from pyspark.sql.types import ArrayType, DoubleType
import numpy as np
from pyspark.ml.feature import OneHotEncoder
from pyspark.ml.functions import vector_to_array




# Shape of intial data

In [42]:
spark = SparkSession.builder.appName("PySpark_Tutorial").getOrCreate()
df=spark.read.csv("/content/weather_csci4521_hw4.csv",header=True)

col=len(df.columns)
rows=df.count()
print("intial num rows:",rows,"\n")
print("initial num cols:",col-1,"\n")
print("Data set")
print(df.show(n=10))

intial num rows: 96453 

initial num cols: 108 

Data set
+--------------------+--------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+---------+---------+-------------------+----------+-------------------+------------------+--------------------+-------------------+-------------------+-------------------+----------+-------------------+------------------+------------------+-------------------+-------------------+-------------------+------------------+-------------------+-------------------+-------------------+------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+------------------+------------------+-------------------+-------------------+-------------------+--------------------+-------------------+-------------------+------------------+-------------------+-------------------+-------------------+---

# Converting Formatted date into seperate features

In [43]:
#Converting date column into four individual
from pyspark.sql.functions import split


#Convert the Formatted Date column into year month day1( which includes the day the time and zone)
df = df.withColumn('year', split(df['Formatted Date'], '-').getItem(0)) \
       .withColumn('month', split(df['Formatted Date'], '-').getItem(1)) \
       .withColumn('day1', split(df['Formatted Date'], '-').getItem(2))

#Split day1 into columns day time zones
df=df.withColumn("day",split(df["day1"]," ").getItem(0))\
       .withColumn("time",split(df["day1"]," ").getItem(1))\
       .withColumn("Zones",split(df["day1"]," ").getItem(2))

#split time into hour minutes and seconds
df=df.withColumn("hour",split(df["time"],":").getItem(0))\
       .withColumn("minute",split(df["time"],":").getItem(1))\
       .withColumn("seconds",split(df["time"],":").getItem(2))

# Dropping uneccicary  columns after the split
df=df.drop("day1","time","Formatted Date")
print("Data Frame after adding columns: year,month,Zones,hour,minute,seconds\n")
df.show(truncate=False,n=10)







Data Frame after adding columns: year,month,Zones,hour,minute,seconds

+--------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+---------+---------+-------------------+----------+-------------------+------------------+--------------------+-------------------+-------------------+-------------------+----------+-------------------+------------------+------------------+-------------------+-------------------+-------------------+------------------+-------------------+-------------------+-------------------+------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+------------------+------------------+-------------------+-------------------+-------------------+--------------------+-------------------+-------------------+------------------+-------------------+-------------------+-------------------+-----------

# Encoding Daily Summary

In [44]:
#encode catagorical variable Daily summary
print("Data Frame before encoding Daily Summary:\n")
print(df.show(n=10))

indexer = StringIndexer(inputCol="Daily Summary", outputCol="encoded_Summary")
df = indexer.fit(df).transform(df)
df=df.drop("Daily Summary")

print("Data Frame after encoding Daily Summary:\n")
print(df.show(n=10))

Data Frame before encoding Daily Summary:

+--------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+---------+---------+-------------------+----------+-------------------+------------------+--------------------+-------------------+-------------------+-------------------+----------+-------------------+------------------+------------------+-------------------+-------------------+-------------------+------------------+-------------------+-------------------+-------------------+------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+------------------+------------------+-------------------+-------------------+-------------------+--------------------+-------------------+-------------------+------------------+-------------------+-------------------+-------------------+-------------------+-------------------

# Re Ordering data frame for ease of use

In [45]:
#Temperature seem to between some of features which will make it hard to work with
col_array=df.columns[:]
print("Data before ordering:\n")
print(df.show(n=10))
correct_col_ord=[]
for i in col_array:
  if i!="Temperature (C)":
    correct_col_ord.append(i)
correct_col_ord.append("Temperature (C)")# Add it to back of data make it ordered
df=df[correct_col_ord]
print("Data after ordering:\n")
print(df.show(n=10))

Data before ordering:

+--------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+---------+---------+-------------------+----------+-------------------+------------------+--------------------+-------------------+-------------------+-------------------+----------+-------------------+------------------+------------------+-------------------+-------------------+-------------------+------------------+-------------------+-------------------+-------------------+------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------+------------------+------------------+-------------------+-------------------+-------------------+--------------------+-------------------+-------------------+------------------+-------------------+-------------------+-------------------+-------------------+-------------------+-------------------

# Removing features with only 0 in them

In [46]:
# remove features seconds,minutes and feature_17 as these columns have only 0s
print("number of columns before removing features which only hava zeros:",len(df.columns),"\n")
df=df.drop("seconds","minute","feature_17")
print("number of columns after removing features which only have zeros:",len(df.columns))



number of columns before removing features which only hava zeros: 115 

number of columns after removing features which only have zeros: 112


# Converting data type of features to float

In [47]:
columns=df.columns[:]
#Changing data types of columns from string to floatt
for col in columns:
  df= df.withColumn(col, F.col(col).cast('float'))

# Replacing null values with median of data

In [48]:
#need to replace null values with median of data
#Used median as median gennerally is not affected by outliers
print("Data before replacing null values:\n")
print(df.show(n=10))

cols_to_replace_missing_vals=df.columns[:-1]

for col in cols_to_replace_missing_vals:
  Median=df.select(median(col)).collect()[0][0]
  df = df.fillna(Median, subset=col)

print("data after replacement:\n")
df.show(n=10)

Data before replacing null values:

+-----------+----------+---------+----------+----------+----------+----------+---------+---------+----------+----------+----------+----------+-----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+-----------+-----------+----------+----------+----------+----------+----------+----------+-----------+----------+----------+----------+----------+-----------+-----------+----------+----------+-----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+-----------+----------+----------+----------+----------+----------+----------+

# Scales entire data by vectorizing data

In [49]:
print("data before scaling:\n")
print( df.show(n=10))

def split_arr_to_list(col):
    def to_list(v):
        return v.toArray().tolist()
    return F.udf(to_list, ArrayType(DoubleType()))(col)

#apply vector assembler
cols = df.columns[:112]
vec_assembler = VectorAssembler(inputCols=cols, outputCol="vec_feats")
df_used= vec_assembler.transform(df)

# apply standard scalar to vector assembler
scaler = StandardScaler(inputCol = "vec_feats", outputCol = "scaled_features")# Standardized data as providede better model when tested relative to min max scalar
scaler_model = scaler.fit(df_used)
df_used= scaler_model.transform(df_used)


df3 = df_used.withColumn("split_int", split_arr_to_list(F.col("scaled_features"))).select("Temperature (C)","vec_feats","scaled_features",*[F.col("split_int")[i] for i in range(111)])
StringArr=[]
for i in range(0,len(df_used.columns)):
  df3=df3.withColumnRenamed("split_int"+"["+str(i)+"]",df_used.columns[i])# renaming columns

df=df3
print("Data after scaling:\n")
print(df.show(n=10))


data before scaling:

+-----------+----------+---------+----------+----------+----------+----------+---------+---------+----------+----------+----------+----------+-----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+-----------+----------+----------+----------+----------+----------+----------+----------+-----------+----------+----------+----------+----------+----------+----------+-----------+-----------+----------+----------+----------+----------+----------+----------+-----------+----------+----------+----------+----------+-----------+-----------+----------+----------+-----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+-----------+----------+----------+----------+----------+----------+----------+-----------+

# Drops all duplicates in data

In [50]:
# Remove data duplicates using drop duplicates
print("Num rows before dropping duplicates: ",df.count(),"\n")

df=df.drop_duplicates()

print("Num rows after dropping duplicates: ",df.count(),"\n")

Num rows before dropping duplicates:  96453 

Num rows after dropping duplicates:  96212 



# Removing outliers using  three standard deviation method

In [51]:
cols=df.columns[3:]
# Question 5 use standard deviation method to remove any data points lower than or greater the 3 standiviation from mean
print("Shape of a data before removing outliers")
print(("num_samples:",df.count()))
print("\n")
total_outlier=0
initialdf=df
outliers_collector=[]
upper_list=[]
lower_list=[]
mean_list=[]
std_list=[]
summary_df = df.select(cols).describe()
summary_stats = summary_df.filter(summary_df["summary"].isin("mean", "stddev")).collect()
for j in cols:
  data_mean =float(summary_stats[0][j])
  data_std = float(summary_stats[1][j])

  cut_off = data_std * 3
  lower, upper = data_mean - cut_off, data_mean + cut_off

  lower_list.append(lower)
  upper_list.append(upper)

broadcast_lower_list = spark.sparkContext.broadcast(lower_list)
broadcast_upper_list = spark.sparkContext.broadcast(upper_list)
upper_list=broadcast_upper_list.value
lower_list=broadcast_lower_list.value


for i in range(0,len(cols)):
  outliers_df = df.where((F.col(cols[i]) < lower_list[i]) | (F.col(cols[i]) > upper_list[i]))
  outliers_collector.append(outliers_df)
  df = df.where((F.col(cols[i]) >= lower_list[i]) & (F.col(cols[i]) <= upper_list[i]))




print("Shape of a data after removing outliers")
print(("num_samples:",df.count()))





Shape of a data before removing outliers
('num_samples:', 96212)


Shape of a data after removing outliers
('num_samples:', 91720)


# Using anova test to select top 20 important features

In [52]:
from pyspark.ml.feature import UnivariateFeatureSelector

selector = UnivariateFeatureSelector(featuresCol = "scaled_features", outputCol="selected_features", labelCol = "Temperature (C)")
selector.setFeatureType("continuous").setLabelType("continuous").setSelectionThreshold(20)
model = selector.fit(df)
df_feat_sel = model.transform(df)
col_name=[]
cols=df.columns
col_array=[]
for i in model.selectedFeatures:
  col_name.append(cols[i+3])
  col_array.append(df[cols[i+3]])

# as majority of data uses anova test I simply used anova test on entire data to find importance
print(f"Features selected using anova test: {model.selectedFeatures}\n")
print("Names of col selected:",col_name)



Features selected using anova test: [7, 8, 10, 18, 86, 106, 108, 109, 110, 105, 34, 100, 83, 43, 47, 59, 69, 87, 103, 54]

Names of col selected: ['feature_7', 'feature_8', 'feature_10', 'feature_19', 'feature_87', 'month', 'Zones', 'hour', 'encoded_Summary', 'year', 'feature_35', 'feature_101', 'feature_84', 'feature_44', 'feature_48', 'feature_60', 'feature_70', 'feature_88', 'feature_104', 'feature_55']


# Final data frame

In [61]:
col_array.append(df["Temperature (C)"])

final_df=df[col_array]
print("Shape of final data frame")
print(" final rows: ",final_df.count()," final columns:",len(final_df.columns))
print("Final data frame\n")
final_df.show(n=10)

Shape of final data frame
 final rows:  91720  final columns: 22
Final data frame

+-----------------+-----------------+------------------+-------------------+-------------------+-----------------+------------------+------------------+------------------+-----------------+-------------------+--------------------+--------------------+--------------------+--------------------+--------------------+------------------+--------------------+--------------------+--------------------+---------------+---------------+
|        feature_7|        feature_8|        feature_10|         feature_19|         feature_87|            month|             Zones|              hour|   encoded_Summary|             year|         feature_35|         feature_101|          feature_84|          feature_44|          feature_48|          feature_60|        feature_70|          feature_88|         feature_104|          feature_55|Temperature (C)|Temperature (C)|
+-----------------+-----------------+------------------+---

# Intializing packages used to train linear regression models

In [54]:
from sklearn.model_selection import train_test_split,RandomizedSearchCV, KFold,GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,BaggingRegressor,AdaBoostRegressor,GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score



# Preparing train and test split

In [55]:

X_panda=final_df.toPandas()
y_panda=X_panda["Temperature (C)"]
X_panda=X_panda.drop(columns=["Temperature (C)"])
x_numpy=X_panda.to_numpy()
y_numpy=y_panda.to_numpy()
X_train, X_test, y_train, y_test = train_test_split(x_numpy, y_numpy, test_size=0.2, random_state=42,shuffle=True)

# Model 1: Linear regression

In [56]:
#Train model
k_lin = KFold(n_splits=2)
lin_reg = LinearRegression(n_jobs=-1)
lin_reg_param_grid = {"fit_intercept": [True, False],"positive": [True, False]}
lin_reg_search = GridSearchCV(lin_reg, lin_reg_param_grid, cv=k_lin, n_jobs=-1)
lin_reg_search.fit(X_train, y_train)

# Collect important values
lin_reg_model = lin_reg_search.best_estimator_
cv_results = lin_reg_search.cv_results_
lin_best_score = lin_reg_search.best_score_

# Test model
y_train_pred = lin_reg_model.predict(X_train)
y_test_pred = lin_reg_model.predict(X_test)

# Check model for mse and r square
train_mse = mean_squared_error(y_train, y_train_pred)
train_r2 = r2_score(y_train, y_train_pred)
test_mse = mean_squared_error(y_test, y_test_pred)
test_r2 = r2_score(y_test, y_test_pred)

# Show cross-val results
mean_scores = cv_results['mean_test_score']
std_scores = cv_results['std_test_score']
params = cv_results['params']

print("Cross-Validation Results:\n")
for i in range(len(mean_scores)):
    print("Parameters:", params[i], " Mean Test Score:", mean_scores[i], " Std Test Score:", std_scores[i], "\n")

# Best hyperparameters
print("\nBest Hyperparameters Chosen for model:", lin_reg_search.best_params_, "with cross-validation score:", lin_best_score, "\n")

#results
print("Linear Regression Results for model:\n")
print("Train MSE:", train_mse)
print("Train RMSE:", np.sqrt(train_mse))
print("Train R²:", train_r2, "\n")

print("Test MSE:", test_mse)
print("Test RMSE:", np.sqrt(test_mse))
print("Test R²:", test_r2, "\n")

Cross-Validation Results:

Parameters: {'fit_intercept': True, 'positive': True}  Mean Test Score: 0.5973740422691918  Std Test Score: 0.00033690931339935837 

Parameters: {'fit_intercept': True, 'positive': False}  Mean Test Score: 0.7403280980502294  Std Test Score: 0.0012940207835558937 

Parameters: {'fit_intercept': False, 'positive': True}  Mean Test Score: 0.4520104963561851  Std Test Score: 0.0010772151887122328 

Parameters: {'fit_intercept': False, 'positive': False}  Mean Test Score: 0.7401431351698418  Std Test Score: 0.0012826487573549539 


Best Hyperparameters Chosen for model: {'fit_intercept': True, 'positive': False} with cross-validation score: 0.7403280980502294 

Linear Regression Results for model:

Train MSE: 23.77297364655211
Train RMSE: 4.875753649083607
Train R²: 0.7404392640170112 

Test MSE: 23.65912785717958
Test RMSE: 4.8640649519902155
Test R²: 0.7399586882406577 



# Model 2: Random forest regressor

In [57]:


# Train model
k_rf = KFold(n_splits=2)
rf = RandomForestRegressor(random_state=46,min_samples_leaf=20,max_features=None,min_samples_split=10,n_jobs=-1,oob_score=True)
rf_param_grid = {"n_estimators": [50,60,70,80,90,100],"criterion":["squared_error","friedman_mse"]}
rf_model = RandomizedSearchCV(rf, rf_param_grid, cv=k_rf, n_jobs=-1,random_state=46,n_iter=5)
rf_model.fit(X_train, y_train)

# Collect model information
rf_best_model = rf_model.best_estimator_
rf_cv_results = rf_model.cv_results_
rf_best_score=rf_model.best_score_
rf_best_params=rf_model.best_params_

# Test model
y_pred = rf_best_model.predict(X_test)
y_pred_train=rf_best_model.predict(X_train)

#Check performance using metrics
test_mse = mean_squared_error(y_test, y_pred)
test_r2 = r2_score(y_test, y_pred)
train_mse = mean_squared_error(y_train, rf_best_model.predict(X_train))
train_r2 = r2_score(y_train, y_pred_train)

# Show cross-val results
mean_scores = rf_cv_results['mean_test_score']
std_scores = rf_cv_results['std_test_score']
params = rf_cv_results['params']

print("Cross-Validation Results:")
for i in range(len(mean_scores)):
    print("Parameters:",params[i]," Mean Test Score:",mean_scores[i]," Std Test Score:",std_scores[i],"\n")

#Best hyper parameter
print("\nBest Hyperparameters Choosen for model:", rf_best_params,"its cross val score",rf_best_score,"\n")

# Show train and test on metric mse,rmse,Rsquare
print("Random forest Regression Results for model:\n")
print("Train MSE:", train_mse,)
print("Train RMSE:", np.sqrt(train_mse))
print("Train R²:", train_r2,"\n")

print("Test MSE:", test_mse)
print("Test RMSE:", np.sqrt(test_mse))
print("Test R²:", test_r2,"\n")


Cross-Validation Results:
Parameters: {'n_estimators': 60, 'criterion': 'friedman_mse'}  Mean Test Score: 0.8878134576653254  Std Test Score: 0.0007462179195630302 

Parameters: {'n_estimators': 60, 'criterion': 'squared_error'}  Mean Test Score: 0.8878134576653254  Std Test Score: 0.0007462179195630302 

Parameters: {'n_estimators': 50, 'criterion': 'squared_error'}  Mean Test Score: 0.8875466418776212  Std Test Score: 0.000776880140104852 

Parameters: {'n_estimators': 50, 'criterion': 'friedman_mse'}  Mean Test Score: 0.8875466418776212  Std Test Score: 0.000776880140104852 

Parameters: {'n_estimators': 80, 'criterion': 'friedman_mse'}  Mean Test Score: 0.888054669125351  Std Test Score: 0.0007767575262858095 


Best Hyperparameters Choosen for model: {'n_estimators': 80, 'criterion': 'friedman_mse'} its cross val score 0.888054669125351 

Random forest Regression Results for model:

Train MSE: 6.68597576432191
Train RMSE: 2.585725384553029
Train R²: 0.9270004326781596 

Test MSE: 

# Model 3: decision tree regression with data bagging

In [58]:
#Train model
k_bagging = KFold(n_splits=2)
base_estimator = DecisionTreeRegressor(random_state=46)
bagging = BaggingRegressor(estimator=base_estimator,random_state=46,n_jobs=-1)
bagging_param_grid = {"n_estimators": [10, 20, 30, 40, 50],"max_samples": [0.5, 0.7, 0.9, 1.0],"bootstrap": [True, False]}
bagging_search = RandomizedSearchCV(bagging,bagging_param_grid,cv=k_bagging, n_jobs=-1,random_state=46,n_iter=5)
bagging_search.fit(X_train, y_train)

#collection of important information
bagging_model = bagging_search.best_estimator_
cv_results = bagging_search.cv_results_
bg_best_score=bagging_search.best_score_

#Use model to predict
y_train_pred = bagging_model.predict(X_train)
y_test_pred = bagging_model.predict(X_test)

# Calculate metrics
train_mse = mean_squared_error(y_train, y_train_pred)
train_r2 = r2_score(y_train, y_train_pred)
test_mse = mean_squared_error(y_test, y_test_pred)
test_r2 = r2_score(y_test, y_test_pred)

# Show cross-validation results
mean_scores = cv_results['mean_test_score']
std_scores = cv_results['std_test_score']
params = cv_results['params']

print("Cross-Validation Results:\n")
for i in range(len(mean_scores)):
    print("Parameters:",params[i]," Mean Test Score:",mean_scores[i]," Std Test Score:",std_scores[i],"\n")

# Best hyper parameter chosen for model
print("\nBest Hyperparameters Choosen for model:", bagging_search.best_params_,"its cross val score",bg_best_score,"\n")

# Model results
print("Bagging Regression Results for model:\n")
print("Train MSE:", train_mse,)
print("Train RMSE:", np.sqrt(train_mse))
print("Train R²:", train_r2,"\n")

\
print("Test MSE:", test_mse)
print("Test RMSE:", np.sqrt(test_mse))
print("Test R²:", test_r2,"\n")


Cross-Validation Results:

Parameters: {'n_estimators': 40, 'max_samples': 0.5, 'bootstrap': True}  Mean Test Score: 0.8940178838514858  Std Test Score: 0.0009400871884050477 

Parameters: {'n_estimators': 40, 'max_samples': 1.0, 'bootstrap': False}  Mean Test Score: 0.8222734813606395  Std Test Score: 1.8340103755565274e-05 

Parameters: {'n_estimators': 20, 'max_samples': 0.7, 'bootstrap': True}  Mean Test Score: 0.8950569573809106  Std Test Score: 0.00027353626584614155 

Parameters: {'n_estimators': 50, 'max_samples': 0.5, 'bootstrap': True}  Mean Test Score: 0.8947801056792424  Std Test Score: 0.0007500390586305805 

Parameters: {'n_estimators': 30, 'max_samples': 1.0, 'bootstrap': True}  Mean Test Score: 0.9006116276442846  Std Test Score: 0.0006727184402920972 


Best Hyperparameters Choosen for model: {'n_estimators': 30, 'max_samples': 1.0, 'bootstrap': True} its cross val score 0.9006116276442846 

Bagging Regression Results for model:

Train MSE: 1.1258544711565195
Train RMS

# Model 4: decision tree regression with ada boosting

In [59]:
#Train model
k_boost= KFold(n_splits=2)
estimator=DecisionTreeRegressor(random_state=46)
ada_boost=AdaBoostRegressor(estimator,random_state=46)
ada_param_grid ={"n_estimators": [5,10,15,20],"learning_rate": [1.0,1.2,1.4,1.5],"loss": ['square', 'exponential']}
ada_random = RandomizedSearchCV(ada_boost,ada_param_grid,cv=k_boost, n_iter=5, n_jobs=-1,random_state=46)
ada_random.fit(X_train, y_train)

#collection of values from model built
best_ada= ada_random.best_estimator_
ada_cv_results = ada_random.cv_results_
ada_best_score=ada_random.best_score_

#Use model to predict
y_train_pred = best_ada.predict(X_train)
y_test_pred = best_ada.predict(X_test)

# Calculate metrics
train_mse = mean_squared_error(y_train, y_train_pred)
train_r2 = r2_score(y_train, y_train_pred)
test_mse = mean_squared_error(y_test, y_test_pred)
test_r2 = r2_score(y_test, y_test_pred)

#Cross val scores

mean_scores = ada_cv_results['mean_test_score']
std_scores = ada_cv_results['std_test_score']
params = ada_cv_results['params']

print("Cross-Validation Results:\n")
for i in range(len(mean_scores)):
    print("Parameters:",params[i]," Mean Test Score:",mean_scores[i]," Std Test Score:",std_scores[i],"\n")

# choose one with highest
print("\nBest Hyperparameters Choosen for model:", ada_random.best_params_,"its cross val score",ada_best_score,"\n")

print(" Ada boost Regression on descision tree result Results for model:\n")
print("Train MSE:", train_mse,)
print("Train RMSE:", np.sqrt(train_mse))
print("Train R²:", train_r2,"\n")

print("Test MSE:", test_mse)
print("Test RMSE:", np.sqrt(test_mse))
print("Test R²:", test_r2,"\n")


Cross-Validation Results:

Parameters: {'n_estimators': 10, 'loss': 'square', 'learning_rate': 1.0}  Mean Test Score: 0.8850455069001257  Std Test Score: 0.0011029623048905424 

Parameters: {'n_estimators': 5, 'loss': 'exponential', 'learning_rate': 1.2}  Mean Test Score: 0.8671858079029791  Std Test Score: 0.0009409581729593053 

Parameters: {'n_estimators': 15, 'loss': 'exponential', 'learning_rate': 1.5}  Mean Test Score: 0.8936364861343945  Std Test Score: 2.877321590455395e-05 

Parameters: {'n_estimators': 15, 'loss': 'square', 'learning_rate': 1.2}  Mean Test Score: 0.8926445937613932  Std Test Score: 0.00020346041548874316 

Parameters: {'n_estimators': 5, 'loss': 'exponential', 'learning_rate': 1.0}  Mean Test Score: 0.8687426497187516  Std Test Score: 0.0014498120979817841 


Best Hyperparameters Choosen for model: {'n_estimators': 15, 'loss': 'exponential', 'learning_rate': 1.5} its cross val score 0.8936364861343945 

 Ada boost Regression on descision tree result Results f

# Model 5: Descision tree regressor with ada grad

In [60]:
#Train data

k_grad = KFold(n_splits=2)
estimator=DecisionTreeRegressor(random_state=46)
Grad_boost=GradientBoostingRegressor(random_state=46)
Grad_param_grid ={"max_depth":[1,2,3,4,5],"n_estimators": [10,20,30,40],"learning_rate": [0.1,0.5, 1.0],"loss": ['absolute_error','quantile','squared_error']}
Grad_random = RandomizedSearchCV(Grad_boost,Grad_param_grid,cv=k_grad, n_iter=5, n_jobs=-1,random_state=46)
Grad_random.fit(X_train, y_train)

#Best estimator
best_Grad = Grad_random.best_estimator_
best_score=Grad_random.best_score_
Grad_cv_results = Grad_random.cv_results_

#prediction
y_train_pred = best_Grad.predict(X_train)
y_test_pred = best_Grad.predict(X_test)

#Check model efficacy using metric
train_mse = mean_squared_error(y_train, y_train_pred)
train_r2 = r2_score(y_train, y_train_pred)
test_mse = mean_squared_error(y_test, y_test_pred)
test_r2 = r2_score(y_test, y_test_pred)

# Showing cross val score
mean_scores = Grad_cv_results['mean_test_score']
std_scores = Grad_cv_results['std_test_score']
params = Grad_cv_results['params']

print("Cross-Validation Results:\n")
for i in range(len(mean_scores)):
    print("Parameters:",params[i]," Mean Test Score:",mean_scores[i]," Std Test Score:",std_scores[i],"\n")

# choose parameters provide best model
print("\nBest Hyperparameters Choosen for model:", Grad_random.best_params_,"its cross val score",best_score,"\n")

#Ada grad results
print("Ada grad results in decesion tree reRegression Results for model:\n")
print("Train MSE:", train_mse,)
print("Train RMSE:", np.sqrt(train_mse))
print("Train R²:", train_r2,"\n")

print("Test MSE:", test_mse)
print("Test RMSE:", np.sqrt(test_mse))
print("Test R²:", test_r2,"\n")

Cross-Validation Results:

Parameters: {'n_estimators': 30, 'max_depth': 5, 'loss': 'squared_error', 'learning_rate': 1.0}  Mean Test Score: 0.8749682867047199  Std Test Score: 0.0010598886870301083 

Parameters: {'n_estimators': 10, 'max_depth': 4, 'loss': 'squared_error', 'learning_rate': 0.1}  Mean Test Score: 0.6951465862180681  Std Test Score: 0.0010031497501261644 

Parameters: {'n_estimators': 10, 'max_depth': 2, 'loss': 'absolute_error', 'learning_rate': 0.1}  Mean Test Score: 0.5861689722085606  Std Test Score: 0.001676495173164205 

Parameters: {'n_estimators': 10, 'max_depth': 4, 'loss': 'absolute_error', 'learning_rate': 0.1}  Mean Test Score: 0.6834224262730962  Std Test Score: 0.0019147283849872965 

Parameters: {'n_estimators': 10, 'max_depth': 2, 'loss': 'quantile', 'learning_rate': 0.1}  Mean Test Score: -0.504920975153677  Std Test Score: 0.03477082783479846 


Best Hyperparameters Choosen for model: {'n_estimators': 30, 'max_depth': 5, 'loss': 'squared_error', 'learn

---

# Write your report here

**Data preprocessing**

The first thing I did was load the data and and check how it was configured. What I realized was that there were two key columns I would have to change in order for the data to be preprocessed: Formatted date and daily summary. For formatted date I split the column into several individual features which help me retain its information .These feature include the year month day minute second and zone in which this temperature happened. In the guise of adding new features however the primary data became unordered where the label was not in the end of the data frame but was in between some of the data frame and so I reordered the data so that it was in the end. For an encoded summary I simply used a label encoder to encode the categorical features. After doing this I had some columns which had only zeros which I decided to just drop as they were insignificant which included feature_17,minutes,seconds. I then replace null values within the data frame with median values of their column. I replace them with the median as the median generally tends not to skew the data. After this to scale data I used a vector assembler to get the data into vectorized form which I then used  standard scalar to scale. The reason why I chose to use standard Scalar instead of min max scalar is that standard scalar improved my overall model performance while vector scalar did not. After standardizing the vector scalar I rewrote the rest of the data frame with the standard scalar values which gave me my new standardized data frame. After doing this I remove outlier which I considered to be any data point around three standard deviations from the mean. I used this as I believed that this was the right spot where I believed that outliers lied where any less would consist of too many data points. I found that if I implemented any of my previous methods of calculating mean and standard deviation by column it would take too long so instead I simply used a function describe to get all the statistics for the data for each column in one go and simply collected the means and standard deviations of them. I also used broadcasting to calculate the lower and upper list of data which I think helped my code run faster. As the majority of the data was continuous and the target value was continuous I simply used anova test in the univariate feature selector where I decided to take 20 approximately the top 20 columns in the data. I took this much as if I had taken any more than the model training time would have risen exponentially and would have resulted in a much longer time model to work.


**Model phase**

The first main thing I realized is due to the enormous amount of data to shorten the time I used k fold cross validation of only 2. When preparing the data I initially did not shuffle the data as some of the data is based on time series so order in which the data is given could matter however after checking the performance of my models with and without shuffling I realized that shuffling was an essential step. I used five models in my overall project which included linear regression, Random forest regressor, decision tree bagging and decision tree ada grad and decision tree ada boost. The primary metrics I use are R squared,mean square error and rmse. I used R square error to calculate how effective my model predictions were in explaining the variance of the actual output. I used mse which provides me an estimation of the squared error of my model and rmse to help me contextualize the mse. The main reason I use both R2 and rmse together is that if I simply looked at the mse it would simply tell me how accurately my model is performing however it would not tell me if my model would generalize well and as result I use both to come to my conclusion. For hyper parameter tuning of the models I use gridsearch cv for the first linear regression but randomized grid search cv for the rest of the model. The main reason I did this was that a lot of the tree based models took too long to run and so I used randomized grid search to choose a random combination of 5 hyper parameters to tune models. I chose the models with the highest mean cross vals scores. I also checked the standard deviations of these scores and for most of the models the standard deviations for cross val values were quite low indicating that most of the models I trained were performing quite evenly across the data.


**Model analysis**

To choose the best model the main features I was looking for is a model which had a High level r square with a low mean square and root mean square errors in both the training and testing data. If It have very High train R square and rmse but a significantly lower test R square and rmse then the overall model I would consider it to be a model which was overfitting.


For my first model which used linear regression the my train rmse and R2 were quite bad sitting at rmse was 4.875753649083607 rmse
0.7404392640170112 R2. What this indicated to me was the underlying function for this over distribution was non linear.


For my Random forest model I managed to get a quite high R2 value of 0.92 in train and 0.9 in test which indicated it was learning the overall data quite well. The rmse was also quite decent with around 2.58 for train and 2.9 for test. This marginal difference may indicate that the model is slightly over fitting. The reason why I believe that this model performs so well is due to the nonlinear nature of the error functions they are defined. For this model I chose particular error functions: The square error,Fienmen error. I chose these two as they are contradictory in performance where generally the square error tends to perform quite well in data which is less noisy and overall more centralized while Feinman has a more complex error function which turns out to be the more optimal solution. I tried poisson error function but for some reason it did not work on my data. I also tried absolute_error which I found to be to computationally expensive to run as it took to long to actually calculate as a result I ended up choosing these error functions.




As my random forest worked so well I decided that I would try more algorithms based primarily around trees. Third model I used was dbagging with a simple decision tree. Initially I wanted to use dbagging with random forest as random  forest did very well however I realized it would be too computationally expensive and take too long to run as a result I ended up using dbagging with a simple decision tree. The overall model performed actually remarkably well where the I got train r2 of around 0.98 and a test r2 of around 1 however the test was not as good where it had an r2 of around 0.92 with rmse of around 2.6 which is quite less. This would indicate some level of over fitting. Even if it was slightly over-fitting the overall model still seems to be quite good as it has a better r2 and rmse relative to the random forest regression model which may indicate that even with a slight over fit dbagging is doing quite well.




The fourth model I trained was based on ada boosting decision tree where I used multiple weak learning decision trees to create final model which has a very good train rmse of 0.12 and r2 of 0.99 however this it obvious that this model is overfitting as the test split is significantly worse indicating that the model is simply memorizing the output.


The fifth model ada grad did not over fit however it also did not perform that well with a train r2 of around 0.9 and an rmse of 3.19. This level of r2 and rmse is lower than the test rmse and r2 of random forest. As a result, by default I believe that I cannot take this model.




**Model conclusion**

After looking through the model results I came to the conclusion that random forest was the best model. The main reason I believe this to be the case is that random forest has quite a good train rmse and r2 where even if not not best is quite good along with an rmse and r2 test which is similar to the train. My next option was decision tree data bagging which had better r2 and rmse on test however it suffered a flaw of having a very high r2 on the train indicating that it was over fitting. The same criticism can be applied to decision tree ada boosting which had a train r2 of 0.99 indicating again over fitting. The ada grad does not over fit however its performance is inferior to random forest. Due to the reasons mentioned above I choose random forest as my model.







**Insights and how I plan to used these insight in future**

The main insight I gained through this homework is how time consuming it can be to clean data in a very large data set column by column. Initially when I tried removing outliers column by column it took a very significant amount of time to finish the computation which led to a very long wait time for data preprocessing. What I learnt was that I had to use a more efficient way of collecting these statistics via any time of function which leverages pysparks nature and is able to do it all at once.


I am going to use these insight in future project where if a certain method I use takes way to long I will try to refactor the program in such a manner that I am able to do this computation much faster. In the future I will no longer calculate mean and standard deviation of data column by column but rather using a function like describe or aggregate to collect all that I require one go. I also plan to implement broadcasting into all my code to make it more efficient. I also plan to use randomized gridsearch when making models which use large data sets as it generally helps in improving the speed at which I can develop the model. I believe all of these step will help be make better models when I will work with alot of data in future homeworks.


