
**Comparative Performance Analysis for Predicting Sea Level Rise Using Linear Regression and Support Vector Machine**
Objective:
In this assignment, you will analyze sea level data to predict the rise in sea levels over time using linear regression and support vector machine (SVM). You will perform data cleaning, modeling, and visualize the results. Your predictions will be used to estimate future sea levels.

**Deliverable** Upload your completed code to Canvas within the due date.

Dataset:
The dataset contains historical data on sea level measurements from the CSIRO (Commonwealth Scientific and Industrial Research Organisation). The columns of the dataset are as follows:

-- Year: The year of the measurement.

-- CSIRO Adjusted Sea Level: The adjusted sea level measurement (in millimeters).

-- Lower Error Bound: The lower bound of the sea level measurement.

-- Upper Error Bound: The upper bound of the sea level measurement.

-- NOAA Adjusted Sea Level: The NOAA adjusted sea level (containing missing values).

### Tasks: Write the code for each of the following:

* Task 1: Data Exploration and Preprocessing
* 
1.1
-- Load and Explore the Data:
  
1.2
-- Load the dataset into a Pandas DataFrame.

1.3
-- Display the first few rows of the dataset to understand its structure.

1.4
-- Identify and handle any missing data.

1.5
-- Describe the dataset and summarize the statistics:

1.6
-- Identify any potential outliers or anomalies in the data.


* Task 2: Focused Prediction from Year 2000 to Present
  
  2.1
-- Filter the data from the year 2000 to the most recent year available in the dataset.
  
2.2
-- Shuffle the dataset and Split the dataset into 70% train and 30% test.

2.3
-- Fit a linear regression and SVM model based on the 70% of the dataset (from 2000 to the most recent year). Use the SKLearn library.

2.4
-- Visualize the observed data and the fitted regression line for this range of years.

2.5
-- Display the values of all the weights (coefficients) obtained from Linear Regression and SVM.


* Task 3: Predict Sea Level in 2040 Using the SVM and Linear Regression:
  
3.1
-- Using the linear regression model and SVM (from the year 2000 onwards), predict the sea level rise using the 30% test.

3.2
-- Report the predicted sea levels from both the models.

* Task 4: Reflection and Analysis
  
4.1
-- Compare the predicted sea level for 2030  and 2040 from both the models.

4.2
-- Discuss how the different models might lead to different predictions and why this is the case.


* Task 5: Interpretation of Results:
  
5.1
-- Report a metric that you used to compare the performance of the Linear Regression and SVM. Which model performed the best?

5.2
-- Explain how the regression models are helping us understand the relationship between the year and the sea level rise.



In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

Task 1

1.1 and 1.2

In [None]:
df = pd.read_csv("epa-sea-level.csv")
df = df[["Year","CSIRO Adjusted Sea Level"]]

1.3

In [None]:
df.head()

1.4

In [None]:
df.dropna()

1.5

In [None]:
df.describe()

In [None]:
ax = df["CSIRO Adjusted Sea Level"].plot.hist(bins=15,figsize=(10,5),alpha=0.5,color='#1A4D3B');

Appear to have right skewed data, with a lot of variation. We don't seem to have any apparent outliers (seen in boxplot).

1.6

In [None]:
Q1 = df["CSIRO Adjusted Sea Level"].quantile(0.25)
Q3 = df["CSIRO Adjusted Sea Level"].quantile(0.75)
IQR = Q3 - Q1
outliers = ((df["CSIRO Adjusted Sea Level"] < (Q1 - 1.5 * IQR)) | (df["CSIRO Adjusted Sea Level"] > (Q3 + 1.5 * IQR)))
df_outliers = df["CSIRO Adjusted Sea Level"][outliers]
print(df_outliers)

In [None]:
sns.catplot(y = "CSIRO Adjusted Sea Level", kind="box", data=df);


Don't appear to have any outliers, our outlier code gives us an empty list, and our box plot doesn't have any outliers outside of our IQR.

Task 2

2.1

In [None]:
df = df[df["Year"]>=2000]

2.2

In [None]:
df_shuffled = df.sample(frac=1, random_state=42).reset_index(drop=True)
train_df, test_df = train_test_split(df_shuffled, test_size=0.3, random_state=42)

X_train = train_df[["Year"]].values
y_train = train_df["CSIRO Adjusted Sea Level"].values
X_test = test_df[["Year"]].values
y_test = test_df["CSIRO Adjusted Sea Level"].values

2.3

In [None]:
lin_model = LinearRegression()
lin_model.fit(X_train, y_train)

svm_model = SVR(kernel = 'linear')
svm_model.fit(X_train, y_train)

2.4

In [None]:
lin_pred_plot = lin_model.predict(X_train)
svm_pred_plot = svm_model.predict(X_train)

plt.figure(figsize=(10, 6))
plt.scatter(df["Year"], df["CSIRO Adjusted Sea Level"], color='lightblue', label="Observed Data")
plt.plot(X_train, lin_pred_plot, color='red', label="Linear Regression")
plt.plot(X_train, svm_pred_plot, color='green', linestyle='--', label="SVM Regression")
plt.xlabel("Year")
plt.ylabel("CSIRO Adjusted Sea Level (mm)")
plt.title("Sea Level Prediction (2000–Present): Linear vs SVM")
plt.legend()
plt.grid(True)
plt.tight_layout()

#This is for the train dataset, the test data set (the actual answer is below)

2.5

In [None]:
lin_pred_plot = lin_model.predict(X_test)
svm_pred_plot = svm_model.predict(X_test)

plt.figure(figsize=(10, 6))
plt.scatter(df["Year"], df["CSIRO Adjusted Sea Level"], color='lightblue', label="Observed Data")
plt.plot(X_test, lin_pred_plot, color='red', label="Linear Regression")
plt.plot(X_test, svm_pred_plot, color='green', linestyle='--', label="SVM Regression")
plt.xlabel("Year")
plt.ylabel("CSIRO Adjusted Sea Level (mm)")
plt.title("Sea Level Prediction (2000–Present): Linear vs SVM")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

In [None]:
print("Linear Regression Coefficients:")
print("Intercept", lin_model.intercept_)
print("Coefficient", lin_model.coef_)

print("SVM Coefficients:")
print("Intercept", svm_model.intercept_)
print("Slope", svm_model.coef_)

Task 3

3.1 

In [None]:
year_2040 = np.array([[2040]])
year_max = np.array([[X_test.max()]])
print(lin_model.predict(year_2040) - lin_model.predict(year_max)) 
print(svm_model.predict(year_2040) - svm_model.predict(year_max))

3.2

In [None]:
print(lin_model.predict(year_2040))
print(svm_model.predict(year_2040))

Task 4

4.1

In [None]:
year_2030 = np.array([[2030]])
print("2030 lin model", lin_model.predict(year_2030))
print("2030 svm model", svm_model.predict(year_2030))
print("2040 lin model", lin_model.predict(year_2040))
print("2040 svm model", svm_model.predict(year_2040))

4.2

The differences in the model can be because the linear model tries its best to minimize the sum of squares error from all data points, while our svm model does its best to minimize errors outside of our margin lines. Meaning, our linear model includes errors from all points, while our svm model doesn't include errrors inside margin lines, and as a result, our results can be slightly different. 

Task 5

5.1 

In [None]:
mse_lin = mean_squared_error(y_test, y_pred_lin)
mse_svm = mean_squared_error(y_test, y_pred_svm)
print("Linear MSE", mse_lin)
print("SVM MSE", mse_svm)

We see that SVM performed slightly worse compared to our linear model. 

5.2

The regression model takes our data points, and creates a model that minimizes the amount of error, which allows us to predict the next data point as accurately as possible. Depending on the coeffiecients we get out, we start to understand if our model is linear or not, how strong our relationship is, our baseline values, and most importantly, how much our output (in this case sea level) changes depending on a 1 unit increase (in this case 1 year increase) in our explanatory variable. Here, we see that a 1 year increase is associated with a rise in sea level of roughly 0.16 inches for both models. 