# EDS232 Lab 1: Regression

## General Lab Template
1. Look at the big picture.
2. Get the data.
3. Explore and visualize the data to gain insights.
4. Prepare the data for machine learning algorithms.
5. Select a model and train it.
6. Fine-tune your model.
7. Present your solution.

## Overview
In this lab, we will introduce the basics of machine learning in **Python** by focusing on **regression**, a core technique used to predict continuous outcomes. We will use the popular **scikit-learn** library, which provides easy-to-use tools for building and evaluating machine learning models.
Specifically, we will focus on how regression algorithms can help us model and predict XXXXX data.

## Objectives
By the end of this lab, you will be able to:
- Understand the concept of regression and its implementation in Pythnon
- Implement simple and multiple linear regression models
- Evaluate model performance using various metrics like **R²**, **MAE**, and **RMSE**
- Visualize regression prediction results 

## Key Concepts
- **Machine Learning**: A subset of artificial intelligence where algorithms learn patterns from data to make predictions or decisions without being explicitly programmed.
- **Regression**: A machine learning method for predicting continuous values.
  - **Simple Linear Regression**: A regression model with one independent variable.
  - **Multiple Linear Regression**: A regression model with two or more independent variables.
  
- **Scikit-learn**: A Python library that provides simple and efficient tools for data mining and machine learning. We will use it for:
  - **Data Preprocessing**: Preparing data for the model.
  - **Model Training**: Fitting the regression model to our data.
  - **Model Evaluation**: Assessing model performance using model evaluation metrics

- **Model Evaluation Metrics**: Tools to assess how well our model fits the data, such as:
  - **R² (R-squared)**: Measures the proportion of variance in the dependent variable that is predictable from the independent variable(s).
  - **MAE (Mean Absolute Error)**: The average of the absolute differences between predicted and actual values.
  - **RMSE (Root Mean Square Error)**: The square root of the average squared differences between predicted and actual values.




In [None]:
# Load data 

In [None]:
# Explore and Visualize data 
# ex: scatterplot, correlation matrix

In [None]:
# Prepare the Data for Machine Learning 
# Preprocessing: handling missing values, feature scaling, splitting data into training/test 
# train_test_split()

In [None]:
# Select and Train Model
# Import and create regression model using LinearRegression() from scikit-learn
# Train the model on training data with .fit()

# Initialize and train the model
model = LinearRegression()
model.fit(X_train, y_train)

In [None]:
#Evaluate the Model 
#Predict target variable on the test set and calculate evaluation metrics using scikit-learn's mean_squared_error adn r2_score
#Visualize predictions vs. actual values

from sklearn.metrics import mean_squared_error, r2_score

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"R² Score: {r2}")

# Plot predictions vs actual
plt.scatter(y_test, y_pred)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Actual vs Predicted')
plt.show()


In [None]:
#Fine-tune the Model 
#Use cross_val_score() to fine-tune the model

from sklearn.model_selection import cross_val_score

# Perform cross-validation
scores = cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
print(f"Cross-validation scores: {-scores}")


## Present the Solution
The linear regression model provides reasonably accurate predictions of environmental data with an R² score of `x.xx` and a Mean Squared Error of `y.yy`. Future analysis could include...

In [None]:
# Transform features for polynomial regression 

from sklearn.preprocessing import PolynomialFeatures

# Transform features to include polynomial terms (degree 2 for quadratic terms)
poly = PolynomialFeatures(degree=2)
X_poly_train = poly.fit_transform(X_train)
X_poly_test = poly.transform(X_test)

# View the transformed feature set (for insight)
print(X_poly_train)

In [None]:
# Train the model on polynomial features 
poly_model = LinearRegression()
poly_model.fit(X_poly_train, y_train)

In [None]:
# Evaluate the polynomial regression model
# Make predictions using the polynomial model
y_poly_pred = poly_model.predict(X_poly_test)

# Evaluate the polynomial model
poly_mse = mean_squared_error(y_test, y_poly_pred)
poly_r2 = r2_score(y_test, y_poly_pred)

print(f"Polynomial Regression Mean Squared Error: {poly_mse}")
print(f"Polynomial Regression R² Score: {poly_r2}")

# Plot predictions vs actual
plt.scatter(y_test, y_poly_pred)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Polynomial Regression: Actual vs Predicted')
plt.show()

In [None]:
#Compare polynomial and linear regression results