# Exploratory Data Analysis: US Transportation
## Authors: Yasmine Thandi, Kyle Truong, Bin Xu
**Original Dataset Source: Monthly Transportation Statistics (Updated 2024). Kaggle Data Science Platform. https://www.kaggle.com/datasets/utkarshx27/monthly-transportation-statistics/data**

**Modified Dataset: https://raw.githubusercontent.com/HenryCROSS/eecs3401_final_project/main/data/Monthly_Transportation_Statistics.csv**

## Transportation Dataset Description
From the original dataset, any data prior to 1967 was removed, due to there being an insufficient amount of data recorded by The Bureau of Transportation Statistics.

We believe that most of the data provided to us is excessive and isn't required for the task we want to focus on. Therefore we reduced our 136 unique attributes to 26 that we thought were useful for our model.
### Attributes Used:
1. **Date** - The date the data was recorded (Typically the first day of each month at 12:00AM)
1. **Transit Ridership - Other Transit Modes - Adjusted** - Total number of riders on other transit modes.
1. **Transit Ridership - Fixed Route Bus - Adjusted** - Total number of riders on any bus routes.
1. **Transit Ridership - Urban Rail - Adjusted** - Total number of riders on any methods of urban rail (i.e. Subway, Local Trains, etc.)
1. **Freight Rail Intermodal Units** - Number of freight cars used per month.
1. **Freight Rail Carloads** - Number of freight cars with cargo loaded per month.
1. **Highway Vehicle Miles Traveled - All Systems** - Total combined miles travelled on a highway.
1. **Highway Fuel Price - Regular Gasoline** - Price of regular gasoline per gallon.
1. **Highway Fuel Price - On-highway Diesel** - Price of diesel per gallon.
1. **Personal Spending on Transportation - Transportation Services - Seasonally Adjusted** - Average monthly cost on transportation.
1. **Personal Spending on Transportation - Gasoline and Other Energy Goods - Seasonally Adjusted** - Average monthly on gasoline, diesel or electricity.
1. **Personal Spending on Transportation - Motor Vehicles and Parts - Seasonally Adjusted** - Average monthly spending on autoshops, repair parts and services.
1. **Passenger Rail Passengers** - Number of passengers who use the trains every month
1. **Transportation Services Index - Freight** - Month to month performance output measure of freight services
1. **Transportation Services Index - Passenger** - Month to month performance output measure of passenger services
1. **Real Gross Domestic Product - Seasonally Adjusted** - Monetary value of all transportation services
1. **U.S.-Canada Incoming Person Crossings** - Number of people entering the United States from Canada
1. **U.S.-Canada Incoming Truck Crossings** - Number of trucks entering the United States from Canada
1. **U.S.-Mexico Incoming Person Crossings** - Number of people entering the United States from Mexico
1. **U.S.-Mexico Incoming Truck Crossings** - Number of trucks entering the United States from Mexico
1. **U.S. Airline Traffic - Domestic - Non Seasonally Adjusted** - Amount of airline traffic travelling within the United States
1. **U.S. Airline Traffic - Total - Non Seasonally Adjusted** - Amount of airline traffic travelling collectively involving the United States
1. **U.S. Airline Traffic - International - Non Seasonally Adjusted** - Amount of airline traffic travelling in and out of the United States
1. **Transborder - Total North American Freight** - Total freight travelled across North America
1. **Transborder - U.S. - Mexico Freight** - Total freight travelled across the US-Mexico border into the United States
1. **Transborder - U.S. - Canada Freight** - Total freight travelled across the US-Canada border into the United States





# 1- Look at the big picture

### Frame the problem
1. Supervised learning.
2. A regression task – predict a value.
3. Batch learning 
    - Small data set
    - No need to continuously adjust any incoming data because the last data recorded was in December, 2023

### Look at the big picture
Predictions will be used to inform operators in the US about future transportation metrics by using previous data on border crossings, ridership count, freight values, prices and revenue. We will be predicting the future cost of transportation and the future size of ridership. This will help with resource allocation, and predicting the future demand of transportation services for the operators. By understanding the relationship between the demand and revenue in the data set, we will provide a suitable budget as a future reference to operators to assist with optimizing pricing strategies for transportation services. 

In [None]:
# Import libraries
# you can install missing library using pip install numpy 

import sklearn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 2- Load the data

In [None]:
url = "https://raw.githubusercontent.com/HenryCROSS/eecs3401_final_project/main/data/Monthly_Transportation_Statistics.csv"
data = pd.read_csv(url, sep=',')
data_bak = data

In [None]:
data

## Task 2.1
We decided at this point in time to reduce the number of attributes in our dataset, as we believe it would help optimize our workflow moving forwards. As mentioned above, these points of data were selected because we believe that they would be useful to our algorithm.

In [None]:
tokeep = ["Date", "Transit Ridership - Other Transit Modes - Adjusted", "Transit Ridership - Fixed Route Bus - Adjusted", "Transit Ridership - Urban Rail - Adjusted", "Freight Rail Intermodal Units", "Freight Rail Carloads",  "State and Local Government Construction Spending - Total", "Highway Fuel Price - Regular Gasoline", 
          "Highway Fuel Price - On-highway Diesel", "Personal Spending on Transportation - Transportation Services - Seasonally Adjusted", "Personal Spending on Transportation - Gasoline and Other Energy Goods - Seasonally Adjusted", "Personal Spending on Transportation - Motor Vehicles and Parts - Seasonally Adjusted",
          "Passenger Rail Passengers", "Transportation Services Index - Freight", "Transportation Services Index - Passenger", "Real Gross Domestic Product - Seasonally Adjusted", "U.S.-Canada Incoming Person Crossings", "U.S.-Canada Incoming Truck Crossings", "U.S.-Mexico Incoming Person Crossings", 
          "U.S.-Mexico Incoming Truck Crossings", "U.S. Airline Traffic - Domestic - Non Seasonally Adjusted", "U.S. Airline Traffic - Total - Non Seasonally Adjusted", "U.S. Airline Traffic - International - Non Seasonally Adjusted", "Transborder - Total North American Freight", "Transborder - U.S. - Mexico Freight","Transborder - U.S. - Canada Freight"]
data = data[tokeep]
data.dropna(axis=0, inplace=True)

In [None]:
data

Here, we implement the info() function to get a description of our data, and the number of non-null attributes. We have determined that it would be best to remove Highway Vehicle Miles Traveled - All Systems since it only has 50 non-null attributes. This helps us scale our data from 2017 - 2022.

In [None]:
data.info()

# 3. Explore and visualize the data to gain insights.

We have decided to create custom heading titles to make the following graphs in our data visualization more readable. Here is the key to the following acronyms of the attribute names:

- TR = Transit Ridership, FR = Freight Rail, H = Highway, V = Vehicle, F = Fuel, P = Price, PST = Personal Spending on Transportation, TSI = Transportation Sevices Index, Pass = Passenger, PC = Person Crossings, TC = Truck Crossings, Mex = Mexico, AT = Airline Traffic, TB = Transborder
- AS = All Systems, A = Adjusted, SA = Seasonally Adjusted, NSA = Non Seasonally Adjusted

In [None]:
# Custom headers
custom_headers = [
    "Date",
    "TROtherTransit(A)",
    "TRFixedRouteBus(A)",
    "TRUrbanRail(A)",
    "FRIntermodalUnits",
    "FRCarloads",
    "HFP-RegGas",
    "HFP-HDiesel",
    "PST-TransServ(SA)",
    "PST-EnergyGoods(SA)",
    "PST-AutoServ(SA)",
    "PassRailPass",
    "TSI-Freight",
    "TSI-Pass",
    "RealGDP(SA)",
    "US-CA PC",
    "US-CA TC",
    "US-Mex PC",
    "US-Mex TC",
    "US AT - Domestic(NSA)",
    "US AT - Total(NSA)",
    "US AT - International(NSA)",
    "TB - TotalFreight",
    "TB - US-MexFreight",
    "TB - US-CAFreight"
]

# Load the dataset with custom headers and selected columns
data = pd.read_csv(url, usecols=tokeep)
data.columns = custom_headers


Here we use the describe() method to see a summary of the numerical attributes. This way we can clean our data more efficiently by replacing missing data with median/average values.

In [None]:
data.describe()

## 3.1 Plot a histogram of the data using hist()

In [None]:
data.hist(bins=50, figsize=(18, 16))

plt.show()

## Histogram on combined target values

We will create a histogram on our target value, which is transit ridership, by combining the data from TROtherTransit(A), TRFixedRouteBus(A) and TRUrbanRail(A). This is done so that we can use the graph to interpret what method of feature scaling we would like to use. Looking at the graph, the data seems to be skewed to the left. However, to be completely sure regarding whether we should use the min-max scaler or not, we want to determine whether there is normalization here.

In [None]:
data['TRTotal'] = data['TROtherTransit(A)']+data['TRFixedRouteBus(A)']+data['TRUrbanRail(A)']

data['TRTotal'].hist(bins=50, figsize=(14, 10))

plt.show()

In [None]:
# Data obtained from data['TRTotal']
TotalTRData = data['TRTotal']

# Convert the data to a numpy array and sort the array
TotalTRData = np.array(TotalTRData)

TotalTRData_sorted = np.sort(TotalTRData)

# Create sample data and sort it
new_data = np.random.normal(loc=0, scale=1, size=len(TotalTRData))

new_data_sorted = np.sort(new_data)

We will create a Q-Q plot to determine whether the following data is an example of normal distributon. The data below does slightly follow a line, so we will try to use both feature scaling methods (min-max and a standard scaler) to see which works best in our pipeline.

In [None]:
# Plot the Q-Q plot
plt.figure(figsize=(6, 6))
plt.scatter(new_data_sorted, TotalTRData_sorted)
plt.plot([np.min(new_data_sorted), np.max(new_data_sorted)],
         [np.min(TotalTRData_sorted), np.max(TotalTRData_sorted)], color='red')
plt.title('Q-Q Plot')
plt.xlabel('Ordered Values (new_data)')
plt.ylabel('Ordered Values (TotalTRData)')
plt.grid(True)

plt.show()

## 3.2 Plot a Box Plot of the data using boxplot() of Highway Fuel Prices

In [None]:
# Boxplot of Highway Fuel Prices 
# Compares the distribution of regular gasoline against highway diesel prices so that operators can make informed decisions regarding budgeting 
# and pricing on transportation services after examining the interquartile range (the spread of the prices)
plt.figure(figsize=(12, 8))  
sns.boxplot(data=data[['HFP-RegGas', 'HFP-HDiesel']])
plt.title('Distribution of Highway Fuel Prices')
plt.ylabel('Price (per gallon)')
plt.xlabel('Fuel Type')
plt.xticks(ticks=[0, 1], labels=['Regular Gasoline', 'On-highway Diesel'])

plt.show()

## 3.3 Plot a histogram of the data using hist() of Personal Spendings on Transportation

In [None]:
# Histogram of Personal Spendings on Transportation
# Examines personal spending on transportation services, motor vehicles and gas so that operators can make informed decisions regarding 
# budgeting and pricing on transportation and adjust prices accprding to individual and customer expenditures

plt.figure(figsize=(10, 7))
sns.histplot(data=data[['PST-TransServ(SA)',
                        'PST-EnergyGoods(SA)',
                        'PST-AutoServ(SA)']], bins=20, kde=True)
plt.title('Distribution of Personal Spending on Transportation')
plt.xlabel('Spending (USD)')
plt.ylabel('Frequency')
plt.legend(['Transportation Services', 'Gasoline and Energy Goods', 'Motor Vehicles and Parts'])

plt.show()


## 3.4 Plot a Time Series Analysis of Transit Ridership 

In [None]:
# Plot a Time Series Analysis of Transit Ridership : Plots the trend of transit overtime

plt.figure(figsize=(12, 6))
sns.lineplot(x='Date', y='TRFixedRouteBus(A)', data=data)
sns.lineplot(x='Date', y='TRUrbanRail(A)', data=data)
plt.title('Transit Ridership Over Time')
plt.xlabel('Date')
plt.ylabel('Ridership')
plt.legend(['Fixed Route Bus', 'Urban Rail'])
plt.xticks(range(660,len(data), 12), rotation=90)

plt.show()

## 3.5 Plot a Heat Map Using heatmap() to Compare GDP with Different Transportation Metrics

In [None]:
# Heatmap for GDP vs. Transportation Metrics 
# Assists with determining which transportation metrics are strongly correlated with the GDP and help operators make adjustmets to transportation service plans
selected_columns = ['RealGDP(SA)',
                   'TB - TotalFreight',
                   'US AT - Total(NSA)',
                   'TSI-Freight']

correlation_matrix = data[selected_columns].corr()

plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap: GDP vs. Transportation Metrics')

plt.show()


## Look for correlations between the features

## 3.6 Check for correlation between attributes using sns.pairplot.

In [None]:
#Check for correlation between attributes using sns.pairplot.
#sns.pairplot(data)

## 3.7 Look for correlations using pearson correlation coefficient 

In [None]:
corr_matrix = data.corr(numeric_only=True)
corr_matrix

Let's look at correlations with regards to our target...

In [None]:
corr_matrix["TROtherTransit(A)"].sort_values(ascending=False)

In [None]:
corr_matrix["TRFixedRouteBus(A)"].sort_values(ascending=False)

In [None]:
corr_matrix["TRUrbanRail(A)"].sort_values(ascending=False)

# 4. Prepare the data for Machine Learning Algorithms

## 4.1 Remove duplicate rows.

In [None]:
data.duplicated().sum()
data.drop_duplicates(inplace=True)

## 4.2 Handle the missing values

In [None]:
data.isna().sum()

In [None]:
data

Drop the date since it is not needed for the training

In [None]:
data.drop(labels=['Date'], axis=1, inplace=True)

In [None]:
data = data.dropna(axis='rows', thresh=int(0.05 * data.shape[1]))

In [None]:
data

## 4.3 Create a pipeline

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

### Pipeline with standard Scaler and replace the empty value to 0

In [None]:
num_cols = data.select_dtypes(include='number').columns.to_list()

num_pipeline = make_pipeline(SimpleImputer(strategy='constant', fill_value=0), StandardScaler())

preprocessing = ColumnTransformer([('num', num_pipeline, num_cols)],
                                    remainder='passthrough')

preprocessing

In [None]:
data_prepared = preprocessing.fit_transform(data)

feature_names=preprocessing.get_feature_names_out()
data_prepared = pd.DataFrame(data=data_prepared, columns=feature_names)
data_prepared.shape

# 5. Select a model and train it

#### split data

In [None]:
from sklearn.model_selection import train_test_split

features = ["num__TROtherTransit(A)", "num__TRFixedRouteBus(A)", "num__TRUrbanRail(A)"]

y = data_prepared.drop(features, axis=1)
X = data_prepared[features]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

#### Train a Linear Regression model without any regularization 

In [None]:
from sklearn.linear_model import LinearRegression

lr_model = LinearRegression()

lr_model.fit(X_train,y_train)

In [None]:
lr_y_predict = lr_model.predict(X_test)

from sklearn.metrics import mean_squared_error as mse
lr_mse=mse(y_test, lr_y_predict)
lr_mse

In [None]:
import matplotlib.pyplot as plt

# Assuming 'lr_model' is your best performing trained linear regression model
lr_y_predict = lr_model.predict(X_test)  # X is your feature data
plt.scatter(lr_y_predict, y_test)  # y is your actual target values
plt.xlabel("TR Predicted Values")
plt.ylabel("TR Actual Values")
plt.title("Predicted vs. Actual Values")
plt.show()

#### Train a Linear Regression model using Ridge regularization with alpha=1

In [None]:
from sklearn.linear_model import Ridge

RidgeRegression = Ridge()
param_grid = {'alpha':[0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 500]}
RidgeRegression = GridSearchCV(RidgeRegression, param_grid)
ridge_model = RidgeRegression.fit(X_train, y_train)


In [None]:
Ridge_y_predict = ridge_model.predict(X_test)
ridge_mse = mse(y_test, Ridge_y_predict)

print(f'Ridge Regression MSE: {ridge_mse}')

In [None]:
Ridge_y_predict = ridge_model.predict(X_test)
plt.scatter(Ridge_y_predict, y_test)  # y is your actual target values
plt.xlabel("TR Predicted Values")
plt.ylabel("TR Actual Values")
plt.title("Predicted vs. Actual Values")
plt.show()

#### Train a Linear Regression model using Lasso regularization with alpha=1

In [None]:
from sklearn.linear_model import Lasso

LassoRegression = Lasso()
param_grid = {'alpha':[0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 500]}
LassoRegression = GridSearchCV(LassoRegression, param_grid)
lasso_model = LassoRegression.fit(X_train, y_train)

In [None]:
Lasso_y_predict = lasso_model.predict(X_test)
lasso_mse=mse(y_test, Lasso_y_predict)

print(f'Lasso Regression MSE: {lasso_mse}')

In [None]:
Lasso_y_predict = lasso_model.predict(X_test)
plt.scatter(Lasso_y_predict, y_test)  # y is your actual target values
plt.xlabel("TR Predicted Values")
plt.ylabel("TR Actual Values")
plt.title("Predicted vs. Actual Values")
plt.show()

#### Decision Tree model

In [None]:
from sklearn.tree import DecisionTreeRegressor

decision_tree = DecisionTreeRegressor(random_state=42)
decision_tree.fit(X_train, y_train)
dt_y_pred = decision_tree.predict(X_test)

from sklearn.metrics import mean_squared_error as mse
dt_mse=mse(y_test, dt_y_pred)
dt_mse

In [None]:
plt.scatter(dt_y_pred, y_test)  # y is your actual target values
plt.xlabel("TR Predicted Values")
plt.ylabel("TR Actual Values")
plt.title("Predicted vs. Actual Values")
plt.show()

### Pipeline with Min Max Scaler and replace the empty value to 0

In [None]:
from sklearn.preprocessing import MinMaxScaler

num_cols = data.select_dtypes(include='number').columns.to_list()

#create pipelines for numeric columns
num_pipeline = make_pipeline(SimpleImputer(strategy='constant', fill_value=0), MinMaxScaler())

preprocessing = ColumnTransformer([('num', num_pipeline, num_cols)],
                                    remainder='passthrough')

preprocessing

In [None]:
data_prepared = preprocessing.fit_transform(data)

feature_names=preprocessing.get_feature_names_out()
data_prepared = pd.DataFrame(data=data_prepared, columns=feature_names)
data_prepared.shape

In [None]:
from sklearn.model_selection import train_test_split

features = ["num__TROtherTransit(A)", "num__TRFixedRouteBus(A)", "num__TRUrbanRail(A)"]

y = data_prepared.drop(features, axis=1)
X = data_prepared[features]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

In [None]:
from sklearn.linear_model import LinearRegression

lr_minmax_model = LinearRegression()

lr_minmax_model.fit(X_train,y_train)

In [None]:
lr_minmax_y_predict = lr_minmax_model.predict(X_test)

from sklearn.metrics import mean_squared_error as mse
lr_minmax_mse=mse(y_test, lr_minmax_y_predict)
lr_minmax_mse

In [None]:
plt.scatter(lr_minmax_y_predict, y_test)  # y is your actual target values
plt.xlabel("TR Predicted Values")
plt.ylabel("TR Actual Values")
plt.title("Predicted vs. Actual Values")
plt.show()

#### Desicion Tree with min-max scaler

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split

decision_tree = DecisionTreeRegressor(random_state=42)
decision_tree.fit(X_train, y_train)
dt_minmax_y_pred = decision_tree.predict(X_test)

from sklearn.metrics import mean_squared_error as mse
dt_minmax_mse=mse(y_test, dt_minmax_y_pred)
dt_minmax_mse

In [None]:
plt.scatter(dt_minmax_y_pred, y_test)  # y is your actual target values
plt.xlabel("TR Predicted Values")
plt.ylabel("TR Actual Values")
plt.title("Predicted vs. Actual Values")
plt.show()

#### Train a Linear Regression model using Ridge regularization with alpha=1 and MinMax Scaler

In [None]:
from sklearn.linear_model import Ridge

RidgeRegression = Ridge()
param_grid = {'alpha':[0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 500]}
RidgeRegression = GridSearchCV(RidgeRegression, param_grid)
ridge_minmax_model = RidgeRegression.fit(X_train, y_train)

In [None]:
Ridge_minmax_y_predict = ridge_minmax_model.predict(X_test)
ridge_minmax_mse = mse(y_test, Ridge_minmax_y_predict)

print(f'Ridge Regression MSE: {ridge_minmax_mse}')

In [None]:
plt.scatter(Ridge_minmax_y_predict, y_test)  # y is your actual target values
plt.xlabel("TR Predicted Values")
plt.ylabel("TR Actual Values")
plt.title("Predicted vs. Actual Values")
plt.show()

#### Train a Linear Regression model using Lasso regularization with alpha=1 and MinMax Scaler

In [None]:
from sklearn.linear_model import Lasso

LassoRegression = Lasso()
param_grid = {'alpha':[0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 500]}
LassoRegression = GridSearchCV(LassoRegression, param_grid)
lasso_minmax_model = LassoRegression.fit(X_train, y_train)

In [None]:
Lasso_minmax_y_predict = lasso_minmax_model.predict(X_test)
lasso_minmax_mse=mse(y_test, Lasso_minmax_y_predict)

print(f'Lasso Regression MSE: {lasso_minmax_mse}')

In [None]:
plt.scatter(Lasso_minmax_y_predict, y_test)  # y is your actual target values
plt.xlabel("TR Predicted Values")
plt.ylabel("TR Actual Values")
plt.title("Predicted vs. Actual Values")
plt.show()

## Polynomial Regression

In [None]:
from sklearn.preprocessing import PolynomialFeatures

### Pipeline with standard Scaler, PolynomialFeatures with degree = 3 and replace the empty value to 0

In [None]:
num_cols = data.select_dtypes(include='number').columns.to_list()

#create pipelines for numeric columns
num_pipeline = make_pipeline(SimpleImputer(strategy='constant', fill_value=0), StandardScaler(),
                             PolynomialFeatures(degree=3))

preprocessing = ColumnTransformer([('num', num_pipeline, num_cols)],
                                    remainder='passthrough')

preprocessing

In [None]:
data_prepared = preprocessing.fit_transform(data)

feature_names=preprocessing.get_feature_names_out()
data_prepared = pd.DataFrame(data=data_prepared, columns=feature_names)
data_prepared.shape

In [None]:
from sklearn.model_selection import train_test_split

features = ["num__TROtherTransit(A)", "num__TRFixedRouteBus(A)", "num__TRUrbanRail(A)"]

y = data_prepared.drop(features, axis=1)
X = data_prepared[features]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

In [None]:
from sklearn.linear_model import LinearRegression

pr_model = LinearRegression()

pr_model.fit(X_train,y_train)

In [None]:
pr_y_predict = pr_model.predict(X_test)

from sklearn.metrics import mean_squared_error as mse
pr_mse=mse(y_test, pr_y_predict)
pr_mse

In [None]:
plt.scatter(pr_y_predict, y_test)  # y is your actual target values
plt.xlabel("TR Predicted Values")
plt.ylabel("TR Actual Values")
plt.title("Predicted vs. Actual Values")
plt.show()

#### decision tree with Polynomial Regression

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split

decision_tree = DecisionTreeRegressor(random_state=42)
decision_tree.fit(X_train, y_train)
dt_pr_y_pred = decision_tree.predict(X_test)

from sklearn.metrics import mean_squared_error as mse
dt_pr_mse=mse(y_test, dt_pr_y_pred)
dt_pr_mse

In [None]:
plt.scatter(dt_pr_y_pred, y_test)  # y is your actual target values
plt.xlabel("TR Predicted Values")
plt.ylabel("TR Actual Values")
plt.title("Predicted vs. Actual Values")
plt.show()

#### Train a Polynomial Regression model using Lasso regularization with alpha=1

In [None]:
from sklearn.linear_model import Lasso

LassoRegression = Lasso()
param_grid = {'alpha':[0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 500]}
LassoRegression = GridSearchCV(LassoRegression, param_grid)
pr_lasso_model = LassoRegression.fit(X_train, y_train)

In [None]:
pr_Lasso_y_predict = pr_lasso_model.predict(X_test)
pr_lasso_mse=mse(y_test, pr_Lasso_y_predict)

print(f'Lasso Regression MSE: {pr_lasso_mse}')

In [None]:
plt.scatter(pr_Lasso_y_predict, y_test)  # y is your actual target values
plt.xlabel("TR Predicted Values")
plt.ylabel("TR Actual Values")
plt.title("Predicted vs. Actual Values")
plt.show()

#### Train a Polynomial Regression model using Ridge regularization with alpha=1

In [None]:
from sklearn.linear_model import Ridge

RidgeRegression = Ridge()
param_grid = {'alpha':[0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 500]}
RidgeRegression = GridSearchCV(RidgeRegression, param_grid)
pr_ridge_model = RidgeRegression.fit(X_train, y_train)

In [None]:
pr_Ridge_y_predict = pr_ridge_model.predict(X_test)
pr_ridge_mse = mse(y_test, pr_Ridge_y_predict)

print(f'Ridge Regression MSE: {pr_ridge_mse}')

In [None]:
plt.scatter(pr_Ridge_y_predict, y_test)  # y is your actual target values
plt.xlabel("TR Predicted Values")
plt.ylabel("TR Actual Values")
plt.title("Predicted vs. Actual Values")
plt.show()

In [None]:
print(f'Linear Regression MSE: {lr_mse}')
print(f'Ridge Regression MSE: {ridge_mse}')
print(f'Lasso Regression MSE: {lasso_mse}')
print(f'Decision Tree MSE: {dt_mse}')
print()
print(f'Linear Regression with MinMaxScaler MSE: {lr_minmax_mse}')
print(f'Ridge Regression with MinMaxScaler MSE: {ridge_minmax_mse}')
print(f'Lasso Regression with MinMaxScaler MSE: {lasso_minmax_mse}')
print(f'Decision Tree with MinMaxScaler MSE: {dt_minmax_mse}')
print()
print(f'Polynomail Regression MSE: {pr_mse}')
print(f'Ridge Regression with Polynomail Regression MSE: {pr_ridge_mse}')
print(f'Lasso Regression with Polynomail Regression MSE: {pr_lasso_mse}')
print(f'Decision Tree with Polynomail Regression MSE: {dt_pr_mse}')


# 6. Fine-Tune Your Model