<a href="https://colab.research.google.com/github/535amar/CMP7005_Programming-for-Data-Analysis_S1_25/blob/main/Simple_Linear_Regression_Workshop_8_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Objective of the Workshop**
The objective of this workshop is to provide students with hands-on experience in applying linear and multiple regression techniques to analyze environmental data. By working with real-world climate and ecosystem variables, students will gain insights into how regression models can be used to identify relationships between key environmental factors, make predictions, and support data-driven decision-making.

**Through this workshop, you will:**

* Understand the concepts of linear and multiple regression and their applications in environmental science.
* Explore feature selection and correlation analysis to determine key influencing factors.
* Implement regression models using Python (Pandas, Scikit-learn, and Matplotlib/Seaborn).
* Evaluate model performance using metrics such as R-squared and Mean Squared Error (MSE).
* Interpret regression outputs and derive meaningful conclusions from the results.

## Importing the required packages

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import pylab as pl
import numpy as np
from sklearn.model_selection import train_test_split

### Task 1 Import Wetland data and create a Pandas DataFrame

In [None]:
# Load the Drive helper and mount
from google.colab import drive

# This will prompt for authorization.
drive.mount('/content/drive')

In [None]:
%cd "/content/drive/MyDrive/CMP7005/Semester 2/Week 7"
# please change the path according to the location of your data

In [None]:
ls

# **Reading the FuelConsumption data**

In [None]:
df=pd.read_csv('FuelConsumption.csv')

## Data exploration

###  Task 2: Inspect the first few rows of the DataFrame and summarise the descriptive statistics of the data

In [None]:
# Summarise the data
df.describe()

In [None]:
cdf=df[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB','CO2EMISSIONS']]
cdf

## Task 3: Plot to check whether the relationship between the independent and dependent variable is linear or not

In [None]:
plt.scatter(df.ENGINESIZE,df.CO2EMISSIONS, color='hotpink')
plt.xlabel("ENGINESIZE")
plt.ylabel("CO2EMISSIONS")
plt.show()

In [None]:
plt.scatter(df.CYLINDERS,df.CO2EMISSIONS, color='cyan')
plt.xlabel("CYLINDERS")
plt.ylabel("CO2EMISSIONS")
plt.show()

In [None]:
plt.scatter(df.FUELCONSUMPTION_COMB,df.CO2EMISSIONS, color='blue')
plt.xlabel("FUELCONSUMPTION_COMB")
plt.ylabel("CO2EMISSIONS")
plt.show()


Which of the above variables do you think will work best to explain a linear relation with CO2 emission?

## Task 4: Train-test data preparation

In [None]:
X=df[['ENGINESIZE']]
X


In [None]:
y=df[['CO2EMISSIONS']]
y

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
print(X_train)

In [None]:
print(X_test)

In [None]:
print(y_train)

In [None]:
print(X_test)

In [None]:
# Train data distribution
plt.scatter(X_train,y_train, color='blue')
plt.xlabel("Engine size")
plt.ylabel("Emission")
plt.show()

## Task 5: Using sklearn package for data modelling

In [None]:
from sklearn import linear_model
regr=linear_model.LinearRegression()
regr.fit(X_train,y_train)

# The coefficients
print('Coefficients:', regr.coef_)
print('Intercept:', regr.intercept_)

In [None]:
# Plot outputs
plt.scatter(X_train,y_train,color='blue')
plt.plot(X_train,regr.coef_[0][0]*X_train + regr.intercept_[0],'-r')
plt.xlabel("Engine size")
plt.ylabel("Emission")

## Task 6: Model evaluation

In [None]:
from sklearn.metrics import r2_score
test_y_ = regr.predict(X_test)

In [None]:
print("Mean absolute error: %.2f" % np.mean(np.absolute(test_y_-y_test)))
print("Residual sum of squares (MSE): %.2f" % np.mean((test_y_-y_test)**2))
print("R2-score: %.2f" % r2_score(test_y_,y_test))