**A)Linear regression: Single feature vs multiple features**

a.Download dataset as per your batch.

In [None]:
import pandas as pd
df=pd.read_excel('AirQualityUCI.xlsx')
print(df.columns)

Index(['Date', 'Time', 'CO(GT)', 'PT08.S1(CO)', 'NMHC(GT)', 'C6H6(GT)',
       'PT08.S2(NMHC)', 'NOx(GT)', 'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)',
       'PT08.S5(O3)', 'T', 'RH', 'AH'],
      dtype='object')


b.Preprocessing: Null value handling, standardization, replace categorical values with numeric values (e.g. 0, 1, 2 etc.)

In [32]:

# a. Null value handling
print("\nMissing values before handling:")
print(df.isnull().sum())

# Handle missing values by dropping rows with any null value
df.dropna(inplace=True)

# b. Standardization
from sklearn.preprocessing import StandardScaler

# Initialize the StandardScaler
scaler = StandardScaler()

# Select numeric columns for standardization
numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns

# Standardize the selected numeric columns
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])


# c. Replace categorical values with numeric values (Label Encoding)
# Categorical values are replaced with numeric values using label encoding

from sklearn.preprocessing import LabelEncoder

for column in df.columns:
    if df[column].dtype == 'object':
        print(f"{column} contains categorical df.")
# Initialize the LabelEncoder
print("\nBefore:")
print(df['Time'].head())

label_encoder = LabelEncoder()
df['Time'] = label_encoder.fit_transform(df['Time'])

print("\nAfter:")
print(df['Time'].head())

# Print the dataset after preprocessing
print("\nProcessed dataset:")
print(df.head())


Missing values before handling:
Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
NMHC(GT)         0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64

Before:
0    0.939133
1    1.083583
2    1.228033
3    1.372483
4    1.516933
Name: Time, dtype: float64

After:
0    18
1    19
2    20
3    21
4    22
Name: Time, dtype: int64

Processed dataset:
        Date  Time    CO(GT)  PT08.S1(CO)  NMHC(GT)  C6H6(GT)  PT08.S2(NMHC)  \
0 2004-03-10    18  0.474000     1.302055  2.211236  0.242065       0.530761   
1 2004-03-10    19  0.466273     1.028437  1.939383  0.182019       0.182278   
2 2004-03-10    20  0.468849     1.460851  1.767687  0.172368       0.122335   
3 2004-03-10    21  0.468849     1.361909  1.710454  0.177950       0.156878   
4 2004-03-10    22  0.461122     0.940489  1.502988  0.112443      -0.

c.Data splitting: Split data as 70% train and 30% test using train_test_split function.

In [None]:
from sklearn.model_selection import train_test_split
#For single feature
X = df[['CO(GT)']]
y = df['AH']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

X_train shape: (6549, 1)
X_test shape: (2808, 1)
y_train shape: (6549,)
y_test shape: (2808,)


In [None]:
from sklearn.model_selection import train_test_split
# For multiple feature
X = df[['CO(GT)', 'PT08.S1(CO)', 'NMHC(GT)', 'C6H6(GT)','PT08.S2(NMHC)', 'NOx(GT)', 'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)']]
y = df['AH']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

X_train shape: (6549, 9)
X_test shape: (2808, 9)
y_train shape: (6549,)
y_test shape: (2808,)


d.Fit model using fit function taking a single feature at a time and all independent features at a time.




In [None]:
X_train_single_feature = X_train[['C6H6(GT)']]
y_train_single_feature = y_train
model_single_feature = LinearRegression()
model_single_feature.fit(X_train_single_feature, y_train_single_feature)

In [None]:
model_all_features  = LinearRegression()
model_all_features .fit(X_train, y_train)

e.Report parameter values, training error and test error and model accuracy for Linear regression with single feature and multiple features.

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
#NOTE: R2-score - To evaluate the goodness of fit of a regression model.

# Single feature
X_test_single_feature = X_test[['C6H6(GT)']]
y_test_single_feature = y_test
# parameter values
coef = model_single_feature.coef_
intercept = model_single_feature.intercept_
# Calculating predicate values
y_train_pred = model_single_feature.predict(X_train_single_feature)
y_test_pred = model_single_feature.predict(X_test_single_feature)
# training and test error
train_error = mean_squared_error(y_train, y_train_pred)
test_error = mean_squared_error(y_test, y_test_pred)
# model accuracy
r2 = r2_score(y_test, y_test_pred)

print("Parameter values:")
print("Coefficient:", coef)
print("Intercept:", intercept)
print("\nTraining Error:", train_error)
print("Test Error:", test_error)
print("R-squared (Model Accuracy):", r2)

Parameter values:
Coefficient: [0.98585899]
Intercept: 0.0024334274633714885

Training Error: 0.02985000657335495
Test Error: 0.03254161048185446
R-squared (Model Accuracy): 0.9675033474889237


In [None]:
# For multiple feature
# parameter values
coef_all_features = model_all_features.coef_
intercept_all_features = model_all_features.intercept_

# Predict on training and testing data
y_train_pred_all_features = model_all_features.predict(X_train)
y_test_pred_all_features = model_all_features.predict(X_test)

# Calculate training and test errors
train_error_all_features = mean_squared_error(y_train, y_train_pred_all_features)
test_error_all_features = mean_squared_error(y_test, y_test_pred_all_features)

# Calculate R-squared (model accuracy)
r2_all_features = r2_score(y_test, y_test_pred_all_features)

# Print parameter values, training error, test error, and model accuracy
print("Parameter values:")
print("Coefficient:", coef_all_features)
print("Intercept:", intercept_all_features)
print("\nTraining Error:", train_error_all_features)
print("Test Error:", test_error_all_features)
print("R-squared (Model Accuracy):", r2_all_features)


Parameter values:
Coefficient: [ 0.00239395 -0.01198363 -0.00312823  1.0876342  -0.18555708 -0.04382193
 -0.02395594  0.02896879  0.00117067]
Intercept: 0.0006072525791409266

Training Error: 0.0026779843396792487
Test Error: 0.0033700267199065037
R-squared (Model Accuracy): 0.9966346291517776


**B)Answer following questions (include question and answer as markdown cell in your notebook)**

a.Provide a general multiple linear regression equation and explain all the terms.


A multiple linear regression equation models the relationship between multiple independent variables (features) and a dependent variable (target).

Equation:

y = b+w1.x1+w2.x2+.......+wn.xn


1.   y is the dependent variable (target) that we want to predict.
2.   b is the y-intercept or constant term. It represents the expected mean value of y
when all independent variables are set to zero. It is the value of y when all x variables are absent.
3.   w1, w2,..., wn are the coefficients associated with each independent variable. They represent the change in the dependent variable y for a one-unit change in the corresponding independent variable x, holding all other variables constant. These are the parameters that the model learns during the training process.








b.Explain the concept of a dummy variable and how such variables are calculated. Why is it necessary to convert nominal variables to dummy variables when performing linear regression?

A dummy variable, also known as an indicator variable, is a binary variable used to represent categorical data in statistical models like linear regression. It's necessary to convert nominal variables to dummy variables when performing linear regression because linear regression requires numerical inputs, and nominal variables cannot be directly used in their original form.

Here's how dummy variables are calculated and why they are necessary:

* **Calculating Dummy Variables:**

Suppose you have a nominal variable with k categories. To represent this nominal variable as dummy variables, you create k−1 binary variables.
Each dummy variable represents one category of the nominal variable.
For each observation, one of the dummy variables is set to 1 to indicate the category to which it belongs, and all other dummy variables are set to 0.
Example:

Suppose you have a nominal variable "Color" with categories "Red," "Green," and "Blue."
You create two dummy variables: "Dummy_Red" and "Dummy_Green."
If an observation has the color "Red," the "Dummy_Red" variable would be 1, the "Dummy_Green" variable would be 0.
If an observation has the color "Green," the "Dummy_Red" variable would be 0, the "Dummy_Green" variable would be 1.

*   **Necessity of Dummy Variables:**

Linear regression models require numerical inputs. Nominal variables, such as categorical data, cannot be directly used in regression models.
By converting nominal variables to dummy variables, we can represent categorical data numerically, allowing us to include them as predictors in regression models.
Dummy variables also enable linear regression to capture the effect of categorical variables on the dependent variable by treating each category separately.

*  **Avoiding Multicollinearity:**

Including all
k categories as separate variables could lead to multicollinearity, where one predictor can be linearly predicted from the others with a substantial degree of accuracy.
By creating only
k−1 dummy variables, we avoid perfect multicollinearity because the omitted category is the reference category against which the others are compared.
In summary, dummy variables are essential in linear regression to represent categorical data numerically, enabling the model to incorporate the effects of categorical variables while avoiding multicollinearity.

c.Explore and mention assumptions in linear regression with suitable explanation


Linear regression relies on several assumptions for its validity:

1. **Linearity:** The relationship between the independent variables and the dependent variable is linear. This means that the change in the dependent variable is proportional to the change in the independent variables.

2. **Independence:** The observations in the dataset are independent of each other. There should be no correlation between the residuals (the differences between the observed and predicted values).

3. **Homoscedasticity:** The variance of the residuals is constant across all levels of the independent variables. In other words, the spread of the residuals should remain constant as the values of the independent variables change.

4. **Normality of Residuals:** The residuals should be normally distributed. This means that the errors should follow a normal distribution with a mean of 0.

5. **No Multicollinearity:** The independent variables should not be highly correlated with each other. High multicollinearity can lead to unstable estimates of the coefficients.

Violation of these assumptions can lead to biased and inefficient estimates of the regression coefficients, affecting the reliability and interpretability of the model. Therefore, it's essential to assess these assumptions and, if necessary, take appropriate measures to address any violations.