#1.Perform exploratory data analysis (EDA) to gain insights into the dataset.


How to Perform in EDA -

Observe your dataset.

Find any missing values.

Categorize your values.

Find the shape of your dataset.

Identify relationships in your dataset.

Locate any outliers in your dataset.

In [None]:
import pandas as pd
import numpy as np

df = pd.read_csv('https://raw.githubusercontent.com/Shrikrishna-jadhavar/Data-Science-Material/main/Dataset/ToyotaCorolla%20-%20MLR.csv')
df

In [None]:
df.info()

In [None]:
df.dtypes

**Provide visualizations and summary statistics of the variables.**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style = 'whitegrid')

pairplot = sns.pairplot(df, diag_kind="kde", corner = True) #Pairplot to visualize relationships and distributions.
pairplot.fig.suptitle("Pairplot of Variables", y=1.02)      #Adjust the title position.

In [None]:
numeric_columns = df.select_dtypes(include = ['int64']).columns #'Fuel_Type' from the pairplot and heatmap since it's non-numeric

plt.figure(figsize = (15, 8))
correlation_matrix = df[numeric_columns].corr() # Correlation heatmap between numeric variables
heatmap = sns.heatmap(correlation_matrix, annot = True, cmap = 'coolwarm', fmt = '.2f')
plt.title("Correlation Heatmap of Variables")

In [None]:
plt.figure(figsize = (10, 8)) # Visualizing distribution of Price
price_dist = sns.histplot(df['Price'], kde = True, color = 'blue')
plt.title("Distribution of Car Prices")
plt.xlabel("Price (Euros)")
plt.ylabel("Frequency")

plt.show()

In [None]:
df.describe()

Summary Statistics

Price: Ranges from €4,350 to €32,500 with a mean of €10,730.    
Age: Ranges from 1 month to 80 months, with a mean of around 56 months.          
Mileage (KM): Ranges from 1 km to 243,000 km, with a mean of around 68,533 km.   
Horsepower (HP): Ranges from 69 to 192 with a mean of around 101.                
Weight: Ranges from 1,000 kg to 1,615 kg, with a mean of around 1,072 kg.

**Preprocess the data to apply the Multi Linear Regression.**

**To preprocess the data for Multiple Linear Regression (MLR), we'll follow these steps**:

Handle Categorical Data: Convert the Fuel_Type categorical variable into numerical format using one-hot encoding.

Feature Selection: Select the relevant features for the MLR model.

Normalization/Standardization: Scale the features to ensure they are on a similar scale.

Split the Data: Divide the data into training and testing sets.

In [None]:
data_encoded = pd.get_dummies(df, columns=['Fuel_Type'], drop_first=True) #Handle Categorical Data using one-hot encoding for the 'Fuel_Type' column

X = data_encoded.drop('Price', axis=1)
y = data_encoded['Price']

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

In [48]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)

In [49]:
y_pred = model.predict(X_test)

# 2.Split the dataset into training and testing sets (e.g., 80% training, 20% testing).

In [50]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test, = train_test_split(X_scaled, y, test_size=0.2, random_state=42)  # X is feature matrix and y is target vector.

X_train.shape, X_test.shape, y_train.shape, y_test.shape  # Show the shapes of the resulting datasets to confirm the preprocessing

((1148, 11), (288, 11), (1148,), (288,))

#3.Build a multiple linear regression model using the training dataset.

Interpret the coefficients of the model. Build minimum of 3 different models.

In [19]:
import pandas as pd
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm

X_train_df = pd.DataFrame(X_train, columns=X.columns).reset_index(drop=True)
y_train_reset = y_train.reset_index(drop=True)

model_1 = LinearRegression()
model_1.fit(X_train_df, y_train_reset)
coefficients_1 = model_1.coef_

model_2 = LinearRegression()
model_2.fit(X_train_df, y_train_reset)
coefficients_2 = model_2.coef_

model_3 = LinearRegression()
model_3.fit(X_train_df, y_train_reset)
coefficients_3 = model_3.coef_

# Using statsmodels for detailed summary
X_train_sm = sm.add_constant(X_train_df)  # Adds a constant term to the predictor
sm_model_1 = sm.OLS(y_train_reset, X_train_sm).fit()
sm_model_2 = sm.OLS(y_train_reset, X_train_sm).fit()
sm_model_3 = sm.OLS(y_train_reset, X_train_sm).fit()

print(sm_model_1.summary())
print(sm_model_2.summary())
print(sm_model_3.summary())


                            OLS Regression Results                            
Dep. Variable:                  Price   R-squared:                       0.870
Model:                            OLS   Adj. R-squared:                  0.869
Method:                 Least Squares   F-statistic:                     762.7
Date:                Wed, 14 Aug 2024   Prob (F-statistic):               0.00
Time:                        10:39:49   Log-Likelihood:                -9863.2
No. Observations:                1148   AIC:                         1.975e+04
Df Residuals:                    1137   BIC:                         1.980e+04
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const             1.075e+04     38.680  

#4.Evaluate the performance of the model using appropriate evaluation metrics on the testing dataset.

To evaluate the performance of the multiple linear regression models, I'll use the following metrics on the testing dataset:

R-squared (R²).

Mean Absolute Error (MAE).

Mean Squared Error (MSE).

Root Mean Squared Error (RMSE).

In [35]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

y_pred_1 = model_1.predict(X_test)

r2_1 = r2_score(y_test, y_pred_1)
mae_1 = mean_absolute_error(y_test, y_pred_1)
mse_1 = mean_squared_error(y_test, y_pred_1)
rmse_1 = np.sqrt(mse_1)

print(f"Model 1 Performance:")
print(f"R-squared: {r2_1:.4f}")
print(f"Mean Absolute Error : {mae_1:.2f}")
print(f"Mean Squared Error : {mse_1:.2f}")
print(f"Root Mean Squared Error : {rmse_1:.2f}")


Model 1 Performance:
R-squared: 0.8349
Mean Absolute Error : 990.89
Mean Squared Error : 2203043.82
Root Mean Squared Error : 1484.27




# 5.Apply Lasso and Ridge methods on the model.

In [51]:
from sklearn.linear_model import Lasso  #Least Absolute Shrinkage and Selection Operator
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

lasso = Lasso(alpha=0.1)    # Lasso regression with a alpha
lasso.fit(X_train_df, y_train_reset)

y_pred_lasso = lasso.predict(X_test) #Predict using Lasso.

r2_lasso = r2_score(y_test, y_pred_lasso)   # Apply Lasso model
mse_lasso = mean_squared_error(y_test, y_pred_lasso)
rmse_lasso = np.sqrt(mse_lasso)

print("Lasso Regression Performance:")
print(f"R-squared: {r2_lasso:.4f}")
print(f"Mean Squared Error:{mse_lasso:.2f}")
print(f"Root Mean Squared Error: {rmse_lasso:.2f}")
print(f"Lasso Coefficients: {lasso.coef_}")

Lasso Regression Performance:
R-squared: 0.8349
Mean Squared Error:2202734.65
Root Mean Squared Error: 1484.16
Lasso Coefficients: [-2246.64304209  -608.62440915   210.36488824    34.07822185
   -12.7865696    -57.25480263     0.           103.88310379
  1361.7055931    -21.02979021   444.95563866]




In [30]:
from sklearn.linear_model import Ridge

ridge = Ridge(alpha=0.1)  # Ridge regression with alpha
ridge.fit(X_train_df, y_train_reset)

y_pred_ridge = ridge.predict(X_test)   # Predict using Ridge

r2_ridge = r2_score(y_test, y_pred_ridge)   # Apply Ridge model
mse_ridge = mean_squared_error(y_test, y_pred_ridge)
rmse_ridge = np.sqrt(mse_ridge)

print("\nRidge Regression Performance:")
print(f"R-squared: {r2_ridge:.4f}")
print(f"Mean Squared Error : {mse_ridge:.2f}")
print(f"Root Mean Squared Error : {rmse_ridge:.2f}")
print(f"Ridge Coefficients: {ridge.coef_}")



Ridge Regression Performance:
R-squared: 0.8349
Mean Squared Error : 2202805.77
Root Mean Squared Error : 1484.19
Ridge Coefficients: [-2246.44992968  -608.73867965   210.38374177    34.15758066
   -12.87150784   -57.34763231     0.           103.95726249
  1361.86187163   -21.32804161   444.71000491]




# Interview Questions :

**1.What is Normalization & Standardization and how is it helpful?**

Normalization and standardization are two techniques used in data preprocessing to scale numerical data, making it more suitable for machine learning models. Both techniques adjust the range and distribution of data, but they do so in different ways.

**Normalization**

Def: Normalization, also known as min-max scaling, transforms the data to fit within a specific range, typically [0, 1]. It adjusts the values by subtracting the minimum value of the feature and then dividing by the range (the difference between the maximum and minimum value).

Formula:
Normalized value=(x−min⁡(X))/(max⁡(X)−min⁡(X))​

Use:

Normalization is particularly useful when you know that the distribution of data does not follow a Gaussian (Normal) distribution or when you want to scale features to be between a specific range, especially for algorithms like k-nearest neighbors or neural networks, where distances between points are important.

**Standardization**

Def: Standardization, also known as z-score normalization, transforms the data to have a mean of 0 and a standard deviation of 1. This technique centers the data by subtracting the mean of the feature and scales it by dividing by the standard deviation.

Formula:
Standardized value=(x−μ)/σ​

Use:

Standardization is useful when the features have different units or scales but you want them to be comparable. It is often preferred when the data follows a Gaussian distribution. Algorithms like linear regression, logistic regression, and support vector machines often perform better with standardized data.

**How are They Helpful?**

**Improving Model Performance:**

Gradient Descent Convergence: For optimization algorithms like gradient descent, normalized or standardized data can help the algorithm converge faster by providing a more consistent scale for the coefficients.

Reducing Bias: Models that are sensitive to the scale of input data (like SVM, KNN, and neural networks) can be biased toward larger scale features. Normalization or standardization ensures that all features contribute equally to the model.

**Preventing Overfitting:**

Regularization: Techniques like Lasso and Ridge regression apply penalties based on the magnitude of coefficients.

Normalization and Standardization are crucial preprocessing steps that can significantly impact the performance and reliability of machine learning models.

**2.What techniques can be used to address multicollinearity in multiple linear regression?**

Multicollinearity occurs in multiple linear regression when two or more independent variables are highly correlated with each other. This situation can cause issues because it makes it difficult to determine the individual effect of each predictor on the dependent variable.

Here are several techniques to address multicollinearity:

**Remove Highly Correlated Predictors**

Correlation Matrix: Calculate the correlation matrix of the independent variables and identify pairs with high correlations (e.g., > 0.8 or < -0.8). You can then remove one of the highly correlated variables.

Variance Inflation Factor (VIF): Calculate the VIF for each predictor. VIF measures how much the variance of a regression coefficient is inflated due to multicollinearity. A VIF value greater than 5 or 10 is often considered problematic, and you may consider removing the variable with a high VIF.

**Combine Predictors**

Principal Component Analysis (PCA): PCA reduces the dimensionality of the data by combining correlated variables into a smaller number of uncorrelated components. You can then use these components as predictors in the regression model.
    
Factor Analysis: Similar to PCA, factor analysis identifies underlying factors that explain the pattern of correlations within a set of observed variables. These factors can then be used as predictors.

**Regularization Techniques**

Lasso Regression: Lasso regression adds an L1 penalty, which can shrink some coefficients to zero, effectively performing variable selection and reducing multicollinearity.

Ridge Regression: Ridge regression adds an L2 penalty to the loss function, which shrinks the coefficients of correlated predictors towards zero but does not eliminate them. This reduces the impact of multicollinearity while keeping all predictors in the model.
    
Elastic Net: Elastic Net combines both L1 and L2 penalties and can be particularly effective when there are multiple correlated predictors.

**Centered Variables**

Mean Centering: Subtract the mean of each predictor from the values of that predictor. This can help reduce multicollinearity in models that include interaction terms or polynomial features, although it may not fully eliminate it.

The choice of technique depends on the specific dataset and the goals of the analysis. In many cases, a combination of these methods might be the most effective way to address multicollinearity.