
## Interview Question 1: What is Normalization & Standardization and how is it helpful?

### **Normalization**:
Normalization is the process of scaling individual data points within a specific range, typically between 0 and 1. This technique is especially useful when the features have different ranges, as it helps bring all features to a common scale.

- **Formula for Normalization**:
  \[
  X_{norm} = rac{X - X_{min}}{X_{max} - X_{min}}
  \]
  
Normalization is particularly useful in algorithms that rely on distance metrics (like K-Nearest Neighbors or clustering algorithms) as it ensures that features with large values do not dominate features with smaller values.

### **Standardization**:
Standardization involves rescaling data so that it has a mean of 0 and a standard deviation of 1. It ensures that the data follows a standard normal distribution (with mean = 0 and standard deviation = 1).

- **Formula for Standardization**:
  \[
  X_{std} = rac{X - \mu}{\sigma}
  \]
  where:
  - \(X\) is the feature value
  - \(\mu\) is the mean of the feature
  - \(\sigma\) is the standard deviation

Standardization is often preferred when dealing with features that have different units or distributions. It is useful for models like regression, SVM, and PCA.

### **Benefits**:
- Both techniques help to improve model convergence during training.
- They reduce the bias of certain features dominating the learning process due to scale.



## Interview Question 2: What techniques can be used to address multicollinearity in multiple linear regression?

### **Multicollinearity**:
Multicollinearity occurs when two or more independent variables in a multiple regression model are highly correlated. This can lead to instability in the model coefficients, making it difficult to determine the individual effect of each variable on the dependent variable.

### **Techniques to Address Multicollinearity**:

1. **Variance Inflation Factor (VIF)**:
   - VIF quantifies the extent of multicollinearity by measuring how much the variance of a regression coefficient is inflated due to multicollinearity.
   - A VIF value above 10 indicates significant multicollinearity.
   - Removing variables with high VIF can reduce multicollinearity.

2. **Removing or Combining Highly Correlated Features**:
   - If two variables are highly correlated, one can be removed, or both can be combined (e.g., using principal component analysis) to reduce redundancy.

3. **Regularization Techniques**:
   - **Ridge Regression**: Ridge adds a penalty to the size of coefficients, effectively shrinking them, which reduces the impact of multicollinearity.
   - **Lasso Regression**: Lasso adds a penalty that can shrink some coefficients to zero, thus performing feature selection and reducing multicollinearity.
   
4. **Principal Component Analysis (PCA)**:
   - PCA can reduce multicollinearity by transforming correlated variables into a smaller set of uncorrelated components, which can then be used in regression.

5. **Dropping Variables**:
   - Another straightforward technique is to drop variables that exhibit high correlation with others. This can reduce multicollinearity but may also lead to loss of information.



## Visualizing Normalization & Standardization

### **Standardization Example**:
Standardization transforms data to have a mean of 0 and a standard deviation of 1. This process helps in algorithms like linear regression, logistic regression, and principal component analysis (PCA), which assume normally distributed data.

```python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(df[['Age_08_04', 'KM', 'HP']])

# Plotting the distributions before and after standardization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12,6))

# Original Data
sns.histplot(df[['Age_08_04', 'KM', 'HP']], ax=ax1, kde=True)
ax1.set_title('Original Data Distribution')

# Standardized Data
sns.histplot(X_scaled, ax=ax2, kde=True)
ax2.set_title('Standardized Data Distribution')

plt.show()
```

### **Normalization Example**:
Normalization scales the data between a defined range, usually 0 and 1. This technique is important for distance-based algorithms such as K-Nearest Neighbors or clustering.

```python
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_normalized = scaler.fit_transform(df[['Age_08_04', 'KM', 'HP']])

# Plotting the distributions before and after normalization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12,6))

# Original Data
sns.histplot(df[['Age_08_04', 'KM', 'HP']], ax=ax1, kde=True)
ax1.set_title('Original Data Distribution')

# Normalized Data
sns.histplot(X_normalized, ax=ax2, kde=True)
ax2.set_title('Normalized Data Distribution')

plt.show()
```


# Multiple Linear Regression with Multicollinearity Handling


## Visualizing Multicollinearity

One of the most common methods to detect multicollinearity is through a correlation matrix and VIF values.

```python
# Plotting the correlation matrix
plt.figure(figsize=(10,8))
sns.heatmap(df_encoded.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix to Detect Multicollinearity')
plt.show()

# Bar plot for VIF values
vif_data.plot(kind='bar', x='Feature', y='VIF', title='Variance Inflation Factor (VIF)', figsize=(10,6))
plt.axhline(y=10, color='red', linestyle='--', label='VIF > 10 indicates multicollinearity')
plt.legend()
plt.show()
```

This visual representation highlights how multicollinearity can be detected in the features.


In [1]:

# Import necessary libraries
import pandas as pd
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from statsmodels.stats.outliers_influence import variance_inflation_factor
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import Ridge, Lasso
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Load the dataset
df = pd.read_csv('/mnt/data/ToyotaCorolla_MLR.csv')

# One-hot encoding the 'Fuel_Type' column
df_encoded = pd.get_dummies(df, columns=['Fuel_Type'], drop_first=True)

# Defining the feature set (independent variables) and target variable (Price)
X = df_encoded.drop(columns=['Price'])
y = df_encoded['Price']

# Adding a constant for the intercept term in the regression model
X = sm.add_constant(X)

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Calculating VIF to detect multicollinearity
vif_data = pd.DataFrame()
vif_data['Feature'] = X.columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(len(X.columns))]
vif_data

FileNotFoundError: [Errno 2] No such file or directory: '/mnt/data/ToyotaCorolla_MLR.csv'

### Adjusting the model to reduce multicollinearity by removing highly collinear features

In [None]:

# Dropping 'Cylinders', 'Weight', and 'Gears' due to high VIF
X_reduced = X.drop(columns=['Cylinders', 'Weight', 'Gears'])

# Recalculate VIF after reduction
vif_data_reduced = pd.DataFrame()
vif_data_reduced['Feature'] = X_reduced.columns
vif_data_reduced['VIF'] = [variance_inflation_factor(X_reduced.values, i) for i in range(len(X_reduced.columns))]
vif_data_reduced

### Building and evaluating the multiple linear regression model

In [None]:

# Aligning indices before fitting the model
X_train_aligned, y_train_aligned = X_reduced.loc[y_train.index], y_train

# Fitting the OLS model
model = sm.OLS(y_train_aligned, X_train_aligned).fit()
model.summary()


### Applying Ridge and Lasso Regression to address multicollinearity

In [None]:

# Ridge and Lasso regression
ridge_model = Ridge(alpha=1.0).fit(X_train_aligned, y_train_aligned)
lasso_model = Lasso(alpha=1.0).fit(X_train_aligned, y_train_aligned)

# Predicting and calculating AIC, BIC for Ridge and Lasso
n = len(y_train_aligned)
p = X_train_aligned.shape[1]

ridge_aic = n * np.log(mean_squared_error(y_train_aligned, ridge_model.predict(X_train_aligned))) + 2 * p
ridge_bic = n * np.log(mean_squared_error(y_train_aligned, ridge_model.predict(X_train_aligned))) + p * np.log(n)

lasso_aic = n * np.log(mean_squared_error(y_train_aligned, lasso_model.predict(X_train_aligned))) + 2 * p
lasso_bic = n * np.log(mean_squared_error(y_train_aligned, lasso_model.predict(X_train_aligned))) + p * np.log(n)

ridge_model.score(X_train_aligned, y_train_aligned), ridge_aic, ridge_bic, lasso_model.score(X_train_aligned, y_train_aligned), lasso_aic, lasso_bic


## Visualizing Correlation Matrix and Pair Plots

In [None]:

# Correlation heatmap to visualize relationships
plt.figure(figsize=(10,8))
sns.heatmap(df_encoded.corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix of Features')
plt.show()

# Pairplot to visualize pairwise relationships between variables
sns.pairplot(df_encoded[['Price', 'Age_08_04', 'KM', 'HP', 'cc', 'Weight']])
plt.show()


## Regression Model Residual Plot

In [None]:

# Residual plot to check model performance
plt.figure(figsize=(10,6))
sns.residplot(x=model.predict(X_train_aligned), y=y_train_aligned, lowess=True, line_kws={'color': 'red'})
plt.title('Residual Plot')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.show()


## Ridge and Lasso Regression Coefficient Plot

In [None]:

# Coefficients from Ridge and Lasso models
ridge_coefficients = ridge_model.coef_
lasso_coefficients = lasso_model.coef_

# Plotting the coefficients
plt.figure(figsize=(10,6))
plt.plot(ridge_coefficients, label='Ridge Coefficients', marker='o')
plt.plot(lasso_coefficients, label='Lasso Coefficients', marker='x')
plt.axhline(0, color='gray', linestyle='--')
plt.title('Ridge and Lasso Coefficients')
plt.legend()
plt.show()
