# 🏠 Task 3: Linear Regression using Scikit-learn

This notebook implements simple linear regression on a housing dataset using `scikit-learn`, `pandas`, and `matplotlib`.

Dataset: [Housing Price Prediction](https://www.kaggle.com/datasets/harishkumardatalab/housing-price-prediction)


In [None]:
# Step 0: Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score


In [None]:
# Step 1: Load and preview the dataset
# Make sure housing.csv is in the same directory or adjust the path accordingly
df = pd.read_csv("housing.csv")
print("First 5 rows of the dataset:")
df.head()


In [None]:
# Step 2: Data Cleaning and Preprocessing
# Check for missing values
print("Missing values in dataset:")
print(df.isnull().sum())

# Drop missing rows
df = df.dropna()


In [None]:
# Step 3: Feature Selection
# Simple Linear Regression using 'area' as independent variable
X = df[['area']]
y = df['price']


In [None]:
# Step 4: Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
# Step 5: Fit Linear Regression Model
model = LinearRegression()
model.fit(X_train, y_train)


In [None]:
# Step 6: Evaluate the model
y_pred = model.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Absolute Error: {mae}")
print(f"Mean Squared Error: {mse}")
print(f"R² Score: {r2}")


In [None]:
# Step 7: Visualization
plt.scatter(X_test, y_test, color='blue', label='Actual')
plt.plot(X_test, y_pred, color='red', label='Predicted')
plt.xlabel('Area')
plt.ylabel('Price')
plt.title('Linear Regression: Area vs Price')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()


## 🧠 Interview Questions

1. **What assumptions does linear regression make?**  
   - Linearity, Independence, Homoscedasticity, Normality of residuals, No multicollinearity.

2. **How do you interpret the coefficients?**  
   - Each coefficient represents the change in the dependent variable for one unit of change in the independent variable.

3. **What is R² score and its significance?**  
   - R² indicates how well the independent variable explains the variability of the dependent variable.

4. **When would you prefer MSE over MAE?**  
   - Use MSE when large errors are more serious and you want to penalize them more.

5. **How do you detect multicollinearity?**  
   - Use Variance Inflation Factor (VIF).

6. **What is the difference between simple and multiple regression?**  
   - Simple regression has one independent variable, multiple regression has two or more.

7. **Can linear regression be used for classification?**  
   - No, use logistic regression for classification tasks.

8. **What happens if you violate regression assumptions?**  
   - It leads to unreliable estimates and predictions.
