### **3. Mini-Project: Predicting Car Prices**
**Objective**: Build a model to predict car prices based on features such as **car age**, **mileage**, and **engine size**.

**Steps**:
1. **Collect Data**: You can collect a dataset (like from Kaggle) with features such as age, mileage, engine size, and price.
   
2. **Preprocess Data**: Clean and process the dataset (handle missing values, outliers, etc.).
   
3. **Apply Multiple Linear Regression**: Model the relationship between car features (predictor variables) and price (dependent variable).
   
4. **Evaluate the Model**: Use R-squared (R²), Mean Absolute Error (MAE), and Mean Squared Error (MSE) to evaluate your model.

**Code Example** (using `pandas` for data preprocessing):
```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Load dataset
df = pd.read_csv('car_data.csv')

# Preprocessing (handling missing values, encoding categorical variables, etc.)
df = df.dropna()  # Example of handling missing values

# Define predictors (X) and target variable (y)
X = df[['Age', 'Mileage', 'EngineSize']]  # Features
y = df['Price']  # Target variable

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
print("R-squared:", r2_score(y_test, y_pred))
print("Mean Absolute Error:", mean_absolute_error(y_test, y_pred))
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
```

---

### **4. Practical Tips for Linear Regression in Data Science**:
1. **Check Assumptions**: Ensure that your data meets the assumptions of linear regression (linearity, homoscedasticity, independence, normality of errors).
2. **Scaling**: When dealing with multiple features, scale your data if necessary (StandardScaler or MinMaxScaler in scikit-learn).
3. **Correlation**: Check for multicollinearity. If predictors are highly correlated, it can destabilize the model. Use variance inflation factor (VIF) to detect this.
4. **Model Evaluation**: Use metrics like R-squared, Adjusted R-squared, RMSE, MAE, and MSE to evaluate the performance of your model.

---

### **In Conclusion**:
By applying linear regression to real-world datasets and working through these exercises and mini-projects, you'll gain a deeper understanding of how linear regression works and how it can be applied to various data science tasks.

In [None]:
import pandas as pd
import numpy as np

# Seed for reproducibility
np.random.seed(42)

# Number of records
num_records = 500

# Generate synthetic data
data = {
    "Car_Name": np.random.choice(
        ["Maruti", "Hyundai", "Honda", "Toyota", "Ford", "BMW", "Audi", "Tata", "Mahindra", "Suzuki"],
        size=num_records
    ),
    "Year": np.random.randint(2000, 2023, size=num_records),
    "Present_Price": np.round(np.random.uniform(2.0, 60.0, size=num_records), 2),
    "Kms_Driven": np.random.randint(5000, 200000, size=num_records),
    "Fuel_Type": np.random.choice(["Petrol", "Diesel", "CNG"], size=num_records),
    "Seller_Type": np.random.choice(["Individual", "Dealer"], size=num_records),
    "Transmission": np.random.choice(["Manual", "Automatic"], size=num_records),
    "Owner": np.random.choice([0, 1, 2], size=num_records),
    "Mileage": np.round(np.random.uniform(10.0, 30.0, size=num_records), 2),  # in kmpl
    "EngineSize": np.round(np.random.uniform(1.0, 5.0, size=num_records), 2),  # in liters
    "Age": 2023 - np.random.randint(2000, 2023, size=num_records),  # Derived from Year
    "Selling_Price": np.round(np.random.uniform(1.0, 50.0, size=num_records), 2)
}

# Create a DataFrame
df = pd.DataFrame(data)

# Save to CSV
df.to_csv("car_data.csv", index=False)

print("Dataset generated and saved as 'car_data.csv'")


Dataset generated and saved as 'car_data.csv'


In [None]:
df

Unnamed: 0,Car_Name,Year,Present_Price,Kms_Driven,Fuel_Type,Seller_Type,Transmission,Owner,Mileage,EngineSize,Age,Selling_Price
0,Audi,2008,15.28,189697,Petrol,Dealer,Manual,2,13.31,2.94,2,47.35
1,Toyota,2016,33.49,57098,Petrol,Individual,Manual,2,18.91,2.33,16,15.50
2,Tata,2016,27.03,151323,CNG,Individual,Manual,2,14.18,4.17,1,1.05
3,Ford,2019,21.30,127413,Petrol,Dealer,Automatic,2,11.00,2.81,21,14.33
4,Audi,2015,44.37,5526,Petrol,Dealer,Automatic,1,26.87,1.73,16,11.68
...,...,...,...,...,...,...,...,...,...,...,...,...
495,Maruti,2022,32.76,36007,CNG,Individual,Manual,1,24.47,2.38,20,8.19
496,Audi,2001,3.04,175604,Diesel,Individual,Automatic,0,20.71,3.93,23,34.83
497,Audi,2013,48.21,59253,Petrol,Dealer,Manual,2,26.72,3.63,11,47.31
498,Mahindra,2020,19.06,17323,Diesel,Individual,Automatic,1,26.43,4.73,20,34.44


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Load dataset
df = pd.read_csv('car_data.csv')

# Preprocessing (handling missing values, encoding categorica variables, etc.)
df = df.dropna()

# Define predictors (x) and target variable (y)
x = df[['Age', 'Mileage', 'EngineSize']]
y = df['Selling_Price']

# Split data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Create and train model
model = LinearRegression()
model.fit(x_train, y_train)

# Make predictions
y_pred = model.predict(x_test)

# Evaluate the model
print('R-squared: ', r2_score(y_test, y_pred))
print("Mean Absolute Error: ", mean_absolute_error(y_test, y_pred))
print("Mean Squred Error: ", mean_squared_error(y_test, y_pred))

R-squared:  -0.002120796950441095
Mean Absolute Error:  10.997476839540905
Mean Squred Error:  172.6692195597508
