### Linear Regression

# 1️⃣ What is Linear Regression?
Linear Regression is a supervised learning algorithm used to predict a continuous target variable based on one or more independent variables. The goal is to establish a linear relationship between the input features and the target variable by minimizing errors. 

# 2️⃣ Types of Linear Regression
#### Simple Linear Regression: Uses one independent variable to predict the target.  
Example:
  Predict house price based on size. 
  Data format: 
  Size (sq ft) Price 
  1000 150000 
  1500 200000 

#### Multiple Linear Regression 
Uses multiple independent variables to predict the target.  
Example: Predict house price based on size, number of bedrooms, and location. 
Data format:
Size Bedrooms Location Price  
1000 2 1 150000  
1500 3 2 200000   
# 3️⃣ Suitable Data Types for Linear Regression  
Numerical Data (Continuous or Integer) – Required for independent variables (features).  
Categorical Data (After Encoding) – Can be used after converting to numerical format using techniques like One-Hot Encoding (OHE).   
# 4️⃣ Data Preprocessing for Linear Regression 
To ensure the data is properly formatted and ready for modeling:  
#### . Handle Missing Values  
   Drop or Impute Missing Data  
   data.dropna(inplace=True) # Drop rows with missing values  
   or  
data.fillna(data.mean(), inplace=True) # Fill with mean/median/mode 
#### 2. Encode Categorical Variables  
One-Hot Encoding (OHE) to convert categorical features into binary variables. 
from sklearn.preprocessing import OneHotEncoder  
ohe = OneHotEncoder(drop='first', sparse=False) # drop='first' to avoid dummy variable trap  
encoded_data = pd.get_dummies(data, drop_first=True)  
#### 3. Feature Scaling   
Scale numerical features for better convergence during model training. 

from sklearn.preprocessing import StandardScaler 
scaler = StandardScaler()   
X_scaled = scaler.fit_transform(X)  
# 5️⃣ Splitting Data for Training & Testing
from sklearn.model_selection import train_test_split

X = data.iloc[:, :-1].values # Independent variables 
y = data.iloc[:, -1].values # Dependent variable 


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)  
# 6️⃣ Model Training & Prediction  

from sklearn.linear_model import LinearRegression  



model = LinearRegression()  
model.fit(X_train, y_train)   
y_pred = model.predict(X_test)   

# 7️⃣ Evaluation Metrics for Linear Regression 

1. Mean Squared Error (MSE)  
   Measures average squared difference between actual and predicted values.
   
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}") 
2. Root Mean Squared Error (RMSE)
Square root of MSE, interpretable in the same units as the target.
rmse = np.sqrt(mse)
print(f"Root Mean Squared Error: {rmse}") 
3. Mean Absolute Error (MAE)
Average absolute difference between actual and predicted values.
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error: {mae}") 

4. R-squared (R²) Score
Proportion of variance in the target variable explained by the model.
​
from sklearn.metrics import r2_score  
r2 = r2_score(y_test, y_pred)  
print(f"R-Squared Score: {r2}")   

# 8️⃣ Model Performance Interpretation
MSE/RMSE/MAE: Lower values indicate better fit.   
R² Score:   
Close to 1 → Good model  
 Close to 0 → Poor model  
Negative → Worse than random guess  
9️⃣ Common Mistakes & Pitfalls to Avoid  
❌ Overfitting: Model fits training data too well and performs poorly on unseen data.  
✅ Solution: Use train-test split and cross-validation.  

❌ Multicollinearity: High correlation between independent variables leads to instability. 
✅ Solution: Check correlation matrix and drop correlated features.  

❌ Ignoring Outliers: Outliers can skew the model.  
✅ Solution: Handle outliers using IQR or Z-score methods.  

📚 🔟 Real-Life Use Cases of Linear Regression  
🏡 House Price Prediction – Predict prices based on features like size, location, etc.   
📈 Stock Market Analysis – Estimate future stock prices. 
🚗 Car Price Estimation – Predict car price using model, mileage, and brand.  
📊 Sales Forecasting – Predict sales revenue based on historical data.


In [2]:
#Linear Regression
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

from sklearn.preprocessing import LabelEncoder


In [3]:
data = pd.read_csv("medical.csv")

In [None]:
#Preparing Data
encoder=LabelEncoder()
data['sex']=encoder.fit_transform(data['sex'])
data['smoker']=encoder.fit_transform(data['smoker'])
data['region']=encoder.fit_transform(data['region'])

#3 Feature Selection
X = data.iloc[:, :-1].values  # Features
y = data.iloc[:, -1].values   # Target (0 or 1)

# Splitting the data
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

#model creation
model = LinearRegression()
#training the model
model.fit(x_train, y_train)

# predictions
y_pred = model.predict(x_test)


#Evaluation
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)

print("R2 Score:", r2)
print("Mean Squared Error:", mse)


R2 Score: 0.799874714544996
Mean Squared Error: 31845929.134159416
