
# Project 1: Comprehensive Regression Analysis
### Course: Introduction to Machine Learning

This notebook is designed to guide you through a comprehensive regression analysis using various techniques. You will explore different methods, implement regularization techniques, and evaluate the performance of your models using various metrics and computational time.


## Submission Instructions

Once you are finished, follow these steps:

Make sure you have provided the team name, name of team members with IDs. 

Restart the kernel and re-run this notebook from beginning to end by going to Kernel > Restart Kernel and Run All Cells. If this process stops halfway through, that means there was an error. Correct the error and repeat Step 1 until the notebook runs from beginning to end. Double check that there is a number next to each code cell and that these numbers are in order. Then, submit your project as follows:

Go to File > Print > Save as PDF. Double check that the entire notebook, from beginning to end, is in this PDF file. Upload the PDF and the notebook to Google Classroom.




### Team Name: **Insight Engineers** 
### Name and ID of Member 1: **Sayan Das** - ``B2430035``
### Name and ID of Member 2: **Raihan Uddin** - ``B2430070``

In [2]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time
# Add all other libraries you would require



## 1. Load the Dataset
**Instruction:** Load the chosen dataset and display its basic information and statistics. You may use any well-known dataset.

In [7]:

# Loading a particular dataset
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()

# Display the first few rows of the dataset
dataset=pd.DataFrame(housing.data)
dataset.columns=housing.feature_names
dataset.head()



{'data': array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
          37.88      , -122.23      ],
       [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
          37.86      , -122.22      ],
       [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
          37.85      , -122.24      ],
       ...,
       [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
          39.43      , -121.22      ],
       [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
          39.43      , -121.32      ],
       [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
          39.37      , -121.24      ]]), 'target': array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894]), 'frame': None, 'target_names': ['MedHouseVal'], 'feature_names': ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude'], 'DESCR': '.. _california_housing_dataset:\n\nCalifornia Housing dataset\n-

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


In [4]:

# Display basic statistics
print(dataset.describe())


             MedInc      HouseAge      AveRooms     AveBedrms    Population  \
count  20640.000000  20640.000000  20640.000000  20640.000000  20640.000000   
mean       3.870671     28.639486      5.429000      1.096675   1425.476744   
std        1.899822     12.585558      2.474173      0.473911   1132.462122   
min        0.499900      1.000000      0.846154      0.333333      3.000000   
25%        2.563400     18.000000      4.440716      1.006079    787.000000   
50%        3.534800     29.000000      5.229129      1.048780   1166.000000   
75%        4.743250     37.000000      6.052381      1.099526   1725.000000   
max       15.000100     52.000000    141.909091     34.066667  35682.000000   

           AveOccup      Latitude     Longitude  
count  20640.000000  20640.000000  20640.000000  
mean       3.070655     35.631861   -119.569704  
std       10.386050      2.135952      2.003532  
min        0.692308     32.540000   -124.350000  
25%        2.429741     33.930000   -1


## 2. Data Preprocessing
**Instruction:** Perform any necessary preprocessing steps, including handling missing values, encoding categorical variables, and scaling features if required.


In [6]:

# Checking for missing values
print(dataset.isnull().sum())

# Split the dataset into features (X) and target (y)
# X = 
# y = 

# Split the data into training and test sets (e.g., 80% training, 20% testing)
# X_train, X_test, y_train, y_test = 


MedInc        0
HouseAge      0
AveRooms      0
AveBedrms     0
Population    0
AveOccup      0
Latitude      0
Longitude     0
dtype: int64



## 3. Simple Linear Regression
**Instruction:** Implement a simple linear regression model using scikit-learn.


In [None]:

# Define the linear regression model


# Fit the model on the training data


# Predict on the test data


# Calculate performance metrics [MSE, MAE, R^2]


# Print the metrics




## 4. Polynomial Regression
**Instruction:** Implement polynomial regression for degrees 2, 3, and 4.


In [None]:

# Define polynomial features


# Split the transformed data into training and test sets
X_train_poly, X_test_poly, y_train, y_test =

# Fit the linear model on polynomial features


# Predict and evaluate performance
y_pred_poly

# Calculate performance metrics for polynomial regression


# Print the metrics for polynomial regression



## 5. Gradient Descent Methods
**Instruction:** Implement batch, stochastic, and mini-batch gradient descent for linear regression.


In [None]:

# Example function for batch gradient descent. 
# Return theta and history (a list that stores cost for each epoch)
def batch_gradient_descent(X, y, learning_rate=0.01, epochs=1000):

    return theta, history

# Add similar functions for stochastic and mini-batch gradient descent
# Run them and output the performance



## 6. Regularization Techniques (Ridge,  Lasso, Elastic Net Regression and Early Stopping)
**Instruction:** Implement Ridge, Lasso regression, Elastic Net Regression, and Early Stopping using scikit-learn and compare the results.


In [None]:

# Ridge Regression


# Lasso Regression

# Elastic Net Regression

# Early Stopping

# Calculate performance metrics for each of these



# Print the performance metrics




## 7. Normal Equation
**Instruction:** Implement the normal equation method for linear regression.


In [None]:
# Define Normal equation. Inputs: X and y, Output: theta
def normal_equation(X, y):
    # Put your code and output accordingly
    return

# Use the normal equation to find theta
theta_ne = normal_equation(X_train, y_train)

# Predict using the normal equation
y_pred_ne = 

# Calculate and print the performance metrics




## 8. Implement linear Regression using SVD


In [None]:
# Define SVD equation. Inputs: X and y, Output: theta
def svd_equation(X, y):
    # Put your code and output accordingly
    return

# Use the svd equation to find theta

# Predict using the svd equation

# Calculate the performance metrics


# Print the performance metrics




## 9. Performance Metrics and Computational Analysis
**Instruction:** Compare the performance and computational time of all models implemented.


In [None]:

# Example: Timing the training process
start_time = time.time()
linear_model.fit(X_train, y_train)
end_time = time.time()
print(f'Training Time (Linear Regression): {end_time - start_time} seconds')

# Plot performance comparison (students to fill in the details)
# Students should create plots comparing the performance metrics (MSE, MAE, R2) and computational time for each model.



## 10. Conclusion
**Instruction:** Summarize the findings from the analysis, including which models performed best in terms of accuracy and computational efficiency.
