Problem Statement: **Data Analytics I**
* Import required libraries (e.g., pandas, sklearn, matplotlib).
* Load the **Housing Dataset**.
* Explore the dataset: check shape, features, and summary stats.
* Preprocess the data (handle missing values if any, encode if needed).
* Split data into training and testing sets.
* Build a **Linear Regression model** using Python or R.
* Train the model on the training data.
* Predict house prices on test data.
* Evaluate model performance using metrics like **R² score**, **MAE**, or **RMSE**.
* Visualize results (optional).

### import Required Files

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

### read CSV File

In [2]:
df = pd.read_csv("4_HousingData.csv")

In [3]:
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,,36.2


### Understand Data

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     486 non-null    float64
 1   ZN       486 non-null    float64
 2   INDUS    486 non-null    float64
 3   CHAS     486 non-null    float64
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      486 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    int64  
 9   TAX      506 non-null    int64  
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    486 non-null    float64
 13  MEDV     506 non-null    float64
dtypes: float64(12), int64(2)
memory usage: 55.5 KB


In [5]:
df.describe()						            # Summary statistics

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
count,486.0,486.0,486.0,486.0,506.0,506.0,486.0,506.0,506.0,506.0,506.0,506.0,486.0,506.0
mean,3.611874,11.211934,11.083992,0.069959,0.554695,6.284634,68.518519,3.795043,9.549407,408.237154,18.455534,356.674032,12.715432,22.532806
std,8.720192,23.388876,6.835896,0.25534,0.115878,0.702617,27.999513,2.10571,8.707259,168.537116,2.164946,91.294864,7.155871,9.197104
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73,5.0
25%,0.0819,0.0,5.19,0.0,0.449,5.8855,45.175,2.100175,4.0,279.0,17.4,375.3775,7.125,17.025
50%,0.253715,0.0,9.69,0.0,0.538,6.2085,76.8,3.20745,5.0,330.0,19.05,391.44,11.43,21.2
75%,3.560263,12.5,18.1,0.0,0.624,6.6235,93.975,5.188425,24.0,666.0,20.2,396.225,16.955,25.0
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97,50.0


In [6]:
df.isnull().sum()

CRIM       20
ZN         20
INDUS      20
CHAS       20
NOX         0
RM          0
AGE        20
DIS         0
RAD         0
TAX         0
PTRATIO     0
B           0
LSTAT      20
MEDV        0
dtype: int64

### Fill Missing Values using Simpler Imputer

### **SimpleImputer**
SimpleImputer is a class in scikit-learn used for handling missing data by replacing missing values with a specified strategy, such as mean, median, most frequent, or a constant value.  

Parameters:
<table>
  <thead>
    <tr>
      <th>Parameter</th>
      <th>Description</th>
      <th>Valid Values</th>
      <th>Use Case</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code>strategy</code></td>
      <td>The imputation strategy.</td>
      <td>'mean', 'median', 'most_frequent', 'constant'</td>
      <td>Choose based on the nature of your data. 'mean' and 'median' for numerical, 'most_frequent' for categorical, 'constant' for a fixed value.</td>
    </tr>
    <tr>
      <td><code>fill_value</code></td>
      <td>When strategy='constant', this value is used to replace missing values.</td>
      <td>str or numerical value</td>
      <td>Specify the constant value to replace missing entries.</td>
    </tr>
  </tbody>
</table>


When to Use:
Use SimpleImputer when your dataset contains missing values, and you want to fill them in using a simple strategy. This is a common preprocessing step before training machine learning models.

In [7]:
from sklearn.impute import SimpleImputer

In [8]:
si = SimpleImputer(strategy='mean')

In [9]:
#create list of columns with missing values
cols = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'AGE', 'LSTAT']

In [10]:
for i in cols:
    df[i] = si.fit_transform(df[[i]])

In [11]:
df

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.200000,4.0900,1,296,15.3,396.90,4.980000,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.900000,4.9671,2,242,17.8,396.90,9.140000,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.100000,4.9671,2,242,17.8,392.83,4.030000,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.800000,6.0622,3,222,18.7,394.63,2.940000,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.200000,6.0622,3,222,18.7,396.90,12.715432,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0.0,0.573,6.593,69.100000,2.4786,1,273,21.0,391.99,12.715432,22.4
502,0.04527,0.0,11.93,0.0,0.573,6.120,76.700000,2.2875,1,273,21.0,396.90,9.080000,20.6
503,0.06076,0.0,11.93,0.0,0.573,6.976,91.000000,2.1675,1,273,21.0,396.90,5.640000,23.9
504,0.10959,0.0,11.93,0.0,0.573,6.794,89.300000,2.3889,1,273,21.0,393.45,6.480000,22.0


### Separates features (X) and target (y) from the dataset for supervised learning

In [12]:
X = df.drop("MEDV", axis=1)
y = df["MEDV"]

### Splits the dataset into training and testing sets (80% train, 20% test) 

### **train_test_split()**
train_test_split() is a function in scikit-learn **used to split arrays or matrices into random train and test subsets**. 

Output: It returns a list containing train-test splits of inputs. For supervised learning, it typically returns: <code> X_train, X_test, y_train, y_test </code>

Parameters:
<table>
  <thead>
    <tr>
      <th>Parameter</th>
      <th>Description</th>
      <th>Valid Values</th>
      <th>Use Case</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code>test_size</code></td>
      <td>Proportion or absolute number of the test set.</td>
      <td>float (0.0 to 1.0) or int</td>
      <td>Specify the size of the test set. E.g., 0.2 for 20% test data.</td>
    </tr>
    <tr>
      <td><code>random_state</code></td>
      <td>Controls the shuffling applied to the data before splitting.</td>
      <td>int or None</td>
      <td>Set for reproducibility of the split.</td>
    </tr>
  </tbody>
</table>


In [13]:
from sklearn.model_selection import train_test_split

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Model Building

<h2>Linear Regression</h2>

<h3>Definition:</h3>
<p>
  Linear regression is a statistical method used to model the relationship between a dependent variable 
  <strong>y</strong> and one or more independent variables <strong>x</strong>. It predicts the value of 
  <strong>y</strong> based on the value(s) of <strong>x</strong>.
</p>

<h3>Mathematical Formula:</h3>

<p><strong>Simple Linear Regression (one independent variable):</strong></p>
<p><code>y = β₀ + β₁x + ε</code></p>

<p><strong>Multiple Linear Regression (multiple independent variables):</strong></p>
<p><code>y = β₀ + β₁x₁ + β₂x₂ + ⋯ + βₙxₙ + ε</code></p>

<h3>Where:</h3>
<ul>
  <li><strong>y:</strong> Dependent variable (target)</li>
  <li><strong>x:</strong> Independent variable(s) (features)</li>
  <li><strong>β₀:</strong> Intercept term</li>
  <li><strong>β₁, β₂, ..., βₙ:</strong> Coefficients (slopes)</li>
  <li><strong>ε:</strong> Error term (residual)</li>
</ul>

<h3>Objective:</h3>
<p>
  Estimate the coefficients <strong>β</strong> that minimize the sum of squared residuals (differences between observed and predicted values),
  using a method known as <strong>Ordinary Least Squares (OLS)</strong>.
</p>


In [15]:
from sklearn.linear_model import LinearRegression

In [16]:
model = LinearRegression()

In [17]:
model.fit(X_train, y_train)

### predict Result on testing data

In [18]:
y_pred = model.predict(X_test)

### Evaluate Result

In [19]:
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score

In [20]:
print(mean_squared_error(y_test, y_pred))
print(r2_score(y_test, y_pred)*100)

25.017672023842703
65.8852019550814
