Data Analytics I
    Create a Linear Regression Model using Python/R to predict home prices using Boston Housing 
    Dataset (https://www.kaggle.com/c/boston-housing). The Boston Housing dataset contains 
    information about various houses in Boston through different parameters. There are 506 samples 
    and 14 feature variables in this dataset.
    
The objective is to predict the value of prices of the house using the given features

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv("boston_housing.csv")

In [3]:
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,,36.2


In [4]:
df.tail()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
501,0.06263,0.0,11.93,0.0,0.573,6.593,69.1,2.4786,1,273,21.0,391.99,,22.4
502,0.04527,0.0,11.93,0.0,0.573,6.12,76.7,2.2875,1,273,21.0,396.9,9.08,20.6
503,0.06076,0.0,11.93,0.0,0.573,6.976,91.0,2.1675,1,273,21.0,396.9,5.64,23.9
504,0.10959,0.0,11.93,0.0,0.573,6.794,89.3,2.3889,1,273,21.0,393.45,6.48,22.0
505,0.04741,0.0,11.93,0.0,0.573,6.03,,2.505,1,273,21.0,396.9,7.88,11.9


In [5]:
df.isnull().sum()

CRIM       20
ZN         20
INDUS      20
CHAS       20
NOX         0
RM          0
AGE        20
DIS         0
RAD         0
TAX         0
PTRATIO     0
B           0
LSTAT      20
MEDV        0
dtype: int64

In [6]:
df['CRIM'].fillna(df['CRIM'].mean(),inplace=True)       #replace all the null values with the mean of each
df['ZN'].fillna(df['ZN'].mean(),inplace=True)  
df['INDUS'].fillna(df['INDUS'].mean(),inplace=True)
df['CHAS'].fillna(df['CHAS'].mean(),inplace=True)
df['AGE'].fillna(df['AGE'].mean(),inplace=True)
df['LSTAT'].fillna(df['LSTAT'].mean(),inplace=True)

In [7]:
df.isnull().sum()

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
MEDV       0
dtype: int64

In [8]:
df.columns

Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'B', 'LSTAT', 'MEDV'],
      dtype='object')

In [9]:
X = df[['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']]
Y = df['MEDV'] #divide the dataset into X represents the input features or independent variables.
                                    #Y represents the target variable or dependent variable.

In [10]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size = 0.3, random_state = 42)

test_size=0.3 specifies the proportion of the dataset that should be allocated for testing. 
In this case, 30% of the data will be used for testing, and the remaining 70% will be used for training.
random_state=42 sets the random seed, ensuring reproducibility. The same random seed will always result 
in the same train-test split.
The function returns four variables: X_train, X_test, y_train, and y_test. These variables hold the 
    resulting splits of the dataset, where:

X_train contains the training set features.
X_test contains the testing set features.
y_train contains the corresponding target values for the training set.
y_test contains the corresponding target values for the testing set.
By using train_test_split, you can separate your data into training and test

In [11]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

The fit_transform method calculates the mean and standard deviation of each feature in X_train and scales the features based on these statistics. 
The result is that X_train is transformed and standardized.

In [14]:
from sklearn.linear_model import LinearRegression

In [15]:
lr = LinearRegression() #create an variable to store the object

In [16]:
lr.fit(X_train, y_train) #fit your dataset into that variable

In [17]:
y_pred = lr.predict(X_test)

y_pred = lr.predict(X_test)

This line calls the predict method on a trained machine learning model lr, using the testing set features X_test as input. The predict method takes the input features and generates predictions for the corresponding target variable.

The resulting predictions are assigned to the variable y_pred, which represents the predicted values for the target variable based on the input features in X_test.

In this context, lr is assumed to be a trained model, such as a linear regression model, logistic regression model, or any other model that has a predict method implemented.

After executing this line of code, y_pred will contain the predicted values for the target variable based on the input features in X_test. These predicted values can be further evaluated against the actual target values (y_test) to assess the performance of the machine learning model

In [18]:
print("Testing accuracy is:")
lr.score(X_test, y_test)

Testing accuracy is:


0.6709481059106917

In [19]:
print("Training accuracy is:")
lr.score(X_train, y_train)

Training accuracy is:


0.7323332967523362

 The training accuracy measures how well the model fits the training data, indicating how effectively it has learned from the available information.

In [21]:
from sklearn.metrics import mean_squared_error
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Root Mead squared Error is:")
print(rmse)

Root Mead squared Error is:
4.95163359216575


Linear regression is a statistical technique used to model the relationship between a dependent variable and one or more independent variables. It assumes a linear relationship between the variables, meaning that changes in the independent variables are associated with proportional changes in the dependent variable.

The goal of linear regression is to find the best-fitting line (or hyperplane) that minimizes the difference between the predicted values and the actual values of the dependent variable. This line is determined by estimating the coefficients or weights for each independent variable, which represent the slope of the line.

In simple linear regression, there is only one independent variable, and the relationship is represented by a straight line equation:

y = mx + b

where 'y' is the dependent variable, 'x' is the independent variable, 'm' is the slope (coefficient), and 'b' is the y-intercept.

Multiple linear regression extends this concept to multiple independent variables, resulting in an equation:

y = b0 + b1x1 + b2x2 + ... + bnxn

# 

In linear regression, 'x' represents the independent variable or predictor variable, and 'y' represents the dependent variable or response variable.


# 


In linear regression, training the dataset is the process of fitting the model to the given data. The main purpose of training is to estimate the coefficients or weights of the linear regression model that best represent the relationship between the independent variables (predictors) and the dependent variable (response).

Here are the key reasons why we need to train the dataset in linear regression:

Model Learning: The training process allows the linear regression model to learn and understand the underlying patterns and relationships in the data. By analyzing the training data, the model adjusts its coefficients to minimize the difference between the predicted values and the actual values of the dependent variable.