# Assignment 1: Linear Regression Analysis on USA Housing Dataset

#### DSCI 6601: Practical Machine Learning

Linear regression is a statistical method used in supervised learning to understand the relationship between a dependent variable (target) and one or more independent variables (features). It is mostly applied to numerical and continuous dataset. 

## Task 1: Data Splitting and Model Fitting

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load the dataset
dataset = pd.read_csv("USA_Housing.csv")
# Checked the dataset feature
dataset.head(5)

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price,Address
0,79545.458574,5.682861,7.009188,4.09,23086.800503,1059034.0,"208 Michael Ferry Apt. 674\nLaurabury, NE 3701..."
1,79248.642455,6.0029,6.730821,3.09,40173.072174,1505891.0,"188 Johnson Views Suite 079\nLake Kathleen, CA..."
2,61287.067179,5.86589,8.512727,5.13,36882.1594,1058988.0,"9127 Elizabeth Stravenue\nDanieltown, WI 06482..."
3,63345.240046,7.188236,5.586729,3.26,34310.242831,1260617.0,USS Barnett\nFPO AP 44820
4,59982.197226,5.040555,7.839388,4.23,26354.109472,630943.5,USNS Raymond\nFPO AE 09386


In [10]:
# dropped the non-numerical feature from the dataset.
dataset1 = dataset.drop(columns=['Address'])
dataset1.head(5)

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price
0,79545.458574,5.682861,7.009188,4.09,23086.800503,1059034.0
1,79248.642455,6.0029,6.730821,3.09,40173.072174,1505891.0
2,61287.067179,5.86589,8.512727,5.13,36882.1594,1058988.0
3,63345.240046,7.188236,5.586729,3.26,34310.242831,1260617.0
4,59982.197226,5.040555,7.839388,4.23,26354.109472,630943.5


In [11]:
# Features dataset X: all columns except 'Price'
X = dataset1.drop(columns=['Price'])
print(X.head(5))
print()
# Target y: the 'Price' column
y = dataset1['Price']
y.head(5)

   Avg. Area Income  Avg. Area House Age  Avg. Area Number of Rooms  \
0      79545.458574             5.682861                   7.009188   
1      79248.642455             6.002900                   6.730821   
2      61287.067179             5.865890                   8.512727   
3      63345.240046             7.188236                   5.586729   
4      59982.197226             5.040555                   7.839388   

   Avg. Area Number of Bedrooms  Area Population  
0                          4.09     23086.800503  
1                          3.09     40173.072174  
2                          5.13     36882.159400  
3                          3.26     34310.242831  
4                          4.23     26354.109472  



0    1.059034e+06
1    1.505891e+06
2    1.058988e+06
3    1.260617e+06
4    6.309435e+05
Name: Price, dtype: float64

In [31]:
# Split the dataset to 80% for training, 20% for testing 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [30]:
# Create and fit the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

## Task 2: Reporting Coefficients and Model Evaluation

In [32]:
# Output the model's coefficients for each feature and the intercept
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)

Coefficients: [2.16522058e+01 1.64666481e+05 1.19624012e+05 2.44037761e+03
 1.52703134e+01]
Intercept: -2635072.900933358


In [44]:
# Making predictions
y_pred = model.predict(X_test)

# Evaluating the model
# mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# print("Mean Squared Error:", mse)
print("R-squared:", r2)

R-squared: 0.9179971706834289


## R-squared
```
The value of R-squared ranges from 0 to 1.

For a value of or close to (0), the model explains none of the variance in the target variable.
Essentially, the model fails to predict the target variable.

For a value of or close to (1), the model explains all the variance in the target variable perfectly;
indicating a perfect fit.

An R-squared of 0.9179971706834289 in this case study suggests that the model has a strong predictive ability. 
So, 91.8% of the variability of the price of a house can be explained by the features used in this model.

## Task 3: Predictions on Sample Data

In [34]:
# Checking the test dataset
print(X_test.head(5))

      Avg. Area Income  Avg. Area House Age  Avg. Area Number of Rooms  \
1501      61907.593345             7.017838                   6.440256   
2586      57160.202243             6.893260                   6.921532   
2653      70190.796445             6.745054                   6.662567   
1055      69316.796889             6.300409                   7.873576   
705       72991.481649             3.412866                   6.494081   

      Avg. Area Number of Bedrooms  Area Population  
1501                          3.25     43828.947207  
2586                          3.13     43467.147035  
2653                          2.01     29215.136112  
1055                          4.28     24448.211461  
705                           2.48     50626.495426  


In [35]:
# I did some googling and chat Gpt here
# Randomly select 20 samples (rows) in random from the test set
random_samples = X_test.sample(n=20, random_state=42)
random_samples.head(5)

Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population
3707,76153.304902,5.322412,7.989237,6.08,32925.347152
828,52420.525533,7.326977,6.010628,4.29,39766.419573
2664,65885.135759,7.652591,6.196093,4.02,34100.916771
1047,66236.840561,6.856135,5.066614,3.22,38906.857363
3197,64354.773272,4.312952,6.329671,3.09,25495.35594


In [36]:
# Using the fitted model, made a prediction on the 20 randomly selected sample
random_samples_pred = model.predict(random_samples)

In [40]:
# Get the actual values for the selected samples from the test set
actual_values = y_test.loc[random_samples.index]

In [42]:
# Did some googling here as well 
# Create a DataFrame to compare predicted and actual values
comparison_df = pd.DataFrame({
    'Actual Price': actual_values,
    'Predicted Price': random_samples_pred
})
comparison_df

Unnamed: 0,Actual Price,Predicted Price
3707,1220133.0,1363559.0
828,1128403.0,1043185.0
2664,1440737.0,1323354.0
1047,1113937.0,1136143.0
3197,520217.9,622592.0
4200,685355.4,655836.5
1505,1522100.0,1547218.0
996,1255576.0,1067322.0
2750,920540.7,924307.9
4445,1673538.0,1674049.0


## Task 4: Feature Ranking and Model Refinement

In [46]:
# Checking the features and their coefficients 
print("Features:", X_train.columns)
print("Coefficients:", model.coef_)

Features: Index(['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
       'Avg. Area Number of Bedrooms', 'Area Population'],
      dtype='object')
Coefficients: [2.16522058e+01 1.64666481e+05 1.19624012e+05 2.44037761e+03
 1.52703134e+01]


In [50]:
# Zip the two variables to visualize better
features = X_train.columns
coefficients = model.coef_

for feature, coefficient in zip(features, coefficients):
    print()
    print(f"Feature: {feature}, Coefficient: {coefficient}")



Feature: Avg. Area Income, Coefficient: 21.652205763623375

Feature: Avg. Area House Age, Coefficient: 164666.48072189192

Feature: Avg. Area Number of Rooms, Coefficient: 119624.01223205797

Feature: Avg. Area Number of Bedrooms, Coefficient: 2440.377611031628

Feature: Area Population, Coefficient: 15.270313429966336


### Observation
```
Looking at the features and their corresponding coefficients, we can observe that (Avg. Area House Age) 
and (Avg. Area Number of Rooms) have the highest values and, thus, the most impactful coefficients on the model. 
Meanwhile, (Avg. Area Income), (Avg. Area Number of Bedrooms), and (Area Population) are very neglectable. 

In [51]:
# Let's drop the least important feature and re-fit the model using the remaining features

dataset1.head(5)

The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.


Unnamed: 0,Avg. Area Income,Avg. Area House Age,Avg. Area Number of Rooms,Avg. Area Number of Bedrooms,Area Population,Price
0,79545.458574,5.682861,7.009188,4.09,23086.800503,1059034.0
1,79248.642455,6.0029,6.730821,3.09,40173.072174,1505891.0
2,61287.067179,5.86589,8.512727,5.13,36882.1594,1058988.0
3,63345.240046,7.188236,5.586729,3.26,34310.242831,1260617.0
4,59982.197226,5.040555,7.839388,4.23,26354.109472,630943.5


In [54]:
# New features dataset X_1: columns (Avg. Area House Age) and (Avg. Area Number of Rooms)
X_1 = dataset1.drop(columns=[ 'Avg. Area Income', 'Avg. Area Number of Bedrooms', 'Area Population', 'Price'])
print(X_1.head(5))
print()
# Target y_1: the 'Price' column
y_1 = dataset1['Price']
y_1.head(5)

   Avg. Area House Age  Avg. Area Number of Rooms
0             5.682861                   7.009188
1             6.002900                   6.730821
2             5.865890                   8.512727
3             7.188236                   5.586729
4             5.040555                   7.839388



0    1.059034e+06
1    1.505891e+06
2    1.058988e+06
3    1.260617e+06
4    6.309435e+05
Name: Price, dtype: float64

In [58]:
# Split the new dataset to 80% for training, 20% for testing 
X_train1, X_test1, y_train1, y_test1 = train_test_split(X_1, y_1, test_size=0.2, random_state=42)

In [59]:
model = LinearRegression()
model.fit(X_train1, y_train1)

In [61]:
# Making predictions with the new dataset
y_pred1 = model.predict(X_test1)

# Evaluating the model
# mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test1, y_pred1)

# print("Mean Squared Error:", mse)
print("R-squared:", r2)

R-squared: 0.34043009017731085


R-squared, after dropping some of the less-ranked coefficient features, decreased significantly from 91.8% to 34.04%, 
which means the model has a poor predictivity ability and, thus, is unreliable.
Coefficient rank, in this particular case, might not necessarily be a deciding factor in keeping the features or not in this model. Rather The dropped features are proven to be strong factors in building this model. 