## **Import Libraries**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import sklearn
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
plt.rcParams["figure.figsize"] = [10,8]

In [None]:
import warnings
warnings.simplefilter(action = 'ignore', category = FutureWarning)

# **Load the dataset**

In [None]:
df = pd.read_csv('../Datasets/USA_Housing.csv')

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.nunique()

In [None]:
df.isnull().sum()

In [None]:
df.info()

## **1. Perform EDA on the dataset which should include**

### **a. Visualization** and explore the data using seaborn
#### **i.** Add your findings about the data under each graph in the colab notebook

In [None]:
sns.histplot(data=df['Area Population'], kde=False)
plt.title("Histogram of Area Population")
plt.show()

- This histogram should the relation of population with the number of houses. It shows that as the number of houses increase in a particular area its population also increases.

In [None]:
sns.boxenplot(data=df['Area Population'])
plt.show()

In [None]:
sns.histplot(data=df['Price'], kde=True)
plt.title("Histogram of Price")
plt.show()

- This histogram shows the relationship of number of houses with price. It clearly shows that if the number of houses increases in a particular area then its price also increases.

In [None]:
sns.set_theme(style="darkgrid")
sns.regplot(data=df, x='Avg. Area Number of Rooms', y='Price')
plt.title("Regplot of Rooms and Price")
plt.show()

- This regplot shows that there is negative relationship between Avg. Area number of Rooms and Price. It means, that if in a particular area there are more number of rooms then its price is low or vice versa.

In [None]:
sns.set_theme(style="whitegrid")
# Sample size you want for visualization (e.g., 1000 data points)
sample_size = 2000
# Randomly sample data from the DataFrame
sampled_data = df.sample(n=sample_size, random_state=42)
# Now create the box plot using the sampled data
plt.boxplot(sampled_data['Area Population'])
plt.title('Box Plot')
plt.show()

- In this box plot, i have used sample values which means that I have not used all the data of the **Area Population** columnn due to larger number of rows. It is showing outliers, it means that some areas have much larger of population and some areas have much lesser population and these are outliers in our **Area population** column.

In [None]:
# Sample size you want for visualization (e.g., 1000 data points)
sample_size = 2000

# Assuming 'x_column' and 'y_column' are the column names for your x and y values in the DataFrame 'df'
# Randomly sample data from the DataFrame
sampled_data = df.sample(n=sample_size, random_state=42)

# Use the sampled data for x and y values
x_values = sampled_data['Avg. Area Number of Rooms']
y_values = sampled_data['Area Population']

# Plot the line with sampled x and y values
plt.scatter(x_values, y_values)
plt.title('Line Plot with Sampled Data')

plt.xlabel("Avg. Area Number of Rooms")
plt.ylabel("Area Population")
# Show the plot
plt.show()

In [None]:
sns.violinplot(data=df['Avg. Area House Age'])
plt.title("Violin plot of Avg. Area House Age")
plt.show()

- The above violin plot shows the distribution for Average Area house age, it means that there are more older houses.

In [None]:
sns.lineplot(data = df, x='Avg. Area House Age', y='Price')
plt.title("Line plot of Average Area House Age and Price")
plt.show()

- Due to sheer number of rows, it is bit difficult to understand but if we analyze closely then it is showing that as the Average area house age increases then price also increases. In simple words, more old houses have more price as compared to others.

In [None]:
sns.lineplot(data = df, x='Avg. Area Number of Rooms', y='Price')
plt.title("Line plot of Average Area Number of Rooms and Price")
plt.show()

- Same is the case with Average Area Number of Rooms. As number of rooms increases, price also increases.

In [None]:
plt.figure(figsize=(10,8))
sns.lmplot(data=df, x='Avg. Area Income', y='Price', aspect=1.5)
plt.title("lmplot of Price and Income")
plt.show()

- It shows that there is a positive relationship between Price and Income.

In [None]:
sns.heatmap(df.corr(), annot=True, cmap="viridis", cbar=True, fmt='.2f')
plt.title("Heatmap")
plt.show()

- Heatmap shows that there is not as much strong correlation between most of the features, and mostly the features are independent to each other.

In [None]:
sns.pairplot(data=df)
plt.title("Pair plot")
plt.show()

- It shows the relationship of each feature in the dataset. Here, we can see the correlation at a single place for all features.

### **b. Identify the data patterns** if exist for single/multiple variables
#### **i.** Write your findings under the plots or code that identify the pattern

In [None]:
sns.histplot(data=df['Area Population'], kde=False)
plt.title("Histogram of Area Population")
plt.show()

- This histogram should the relation of population with the number of houses. It shows that as the number of houses increase in a particular area its population also increases.

In [None]:
sns.set_theme(style="darkgrid")
sns.regplot(data=df, x='Avg. Area Number of Rooms', y='Price')
plt.title("Regplot of Rooms and Price")
plt.show()

- This regplot shows that there is negative relationship between Avg. Area number of Rooms and Price. It means, that if in a particular area there are more number of rooms then its price is low or vice versa.

In [None]:
plt.figure(figsize=(10,8))
sns.lmplot(data=df, x='Avg. Area Income', y='Price', aspect=1.5)
plt.title("lmplot of Price and Income")
plt.show()

- It shows that there is a positive relationship between Price and Income.

### **c. Clean the dataset,** remove the missing values
#### i. Explain your approach in the colab notebook cell

In [None]:
# heatmap to see the missing values in the dataset
sns.heatmap(df.isnull(), yticklabels=False, cbar=False, cmap="viridis")
plt.title("Missing Data")
plt.show()

In [None]:
df.isnull().sum()

In [None]:
df.drop('Address', axis = 1, inplace = True)

In [None]:
df.dropna(inplace = True)
df.head()

- **My Approach:**\
I have used for heatmap and .isnull(), but there are no missing values in the dataset. Still, I have used .dropna() function for precaution to drop any missing values from dataset.

### **d. Select the target variable** and clearly mention reason for selecting it

**Target variable:**\
Price\
**Reason:**\
I am selecting **price** as my target variable, because the algorithms we are going to use are regressors and they are used for predicting numerical values. So, I think **price** is a better choice as target variable.

In [None]:
x = df.drop('Price', axis = 1)
y = df['Price']

### **e. Transform the Dataset**
#### i. Transform the whole dataset (Features, Target Variable)

In [None]:
from sklearn import preprocessing

In [None]:
# Transforming features
pre_process_x = preprocessing.StandardScaler().fit(x)
x_transform = pre_process_x.fit_transform(x)

In [None]:
# Transforming target variable
y_array = y.to_numpy()
y_reshaped_column = y_array.reshape(-1, 1)
pre_process_y = preprocessing.StandardScaler().fit(y_reshaped_column)
y_transform = pre_process_y.fit_transform(y_reshaped_column)

### **f. Split the Dataset** into train and test set

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_transform, y_transform, test_size = .20, random_state=101)

## **2. Use the Scikit Learn Library to fit the Regression Models**

### **a.** Use the different regression models
#### **i.** Linear regression
#### **ii.** Decision Tree Regressor
#### **iii.** Random forest Regressor
#### **iv.** Gradient boosting Regressor

## **i. Linear Regression**

In [None]:
# Import model
from sklearn.linear_model import LinearRegression

# Creating instance of the model
lin_reg = LinearRegression()

# Pass training data to model
lin_reg.fit(x_train, y_train)

In [None]:
# Predict
y_pred_lreg = lin_reg.predict(x_test)

In [None]:
# Convert y_test and y_pred to 1-dimensional arrays using .flatten()
y_test_1d = y_test.flatten()
y_pred_1d = y_pred_lreg.flatten()

# Plot the scatter plot and the ideal line
sns.scatterplot(x=y_test_1d, y=y_pred_1d, color='blue', label='Actual Data points')
plt.plot([min(y_test_1d), max(y_test_1d)], [min(y_test_1d), max(y_test_1d)], color='red', label='Ideal Line')
plt.legend()
plt.show()

In [None]:
# Combine actual and predicted values side by side
results = np.column_stack((y_test, y_pred_lreg))

# Printing the results
print("Actual Values  |  Predicted Values")
print("-----------------------------")
for actual, predicted in results:
    print(f"{actual:14.2f} | {predicted:12.2f}")

In [None]:
# Score It
from sklearn.metrics import mean_squared_error

print('Linear Regression Model')
# Results
print('--'*30)
# mean_squared_error(y_test, y_pred)
mse_lreg = mean_squared_error(y_test, y_pred_lreg)
rmse_lreg = np.sqrt(mse_lreg)

# Print evaluation metrics
print("Mean Squared Error:", mse_lreg)
print("Root Mean Squared Error:", rmse_lreg)

## **ii. Decision tree Regressor**

In [None]:
# Import model
from sklearn.tree import DecisionTreeRegressor

# Creating instance of the model
Dtr = DecisionTreeRegressor()

# Pass training data to model
Dtr.fit(x_train, y_train)

y_pred_dtr = Dtr.predict(x_test)

In [None]:
print('Decision Tree Regressor')
# Results
print('--'*30)
# mean_squared_error(y_test, y_pred)
mse_dtr = mean_squared_error(y_test, y_pred_dtr)
rmse_dtr = np.sqrt(mse_dtr)

# Print evaluation metrics
print("Mean Squared Error:", mse_dtr)
print("Root Mean Squared Error:", rmse_dtr)

## **iii. Random forest Regressor**

In [None]:
# Import model
from sklearn.ensemble import RandomForestRegressor

# Creating instance of the model
Rfr = RandomForestRegressor()

# Pass training data to model
Rfr.fit(x_train, y_train)

y_pred_rfr = Rfr.predict(x_test)

In [None]:
print('Random Tree Regressor')
# Results
print('--'*30)
# mean_squared_error(y_test, y_pred_rtr)
mse_rfr = mean_squared_error(y_test, y_pred_rfr)
rmse_rfr = np.sqrt(mse_rfr)

# Print evaluation metrics
print("Mean Squared Error:", mse_rfr)
print("Root Mean Squared Error:", rmse_rfr)

## **iv. Gradient boosting Regressor**

In [None]:
# Import model
from sklearn.ensemble import GradientBoostingRegressor

# Creating instance of the model
Gbr = GradientBoostingRegressor()

# Pass training data to model
Gbr.fit(x_train, y_train)

y_pred_gbr = Gbr.predict(x_test)

In [None]:
print('Gradient Boosting Regressor')
# Results
print('--'*30)
# mean_squared_error(y_test, y_pred_rtr)
mse_gbr = mean_squared_error(y_test, y_pred_gbr)
rmse_gbr = np.sqrt(mse_gbr)

# Print evaluation metrics
print("Mean Squared Error:", mse_gbr)
print("Root Mean Squared Error:", rmse_gbr)

### **b.** You have to report the **MSE** result with the following combinations
#### **i.** Without feature scaling
#### **ii.** With only feature scaling (without target variable)
#### **iii.** With feature and target variable scaling

## **i. Without feature scaling**

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = .20, random_state = 101)

models = [LinearRegression(), DecisionTreeRegressor(), RandomForestRegressor(), GradientBoostingRegressor()]
model_names = ('Linear Regression', 'Decision Tree Regressor', 'Random Forest Regressor', 'Gradient Boosting Regressor')

models_score = []
for model, model_name in zip(models, model_names):
  model.fit(x_train, y_train)
  y_pred = model.predict(x_test)
  mse = mean_squared_error(y_test, y_pred)
  models_score.append([model_name, mse])

sorted_models = sorted(models_score, key=lambda x: x[1], reverse=True)
for model in sorted_models:
    print(f'Model: {model[0]}, Mean Squared Error (MSE): {model[1]:.2f}')

## **ii. With only feature scaling (without target variable)**

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error

from sklearn.model_selection import train_test_split

# Assuming the features have been transformed and are stored in 'x_transform'
# If you use different transformation techniques, update the description accordingly.
x_train, x_test, y_train, y_test = train_test_split(x_transform, y, test_size=0.20, random_state=101)

models = [LinearRegression(), DecisionTreeRegressor(), RandomForestRegressor(), GradientBoostingRegressor()]
model_names = ('Linear Regression', 'Decision Tree Regressor', 'Random Forest Regressor', 'Gradient Boosting Regressor')

models_score = []
for model, model_name in zip(models, model_names):
    model.fit(x_train, y_train)
    y_pred = model.predict(x_test)
    mse = mean_squared_error(y_test, y_pred)
    models_score.append([model_name, mse])

sorted_models = sorted(models_score, key=lambda x: x[1], reverse=True)
for model in sorted_models:
    print(f'Model: {model[0]}, Mean Squared Error (MSE): {model[1]:.2f}')

print("Note: Features have been transformed using StandardScaler.")

## **iii. With feature and target variable scaling**

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error

from sklearn.model_selection import train_test_split

# Assuming the features have been transformed and are stored in 'x_transform'
# If you use different transformation techniques, update the description accordingly.
x_train, x_test, y_train, y_test = train_test_split(x_transform, y_transform, test_size=0.20, random_state=101)

models = [LinearRegression(), DecisionTreeRegressor(), RandomForestRegressor(), GradientBoostingRegressor()]
model_names = ('Linear Regression', 'Decision Tree Regressor', 'Random Forest Regressor', 'Gradient Boosting Regressor')

models_score = []
for model, model_name in zip(models, model_names):
    model.fit(x_train, y_train)
    y_pred = model.predict(x_test)
    mse = mean_squared_error(y_test, y_pred)
    models_score.append([model_name, mse])

sorted_models = sorted(models_score, key=lambda x: x[1], reverse=True)
for model in sorted_models:
    print(f'Model: {model[0]}, Mean Squared Error (MSE): {model[1]:.2f}')

print("Note: Features and target variable both have been transformed using StandardScaler.")

## **Comparison of MSE:**
### **Without feature scaling:**
- Decision Tree Regressor (MSE): 32320110401.78
- Random Forest Regressor (MSE): 15118290670.43
- Gradient Boosting Regressor (MSE): 12408033260.39
- Linear Regression (MSE): 10100187858.86

### **With only feature scaling:**
- Decision Tree Regressor (MSE): 32523773080.99
- Random Forest Regressor (MSE): 15173002261.80
- Gradient Boosting Regressor (MSE): 12400765586.94
- Linear Regression (MSE): 10100187858.87

### **With feature and target variable scaling:**
- Decision Tree Regressor (MSE): 0.25
- Random Forest Regressor (MSE): 0.12
- Gradient Boosting Regressor (MSE): 0.10
- Linear Regression (MSE): 0.08

### **c.** Display the ranking of different models according to their **MSE** values

In [None]:
model_scores = {
    "Linear Regression": 0.08101725519794249,
    "Descison Tree Regressor": 0.25137569765775214,
    "Random Forest Regressor": 0.12042240672361741,
    "Gradient Boosting Regressor": 0.09946292746379987
}

# Sort the model scores in ascending order based on their values (lower values first)
sorted_scores = sorted(model_scores.items(), key=lambda x: x[1])

# Display the ranking of the models
print("Model Rankings according to their MSE values:")
for rank, (model_name, score) in enumerate(sorted_scores, start=1):
    print(f"{rank}. {model_name}: {score}")