In [1]:
# import the dataset using pandas
import pandas as pd
housing = pd.read_csv('housing.csv')

In [2]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=10)

In [3]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

num_attribs = ["longitude", "latitude", "housing_median_age", "total_rooms",
               "total_bedrooms", "population", "households", "median_income"]
cat_attribs = ["ocean_proximity"]

num_pipeline = Pipeline([
    ("impute", SimpleImputer(strategy="median")),
    ("standardize", StandardScaler()),
])

cat_pipeline = Pipeline([
    ("impute", SimpleImputer(strategy="most_frequent")),
    ("oneHot", OneHotEncoder()),
])

preprocessing = ColumnTransformer([
    ("num", num_pipeline, num_attribs),
    ("cat", cat_pipeline, cat_attribs),
])

In [5]:
housing = train_set.drop("median_house_value", axis=1) # drop labels for training set
housing_labels = train_set["median_house_value"].copy()

In [6]:
#develop a linear regression model using the prpeared dataset

from sklearn.linear_model import LinearRegression

housing_prepared = preprocessing.fit_transform(housing)
lin_reg = LinearRegression()

lin_reg.fit(housing_prepared, housing_labels)


In [7]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression

lin_reg = make_pipeline(preprocessing, LinearRegression())
lin_reg.fit(housing, housing_labels)

In [8]:
housing_predictions = lin_reg.predict(housing)
housing_predictions[:5].round(-2)  # -2 = rounded to the nearest hundred

array([183600., 218100., 314300., 157900., 234400.])

In [None]:
from sklearn.linear_model import LinearRegression

housing_prepared = preprocessing.fit_transform(housing)
lin_reg = LinearRegression()

lin_reg.fit(housing_prepared, housing_labels)

In [15]:
#shape of the new data
housing_prepared.shape

(16512, 13)

In [17]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression

lin_reg = make_pipeline(preprocessing, LinearRegression())
lin_reg.fit(housing, housing_labels)

In [20]:
#looking at the first 5 preictions using the LR model
housing_predictions = lin_reg.predict(housing)
housing_predictions[:5].round(-2)  # -2 = rounded to the nearest hundred

array([183600., 218100., 314300., 157900., 234400.])

In [21]:
#calling the five same labels from the data
housing_labels.iloc[:5].values

array([145200., 117000., 263900., 163700., 236100.])

The **error_ratios** represent the percentage difference between the predicted housing values (housing_predictions[:5]) and the actual housing labels (housing_labels.iloc[:5]). The rounding of the predicted values to the nearest hundred (round(-2)) happens first, and then the ratio is calculated as the prediction divided by the actual label minus 1.

In [24]:
#calculating error ratio
error_ratios = housing_predictions[:5].round(-2) / housing_labels.iloc[:5].values - 1
print(", ".join([f"{100 * ratio:.1f}%" for ratio in error_ratios]))

26.4%, 86.4%, 19.1%, -3.5%, -0.7%


In [25]:
#evaluate the model performance
from sklearn.metrics import mean_squared_error

lin_rmse = mean_squared_error(housing_labels, housing_predictions,
                              squared=False)
lin_rmse



68539.44416127144

This is better than nothing, but clearly not a great score: the median_housing_values of most districts range between 120,000 dollars and 265,000 dollars, so a typical prediction error of $68,539 is really not very satisfying.

This is an example of a model underfitting the training data. When this happens it can mean that the features do not provide enough information to make good predictions, or that the model is not powerful enough.
**The main ways to fix underfitting are to**

* select a more powerful model,
* to feed the training algorithm with better features, or
* to reduce the constraints on the model.

Let's train a **DecisionTreeRegressor**, a robust model that can capture intricate nonlinear patterns in the data.








In [27]:
#Let's trian a DT model
from sklearn.tree import DecisionTreeRegressor

tree_reg = make_pipeline(preprocessing, DecisionTreeRegressor(random_state=42))
tree_reg.fit(housing, housing_labels)

In [29]:
#evaluate the model fit on the training set
from sklearn.metrics import mean_squared_error
housing_predictions = tree_reg.predict(housing)
tree_rmse = mean_squared_error(housing_labels, housing_predictions,
                              squared=False)
tree_rmse



0.0

 Zero error? Is this model actually flawless? Well, it's far more likely that the model has severely overfitted the data. So how can you be certain? As we discussed earlier, you should avoid using the test set until you're confident and ready to deploy the model. Instead, split the training data into subsets: one for training and the other for model validation.