**Question : Welcome to the Machine Learning Housing Corporation! Your first task is to use California census data to build a model of housing prices in the state. This data includes metrics such as the population, median income, and median housing price for each block group in California. Block groups are the smallest geographical unit for which the US Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people). We will call them “districts” for short. Your model should learn from this data and be able to predict the median housing price in any district, given all the other metrics.**

# Loading the dataset from a CSV file.

In [None]:
%matplotlib inline
import os
import pandas as pd
import matplotlib.pyplot as plt
HOUSING_PATH = 'D:\Python Runtime\Machine Learning\Chapter 2 (End To End)'

def load_housing_data(housing_path=HOUSING_PATH):
 csv_path = os.path.join(housing_path, "housing.csv")
 return pd.read_csv(csv_path)

In [None]:
data = load_housing_data()

data.head() prints the first 5 rows from the dataset.

In [None]:
data.head()

data.info() gives the information about the dataset like the columns, number of rows in each column, datatype etc.,

In [None]:
data.info()

In [None]:
data.describe()

Counts the the frequency of every unique element in the column **ocean_proximity**

In [None]:
data["ocean_proximity"].value_counts()

**Ploting the histogram for all the columns with 50 bins**

In [None]:
data.hist(bins=50, figsize=(20,15))
plt.show()

# Spliting the dataset

In [None]:
import numpy as np
np.random.seed(42)

In [None]:
data["income_cat"] = pd.cut(data["median_income"],
bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
labels=[1, 2, 3, 4, 5])

In [None]:
data["income_cat"].hist()

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(data, data["income_cat"]):
    strat_train_set = data.loc[train_index]
    strat_test_set = data.loc[test_index]

In [None]:
print("Training data :", len(strat_train_set))
print("Testing data :", len(strat_test_set))

In [None]:
strat_test_set["income_cat"].value_counts() / len(strat_test_set)

In [None]:
for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)

# Some inferences on the data : 

In [None]:
sample_data = strat_train_set.copy()

**Distribution of houses :**

In [None]:
sample_data.plot(kind = "scatter" , x = 'longitude' , y = 'latitude' , alpha = 0.1)

In [None]:
sample_data.plot(
    kind="scatter", 
    x="longitude", 
    y="latitude", 
    alpha=0.4,
    s=sample_data["population"]/100, 
    label="population", 
    figsize=(10,7),
    c="median_house_value", 
    cmap=plt.get_cmap("jet"), 
    colorbar=True,
)
plt.legend()

**Correlation between various features.**

In [None]:
corr_matrix = sample_data.corr()

In [None]:
corr_matrix

**Correlation between median_house_value and other features.**

In [None]:
corr_matrix["median_house_value"].sort_values(ascending=False)

**Correlation graph for some columns in the datset.**

In [None]:
from pandas.plotting import scatter_matrix
attributes = ["median_house_value", "median_income", "total_rooms", "housing_median_age"]
scatter_matrix(sample_data[attributes], figsize=(12, 8))
plt.show()

In [None]:
sample_data.plot(kind="scatter", x="median_income", y="median_house_value",alpha=0.1)

**Generating new features like:**
* rooms_per_household
* bedrooms_per_room
* population_per_household

In [None]:
sample_data["rooms_per_household"] = sample_data["total_rooms"]/sample_data["households"]
sample_data["bedrooms_per_room"] = sample_data["total_bedrooms"]/sample_data["total_rooms"]
sample_data["population_per_household"]=sample_data["population"]/sample_data["households"]

In [None]:
sample_data.head()

In [None]:
corr_matrix = sample_data.corr()

In [None]:
corr_matrix["median_house_value"].sort_values(ascending=False)

# Preparing traning dataset for various regression models :

In [None]:
housing = strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set["median_house_value"].copy()

**Handling missing data:**

* housing.dropna(subset=["total_bedrooms"]) # option 1 - Drop only rows

* housing.drop("total_bedrooms", axis=1) # option 2 - Drop the whole column

* median = housing["total_bedrooms"].median() # option 3 = Replace the Nan values with the median , 0 , etc.,
  housing["total_bedrooms"].fillna(median, inplace=True)


**Using sklearn imputer to handle missing data:**

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")

In [None]:
# Numeric data only
housing_num = housing.drop("ocean_proximity", axis=1)

In [None]:
imputer.fit(housing_num)

In [None]:
imputer.statistics_

In [None]:
X = imputer.transform(housing_num)

In [None]:
housing_tr = pd.DataFrame(X, columns=housing_num.columns, index=housing_num.index)

In [None]:
housing_tr.info()

**Dealing with the categorical data (textual data) :**

In [None]:
housing_cat = housing[["ocean_proximity"]]

In [None]:
housing_cat.head(10)

**Ordinal encoder for the ocean_proximity column**

In [None]:
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)
housing_cat_encoded[:10]

In [None]:
ordinal_encoder.categories_

**One hot encoder for the ocean_proximity column**

In [None]:
from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)

In [None]:
housing_cat_1hot.toarray()

**Creating a custom transformer class to apply all the required transformations :**

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
 
    def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room

    def fit(self, X, y=None):
        return self # nothing else to do

    def transform(self, X):
        rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
        population_per_household = X[:, population_ix] / X[:, households_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,bedrooms_per_room]
        else:  
            return np.c_[X, rooms_per_household, population_per_household]
        # np.c_() -> Concatenation along the second axis (column wise)

In [None]:
attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=True)
housing_extra_attribs = attr_adder.transform(housing.values)

In [None]:
housing_extra_attribs[0]

**Transformation's using the Pipeline :**

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="median")),
    ('attribs_adder', CombinedAttributesAdder()),
    ('std_scaler', StandardScaler()),
    ])
housing_num_tr = num_pipeline.fit_transform(housing_num)

In [None]:
from sklearn.compose import ColumnTransformer
num_attribs = list(housing_num) # List of all column name.
cat_attribs = ["ocean_proximity"]
full_pipeline = ColumnTransformer([
 ("num", num_pipeline, num_attribs),
 ("cat", OneHotEncoder(), cat_attribs),
 ])
housing_prepared = full_pipeline.fit_transform(housing)

# Linear regression model:

**Training the regression model :**

In [None]:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)

In [None]:
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)
print("Predictions:", lin_reg.predict(some_data_prepared))
print("Actual data : \n" , list(some_labels))

**Calculating the RMSE error :**

In [None]:
from sklearn.metrics import mean_squared_error
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

# DecisionTreeRegressor model:

In [None]:
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_prepared, housing_labels)

In [None]:
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)
print("Predictions:", tree_reg.predict(some_data_prepared))
print("Actual data : \n" , list(some_labels))

**Calculating the RMSE error : (Model overfits badly)**

In [None]:
housing_predictions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

# RandomForestRegressor model

In [None]:
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg.fit(housing_prepared, housing_labels)

In [None]:
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)
print("Predictions:", forest_reg.predict(some_data_prepared))
print("Actual data : \n" , list(some_labels))

**Calculating the RMSE error :**

In [None]:
housing_predictions = forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rmse = np.sqrt(tree_mse)
forest_rmse

# Cross-validation

## Cross-validation on DecisionTreeRegressor

In [None]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)

In [None]:
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

In [None]:
display_scores(tree_rmse_scores)

## Cross-validation on Linear Regressor

In [None]:
lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

## Cross-validation on forest Regressor

In [None]:
forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)

# Fine Tuning the model :