### In this homework, we will use the California Housing Prices again 


You have to:
* Use the same dataset as in homework #3.
* Filter the data to include only records where ocean_proximity is either '<1H OCEAN' or 'INLAND'.
* Unlike homework #3, use all columns of the dataset for this analysis*

What to do:

* Replace any missing values with zeros.
* Apply a logarithmic transformation to the target variable median_house_value.
* Divide the dataset into train, validation, and test sets with a 60%/20%/20% split.
* Use the train_test_split function from scikit-learn and set the random_state parameter to 1 for reproducibility.
* Use DictVectorizer(sparse=True) to convert the dataframes into sparse matrices.


In [1]:
#do all the preprocessing steps above here:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer

In [2]:
housing_data = pd.read_csv('housing.csv')

In [3]:
filtered_data = housing_data[housing_data['ocean_proximity'].isin(['<1H OCEAN', 'INLAND'])]

In [4]:
filtered_data.loc[:, :] = filtered_data.fillna(0)

In [5]:
filtered_data.loc[:, 'median_house_value'] = np.log(filtered_data['median_house_value'])

In [6]:
train_data, test_data = train_test_split(filtered_data, test_size=0.4, random_state=1)
val_data, test_data = train_test_split(test_data, test_size=0.5, random_state=1)

In [7]:
dv = DictVectorizer(sparse=True)

In [8]:
X_train = dv.fit_transform(train_data.drop(columns=['median_house_value']).to_dict(orient='records'))
X_val = dv.transform(val_data.drop(columns=['median_house_value']).to_dict(orient='records'))
X_test = dv.transform(test_data.drop(columns=['median_house_value']).to_dict(orient='records'))

y_train = train_data['median_house_value']
y_val = val_data['median_house_value']
y_test = test_data['median_house_value']

### **Question 1**
Train a decision tree regressor with max_depth=1 to predict median_house_value.

 Identify which feature is used for the root split?:

- ocean_proximity
- total_rooms
- latitude
- population

In [9]:
from sklearn.tree import DecisionTreeRegressor

dt = DecisionTreeRegressor(max_depth=1, random_state=1)
dt.fit(X_train, y_train)

root_feature = dv.feature_names_[dt.tree_.feature[0]]
print("Answer:", root_feature)

Answer: ocean_proximity=INLAND


### **Question 2**
Train a random forest regressor with:
* 10 trees (n_estimators=10), 
* random_state=1,
*  n_jobs=-1. 

What is the RMSE of this model on the validation dataset?:

- 0.045
- 0.245
- 0.545
- 0.845

In [10]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

rf = RandomForestRegressor(n_estimators=10, random_state=1, n_jobs=-1)
rf.fit(X_train, y_train)

y_pred_val = rf.predict(X_val)
rmse_val = np.sqrt(mean_squared_error(y_val, y_pred_val))
print("RMSE answеr:", round(rmse_val, 3))

RMSE answеr: 0.235


### **Question 3**
Experiment with the random forest model:

* Vary n_estimators from 10 to 200 in steps of 10
* Keep random_state=1
* Evaluate RMSE on the validation set for each n_estimators value

Determine at which n_estimators value the RMSE stops improving (consider to 3 decimal places)?: 

- 10
- 25
- 50
- 160

In [11]:
n_estimators_range = range(10, 201, 10)
rmse_values = []

for n in n_estimators_range:
    rf = RandomForestRegressor(n_estimators=n, random_state=1, n_jobs=-1)
    rf.fit(X_train, y_train)
    y_pred_val = rf.predict(X_val)
    rmse = np.sqrt(mean_squared_error(y_val, y_pred_val))
    rmse_values.append(rmse)

best_n_estimators = n_estimators_range[np.argmin(rmse_values)]
print("Оптимальное количество деревьев:", best_n_estimators)


Оптимальное количество деревьев: 150


### **Question 4**
Experiment to find the best max_depth for the random forest.
What to do:
* Test max_depth values: 10, 15, 20, 25
* For each max_depth vary n_estimators from 10 to 200 (step 10)
* Calculate mean RMSE across all n_estimators
* Use random_state=1

Identify the best max_depth utilazing  mean RMSE? 

- 10
- 15
- 20
- 25

In [12]:
max_depth_values = [10, 15, 20, 25]
n_estimators_range = range(10, 210, 10)
random_state = 13

In [13]:
from math import sqrt

mean_rmse_dict = {}

for max_depth in max_depth_values:
    rmse_list = []
    for n_estimators in n_estimators_range:

        rf_model = RandomForestRegressor(max_depth=max_depth, n_estimators=n_estimators, random_state=random_state)
        rf_model.fit(X_train, y_train)

        y_pred = rf_model.predict(X_val)

        rmse = sqrt(mean_squared_error(y_val, y_pred))
        rmse_list.append(rmse)

    mean_rmse_dict[max_depth] = np.mean(rmse_list)

best_max_depth = min(mean_rmse_dict, key=mean_rmse_dict.get)

print("Average RMSE(feature : max_depth):", mean_rmse_dict)
print("Best tree depth (max_depth):", best_max_depth)

Average RMSE(feature : max_depth): {10: 0.2326386603841415, 15: 0.22358645186317871, 20: 0.22262417744647184, 25: 0.22276327257114859}
Best tree depth (max_depth): 20


### **Question 5**
Tree-based models use a "gain" metric when splitting nodes, measuring the reduction in impurity before and after each split. This gain helps identify important features. In scikit-learn, this information is stored in the feature_importances_ attribute

Train a random forest model with these parameters:

* n_estimators=10
* max_depth=20
* random_state=1
* n_jobs=-1 (optional)

Extract the feature importance information from the trained model. 

Identify and report the most important feature among the top 4 features based on their importance scores.

What's the most important feature among these 4?: 
- total_rooms
- median_income 
- total_bedrooms
- longitude


In [14]:
dv = DictVectorizer(sparse=True)
X_train = dv.fit_transform(train_data.drop(columns=['median_house_value']).to_dict(orient='records'))
y_train = train_data['median_house_value']

rf_model = RandomForestRegressor(n_estimators=10, max_depth=20, random_state=1, n_jobs=-1)
rf_model.fit(X_train, y_train)

feature_importances = rf_model.feature_importances_

feature_names = dv.get_feature_names_out()
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': feature_importances
}).sort_values(by='Importance', ascending=False)

top_4_features = importance_df[importance_df['Feature'].isin(['total_rooms', 'median_income', 'total_bedrooms', 'longitude'])].sort_values(by='Importance', ascending=False)

most_important_feature = top_4_features.iloc[0]['Feature']
print(f"Most important feature: {most_important_feature}")

Most important feature: median_income
