## Build a Model

In [None]:
import pandas as pd
import numpy as np

In [None]:
test_data = pd.read_csv("./data/test.csv", index_col="index", low_memory=False)
train_data = pd.read_csv("./data/train.csv", index_col="index", low_memory=False)

### About the data

In [None]:
### How large is the training data set and the test data set?
print("test data size: ", test_data.shape)
print("train data size: ", train_data.shape)

In [None]:
### train data
train_data.target__office.value_counts()

### Features

## Baseline Model Review

In the provided baseline model, several ambiguous points require attention.

### Feature Selection:

- **Issue:** The reason for choosing specific columns as features is not clearly articulated.
  
- **Recommendation:** Begin by providing a rationale for the selection of these columns. Explain why each feature is relevant to the analysis, the problem at hand, or the domain.

### Missing Values Handling:

- **Issue:** Missing values (NaN) in the dataset have not been addressed.
  
- **Recommendation:** Implement a strategy for handling missing values, such as imputation or removal. Additionally, it's crucial to investigate whether the presence of NaN values conveys any meaningful information. Determine if the absence of data in certain columns holds significance and document your findings.

### Feature Engineering and Selection:

- **Issue:** The data analysis lacks a dedicated feature engineering and feature selection process.
  
- **Recommendation:** Explore and create new features that might enhance the model's predictive power. Additionally, consider employing techniques for feature selection to identify the most impactful variables. This step is essential for refining the model and improving its efficiency.

## Jupyter Notebook Markup:

```python
# Exploratory Data Analysis Project Explanation

## Baseline Model Review

In the provided baseline model, several ambiguous points require attention.

### Feature Selection:

- **Issue:** The reason for choosing specific columns as features is not clearly articulated.

- **Recommendation:** Begin by providing a rationale for the selection of these columns. Explain why each feature is relevant to the analysis, the problem at hand, or the domain.

### Missing Values Handling:

- **Issue:** Missing values (NaN) in the dataset have not been addressed.

- **Recommendation:** Implement a strategy for handling missing values, such as imputation or removal. Additionally, it's crucial to investigate whether the presence of NaN values conveys any meaningful information. Determine if the absence of data in certain columns holds significance and document your findings.

### Feature Engineering and Selection:

- **Issue:** The data analysis lacks a dedicated feature engineering and feature selection process.

- **Recommendation:** Explore and create new features that might enhance the model's predictive power. Additionally, consider employing techniques for feature selection to identify the most impactful variables. This step is essential for refining the model and improving its efficiency.

In [None]:
### numerical features
train_data.select_dtypes(include="number").columns

In [None]:
train_data[train_data.target__office == True]["officearea"].describe()

In [None]:
train_data[train_data.target__office == True]["landuse"].value_counts()

In [None]:
train_data[train_data.target__office == False]["officearea"].describe()

In [None]:
train_data[train_data.target__office == False]["landuse"].value_counts()

### Train the Model

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

In [None]:
feature_cols = ["officearea", "comarea", "yearbuilt"]

In [None]:
X = train_data[feature_cols].copy(deep=True)
y = train_data["target__office"].copy(deep=True)


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [None]:
clf = RandomForestClassifier(max_depth=2, random_state=0)

In [None]:
clf.fit(X_train, y_train)

In [None]:
y_hat = clf.predict(X_test)

### How does my model perform?

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score

In [None]:
print("MSE:", mean_squared_error(y_test.astype(int), y_hat.astype(int)))
print("Accuracy:", accuracy_score(y_test.astype(int), y_hat.astype(int)))