# The main file of the project.

The Boston Housing Dataset has been removed from the Scikit-Learn library in December 2022, after decades of its use for teaching purposes. 

Importing the dataset in the traditional way now throws the following error:

In [None]:
from sklearn import datasets

try:
    bostn = datasets.load_boston()
except Exception as e:
    print(e)


# Obtaining the data "by hand"

The documentation of Scikit-Learn advises users to obtain the dataset in the csv format from its original source. We respect that and then convert the data to a well-formatted Pandas DataFrame. 

In the dataset, there are 14 variables:

- **CRIM**: per capita crime rate by town
- **ZN**: proportion of residential land zoned for lots over 25,000 sq.ft.
- **INDUS**: proportion of non-retail business acres per town
- **CHAS**: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- **NOX**: nitric oxides concentration (parts per 10 million)
- **RM**: average number of rooms per dwelling
- **AGE**: proportion of owner-occupied units built prior to 1940
- **DIS**: weighted distances to five Boston employment centres
- **RAD**: index of accessibility to radial highways
- **TAX**: full-value property-tax rate per $10,000
- **PTRATIO**: pupil-teacher ratio by town
- **B**: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- **LSTAT**: % lower status of the population
- **MEDV**: Median value of owner-occupied homes in $1000's


In [None]:
# Import the data as instructed by documentation

import pandas as pd
import numpy as np

data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)

# Split the data into data and target
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]

# Convert the data to a well-formatted dataframe

column_names = [
    "CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE",
    "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT", "MEDV"
]

feature_columns = column_names[:-1]  # All columns except the target
target_column = column_names[-1]     # The target column

# Construct the features DataFrame
boston_df = pd.DataFrame(data, columns=feature_columns)

# Add the target to the DataFrame
boston_df[target_column] = target


# Exploration & Cleaning

In [None]:
boston_df.describe()

In [None]:
boston_df.isnull().sum()

There are no missing values. Let us deploy some techniques to potentically uncover filled-in values that could affect the natural data:

## Any filled in values?

**Check if mean, median, min or max values occur suspisiously frequently in the dataset.**

Values in the dataset are rounded to 6 decimal places, match that.

In [None]:
for column in boston_df.columns:
    mean = boston_df[column].mean().round(6)
    median = boston_df[column].median().round(6)
    min_value = boston_df[column].min().round(6)
    max_value = boston_df[column].max().round(6)

    mean_frequency = (boston_df[column].round(6).value_counts().get(mean, 0) / len(boston_df)) * 100
    median_frequency = (boston_df[column].round(6).value_counts().get(median, 0) / len(boston_df)) * 100
    min_frequency = (boston_df[column].round(6).value_counts().get(min_value, 0) / len(boston_df)) * 100
    max_frequency = (boston_df[column].round(6).value_counts().get(max_value, 0) / len(boston_df)) * 100

    print(f"{column} - Mean Frequency: {mean_frequency:.2f}%, "
          f"Median Frequency: {median_frequency:.2f}%, "
          f"Min Frequency: {min_frequency:.2f}%, "
          f"Max Frequency: {max_frequency:.2f}%")


In [None]:
import matplotlib.pyplot as plt

# Plotting small histograms for each variable
boston_df.hist(figsize=(12, 10), bins=50, grid=False)
plt.tight_layout()
plt.show()


Combining the findings from checking the median, mean, min and max values together with the histograms, there should be no major concern regarding the potential influence of filled-in values. While some of the four examined measures do occur quite often in the columns, it does make sense from the nature of each specific variable.

The only questionable column is the problematic B column, which will be given more attention later.

## The LSTAT Variable

One of the problematic variables is LSTAT, a percentage of "lower status population". The original paper describes it as 

> Proportion of population that is lower status = 1/2 (proportion of adults without some high education and proportion of male workers classified as laborers)

And the authors suggest that "the effect on price is higher in the upper brackets of society", and therefore a logarithmic transform should be applied in the models.

The sole inclusion and consideration of this variable does raise concerns. For now, we will simply explore its properties and pay more attention to it later on. 

In [None]:
import app.var_examination as ve

ve.var_examination(boston_df, "LSTAT")


## TODO LSTAT findings description

## The B Variable

A key interest of this project is the B variable. We do not plan on commenting neither the reasons for its inclusion both in the original research paper and in the dataset, nor why the issues were recognized only a few years ago. We will try to examine the variable and later show its role in the dataset.

In the original research paper, the variable is described as 

> Black proportion of population. At low or moderate levels of B, an increase of B should have negative on housing values if Blacks are regarded as undesirable by Whites. However, market discrimination means that market prices are higher at very high levels of B. One expects, therefore, a parabolic relationship between proportion Black in a neighborhood and housing values.

We should note that the above **definition is wrong on many levels** and should not be considered at all in a real world scenario.

TODO

In [None]:
ve.var_examination(boston_df, "B")

TODO comments on B

TODO comments on the quadratic transformation, sus? Data distortion? We do not know what was the original number Bk

In [None]:
import numpy as np
import matplotlib.pyplot as plt

Bk = np.linspace(0, 1, 100)
Bf = 1000 * (Bk - 0.63) ** 2

plt.plot(Bk, Bf)
plt.xlabel('Bk')
plt.ylabel('Bf')
plt.title('Bf = 1000(Bk-0.63)^2')
plt.grid(True)
plt.show()
