# Model building: predicting neighbourhood rating and community belonging

In [None]:
# Libraries
import pandas as pd
import numpy as np
from janitor import clean_names
from pandas_profiling import ProfileReport
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import RandomForestClassifier
import pickle

# Data preparation

The approach will be different from data preparation in the exploratory analysis. The aim will be to preserve as many data points as possible, whist reducing the cardinality of the many categorical variables. Given the targets to be predicted, only two datasets will be necessary, as they both contain the same columns except for the two dependent variables, neighbourhood rating and community belonging.

In [None]:
# Read in neighbourhood ratings
neighbourhood = pd.read_csv("data/neighbourhood_rating.csv").clean_names()

# Read in community ratings
community = pd.read_csv("data/community_belonging.csv").clean_names()

In [None]:
# Joining the two datasets.
survey = (
    pd.merge(neighbourhood, community, how='outer')
    .query("featurecode != 'S92000003' & measurement == 'Percent'")
    .drop(columns=['measurement', 'units', 'featurecode'])
    .fillna('All')
)

survey.sample(10)

As already mentioned, many of the variables are categorical and show high cardinality, but do not fully explain the variation in the percentage of adults that fall into that category (see exploratory analysis report for more details). This is particularly true of our targets: community belonging and neighbourhood rating. For these reasons, many features will be engineered into binary classifiers that will hopefully better capture data variation, and simplify model building.

Additionally, confidence intervals and region codes have been dropped and the data points taken as country-wide statistical variation. In order to drop these two columns upper and lower bounds for the confidence intervals and the feature code for Scotland as a whole have been filtered out.

## Feature engineering

The following features will be re-binned or engineered:

- `walking_distance_to_nearest_greenspace` will become `green_access`, 1 for those who live within 10 min of a green space, 0 otherwise.

- `neighbourghood_rating` and `community_belonging` will become `good_neighbourhood` and `good_community`, 1 if they are very/fairly good/strong, and 0 otherwise.

- `type_of_tenure` will be replaced by `owner` and `tenant` and their respective binary encoding.

- `household_type` will be replaced by `pensioners` and `children` and their respective binary encoding.

- The rest of the variables will also be binary encoded.

In [None]:
# Good neighbourhood rating
survey.loc[:, 'good_neighbourhood'] = np.where(
    survey['neighbourhood_rating'].isin(['Very good', 'Fairly good']),
    1, 0)

# Good community belonging feeling
survey.loc[:, 'good_community'] = np.where(
    survey['community_belonging'].isin(['Very strongly', 'Fairly strongly']),
    1, 0)

# Green space access
survey.loc[:, 'green_access'] = np.where(
    survey['walking_distance_to_nearest_greenspace'] == 'Less than 10 minutes',
    1, 0)

# Owners
survey.loc[:, 'owners'] = np.where(
    survey['type_of_tenure'].isin(['Owned Outright', 'Owned Mortgage/Loan']),
    1, 0)

# Tenants
survey.loc[:, 'tenants'] = np.where(
    survey['type_of_tenure'].isin(['Social Rented', 'Private Rented']),
    1, 0)

# Pensioners
survey.loc[:, 'pensioners'] = np.where(
    survey['household_type'] == 'Pensioners',
    1, 0)

# children
survey.loc[:, 'children'] = np.where(
    survey['household_type'] == 'With Children',
    1, 0)

# Urban
survey.loc[:, 'urban'] = np.where(
    survey['urban_rural_classification'] == 'Urban',
    1, 0)

# Female
survey.loc[:, 'female'] = np.where(
    survey['gender'] == 'Female',
    1, 0)

# Lowest quintile
survey.loc[:, 'lowest_quintile'] = np.where(
    survey['simd_quintiles'] == '20% most deprived',
    1, 0)

# White ethnicity
survey.loc[:, 'white_ethnicity'] = np.where(
    survey['ethnicity'] == 'White',
    1, 0)

survey = survey.drop(columns=[
    'neighbourhood_rating', 'gender', 'urban_rural_classification',
    'simd_quintiles', 'type_of_tenure', 'household_type', 'ethnicity',
    'walking_distance_to_nearest_greenspace', 'community_belonging']).copy()

In [None]:
survey.sample(20)

In [None]:
survey.good_neighbourhood.value_counts()[1] / survey.good_neighbourhood.value_counts()[0]

By encoding data in this way, we can treat the constant value 'All' as the absence of a feature rather than a feature being held constant. With enough data points this will be sufficient to generate predictions, albeit the model may not be useful to explain why certain features correlate with higher ratings.

# Splitting test and train sets

Given that this data is dated, let us split it by date instead of at random. 2019 will become the test set and the model will be trained with the rest.

In [None]:
survey.datecode.value_counts()

Each year corresponds approximately with 1/7 th of the data, or 14%. For the train and test sets, let us drop both ratings from the train set as there might be a strong correlation between how people rate their community and their neighbourhood, so using each rating to predict the other would be circular.

In [355]:
train

Unnamed: 0,value,neighbourhood_rating,gender,urban_rural_classification,simd_quintiles,type_of_tenure,household_type,ethnicity,walking_distance_to_nearest_greenspace,community_belonging
14,26.0,Fairly good,All,All,All,All,All,All,More than 10 minutes,All
15,46.0,Fairly good,All,All,All,All,All,All,More than 10 minutes,All
20,5.0,Very poor,All,All,All,All,All,All,More than 10 minutes,All
26,46.0,Fairly good,All,All,All,All,All,All,More than 10 minutes,All
27,37.0,Fairly good,All,All,All,All,All,All,More than 10 minutes,All
...,...,...,...,...,...,...,...,...,...,...
77931,25.0,All,All,All,All,Private Rented,All,All,All,Very strongly
77935,30.0,All,All,All,All,Private Rented,All,All,All,Not very strongly
77936,28.0,All,All,All,All,Private Rented,All,All,All,Not very strongly
77938,18.0,All,All,All,All,Private Rented,All,All,All,Not at all strongly


In [None]:
# Train and test split
train = survey.query("datecode != 2019").drop('datecode', axis=1)

test = survey.query("datecode == 2019").drop('datecode', axis=1)

# Now drop both targets from the predictor sets
X_test = test.drop(columns=['good_neighbourhood', 'good_community'])
X_train = train.drop(columns=['good_neighbourhood', 'good_community'])

# Create both targets
community_test = test.good_community
community_train = train.good_community

neighbourhood_test = test.good_neighbourhood
neighbourhood_train = train.good_neighbourhood

# Building the pipelines

In [None]:
log_pipe = Pipeline([
    ('split', ColumnTransformer([
        ('green_access', 
         OneHotEncoder(categories=[survey.green_access.unique()], drop='first', handle_unknown='error'), ['green_access']),
        ('owners', 
         OneHotEncoder(categories=[survey.owners.unique()], drop='first', handle_unknown='error'), ['owners']), 
        ('tenants', 
         OneHotEncoder(categories=[survey.tenants.unique()], drop='first', handle_unknown='error'), ['tenants']), 
        ('pensioners', 
         OneHotEncoder(categories=[survey.pensioners.unique()], drop='first', handle_unknown='error'), ['pensioners']),
         ('children', 
         OneHotEncoder(categories=[survey.children.unique()], drop='first', handle_unknown='error'), ['children']),
        ('urban', 
         OneHotEncoder(categories=[survey.urban.unique()], drop='first', handle_unknown='error'), ['urban']),
         ('female', 
         OneHotEncoder(categories=[survey.female.unique()], drop='first', handle_unknown='error'), ['female']),
         ('lowest_quintile', 
         OneHotEncoder(categories=[survey.lowest_quintile.unique()], drop='first', handle_unknown='error'), ['lowest_quintile']),
         ('white_ethnicity', 
         OneHotEncoder(categories=[survey.white_ethnicity.unique()], drop='first', handle_unknown='error'), ['white_ethnicity']),
        ('value', Pipeline([
                ('scale', StandardScaler())
            ]), ['value'])
    ])),
    # ('poly', PolynomialFeatures(degree=2, include_bias=False, interaction_only=True)),
    ('classifier', LogisticRegression())
])

# Scoring the models

## Neighbourhood ratings model

In [None]:
# Fitting the model to the train set for the neighbourhood rating
log_pipe.fit(X_train, neighbourhood_train);

In [None]:
print('MAE train for Logistic Regression:', mean_absolute_error(neighbourhood_train, log_pipe.predict(X_train)))
print('MAE test for Logistic Regression:', mean_absolute_error(neighbourhood_test, log_pipe.predict(X_test)))

The mean absolute error for the test and train sets is very similar, suggesting that the model is not over-fitting.

In [None]:
print(f"Logistic Regression model train mean accuracy: {cross_val_score(log_pipe, X_train, neighbourhood_train, scoring='accuracy', cv=5).mean()}.")
print(f"Logistic Regression model test mean accuracy: {cross_val_score(log_pipe, X_test, neighbourhood_test, scoring='accuracy', cv=5).mean()}.")

Accuracy is good for this model, and similar in the test and train sets.

In [None]:
print('AUC train for Logistic Regression:', roc_auc_score(neighbourhood_train, log_pipe.predict(X_train)))
print('AUC test for Logistic Regression:', roc_auc_score(neighbourhood_test, log_pipe.predict(X_test)))

## Community belonging model

In [None]:
# Fitting the model to the train set for the community belonging rating
log_pipe.fit(X_train, community_train);

In [None]:
print('MAE train for Logistic Regression:', mean_absolute_error(community_train, log_pipe.predict(X_train)))
print('MAE test for Logistic Regression:', mean_absolute_error(community_test, log_pipe.predict(X_test)))

In [None]:
print(f"Logistic Regression model train mean accuracy: {cross_val_score(log_pipe, X_train, community_train, scoring='accuracy', cv=5).mean()}.")
print(f"Logistic Regression model test mean accuracy: {cross_val_score(log_pipe, X_test, community_test, scoring='accuracy', cv=5).mean()}.")

In [None]:
print('AUC train for Logistic Regression:', roc_auc_score(community_train, log_pipe.predict(X_train)))
print('AUC test for Logistic Regression:', roc_auc_score(community_test, log_pipe.predict(X_test)))

# Second approach: random forest classifier.

Given that the feature engineering and logistic regression has not been successful in the goal of creating an accurate model, another approach will be considered.

In [349]:
raw_survey = (
    pd.merge(neighbourhood, community, how='outer')
    .query("featurecode != 'S92000003' & measurement == 'Percent'")
    .drop(columns=['measurement', 'units', 'featurecode'])
    .fillna('All')
)

raw_survey.sample(10)

Unnamed: 0,datecode,value,neighbourhood_rating,gender,urban_rural_classification,simd_quintiles,type_of_tenure,household_type,ethnicity,walking_distance_to_nearest_greenspace,community_belonging
35931,2017,66.0,Very good,All,All,All,All,All,All,All,All
26396,2013,8.0,Fairly poor,All,All,All,All,With Children,All,All,All
48137,2017,40.0,All,All,All,All,Owned Mortgage/Loan,All,All,All,Very strongly
19243,2018,56.0,Very good,Male,All,All,All,All,All,All,All
60041,2017,45.0,All,Male,All,All,All,All,All,All,Very strongly
77789,2019,19.0,All,All,All,All,Social Rented,All,All,All,Not very strongly
54752,2018,9.0,All,Female,All,All,All,All,All,All,Not very strongly
17237,2015,22.0,Fairly good,All,All,All,All,All,White,All,All
66429,2018,8.0,All,All,All,All,All,Pensioners,All,All,Not very strongly
8907,2014,54.0,Fairly good,All,All,All,Owned Mortgage/Loan,All,All,All,All


In [None]:
ProfileReport(raw_survey)

In [350]:
rf_pipe = Pipeline([
    ('split', ColumnTransformer([
        ('gender', 
         OneHotEncoder(categories=[raw_survey.gender.unique()], drop='first', handle_unknown='error'), ['gender']), 
        ('urban_rural_classification', 
         OneHotEncoder(categories=[raw_survey.urban_rural_classification.unique()], drop='first', handle_unknown='error'), ['urban_rural_classification']), 
        ('simd_quintiles', 
         OneHotEncoder(categories=[raw_survey.simd_quintiles.unique()], drop='first', handle_unknown='error'), ['simd_quintiles']),
         ('type_of_tenure', 
         OneHotEncoder(categories=[raw_survey.type_of_tenure.unique()], drop='first', handle_unknown='error'), ['type_of_tenure']),
        ('household_type', 
         OneHotEncoder(categories=[raw_survey.household_type.unique()], drop='first', handle_unknown='error'), ['household_type']),
         ('ethnicity', 
         OneHotEncoder(categories=[raw_survey.ethnicity.unique()], drop='first', handle_unknown='error'), ['ethnicity']),
         ('walking_distance_to_nearest_greenspace', 
         OneHotEncoder(categories=[raw_survey.walking_distance_to_nearest_greenspace.unique()], drop='first', handle_unknown='error'), ['walking_distance_to_nearest_greenspace']),
        ('value', Pipeline([
                ('scale', StandardScaler())
            ]), ['value'])
    ])),
    ('classifier', RandomForestClassifier())
])

In [343]:
# Convert all objects to categorical data
raw_survey.iloc[:, 2:] = raw_survey.select_dtypes(include='object').astype('category')

In [351]:
# Train and test split
train = raw_survey.query("datecode != 2019").drop('datecode', axis=1)

test = raw_survey.query("datecode == 2019").drop('datecode', axis=1)

# Now drop both targets from the predictor sets
X_test = test.drop(columns=['neighbourhood_rating', 'community_belonging'])
X_train = train.drop(columns=['neighbourhood_rating', 'community_belonging'])

# Create both targets
community_test = test.community_belonging
community_train = train.community_belonging

neighbourhood_test = test.neighbourhood_rating
neighbourhood_train = train.neighbourhood_rating

In [339]:
X_train.dtypes

value                                      float64
gender                                    category
urban_rural_classification                category
simd_quintiles                            category
type_of_tenure                            category
household_type                            category
ethnicity                                 category
walking_distance_to_nearest_greenspace    category
dtype: object

## Neighbourhood rating

In [352]:
# Fitting the model to the train set for the neighbourhood rating
rf_pipe.fit(X_train, neighbourhood_train);

In [347]:
X_train.type_of_tenure.unique()

['All', 'Owned Mortgage/Loan', 'Owned Outright', 'Social Rented', 'Private Rented']
Categories (5, object): ['All', 'Owned Mortgage/Loan', 'Owned Outright', 'Private Rented', 'Social Rented']

In [346]:
X_train_novalue = X_train.drop('value', axis=1).copy()

rf = RandomForestClassifier()
rf.fit(X_train_novalue, neighbourhood_train)

ValueError: could not convert string to float: 'All'

In [304]:
print('MAE train for Logistic Regression:', mean_absolute_error(neighbourhood_train, rf_pipe.predict(X_train)))
print('MAE test for Logistic Regression:', mean_absolute_error(neighbourhood_test, rf_pipe.predict(X_test)))

ValueError: could not convert string to float: 'Fairly good'

The mean absolute error for the test and train sets is very similar, suggesting that the model is not over-fitting.

In [353]:
print(f"Logistic Regression model train mean accuracy: {cross_val_score(rf_pipe, X_train, neighbourhood_train, scoring='accuracy', cv=5).mean()}.");
print(f"Logistic Regression model test mean accuracy: {cross_val_score(rf_pipe, X_test, neighbourhood_test, scoring='accuracy', cv=5).mean()}.");

Logistic Regression model train mean accuracy: 0.46291486551455296.
Logistic Regression model test mean accuracy: 0.4047675588317749.


Accuracy is good for this model, and similar in the test and train sets.

In [354]:
print('AUC train for Logistic Regression:', roc_auc_score(neighbourhood_train, rf_pipe.predict(X_train)))
print('AUC test for Logistic Regression:', roc_auc_score(neighbourhood_test, rf_pipe.predict(X_test)))

ValueError: could not convert string to float: 'All'