# EPC-rating classification from the Scottish EPC register
Classifying houses based on their reported features within the EPC-rating categories (based on a percentage; A, B, C, etc.). The open-sourced data was extracted from 
<a href="https://www.scottishepcregister.org.uk/CustomerFacingPortal/DataExtract" >Energy Saving Trust's "Scottish Energy Performance Certificate Register"</a>.

## Data extraction and cleaning

### Extraction
First, the data needs to be extracted and stored

In [1]:
## Import pandas
import numpy as np
import pandas as pd

## From my computer's path
path = 'D:/tahad/Files/Pythonx.y/ScriptFiles/Data/Home-Energy/D_EPC_data_2012-2020_extract_0221/'
file = 'D_EPC_data_2020_Q4_extract_0221.csv'
df = pd.read_csv(path+file, 
                 header=1, 
                 encoding="ISO-8859-1", 
                 low_memory=False)
df.head()

## Web download
# web = 'http://statistics.gov.scot/downloads/'\
#       'file?id=fc0dde21-014b-4ccb-ba05-2f3b9cf07182%2FD_EPC_data_2012-2020_extract_0221.zip'
# file = 'D_EPC_data_2020_Q4_extract_0221.csv'
# df = pd.read_csv(web+file)
# df.head()
# Unable to perform web-download because only the containing zip-file is accessible --
# it would be possible to download the file, extract it, and import the data-file in 
# a script, but it seems rather impractical to do so

Unnamed: 0,BUILDING_REFERENCE_NUMBER,OSG_REFERENCE_NUMBER,ADDRESS1,ADDRESS2,ADDRESS3,POSTCODE,INSPECTION_DATE,TYPE_OF_ASSESSMENT,LODGEMENT_DATE,ENERGY_CONSUMPTION_CURRENT,...,PHOTO_SUPPLY,SOLAR_WATER_HEATING_FLAG,TENURE,TRANSACTION_TYPE,UNHEATED_CORRIDOR_LENGTH,CONSTITUENCY,CONSTITUENCY_LABEL,WIND_TURBINE_COUNT,BUILT_FORM,PROPERTY_TYPE
0,1000136000.0,116023891.0,21 OLD TOWN,AYTON,EYEMOUTH,TD14 5RA,30/09/2020,"RdSAP, existing dwelling",01/10/2020,620.0,...,Array: Roof Area: 0%; Connection: not applicab...,N,rented (social),none of the above,,00QEMG,East Berwickshire,0.0,End-Terrace,Flat
1,1002185000.0,484108772.0,3 MARIGOLD WAY,,CARLUKE,ML8 5TL,28/09/2020,"RdSAP, existing dwelling",01/10/2020,287.0,...,Array: Roof Area: 0%; Connection: not applicab...,N,owner-occupied,marketed sale,,00RFMA,Clydesdale West,0.0,Semi-Detached,House
2,1001076000.0,137044546.0,BANK OF SCOTLAND BUILDINGS,2 JOHN STREET,LANGHOLM,DG13 0AD,30/09/2020,"RdSAP, existing dwelling",01/10/2020,366.0,...,Array: Roof Area: 0%; Connection: not applicab...,N,owner-occupied,marketed sale,14.62,00QHMN,Annandale East and Eskdale,0.0,Detached,Flat
3,1001377000.0,117115511.0,5 ELM RISE,BALDOVIE,DUNDEE,DD5 3UY,29/09/2020,"RdSAP, existing dwelling",01/10/2020,145.0,...,Array: Roof Area: 0%; Connection: not applicab...,N,owner-occupied,marketed sale,,00QCMD,Monifeith and Sidlaw,0.0,Detached,Bungalow
4,1001413000.0,137069631.0,PATHHEAD,,LANGHOLM,DG13 0ND,29/09/2020,"RdSAP, existing dwelling",01/10/2020,1010.0,...,Array: Roof Area: 0%; Connection: not applicab...,N,owner-occupied,marketed sale,,00QHMN,Annandale East and Eskdale,0.0,Detached,House


Resizing the DF to make it digestigble

In [2]:
df.describe

<bound method NDFrame.describe of        BUILDING_REFERENCE_NUMBER  OSG_REFERENCE_NUMBER  \
0                   1.000136e+09          1.160239e+08   
1                   1.002185e+09          4.841088e+08   
2                   1.001076e+09          1.370445e+08   
3                   1.001377e+09          1.171155e+08   
4                   1.001413e+09          1.370696e+08   
...                          ...                   ...   
48023               1.001583e+09          9.051120e+09   
48024               1.001029e+09          9.051117e+09   
48025               1.000902e+09          9.051116e+09   
48026               1.002091e+09          1.301427e+08   
48027               1.000232e+09          9.067000e+11   

                          ADDRESS1             ADDRESS2       ADDRESS3  \
0                     21 OLD TOWN                AYTON       EYEMOUTH    
1                  3 MARIGOLD WAY                   NaN       CARLUKE    
2      BANK OF SCOTLAND BUILDINGS        2 JOHN

In [3]:
df = df[0:10000]

The data was found at <a href='https://statistics.gov.scot/downloads/file?id=fc0dde21-014b-4ccb-ba05-2f3b9cf07182%2FD_EPC_data_2012-2020_extract_0221.zip'>Data Extract download link</a> and can be downloaded by pressing the hyperlink. <br>
In the zip-file a .pdf document can be found elaborating as to the contents of the datasets. Interesting are the potential y-values:
* (Efficiency) ratings:
    * Current Energy Efficiency Rating; CURRENT_ENERGY_EFFICIENCY; <blockquote>Current energy cost rating (EER or ‘SAP rating’) for the building which is calculated using both the energy efficiency of the building and the cost of fuels used.</blockquote>
    * Current Energy Efficiency Band; CURRENT_ENERGY_RATING; <blockquote>Current Energy Efficiency Rating expressed on a Scale of G to A, with A being the highest (best) rating band.</blockquote>
    * Potential Energy Efficiency Band; POTENTIAL_ENERGY_RATING; <blockquote>Potential Energy Efficiency Rating expressed on a Scale of G to A, with A being the highest (best) rating band.</blockquote>
    * Current Environmental Impact Band; ENVIRONMENT_IMPACT_RATING; <blockquote>Current Environmental Impact Rating expressed on a Scale of G to A, with A being the highest (best) rating band.</blockquote>
    * Potential Environmental Impact Band; POTENTIAL_ENVIRONMENTAL_RATING; <blockquote>Potential Environmental Impact Rating expressed on a Scale of G to A, with A being the highest (best) rating band.</blockquote>
* Values
    * Primary Energy Indicator; ENERGY_CONSUMPTION_CURRENT; <blockquote>Reports Primary Energy - the amount of energy required at source, before conversion and transmission, to meet the calculated energy demand of the dwelling (Units: kWh/m²/year).</blockquote>
    * Total floor area; TOTAL_FLOOW_AREA; <blockquote>The total floor area of the dwelling. This excludes any unheated ancillary buildings. (Units: m²)</blockquote>
    * Total current energy costs over 3 years; 3_YR_ENERGY_COST_CURRENT; <blockquote>Calculation illustrating the total energy cost for heating, cooling, lighting and ventilating the building. Based on standardised occupancy patterns and fuel costs. (Units: £)</blockquote>
    * CO2 Emissions Current Per Floor Area; CO2_EMISS_CURR_PER_FLOOR_AREA; <blockquote>Annual CO2 equivalent emissions per square metre of floor area (units: kg.CO2e/m²/yr)</blockquote>

Note that I refrained from repeating the "Potential" and "Current", and the "Rating" and "Band" distinctions. I believe sensible y-values for a classification tree to be "Band"-data, considering these are discrete categories and can later be compared to a regression model.<br>
Area has the potential to be used as an x-parameter, but it is interesting to imagine if a model could correctly predict a house's area based on reported energy data using as few features as possble.<br>
"Potential"-data could become the y-parameter as likely as "Current"; it would have a practical use to find "Potential" energy bands based on a small number of features. It should be noted that all these values are calculated and therefore an algorithm ought to be able to fit and find parameters or coefficients of the calculated value or category; there should be a clear trend.<br>
Between environmental and energy data, it matters little which one I use for this model. I am curious, though, if using both as outcome values affects the model when compared to drawing trees on the separate variables.<br><br>
As fo

### Selection
Below we will extract only the columns that are useful as well as a unique identifier for a primary key so that the tables can be merged appropriately.
* Primary key
    * OSG UPRN; OSG_REFERENCE_NUMBER; <blockquote>Unique 9-12-digit property reference number assigned to a building by local authority data custodians and recorded centrally on the One Scotland Gazetteer – www.osg.scot.</blockquote>

#### y-data columns

In [4]:
## Primary key
pk = 'OSG_REFERENCE_NUMBER'
df[pk] = df[pk].astype('Int64')

## Extracting y-data into df_y
col_y = ['CURRENT_ENERGY_RATING', 'CURRENT_ENVIRONMENTAL_RATING']
df_y = df[[pk] + col_y]
df_y.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._set_item(key, value)


Unnamed: 0,OSG_REFERENCE_NUMBER,CURRENT_ENERGY_RATING,CURRENT_ENVIRONMENTAL_RATING
0,116023891,E,E
1,484108772,C,D
2,137044546,E,E
3,117115511,C,C
4,137069631,G,G


In [5]:
## Changing data types appropriately
df_y.dtypes

OSG_REFERENCE_NUMBER             Int64
CURRENT_ENERGY_RATING           object
CURRENT_ENVIRONMENTAL_RATING    object
dtype: object

In [6]:
df_y.head()

Unnamed: 0,OSG_REFERENCE_NUMBER,CURRENT_ENERGY_RATING,CURRENT_ENVIRONMENTAL_RATING
0,116023891,E,E
1,484108772,C,D
2,137044546,E,E
3,117115511,C,C
4,137069631,G,G


For now, we will continue onto the x-data before cleaning; note, though that it is more useful to keep the data together and split after cleaning. I will do the reverse, just for some data-merging practice.
#### x-data columns

In [7]:
## Extracting x-data into df_x
"""
> This is made into a comment because it is impractical to
> copy all column header names.
> Instead, since these are repeating attributes, I looked
> at the index manually.
> I stubbornly wanted to first select the columns and then
> the data (in hindsight there might have been easier ways
> to slice/extract data). Looking at the data now, I could
> have more easily selected these columns by filtering
>> df.filter(regex='...$')
>> ... == _EFF / _DESCRIPTION
> or
>> df.filter(regex='^...')
>> ... == WALL_ / ROOF_ / etc.
> also
>> df.filter(like='...')
>> ... == _ENERGY / _ENV / _DESCRIPTION
> can (potentially, with axis=1) be used to select full
> columns with labels containing those strings; this way no
> column selecion has to be done separately. Then, the
> dataframes might be 'join'ed or 'merge'd afterwards.
col_x = ['WALL_DESCRIPTION'
        ,'WALL_ENERGY_EFF'
        ,'WALL_ENV_EFF'
        ,'ROOF_DESCRIPTION'
        ,'ROOF_ENERGY_EFF'
        ,'ROOF_ENV_EFF'
        ,'FLOOR_DESCRIPTION'
        ,'FLOOR_ENERGY_EFF'
        ,'FLOOR_ENV_EFF'
        ,'WINDOWS_DESCRIPTION'
        ,'WINDOWS_ENERGY_EFF'
        ,'WINDOWS_ENV_EFF'
        ,'MAINHEAT_DESCRIPTION'
        ,'MAINHEAT_ENERGY_EFF'
        ,'MAINHEAT_ENV_EFF'
        ,'MAINHEAT']
"""
# For the first chunk of columns
# x_col1a = 23      # Index of first column of chain
# x_col1b = 53      # Index of last column of chain
# col_x1 = df.iloc[:, x_col1a:x_col1b].columns.tolist()
x_col1c = list(range(23,53,3))
col_x1 = df.iloc[:, x_col1c].columns.tolist()

# Selecting other columns; NOT the following from CONSTRUCTION_AGE_BAND:
# ENERGY_CONSUMPTION_POTENTIAL; TENURE; TRANSACTION_TYPE; excluding location
x_col2a = [69, 70, 97, 100, 101, 102]
x_col2b = 73
x_col2c = 82
x_col2d = 85
x_col2e = 94
col_x2 = df.iloc[:, np.r_[x_col2a, x_col2b:x_col2c, x_col2d:x_col2e]].columns.tolist()
print(col_x2)

# Pasting them all together
df_x = df[[pk]+['TOTAL_FLOOR_AREA']+col_x1+col_x2]
df_x.head()

['CONSTRUCTION_AGE_BAND', 'FLOOR_HEIGHT', 'UNHEATED_CORRIDOR_LENGTH', 'WIND_TURBINE_COUNT', 'BUILT_FORM', 'PROPERTY_TYPE', 'EXTENSION_COUNT', 'FIXED_LIGHTING_OUTLETS_COUNT', 'LOW_ENERGY_FIXED_LIGHT_COUNT', 'LOW_ENERGY_LIGHTING', 'FLOOR_LEVEL', 'FLAT_TOP_STOREY', 'GLAZED_AREA', 'NUMBER_HABITABLE_ROOMS', 'HEAT_LOSS_CORRIDOOR', 'MAIN_HEATING_CATEGORY', 'MAIN_FUEL', 'MAIN_HEATING_CONTROLS', 'MECHANICAL_VENTILATION', 'ENERGY_TARIFF', 'MULTI_GLAZE_PROPORTION', 'GLAZED_TYPE', 'NUMBER_OPEN_FIREPLACES', 'PHOTO_SUPPLY']


Unnamed: 0,OSG_REFERENCE_NUMBER,TOTAL_FLOOR_AREA,WALL_DESCRIPTION,ROOF_DESCRIPTION,FLOOR_DESCRIPTION,WINDOWS_DESCRIPTION,MAINHEAT_DESCRIPTION,MAINHEATCONT_DESCRIPTION,SECONDHEAT_DESCRIPTION,HOTWATER_DESCRIPTION,...,HEAT_LOSS_CORRIDOOR,MAIN_HEATING_CATEGORY,MAIN_FUEL,MAIN_HEATING_CONTROLS,MECHANICAL_VENTILATION,ENERGY_TARIFF,MULTI_GLAZE_PROPORTION,GLAZED_TYPE,NUMBER_OPEN_FIREPLACES,PHOTO_SUPPLY
0,116023891,47.0,"Cavity wall, filled cavity |",(another dwelling above) |,"Suspended, no insulation (assumed) |",Description: Fully double glazed |,Electric storage heaters |,Manual charge control |,Portable electric heaters (assumed) |,"Electric immersion, off-peak |",...,no corridor,electric storage heaters,electricity (not community),2401.0,natural,dual,100.0,"double glazing, unknown install date",0.0,Array: Roof Area: 0%; Connection: not applicab...
1,484108772,59.0,"Cavity wall, filled cavity |","Pitched, limited insulation (assumed) |","Suspended, no insulation (assumed) |",Description: Fully double glazed |,"Boiler and radiators, mains gas |","Programmer, TRVs and bypass |",None |,From main system |,...,,boiler with radiators or underfloor heating,mains gas (not community),2107.0,natural,Single,100.0,double glazing installed before 2002,0.0,Array: Roof Area: 0%; Connection: not applicab...
2,137044546,155.0,"Sandstone or limestone, as built, no insulatio...","Pitched, 150 mm loft insulation |",(other premises below) |,Description: Some double glazing |,"Boiler and radiators, mains gas |","Programmer, room thermostat and TRVs |","Room heaters, dual fuel (mineral and wood) |",From main system |,...,unheated corridor,boiler with radiators or underfloor heating,mains gas (not community),2106.0,natural,Single,10.0,double glazing installed during or after 2002,2.0,Array: Roof Area: 0%; Connection: not applicab...
3,117115511,168.0,"Timber frame, as built, insulated (assumed) |","Pitched, 250 mm loft insulation |","Solid, insulated (assumed) |",Description: Fully double glazed |,"Boiler and radiators, mains gas |","Programmer, room thermostat and TRVs |",None |,From main system |,...,,boiler with radiators or underfloor heating,mains gas (not community),2106.0,natural,Single,100.0,double glazing installed during or after 2002,0.0,Array: Roof Area: 0%; Connection: not applicab...
4,137069631,140.0,"Sandstone or limestone, as built, no insulatio...","Pitched, 250 mm loft insulation | Pitched, no ...","Solid, no insulation (assumed) | Suspended, no...",Description: Fully double glazed |,Electric storage heaters |,Manual charge control |,Portable electric heaters (assumed) |,"Electric immersion, off-peak |",...,,electric storage heaters,electricity (not community),2401.0,natural,dual,100.0,double glazing installed before 2002,0.0,Array: Roof Area: 0%; Connection: not applicab...


In [8]:
df_x.dtypes

OSG_REFERENCE_NUMBER              Int64
TOTAL_FLOOR_AREA                float64
WALL_DESCRIPTION                 object
ROOF_DESCRIPTION                 object
FLOOR_DESCRIPTION                object
WINDOWS_DESCRIPTION              object
MAINHEAT_DESCRIPTION             object
MAINHEATCONT_DESCRIPTION         object
SECONDHEAT_DESCRIPTION           object
HOTWATER_DESCRIPTION             object
LIGHTING_DESCRIPTION             object
AIR_TIGHTNESS_DESCRIPTION        object
CONSTRUCTION_AGE_BAND            object
FLOOR_HEIGHT                    float64
UNHEATED_CORRIDOR_LENGTH        float64
WIND_TURBINE_COUNT              float64
BUILT_FORM                       object
PROPERTY_TYPE                    object
EXTENSION_COUNT                 float64
FIXED_LIGHTING_OUTLETS_COUNT    float64
LOW_ENERGY_FIXED_LIGHT_COUNT    float64
LOW_ENERGY_LIGHTING             float64
FLOOR_LEVEL                     float64
FLAT_TOP_STOREY                  object
GLAZED_AREA                     float64


#### Location columns
Now lastly, there are several features indicating location data, but since I already use numerous features above, I want to simplify the locations by picking one, or comparing two against eachother.
* Location data
    * Property Address; ADDRES1 / ADDRESS2 / POST_TOWN
    * Postcode; POSTCODE
    * Data Zone; DATA_ZONE
    * Local Authority; LOCAL_AUTHORITY_LABEL
    * Ward Code / Ward Name; CONSTITUENCY / CONSTITUENCY_LABEL

In [9]:
## Scouting location data
x_loc1a = [2, 3, 4, 5, 71, 83, 98, 99]
loc_col = df.iloc[:, x_loc1a].columns.to_list()
df_xloc = df[[pk]+loc_col]
df_xloc.head()

Unnamed: 0,OSG_REFERENCE_NUMBER,ADDRESS1,ADDRESS2,ADDRESS3,POSTCODE,DATA_ZONE,LOCAL_AUTHORITY_LABEL,CONSTITUENCY,CONSTITUENCY_LABEL
0,116023891,21 OLD TOWN,AYTON,EYEMOUTH,TD14 5RA,S01005484 (Berwickshire East),Scottish Borders,00QEMG,East Berwickshire
1,484108772,3 MARIGOLD WAY,,CARLUKE,ML8 5TL,S01005754 (Carluke South),South Lanarkshire,00RFMA,Clydesdale West
2,137044546,BANK OF SCOTLAND BUILDINGS,2 JOHN STREET,LANGHOLM,DG13 0AD,S01001064 (Langholm and Canonbie),Dumfries and Galloway,00QHMN,Annandale East and Eskdale
3,117115511,5 ELM RISE,BALDOVIE,DUNDEE,DD5 3UY,S01000598 (South Angus),Angus,00QCMD,Monifeith and Sidlaw
4,137069631,PATHHEAD,,LANGHOLM,DG13 0ND,S01001074 (Langholm and Canonbie),Dumfries and Galloway,00QHMN,Annandale East and Eskdale


CONSTITUENCY_LABEL or ADDRESS3 (i.e. POST_TOWN) might be best to be used as location. CONSTITUENCY_LABEL and CONSTITUENCY are equivalent; they denote a fairly small area (but larger than POSTCODE). I believe CONSTITUENCY and DATA_ZONE to be equivalent as well, both denote something of the size of a constituency. There is also the option of using the first 3 or 4 digits of POSTCODE along with another feature like ADDRESS3; similar approach can be taken for DATA_ZONE (combine small and large location feature).<br>
Note, though, that providing two different inputs for location data, both meaning the same thing, might affect the model. For now, I will use DATA_ZONE.

In [10]:
## Picking location data
loc = ['DATA_ZONE']
df_xloc = df_xloc[[pk]+loc]
df_xloc.head()

Unnamed: 0,OSG_REFERENCE_NUMBER,DATA_ZONE
0,116023891,S01005484 (Berwickshire East)
1,484108772,S01005754 (Carluke South)
2,137044546,S01001064 (Langholm and Canonbie)
3,117115511,S01000598 (South Angus)
4,137069631,S01001074 (Langholm and Canonbie)


Now, the data (df_y and df_x) can be merged together into one database.

#### Merging

In [11]:
## Creating df_clf1 from df_y and df_x
df_clf1 = df_y.merge(df_x, on='OSG_REFERENCE_NUMBER')

df_clf1.describe

<bound method NDFrame.describe of          OSG_REFERENCE_NUMBER CURRENT_ENERGY_RATING  \
0                   116023891                     E   
1                   484108772                     C   
2                   137044546                     E   
3                   117115511                     C   
4                   137069631                     G   
...                       ...                   ...   
1495245             117070231                     C   
1495246             906323133                     C   
1495247            9051108251                     D   
1495248             141043792                     D   
1495249             484149765                     A   

        CURRENT_ENVIRONMENTAL_RATING  TOTAL_FLOOR_AREA  \
0                                  E              47.0   
1                                  D              59.0   
2                                  E             155.0   
3                                  C             168.0   
4              

### Exploring and Cleaning
When looking at these datasets, some interesting things become apparent. For example:
- For some of the '..._EFF' cells contain multiple inputs (e.g. Poor | Poor |);
- Most of these (x-features) have '|' separation, traling at the ends;
- Data Zones contain both a unique code as (Sxxxxxxxx) as well as a location in parantheses;
- Some reference numbers are 8 digits long, some 9, some are duplicate, some empty;
- N/A sometimes used instead of NaN (empty cell);
- Missing values.

In this situation, I think it would be best to first explore how much data is missing to judge if it would make a significant difference to the model; I can even train multiple models with and without missing data to see how it affects accuracy. Considering that there are over fourty-eight thousand data points, I think missing or anomalous data is best to be deleted.

#### Data exploration

In [12]:
df_clf1.dtypes

OSG_REFERENCE_NUMBER              Int64
CURRENT_ENERGY_RATING            object
CURRENT_ENVIRONMENTAL_RATING     object
TOTAL_FLOOR_AREA                float64
WALL_DESCRIPTION                 object
ROOF_DESCRIPTION                 object
FLOOR_DESCRIPTION                object
WINDOWS_DESCRIPTION              object
MAINHEAT_DESCRIPTION             object
MAINHEATCONT_DESCRIPTION         object
SECONDHEAT_DESCRIPTION           object
HOTWATER_DESCRIPTION             object
LIGHTING_DESCRIPTION             object
AIR_TIGHTNESS_DESCRIPTION        object
CONSTRUCTION_AGE_BAND            object
FLOOR_HEIGHT                    float64
UNHEATED_CORRIDOR_LENGTH        float64
WIND_TURBINE_COUNT              float64
BUILT_FORM                       object
PROPERTY_TYPE                    object
EXTENSION_COUNT                 float64
FIXED_LIGHTING_OUTLETS_COUNT    float64
LOW_ENERGY_FIXED_LIGHT_COUNT    float64
LOW_ENERGY_LIGHTING             float64
FLOOR_LEVEL                     float64


In [14]:
# df_clf1['WALL_ENERGY_EFF'].unique()

Here we have something that can be found under every '_EFF' column. But interestingly, these should 1-to-1 correspond with '_DESCRIBE' information, so, I should be able to pick and choose what I would prefer to work with; probably in this case '_DESCRIBE'

In [16]:
# df_clf1 = df_clf1.filter(like='^(?!_EFF$).*')
df_clf1.columns

Index(['OSG_REFERENCE_NUMBER', 'CURRENT_ENERGY_RATING',
       'CURRENT_ENVIRONMENTAL_RATING', 'TOTAL_FLOOR_AREA', 'WALL_DESCRIPTION',
       'ROOF_DESCRIPTION', 'FLOOR_DESCRIPTION', 'WINDOWS_DESCRIPTION',
       'MAINHEAT_DESCRIPTION', 'MAINHEATCONT_DESCRIPTION',
       'SECONDHEAT_DESCRIPTION', 'HOTWATER_DESCRIPTION',
       'LIGHTING_DESCRIPTION', 'AIR_TIGHTNESS_DESCRIPTION',
       'CONSTRUCTION_AGE_BAND', 'FLOOR_HEIGHT', 'UNHEATED_CORRIDOR_LENGTH',
       'WIND_TURBINE_COUNT', 'BUILT_FORM', 'PROPERTY_TYPE', 'EXTENSION_COUNT',
       'FIXED_LIGHTING_OUTLETS_COUNT', 'LOW_ENERGY_FIXED_LIGHT_COUNT',
       'LOW_ENERGY_LIGHTING', 'FLOOR_LEVEL', 'FLAT_TOP_STOREY', 'GLAZED_AREA',
       'NUMBER_HABITABLE_ROOMS', 'HEAT_LOSS_CORRIDOOR',
       'MAIN_HEATING_CATEGORY', 'MAIN_FUEL', 'MAIN_HEATING_CONTROLS',
       'MECHANICAL_VENTILATION', 'ENERGY_TARIFF', 'MULTI_GLAZE_PROPORTION',
       'GLAZED_TYPE', 'NUMBER_OPEN_FIREPLACES', 'PHOTO_SUPPLY'],
      dtype='object')

In [19]:
df_clf1['WALL_DESCRIPTION'].unique()

array(['Cavity wall, filled cavity | ',
       'Sandstone or limestone, as built, no insulation (assumed) | Solid brick, as built, no insulation (assumed) | ',
       'Timber frame, as built, insulated (assumed) | ',
       'Cavity wall, as built, insulated (assumed) | ',
       'System built, as built, insulated (assumed) | Timber frame, as built, insulated (assumed) | ',
       'Sandstone or limestone, as built, partial insulation (assumed) | Solid brick, as built, partial insulation (assumed) | ',
       'Average thermal transmittance 0.16 W/m²K | ',
       'Average thermal transmittance 0.15 W/m²K | ',
       'Average thermal transmittance 0.13 W/m²K | ',
       'Average thermal transmittance 0.21 W/m²K | ',
       'Solid brick, as built, insulated (assumed) | Timber frame, as built, insulated (assumed) | ',
       'Average thermal transmittance 0.22 W/m²K | ',
       'Sandstone or limestone, with internal insulation | ',
       'Cavity wall, as built, insulated (assumed) | Sandsto