# Extracting from a zip file of EPC data

We've downloaded the full dataset from [The Energy Performance Certificates dataset](https://epc.opendatacommunities.org/domestic/search). This is a 5GB zip file, but there isn't enough capacity on the laptop to unzip it - so we need some Python.

## Import the libraries

We need the `zipfile` library to work with the zip file. 

In [1]:
#import a library for dealing with zip files
import zipfile

## Unzip and analyse the files

We can use the `.namelist()` method to look at the names of the files in the zip.

In [2]:
#code adapted from https://realpython.com/python-zipfile/
with zipfile.ZipFile("all-domestic-certificates.zip", mode="r") as archive:
    #extract the list of names
    filelist = archive.namelist()
    #how many items
    print(len(filelist))
    #show the first 7 items
    print(filelist[:7])

1030
['LICENCE.txt', 'domestic-E07000044-South-Hams/', 'domestic-E07000044-South-Hams/recommendations.csv', 'domestic-E07000044-South-Hams/certificates.csv', 'domestic-E07000078-Cheltenham/', 'domestic-E07000078-Cheltenham/recommendations.csv', 'domestic-E07000078-Cheltenham/certificates.csv']


## Read some files

We can use `.read()` to grab any items from that list.

In [3]:
#code adapted from https://realpython.com/python-zipfile/
with zipfile.ZipFile("all-domestic-certificates.zip", mode="r") as archive:
    #extract the list of names
    filelist = archive.namelist()
    #open the first item and store in a variable called text
    text = archive.read(filelist[0]).decode(encoding="utf-8")

#print the new variable, the first 100 chars
print(text[:100])

# Terms of Use

## Copyright and Database Right Information

The Department of Levelling Up, Housing


## Extract into a folder

We can use `.extract()` to extract a specified file into a specified location.

In [4]:
#code adapted from https://realpython.com/python-zipfile/
with zipfile.ZipFile("all-domestic-certificates.zip", mode="r") as archive:
    #extract the list of names
    filelist = archive.namelist()
    #extract first item to a subdirectory
    archive.extract(filelist[0], path="output_dir/")


## Extract all to a folder

The `.extractall()` method will extract all the files, so all it needs to know is the directory you want to extract to.

In [5]:
#code adapted from https://realpython.com/python-zipfile/
with zipfile.ZipFile("all-domestic-certificates.zip", mode="r") as archive:
    #extract all to a subdirectory
    archive.extractall(path="output_dir/")


This appears to work, so we turn to command line to solve the problem of combining all these files into one.