Although the HTML file provides us with a wealth of data, the format isn't very tractable when it comes to automating any process requiring such high-level information. This code scraps data from the HTML file, and returns a simple .csv file named "**Summary**":

- **Summary**

 | Column name | Type of variable | Number of values | Kind of measurement | Coding used | Cardinality |
 | --- | --- | --- |--- | --- | --- |
 | 	'23-0.0' |	'Categorical (single)' | '456606' | 'Spirometry method'| '100270'  |'5' |

- **Meaning**

 | Column  | Signification/ Relevance | 
 | --- | --- | 
 |Column name |	Name used in the UKBB | 
 |Type of variable |	'Categorical (single)' , 'Integer' , 'Date' , 'Time'  ...|
 |Number of values |	Number of people on which the measurement was performed (related to the number of missing values for this measurement) |
 |Kind of measurement | Why is this column relevant in the UKBB?  |
 |Coding used|	A lot of measurements are encoded differently |
 |Cardinality| Number of similar questions (same number before the first "-" in "Column name") |

In [None]:
from bs4 import BeautifulSoup
url = "./ukb24899.html"      ##insert the pathname for the html file
page = open(url, encoding = "ISO-8859-1")
soup = BeautifulSoup(page.read())
import re
import csv

### Function to pull all the information from the HTML

In [None]:
def variable_type_opt(column_number):
    #Get the subtype in which the column is encoded in the UKbiobank
    #First, get the UID
    tag=soup.find('td', style="text-align: right;",text=re.compile(r'^(%s)$'%column_number))
    if  tag.next_sibling!=None and tag.next_sibling.a!=None:
         tag_aux=tag.next_sibling
    else:
        tags=soup.find_all('td', style="text-align: right;",text=re.compile(r'^(%s)$'%column_number))
        for item in tags:
              if item.next_sibling!=None and item.next_sibling.a!=None:
                    tag_aux=item.next_sibling
                    break 
   #the first numbers before the "-" are unique to the tag we are then looking for. They all pertain to the same subtype
    up_to=tag_aux.string.find('-')
    udi_aux=tag_aux.string[:up_to+1]
   #Extracting the number of measurements:
    tag6=tag_aux.next_sibling
   #extracting the tag of interest
    tag_aux_2=soup.find('a',text=re.compile(r'^(%s)'%udi_aux))
    tag_aux_3=tag_aux_2.parent
    tag4=tag_aux_3.next_sibling.next_sibling
    tag5=tag4.next_sibling #fifth line
    #get generator for each string
    list_1=list(tag5.stripped_strings)
    if len(list_1)==1:
        a1=list_1[0] #meaning of the column
        a2="0"
        a3="0"
    else:
        a1=list_1[0] #meaning of the column
        a2=list_1[2] #data coding
        a3=list_1[3].split()[1] #number of values
    return(str(tag_aux.string),str(tag4.string),str(tag6.string), a1,a2,a3);

Always pre-allocate a list, this operation is very fast, even on big lists. 
Allocating new objects that will be later assigned to list elements will take MUCH longer and will be THE bottleneck in your program, performance-wise.

In [None]:
t = time.time()
List_types=[None]*16021
List_names=[None]*16021
List_count=[None]*16021
List_meaning=[None]*16021
List_coding=[None]*16021
List_cardinality=[None]*16021
List_types[0]="Sequence"
List_names[0]="eid"
List_count[0]=502543
List_meaning[0]="Encoded anonymised participant ID"
List_coding[0]="0"
List_cardinality[0]="0"

for i in range(1, 16021):  #16020:number of columns in this UKBB +1
    Aux=variable_type_opt(i)
    List_names[i]=Aux[0]
    List_types[i]=Aux[1]
    List_count[i]=Aux[2]
    List_meaning[i]=Aux[3]
    List_coding[i]=Aux[4]
    List_cardinality[i]=Aux[5]
    if i%500==0:
        print(i)
elapsed = time.time() - t
print(elapsed/60)

Even with preallocation, it took about 3 hours to perfom all the calculations. (199.39 minutes) (Macbook pro 2.7 GHz Intel Core i5 8 GB 1867 MHz DDR3)

### Saving each list in csv files

In [None]:
with open('List_count','w', newline='') as f:
    thewriter=csv.writer(f)        #create the writer object
    thewriter.writerow(List_count)
with open('List_names','w', newline='') as f:
    thewriter=csv.writer(f)        #create the writer object
    thewriter.writerow(List_names)
with open('List_types','w', newline='') as f:
    thewriter=csv.writer(f)        #create the writer object
    thewriter.writerow(List_types)
with open('List_meaning','w', newline='') as f:
    thewriter=csv.writer(f)        #create the writer object
    thewriter.writerow(List_meaning)
with open('List_coding','w', newline='') as f:
    thewriter=csv.writer(f)        #create the writer object
    thewriter.writerow(List_coding)
with open('List_cardinality','w', newline='') as f:
    thewriter=csv.writer(f)        #create the writer object
    thewriter.writerow(List_cardinality)

## Saving to summary.csv

In [1]:
with open("List_count") as LCount, open("List_types") as LTypes,open("./List_names") as LNames,open("List_meaning") as LMeaning, open("List_coding") as LCoding,open("List_cardinality") as LCardinality:
    readerLC = csv.reader(LCount)
    readerLT = csv.reader(LTypes)
    readerLN = csv.reader(LNames)
    readerLMn = csv.reader(LMeaning)
    readerLCd = csv.reader(LCoding)
    readerLCr = csv.reader(LCardinality)
    data1 = [r for r in readerLC]
    data2 = [r for r in readerLT]
    data3 = [r for r in readerLN]
    data4 = [r for r in readerLMn]
    data5 = [r for r in readerLCd]
    data6 = [r for r in readerLCr]
    summary_T=[data3[0],data2[0],data1[0],data4[0],data5[0],data6[0]]
    
summary=list(map(list, zip(*summary_T)))
df=pd.DataFrame(summary,columns=["Column_name","Type_variable", "Number_values","Kind_measurement","Coding_used", "Cardinality"]).to_csv("Summary.csv",index=False)

NameError: name 'csv' is not defined