# Extract government SOU dataset from csv data 

This notebook parses a previously created csv data-file (sou_data.csv) to generate a bash-script that will download SOU dataset from https://sou.kb.se. 

The resulting bash script uses wget to download while also giving the pdf-files sanitized names based on titles in the csv-file. The generated script also sorts them into a folder structure organized by year. A pre-generated example script can be found in the subfolder [run-csv/](https://github.com/CDHUppsala/cdhu-kb-scraping/tree/main/run-csv). You can either edit/execute this script directly or run this notebook to customize the download-script to your liking. 


```python
CDHU="""\
 ____ ____ ____ ____ ________ 
||C |||D |||H |||U |||       
||__|||__|||__|||__|||_______
|/__\|/__\|/__\|/__\|/_______

 EXTRACT SOU / BEAUTIFUL SOUP
"""
```



In [11]:
# Code: Matts L/CDHU
# Requires: pandas
import re
import pandas as pd 


In [12]:

filename='sou_data.csv'
try:
    df = pd.read_csv(f'./{filename}', sep=',', index_col=False)
    print(df.head())
except Exception: 
  print('-- Error!'+str(Exception))
else:
  print(f'++ Read {filename}')

                  titel                                                pdf  \
0  1922:1 första serien  https://weburn.kb.se/sou/580/urn-nbn-se-kb-dig...   
1                1922:1  https://weburn.kb.se/sou/190/urn-nbn-se-kb-dig...   
2  1922:2 första serien  https://weburn.kb.se/sou/580/urn-nbn-se-kb-dig...   
3                1922:2  https://weburn.kb.se/sou/190/urn-nbn-se-kb-dig...   
4  1922:3 första serien  https://weburn.kb.se/sou/580/urn-nbn-se-kb-dig...   

                                                 urn                sou-nr  \
0  http://urn.kb.se/resolve?urn=urn:nbn:se:kb:sou...  1922:1 första serien   
1  http://urn.kb.se/resolve?urn=urn:nbn:se:kb:sou...                1922:1   
2  http://urn.kb.se/resolve?urn=urn:nbn:se:kb:sou...  1922:2 första serien   
3  http://urn.kb.se/resolve?urn=urn:nbn:se:kb:sou...                1922:2   
4  http://urn.kb.se/resolve?urn=urn:nbn:se:kb:sou...  1922:3 första serien   

                                          full_titel  
0      

In [13]:
count=0
deactivated=0
DECADE=1920 #set sou-series starting decade
DECADE_BLACKLIST=[]
# Uncomment this and edit blacklist to exclude whole decades from the download script
#DECADE_BLACKLIST=[1920,1930,1940,1950,1960,1980,1990]; """ <== blacklist all except 1970s """
DECADE_BLACKLIST=[1920,1930,1940,1990]; """<== blacklist 1920-1940, 1990 """

# generate output and write to file
# 
try:
  with open(r'wget_all_csv.sh', 'w') as fp:
    fp.write("#!/bin/bash\n")
    for ind in df.index:
          df_titel = df['titel'][ind]
          result = r = re.search(r'(\d\d\d\d)(:)(\d\d?\d?)(.*)', df_titel)
          year=(result.group(1))
          nr=(result.group(3).zfill(3))
          extra=(result.group(4))
          full_titel = df['full_titel'][ind]
          full_titel = "".join([c for c in full_titel if c.isalpha() or c.isdigit() or c==' ']).rstrip()
          full_titel = full_titel.replace("  ", "_")
          full_titel = full_titel.replace(" ", "_")
          full_titel = full_titel[:120]

          # this code block produces a commented line indicating the start of a new decade (for easier editing)
          if int(year)-DECADE == 10:
            DECADE=DECADE+10 # inc 10 yrs
            fp.write("# "+str(DECADE)+"\n")

          # This code block checks if the current SOU belongs to a blacklisted decade or not
          if DECADE_BLACKLIST:
            for d in DECADE_BLACKLIST:
              test = int(year)-int(year)%10 # modulo operation and subtraction floors year to get decade:
              if test == d: 
                  DO_COMMENT="#" # deactivate line by inserting a comment if year is blacklisted
                  deactivated+=1
                  break #break out of loop if blacklist match was found for year ...
              else: #  ... if not, clear DO_COMMENT
                DO_COMMENT = ""
          
          #c=1920

          url = df['pdf'][ind]
          id = df['sou-nr'][ind]
          #sanitize
          #id = "".join([c for c in id if c.isalpha() or c.isdigit() or c==' ']).rstrip()
          id = id.replace(" ", "_")
          id = id.replace(":", "-")
          id = id.replace("/", "-")
          id = 'sou-'+id
          longtitle = df['full_titel'][ind]

          # set up download command and output filename and path
          command = 'wget --continue -O \"' +year+'/'+'sou-'+year+'-'+nr+extra   \
                                            +'-'  \
                                            +full_titel   \
                                            +'.pdf\" '+url
          # set up mkdir command to prefix wget
          prefix='mkdir -p '+year+'/ && '
          command=prefix+command
          # deactivate line if blacklisted
          command=DO_COMMENT+command
          #write each item on a new line
          fp.write("%s\n" % command)
          count=count+1
except:
  print("-- Error!")
else:
  print(f'++ Wrote {count} lines as "wget_all_csv.sh" ({deactivated} deactivated lines)')

++ Wrote 6114 lines as "wget_all_csv.sh" (3267 deactivated lines)
