# Web Scraping


### Extracting text files from webscraper.io CSV results

In [None]:
import pandas as pd
with open('sample-wireless-news.csv', encoding='utf-8') as f:  # change the file name for your file
    df = pd.read_csv(f) # read csv into a pandas dataframe
df.head(5) # display the first five rows of the dataframe

### Export script

Use the script below to export your scraped content to a directory of text files *if your text column contains plain text*. 

Please note the following:

- this will only work with data in the CSV format exported from webscraper.io
- you should inspect the webscraper output in CSV format first, to save repeating this process if changes are needed
- you must load the CSV into this notebook as a pandas dataframe using the cell above FIRST
- you must create a directory called 'textfiles' in the same directory as this notebook (or wherever you run the code from)

In [None]:
# once your data is loaded in the cell above and you've created a 'textfiles' directory, run this cell    

text_column_name = 'story_text' #modify this if your column is named something else

for idx, col in df.iterrows():   
    if isinstance(col['title'],str) and isinstance(col['date'],str) and isinstance(col[text_column_name],str):
        filename = 'textfiles/{}.txt'.format(col['title'][:35] + '-' + col['date'])
        with open(filename, 'w', encoding='utf-8') as f:
        # the format(col['title'] bit above determines the output filename - part of the title and the date
            f.write(col[text_column_name])
            print('Writing file ' + str(idx), filename)
    else:
        print('No string data - ignoring row',idx)
            

### Note about grouped type selector

It is possible that text columns from the webscraper.io can contain JSON (e.g. if you capture multiple paragraphs using the 'Grouped' selector). In that case, use this exporter.

In [None]:
# once your data is loaded in the cell above and you've created a 'textfiles' directory, run this cell    
import json

text_column_name = 'story_text' #modify this if your column is named something else

for idx, col in df.iterrows():   
    if isinstance(col['title'],str) and isinstance(col['date'],str) and isinstance(col[text_column_name],str):
        filename = 'textfiles/{}.txt'.format(col['title'][:35] + '-' + col['date'])
        with open(filename, 'w', encoding='utf-8') as f:
        # the format(col['title'] bit above determines the output filename - part of the title and the date
            chunks = json.loads(col[text_column_name]) # this parses the json structure
            for chunk in chunks:
                f.write(chunk[text_column_name] + ' ')
            print('Writing file ' + str(idx), filename)
    else:
        print('No string data - ignoring row',idx)

### Note if you trying to join multiple elements from a page

It is possible that you may have scraped text using webscraper.io in such a way that there are multiple rows in the CSV for the same URL. This exporter should help consolidate them into one.

In [None]:
df = df.sort_values(['web-scraper-order'], ascending = [True]) # reorder the dataframe by web-scraper-order so you get text in order

text_column_name = 'story_text' #modify this if your column is named something else

for idx, col in df.iterrows():   
    if isinstance(col['title'],str) and isinstance(col['date'],str) and isinstance(col[text_column_name],str):
        filename = 'textfiles/{}.txt'.format(col['title'][:35] + '-' + col['date'])
        with open(filename, 'a', encoding='utf-8') as f: #appending rather than overwriting
            f.write(col[text_column_name])
            print('Writing file ' + str(idx), filename)
    else:
        print('No string data - ignoring row',idx)