Install dependencies

In [1]:
%%capture
! pip install markdown2 beautifulsoup4

Import dependencies

In [2]:
import markdown2
from bs4 import BeautifulSoup
import re

Define functions, one for extracting the text visible to the user, one additionally extracting the href values from links

In [3]:
def extract_raw_text_from_markdown(markdown_content):
  html_content = markdown2.markdown(markdown_content)
  soup = BeautifulSoup(html_content, 'html.parser')

  raw_text = soup.get_text()
  stripped_text = raw_text.strip()
  cleaned_text = re.sub(r'\n{2,}', '\n', stripped_text)

  return cleaned_text

In [4]:
def extract_raw_text_from_markdown_with_links(markdown_content):
  html_content = markdown2.markdown(markdown_content)
  soup = BeautifulSoup(html_content, 'html.parser')

  raw_text = ""
  for element in soup.recursiveChildGenerator():
    if isinstance(element, str):
      raw_text += element
    elif element.name == 'a':
      link_text = element.get_text()
      raw_text += link_text
      link_href = element.get('href')
      raw_text += f'({link_href}) '

  return raw_text

Run the code against an example file

In [5]:
input_file = 'example.md'
with open(input_file, 'r', encoding='utf-8') as f:
  markdown_content = f.read()
raw_text = extract_raw_text_from_markdown(markdown_content)

print(raw_text)

ObsPy is an open-source project dedicated to provide a Python framework for processing seismological data. It provides parsers for common file formats, clients to access data centers and seismological signal processing routines which allow the manipulation of seismological time series (see Beyreuther et al. 2010, Megies et al. 2011, Krischer et al. 2015).
The goal of the ObsPy project is to facilitate rapid application development for seismology.
ObsPy is licensed under the GNU Lesser General Public License (LGPL) v3.0.
A one-hour introduction to ObsPy is available at YouTube.
Read more in our GitHub wiki
Installation
Installation instructions can be found in the wiki.
Getting started
Read about how to get started in the wiki and in our Tutorial section in the documentation.
ObsPy Tutorial notebooks -- and much more on specific seismology topics -- can also be found on Seismo-Live, both as a static preview and as interactively runnable version.
python
from obspy import read
st = read()

Load from the fetched JSON files

In [8]:
import os
import json
import csv

In [11]:
with open('out.csv', 'w', newline='') as csv_file:
  csv_writer = csv.writer(csv_file)
  csv_writer.writerow([
    'full_name',
    'url',
    'homepage',
    'programming_language',
    'forks',
    'stars',
    'description',
    'contents',
    'readme'
  ])

  i = 0
  for year in os.listdir('out'):
    print(f'reading year {year}')
    for field in os.listdir(f'out/{year}'):
      print(f'reading field {field}')
      for term in os.listdir(f'out/{year}/{field}'):
        print(f'reading term {term}')
        with open(f'out/{year}/{field}/{term}') as json_file:
          contents = json.load(json_file)
          for repo in contents:
            if i > 100:
              break
            csv_writer.writerow([
              repo['full_name'],
              repo['url'],
              repo['homepage'],
              repo['programming_language'],
              repo['forks'],
              repo['stars'],
              repo['description'],
              repo['contents'],
              repo['readme']
            ])
        i += 1

reading year 2015
reading field computer_sciences
reading term Robotics.json
reading term Networking.json
reading term Privacy.json
reading term Security.json
reading term Software Engineering.json
reading term Artificial Intelligence.json
reading term Vision.json
reading term Information Visualization.json
reading term Theoretical Computer Science.json
reading term Graphics.json
reading term Systems.json
reading term Machine Learning.json
reading term Computer Engineering.json
reading term Human-Computer Interaction.json
reading field biology
reading term Biophysics.json
reading term Genomics.json
reading term Genetics.json
reading term Cancer.json
reading term Neuroscience.json
reading term Virology.json
reading term Structural Biology.json
reading term Biochemistry.json
reading term Cell Biology.json
reading term Microbiology.json
reading field medicine
reading term Clinical Innovation .json
reading term Internal Medicine.json
reading term Cardiology.json
reading term Geriatric Medi