# Web Scraping and cleaning

In this notebook we will be scraping the data from any given url and then use cleaning methods like regular expression to remove unwanted contents from the scraped text like unwanted spaces, \n, \t. Remove the citation marks like [6] or (ab) so we are left with only useful text.

In [8]:
import bs4 as bs
import requests
import re

In [2]:
URL = 'https://en.wikipedia.org/wiki/Artificial_intelligence'
html_page = requests.get(URL).text

In [3]:
soup = bs.BeautifulSoup(html_page, 'lxml')

In [4]:
main_heading = soup.find('h1').text
main_heading

'Artificial intelligence'

In [5]:
heading_tags = soup.find_all('h2')
headings = []
for heading in heading_tags:
    h = heading.text
    h = h.replace("\n", "")
    headings.append(h)
headings

['Contents',
 'History',
 'Goals',
 'Tools',
 'Applications',
 'Philosophy',
 'Future',
 'In fiction',
 'See also',
 'Explanatory notes',
 'Citations',
 'References',
 'Further reading',
 'External links',
 'Sources',
 'Navigation menu']

In [6]:
subheading_tags = soup.find_all('h3')
subheadings = []
for subheading in subheading_tags:
    s = subheading.text
    s = s.replace("\n", "")
    subheadings.append(s)
subheadings

['Reasoning, problem solving',
 'Knowledge representation',
 'Planning',
 'Learning',
 'Natural language processing',
 'Perception',
 'Motion and manipulation',
 'Social intelligence',
 'General intelligence',
 'Search and optimization',
 'Logic',
 'Probabilistic methods for uncertain reasoning',
 'Classifiers and statistical learning methods',
 'Artificial neural networks',
 'Specialized languages and hardware',
 'Defining artificial intelligence',
 'Evaluating approaches to AI',
 'Machine consciousness, sentience and mind',
 'Superintelligence',
 'Risks',
 'Ethical machines',
 'Regulation',
 'AI textbooks',
 'History of AI',
 'Other sources',
 'Personal tools',
 'Namespaces',
 'Variantsexpandedcollapsed',
 'Views',
 'Moreexpandedcollapsed',
 'Search',
 'Navigation',
 'Contribute',
 'Tools',
 'Print/export',
 'In other projects',
 'Languages']

In [7]:
para_tags = soup.find_all('p')
content = []
citation_pattern = r"([\[]\w*|[\]])"
for para in para_tags:
    p = para.text
    p = p.replace("\n", "")
    p = re.sub(citation_pattern, "", p)
    if(p):
        content.append(p)
print(content[18])

Early researchers developed algorithms that imitated step-by-step reasoning that humans use when they solve puzzles or make logical deductions.By the late 1980s and 1990s, AI research had developed methods for dealing with uncertain or incomplete information, employing concepts from probability and economics.


## Making it into a function

In [8]:
def get_content(URL):
    html_page = requests.get(URL).text
    soup = bs.BeautifulSoup(html_page, 'lxml')
    main_tags = soup.find_all('h1')
    main_headings = []
#     spaces_pattern = r"(  |\r|\n|\t)"
    space_pattern = r'\s+'
    for main in main_tags:
        m = main.text
        m = re.sub(spaces_pattern, "", m)
        main_headings.append(m)
    
    heading_tags = soup.find_all('h2')
    headings = []
    for heading in heading_tags:
        h = heading.text
        h = re.sub(spaces_pattern, "", h)
        headings.append(h)
    
    subheading_tags = soup.find_all('h3')
    subheadings = []
    for subheading in subheading_tags:
        s = subheading.text
        s = re.sub(spaces_pattern, "", s)
        subheadings.append(s)
    
    para_tags = soup.find_all('p')
    content = []
#     citation_pattern = r"([\[]\w*|[\]])|([\(]\w*|[\)])"
    citation_pattern = r'\[[0-9a-zA-Z]*\]'
    for para in para_tags:
        p = para.text
        p = re.sub(spaces_pattern, "", p)
        p = re.sub(citation_pattern, "", p)
        if(p):
            content.append(p)
    
    return (main_headings, headings, subheadings, content)

In [9]:
data = get_content('https://www.frontiersin.org/articles/10.3389/fpubh.2020.00014/full')

In [10]:
data[3][9]

'The interactions between humans and their physical surroundings have been extensively studied, as multiple human activities influence the environment. The environment is a coupling of the biotic  organisms and microorganisms and the abiotic , lithosphere, and atmosphere.'

In [21]:
def clean_text(text):
    space_pattern = r"(  |\r|\n|\t)"
    citation_pattern = r'\[[0-9a-zA-Z]*\]'
    text = re.sub(space_pattern, "", text)
    text = re.sub(citation_pattern, "", text)
    
    return text

In [22]:
temp = '''Direct Automatic Generation of Mind Maps from text with M2Gen  M. Abdeen, R. El-Sahan, A. Ismaeil, S. El-Harouny, M. Shalaby The Faculty of Computers and Information Sciences Ain-Shams University Cairo, Egypt mabdeen@alumni.uottawa.ca M. C.E. Yagoub  The school of Information Technology and Engineering University of Ottawa Ottawa, Ontario,  myagoub@site.uottawa.ca  Abstract—A  mind  map  is  a  diagram  used  to  represent  words, ideas,  or  other  items  linked  to  and  arranged  around  a  central keyword  or  idea.  Mind  maps  are  used  to  generate,  visualize, structure, and classify ideas, and as an aid in organization, study, project  management,    problem  solving,  decision  making,  and writing.  It  has  been  long  used  in  brainstorming  and  as  an effective educational tool. There are numerous tools in the  market, either  as freeware or as proprietary  software,  that  help  users  generate  mind  maps.. However,  these  tools  are  more  of  mind  map  “editing”  tools  to help  users  project  their  ideas  from  their  minds  into  the  tool mapping space. These tools also provide a comprehensive library of images that  suits  the most popular mind map  types.  The tools act  as  the  media  into  which  users  projects  the  maps  that  has already more-or-less matured in their minds. In  this  work,  we  present  a  software  tool  that  automatically generates  mind  maps  directly  from  text.  This  tool  provides  a prospect  to  transform  many  literatures automatically  into  mind maps. One  significant  application  of  this tool is education. Many students  finds  it  easier  to  follow  and  remember  information presented in the mid map form rather than pure text.  Keywords- text processing; semantic analysis; web mining. I. INTRODUCTION Mind mapping is a popular brainstorming tool and thinking technique  of  visually  arranging  ideas  and  their interconnections.  It  is  a  way  of  representing  associated thoughts with symbols rather than with extraneous words. The human  mind  forms  associations  almost  instantaneously,  and "mapping"  allows  capturing  these  ideas  quicker  than expressing them using only words or phrases. Originated  in  the  late  1960s  by  Tony  Buzan,   [1]  mind mapping  harnesses  the  full  range  of  cortical  skills  -  word, image, number, logic, rhythm, color and spatial awareness - in a single, uniquely powerful manner. In so doing, mind mapping gives the freedom to roam the infinite expanses of your brain. It is now used by millions of people around the world. A  mind  map  is  a diagram  used  to  represent  words, ideas, tasks, or  other  items linked  to  and  arranged  radially  around  a central  keyword.  As  an  example,   0  depicts  a  mind  map  of Google tools.  Figure 1   Google tools mind map  Manually  constructing  mind  maps  requires  thorough reading  and  good  understanding  the  text  which  takes  much time  and  effort.  In  addition  to  that  not  all  people  are  creative enough to  draw  elegant and expressive mind maps. Therefore, automatically  generating  mind  maps  saves  much  time  and effort and serves better and quicker various applications. Mind  mapping  applications  are  numerous.  Organizing, meetings, planning, note taking, presentation, and  above all, in education  [2]. There are  numerous tools in the market, either  as  freeware or as proprietary software, that help users generate mind maps. Wisemapping and Mindomo  [3] are  examples  of  the  freeware and Buzan’s iMindMap and Inspiration  [4] are examples of the proprietary ones. These softwares help the user in drawing the mind map and have some ready designs and diagrams which can be used. But the user must read, understand the text well and come up with a design for the mind map himself. Automatically  generating  mind  maps  out  of  pure  text requires  many  stages  of  text  processing.  In  the  following sections, we provide details of the main modules of the tool and the stages used to produce the final mind map.  TIC-STH 2009978-1-4244-3878-5/09/$25.00 ©2009 IEEE 95
II.'''

In [23]:
temp

'Direct Automatic Generation of Mind Maps from text with M2Gen  M. Abdeen, R. El-Sahan, A. Ismaeil, S. El-Harouny, M. Shalaby The Faculty of Computers and Information Sciences Ain-Shams University Cairo, Egypt mabdeen@alumni.uottawa.ca M. C.E. Yagoub  The school of Information Technology and Engineering University of Ottawa Ottawa, Ontario,  myagoub@site.uottawa.ca  Abstract—A  mind  map  is  a  diagram  used  to  represent  words, ideas,  or  other  items  linked  to  and  arranged  around  a  central keyword  or  idea.  Mind  maps  are  used  to  generate,  visualize, structure, and classify ideas, and as an aid in organization, study, project  management,    problem  solving,  decision  making,  and writing.  It  has  been  long  used  in  brainstorming  and  as  an effective educational tool. There are numerous tools in the  market, either  as freeware or as proprietary  software,  that  help  users  generate  mind  maps.. However,  these  tools  are  more  of  mind  map  “editing”

In [17]:
clean_text(temp)

'Direct Automatic Generation of Mind Maps from text with M2GenM. Abdeen, R. El-Sahan, A. Ismaeil, S. El-Harouny, M. Shalaby The Faculty of Computers and Information Sciences Ain-Shams University Cairo, Egypt mabdeen@alumni.uottawa.ca M. C.E. YagoubThe school of Information Technology and Engineering University of Ottawa Ottawa, Ontario,myagoub@site.uottawa.caAbstract—Amindmapisadiagramusedtorepresentwords, ideas,orotheritemslinkedtoandarrangedaroundacentral keywordoridea.Mindmapsareusedtogenerate,visualize, structure, and classify ideas, and as an aid in organization, study, projectmanagement,problemsolving,decisionmaking,and writing.Ithasbeenlongusedinbrainstormingandasan effective educational tool. There are numerous tools in themarket, eitheras freeware or as proprietarysoftware,thathelpusersgeneratemindmaps.. However,thesetoolsaremoreofmindmap“editing”toolsto helpusersprojecttheirideasfromtheirmindsintothetool mapping space. These tools also provide a comprehensive library of image