## Reading the web page into Python

The first thing we need to do is to read the HTML for this article into Python, which we'll do using the requests library.

In [1]:
import requests

In [2]:
r=requests.get('https://arxiv.org/list/cs.AI/pastweek?show=124')

The code above fetches our web page from the URL, and stores the result in a "response" object called r. That response object has a text attribute, which contains the same HTML code we saw when viewing the source from our web browser:

In [3]:
# print the first 500 characters of the HTML
print(r.text[0:500])

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
<head>
<title>Artificial Intelligence  authors/titles recent submissions</title>
<link rel="shortcut icon" href="/favicon.ico" type="image/x-icon" />
<link rel="stylesheet" type="text/css" media="screen" href="/css/arXiv.css?v=20190306" />
<link rel="stylesheet" type="text/css" media=


## Parsing the HTML using Beautiful Soup

We're going to parse the HTML using the Beautiful Soup 4 library, which is a popular Python library for web scraping.

In [4]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')

The code above parses the HTML (stored in r.text) into a special object called soup that the Beautiful Soup library understands. In other words, Beautiful Soup is reading the HTML and making sense of its structure.

Let's ask Beautiful Soup to find all of the records:

In [5]:
results = soup.find_all('div', attrs={'class':'meta'})

In [6]:
len(results)

124

In [7]:
results[0]

<div class="meta">
<div class="list-title mathjax">
<span class="descriptor">Title:</span> Automatic Programming of Cellular Automata and Artificial Neural  Networks Guided by Philosophy
</div>
<div class="list-authors">
<span class="descriptor">Authors:</span>
<a href="/search/cs?searchtype=author&amp;query=Christen%2C+P">Patrik Christen</a>, 
<a href="/search/cs?searchtype=author&amp;query=Del+Fabbro%2C+O">Olivier Del Fabbro</a>
</div>
<div class="list-comments mathjax">
<span class="descriptor">Comments:</span> 19 pages, 1 figure
</div>
<div class="list-subjects">
<span class="descriptor">Subjects:</span> <span class="primary-subject">Artificial Intelligence (cs.AI)</span>; Machine Learning (cs.LG); Multiagent Systems (cs.MA); Neural and Evolutionary Computing (cs.NE); Programming Languages (cs.PL)

</div>
</div>

In [8]:
first_result = results[0]
first_result

<div class="meta">
<div class="list-title mathjax">
<span class="descriptor">Title:</span> Automatic Programming of Cellular Automata and Artificial Neural  Networks Guided by Philosophy
</div>
<div class="list-authors">
<span class="descriptor">Authors:</span>
<a href="/search/cs?searchtype=author&amp;query=Christen%2C+P">Patrik Christen</a>, 
<a href="/search/cs?searchtype=author&amp;query=Del+Fabbro%2C+O">Olivier Del Fabbro</a>
</div>
<div class="list-comments mathjax">
<span class="descriptor">Comments:</span> 19 pages, 1 figure
</div>
<div class="list-subjects">
<span class="descriptor">Subjects:</span> <span class="primary-subject">Artificial Intelligence (cs.AI)</span>; Machine Learning (cs.LG); Multiagent Systems (cs.MA); Neural and Evolutionary Computing (cs.NE); Programming Languages (cs.PL)

</div>
</div>

In [9]:
first_result.find('div', attrs={'class':'list-title mathjax'}).text[8:-1]

'Automatic Programming of Cellular Automata and Artificial Neural  Networks Guided by Philosophy'

In [10]:
first_result.find('div', attrs={'class':'list-authors'}).text[10:-1]

'Patrik Christen, \nOlivier Del Fabbro'

In [11]:
first_result.find('div', attrs={'class':'list-comments mathjax'})

<div class="list-comments mathjax">
<span class="descriptor">Comments:</span> 19 pages, 1 figure
</div>

In [14]:
first_result.find('div', attrs={'class':'list-subjects'}).text[11:-1]

'Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Neural and Evolutionary Computing (cs.NE); Programming Languages (cs.PL)\n'

## Building the dataset

In [15]:
records = []
for result in results:
    Title = result.find('div', attrs={'class':'list-title mathjax'}).text[8:-1]
    Authors = result.find('div', attrs={'class':'list-authors'}).text[10:-1]
    #Comments = result.find('div', attrs={'class':'list-comments mathjax'}).text[11:-1]
    Subjects = result.find('div', attrs={'class':'list-subjects'}).text[11:-1]
    records.append((Title, Authors, Subjects))

In [16]:
len(records)

124

In [17]:
records[0:3]

[('Automatic Programming of Cellular Automata and Artificial Neural  Networks Guided by Philosophy',
  'Patrik Christen, \nOlivier Del Fabbro',
  'Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Neural and Evolutionary Computing (cs.NE); Programming Languages (cs.PL)\n'),
 ('Robust Goal Recognition with Operator-Counting Heuristics',
  'Felipe Meneguzzi, \nAndré Grahl Pereira, \nRamon Fraga Pereira',
  'Artificial Intelligence (cs.AI)'),
 ('AI in the media and creative industries',
  'Giuseppe Amato (CNR PISA), \nMalte Behrmann, \nFrédéric Bimbot (PANAMA), \nBaptiste Caramiaux (LRI, EX-SITU), \nFabrizio Falchi (CNR PISA), \nAnder Garcia, \nJoost Geurts (Inria), \nJaume Gibert, \nGuillaume Gravier (LinkMedia), \nHadmut Holken, \nHartmut Koenitz (HKU), \nSylvain Lefebvre (MFX), \nAntoine Liutkus (LORIA, ZENITH), \nFabien Lotte (Potioc, LaBRI), \nAndrew Perkis (NTNU), \nRafael Redondo, \nEnrico Turrin (FEP), \nThierry Vieville (Mnemosyne), \nEmmanuel

## Applying a tabular data structure


In [18]:
import pandas as pd
df = pd.DataFrame(records, columns=['Title', 'Authors', 'Subjetcs'])

In [20]:
df.head()

Unnamed: 0,Title,Authors,Subjetcs
0,Automatic Programming of Cellular Automata and...,"Patrik Christen, \nOlivier Del Fabbro",Artificial Intelligence (cs.AI); Machine Learn...
1,Robust Goal Recognition with Operator-Counting...,"Felipe Meneguzzi, \nAndré Grahl Pereira, \nRam...",Artificial Intelligence (cs.AI)
2,AI in the media and creative industries,"Giuseppe Amato (CNR PISA), \nMalte Behrmann, \...",Artificial Intelligence (cs.AI); Computer Visi...
3,Memory Bounded Open-Loop Planning in Large POM...,"Thomy Phan, \nLenz Belzner, \nMarie Kiermeier,...",Artificial Intelligence (cs.AI)
4,Integrating Artificial Intelligence into Weapo...,"Philip Feldman, \nAaron Dant, \nAaron Massey",Artificial Intelligence (cs.AI); Computers and...


In [23]:
df

Unnamed: 0,Title,Authors,Subjetcs
0,Automatic Programming of Cellular Automata and...,"Patrik Christen, \nOlivier Del Fabbro",Artificial Intelligence (cs.AI); Machine Learn...
1,Robust Goal Recognition with Operator-Counting...,"Felipe Meneguzzi, \nAndré Grahl Pereira, \nRam...",Artificial Intelligence (cs.AI)
2,AI in the media and creative industries,"Giuseppe Amato (CNR PISA), \nMalte Behrmann, \...",Artificial Intelligence (cs.AI); Computer Visi...
3,Memory Bounded Open-Loop Planning in Large POM...,"Thomy Phan, \nLenz Belzner, \nMarie Kiermeier,...",Artificial Intelligence (cs.AI)
4,Integrating Artificial Intelligence into Weapo...,"Philip Feldman, \nAaron Dant, \nAaron Massey",Artificial Intelligence (cs.AI); Computers and...
5,Hybrid Predictive Model: When an Interpretable...,"Tong Wang, \nQihang Lin",Machine Learning (cs.LG); Artificial Intellige...
6,Semi-supervised and Population Based Training ...,"Oguz H. Elibol, \nGokce Keskin, \nAnil Thomas",Audio and Speech Processing (eess.AS); Artific...
7,Quantifying Teaching Behaviour in Robot Learni...,"Aran Sena, \nMatthew J Howard",Robotics (cs.RO); Artificial Intelligence (cs....
8,The Regression Tsetlin Machine: A Tsetlin Mach...,"K. Darshana Abeyrathna, \nOle-Christoffer Gran...",Machine Learning (cs.LG); Artificial Intellige...
9,On the Detection of Mutual Influences and Thei...,"Stefan Rudolph, \nSven Tomforde, \nJörg Hähner",Multiagent Systems (cs.MA); Artificial Intelli...


## Exporting the dataset to a CSV file

In [24]:
df.to_csv('Parser for arxiv papers on AI.csv', index=False, encoding='utf-8')

In [27]:
df = pd.read_csv('Parser for arxiv papers on AI.csv', parse_dates=['Title'], encoding='utf-8')

In [28]:
df

Unnamed: 0,Title,Authors,Subjetcs
0,Automatic Programming of Cellular Automata and...,"Patrik Christen, \nOlivier Del Fabbro",Artificial Intelligence (cs.AI); Machine Learn...
1,Robust Goal Recognition with Operator-Counting...,"Felipe Meneguzzi, \nAndré Grahl Pereira, \nRam...",Artificial Intelligence (cs.AI)
2,AI in the media and creative industries,"Giuseppe Amato (CNR PISA), \nMalte Behrmann, \...",Artificial Intelligence (cs.AI); Computer Visi...
3,Memory Bounded Open-Loop Planning in Large POM...,"Thomy Phan, \nLenz Belzner, \nMarie Kiermeier,...",Artificial Intelligence (cs.AI)
4,Integrating Artificial Intelligence into Weapo...,"Philip Feldman, \nAaron Dant, \nAaron Massey",Artificial Intelligence (cs.AI); Computers and...
5,Hybrid Predictive Model: When an Interpretable...,"Tong Wang, \nQihang Lin",Machine Learning (cs.LG); Artificial Intellige...
6,Semi-supervised and Population Based Training ...,"Oguz H. Elibol, \nGokce Keskin, \nAnil Thomas",Audio and Speech Processing (eess.AS); Artific...
7,Quantifying Teaching Behaviour in Robot Learni...,"Aran Sena, \nMatthew J Howard",Robotics (cs.RO); Artificial Intelligence (cs....
8,The Regression Tsetlin Machine: A Tsetlin Mach...,"K. Darshana Abeyrathna, \nOle-Christoffer Gran...",Machine Learning (cs.LG); Artificial Intellige...
9,On the Detection of Mutual Influences and Thei...,"Stefan Rudolph, \nSven Tomforde, \nJörg Hähner",Multiagent Systems (cs.MA); Artificial Intelli...
