# Acquire the Data

## Sources of Data

We want to understand what are the important trends in Machine Learning at the moment. So we want to get a list of articles about Machine Learning that people are talking about. We can do that from many sources, but we decided to pick three sources to do that.

1. [Reddit.com - Machine Learning](https://www.reddit.com/r/MachineLearning/) - Reddit is a user generated discussion forum where recent articles and topics on Maching Learning are discussed by the community.

2. [Data Tau](http://www.datatau.com/)- Data Tau is the hacker news for machine learning. Users post articles about latest trends in data science and machine learning and can have discussion arount it.

3. [Twitter #machinelearning](https://twitter.com/search?q=%23machinelearning&src=typd) - We can also look at Twitter with #machinelearning tags to find the latest articles and post about machine learning that are being discussed in the social media.


## Working with Data Tau

Let us start with Data Tau site and scrape the data to acquire it.

![](img/datatau.png)

We will want to scrape the title and date for each of the article in this page

In [2]:
import requests
from bs4 import BeautifulSoup 
import re
import pandas as pd

In [3]:
base_url = 'http://www.datatau.com'

## Understand the HTML Structure

In [6]:
#Let us use request to get the url
dataTau = requests.get(base_url)

In [7]:
# Check if the page has been scraped - we should see Response 200
dataTau

<Response [200]>

In [8]:
# Let us see the text content of the page
# dataTau.content

In [9]:
# Start the beautifulsoup library and create a soup!
soup = BeautifulSoup(dataTau.content,'html.parser')

In [10]:
# See the pretty form HTML - Not so pretty though!
# print (soup.prettify())

### Get the title in each page

We have 30 articles on each page. Let us see if we can get the html tag and attribute to get this data

Let us see which html tag we need the '`td .title`'

![](img/title.png)

In [11]:
title_class = soup.select('td .title')

In [12]:
len(title_class)

61

We are getting double the number -> Let us see why by examining the first two elements in the list

In [13]:
title_class[0:2]

[<td align="right" class="title" valign="top">1.</td>,
 <td class="title"><a href="https://www.youtube.com/watch?v=KeJINHjyzOU">Deep Advances in Generative Modeling</a><span class="comhead"> (youtube.com) </span></td>]

Aha - We are getting both the number and the title name. We need to be even more specific and pick only the one with `<a>`

In [14]:
title_class = soup.select('td .title a')

In [15]:
len(title_class)

31

Why do we get 31 and not 30 articles... Lets check

In [16]:
title_class[0]

<a href="https://www.youtube.com/watch?v=KeJINHjyzOU">Deep Advances in Generative Modeling</a>

In [18]:
title_class[0].get_text()

'Deep Advances in Generative Modeling'

In [20]:
title_class[-1]

<a href="/x?fnid=zmYKlAHgNy" rel="nofollow">More</a>

Ok... so the last link is the link to the "More" - which is the next page. That is good. We can use it to get the link to the next url to scrape

**NOTE: Taking care of the edge cases**

When we run this on multiple pages, we find that sometimes there are more than one `<a>` link in the title. To take of this we re-write the selection criterion to only pick the first `<a>` link in the title only

In [31]:
title_class = soup.select('td .title > a:nth-of-type(1)')

In [32]:
title_class[0].get_text()

'Deep Advances in Generative Modeling'

### Get the date for each title

To get the date for each title, we need html tag and class - '`td .subtext`'

![](img/date.png)

In [21]:
date_class = soup.select('.subtext')

In [22]:
date_class[0]

<td class="subtext"><span id="score_11973">6 points</span> by <a href="user?id=gwulfs">gwulfs</a> 5 hours ago  | <a href="item?id=11973">discuss</a></td>

In [24]:
date_class[0].get_text()

'6 points by gwulfs 5 hours ago  | discuss'

## Automate the Scraping Process

We now write a function which starts with first page, gets all the title and date string and puts it in to a dataframe and then moves to the next page.

In [35]:
# Let us create an empty dataframe to store the data
df = pd.DataFrame(columns=['title','date'])
df.count()

title    0
date     0
dtype: int64

In [36]:
def get_data_from_tau(url):
    print(url)
    dataTau = requests.get(url)
    soup = BeautifulSoup(dataTau.content,'html.parser')
    title_class = soup.select('td .title > a:nth-of-type(1)')
    date_class = soup.select('.subtext')
    print(len(title_class),len(date_class))
    for i in range(len(title_class)-1):
        df.loc[df.shape[0]] = [title_class[i].get_text(),date_class[i].get_text()]
    print('updated df with data')
    return title_class[len(title_class) - 1]

In [37]:
url = base_url
for i in range(0,6):
    more_url = get_data_from_tau(url)
    url = base_url+more_url['href']

http://www.datatau.com
31 30
updated df with data
http://www.datatau.com/x?fnid=MZdGRW3odn
31 30
updated df with data
http://www.datatau.com/x?fnid=SuhXPcl1FI
31 30
updated df with data
http://www.datatau.com/x?fnid=ek8XXt9Rac
31 30
updated df with data
http://www.datatau.com/x?fnid=KdMQRvGXDC
31 30
updated df with data
http://www.datatau.com/x?fnid=brd5WnrBqZ
31 30
updated df with data


In [38]:
df.shape

(180, 2)

In [39]:
df.head()

Unnamed: 0,title,date
0,Deep Advances in Generative Modeling,6 points by gwulfs 5 hours ago | discuss
1,A Neural Network in 11 lines of Python,2 points by dekhtiar 5 hours ago | discuss
2,"Python, Machine Learning, and Language Wars",3 points by pmigdal 7 hours ago | discuss
3,Markov Chains Explained Visually,11 points by zeroviscosity 1 day ago | 1 comment
4,Dplython: Dplyr for Python,10 points by thenaturalist 1 day ago | 3 comm...


In [41]:
df.to_csv('data_tau.csv', index = False)