# WEB SCRAPER USING PYTHON 3

#### WEB SCRAPING
It is defined as extraction of information from web by using patterns visible on that web (page/site). Here patterns corresponds to the markdown similarity of some text we want to scrap (extract).

This file contains 2 parts, where part 1 scarp data of Trump lies and part 2 scarps data of Sachin's centuries.

### PART 1

This file implements extraction of trump lies from (https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html).

In [78]:
# First we will import requests library to read the web page/site (if not available type "pip3 install requests" in terminal)
import requests
page = requests.get("https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html") # page is a response
print(page.text[0:1000])

<!DOCTYPE html>
<!--[if (gt IE 9)|!(IE)]> <!--><html lang="en" class="no-js page-interactive section-opinion page-theme-standard tone-opinion page-interactive-default limit-small layout-xlarge app-interactive" itemid="https://www.nytimes.com/interactive/2017/06/23/opinion/trumps-lies.html" itemtype="http://schema.org/NewsArticle" itemscope xmlns:og="http://opengraphprotocol.org/schema/"><!--<![endif]-->
<!--[if IE 9]> <html lang="en" class="no-js ie9 lt-ie10 page-interactive section-opinion page-theme-standard tone-opinion page-interactive-default limit-small layout-xlarge app-interactive" xmlns:og="http://opengraphprotocol.org/schema/"> <![endif]-->
<!--[if IE 8]> <html lang="en" class="no-js ie8 lt-ie10 lt-ie9 page-interactive section-opinion page-theme-standard tone-opinion page-interactive-default limit-small layout-xlarge app-interactive" xmlns:og="http://opengraphprotocol.org/schema/"> <![endif]-->
<!--[if (lt IE 8)]> <html lang="en" class="no-js lt-ie10 lt-ie9 lt-ie8 page-intera

In [79]:
# Importing BeautifulSoup from bs4
from bs4 import BeautifulSoup
htmlSoup = BeautifulSoup(page.text, "html.parser")
# htmlSoup

In [80]:
results = htmlSoup.find_all('span', attrs = {'class' : 'short-desc'})
len(results)
print(results[-1]) # Checking the last tag, if scarped correct

<span class="short-desc"><strong>Nov. 11 </strong>“I'd rather have him  – you know, work with him on the Ukraine than standing and arguing about whether or not  – because that whole thing was set up by the Democrats.” <span class="short-truth"><a href="https://www.nytimes.com/interactive/2017/12/10/us/politics/trump-and-russia.html" target="_blank">(There is no evidence that Democrats "set up" Russian interference in the election.)</a></span></span>


In [81]:
records = []
for result in results:
    date = result.find('strong').text[0:-1]+', 2017'
    lie = result.contents[1][1:-2]
    urlText = result.find('a').text[1:-1]
    urlLink = result.find('a')['href']
    records.append((date, lie, urlText, urlLink))

# print(records)
print(records[-1])

('Nov. 11, 2017', "I'd rather have him  – you know, work with him on the Ukraine than standing and arguing about whether or not  – because that whole thing was set up by the Democrats.", 'There is no evidence that Democrats "set up" Russian interference in the election.', 'https://www.nytimes.com/interactive/2017/12/10/us/politics/trump-and-russia.html')


In [82]:
import pandas as pd
df = pd.DataFrame(records, columns = ['Date', 'Lie', 'Reason', 'URL'])
df['Date'] = pd.to_datetime(df['Date'])

In [83]:
df.to_csv("trump_lies.csv", index = False)

### PART 2
This file implements extraction of sachin centuries from (http://www.fastcricket.com/entry/1270/).

In [84]:
cricketPage = requests.get("http://www.fastcricket.com/entry/1270/")
# print(cricketPage.text)

In [85]:
sachinSoup = BeautifulSoup(cricketPage.text, 'html.parser')
# sachinSoup

In [86]:
results = sachinSoup.find_all('tr')
print(len(results))
print(results[-1])
# results

51
<tr>
<td><b>100</b></td>
<td style="text-align:center">113</td>
<td style="text-align:center">10</td>
<td style="text-align:center">1</td>
<td style="text-align:center">2</td>
<td>Pakistan</td>
<td>Peshawar</td>
<td>6 Feb 2006</td>
</tr>


In [87]:
results[2].contents

['\n',
 <td class="bg"><b>200*</b></td>,
 '\n',
 <td class="bg" style="text-align:center">147</td>,
 '\n',
 <td class="bg" style="text-align:center">25</td>,
 '\n',
 <td class="bg" style="text-align:center">3</td>,
 '\n',
 <td class="bg" style="text-align:center">2</td>,
 '\n',
 <td class="bg">South Africa</td>,
 '\n',
 <td class="bg">Gwalior</td>,
 '\n',
 <td class="bg"><a href="http://www.fastcricket.com/entry/1853/">24 Feb 2010</a></td>,
 '\n']

##### From above block, it can be inferred that even positions (odd indices) have useful information.

In [88]:
records = []

for result in results[2:]:
    result = result.contents
    runs = result[1].text
    balls = result[3].text
    fours = result[5].text
    sixes = result[7].text
    position = result[9].text
    against = result[11].text
    venue = result[13].text
    records.append((runs, balls, fours, sixes, position, against, venue))

print(len(records), "Total One Day International Centuries")

49 Total One Day International Centuries


In [89]:
df1 = pd.DataFrame(records, columns = ['Runs', 'Balls', '4s', '6s', 'Position', 'Against', 'Venue'])
df1.head()


Unnamed: 0,Runs,Balls,4s,6s,Position,Against,Venue
0,200*,147,25,3,2,South Africa,Gwalior
1,186*,150,20,3,2,New Zealand,Hyd'bad
2,175,141,19,6,2,Australia,Hyd'bad
3,163*,133,16,5,2,New Zealand,C'church
4,152,151,18,0,2,Namibia,P'sburg


In [90]:
df1.to_csv("sachine_centuries.csv", index = False)