# INFO 2950 Discussion Section 2

### Part 1: Getting Data
For your project you will need to collect data. There are many useful datasets freely available; do not try to reinvent the wheel!

First and foremost: Cornell Library's [list of databases](https://newcatalog.library.cornell.edu/databases) has a collection of data on likely any subject you are interested. It also has subscriptions available for students to access otherwise unavailable material. There is also [a list of dataset ideas](https://docs.google.com/document/d/1UbrWP8y4R9QgrytLdQz7KRc8bmNGh4csYNdiNnK9nfs/edit#heading=h.rj95q2rlptpz) in the INFO 2950 Student Handbook.

You should make an attempt to find pre-collected data relevant to your interests. If this fails, many companies provide open-access to their data via [API](https://www.howtogeek.com/343877/what-is-an-api/)s.

If that fails then we can turn to web-scraping. Keep in mind, many companies attempt to prevent web-scraping. If you encounter this issue, it will likely be difficult to succeed and you should consider looking for a different data source, and/or reframing your research question(s).

The following cell installs two Python packages useful for web scraping. (If you already have them installed, the code will produce a message that "Requirement already satisfied".) **If either is installed now, you may need to close Jupyter and restart it in order to use the libraries.**

In [None]:
# install requests and beautiful soup 
import sys
!conda install --yes --prefix {sys.prefix} requests
!conda install --yes --prefix {sys.prefix} bs4

In [1]:
import requests # package for http requests
import bs4 # package for html parsing
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Let's illustrate an example of web-scraping by downloading the [Wikipedia article for Web Scraping](https://en.wikipedia.org/wiki/Web_scraping) using the Python [requests](https://requests.readthedocs.io/en/master/) package.

In [2]:
wikipedia_web_scraping = 'https://en.wikipedia.org/wiki/Web_scraping'
requests.get(wikipedia_web_scraping)

<Response [200]>

A response of `<Response [200]>` indicates that we have received what we have asked for. If there is another number (such as `404`), then there was likely an error. A list of http response codes can be found [here](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes). 

The response Python object contains all the data the server would normally send a browser, including the contents of the website. Here we are interested in the data containted within the text attribute:

In [3]:
wikiResponse = requests.get(wikipedia_web_scraping)
wikiResponse.text[:2000]

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Web scraping - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"a7832dc2-0c08-4dfa-ba0b-1214e3e211a3","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Web_scraping","wgTitle":"Web scraping","wgCurRevisionId":977675232,"wgRevisionId":977675232,"wgArticleId":2696619,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 Danish-language sources (da)","CS1 French-language sources (fr)","Articles with short description","Short description matches Wikidata","Articles needing addition

This data is not exactly what we were looking for. It includes raw HTML, which is meant to be read by computers; we want to parse out human-readible text. For this, we use the Python package [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/).

In [4]:
## parse the raw wiki HTML into a BeautifulSoup object
soup = bs4.BeautifulSoup(wikiResponse.text, 'html.parser')
## let's see what it looks like by looking specifically
## at the text attribute of the soup object
print(soup.text[100:500])

yclopedia



Jump to navigation
Jump to search
Data scraping used for extracting data from websites
This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: "Web scraping" – news · newspapers · books · scholar · JSTOR (June 2017) (Learn how and when to remove th


The `.text` attribute of our `BeautifulSoup` object has excessive white space and more text than just the article body that we really want. Luckily, HTML has structure that we can take advantage of to isolate the specific information we want. HTML stores the information displayed on the webpage in tags. For instance, hyperlinks are encoded in `a` tags:

```
<a href="www.google.com">This text will be hyperlinked to the specified URL.</a>
```

The basic tag is `<a>`. Notice that it has a matching end tag: `</a>`. The text between the tags is what actually displays on the webpage. This tag has an attribute `href`, which defines the URL to which the text links. If you're not very familiar with HTML, here is [a basic guide](https://www.w3schools.com/html/html_basic.asp). This [comprehensive reference of HTML tags](https://www.w3schools.com/tags/) may also be useful.

To extract only the specific text we want from the Wikipedia webpage, we need to inspect the raw HTML to figure out how the data we want is being stored (e.g. in which tags or which attributes of tags). One way to do this is to open the page in a web browser, right clicking anywhere on the page, and selecting 'Inspect' or 'View Page Source'. In some web browsers, like Google Chrome, as you mouse over various tags in the raw HTML, the corresponding element will be highlighted in the displayed page. Keep in mind, if the webpage uses Javascript or dynamic programming, then what you view using the browser tool will not be the same data received via the `requests` package. Instead, you'll want to dump the static version of the web page downloaded via `requests` to an `.html` file on your computer, and then inspect it in a browser as above. 

Let's dump the scraped text into a file that we can open to inspect:

In [6]:
wiki_file = 'wikipedia_web_scrape.html'

with open(wiki_file, mode='w', encoding='utf-8') as f:
    f.write(wikiResponse.text)

You can also open the raw HTML file in a text editor and use the 'Find' tool (CTRL/CMD + F) to help figure out how data of interest is stored.

After inspecting this saved file, it looks like the main text of the web page is within paragraph tags (`<p>`) inside the `<body>` tag.

In [7]:
tags = soup.body.findAll('p')
for pTag in tags[2:3]:
    print(pTag.text)

Web scraping is used for contact scraping, and as a component of applications used for web indexing, web mining and data mining, online price change monitoring and price comparison, product review scraping (to watch the competition), gathering real estate listings, weather data monitoring, website change detection, research, tracking online presence and reputation, web mashup and, web data integration.



We can also save this text as a `pandas` dataframe to make it easier to analyse down the line:

In [8]:
paragraphs = []
for pTag in tags:
    paragraphs.append(pTag.text)
## convert the list to pandas
df = pd.DataFrame({'paragraph' : paragraphs})
df.head()

Unnamed: 0,paragraph
0,"Web scraping, web harvesting, or web data extr..."
1,Web scraping a web page involves fetching it a...
2,"Web scraping is used for contact scraping, and..."
3,Web pages are built using text-based mark-up l...
4,Newer forms of web scraping involve listening ...


### Part 2: NumPy Crash Course

NumPy is a Python package fundamental for data science. Refer to the [documentation](https://numpy.org/doc/stable/) if you are stuck on any problems and take a look at the the [quickstart tutorial](https://numpy.org/doc/stable/user/quickstart.html) to help develop your skills.

The main array object used in NumPy is the `ndarray`. NumPy arrays are stored in a continuous place in memory, which allows your computer to process these objects up to 50x faster than Python Lists. It supports integers, booleans, floats, strings, and more, but unlike Python Lists, all entries of an `ndarray` must have the same data type.



In [9]:
arr1 = np.array([1,7,5,2,4])

In [10]:
arr1.dtype

dtype('int64')

In [11]:
arr2 = np.array([2.1,3,4,1,5])

In [12]:
arr2.dtype

dtype('float64')

Numpy arrays support basic arithmetic:

In [13]:
arr1+arr2

array([ 3.1, 10. ,  9. ,  3. ,  9. ])

In [14]:
arr1*2

array([ 2, 14, 10,  4,  8])

In [15]:
arr1**2

array([ 1, 49, 25,  4, 16])

and conditionals:

In [16]:
arr1>=5

array([False,  True,  True, False, False])

Passing this boolean array into the original array returns the correct values indexed in the original array:

In [17]:
arr1[arr1>=5]

array([7, 5])

NumPy arrays are multidimensional and support matrix multiplication:

In [18]:
arr3= np.array([[1,1],[0,1]])

In [19]:
arr3

array([[1, 1],
       [0, 1]])

In [20]:
arr3 @ arr3

array([[1, 2],
       [0, 1]])

In [21]:
arr3.shape

(2, 2)

In [22]:
arr3.size

4

Numpy array elements can be accessed directly using an index (keep in mind Python indices start at 0):

In [23]:
arr3[1,0] = 4

In [24]:
arr3

array([[1, 1],
       [4, 1]])