# SCRAPE THE DATA FROM THE WEBSITE

In [1]:
import bs4 as bs
import urllib.request


In [2]:
source=urllib.request.urlopen("https://en.wikipedia.org/wiki/Data_science#:~:text=Data%20science%20is%20an%20interdisciplinary,broad%20range%20of%20application%20domains.").read()

In [3]:
#Then, we create the "soup." This is a beautiful soup object:

In [4]:
soup=bs.BeautifulSoup(source,'lxml')

In [5]:
#If you do print(soup) and print(source), it looks the same, but the source is just plain the response data, and the soup is an object that we can actually interact with, by tag, now, like so:

In [6]:
#title of the page

print(soup.title)

<title>Data science - Wikipedia</title>


In [7]:
#get attributes

print(soup.title.name)


title


In [8]:
#get values

print(soup.title.string)

Data science - Wikipedia


In [9]:
#beginning navigation

print(soup.title.parent.name)

head


In [10]:
#getting specific values

print(soup.p)

<p class="mw-empty-elt">
</p>


In [11]:
#Finding paragraph tags <p> is a fairly common task. In the case above, we're just finding the first one. What if we wanted to find them all?

In [12]:
print(soup.find_all('p'))

[<p class="mw-empty-elt">
</p>, <p><b>Data science</b> is an <a href="/wiki/Interdisciplinarity" title="Interdisciplinarity">interdisciplinary</a> field that uses scientific methods, processes, algorithms and systems to extract <a href="/wiki/Knowledge" title="Knowledge">knowledge</a> and insights from noisy, structured and <a href="/wiki/Unstructured_data" title="Unstructured data">unstructured data</a>,<sup class="reference" id="cite_ref-1"><a href="#cite_note-1">[1]</a></sup><sup class="reference" id="cite_ref-2"><a href="#cite_note-2">[2]</a></sup> and apply knowledge and actionable insights from data across a broad range of application domains. Data science is related to <a href="/wiki/Data_mining" title="Data mining">data mining</a>, <a href="/wiki/Machine_learning" title="Machine learning">machine learning</a> and <a href="/wiki/Big_data" title="Big data">big data</a>.
</p>, <p>Data science is a "concept to unify <a href="/wiki/Statistics" title="Statistics">statistics</a>, <a h

In [13]:
#We can also iterate through them:

In [14]:
for paragraph in soup.find_all('p'):
    print(paragraph.string)
    print(str(paragraph.text))





None
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from noisy, structured and unstructured data,[1][2] and apply knowledge and actionable insights from data across a broad range of application domains. Data science is related to data mining, machine learning and big data.

None
Data science is a "concept to unify statistics, data analysis, informatics, and their related methods" in order to "understand and analyze actual phenomena" with data.[3] It uses techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, information science, and domain knowledge. However, data science is different from computer science and information science. Turing Award winner Jim Gray imagined data science as a "fourth paradigm" of science (empirical, theoretical, computational, and now data-driven) and asserted that "everything about science is changing beca

In [15]:
#The difference between string and text is that string produces a NavigableString object, and text is just typical unicode text. Notice that, if there are child tags in the paragraph item that we're attempting to use .string on, we will get None returned.

#Another common task is to grab links. For example:

In [17]:
for url in soup.find_all('a'):
    print(url.get('href'))

None
#mw-head
#searchInput
/wiki/Information_science
/wiki/File:PIA23792-1600x1200(1).jpg
/wiki/File:PIA23792-1600x1200(1).jpg
/wiki/Comet_NEOWISE
/wiki/Astronomical_survey
/wiki/Space_telescope
/wiki/Wide-field_Infrared_Survey_Explorer
/wiki/Machine_learning
/wiki/Data_mining
/wiki/File:Kernel_Machine.svg
/wiki/Statistical_classification
/wiki/Cluster_analysis
/wiki/Regression_analysis
/wiki/Anomaly_detection
/wiki/Data_Cleaning
/wiki/Automated_machine_learning
/wiki/Association_rule_learning
/wiki/Reinforcement_learning
/wiki/Structured_prediction
/wiki/Feature_engineering
/wiki/Feature_learning
/wiki/Online_machine_learning
/wiki/Semi-supervised_learning
/wiki/Unsupervised_learning
/wiki/Learning_to_rank
/wiki/Grammar_induction
/wiki/Supervised_learning
/wiki/Statistical_classification
/wiki/Regression_analysis
/wiki/Decision_tree_learning
/wiki/Ensemble_learning
/wiki/Bootstrap_aggregating
/wiki/Boosting_(machine_learning)
/wiki/Random_forest
/wiki/K-nearest_neighbors_algorithm
/wi

In [18]:
#In this case, if we just grabbed the .text from the tag, you'd get the anchor text, but we actually want the link itself. That's why we're using .get('href') to get the true URL.

#Finally, you may just want to grab text. You can use .get_text() on a Beautiful Soup object, including the full soup:

In [19]:
print(soup.get_text())




Data science - Wikipedia









































Data science

From Wikipedia, the free encyclopedia



Jump to navigation
Jump to search
Interdisciplinary field of study focused on deriving knowledge and insights from data
Not to be confused with information science.


 The existence of Comet NEOWISE (here depicted as a series of red dots) was discovered by analyzing astronomical survey data acquired by a space telescope, the Wide-field Infrared Survey Explorer.
Part of a series onMachine learningand data mining
Problems
Classification
Clustering
Regression
Anomaly detection
Data Cleaning
AutoML
Association rules
Reinforcement learning
Structured prediction
Feature engineering
Feature learning
Online learning
Semi-supervised learning
Unsupervised learning
Learning to rank
Grammar induction

Supervised learning(classification • regression) 
Decision trees
Ensembles
Bagging
Boosting
Random forest
k-NN
Linear regression
Naive Bayes
Artificial neural networks
Logistic 

In [20]:
#Next, we can grab the links from just the nav bar:

In [21]:
nav = soup.nav

In [22]:
for url in nav.find_all('a'):
    print(url.get('href'))

/wiki/Special:MyTalk
/wiki/Special:MyContributions
/w/index.php?title=Special:CreateAccount&returnto=Data+science
/w/index.php?title=Special:UserLogin&returnto=Data+science


In [23]:
#In this case, we're grabbing the first nav tags that we can find (the navigation bar). You could also go for soup.body to get the body section, then grab the .text from there:

In [24]:
body = soup.body
for paragraph in body.find_all('p'):
    print(paragraph.text)



Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from noisy, structured and unstructured data,[1][2] and apply knowledge and actionable insights from data across a broad range of application domains. Data science is related to data mining, machine learning and big data.

Data science is a "concept to unify statistics, data analysis, informatics, and their related methods" in order to "understand and analyze actual phenomena" with data.[3] It uses techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, information science, and domain knowledge. However, data science is different from computer science and information science. Turing Award winner Jim Gray imagined data science as a "fourth paradigm" of science (empirical, theoretical, computational, and now data-driven) and asserted that "everything about science is changing because of the i

In [25]:
#Finally, sometimes there might be multiple tags with the same names, but different classes, and you might want to grab information from a specific tag with a specific class. For example, our page that we're working with has a div tag with the class of "body". We can work with this data like so:

In [27]:
for div in soup.find_all('div', class_='body'):
    print(div.text)