### Overview

This project dives into web scraping ***Wikipedia*** articles, the objective being to extract and organize the insightful information available in one of the biggest and most reliable knowledge bases on the internet. Written with Python libraries such as requests and BeautifulSoup, this project shows you how to crawl a Wikipedia page and figure out its main part and body text, as well as how to format it nicely so that it is easy to read. Their kettle produces a structured text file without references. Web scraping is a great project to get started, you can have fun going through lots of data, plus it gives you the chance to build many practical applications based on real, unstructured content. Help convert raw articles into valuable datasets!

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [23]:
res = requests.get('https://en.wikipedia.org/wiki/Martin_Luther_King_Jr.')
soup = BeautifulSoup(res.text, 'html.parser')

In [24]:
print(soup.text)





Martin Luther King Jr. - Wikipedia



































Jump to content







Main menu





Main menu
move to sidebar
hide



		Navigation
	


Main pageContentsCurrent eventsRandom articleAbout WikipediaContact us





		Contribute
	


HelpLearn to editCommunity portalRecent changesUpload file



















Search











Search






















Appearance
















Donate

Create account

Log in








Personal tools





Donate Create account Log in





		Pages for logged out editors learn more



ContributionsTalk




























Contents
move to sidebar
hide




(Top)





1
Early life and education




Toggle Early life and education subsection





1.1
Birth








1.2
Early childhood








1.3
Adolescence








1.4
Morehouse College










2
Religious education








3
Marriage and family








4
Activism and organizational leadership




Toggle Activism and organizational leadership subsection





4.1
Montgomery 

In [25]:
soup.find('h1')

<h1 class="firstHeading mw-first-heading" id="firstHeading"><span class="mw-page-title-main">Martin Luther King Jr.</span></h1>

In [26]:
heading = soup.find('h1').text

In [27]:
heading

'Martin Luther King Jr.'

In [28]:
print(soup.text.replace('\n\n', ''))

Martin Luther King Jr. - WikipediaJump to contentMain menuMain menu
move to sidebar
hide		Navigation
	
Main pageContentsCurrent eventsRandom articleAbout WikipediaContact us		Contribute
	
HelpLearn to editCommunity portalRecent changesUpload fileSearchSearch
Appearance
DonateCreate accountLog in
Personal toolsDonate Create account Log in		Pages for logged out editors learn moreContributionsTalk
Contents
move to sidebar
hide
(Top)1
Early life and education
Toggle Early life and education subsection1.1
Birth
1.2
Early childhood
1.3
Adolescence
1.4
Morehouse College
2
Religious education
3
Marriage and family
4
Activism and organizational leadership
Toggle Activism and organizational leadership subsection4.1
Montgomery bus boycott, 1955
4.2
Southern Christian Leadership Conference
4.3
Survived knife attack, 1958
4.4
Atlanta sit-ins, prison sentence, and the 1960 elections
4.5
Albany Movement, 1961
4.6
Birmingham campaign, 1963
4.7
March on Washington, 1963
4.8
St. Augustine, Florida, 1964

In [29]:
soup.find_all('p')

[<p class="mw-empty-elt">
 </p>,
 <p class="mw-empty-elt">
 </p>,
 <p><b>Campaigns</b>
 </p>,
 <p><b>Death and memorial</b>
 </p>,
 <p><b>Martin Luther King Jr.</b> (born <b>Michael King Jr.</b>; January 15, 1929 – April 4, 1968) was an American <a class="mw-redirect" href="/wiki/Baptist" title="Baptist">Baptist</a> minister, <a class="mw-redirect" href="/wiki/Activist" title="Activist">activist</a>, and <a class="mw-redirect" href="/wiki/Political_philosopher" title="Political philosopher">political philosopher</a> who was one of the most prominent leaders in the <a href="/wiki/Civil_rights_movement" title="Civil rights movement">civil rights movement</a> from 1955 until <a href="/wiki/Assassination_of_Martin_Luther_King_Jr." title="Assassination of Martin Luther King Jr.">his assassination</a> in 1968. King advanced <a class="mw-redirect" href="/wiki/Civil_rights" title="Civil rights">civil rights</a> for <a class="mw-redirect" href="/wiki/People_of_color" title="People of color">peo

In [30]:
for p in soup.find_all('p'):
    print(p.text)
    print('-'*10)



----------


----------
Campaigns

----------
Death and memorial

----------
Martin Luther King Jr. (born Michael King Jr.; January 15, 1929 – April 4, 1968) was an American Baptist minister, activist, and political philosopher who was one of the most prominent leaders in the civil rights movement from 1955 until his assassination in 1968. King advanced civil rights for people of color in the United States through the use of nonviolent resistance and nonviolent civil disobedience against Jim Crow laws and other forms of legalized discrimination.

----------
A black church leader, King participated in and led marches for the right to vote, desegregation, labor rights, and other civil rights.[1] He oversaw the 1955 Montgomery bus boycott and later became the first president of the Southern Christian Leadership Conference (SCLC). As president of the SCLC, he led the unsuccessful Albany Movement in Albany, Georgia, and helped organize some of the nonviolent 1963 protests in Birmingham, A

In [31]:
corpus = ''

for p in soup.find_all('p'):
    corpus += p.text
    corpus += '\n'

corpus = corpus.strip()

In [32]:
print(corpus)

Campaigns

Death and memorial

Martin Luther King Jr. (born Michael King Jr.; January 15, 1929 – April 4, 1968) was an American Baptist minister, activist, and political philosopher who was one of the most prominent leaders in the civil rights movement from 1955 until his assassination in 1968. King advanced civil rights for people of color in the United States through the use of nonviolent resistance and nonviolent civil disobedience against Jim Crow laws and other forms of legalized discrimination.

A black church leader, King participated in and led marches for the right to vote, desegregation, labor rights, and other civil rights.[1] He oversaw the 1955 Montgomery bus boycott and later became the first president of the Southern Christian Leadership Conference (SCLC). As president of the SCLC, he led the unsuccessful Albany Movement in Albany, Georgia, and helped organize some of the nonviolent 1963 protests in Birmingham, Alabama. King was one of the leaders of the 1963 March on Wa

In [33]:
for i in range(2, 467):
    corpus = corpus.replace('['+str(i)+']', '')

In [34]:
print(corpus)

Campaigns

Death and memorial

Martin Luther King Jr. (born Michael King Jr.; January 15, 1929 – April 4, 1968) was an American Baptist minister, activist, and political philosopher who was one of the most prominent leaders in the civil rights movement from 1955 until his assassination in 1968. King advanced civil rights for people of color in the United States through the use of nonviolent resistance and nonviolent civil disobedience against Jim Crow laws and other forms of legalized discrimination.

A black church leader, King participated in and led marches for the right to vote, desegregation, labor rights, and other civil rights.[1] He oversaw the 1955 Montgomery bus boycott and later became the first president of the Southern Christian Leadership Conference (SCLC). As president of the SCLC, he led the unsuccessful Albany Movement in Albany, Georgia, and helped organize some of the nonviolent 1963 protests in Birmingham, Alabama. King was one of the leaders of the 1963 March on Wa

In [35]:
fd = open(heading + '.txt', 'w', encoding = 'utf-8')
fd.write(corpus)
fd.close()