Webscrapping 

Introduction
Welcome to this specialized Jupyter Notebook, designed with the singular focus of extracting data from Wikipedia for educational purposes. In the vast digital ocean of information, Wikipedia stands out as a beacon of knowledge, offering a comprehensive database on a wide range of topics. This notebook serves as your guide to navigate through Wikipedia's resources, enabling you to harness this wealth of information effectively and efficiently.

Purpose
The primary aim of this notebook is to demonstrate the process of web scraping—specifically, how to methodically extract data from Wikipedia. This endeavor is geared towards educators, students, and anyone with a curiosity-driven mindset, providing a foundational tool for gathering data that can fuel educational projects, research, and learning.

What to Expect
As we journey through this notebook, you will be introduced to the following steps:

Understanding Wikipedia's Structure: Gain insights into how information is organized on Wikipedia, making it easier to target the data you need.
Identifying Data Points: Learn how to pinpoint the specific pieces of information that are most relevant to your educational goals.
Extracting Data: Utilize Python and its powerful libraries to scrape the identified data from Wikipedia pages, focusing on techniques that are both efficient and respectful of Wikipedia's resources.
Tools and Libraries
This notebook leverages Python for its ease of use and the robust libraries it offers for web scraping, including:

Requests/BeautifulSoup: These libraries will be our primary tools for downloading Wikipedia pages and parsing their HTML content to extract the needed data.
Pandas: While our focus is not on data analysis, Pandas may still be used for its ability to structure the extracted data into a readable and manageable format.
Ethical Considerations
It's imperative to approach the task of web scraping with a conscientious mindset. We will ensure our data extraction methods are compliant with Wikipedia's usage policies and the ethical standards of web scraping. Our goal is to minimize our impact on Wikipedia's servers and respect the availability of its resources to the wider internet community.

Conclusion
This notebook is your portal to unlocking the potential of Wikipedia as an educational resource. By focusing on the extraction of data, we equip you with the knowledge and tools to collect information that can support a variety of educational objectives. Let's proceed with curiosity and respect as we delve into the art of data extraction from one of the world's largest repositories of knowledge.

In [None]:
#import beautiful soup library
from bs4 import BeautifulSoup
import requests

In [2]:
url ="https://en.wikipedia.org/wiki/List_of_companies_of_Kenya"
page=requests. get(url)
print(page)

<Response [200]>


In [3]:
soup=BeautifulSoup(page.text, "html")

In [4]:
#print (soup)

In [5]:
#inspect the web for information you want
#copypaste

In [6]:
finding_the_analysis_table=soup.find("table")

In [7]:
#Incase of numerous tables
soup.find_all("table")

[<table class="wikitable sortable">
 <caption>Notable companies <br/>
 <span style="margin:0px; font-size:90%;"><span style="border:1px solid #000; background-color:#f9f9f9; color:#f9f9f9;">    </span> Active</span>
 <span style="margin:0px; font-size:90%;"><span style="border:1px solid #000; background-color:#f9e6ff; color:#f9e6ff;">    </span> State-owned</span>
 <span style="margin:0px; font-size:90%;"><span style="border:1px solid #000; background-color:#dddddd; color:#dddddd;">    </span> Defunct</span>
 </caption>
 <tbody><tr>
 <th>Name
 </th>
 <th>Industry
 </th>
 <th>Sector
 </th>
 <th>Headquarters
 </th>
 <th>Founded
 </th>
 <th class="unsortable">Notes
 </th></tr>
 <tr>
 <td><a href="/wiki/98.4_Capital_FM" title="98.4 Capital FM">98.4 Capital FM</a>
 </td>
 <td>Consumer services
 </td>
 <td>Broadcasting &amp; entertainment
 </td>
 <td><a href="/wiki/Nairobi" title="Nairobi">Nairobi</a>
 </td>
 <td>1996
 </td>
 <td>Radio
 </td></tr>
 <tr>
 <td><a href="/wiki/ABC_Bank_(Kenya)" 

In [8]:
#In case of multiple tables you can use indexing
table=soup.find_all("table")[0]

In [9]:
#make this the only information we are interested in
table=soup.find_all("table")[0]

In [10]:
#In the text there are th and td. which mean we can split them
world_titles=table.find_all("th")

In [11]:
world_titles

[<th>Name
 </th>,
 <th>Industry
 </th>,
 <th>Sector
 </th>,
 <th>Headquarters
 </th>,
 <th>Founded
 </th>,
 <th class="unsortable">Notes
 </th>]

In [12]:
#loop through the titles
world_table_titles=[title.text for title in world_titles]
world_table_titles

['Name\n', 'Industry\n', 'Sector\n', 'Headquarters\n', 'Founded\n', 'Notes\n']

In [13]:
world_table_titles=[title.text.strip() for title in world_titles]
world_table_titles

['Name', 'Industry', 'Sector', 'Headquarters', 'Founded', 'Notes']

In [14]:
titles=[]
for title in world_titles:
    titles.append(title.text.strip())
titles

['Name', 'Industry', 'Sector', 'Headquarters', 'Founded', 'Notes']

In [15]:
import pandas as pd

In [16]:
pd.DataFrame(columns=world_table_titles)

Unnamed: 0,Name,Industry,Sector,Headquarters,Founded,Notes


In [17]:
column_data =table.find_all("tr")


In [18]:
rows = soup.find_all('tr')

extracted_data = []

# Iterate over each row
for row in rows:
    # Extract the table data (td) from each row
    cells = row.find_all('td')
    extracted_row = [cell.text.strip() for cell in cells]
    #[cell.get_text(strip=True) for cell in cells]
    
    # Append the row to the extracted_data list if it's not empty
    if extracted_row:
        extracted_data.append(extracted_row)

# Display the extracted data
for row in extracted_data:
    print(row)

['98.4 Capital FM', 'Consumer services', 'Broadcasting & entertainment', 'Nairobi', '1996', 'Radio']
['ABC Bank', 'Financials', 'Banks', 'Nairobi', '1981', 'Commercial bank']
['ALS – Aircraft Leasing Services', 'Consumer services', 'Airlines', 'Nairobi', '1985', 'Regional airline']
['ARM Cement Limited', 'Industrials', 'Building materials & fixtures', 'Athi River', '1974', 'Cement, fertilizers, minerals, mining, KN: ARM']
['Bamburi Cement', 'Industrials', 'Building materials & fixtures', 'Nairobi', '1951', 'Cement, KN: BAMB']
['Carbacid Investments', 'Basic materials', 'Commodity chemicals', 'Nairobi', '1961', 'Carbon dioxide, KN: CARB']
['Chase Bank Kenya Limited', 'Financials', 'Banks', 'Nairobi', '1996', 'Commercial bank']
['CIC Insurance Group', 'Financials', 'Full line insurance', 'Nairobi', '1968', 'Insurance']
['CMC Aviation', 'Consumer services', 'Airlines', 'Nairobi', '1961[3]', 'Charter airline, defunct 2011']
['Commercial Bank of Africa', 'Financials', 'Banks', 'Nairobi', '1

In [19]:
df=pd.DataFrame(data=extracted_data, columns=world_table_titles)
df = df[:-3]

df

Unnamed: 0,Name,Industry,Sector,Headquarters,Founded,Notes
0,98.4 Capital FM,Consumer services,Broadcasting & entertainment,Nairobi,1996,Radio
1,ABC Bank,Financials,Banks,Nairobi,1981,Commercial bank
2,ALS – Aircraft Leasing Services,Consumer services,Airlines,Nairobi,1985,Regional airline
3,ARM Cement Limited,Industrials,Building materials & fixtures,Athi River,1974,"Cement, fertilizers, minerals, mining, KN: ARM"
4,Bamburi Cement,Industrials,Building materials & fixtures,Nairobi,1951,"Cement, KN: BAMB"
5,Carbacid Investments,Basic materials,Commodity chemicals,Nairobi,1961,"Carbon dioxide, KN: CARB"
6,Chase Bank Kenya Limited,Financials,Banks,Nairobi,1996,Commercial bank
7,CIC Insurance Group,Financials,Full line insurance,Nairobi,1968,Insurance
8,CMC Aviation,Consumer services,Airlines,Nairobi,1961[3],"Charter airline, defunct 2011"
9,Commercial Bank of Africa,Financials,Banks,Nairobi,1962,Commercial bank
