# **Scrape Wikipedia articles and create summaries**
- 2600 randomly selected Wikipedia articles are scraped. 
- A short summary, the first few sentences of the article is then created (later used as evidence).
- 2000 samples are then saved

### **Imports**

In [1]:
# Functions from my other script:
from scrape_wiki import scrape_data, count_tokens, short_summary
# Other packages:
import pandas as pd

### **Scrape wikipeda articles**
2600 randomly selected articles was scraped. This number was chosen to garantee 2000 samples after claning.

In [2]:
scraped_wiki = scrape_data(2600)
scraped_wiki.to_csv("wikipedia_articles.csv", index=False)

### **Clean scraped data**

In [3]:
# Scraped data, without preprocessing
#scraped_wiki = pd.read_csv("wikipedia_articles.csv")
print(f"Shape of scraped data {scraped_wiki.shape}")
scraped_wiki.head()

Shape of scraped data (2600, 2)


Unnamed: 0,Title,Article
0,USS SC-40,"USS SC-40, until July 1920 known as USS Submar..."
1,Valentina Zenere,Valentina Zenere (born 15 January 1997) is an...
2,From M.E. to Myself,From M.E. to Myself (simplified Chinese: 和自己对话...
3,Charalampos Papaioannou,"Charalampos Papaioannou (born January 4, 1971)..."
4,"First United Methodist Church (Aberdeen, South...",Aberdeen First United Methodist Church is a hi...


### **Drop missing values**

In [4]:
scraped_wiki = scraped_wiki.dropna()
print(f"Shape of data without missing values {scraped_wiki.shape}")
# 2 missing values

Shape of data without missing values (2598, 2)


### **Tokens per article**
Add new column: Number of tokens per article

In [5]:
scraped_wiki["Article tokens"] = scraped_wiki["Article"].apply(count_tokens)
scraped_wiki.head()

Unnamed: 0,Title,Article,Article tokens
0,USS SC-40,"USS SC-40, until July 1920 known as USS Submar...",337
1,Valentina Zenere,Valentina Zenere (born 15 January 1997) is an...,478
2,From M.E. to Myself,From M.E. to Myself (simplified Chinese: 和自己对话...,92
3,Charalampos Papaioannou,"Charalampos Papaioannou (born January 4, 1971)...",43
4,"First United Methodist Church (Aberdeen, South...",Aberdeen First United Methodist Church is a hi...,87


### **Short summary**
Create new column: Short summary with max 100 tokens (min 20)

In [6]:
scraped_wiki["Summary"] = scraped_wiki["Article"].apply(short_summary)
scraped_wiki.head()

Unnamed: 0,Title,Article,Article tokens,Summary
0,USS SC-40,"USS SC-40, until July 1920 known as USS Submar...",337,"USS SC-40, until July 1920 known as USS Submar..."
1,Valentina Zenere,Valentina Zenere (born 15 January 1997) is an...,478,Valentina Zenere (born 15 January 1997) is an...
2,From M.E. to Myself,From M.E. to Myself (simplified Chinese: 和自己对话...,92,From M.E. to Myself (simplified Chinese: 和自己对话...
3,Charalampos Papaioannou,"Charalampos Papaioannou (born January 4, 1971)...",43,"Charalampos Papaioannou (born January 4, 1971)..."
4,"First United Methodist Church (Aberdeen, South...",Aberdeen First United Methodist Church is a hi...,87,Aberdeen First United Methodist Church is a hi...


In [8]:
# Drop na, created by to long first sentence in short_summary()
scraped_wiki = scraped_wiki.dropna()
print(f"Data now consists of {scraped_wiki.shape[0]} rows")

Data now consists of 2559 rows


In [9]:
# To check that the summaries have the right length we find the number of tokens in summary
scraped_wiki["Summary tokens"] = scraped_wiki["Summary"].apply(count_tokens)
scraped_wiki.head()

Unnamed: 0,Title,Article,Article tokens,Summary,Summary tokens
0,USS SC-40,"USS SC-40, until July 1920 known as USS Submar...",337,"USS SC-40, until July 1920 known as USS Submar...",81
1,Valentina Zenere,Valentina Zenere (born 15 January 1997) is an...,478,Valentina Zenere (born 15 January 1997) is an...,87
2,From M.E. to Myself,From M.E. to Myself (simplified Chinese: 和自己对话...,92,From M.E. to Myself (simplified Chinese: 和自己对话...,93
3,Charalampos Papaioannou,"Charalampos Papaioannou (born January 4, 1971)...",43,"Charalampos Papaioannou (born January 4, 1971)...",43
4,"First United Methodist Church (Aberdeen, South...",Aberdeen First United Methodist Church is a hi...,87,Aberdeen First United Methodist Church is a hi...,89


### **Check for dublicates**
Since the articles are randomly sampled from all wikipedia articles, the risk of having dublicates is very small. But we check it anyway!

In [10]:
num_dub = scraped_wiki.duplicated(subset=["Title"]).sum()
print(f"Number of dublicates in data: {num_dub}")

Number of dublicates in data: 0


### **Look at the final dataset**

In [17]:
sorted_scraped_wiki = scraped_wiki.sort_values(by='Summary tokens').reset_index(drop=True)[["Title", "Summary", "Summary tokens"]]
# Longest and shortest summaires
pd.concat([sorted_scraped_wiki.head(), sorted_scraped_wiki.tail()])

Unnamed: 0,Title,Summary,Summary tokens
0,Mordellistena rayi,Mordellistena rayi is a species of beetle in t...,21
1,Bradley (automobile),The Bradley was an automobile manufactured in ...,21
2,List of awards and nominations received by Chr...,"Christopher Walken (born March 31, 1943) is an...",22
3,Bloc Québécois candidates in the 2015 Canadian...,This is a list of nominated candidates for the...,23
4,"Shrout, Kentucky",Shrout is an unincorporated community located ...,23
1995,The Dead Authors Podcast,The Dead Authors Podcast is an improvised come...,100
1996,Ludwig Julius Budge,"Ludwig Julius Budge (11 September 1811, in Wet...",100
1997,Fabian Dawkins,Fabian Dawkins (born 7 February 1981) is a Jam...,100
1998,Lady Bardales,"Lady Bardales (born November 29, 1982) is a Pe...",100
1999,Taggart Siegel,Taggart Siegel is an American documentary film...,100


### **Save 2000 samples**
Saving the final dataset, keeping only relevent columns

In [36]:
res = pd.DataFrame({"Title": scraped_wiki["Title"],
                    "Article":scraped_wiki["Article"],
                    "Summary":scraped_wiki["Summary"]})
res = res.iloc[:2000].reset_index(drop=True)
print(f"Final shape of dataset: {res.shape}")
res.to_csv("wikipedia_data.csv", index=False)
res.head()

Final shape of dataset: (2000, 3)


Unnamed: 0,Title,Article,Summary
0,USS SC-40,"USS SC-40, until July 1920 known as USS Submar...","USS SC-40, until July 1920 known as USS Submar..."
1,Valentina Zenere,Valentina Zenere (born 15 January 1997) is an...,Valentina Zenere (born 15 January 1997) is an...
2,From M.E. to Myself,From M.E. to Myself (simplified Chinese: 和自己对话...,From M.E. to Myself (simplified Chinese: 和自己对话...
3,Charalampos Papaioannou,"Charalampos Papaioannou (born January 4, 1971)...","Charalampos Papaioannou (born January 4, 1971)..."
4,"First United Methodist Church (Aberdeen, South...",Aberdeen First United Methodist Church is a hi...,Aberdeen First United Methodist Church is a hi...
