In [1]:
import nltk
import re

from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

#nltk.download('wordnet')
#nltk.download('stopwords')

Import the articles we got from scraping

In [2]:
file = open("data/articles.txt", "r")
articles = dict()
contents = file.readlines()

In [3]:
for line in contents[1:]:
    title, content = line.split(";")
    articles[title] = content[1:-2] #Omit 0th and -1st characters as they are just unneeded quotation marks

In [4]:
print(len(articles.keys()))

1000


Function used to:
1. Remove annotations and "this article is a stub.." from the content
2. Tokenize and remove stopwords and isolated punctuation
3. Lemmatize the tokens using wordnet

In [5]:
def prepareArticle(content):
    #0.remove the annotations => [1], [2] etc., also remove "this article is a stub..."
    no_annotations = re.sub('\[\d\]', '', content)
    no_stubs = re.sub('This .* is a stub. You can help Wikipedia by expanding it.', '', no_annotations)
    #1.tokenize and remove stopwords and punctuation
    words = word_tokenize(no_stubs)
    stop = stopwords.words("english")
    tokenized = []
    for word in words:
        if word in stop or word in ",./!?#%*()[]{}:\"\"\'\'\\-=+_``":
            continue
        else:
            tokenized.append(word)
    #2.lemmatize using wordnet
    wordnet = WordNetLemmatizer()
    lemmatized = []
    for word in tokenized:
        lemmatized.append(wordnet.lemmatize(word))
    return " ".join(lemmatized)

Some examples of it working

In [6]:
print(articles["Mandraka Dam"])
print("=======================================")
print(prepareArticle(articles["Mandraka Dam"]))

Mandraka Dam is a gravity dam on the Mandraka River near Mandraka in the Analamanga Region of Madagascar. The dam was constructed by a French firm by 1956 and creates Lake Mandraka.[1]The dam supplies water to a 24 megawatts (32,000 hp) hydroelectric power station 1.9 km (1.2 mi) to the east, down in the valley. The change in elevation between the dam and power station affords a hydraulic head on 226 metres (741 ft).[2][3] The dam and power station are operated and owned by Jirama and the four 6 megawatts (8,000 hp) Pelton turbine-generators were commissioned between 1958 and 1972.[4]
Mandraka Dam gravity dam Mandraka River near Mandraka Analamanga Region Madagascar The dam constructed French firm 1956 creates Lake Mandraka.The dam supply water 24 megawatt 32,000 hp hydroelectric power station 1.9 km 1.2 mi east valley The change elevation dam power station affords hydraulic head 226 metre 741 ft The dam power station operated owned Jirama four 6 megawatt 8,000 hp Pelton turbine-genera

In [7]:
print(articles["Leptocypris taiaensis"])
print("=======================================")
print(prepareArticle(articles["Leptocypris taiaensis"]))

Leptocypris taiaensis is a species of cyprinid fish endemic to Taia River, Little Scarcies River and Waanje River in Sierre Leone.[2]This Cyprinidae-related article is a stub. You can help Wikipedia by expanding it.
Leptocypris taiaensis specie cyprinid fish endemic Taia River Little Scarcies River Waanje River Sierre Leone


In [8]:
print(articles["2014 Coates Hire Ipswich 400"])
print("=======================================")
print(prepareArticle(articles["2014 Coates Hire Ipswich 400"]))

The 2014 Coates Hire Ipswich 400 was a motor race meeting for the Australian sedan-based V8 Supercars. It was the eighth event of the 2014 International V8 Supercars Championship. It was held on the weekend of 1–3 August at the Queensland Raceway, near Ipswich, Queensland.This article related to sport in Australia is a stub. You can help Wikipedia by expanding it.
The 2014 Coates Hire Ipswich 400 motor race meeting Australian sedan-based V8 Supercars It eighth event 2014 International V8 Supercars Championship It held weekend 1–3 August Queensland Raceway near Ipswich Queensland


Convert all articles

In [9]:
converted = dict()
for article in articles.keys():
    converted[article] = prepareArticle(articles[article])

In [10]:
print(len(converted.keys()))

1000


And save it in the articles.csv file

In [12]:
out = open("data/articles.csv", "w")
out.write("Title;Content\n")
for title in converted.keys():
    out.write(title+";"+converted[title]+"\n")