# The Challenge/Goal:
## Given a set of Wikipedia articles, the challenge is to build a model to identify the primary topic of each article and analyze the content and structure of that article. The second part of the project will focus on conducting similar analyses of a group of Wikipedia articles. After topic modeling is complete for the corpus, I will cluster the articles based on their content and recommend similar articles based on those primary topics.


Installing dependencies and relevant packages and libraries..

In [1]:
import wikipedia as wiki
import nltk
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import gensim
from gensim.corpora.dictionary import Dictionary
from collections import Counter, defaultdict
import matplotlib.pyplot as plt
import itertools
import re
import pandas as pd
from gensim.models.tfidfmodel import TfidfModel
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import cosine_similarity

# PART 1: 
# Analyzing one Wikipedia article

In [2]:
page = wiki.page(wiki.random(1))

In [3]:
print(page.title)

Status of First Nations treaties in British Columbia


In [4]:
print(page.url)

https://en.wikipedia.org/wiki/Status_of_First_Nations_treaties_in_British_Columbia


In [5]:
print(page.images)

['https://upload.wikimedia.org/wikipedia/commons/c/cc/Aboriginal_War_Veterans_monument_%28close%29.JPG', 'https://upload.wikimedia.org/wikipedia/commons/f/fc/Maple_Leaf_%28from_roundel%29.svg', 'https://upload.wikimedia.org/wikipedia/commons/d/de/Nisgaa_mask_Louvre_MH_81-22-1.jpg', 'https://upload.wikimedia.org/wikipedia/commons/a/aa/Sir_James_Douglas.jpg', 'https://upload.wikimedia.org/wikipedia/commons/5/5b/Tsawwassen.png']


In [6]:
print(page.links)

['1969 White Paper', 'Abbotsford, British Columbia', 'Aboriginal English in Canada', 'Aboriginal rights', 'Aboriginal title', 'Acho Dene Koe First Nation', 'Adams Lake Indian Band', 'Agreement Respecting a New Relationship Between the Cree Nation and the Government of Quebec', 'Ahousaht First Nation', 'Aitchelitz Band', 'Alberta', 'Alert Bay', 'Alexandria First Nation', 'Alexis Creek First Nation', 'Alkali Lake Indian Band', 'Americanist phonetic notation', 'Ashcroft Indian Band', 'Atlin Country', 'Attorney General of Canada v. Lavell', 'BC Liberal Party', 'BC Treaty Process', 'Bella Bella, British Columbia', 'Bella Coola, British Columbia', 'Blueberry River First Nations', 'Bonaparte Indian Band', 'Boothroyd Indian Band', 'Boston Bar Indian Band', 'Bridge River Country', 'Bridge River Indian Band', 'British Columbia', 'British Columbia Coast', 'British Columbia Interior', 'British Columbia Treaty Commission', 'British Columbia Treaty Process', 'British Columbia aboriginal treaty refer

In [7]:
print(page.summary)

The status of the First Nations, Aboriginal people of British Columbia, Canada, is a long-standing problem that has become a major issue in recent years.  In 1763 the British Crown declared that only it could acquire land from First Nations through treaties.  Historically only two treaties were signed with the First Nations of BC.  The first of which was the Douglas Treaties, negotiated by Sir James Douglas with the native people of southern Vancouver Island from 1850-1854.  The second treaty, Treaty 8, signed in 1899 was part of the Numbered Treaties that were signed with First Nations outside of British Columbia.  British Columbian Treaty 8 signatories are located in the Peace River Country or the far North East of BC.  For over nine decades no more treaties were signed with First Nations of BC; many Native people wished to negotiate treaties, but successive BC provincial governments refused until the 1990s.  A major development was the 1997 decision of the Supreme Court of Canada in

In [8]:
print(page.content)

The status of the First Nations, Aboriginal people of British Columbia, Canada, is a long-standing problem that has become a major issue in recent years.  In 1763 the British Crown declared that only it could acquire land from First Nations through treaties.  Historically only two treaties were signed with the First Nations of BC.  The first of which was the Douglas Treaties, negotiated by Sir James Douglas with the native people of southern Vancouver Island from 1850-1854.  The second treaty, Treaty 8, signed in 1899 was part of the Numbered Treaties that were signed with First Nations outside of British Columbia.  British Columbian Treaty 8 signatories are located in the Peace River Country or the far North East of BC.  For over nine decades no more treaties were signed with First Nations of BC; many Native people wished to negotiate treaties, but successive BC provincial governments refused until the 1990s.  A major development was the 1997 decision of the Supreme Court of Canada in

In [9]:
tokens = word_tokenize(page.content)

In [10]:
lower_tokens = [t.lower() for t in tokens]

In [11]:
# Creating an initial bag of words.
bag = Counter(lower_tokens)

In [12]:
print(bag)

Counter({'the': 65, 'of': 34, 'treaty': 34, '.': 29, ',': 28, 'in': 22, 'first': 21, 'to': 17, 'nations': 16, 'process': 15, 'british': 14, 'bc': 14, 'and': 14, 'a': 13, 'treaties': 11, 'columbia': 10, 'that': 10, '==': 10, 'with': 9, 'was': 8, 'aboriginal': 6, 'signed': 6, 'are': 6, 'have': 6, 'commission': 6, 'canada': 5, 'it': 5, 'for': 5, '2009': 5, 'people': 4, 'is': 4, 'has': 4, 'crown': 4, 'only': 4, 'from': 4, 'were': 4, 'when': 4, 'government': 4, 'there': 4, 'on': 4, 'nation': 4, '$': 4, 'million': 4, '``': 4, 'xeni': 4, "gwet'in": 4, 'their': 4, 'land': 3, 'through': 3, 'which': 3, 'by': 3, 'native': 3, 'outside': 3, 'negotiate': 3, 'provincial': 3, 'governments': 3, 'court': 3, 'title': 3, 'canadian': 3, '1992': 3, 'as': 3, 'negotiations': 3, 'negotiation': 3, '%': 3, 'referendum': 3, 'would': 3, 'status': 2, 'major': 2, '1763': 2, 'could': 2, 'douglas': 2, '8': 2, 'part': 2, 'or': 2, 'over': 2, 'no': 2, 'more': 2, ';': 2, 'many': 2, 'refused': 2, 'still': 2, 'may': 2, 'be'

### Clean up and preprocessing to remove non-alphabetic characters, stop words, and to lemmatize.

In [13]:
# Removing non-alphabetic characters.
alpha_only = [t for t in lower_tokens if t.isalpha()]

In [14]:
# Removing English stop words.
english_stops = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn', '']
no_stops = [t for t in alpha_only if t not in english_stops]

In [15]:
# Lemmatizing.
word_lemm = WordNetLemmatizer()
lemm = [word_lemm.lemmatize(t) for t in no_stops]

In [16]:
# Creating an updated and improved bag of words.
newbag = Counter(lemm)
print(newbag.most_common(10))

[('treaty', 45), ('first', 21), ('nation', 20), ('process', 15), ('british', 14), ('bc', 14), ('columbia', 10), ('government', 7), ('aboriginal', 6), ('signed', 6)]


You can see the most common words in this particular article and glean key points to aid in topic identification. For this particular article--which was "Status of First Nations treaties in British Columbia"--terms such as 'treaty', 'nation', 'process', 'government', etc. make perfect sense.

# PART 2:
# Analyzing multiple articles

## Gensim

In [21]:
multiple_pages = wiki.random(100)

In [22]:
titles = []
for t in multiple_pages:
    try:
        titles.append(wiki.page(t).title)
    except wiki.DisambiguationError as e:
        titles.append(wiki.page(e.options[0]).title) 
        continue

In [23]:
print(len(titles))

100


In [24]:
contentofpages= []
for c in multiple_pages:
    try:
        contentofpages.append(wiki.page(c).content)
    except wiki.DisambiguationError as e:
        contentofpages.append(wiki.page(e.options[0]).content) 
        continue

In [25]:
print(len(contentofpages))

100


In [26]:
clean_articles = []

for article in contentofpages:
    tokens = word_tokenize(article)
    lower_tokens = [t.lower() for t in tokens]
    alpha_only = [t for t in lower_tokens if t.isalpha()]
    no_stops = [t for t in alpha_only if t not in english_stops]
    lemm = [word_lemm.lemmatize(t) for t in no_stops]
    clean_articles.append(lemm)

In [27]:
dictionary = Dictionary(clean_articles)

In [28]:
# Creating a gensim corpus.
corpus = [dictionary.doc2bow(article) for article in clean_articles]

### Gensim Bag-of-Words

In [29]:
total_word_count = defaultdict(int)
for word_id, word_count in itertools.chain.from_iterable(corpus):
    total_word_count[word_id] += word_count

In [30]:
sorted_word_count = sorted(total_word_count.items(), key=lambda w: w[1], reverse=True) 

In [31]:
for word_id, word_count in sorted_word_count[:5]:
    print(dictionary.get(word_id), word_count)

first 144
also 123
system 121
writing 99
team 97


### TF-IDF

In [32]:
tfidf = TfidfModel(corpus)

In [33]:
# As an example, we'll use the 2nd article to calculate the significant terms.
tfidf_weights = tfidf[corpus[1]]
sorted_tfidf_weights = sorted(tfidf_weights, key=lambda w: w[1], reverse=True)

In [34]:
# The top 5 weighted words for that particular article.
for term_id, weight in sorted_tfidf_weights[:5]:
    print(dictionary.get(term_id), weight)

goapele 0.4785326687587307
dawn 0.20508542946802744
drumma 0.20508542946802744
honey 0.20508542946802744
want 0.20508542946802744


You can see that both the bag of words looks across the entire dataset to find the most common words. The TF-IDF method, on the other hand, allows us to focus on one particular article to determine how the terms in it are represented across the rest of the dataset.

**Depending on our interests and use cases, you can see how both methods could be valuable.

### Clustering (k-means)

Now, we will use k-means clustering to provide recommendations to the reader/user for additional articles with similar topics. 

In [35]:
def tokenize_and_stem(text):
    for article in contentofpages:
        tokens = word_tokenize(article)
        lower_tokens = [t.lower() for t in tokens]
        alpha_only = [t for t in lower_tokens if t.isalpha()]
        no_stops = [t for t in alpha_only if t not in english_stops]
        stems = [word_lemm.lemmatize(t) for t in no_stops]
        return stems

def tokenize_only(text):
    for article in contentofpages:
        tokens = word_tokenize(article)
        lower_tokens = [t.lower() for t in tokens]
        alpha_only = [t for t in lower_tokens if t.isalpha()]
        filtered_tokens = [t for t in alpha_only if t not in english_stops]
        return filtered_tokens

In [36]:
totalvocab_stemmed =[]
totalvocab_tokenized = []

for i in contentofpages:
    allwords_stemmed = tokenize_and_stem(i) #for each item in 'synopses', tokenize/stem
    totalvocab_stemmed.extend(allwords_stemmed) #extend the 'totalvocab_stemmed' list
    
    allwords_tokenized = tokenize_only(i)
    totalvocab_tokenized.extend(allwords_tokenized)

In [37]:
vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index = totalvocab_stemmed)
print('There are ' + str(vocab_frame.shape[0]) + ' items in vocab_frame.')

There are 99800 items in vocab_frame.


In [38]:
#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(contentofpages) #fit the vectorizer to the content

print(tfidf_matrix.shape)

(100, 8876)


In [39]:
terms = tfidf_vectorizer.get_feature_names()

In [40]:
dist = 1 - cosine_similarity(tfidf_matrix)

In [41]:
km = KMeans(n_clusters=5)
km.fit(tfidf_matrix)
clusters = km.labels_.tolist()

In [42]:
cluster_articles = {'title': titles, 'text': contentofpages, 'cluster': clusters}

frame = pd.DataFrame(cluster_articles, index = [clusters] , columns = ['title', 'text', 'cluster'])

In [43]:
frame['cluster'].value_counts() #number of articles per cluster (clusters from 0 to 4)

1    72
0    11
3     6
2     6
4     5
Name: cluster, dtype: int64

Finally, we can see the title of each article found in each cluster.

In [44]:
#sort cluster centers by proximity to centroid
order_centroids = km.cluster_centers_.argsort()[:, ::-1] 

for i in range(5):
    print("Wikipedia articles found in cluster %d include the following:\n" %i, end='')
    for title in frame.loc[i]['title'].values.tolist():
        print(' %s,' % title, end='')
    print() #add whitespace
    print() #add whitespace
 

Wikipedia articles found in cluster 0 include the following:
 2008–09 Swiss Super League, 1985–86 Eerste Divisie, List of 2018–19 Top 14 transfers, List of Nicholls Colonels football seasons, List of mayors of Canoas, 1985 President's Cup Football Tournament, Newbridge GAC, 2005 Kazakhstan Hockey Cup, 2014 Fed Cup Americas Zone Group I – Pool A, 1910 Connecticut Aggies football team, Hell's Kitchen (U.S. season 6),

Wikipedia articles found in cluster 1 include the following:
 Pathophysiology of asthma, Barbara Higbie, Henry Weaver House, Lark, Viktor Kirpichov, Stapfer, Pennsylvania Railroad class D14, My Family (series 9), Miss World Bermuda, Lola Marois, Liu Wen-hsiung (1954–2017), MAPPER, Paulina Radziulytė, The Pearl of Africa (film), Daman and Diu Portuguese creole, U.S. Route 7 in Connecticut, Denham Island, List of first women lawyers and judges in Tennessee, Yaroslav Cherstvy, Laughing Boy (novel), John Ramsay (magician), Oscar D. Skelton, Podcasting in India, Patricia Gibson,