## Text extraction

Let's webscraped some wikipedia sites having different languages and then cluster them ! We use here the three languages English, French and Spanish. Different stopwords lists are used, same for stemming, since there are 3 languages.

In [18]:
import wikipedia
import random
from random_words import RandomWords
import pandas as pd

rw = RandomWords()
lang = ['en','fr','es']

###### Building a Data base like that

We first start with the wikipedia page with Phone numbers, since the tasks asks to get phone numbers. Sometimes, the random word chosen isn't on Wikipedia, so we have to raise exceptions.

In [19]:
Index = ['Telephone number']
Text = []
Language = ['en']

wikipedia.set_lang('en')
word = 'Telephone number'
Sentence = wikipedia.page(word).content.replace('\n', '')
Text.append(Sentence)

s = 0
while s<49:
    while True:
        language = random.choice(lang)
        wikipedia.set_lang(language)
        word = rw.random_word()
        try:
            Sentence = wikipedia.summary(word, sentences=4)
            Index.append(word)
            Text.append(Sentence)
            Language.append(language) 
            s+=1
            break
        except wikipedia.exceptions.DisambiguationError:
            pass
        except wikipedia.exceptions.PageError:
            pass



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


In [288]:
dic = {'Text': Text,
      'Language': Language,
      'Word' : Index}

df = pd.DataFrame.from_dict(dic)

In [289]:
df

Unnamed: 0,Language,Text,Word
0,en,A telephone number is a sequence of digits ass...,Telephone number
1,fr,"Le terme stencil, qui signifie « pochoir » en ...",stencils
2,fr,Une « capabilité » ou « capacité » ou « libert...,capabilities
3,en,Diagnosis is the identification of the nature ...,diagnostics
4,en,The Savage Islands or Selvagens Islands (Portu...,salvages
5,fr,Le climat est la distribution statistique des ...,cleat
6,es,Progesterex es el nombre que fue dado a una dr...,sterilizer
7,fr,Feeling est le deuxième maxi-single de Leila s...,feeling
8,en,"A mechanic is a tradesman, craftsman, or techn...",mechanic
9,fr,L'official ou vicaire judiciaire est un juge e...,official


## Find phone numbers

In [290]:
import re

telephone_numbers = []

In [291]:
for i in range(len(df)):
    phone_numbers = re.findall('(\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4})',
                               df.loc[i,'Text'])
    telephone_numbers.append(phone_numbers)
    if phone_numbers:
        print("The phone numbers found on the wikipedia page '%s' :"%df.loc[i,'Word'])
        for w in phone_numbers:
            print(w)
            
df['Phone Numbers found'] = telephone_numbers

The phone numbers found on the wikipedia page 'Telephone number' :
555-1212
555-1212
776-2323
212-736-5000
867-5309


## Determining the language

Using Google's language detection library

In [292]:
from langdetect import detect
lang_detect = []
Test = []

In [293]:
for i in range(len(df)):
    detected_language = detect(df.loc[i,'Text'])
    lang_detect.append(detected_language)
    test = detect(df.loc[i,'Text']) == df.loc[i,'Language']
    Test.append(test)
    print('Real Language : %s'%detected_language)
    print('Detected Languages : %s'%df.loc[i,'Language'])
    print('True if both languages are the same : %s' %test)
    print('\n')

df['Language detected'] = lang_detect
df['Same language ?'] = Test

Real Language : en
Detected Languages : en
True if both languages are the same : True


Real Language : fr
Detected Languages : fr
True if both languages are the same : True


Real Language : fr
Detected Languages : fr
True if both languages are the same : True


Real Language : en
Detected Languages : en
True if both languages are the same : True


Real Language : en
Detected Languages : en
True if both languages are the same : True


Real Language : fr
Detected Languages : fr
True if both languages are the same : True


Real Language : es
Detected Languages : es
True if both languages are the same : True


Real Language : fr
Detected Languages : fr
True if both languages are the same : True


Real Language : en
Detected Languages : en
True if both languages are the same : True


Real Language : fr
Detected Languages : fr
True if both languages are the same : True


Real Language : en
Detected Languages : en
True if both languages are the same : True


Real Language : es
Detected Lang

## Part of speech tagging

In [294]:
from nltk.tag import pos_tag

proper_nouns = []

In [295]:
for i in range(len(df)):
    text = df.loc[i,'Text']
    tagged_sent = pos_tag(text.split(), lang=df.loc[i,'Language'])  
    propernouns = [word.replace("«",'').replace("»",'').replace('==','').replace('=','') for word,pos in tagged_sent if pos == 'NNP']
    print(*propernouns[:5], sep='\n')
    proper_nouns.append(propernouns)
    
df['Proper Nouns'] = [' '.join(liste) for liste in proper_nouns]

(PSTN)
PSTN
Lowell,
Massachusetts,
Concept
Le


En
Selon
Une




Diagnosis
Savage
Selvagens
Islands
(Portuguese:
Ilhas
Le
Il
L'étude
La
Progesterex
Internet
Tagged,
Facebook,
Bebo,
Leila
Rephlex
Records,
Weather.
Les
Auto
De
Il
être
établi
Le

Automatic
Writing
álbum
Ataxia,
Record
Fifties)
Gregorian
January
December
World
El
Sol
(S/)[2]​
PEN)
Perú
Cork
(irlandés:
Corcaigh,
República
Irlanda,
A
However,
L'amusie
L'amusie
être
L'amusie
être
Rain
Earth.
October
Calendars
Roman
October
Latin
Intellectual
Property
(alias
Dark
Mind
El
Platte,
Chato[1]​
Platte
River)?,
La
(/)
Se
(sin
Véase
Thought
Thinking
Their
("doping")
Se
(palabra
Jan
Baptista
Helmont
La
médaille
Fields
Abel
Elle
(or
Earth,
Earth's
Sun,
Moon
Solder
North
America
Middle
English
Tommy
Pistol,
Queens
à
New
SI

A
K
(374
Ship
Le
Dernier
Navire
Québec

Roman
Chinese
England
Un
Es
La
El
Antony
Charles
Robert
Armstrong-Jones,
à
"more",
"less",
Quantity
(as
Religious
Mediodía-Pirineos,
Tarn
Garona,
Castelsarrasin
y
(also
Las
Escu

In [296]:
df.tail()

Unnamed: 0,Language,Text,Word,Phone Numbers found,Language detected,Same language ?,Proper Nouns
45,fr,Le côt N est un cépage de cuve noir français. ...,cot,[],fr,True,"Le N Il N. C'est Cahors, Argentine Chili. Ori..."
46,es,El Sikorsky SH-60 / MH-60 Seahawk es un helicó...,lamps,[],es,True,El Sikorsky SH-60 / MH-60 Seahawk Sikorsky Air...
47,es,Texas o Tejas[2]​ es uno de los cincuenta esta...,texts,[],es,True,"Texas Tejas[2]​ Washington D. C., Estados Unid..."
48,en,"In physics, a wave is an oscillation accompani...",wave,[],en,True,Frequency Wave
49,fr,Une magnéto d'allumage est une génératrice tra...,magnetos,[],fr,True,Une Elle André-Marie Ampère XIXe électrique ét...


Careful for the French NNP !

## Remove irrelevant words

In [297]:
from nltk.corpus import stopwords

stopwords_en = stopwords.words('english')
stopwords_fr = stopwords.words('french')
stopwords_es = stopwords.words('spanish')

Filtered = []

In [298]:
for i in range(len(df)):
    text = df.loc[i,'Text'] 
    filtered = [word.replace("«",'').replace("»",'').replace('==','') for word in text.split() if word.lower() not in globals()['stopwords_'+df.loc[i,'Language']]]
    Filtered.append(filtered)
    
df['Filtered'] = [' '.join(liste) for liste in Filtered]

In [299]:
df

Unnamed: 0,Language,Text,Word,Phone Numbers found,Language detected,Same language ?,Proper Nouns,Filtered
0,en,A telephone number is a sequence of digits ass...,Telephone number,"[555-1212, 555-1212, 776-2323, 212-736-5000, 8...",en,True,"(PSTN) PSTN Lowell, Massachusetts, Concept Whe...",telephone number sequence digits assigned fixe...
1,fr,"Le terme stencil, qui signifie « pochoir » en ...",stencils,[],fr,True,Le En Selon,"terme stencil, signifie pochoir anglais, pre..."
2,fr,Une « capabilité » ou « capacité » ou « libert...,capabilities,[],fr,True,"Une Amartya Sen, la L’approche Martha Nu...",capabilité capacité liberté substantielle...
3,en,Diagnosis is the identification of the nature ...,diagnostics,[],en,True,Diagnosis,Diagnosis identification nature cause certain ...
4,en,The Savage Islands or Selvagens Islands (Portu...,salvages,[],en,True,Savage Selvagens Islands (Portuguese: Ilhas Se...,Savage Islands Selvagens Islands (Portuguese: ...
5,fr,Le climat est la distribution statistique des ...,cleat,[],fr,True,Le Il L'étude La,climat distribution statistique conditions l'a...
6,es,Progesterex es el nombre que fue dado a una dr...,sterilizer,[],es,True,"Progesterex Internet Tagged, Facebook, Bebo, y...",Progesterex nombre dado droga inexistente menc...
7,fr,Feeling est le deuxième maxi-single de Leila s...,feeling,[],fr,True,"Leila Rephlex Records, Weather. Les Il Feeling,",Feeling deuxième maxi-single Leila sorti 1998 ...
8,en,"A mechanic is a tradesman, craftsman, or techn...",mechanic,[],en,True,Auto,"mechanic tradesman, craftsman, technician uses..."
9,fr,L'official ou vicaire judiciaire est un juge e...,official,[],fr,True,De Il être établi Le,L'official vicaire judiciaire juge ecclésiasti...


## Shrink the vector space

In [300]:
from nltk.stem import SnowballStemmer
from nltk.stem.snowball import FrenchStemmer

stemmer_en = SnowballStemmer('english')
stemmer_fr = FrenchStemmer()
stemmer_es = SnowballStemmer('spanish')

Stemmed = []

In [301]:
for i in range(len(df)):
    text = df.loc[i,'Filtered'] 
    stemmed = [globals()['stemmer_'+df.loc[i,'Language']].stem(word) for word in text.split()]
    Stemmed.append(stemmed)
    print('#'*75)
    print('The word %s is in the %s language' %(df.loc[i,'Word'],df.loc[i,'Language']))
    print(*stemmed[:5], sep = '\n')
    
df['Stemmed'] = [' '.join(liste) for liste in Stemmed]

###########################################################################
The word Telephone number is in the en language
telephon
number
sequenc
digit
assign
###########################################################################
The word stencils is in the fr language
term
stencil,
signif
pochoir
anglais,
###########################################################################
The word capabilities is in the fr language
capabl
capac
libert
substantiel
est,
###########################################################################
The word diagnostics is in the en language
diagnosi
identif
natur
caus
certain
###########################################################################
The word salvages is in the en language
savag
island
selvagen
island
(portuguese:
###########################################################################
The word cleat is in the fr language
climat
distribu
statist
condit
l'atmospher
###########################################################

## Cluster documents into logical groups

In [275]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score
 
documents = df['Stemmed']
 

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)
 
true_k = 8
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
 
print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print("Cluster %d:" % i),
    for ind in order_centroids[i, :5]:
        print(' %s' % terms[ind])

Cluster = []

for i in range(len(df)):
    Y = vectorizer.transform([df.loc[i,'Stemmed']])
    prediction = model.predict(Y)
    Cluster.append(prediction)
    #print(df.loc[i,'Word'],prediction)

df['Cluster'] = Cluster

Top terms per cluster:
Cluster 0:
 octob
 solder
 island
 metal
 conduct
Cluster 1:
 may
 mechanics
 highlin
 public
 human
Cluster 2:
 number
 mile
 use
 quantiti
 official
Cluster 3:
 le
 cet
 mar
 prix
 en
Cluster 4:
 2007
 personaj
 internet
 red
 meat
Cluster 5:
 water
 vapor
 wave
 liquid
 temperatur
Cluster 6:
 sol
 mineral
 gas
 molecul
 sh
Cluster 7:
 rio
 un
 amus
 sigl
 cork


  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


In [276]:
df.head()

Unnamed: 0,Language,Text,Word,Phone Numbers found,Language detected,Same language ?,Proper Nouns,Filtered,Stemmed,Cluster
0,en,A telephone number is a sequence of digits ass...,Telephone number,"[555-1212, 555-1212, 776-2323, 212-736-5000, 8...",en,True,"(PSTN) PSTN Lowell, Massachusetts, Concept Whe...",telephone number sequence digits assigned fixe...,telephon number sequenc digit assign fixed-lin...,[2]
1,fr,"Le terme stencil, qui signifie « pochoir » en ...",stencils,[],fr,True,Le En Selon,"terme stencil, signifie pochoir anglais, pre...","term stencil, signif pochoir anglais, prend di...",[2]
2,fr,Une « capabilité » ou « capacité » ou « libert...,capabilities,[],fr,True,"Une Amartya Sen, la L’approche Martha Nu...",capabilité capacité liberté substantielle...,"capabl capac libert substantiel est, suiv défi...",[3]
3,en,Diagnosis is the identification of the nature ...,diagnostics,[],en,True,Diagnosis,Diagnosis identification nature cause certain ...,diagnosi identif natur caus certain phenomenon...,[2]
4,en,The Savage Islands or Selvagens Islands (Portu...,salvages,[],en,True,Savage Selvagens Islands (Portuguese: Ilhas Se...,Savage Islands Selvagens Islands (Portuguese: ...,savag island selvagen island (portuguese: ilha...,[0]


In [283]:
df[df['Cluster'] == 5]

Unnamed: 0,Language,Text,Word,Phone Numbers found,Language detected,Same language ?,Proper Nouns,Filtered,Stemmed,Cluster
17,en,Rain is liquid water in the form of droplets t...,rains,[],en,True,Rain Earth.,Rain liquid water form droplets condensed atmo...,rain liquid water form droplet condens atmosph...,[5]
31,en,In physics a vapor (American) or vapour (Briti...,vapor,[],en,True,A K (374,physics vapor (American) vapour (British) subs...,physic vapor (american) vapour (british) subst...,[5]
33,en,Canals are human-made channels for water conve...,canals,[],en,True,,"Canals human-made channels water conveyance, s...","canal human-mad channel water conveyance, serv...",[5]
48,en,"In physics, a wave is an oscillation accompani...",wave,[],en,True,Frequency Wave,"physics, wave oscillation accompanied transfer...","physics, wave oscil accompani transfer energy....",[5]


## Produce a basic analysis of the result

Interesting : the words might be random, but the KNN discovered a cluster, with different languages on top of it, that has to do with climate, environment. The words we find are skiers, canals and meat.

Then there's a more physics related cluster. Having words like rains, vapor, canals and wave.

## Summary

Hard to cluster Wikipedia in different articles. Not all the climat things were in the right cluster.

Since this code will each time get random words, I hope this python jupyter notebook shows the analysis on the set of Data webscrapped in the first place.