In [1]:
import pandas as pd
import time

# Curated Wikia index

This notebook analyzes the checked URLs in order to:

- Remove dead wikis
- Remove redirects
- Resolve url anomalies

In [2]:
df = pd.read_csv("../data/20180916-checked_index.csv")
df.head()

Unnamed: 0,url,redirect,redirect-short
0,http://spellmagotm.wikia.com/,http://spellmagotm.wikia.com/wiki/Spellmagotm_...,http://spellmagotm.wikia.com/
1,http://2017-monster-energy-nascar-cup-series.w...,http://2017-monster-energy-nascar-cup-series.w...,http://2017-monster-energy-nascar-cup-series.w...
2,http://10low46japreligion.wikia.com/,http://10low46japreligion.wikia.com/wiki/Ancie...,http://10low46japreligion.wikia.com/
3,http://de.bibel.wikia.com/,http://de.bibel.wikia.com/wiki/Bibel_Wiki,http://de.bibel.wikia.com/
4,http://indigo-showdown.wikia.com/,http://indigo-showdown.wikia.com/wiki/Indigo_s...,http://indigo-showdown.wikia.com/


In [3]:
len(df)

304091

In [4]:
len(df[df['redirect-short']=="NOT AVAILABLE"])

8144

In [5]:
len(df[df['redirect'].str.startswith('4')]['url'])

8047

There are some wikis that do not provide a canonical URL so the redirect column contains a 2xx code but the redirect-short column shows a NOT AVAILABLE message. These urls must be resolved manually (no more than 10-15 links)

In [6]:
df[df['redirect'].str.startswith('2')]['url']

4989                          http://the-first.wikia.com/
9119                       http://explained-tv.wikia.com/
12080                http://gbenson-template-1.wikia.com/
13518                               http://dnc.wikia.com/
17358                            http://lizlux.wikia.com/
25143                             http://rainb.wikia.com/
26067                         http://bibledata.wikia.com/
29036             http://the-alec-baldwin-show.wikia.com/
30226                     http://the-bold-type.wikia.com/
31561            http://fandom-taxonomy-system.wikia.com/
33760                      http://mr-inbetween.wikia.com/
40076               http://homicide-hunter-new.wikia.com/
43004                      http://my-community.wikia.com/
44042                          http://calboytv.wikia.com/
54058                    http://treasure-quest.wikia.com/
56350                       http://camping-hbo.wikia.com/
56495                           http://lodge49.wikia.com/
57333         

`redirect-short` column contains information about the actual URL of the links in the Wikia Sitemap. First we will analyse repeated target urls.

In [15]:
df[df['redirect-short'] == 'NOT AVAILABLE']

Unnamed: 0,url,redirect,redirect-short
69,http://yandere-simulator-mod.wikia.com/,404,NOT AVAILABLE
108,http://es.universograndtheftauto.wikia.com/,404,NOT AVAILABLE
114,http://he.marvelcinematicuniverse.wikia.com/,404,NOT AVAILABLE
144,http://ru.havennheart.wikia.com/,404,NOT AVAILABLE
213,http://gamerslifestudios.wikia.com/,404,NOT AVAILABLE
226,http://de.ass.wikia.com/,404,NOT AVAILABLE
316,http://disneycrossovers.wikia.com/,404,NOT AVAILABLE
394,http://pt-br.puffitopictures.wikia.com/,404,NOT AVAILABLE
566,http://es.jorgeelsalvadortic2.wikia.com/,404,NOT AVAILABLE
569,http://es.laloa.wikia.com/,404,NOT AVAILABLE


In [7]:
repetitions = df.groupby(by="redirect-short")['url'].count()
repetitions.sort_values(ascending=False)

redirect-short
NOT AVAILABLE                                                        8144
http://community.wikia.com/                                             6
http://pt-br.liberproeliis.wikia.com/                                   3
http://brosstoons.wikia.com/                                            3
http://harrypotter.wikia.com/                                           2
http://ruby-redfort.wikia.com/                                          2
http://ru.vkinfo.wikia.com/                                             2
http://ru.mario.wikia.com/                                              2
http://battlerite-royale.wikia.com/                                     2
http://sonicshow.wikia.com/                                             2
http://sus4784.wikia.com/                                               2
http://fr.lovelive.wikia.com/                                           2
http://blockfortress.wikia.com/                                         2
http://fowl-language.wi

There are 8144 urls pointing to Wikia Community wiki (`NOT AVAILABLE`). We will see if it is the same URL studying the content of the `redirect` column.

In [8]:
pd.set_option('max_colwidth',90)
df[df['redirect-short']=="http://community.wikia.com/"].groupby(by="redirect").count()

Unnamed: 0_level_0,url,redirect-short
redirect,Unnamed: 1_level_1,Unnamed: 2_level_1
http://community.wikia.com/wiki/Community_Central,2,2
http://community.wikia.com/wiki/Community_Central:Not_a_valid_community,4,4


As we can see, additionally to the "NOT AVAILABLE" Wikis, we also have to remove the "dead" wikis, the urls that point to the special Wikia page that informs that the Wiki is not a valid community.

Additionally, we have checked manually the rest of the wikis pointing to community.wikia and we will update the actual urls.

In [9]:
df.loc[df['redirect']=="http://community.wikia.com/wiki/Community_Central:Not_a_valid_community",['redirect-short']]="NOT AVAILABLE"

df.loc[df['redirect']=="http://community.wikia.com/wiki/Hub:Lifestyle",['redirect-short']]=df['url']
df.loc[df['redirect']=="http://community.wikia.com/wiki/Special:Chat",['redirect-short']]=df['url']
df.loc[df['redirect']=="http://community.wikia.com/wiki/Special:Forum",['redirect-short']]=df['url']

In [10]:
repetitions = df.groupby(by="redirect-short")['url'].count()
repetitions.sort_values(ascending=False)

redirect-short
NOT AVAILABLE                                                        8148
http://pt-br.liberproeliis.wikia.com/                                   3
http://brosstoons.wikia.com/                                            3
http://fowl-language.wikia.com/                                         2
http://ruby-redfort.wikia.com/                                          2
http://ru.vkinfo.wikia.com/                                             2
http://ru.mario.wikia.com/                                              2
http://battlerite-royale.wikia.com/                                     2
http://sonicshow.wikia.com/                                             2
http://sus4784.wikia.com/                                               2
http://fr.lovelive.wikia.com/                                           2
http://harrypotter.wikia.com/                                           2
http://blockfortress.wikia.com/                                         2
http://matthewtheroblox

Finally, we will remove the "NOT AVAILABLE" wikis

In [11]:
curatedIndex = df[df['redirect-short']!="NOT AVAILABLE"].copy()
len(curatedIndex)

295943

In [12]:
curatedIndex.drop_duplicates(subset="url", inplace=True)
len(curatedIndex)

295943

In [13]:
repetitions = curatedIndex.groupby(by="redirect-short")['url'].count()
repetitions.sort_values(ascending=False)

redirect-short
http://brosstoons.wikia.com/                               3
http://pt-br.liberproeliis.wikia.com/                      3
http://teenwolf.wikia.com/                                 2
http://madoka.wikia.com/                                   2
http://ruby-redfort.wikia.com/                             2
http://ru.vkinfo.wikia.com/                                2
http://ru.mario.wikia.com/                                 2
http://battlerite-royale.wikia.com/                        2
http://sonicshow.wikia.com/                                2
http://sus4784.wikia.com/                                  2
http://fr.lovelive.wikia.com/                              2
http://harrypotter.wikia.com/                              2
http://blockfortress.wikia.com/                            2
http://matthewtherobloxiangamerx3.wikia.com/               2
http://pvzcc.wikia.com/                                    2
http://pt-br.monkz.wikia.com/                              2
http://ax

Finally, we save only the urls in `redirect-short`, removing duplicates

In [14]:
timestr = time.strftime("%Y%m%d")
thefile = open('../data/{}-{}.txt'.format(timestr,'curatedIndex'), 'w')
for item in curatedIndex['redirect-short'].unique():
    thefile.write("%s\n" % item)