In [1]:
import pandas as pd
import time

# Curated Wikia index

This notebook analyzes the checked URLs in order to:

- Remove dead wikis
- Remove redirects
- Resolve url anomalies

In [2]:
df = pd.read_csv("../data/20180220-checked_index.csv")
df.head()

Unnamed: 0,url,redirect,redirect-short
0,http://0ad.wikia.com/,http://0ad.wikia.com/wiki/0_A.D._Wiki,http://0ad.wikia.com/
1,http://0hourmysticknights.wikia.com/,http://0hourmysticknights.wikia.com/wiki/0_Hou...,http://0hourmysticknights.wikia.com/
2,http://0-xxii.wikia.com/,http://0-xxii.wikia.com/wiki/Main_Page,http://0-xxii.wikia.com/
3,http://00fanon.wikia.com/,http://00fanon.wikia.com/wiki/00_Fanon_Wiki,http://00fanon.wikia.com/
4,http://0002oifos.wikia.com/,http://0002oifos.wikia.com/wiki/Main_Page,http://0002oifos.wikia.com/


In [3]:
len(df)

406096

In [4]:
len(df[df['redirect-short']=="NOT AVAILABLE"])

55308

In [5]:
len(df[df['redirect'].str.startswith('4')]['url'])

55293

There are some wikis that do not provide a canonical URL so the redirect column contains a 2xx code but the redirect-short column shows a NOT AVAILABLE message. These urls must be resolved manually (no more than 10-15 links)

In [6]:
df[df['redirect'].str.startswith('2')]['url']

Series([], Name: url, dtype: object)

`redirect-short` column contains information about the actual URL of the links in the Wikia Sitemap. First we will analyse repeated target urls.

In [7]:
repetitions = df.groupby(by="redirect-short")['url'].count()
repetitions.sort_values(ascending=False)

redirect-short
NOT AVAILABLE                                                         55308
http://community.wikia.com/                                           10381
http://es.pokemon.wikia.com/                                             12
http://es.supercampeones.wikia.com/                                      12
http://anne-happy.wikia.com/                                              6
http://martin-mystery.wikia.com/                                          6
http://yolosweg.wikia.com/                                                5
http://xhamrhyl09.wikia.com/                                              5
http://yourguide.wikia.com/                                               5
http://zh.tol.wikia.com/                                                  5
http://whenstarclangetsboredfanfic.wikia.com/                             5
http://were-a-group-stray-dog-rp.wikia.com/                               5
http://official-cup.wikia.com/                                           

There are 10381 urls pointing to Wikia Community wiki. We will see if it is the same URL studying the content of the `redirect` column.

In [8]:
pd.set_option('max_colwidth',90)
df[df['redirect-short']=="http://community.wikia.com/"].groupby(by="redirect").count()

Unnamed: 0_level_0,url,redirect-short
redirect,Unnamed: 1_level_1,Unnamed: 2_level_1
http://community.wikia.com/wiki/Community_Central,3,3
http://community.wikia.com/wiki/Community_Central:Not_a_valid_community,10375,10375
http://community.wikia.com/wiki/Hub:Lifestyle,1,1
http://community.wikia.com/wiki/Special:Chat,1,1
http://community.wikia.com/wiki/Special:Forum,1,1


As we can see, additionally to the "NOT AVAILABLE" Wikis, we also have to remove the "dead" wikis, the urls that point to the special Wikia page that informs that the Wiki is not a valid community.

Additionally, we have checked manually the rest of the wikis pointing to community.wikia and we will update the actual urls.

In [9]:
df.loc[df['redirect']=="http://community.wikia.com/wiki/Community_Central:Not_a_valid_community",['redirect-short']]="NOT AVAILABLE"

df.loc[df['redirect']=="http://community.wikia.com/wiki/Hub:Lifestyle",['redirect-short']]=df['url']
df.loc[df['redirect']=="http://community.wikia.com/wiki/Special:Chat",['redirect-short']]=df['url']
df.loc[df['redirect']=="http://community.wikia.com/wiki/Special:Forum",['redirect-short']]=df['url']

In [10]:
repetitions = df.groupby(by="redirect-short")['url'].count()
repetitions.sort_values(ascending=False)

redirect-short
NOT AVAILABLE                                                         65683
http://es.pokemon.wikia.com/                                             12
http://es.supercampeones.wikia.com/                                      12
http://anne-happy.wikia.com/                                              6
http://martin-mystery.wikia.com/                                          6
http://zh.tol.wikia.com/                                                  5
http://skaza.wikia.com/                                                   5
http://were-a-group-stray-dog-rp.wikia.com/                               5
http://whenstarclangetsboredfanfic.wikia.com/                             5
http://yourguide.wikia.com/                                               5
http://yolosweg.wikia.com/                                                5
http://the-nordia.wikia.com/                                              5
http://worldproblems.wikia.com/                                          

Finally, we will remove the "NOT AVAILABLE" wikis

In [11]:
curatedIndex = df[df['redirect-short']!="NOT AVAILABLE"].copy()
len(curatedIndex)

340413

In [12]:
curatedIndex.drop_duplicates(subset="url", inplace=True)
len(curatedIndex)

339871

In [13]:
repetitions = curatedIndex.groupby(by="redirect-short")['url'].count()
repetitions.sort_values(ascending=False)

redirect-short
http://es.pokemon.wikia.com/                     12
http://es.supercampeones.wikia.com/              12
http://martin-mystery.wikia.com/                  6
http://de.anubis.wikia.com/                       4
http://ru.dragon-mania-legends.wikia.com/         4
http://lacasadepapel.wikia.com/                   4
http://de.battlefield.wikia.com/                  4
http://de.beyond-two-souls.wikia.com/             4
http://transformers.wikia.com/                    4
http://ru.mightandmagic.wikia.com/                3
http://zh.community.wikia.com/                    3
http://es.ben10.wikia.com/                        3
http://ru.kingdom-come-deliverance.wikia.com/     3
http://shadowhunters.wikia.com/                   3
http://de.nintendogs.wikia.com/                   3
http://de.myheroacademia.wikia.com/               3
http://es.prettylittleliars.wikia.com/            3
http://greatest-movies.wikia.com/                 3
http://community.wikia.com/                      

Finally, we save only the urls in `redirect-short`, removing duplicates

In [14]:
timestr = time.strftime("%Y%m%d")
thefile = open('../data/{}-{}.txt'.format(timestr,'curatedIndex'), 'w')
for item in curatedIndex['redirect-short'].unique():
    thefile.write("%s\n" % item)