# Mining meta information about Wikipedia so we can mine Wikipedia

We are looking for the templates that sometimes appear in text in the `...content.json` dataset. For example:
> Denne norske filmrelaterte artikkelen er foreløpig kort eller mangelfull, og du kan hjelpe Wikipedia ved å utvide den.

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import array_contains

spark = SparkSession \
    .builder \
    .appName("Analysing Wikipedia") \
    .getOrCreate()

In [2]:
df = spark.read.json("./nowiki-20210111-cirrussearch-general.json")

In [3]:
df.printSchema()

root
 |-- auxiliary_text: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- category: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- content_model: string (nullable = true)
 |-- coordinates: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- coord: struct (nullable = true)
 |    |    |    |-- lat: double (nullable = true)
 |    |    |    |-- lon: double (nullable = true)
 |    |    |-- country: string (nullable = true)
 |    |    |-- dim: long (nullable = true)
 |    |    |-- globe: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- primary: boolean (nullable = true)
 |    |    |-- region: string (nullable = true)
 |    |    |-- type: string (nullable = true)
 |-- create_timestamp: string (nullable = true)
 |-- defaultsort: string (nullable = true)
 |-- display_title: string (nullable = true)
 |-- external_link: array (nullable = true)
 |    |-- element: strin

In [4]:
# We probably only need wikitext
df.select("content_model").distinct().show()

+-------------+
|content_model|
+-------------+
|   flow-board|
|    Scribunto|
|         null|
|         json|
|sanitized-css|
|     wikitext|
|   javascript|
|          css|
+-------------+



In [7]:
# About namespaces https://en.wikipedia.org/wiki/Wikipedia:Namespace
# Templates have their own namespace! It is number 10
df.select("namespace").distinct().show(50)

+---------+
|namespace|
+---------+
|        7|
|      828|
|     null|
|        6|
|        9|
|        5|
|        1|
|       10|
|      100|
|        3|
|      101|
|       12|
|        8|
|       11|
|      829|
|        2|
|        4|
|       13|
|     2600|
|       14|
|       15|
+---------+



In [47]:
filtered_df = df \
    .filter( \
        (df["namespace"] == 10) & \
        (df["content_model"] == "wikitext")) \
    .drop("content_model", "language", "category", "coordinates", "defaultsort", \
        "external_link", "heading", "incoming_links", "namespace", "namespace_text", \
        "outgoing_link", "redirect", "text_bytes", "template", "wiki", \
        "wikibase_item", "version_type", "file_bits", "file_height", "file_media_type", \
        "file_resolution", "file_size", "file_text", "file_width", "index", \
        "file_mime", "ores_articletopic", "ores_articletopics", "score", "popularity_score", \
        "display_title", "auxiliary_text", "create_timestamp", "timestamp", "version", \
        "opening_text", "source_text")

In [48]:
filtered_df \
    .filter( \
        df["text"] \
            .contains("Denne norske filmrelaterte artikkelen er foreløpig kort eller mangelfull, og du kan hjelpe Wikipedia ved å utvide den.") \
        )\
    .show(truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------+
|text                                                                                                                                                                                                                                                                                                                                                                                                                     |title          |
+-----------------------------------------------------------------------------------------------------------------------------------------------

The text column contained the template text, but also more. And the bottom template should show

> Aktuell artikkel: Denne artikkelen omhandler en aktuell hendelse. Vær ekstra oppmerksom på at innholdet kan være utdatert eller feilaktig, og at hyppige redigeringer kan forekomme.

but it is not that at all!

In [49]:
filtered_df.filter(df["title"] == "Aktuell").show(truncate=False)

+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
|text                                                                                                                                                 

# Conclusion

This was hard and I give up. I'll just keep the templates in.