-
Notifications
You must be signed in to change notification settings - Fork 0
/
GLAMorousToHTML.py
136 lines (110 loc) · 6.77 KB
/
GLAMorousToHTML.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
"""
This script, GLAMorousToHTML.py, automates the conversion of GLAMorous tool output into an HTML page, specifically focusing on
displaying Wikipedia articles across all languages that include images or media from a specified Wikimedia Commons category.
It ensures that only main namespace articles from Wikipedia are considered, excluding pages from Wikimedia Commons, Wikidata,
or other Wiki projects. The process involves fetching XML data, filtering and transforming this data, and ultimately producing
both Excel and HTML outputs that list all unique Wikipedia articles that utilize the specified media in the selected
Wikimedia Commons category.
Features:
- Fetches and processes XML data from the GLAMorous tool based on predefined configurations.
- Transforms data from an image-centric to a Wikimedia project-centric structure for easier analysis.
- Filters out non-language-specific Wikimedia projects to focus on genuine language editions of Wikipedia.
- Deduplicates article URLs within each project and orders projects based on the count of unique articles.
- Enriches the dataset with full language names retrieved from Wikidata, facilitating a more readable output.
- Converts the processed data into a pandas DataFrame for further manipulation and output generation.
- Generates an Excel file summarizing the Wikimedia project data and associated images used across Wikipedia articles.
- Builds a detailed HTML page listing Wikipedia articles, by language, that use the images from the specified Commons category.
Usage:
The script is designed to be executed as a standalone Python program. It requires configurations to be set for the GLAMorous
input (see 'setup.py'), specifically the Wikimedia Commons category to analyze.
Users can adjust the `read_mode`, `wp_fulllanguagelabel_lang`, and other parameters in the `setup.py` module
to cater to different institutions or analysis requirements.
Dependencies:
Relies on custom project imports from `general.py` for utility functions, `setup.py` for configuration settings, `buildHTML.py`
for HTML output generation, and `buildExcel.py` for creating Excel files. External libraries such as pandas are utilized for
data manipulation, and requests are used for fetching data from web sources.
Output:
- An Excel file (`*.xlsx`) containing a summary of Wikipedia articles that use the images from the specified Commons category.
- An HTML page providing a navigable list of Wikipedia articles by language, showcasing the usage of Commons category images.
Author:
Olaf Janssen, Wikimedia coordinator at KB, the national library of the Netherlands
Latest Update:
14 February 2024
Supported by ChatGPT, this script exemplifies the integration of Wikimedia tools with Python for enhancing the visibility
and analysis of media usage across Wikipedia, aiding cultural institutions in understanding the impact of their digital
collections in the Wikimedia ecosystem.
"""
# Custom project imports
from general import *
from setup import read_mode, wp_fulllanguagelabel_lang
from buildHTML import build_html
from buildExcel import build_excel
def main():
""" Main function of the script GLAMorousToHTML.py."""
"""
Read remote XML over http
data = A dictionary representation of this XML data
"""
data = read_xml_data(read_mode)
images = data.get('results', 'XX').get('details', 'XX').get('image', 'XX')
"""
This step achieves two things:
1) Retrieve a list of wikiprojects ['nl.wikipedia', 'en.wikipedia', ...] returned by the Glamorous tool, and
2) count the length of this list = number of wikiprojects
"""
wikiprojects = get_wikiprojects(data)[0] # List
#nwikiprojects = get_wikiprojects(data)[1] # Integer
"""
This step achieves two things:
1) Filter out non-language, test, event or defunct Wikimedia projects, such as
'outreach.wikipedia', 'meta.wikipedia', 'incubator.wikipedia' etc.
2) Also count the length of this list = number of filtered wikiprojects
"""
wikiprojects_filtered = filter_wikiprojects(wikiprojects)[0] # List
#nwikiprojects_filtered = filter_wikiprojects(wikiprojects)[1] # Integer
""""
This step transforms Glamorous data from using image names (such as 'AMH-7230-KB_Map_of_Borneo.jpg') as primary keys
("image key based") to using wikiprojects (such as 'fr.wikipedia') as primary keys ("wikiproject key based")
In other words: Transforms data from an image-centric structure to a Wikimedia project-centric structure.
"""
projectsdict = transform_imagekeybased_to_wikiprojectkeybased(data, wikiprojects_filtered)
# Let's process & enrich 'projectsdict' and convert into a Pandas Dataframe: Steps 1-7
""" 1,2,3)
This step performs three operations on a dictionary mapping Wikimedia project names to lists of article URLs:
1) Remove duplicate Wikipedia URLs per wikiproject/language.
2) Sort the list of deduplicated Wikipedia article URLs alphabetically.
3) Order the dictionary by the number of Wikipedia article URLs per wikiproject, in descending order.
"""
deduped_sorted_ordered_projectsdict = dedup_sort_order_projectsdict(projectsdict, wikiprojects_filtered)
""" Helper function/step to bring full language names to the table
Retrieve a list of dicts containing full Wikipedia language names, from Wikidata
"""
langdictlist = get_languages_dict(wp_fulllanguagelabel_lang) #dict of full Wikipedia language labels in the language specified
""" 4)
This step adds full language names as new key-value pair to 'deduped_sorted_ordered_projectsdict' (short: 'dso_pdict')
"""
full_language_dso_pdict = add_full_language_names_to_dict(deduped_sorted_ordered_projectsdict, langdictlist)
""" 5)
This step sorts the 'full_language_dso_pdict' dict first by the descending count of URLs associated with each project
and then alphabetically by the project's full language name in case of equal URL counts.
"""
sorted_fulllanguage_dso_pdict = sort_projects_by_urlcount_and_fulllanguage_name(full_language_dso_pdict)
""" 6)
For every Wikipedia article in this dict, this steps adds the KB images that are included in it.
"""
images_projectdict = add_images_to_dict(sorted_fulllanguage_dso_pdict, images)
""" 7)
This step turns 'images_projectdict' into Pandas Dataframe - as preparation for conversion into Excel and HTML
"""
wp_df = convert_to_dataframe(images_projectdict)
##### So far for all data transformations and manipulations, let's now create an Excel file and a HTML page from this data
""" 8)
This step writes the dataframe to Excel
"""
build_excel(wp_df)
""" 9)
This step transform dataframe to HTML components/building blocks and writes all these components to an output HTML file
"""
build_html(wp_df)
if __name__ == "__main__":
main()