# Fundamentals of Social Data Science 
## Week 2 Day 2 Lab: Downloading to Wikipedia

Today we will review some changes to the Wikipedia code. These changes will considerably alter what you are able to do with this code. The end result will be a set of two folders, `data` and `dataframes` which you can use for analysis of Wikipedia. 

The code has now been altered on my end in several ways: 
- use and report curl from special export to get a complete history of a page. 
- considerably expanded reporting and commenting.
- new arguments available to the script include --count_only 

There is also now a second script available `xml_to_dataframe.py` which can be used to then process these files and turn them into separate DataFrames. These DataFrames are stored as .feather files and can be loaded with the code below. 

You should review the `xml_to_dataframe.py` file as all the operations within that file have been covered in class with the exception of TQDM but you can see how that works in practice. 

You will note that this version does not use recursion to count the files. Instead it more literally looks within year and month. This is sufficient for this work, but with a deeper folder structure and one where the structure is less certain this approach would not be robust. On the other hand, by assuming year and month it allows for some interesting statistics about the year and month to be displayed. In your own work you may now consider whether to approach a task with a more general but often more abstract solution or a more specific but often more fragile solution. You can see in Jon's solution that he used a clever way to simply count all the files using a global and letting the global handle the recursion (`download_and_count_revisions_solution.py`).

You should now be able to download a complete history for a single wikipedia page and process that as a DataFrame. Confirm that you can do this with the code yourself. Then discuss among your group:
1. Which two (or more) public figures are worth comparing and why. 
2. Prior to any specific time series analysis, consider your expectations for this exploratory comparison.  

Draw upon your group's potential expertise in social science to come up with a theoretically informed rationale for a given comparison. 

## Merging in Changes to a Repository 

First you will want to merge files from an upstream branch (mine). These instructions will show how to do that from the terminal. You will want to be in the oii-fsds-wikipedia folder when entering these commands. Note especially **Step 3**. If you do this it will overwrite `download_wiki_revisions.py` so consider making a backup. 

1. **Add the original repository as a remote:**
   ```sh
   git remote add upstream https://github.com/berniehogan/oii-fsds-wikipedia.git
   ```

2. **Fetch the changes from the original repository:**
   ```sh
   git fetch upstream
   ```

3. **Backup any local changes:**
   If you have your own versions of files like `download_wiki_revisions.py`, you should rename the file first to avoid conflicts:
   ```sh
   mv download_wiki_revisions.py download_wiki_revisions_backup.py
   ```

4. **Merge upstream changes into your local main branch:**
   ```sh
   git merge upstream/main
   ```

5. **Resolve any conflicts and commit the changes:**
   You should resolve any conflicts that arise during the merge and then commit the changes:
   ```sh
   git add .
   git commit -m "Merge changes from upstream"
   ```

6. **Push the changes to your GitHub repository:**
   ```sh
   git push origin main
   ```

7. **Test your code after merging:**
   You should test your code to ensure everything works correctly after the integration.

By following these steps, you should be able to integrate the latest changes from my repository while preserving your own custom modifications.

Once this is done, you can use the script below if you wish in order to run the commands directly within a Jupyter notebook rather than via that terminal. 

# Important note for writing commands for terminal in python notebook

**Use `os.system(<terminal code>)`**

In [17]:
# Import pre-requisite packages

os.system('pip install pandas && pip freeze > requirements.txt')
os.system('pip install requests && pip freeze > requirements.txt')
os.system('pip install bs4 && pip freeze > requirements.txt')
os.system('pip install pathlib && pip freeze > requirements.txt')
os.system('pip install datetime && pip freeze > requirements.txt')
os.system('pip install tqdm && pip freeze > requirements.txt')
os.system('pip install argparse && pip freeze > requirements.txt')
os.system('pip install lxml && pip freeze > requirements.txt')
os.system('pip install pyarrow && pip freeze > requirements.txt')

import os
import pandas as pd
import requests


Collecting argparse
  Using cached argparse-1.4.0-py2.py3-none-any.whl.metadata (2.8 kB)
Using cached argparse-1.4.0-py2.py3-none-any.whl (23 kB)
Installing collected packages: argparse
Successfully installed argparse-1.4.0
Collecting pyarrow
  Using cached pyarrow-17.0.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (3.3 kB)
Using cached pyarrow-17.0.0-cp312-cp312-macosx_11_0_arm64.whl (27.2 MB)
Installing collected packages: pyarrow
Successfully installed pyarrow-17.0.0


In [3]:
# Define articles we want to download
article1 = "Data_science"
article2 = "Machine_learning"

In [5]:
# Create necessary directories if they don't exist (if it exists it won't overwrite or throw an error)
os.makedirs("data", exist_ok=True)
os.makedirs("DataFrames", exist_ok=True)

In [15]:
# Download revisions for both articles
# Using `os.system(<code>)` allows us to execute code as if in terminal
# Note that download_wiki_revisions.py already has the import packages functions 
    # We just need to make sure that they are in the terminal

print("Downloading revisions for first article...")
os.system(f'python download_wiki_revisions.py "{article1}"') 
print("\nDownloading revisions for second article...")
os.system(f'python download_wiki_revisions.py "{article2}"')

Downloading revisions for first article...
Downloading complete history of Data_science


Downloading revisions: 35.3MiB [00:01, 24.9MiB/s]


Found 1709 revisions. Organizing into directory structure...


100%|██████████| 1709/1709 [00:02<00:00, 821.75it/s] 



Final revision counts:
Found 1709 total revisions for 'Data_science'.

Breakdown by year:
  2012: 91 revisions
  2013: 127 revisions
  2014: 73 revisions
  2015: 143 revisions
  2016: 103 revisions
  2017: 135 revisions
  2018: 190 revisions
  2019: 130 revisions
  2020: 168 revisions
  2021: 133 revisions
  2022: 185 revisions
  2023: 110 revisions
  2024: 121 revisions

Downloading revisions for second article...
Downloading complete history of Machine_learning


Downloading revisions: 238MiB [00:20, 11.4MiB/s] 


Found 3887 revisions. Organizing into directory structure...


100%|██████████| 3887/3887 [00:08<00:00, 464.47it/s]



Final revision counts:
Found 3887 total revisions for 'Machine_learning'.

Breakdown by year:
  2003: 6 revisions
  2004: 33 revisions
  2005: 103 revisions
  2006: 138 revisions
  2007: 130 revisions
  2008: 71 revisions
  2009: 74 revisions
  2010: 132 revisions
  2011: 129 revisions
  2012: 113 revisions
  2013: 96 revisions
  2014: 152 revisions
  2015: 219 revisions
  2016: 261 revisions
  2017: 263 revisions
  2018: 270 revisions
  2019: 293 revisions
  2020: 297 revisions
  2021: 244 revisions
  2022: 298 revisions
  2023: 328 revisions
  2024: 237 revisions


0

In [18]:
# Convert all downloaded revisions to DataFrames
print("\nConverting revisions to DataFrames...")
os.system('python xml_to_dataframe.py --data-dir ./data --output-dir ./DataFrames')


Converting revisions to DataFrames...
Processing with text length only


Processing Cat: 100%|██████████| 1/1 [00:00<00:00, 51.13batch/s]



Summary for Cat:
Total revisions: 10
Date range: 2024-10-14 05:29:09+00:00 to 2024-10-16 14:43:33+00:00
Unique contributors: 3
Average text length: 164003.1 characters

Summary for Climate Change:
Total revisions: 18
Date range: 2002-02-25 15:51:15+00:00 to 2024-01-04 18:20:42+00:00
Unique contributors: 14
Average text length: 293.2 characters

Summary for Dog:
Total revisions: 10
Date range: 2024-10-19 20:54:43+00:00 to 2024-10-21 12:37:43+00:00
Unique contributors: 8
Average text length: 190122.7 characters

Summary for Leopard:
Total revisions: 50
Date range: 2024-07-20 22:40:23+00:00 to 2024-10-19 01:05:05+00:00
Unique contributors: 13
Average text length: 109459.4 characters


Processing Climate Change: 100%|██████████| 1/1 [00:00<00:00, 109.46batch/s]
Processing Dog: 100%|██████████| 1/1 [00:00<00:00, 39.71batch/s]
Processing Leopard: 100%|██████████| 1/1 [00:00<00:00, 11.48batch/s]
Processing Machine_learning: 100%|██████████| 4/4 [00:02<00:00,  1.64batch/s]
Processing Hamster:   0%|          | 0/1 [00:00<?, ?batch/s]


Summary for Machine_learning:
Total revisions: 3887
Date range: 2003-05-25 06:03:17+00:00 to 2024-10-21 15:03:51+00:00
Unique contributors: 1098
Average text length: 59622.2 characters


Processing Hamster: 100%|██████████| 1/1 [00:00<00:00,  1.71batch/s]
Processing Data_science:   0%|          | 0/2 [00:00<?, ?batch/s]


Summary for Hamster:
Total revisions: 1000
Date range: 2011-02-04 18:00:15+00:00 to 2024-08-12 13:16:07+00:00
Unique contributors: 343
Average text length: 24421.3 characters


Processing Data_science: 100%|██████████| 2/2 [00:00<00:00,  2.38batch/s]
Processing Tiger:   0%|          | 0/1 [00:00<?, ?batch/s]


Summary for Data_science:
Total revisions: 1709
Date range: 2012-04-11 17:34:10+00:00 to 2024-09-04 22:32:11+00:00
Unique contributors: 466
Average text length: 19660.1 characters


Processing Tiger: 100%|██████████| 1/1 [00:01<00:00,  1.74s/batch]
Processing Cheetah: 100%|██████████| 1/1 [00:00<00:00,  9.89batch/s]
Processing Singapore:   0%|          | 0/1 [00:00<?, ?batch/s]


Summary for Tiger:
Total revisions: 1000
Date range: 2024-03-16 17:22:00+00:00 to 2024-10-21 10:33:06+00:00
Unique contributors: 51
Average text length: 146214.8 characters

Summary for Cheetah:
Total revisions: 50
Date range: 2024-05-19 15:12:28+00:00 to 2024-10-07 03:39:46+00:00
Unique contributors: 14
Average text length: 190603.3 characters


Processing Singapore: 100%|██████████| 1/1 [00:03<00:00,  3.06s/batch]
Processing Kangaroo: 100%|██████████| 1/1 [00:00<00:00, 18.10batch/s]
Processing New Zealand:   0%|          | 0/1 [00:00<?, ?batch/s]


Summary for Singapore:
Total revisions: 1000
Date range: 2021-08-08 10:07:14+00:00 to 2024-10-21 11:34:25+00:00
Unique contributors: 352
Average text length: 286322.4 characters

Summary for Kangaroo:
Total revisions: 50
Date range: 2022-10-06 11:48:28+00:00 to 2024-09-26 12:40:57+00:00
Unique contributors: 36
Average text length: 69492.4 characters


Processing New Zealand: 100%|██████████| 1/1 [00:00<00:00,  6.05batch/s]
Processing Hedgehog: 100%|██████████| 1/1 [00:00<00:00, 27.75batch/s]
Processing Cow: 100%|██████████| 1/1 [00:00<00:00, 33.17batch/s]
Processing Lion: 100%|██████████| 1/1 [00:00<00:00, 10.95batch/s]
Processing Elephant:   0%|          | 0/1 [00:00<?, ?batch/s]


Summary for New Zealand:
Total revisions: 50
Date range: 2024-09-05 22:08:40+00:00 to 2024-10-19 05:06:02+00:00
Unique contributors: 24
Average text length: 272857.7 characters

Summary for Hedgehog:
Total revisions: 50
Date range: 2023-02-24 01:48:59+00:00 to 2024-10-18 23:58:46+00:00
Unique contributors: 28
Average text length: 32363.1 characters

Summary for Cow:
Total revisions: 87
Date range: 2001-05-17 08:19:29+00:00 to 2023-02-01 15:02:19+00:00
Unique contributors: 41
Average text length: 78.2 characters

Summary for Lion:
Total revisions: 50
Date range: 2024-08-19 19:25:04+00:00 to 2024-10-18 14:43:17+00:00
Unique contributors: 21
Average text length: 147368.1 characters

Summary for Elephant:
Total revisions: 50
Date range: 2024-05-15 16:14:19+00:00 to 2024-10-13 01:25:45+00:00
Unique contributors: 31
Average text length: 128240.7 characters


Processing Elephant: 100%|██████████| 1/1 [00:00<00:00, 12.28batch/s]


0

In [19]:
# Load and verify one of the DataFrames
print("\nVerifying DataFrame contents...")
df = pd.read_feather(f"DataFrames/{article1}.feather")


Verifying DataFrame contents...


In [23]:
# Display basic information about the DataFrame
print("\nDataFrame Info:")
display(df.info())

print("\nFirst few rows:")
display(df.head())


DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
Index: 1709 entries, 479 to 759
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype              
---  ------       --------------  -----              
 0   revision_id  1709 non-null   object             
 1   timestamp    1709 non-null   datetime64[ns, UTC]
 2   username     1123 non-null   object             
 3   userid       1123 non-null   object             
 4   comment      1372 non-null   object             
 5   text_length  1709 non-null   int64              
 6   year         1709 non-null   object             
 7   month        1709 non-null   object             
dtypes: datetime64[ns, UTC](1), int64(1), object(6)
memory usage: 120.2+ KB


None


First few rows:


Unnamed: 0,revision_id,timestamp,username,userid,comment,text_length,year,month
479,1244076203,2024-09-04 22:32:11+00:00,Arachnidly,47739713,removed cuz scientific method linked in intro ...,28671,2024,9
475,1244075613,2024-09-04 22:27:46+00:00,Arachnidly,47739713,"removed unrelated journal ""scientific data"" pa...",28695,2024,9
474,1243594551,2024-09-02 10:24:08+00:00,Michaelmalak,14994222,Undid revision [[Special:Diff/1243592758|12435...,28719,2024,9
477,1243592758,2024-09-02 10:03:39+00:00,Iniyavalsha333,48376643,I added new info about data science.,29437,2024,9
478,1243591540,2024-09-02 09:49:07+00:00,Michaelmalak,14994222,Undid revision [[Special:Diff/1243589808|12435...,28719,2024,9


In [24]:
# Display some basic statistics
print("\nBasic statistics:")
print(f"Total number of revisions: {len(df)}")
print(f"Date range: from {df['timestamp'].min()} to {df['timestamp'].max()}")
print(f"Number of unique editors: {df['username'].nunique()}")


Basic statistics:
Total number of revisions: 1709
Date range: from 2012-04-11 17:34:10+00:00 to 2024-09-04 22:32:11+00:00
Number of unique editors: 466
