# Installation

## Overview 

As explained in the README file, you should have already fetched the datasets locally. 

See `../README.md` and `../script/fetch_drupal_data.sh`.

<details>
    <summary>Click to check data folder structure</summary>
    
```bash
data
├── csv
│   └── countries.csv
└── json
    ├── pages_event
    │   ├── page_0.json
    │   └── page_x.json
    ├── pages_organization
    │   ├── page_0.json
    │   └── page_x.json
    ├── pages_user
    │   ├── page_0.json
    │   └── page_x.json
```
</details>


## Convertion

There is too much data fetched from *d.o* to simply load it into Pandas.

For instance as of today (March 23rd, 2025), there are approximately 42k pages for Users - see [this endpoint](https://www.drupal.org/api-d7/user.json?sort=uid&direction).

Each page contains a list of 50 users summing to almost 2.1 million users.

We must convert this data from JSON to a more efficient format for big data.

We choose to use `parquet` files.

In [None]:
import os
import pandas as pd


def merge_drupal_pages(input_folder: str) -> pd.DataFrame:
    """
    Merge a list of dataframes into a single dataframe.

    Parameters
    ----------
    input_folder : str
        The path to the folder containing the JSON files.

    Returns
    -------
    pandas.DataFrame
        Merged dataframe.
    """

    # Initialize an empty list to store dataframes
    dataframes = []

    # Dynamically set max_pages to the total number of JSON files in the directory
    max_pages = len([f for f in os.listdir(
        input_folder) if f.endswith(".json")])

    # For testing purposes, limit the number of pages to process.
    # max_pages = 1000

    # Get a sorted list of JSON files
    json_files = sorted(
        [f for f in os.listdir(input_folder) if f.endswith(".json")],
        # Sort by page number
        key=lambda x: int(x.split("_")[-1].split(".")[0])
    )[:max_pages]  # Limit to max_pages files

    # Iterate over the selected JSON files
    for filename in json_files:
        file_path = os.path.join(input_folder, filename)
        # Read the JSON file into a dataframe
        df = pd.read_json(file_path, lines=True)
        # Append the dataframe to the list
        dataframes.append(df)

    # Concatenate all dataframes
    merged_df = pd.concat(dataframes, ignore_index=True)
    print("Data concatenated and ready to be merged!")
    return merged_df


**⚠️ Warning**: the cell below can takes a long time to runs (last run took ~8.5 minutes).

In [None]:
# Save the merged dataframe as a parquet file
# Merge data pages into one DataFrame.
# for name in ['organization', 'event', 'user', 'module', 'module_terms']:
for name in ['module_terms']:
    output_file = f"../data/{name}.parquet"
    merged_df = merge_drupal_pages(f"../data/json/pages_{name}")
    merged_df.to_parquet(output_file, index=False)
    print(f"Data successfully merged and saved to {output_file}")


Data concatenated and ready to be merged!
Data successfully merged and saved to ../data/module_terms.parquet


In [None]:
import pandas as pd
print(f"Reading the parquet file: {output_file}")
df = pd.read_parquet(output_file)
df.shape


Reading the parquet file: ../data/user.parquet


(2093637, 18)

## Fetching user activity log


### Comments 

Given that dozens or hundreds of pages of comments existing per user and we have 
more than two millions users, fetching all comments is too much data approximating 200 millions records.

We can still get relevant information from_d.o_. 

We need to count comments per user and get the first and last comments' dates.

In [None]:
import json
df = pd.read_parquet('../data/user.parquet')
uids = df['id'].unique()
# Print UIDs to one JSON file.
with open('../data/json/user_uids.json', 'w') as f:
    json.dump(uids.tolist(), f)
# Print the number of unique UIDs
print(f"Number of unique UIDs: {len(uids)}")


Number of unique UIDs: 2093637


In [None]:
# Fetch comments for each user and add them to the dataframe
df['comments_count'] = df['id'].apply(fetch_comments_by_user)


In [None]:
df['comments_count']


0          (1, 2, 3)
1          (1, 2, 3)
2          (1, 2, 3)
3          (1, 2, 3)
4          (1, 2, 3)
             ...    
2093632    (1, 2, 3)
2093633    (1, 2, 3)
2093634    (1, 2, 3)
2093635    (1, 2, 3)
2093636    (1, 2, 3)
Name: comments_count, Length: 2093637, dtype: object

## Next step

🔎 You can now open [the exploration](./exploration.ipynb) notebook.