# Exploraty Data Analysis

Exploraty Data Analysis (EDA) of the Drupal Community.

## Installation

### Overview 

As explained in the README file, you should have already fetched the datasets locally. 

See `../README.md` and `../script/fetch_drupal_data.sh`.

<details>
    <summary>Click to check data folder structure</summary>
    
```bash
data
├── csv
│   └── countries.csv
└── json
    ├── pages_event
    │   ├── page_0.json
    │   └── page_x.json
    ├── pages_organization
    │   ├── page_0.json
    │   └── page_x.json
    ├── pages_user
    │   ├── page_0.json
    │   └── page_x.json
```
</details>


### Convertion

As of today (March 23rd, 2025), there are approximately 42k pages exposed by the Drupal.org's REST API [at this endpoint](https://www.drupal.org/api-d7/user.json?sort=uid&direction).

Each page contains a list of 50 users with 16 fields, summing almost 2.1 million users.

It is too much data to simply load into Pandas as JSON. 

Let's convert it to a more efficient format which is `parquet`.

**Warning**: the cell below can takes a long time to runs (last run too 8.5 minutes).

In [None]:
import os
import pandas as pd

# Define the input folder and output file
input_folder = "../data/json/pages_user/"

# Initialize an empty list to store dataframes
dataframes = []

# Dynamically set max_pages to the total number of JSON files in the directory
max_pages = len([f for f in os.listdir(input_folder) if f.endswith(".json")])

# For testing purposes, limit the number of pages to process.
# max_pages = 1000

# Get a sorted list of JSON files
json_files = sorted(
    [f for f in os.listdir(input_folder) if f.endswith(".json")],
    key=lambda x: int(x.split("_")[-1].split(".")[0])  # Sort by page number
)[:max_pages]  # Limit to max_pages files

# Iterate over the selected JSON files
for filename in json_files:
    file_path = os.path.join(input_folder, filename)
    # Read the JSON file into a dataframe
    df = pd.read_json(file_path, lines=True, dtype={
        "id": "int32",
        "title": "string",
        "fname": "string",
        "lname": "string",
        "created": "int32",
        "da_membership": "string",
        "slack": "string",
        "timezone": "string",
        "region": "string",
        "mentors": "object",
        "countries": "object",
        "languages": "object",
        "organizations": "object",
        "industries": "object",
        "contributions": "object",
        "events": "object",
    })
    # Append the dataframe to the list
    dataframes.append(df)

# Concatenate all dataframes
merged_df = pd.concat(dataframes, ignore_index=True)

print("Data concatenated and ready to be merged!")


OSError: Cannot save file into a non-existent directory: '../data/csv'

Now convert our data to a Parquet file:

In [None]:
# Save the merged dataframe as a parquet file
output_file = "../data/users.parquet"
merged_df.to_parquet(output_file, index=False)
print(f"Data successfully merged and saved to {output_file}")

Data successfully merged and saved to ../data/users.parquet


Check the dataset:

In [None]:
df = pd.read_parquet(output_file)


(2093637, 18)

## Users data

We now have a dataset of approximately 2.1 millions users.

In [41]:
df.shape

(2093637, 18)

### Cleaning

There is a lot of empty values in this dataset.

In [52]:
df.isnull().sum()

id                     0
title                  0
fname            1609139
lname            1620055
created                0
da_membership    2091243
slack            2090244
mentors          2086708
countries         807238
language         2080667
languages        1986539
timezone         1426273
region           1426273
city             1426273
organizations    1824671
industries       2093637
contributions    2064886
events           2079715
registered_on          0
dtype: int64

We can normalize empty data using `None` and get proper datetime values for the registration date.

In [None]:
# Cleaning data.
df = df.replace({pd.NA: None})

# Replace empty arrays with None.
for col in df.columns:
    if df[col].dtype == 'O':
         df[col] = df[col].apply(lambda x: None if (x is None or len(x) == 0) else x)


In [None]:
import datetime

# Add formatted registration date.
if 'registered_on' not in df.columns:
    df['registered_on'] = df['created'].apply(lambda d: datetime.datetime.fromtimestamp(d))

Now let's take a look at the actual data:

In [None]:
df.head()

Unnamed: 0,id,title,fname,lname,created,da_membership,slack,mentors,countries,language,languages,timezone,region,city,organizations,industries,contributions,events,registered_on
0,1,dries,Dries,Buytaert,986038980,Current,,,[BE],,"[Dutch, English]",America/New_York,America,New_York,"[434463, 502475, 1291956]",,,,2001-03-31 13:43:00
1,2,Kjartan,Kjartan,Mannes,986038980,,,,[NO],,"[English, French, Norwegian Bokmål]",Europe/Oslo,Europe,Oslo,[434465],,"[patches, modules, issues, drupalorg, document...","[antwerp_2005, brussels_2006, denver_2012, mun...",2001-03-31 13:43:00
2,3,Drupal,,,986038980,,,,,,,,,,,,,,2001-03-31 13:43:00
3,4,gnudist,,,986038980,,,,,,,,,,,,,,2001-03-31 13:43:00
4,5,bitziz,,,986038980,,,,,,,,,,,,,,2001-03-31 13:43:00


The only numerical data we can *describe* in the registration date.

In [50]:
df['registered_on'].describe()

count                          2093637
mean     2012-08-26 21:11:03.192140032
min                2001-03-31 13:43:00
25%                2009-11-06 16:05:58
50%                2012-01-10 21:21:42
75%                2014-06-21 16:46:53
max                2025-03-26 20:30:33
Name: registered_on, dtype: object

---

@todo Define next steps of this exploraty analysis