# Exploraty Data Analysis

Exploraty Data Analysis (EDA) of the Drupal Community.

## Installation

As explained in the README file, you should have already fetched the datasets locally. 

See `../README.md` and `../script/fetch_drupal_data.sh`.

<details>
    <summary>Click to check data folder structure</summary>
    
```bash
data
├── csv
│   └── countries.csv
└── json
    ├── event.json
    ├── organization.json
    ├── pages_event
    │   ├── page_0.json
    │   └── page_x.json
    ├── pages_organization
    │   ├── page_0.json
    │   └── page_x.json
    ├── pages_user
    │   ├── page_0.json
    │   └── page_x.json
    └── user.json

6 directories, XXXX files
```
</details>


In [1]:
import datetime
import pandas as pd
import matplotlib.pyplot as plt

## Users

We have extracted user data exposed by the Drupal.org (aka _d.o_) REST API [at this endpoint](https://www.drupal.org/api-d7/user.json?sort=uid&direction) and compiled them into on single dataset name `user.json`.

This file contains 16 fields for each user, as explained in the cell below.

In [None]:
# Load user data.
# This is a large file. Loading can easily take more than to 2 minutes.
dtypes = {
    "id": "int32",
    "title": "string",
    "fname": "string",
    "lname": "string",
    "created": "int32",
    "da_membership": "string",
    "slack": "string",
    "timezone": "string",
    "region": "string",
    "mentors": "object",
    "countries": "object",
    "languages": "object",
    "organizations": "object",
    "industries": "object",
    "contributions": "object",
    "events": "object",
}

df = pd.read_json('../data/json/user.json', dtype=dtypes)

In [4]:
# Add formatted registration date.
if 'registered_on' not in df.columns:
    df['registered_on'] = df['created'].apply(lambda d: datetime.datetime.fromtimestamp(d))

In [10]:
# Cleaning data.
df = df.replace({pd.NA: None})

# Replace empty arrays with None.
for col in df.columns:
    if df[col].dtype == 'O':
         df[col] = df[col].apply(lambda x: None if (x is None or len(x) == 0) else x)


In [14]:
df.shape

(2093461, 19)

In [12]:
# Display a sample
df.head(3)

Unnamed: 0,id,title,fname,lname,created,da_membership,slack,mentors,countries,language,languages,timezone,region,city,organizations,industries,contributions,events,registered_on
0,1,dries,Dries,Buytaert,986038980,Current,,,[BE],,"[Dutch, English]",America/New_York,America,New_York,"[434463, 502475, 1291956]",,,,2001-03-31 13:43:00
1,2,Kjartan,Kjartan,Mannes,986038980,,,,[NO],,"[English, French, Norwegian Bokmål]",Europe/Oslo,Europe,Oslo,[434465],,"[patches, modules, issues, drupalorg, document...","[antwerp_2005, brussels_2006, denver_2012, mun...",2001-03-31 13:43:00
2,3,Drupal,,,986038980,,,,,,,,,,,,,,2001-03-31 13:43:00


In [13]:
# Save cleaned data.
df.to_csv("../data/csv/user_data.csv", index=False)

##
...

## Preprocessing

...

In [13]:
# Get people with mentors.
mentored_users = users[users['mentors'].notnull()]
mentored_users.sample(1000)




### Overview

Analyse of the `user.json` cleaned data file. 

It contains approximately 2.1 million lines as of March 23rd, 2025.

In [None]:
# Shape, columns, and sample of the data.
display(users.shape, users.columns, users.sample(5))







In [None]:
# Data pipeline
# get a test set of 1k users.
users = users.sample(1000)

In [None]:
display("Empty values:", users.isnull().sum())





In [None]:
slackers = users[~users['slack'].isnull()]
display(slackers.shape, slackers.sample(5))





In [None]:
slackers[slackers['slack'].str.contains('mattgyver')]




In [None]:
# Display count of unique values for interesting columns.
for col in ['da_membership', 'language', 'city', 'region']:
    display(f"Column: {col}", users[col].value_counts())

# Display total of each array field with a list of values.
from collections import Counter

totals = {}
for col in users.columns:
    if users[col].dtype == 'O' and str(col) not in ['da_membership', 'language', 'city', 'region']:
        counts = Counter()
        users[col].dropna().apply(lambda x: counts.update(x))
        sorted_counts = counts.most_common()
        totals[col] = sorted_counts

for i, col in enumerate(totals):
    display(f"Column: {col}", totals[col])













































### Cleaning

There is a lot of empty values in this dataset. Let's try to clean it!

In [None]:
# Identify SPAM users.
# There are certainly **not** 2 million active users on *d.o*.

### Users analysis

...

#### Assumptions

#### Questions

In [None]:
# User registration.
first_user = users.sort_values('registered_on').iloc[0]
display(f"First user registered: ({first_user.id}) {first_user.title} registered on: {first_user.registered_on}")

# Number of users registered on each day.
# Number of users registered on each month.
# Number of users registered on each year.
# Number of users registered on each year.



## Geographical distribution

Exploration of the distribution of the Drupal community on the planet.

In [None]:
# Plot region in bar
# Scatter cities on a map?

## Expertises

Analyze **expertises** of the users community.

* Are certain professional sectors over-represented?
* Are there regions of the world where certain areas of expertise are favored?

In [None]:
# Group by `field_industries_worked_in` and histplot user counts
# Segment by `field_country` 

## Mentors and mentees

Exploration of the mentorship within the Drupal community.