# Data Preparation

For the analyses to be carried out, it is still necessary to prepare the DataFrame generated from the CSV file accordingly. 

In [1]:
import pandas as pd

In [5]:
df = pd.read_csv('enwiki-p1p857.csv', quotechar='|', sep = '#', engine = 'python', on_bad_lines='warn')
df['timestamp'] = pd.to_datetime(df['timestamp'],format='%Y-%m-%dT%H:%M:%SZ')
df.head()

Unnamed: 0,page_id,page_title,revision_id,timestamp,comment,contributor_id,contributor_name,bytes,revtext
0,10,AccessibleComputing,233192,2001-01-21 02:12:21,*,99,RoseParks,124,This subject covers* AssistiveTechnology* Acce...
1,10,AccessibleComputing,862220,2002-02-25 15:43:11,Automated conversion,1226483,Conversion script,35,#REDIRECT [[Accessible Computing]]
2,10,AccessibleComputing,15898945,2003-04-25 22:18:38,Fixing redirect,7543,Ams80,34,#REDIRECT [[Accessible_computing]]
3,10,AccessibleComputing,56681914,2006-06-03 16:55:41,fix double redirect,516514,Nzd,36,#REDIRECT [[Computer accessibility]]
4,10,AccessibleComputing,74466685,2006-09-08 04:16:04,cat rd,750223,Rory096,57,#REDIRECT [[Computer accessibility]] {{R from ...


Therefore we take a look at the data in the DataFrame:

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 548305 entries, 0 to 548304
Data columns (total 9 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   page_id           548305 non-null  int64         
 1   page_title        548305 non-null  object        
 2   revision_id       548305 non-null  int64         
 3   timestamp         548305 non-null  datetime64[ns]
 4   comment           448614 non-null  object        
 5   contributor_id    548291 non-null  object        
 6   contributor_name  375963 non-null  object        
 7   bytes             548305 non-null  int64         
 8   revtext           546854 non-null  object        
dtypes: datetime64[ns](1), int64(3), object(5)
memory usage: 37.6+ MB


In [11]:
df.describe()

Unnamed: 0,page_id,revision_id,bytes
count,548305.0,548305.0,548305.0
mean,615.323158,351128400.0,65480.75
std,223.177424,298081800.0,53293.9
min,10.0,443.0,0.0
25%,579.0,95938740.0,27517.0
50%,689.0,269724000.0,52817.0
75%,771.0,552128100.0,88660.0
max,857.0,1069247000.0,1788058.0


In [12]:
df.isna().sum()

page_id                  0
page_title               0
revision_id              0
timestamp                0
comment              99691
contributor_id          14
contributor_name    172342
bytes                    0
revtext               1451
dtype: int64

At first, the column `timestamp` is splitted into the date and the time: 

In [13]:
df['dates'] = df['timestamp'].dt.date
df['times'] = df['timestamp'].dt.time
df.head()

Unnamed: 0,page_id,page_title,revision_id,timestamp,comment,contributor_id,contributor_name,bytes,revtext,dates,times
0,10,AccessibleComputing,233192,2001-01-21 02:12:21,*,99,RoseParks,124,This subject covers* AssistiveTechnology* Acce...,2001-01-21,02:12:21
1,10,AccessibleComputing,862220,2002-02-25 15:43:11,Automated conversion,1226483,Conversion script,35,#REDIRECT [[Accessible Computing]],2002-02-25,15:43:11
2,10,AccessibleComputing,15898945,2003-04-25 22:18:38,Fixing redirect,7543,Ams80,34,#REDIRECT [[Accessible_computing]],2003-04-25,22:18:38
3,10,AccessibleComputing,56681914,2006-06-03 16:55:41,fix double redirect,516514,Nzd,36,#REDIRECT [[Computer accessibility]],2006-06-03,16:55:41
4,10,AccessibleComputing,74466685,2006-09-08 04:16:04,cat rd,750223,Rory096,57,#REDIRECT [[Computer accessibility]] {{R from ...,2006-09-08,04:16:04


To get the full number of revision-bytes per article we create a column named `revbytes`.

In [14]:
df['revbytes'] = df.groupby(by='page_title')['bytes'].transform('sum')
df.head()

Unnamed: 0,page_id,page_title,revision_id,timestamp,comment,contributor_id,contributor_name,bytes,revtext,dates,times,revbytes
0,10,AccessibleComputing,233192,2001-01-21 02:12:21,*,99,RoseParks,124,This subject covers* AssistiveTechnology* Acce...,2001-01-21,02:12:21,2848
1,10,AccessibleComputing,862220,2002-02-25 15:43:11,Automated conversion,1226483,Conversion script,35,#REDIRECT [[Accessible Computing]],2002-02-25,15:43:11,2848
2,10,AccessibleComputing,15898945,2003-04-25 22:18:38,Fixing redirect,7543,Ams80,34,#REDIRECT [[Accessible_computing]],2003-04-25,22:18:38,2848
3,10,AccessibleComputing,56681914,2006-06-03 16:55:41,fix double redirect,516514,Nzd,36,#REDIRECT [[Computer accessibility]],2006-06-03,16:55:41,2848
4,10,AccessibleComputing,74466685,2006-09-08 04:16:04,cat rd,750223,Rory096,57,#REDIRECT [[Computer accessibility]] {{R from ...,2006-09-08,04:16:04,2848


How much percent of the whole revision is done by each contributor per article is defined in the new column `revperc`.

In [15]:
df['revperc'] = 100/df['revbytes']*df['bytes']
df.head()

Unnamed: 0,page_id,page_title,revision_id,timestamp,comment,contributor_id,contributor_name,bytes,revtext,dates,times,revbytes,revperc
0,10,AccessibleComputing,233192,2001-01-21 02:12:21,*,99,RoseParks,124,This subject covers* AssistiveTechnology* Acce...,2001-01-21,02:12:21,2848,4.353933
1,10,AccessibleComputing,862220,2002-02-25 15:43:11,Automated conversion,1226483,Conversion script,35,#REDIRECT [[Accessible Computing]],2002-02-25,15:43:11,2848,1.228933
2,10,AccessibleComputing,15898945,2003-04-25 22:18:38,Fixing redirect,7543,Ams80,34,#REDIRECT [[Accessible_computing]],2003-04-25,22:18:38,2848,1.19382
3,10,AccessibleComputing,56681914,2006-06-03 16:55:41,fix double redirect,516514,Nzd,36,#REDIRECT [[Computer accessibility]],2006-06-03,16:55:41,2848,1.264045
4,10,AccessibleComputing,74466685,2006-09-08 04:16:04,cat rd,750223,Rory096,57,#REDIRECT [[Computer accessibility]] {{R from ...,2006-09-08,04:16:04,2848,2.001404


Now we replace empty fields in the columns `comment`, `contributor_id`, `contributor_name` and `revtext`.

In [17]:
df['comment'] = df['comment'].fillna('no comment')
df['contributor_id'] = df['contributor_id'].fillna('no id')
df['contributor_name'] = df['contributor_name'].fillna('anon')
df['revtext'] = df['revtext'].fillna('no text')
df.isna().sum()

page_id             0
page_title          0
revision_id         0
timestamp           0
comment             0
contributor_id      0
contributor_name    0
bytes               0
revtext             0
dates               0
times               0
revbytes            0
revperc             0
dtype: int64

To access this prepared DataFrame for future analyses, a new csv-file is generated. 

In [18]:
df.to_csv('p1p857.csv')