### Data

We will use data from the file `../data/rowiki_2006.txt` in the `data` repository on GitHub. 
To data were originally obtained from the Wikipedia XML Dumps (https://dumps.wikimedia.org/mirrors.html) and include every article edit made on Romanian Wikipedia since it began until the end of 2006. Each line in the file is an edit and includes the title of the edited article, the time when the edit was submitted, whether the edit was a revert, the version of the article, and the user who submitted the edit. To detect the article versions, a hash was calculated for the complete article text following each revision and the hashes were compared between edits. 

The table below describes the variables in the data:

| Variable   | Explanation   
|:-----------|:-------
| title      | title of the edited article               
| time       | time in the format YYYY-MM-DD HH:MM:SS when the edit was completed  
| revert     | 1 if the edit was detected to revert to a previous article version, 0 otherwise 
| version    | an integer indicating a unique state of the article, generally increasing over time; -1 indicates the article was empty (usually due to vandalism); if the same number appears more than once, then the article was exactly in the same state at these different time points  
| user       | the editor's username or if not logged in, the editor's IP address  

### 1. Who reverted whom?

Your goal is to create a network (e.g., an edge list), where an edge goes from the editor who restored an earlier version of the article (the "reverter") to the editor who made the revision immediately after that version (the "reverted"). For every edge, you should know who the reverter was, who got reverted, when the revert occurred, and what the "seniority" of the the reverter and the reverted were at this point in time.

We will ignore the article titles for the analyses so you don't need to save these.

In addition, you will need to clean up the self-reverts – we will not use them in the analyses here.

We will estimate seniority $s_i$ of editor $i$ as the base-ten logarithm of the number of edits $i$ has completed by the time of the revert under question. Transforming the number of edits with the logarithm makes sense because they follow a power-law distribution (the majority of individuals have very few edits, while a handful of individuals are responsible for most of the work). This operationalization allows to express the difference in seniority between two editors as the base-ten logarithm of the ratio of number of edits since $s_i - s_j = \log_{10} e_i - \log_{10} e_j = \log_{10} \frac{e_i}{e_j}$, where $e_i$ is the number of edits of editor $i$ and $e_j$ is the number of edits of editor $j$. In essence, we assume that an editor who has 10 edits compares to one with 100 edits the same way that an editor with 1,000 edits compares to one with 10,000.

**Print the first 5 data points** in your network (what these look like will depend on the data type you are using).

Then **print the number of nodes and edges** in the network.

#### Hints

There are multiple ways to save the network data: you can use a single list, or multiple lists, or a list and dictionaries, or just dictionaries, or create your own network class. You should consider how you are going to use the data to decide on a reasonable data structure.

What to consider: 
- no pandas skitlearn etc 

- no self reverts
- no reverts with no change in version (filter on version being different): no revert if old_version == new_version

- create edge list (network) from reverter to reverted 
= names, time of revert, seniority 
- for seniority it matters the number of reverts in total at that time 
- clean up self-reverts (e.g filter by username) 
- get rid of title values 
- estimation of seniority: si of i as base-ten-log(e) at ti
- si =log10(Number of Edits i)

In [1]:

# open txt file and read lines into a list

with open('../data/rowiki_2006.txt', 'r') as file:
    data = file.readlines()

print(data[:100])  
type(data)

data_small = data[:300]
print(data_small)

['title\ttime\trevert\tversion\tuser\n', 'Academiei_Române \t2004-05-24 19:07:47\t0\t1\tBogdan\n', 'Tata,_Ungaria \t2006-09-12 22:35:24\t0\t14\tMishuletz\n', 'Tata,_Ungaria \t2006-08-04 09:12:15\t0\t13\tYurikBot\n', 'Tata,_Ungaria \t2006-07-21 21:13:26\t0\t12\tYurikBot\n', 'Tata,_Ungaria \t2006-07-21 14:54:46\t0\t11\tOvidiu.\n', 'Tata,_Ungaria \t2006-07-21 14:50:59\t0\t10\tOvidiu.\n', 'Tata,_Ungaria \t2006-07-16 21:22:29\t0\t9\tOvidiu.\n', 'Tata,_Ungaria \t2006-07-16 21:17:23\t0\t8\tOvidiu.\n', 'Tata,_Ungaria \t2006-07-16 20:59:10\t0\t7\tOvidiu.\n', 'Tata,_Ungaria \t2006-07-16 20:55:48\t0\t6\tOvidiu.\n', 'Tata,_Ungaria \t2006-07-16 20:54:06\t0\t5\tOvidiu.\n', 'Tata,_Ungaria \t2006-07-16 20:51:03\t0\t4\tOvidiu.\n', 'Tata,_Ungaria \t2006-07-16 20:39:46\t0\t3\tOvidiu.\n', 'Tata,_Ungaria \t2006-07-16 20:37:27\t0\t2\tOvidiu.\n', 'Tata,_Ungaria \t2006-07-16 20:34:07\t0\t1\tOvidiu.\n', 'Rakke \t2006-09-23 08:19:48\t0\t5\tMishuletz\n', 'Rakke \t2006-06-29 22:11:00\t0\t4\t82.181.169.249\n', 'Ra

In [None]:
# store it in a nested list
data_small_list = [line.strip().split('\t') for line in data_small[1:]]
# Now, 'nested_lists' is a list of lists, where each inner list corresponds to a line in your data

print(data_small_list[:10])


data_list = [line.strip().split('\t') for line in data[1:]]
# Now, 'nested_lists' is a list of lists, where each inner list corresponds to a line in your data

print(data_list[:10])

In [3]:
# Initialize an empty dictionary to store column-wise data

wiki_dict = {
    "title": [],
    "time": [],
    "revert": [],
    "version": [],
    "user": [],
}

# Process each line in the data_list
for line in data:
    # Split the line into fields using tab as a separator
    fields = line.strip().split('\t')

    # Populate the dictionary with column-wise data
    wiki_dict["title"].append(fields[0])
    wiki_dict["time"].append(fields[1])
    wiki_dict["revert"].append(fields[2])
    wiki_dict["version"].append(fields[3])
    wiki_dict["user"].append(fields[4])

# Remove the first element from each key in the dictionary
for key in wiki_dict:
        wiki_dict[key].pop(0)

# Print the resulting dictionary
print(wiki_dict["title"][:10])


['Academiei_Române ', 'Tata,_Ungaria ', 'Tata,_Ungaria ', 'Tata,_Ungaria ', 'Tata,_Ungaria ', 'Tata,_Ungaria ', 'Tata,_Ungaria ', 'Tata,_Ungaria ', 'Tata,_Ungaria ', 'Tata,_Ungaria ']


In [2]:
wiki_dict_small = {
    "title": [],
    "time": [],
    "revert": [],
    "version": [],
    "user": [],
}

for line in data_small:
    # Split the line into fields using tab as a separator
    fields = line.strip().split('\t')

    # Populate the dictionary with column-wise data
    wiki_dict_small["title"].append(fields[0])
    wiki_dict_small["time"].append(fields[1])
    wiki_dict_small["revert"].append(fields[2])
    wiki_dict_small["version"].append(fields[3])
    wiki_dict_small["user"].append(fields[4])

# Remove the first element from each key in the dictionary
for key in wiki_dict_small:
        wiki_dict_small[key].pop(0)

# Print the resulting dictionary
print(wiki_dict_small["title"][:10])

print(type(wiki_dict_small))

print(wiki_dict_small.keys())

['Academiei_Române ', 'Tata,_Ungaria ', 'Tata,_Ungaria ', 'Tata,_Ungaria ', 'Tata,_Ungaria ', 'Tata,_Ungaria ', 'Tata,_Ungaria ', 'Tata,_Ungaria ', 'Tata,_Ungaria ', 'Tata,_Ungaria ']
<class 'dict'>
dict_keys(['title', 'time', 'revert', 'version', 'user'])


In [None]:
import pandas as pd

wiki_df = pd.DataFrame(wiki_dict)

wiki_df_small = pd.DataFrame(wiki_dict_small)
print(wiki_df_small.head())

## accessing variables in a dataframe
wiki_df_small.revert

In [None]:
# test code trying the +1 version
# correct 


# pseudo code
# check if the user has revert == 1. 
# If yes, find the version number of the article at that time, and find the previous time that this version number occurred. 
# then check the most recent time that this version was incremented. 
# If the users are different, this is a revert


# problem: i am not getting the first revert of two consecutive reverts

reverted_by_other = []  # Create an empty list to store the reverts
reverter = None
reverted = None
date_revert = None
date_reverted = None
new_version = None
old_version = None

# Iterate over items in the dictionary
for i in range(len(wiki_dict['user'])):
    user = wiki_dict['user'][i]
    date = wiki_dict['time'][i]
    revert = wiki_dict['revert'][i]
    version = wiki_dict['version'][i]
    title = wiki_dict['title'][i]
    
    if revert == "1":
        if reverter is None:
            reverter = user
            date_revert = date
            new_version = version
            title = title
      
        else:  
            #reverter = reverter
            #date_revert = date_revert
            #new_version = version  
            #reverted = user 
            #date_reverted = date 

            if int(new_version) == int(version):

                reverted = wiki_dict['user'][i-1]
                date_reverted = wiki_dict['time'][i-1]
                reverted_by_other.append((reverter, date_revert, reverted, date_reverted))

            reverter = user
            date_revert = date
            new_version = version

    else:
        if reverter is not None:
            #reverted = reverter
            #date_revert = date_revert
            #new_version = new_version
            if int(new_version)  == int(version):  
                # Access the stored values from before updating new_version
                reverted = wiki_dict['user'][i-1]
                date_reverted = wiki_dict['time'][i-1]
                if reverter != reverted:
                    reverted_by_other.append((reverter, date_revert, new_version, reverted, date_reverted))
            
                reverter = None
                date_revert = None
            

# Check if the last edit was a revert
#if reverter is not None and reverted is not None and reverter != reverted:
 #   reverted_by_other_small.append((reverter, date_revert, new_version, reverted, date_reverted))

print(reverted_by_other)
print(len(reverted_by_other))

print(wiki_df[307:315])
