# Citations Notebook

This notebook will help to use JSTOR Data for Research (DfR) of journal runs and create citation networks from them. It takes in a CSV file produced from the DfR metadata. The important columns for this program are the "article_author" and the "citation_general", the latter being the entire text of each "mixed-citation" field in the DfR metadata files.

In [None]:
import pandas as pd
import csv
import re
import pickle

In [None]:
df = pd.read_csv('YOUR CITATIONS CSV FILE')

In [None]:
df.head()

We first compiled a list of the authors of the research articles:

In [None]:
art_authors = df['article_author'].tolist()
art_authors_set = set(art_authors)
art_authors = list(art_authors_set)
print(art_authors)

We noticed that there were some strange characters in our list, particularly due to unicode characters.  We used the following code to clean these.  Your own strange characters may, of course, differ.

In [None]:
x = []
for elem in art_authors:
    elem = str(elem)
    elem = elem.replace('\u202e','')
    elem = elem.replace('\u202c','')
    x.append(elem)

print(x)

We next extracted from the citation_general column the likely names of authors.  To do this, pattern-matched to series of two or three words, each of which started with a capital letter. This match is far from perfect and results in several false positives, and so probably can be refined.

In [None]:
def match(text):
    pattern = '[A-Z]+[a-z]+$'
    if re.search(pattern, text): 
        return(True) 
    else: 
        return(False)

cit_author_2 = df['citation_general']

newcits = []
for item in cit_author_2:
    try:
        found = re.search('\n(.+),', item).group(1)
        found_lst = found.split()
        if len(found_lst) == 2:
            if match(found_lst[0]) and match(found_lst[1]):
                newcits.append(found)
        elif len(found_lst) == 3:
            if match(found_lst[0]) and match(found_lst[1])and match(found_lst[2]):
                newcits.append(found)
  
    except AttributeError:
        found = ''

#gets unique values
newcits_set = set(newcits)
newcits = list(newcits_set)

In [None]:
print(newcits)

Now we combine the list of article authors (x) with the list of names that we extracted (newcits) into a new dataframe (df1), and create a sorted version (df2)

In [None]:
templist = []
combined_auth = x + newcits
for i in combined_auth:
    a = str(i)
    templist.append(a)
df1 = pd.DataFrame({'Author':templist})
authseries = pd.Series(templist)
authseries.shape

In [None]:
df2 = df1.sort_values("Author")
df2.head()

The dataframe is now saved in CSV form, so it can be manually cleaned.  

In [None]:
df2.to_csv('sorted_citations.csv')

The "sorted_citations.csv" file was then manually cleaned using OpenRefine and Excel.  We had to strip some punctuation and delete the false positives, usually book titles.  For our particular data, we found one case where a list of author names was given a single id - in this case we broke apart the author names into separate lines and assigned new ids to each auhtor name.  The trickiest part was finding the same person represented in different ways (e.g., with or without a middle initial) and changing the id so that they all point to the same numerical id.  For the next part of the program, it is important not to delete these names - they are needed for the pattern matching that generates the citation vertices.

Assuming that you closed out this program and cleaned your data, we start again by reloading the dataframes:

In [None]:
df = pd.read_csv('YOUR CITATIONS CSV FILE')
df2 = pd.read_csv('sorted_citations.csv')

This routine goes through each citation noted in df, checks the citation_general to see if there is a match to one of the names in df2, and if there is creates a tuple of the index numbers of the article's author and the cited author.  "Vertices" is thus a list of tuples.  The tricky part here is the column indices and the returns from the iterrows method - if your dataframes are like ours (df2 has two columns, index number and name, and df has seven columns, with the article author in column 3 and citation_general in column 7) it should work. This routine takes a long time to run (about four hours for our data); there must be more efficient ways to do it.  Just to be safe, we create a saved copy when it ends.

In [None]:
def getAuthId (auth_name):
    for a, b in df2.iterrows():
        authtocheck = b[1]
        authartid = b[0]
        if authtocheck == auth_name:
            return authartid

vertices = []
for i, j in df.iterrows():
    artid = j[0]
    auth_name = j[2]
    authid = getAuthId(auth_name)
    citation = j[6]
    
    for k, l in df2.iterrows():
        authindex = l[0]
        authname = l[1]
        if authname in citation:
            vertices.append(tuple((authid,authindex)))
            
with open ('vertices.pickle','wb') as f:
    pickle.dump(vertices,f)

Assuming that the variable vertices is still active (you may need to reload the vertices.pickle file), we check to see that the list of tuples is of an appropriate length.  We then export the file as a csv.  Both vertices.csv and the sorted_citations.csv fils can then be exported into Gephi or used with the network analysis program of your choice.

In [None]:
print(len(vertices))

In [None]:
with open ('vertices.csv','w') as out:
    csv_out = csv.writer(out)
    csv_out.writerow(['source','target'])
    for row in vertices:
        csv_out.writerow(row)