# Calculating Impact Factors

The formula for any given journal for year (y) will be: $$ \textit{IF} = \frac{Citations_y}{Publications_{y-1} + Publications_{y-2}} $$

That is all the citations ($ Citations_{y}$) recieved in year *y* by pulications published in the last two years ($Publications_{y-1} + Publications_{y-2}$)

To calculate this we need a dataframe with the following columns:

`JournalId|PaperId|Year Published|Citing PaperID| Citing Paper Publish Year`

In the `citing.csv` file, the `PaperReferenceId` should contain only papers that are published by our focal authors and the `PaperId` are the papers that cite our focal authors papers. grouping by the `PaperReferenceId` column and counting will get us the number of times that paper is cited

**Note as of October 23rd**: Calculating the journals impact factors is going to be a lot harder than I originally thought because you have to first find all the papers that our focal authors have published which we do, and have in the `papers.csv` file, then we need to find all the journals in which those papers appear, which I do and those appear in the `journals.csv` file. **but** then we need to go back to the MAG corpus, get all the papers that appear in those journals, then get all the papers that reference the papers that appear in those journals. I think that can be done in a single script using dask but it is a pain in the ass and I hadn't thought about it before. 

Once we have all thse dataframe though and get the columns *just* right, I think it will be trivially easy to calculate impactfactors. 

**Further note as of October 23rd** All of these calculations and the subsetting of the full MAG corpus will depend on who is and who isn't a probable sciologist and the authorids affiliated with the names we have. Subsetting filtering and all that nonesense ultimately will be refined by getting a better way of disambiguating authorids. As of today though I can make a first pass at calculating the number of papers people have and the number of citations they have. 

**NOTE as of October 24th**
THe next steps for getting the edge list shorter and then looking at centralities is to do an interactive job with 500GBs of memory allocated to a single task. Further you can write your large matrices to the /scratch/midway3/timothyelder space and then SCP the files after pickling or compressing them. 

Further, you can subset the edge_lsit based on the type of Paper. Load the papers.csv file, subset to only original articles, check how many rows are dropped then only include the articles in the edgelist that are articles. Then you can drop the paperId column and create the binary network and then project to a onemode network. Hopefully, that will get the size down to a point it can be held in memory on my laptop but I suspect it wont. 

Further we can drop all the papers published before a person received their PhD. We can do that indidvidually for authorsbut we can also do it for the whole dataset. As in find the first year that any of our faculty received their PhD and drop all the papers before that. 

## Loading Libraries and Dataframes 

In [22]:
import re 
import os 
import json
import scipy
import networkx
import pandas as pd 

os.chdir('/home/timothyelder/mag')

authors_df = pd.read_csv("data/authors.csv", low_memory=False) 
authors2papers_df = pd.read_csv("data/authors2papers.csv")
papers_df = pd.read_csv("data/papers.csv")
journals_df = pd.read_csv("data/journals.csv")
citing_df = pd.read_csv("data/citing.csv")
citing_papers_df = pd.read_csv("data/citing_papers.csv")
faculty_df = pd.read_csv("data/faculty_df_complete.csv")
papers2journals = pd.read_csv("data/edge_list.csv", dtype = {"PaperId": int, "AuthorId": int, "JournalId": int})

  exec(code_obj, self.user_global_ns, self.user_ns)


THe first thing I want to do is check how many rows we can drop by subsetting the `papers_df` to include only the "Journal" DocType. 

In [12]:
len(papers_df) - len(papers_df[papers_df['DocType'] == 'Journal'])

602839

Subsetting to only Journal entries means we drop 602839 rows. Now let's see how many edges we can drop from the edge list. 

In [20]:
papers_df = papers_df[papers_df['DocType'] == 'Journal']

len(papers2journals) - len(papers2journals[papers2journals['PaperId'].isin(papers_df['PaperId'])])

0

Unfortunately, we don't drop any rows with this method. Let's change to the year published method. 

In [40]:
len(papers_df) - len(papers_df[papers_df['Year'] >= 1963])

7159

Again, this method didn't really work. We only dropped 7159 rows from the papers_df

In [2]:
authors2papers_df.head()

Unnamed: 0,PaperId,AuthorId,AffiliationId,AuthorSequenceNumber,OriginalAuthor,OriginalAffiliation
0,4794,2123758620,189158971.0,1,Paul Miller,United States Naval Academy#TAB#
1,28888,2435814755,,1,Feng Wang,Key Laboratory of Enhanced Heat Transfer and E...
2,38178,3074655252,,11,Li Zhang,
3,51663,2163596281,,1,William P. Bridges,The United States of America as represented by...
4,63235,614356337,78577930.0,2,Denise Kandel,Columbia University and New York Psychiatric I...


In [4]:
journals_df.head()

Unnamed: 0,JournalId,NormalizedName
0,1137746,the artist and journal of home culture
1,3164724,physiological measurement
2,17807283,theoretical population biology
3,18204665,international journal of multiphase flow
4,27908409,acta veterinaria hungarica


In [39]:
authors_df.head()

Unnamed: 0,AuthorId,Rank,NormalizedName,DisplayName,LastKnownAffiliationId,PaperCount,PaperFamilyCount,CitationCount,CreatedDate
0,184369,15782,charles kurzman,Charles Kurzman,114027177.0,79,79,1497,2016-06-24
1,284723,16745,charles n halaby,Charles N. Halaby,135310074.0,9,9,1029,2016-06-24
2,633513,16244,karen a hegtvedt,Karen A. Hegtvedt,150468666.0,48,48,1392,2016-06-24
3,2033759,18047,georgi derluguian,Georgi Derluguian,111979921.0,16,16,71,2016-06-24
4,2828213,18740,albert j bergesen,Albert J. Bergesen,,11,11,11,2016-06-24


In [40]:
papers_df.head()

Unnamed: 0,PaperId,Rank,Doi,DocType,PaperTitle,OriginalTitle,BookTitle,Year,Date,OnlineDate,...,FirstPage,LastPage,ReferenceCount,CitationCount,EstimatedCitation,OriginalVenue,FamilyId,FamilyRank,DocSubTypes,CreatedDate
0,204697957,23086,,Repository,1998 annual school leavers survey of 1996 97 l...,1998 annual school leavers' survey of 1996/ '9...,,1999.0,1999-01-01,,...,,,0.0,3.0,3.0,Research Papers in Economics,204697957.0,22661.0,,2016-06-24
1,3125546681,23723,,Repository,1998 annual school leavers survey of 1996 97 l...,1998 Annual School Leavers' Survey of 1996/97 ...,,1999.0,1999-01-01,,...,,,0.0,0.0,0.0,Research Series,204697957.0,22661.0,,2021-02-01
2,232961755,27231,,Book,ohio timber products output 1983,"Ohio Timber Products Output, 1983",,2017.0,2017-12-13,,...,,,0.0,0.0,0.0,,232961755.0,26848.0,,2016-06-24
3,3023278565,27995,,,ohio timber products output 1983,Ohio timber products output - 1983,,1986.0,1986-01-01,,...,,,0.0,0.0,0.0,"Resour. Bull. NE-95. Broomall, PA: U.S. Depart...",232961755.0,26848.0,,2020-05-13
4,3141748840,24173,,Repository,the role of company networks in low tech indus...,The Role of Company Networks in Low-tech Indus...,,2011.0,2011-01-01,,...,,,0.0,1.0,1.0,Chapters,581819997.0,22698.0,,2021-04-13


In [13]:
len(set(citing_df.PaperReferenceId))

559711

In [14]:
len(set(citing_df.PaperId))

8223936

In [10]:
len(papers_df)

1017962

In [18]:
df = citing_df.groupby(['PaperReferenceId']).size().reset_index(name='counts')
df.head()

Unnamed: 0,PaperReferenceId,counts
0,4794,1
1,28888,2
2,38178,3
3,51663,5
4,63235,60


## Lots o' Merges

Basic merge syntax:
```
new_df = authors_df.join(authors2papers.set_index("AuthorId"), on="AuthorId")
```

In [36]:
citing_df.head()

Unnamed: 0,PaperId,PaperReferenceId
0,285,2128720819
1,417,2162106818
2,2370,1964941906
3,2678,1811781384
4,3066,2139033805
...,...,...
14638088,3187075103,2742847392
14638089,3187075103,2907375416
14638090,3187075169,2753051611
14638091,3187075269,2491315640


In [50]:
# get the attributes for the paperreferenceid
new_df = citing_df.merge(papers_df.drop(columns=['Rank', 'Doi', 'DocType', 'PaperTitle', 'OriginalTitle',
       'BookTitle','Date', 'OnlineDate', 'Publisher',
       'ConferenceSeriesId', 'ConferenceInstanceId', 'Volume', 'Issue',
       'FirstPage', 'LastPage', 'ReferenceCount', 'CitationCount',
       'EstimatedCitation', 'OriginalVenue', 'FamilyId', 'FamilyRank',
       'DocSubTypes', 'CreatedDate']), left_on="PaperReferenceId", right_on="PaperId")

In [51]:
new_df

Unnamed: 0,PaperId_x,PaperReferenceId,PaperId_y,Year,JournalId
0,285,2128720819,2128720819,2011.0,118093565.0
1,1523727732,2128720819,2128720819,2011.0,118093565.0
2,1708215696,2128720819,2128720819,2011.0,118093565.0
3,1808310069,2128720819,2128720819,2011.0,118093565.0
4,1988195532,2128720819,2128720819,2011.0,118093565.0
...,...,...,...,...,...
14638088,3187058323,2991017726,2991017726,2020.0,36178057.0
14638089,3187059224,3138417050,3138417050,2020.0,
14638090,3187065337,3139323945,3139323945,2021.0,189917590.0
14638091,3187066185,3160607702,3160607702,2021.0,39260535.0


In [44]:
papers_df.columns

Index(['PaperId', 'Rank', 'Doi', 'DocType', 'PaperTitle', 'OriginalTitle',
       'BookTitle', 'Year', 'Date', 'OnlineDate', 'Publisher', 'JournalId',
       'ConferenceSeriesId', 'ConferenceInstanceId', 'Volume', 'Issue',
       'FirstPage', 'LastPage', 'ReferenceCount', 'CitationCount',
       'EstimatedCitation', 'OriginalVenue', 'FamilyId', 'FamilyRank',
       'DocSubTypes', 'CreatedDate'],
      dtype='object')

# Counting papers and citations by authors

Want a dataframe for counting publications by authors with the following columnss:

`AuthorId|Name|PaperId|Year`

Want a dataframe for counting citations by authors with the following columnss:

`AuthorId|Name|PaperId|Year Published|Citing PaperId|Citing Year Published`

First drop columns we dont need

In [141]:
number_papers = len(papers_df)
number_authors = len(authors_df)


In [236]:
author_papers = papers_df.drop(columns = ['Rank', 'Doi', 'DocType', 'PaperTitle', 'OriginalTitle',
       'BookTitle', 'Date', 'OnlineDate', 'Publisher', 'JournalId',
       'ConferenceSeriesId', 'ConferenceInstanceId', 'Volume', 'Issue',
       'FirstPage', 'LastPage', 'ReferenceCount', 'CitationCount',
       'EstimatedCitation', 'OriginalVenue', 'FamilyId', 'FamilyRank',
       'DocSubTypes', 'CreatedDate'])

authors2papers = authors2papers_df.drop(columns=['AffiliationId', 'AuthorSequenceNumber',
       'OriginalAuthor', 'OriginalAffiliation'])

In [237]:
authors_df = authors_df.drop(columns=['Rank',
       'LastKnownAffiliationId', 'PaperCount', 'PaperFamilyCount',
       'CitationCount', 'CreatedDate'])

citing_papers_df = citing_papers_df.drop(columns=['Rank', 'Doi', 'DocType', 'PaperTitle', 'OriginalTitle',
       'BookTitle','Date', 'OnlineDate', 'Publisher', 'JournalId',
       'ConferenceSeriesId', 'ConferenceInstanceId', 'Volume', 'Issue',
       'FirstPage', 'LastPage', 'ReferenceCount', 'CitationCount',
       'EstimatedCitation', 'OriginalVenue', 'FamilyId', 'FamilyRank',
       'DocSubTypes', 'CreatedDate'])

author_papers = author_papers.merge(authors2papers, on= "PaperId") # frist merge
author_papers = author_papers.merge(authors_df, on= "AuthorId") # second merge

Lets first do counts by yearauthor_papers.groupby(['NormalizedName','Year']).size().reset_index(name='counts')

In [238]:
author_papers.groupby(['NormalizedName','Year']).size().reset_index(name='counts')

Unnamed: 0,NormalizedName,Year,counts
0,a aneesh,1998.0,1
1,a aneesh,2000.0,1
2,a aneesh,2002.0,1
3,a aneesh,2004.0,3
4,a aneesh,2006.0,2
...,...,...,...
99337,zulema valdez,2017.0,3
99338,zulema valdez,2018.0,1
99339,zulema valdez,2019.0,2
99340,zulema valdez,2020.0,5


In [239]:
CountByYear = author_papers.groupby(['NormalizedName','Year']).size().reset_index(name='counts')
CountByYear.to_csv('data/year_authors_paper_counts.csv', index=False)

In [240]:
jlm = CountByYear[CountByYear['NormalizedName'] == "john levi martin"]
jlm

Unnamed: 0,NormalizedName,Year,counts
45050,john levi martin,1962.0,1
45051,john levi martin,1995.0,1
45052,john levi martin,1996.0,1
45053,john levi martin,1998.0,2
45054,john levi martin,1999.0,4
45055,john levi martin,2000.0,3
45056,john levi martin,2001.0,3
45057,john levi martin,2002.0,2
45058,john levi martin,2003.0,5
45059,john levi martin,2004.0,1


It will be helpful to have instead of counts by year a cumulative sum for all the peopl so that way for any given year we have an obsevation we can subset the dataframe and kick it over to Rauthor_papers.groupby(['NormalizedName', 'Year']).size().groupby(level=0).cumsum().reset_index()

In [245]:
author_papers.groupby(['NormalizedName', 'Year']).size().groupby(level=0).cumsum().reset_index()

Unnamed: 0,NormalizedName,Year,0
0,a aneesh,1998.0,1
1,a aneesh,2000.0,2
2,a aneesh,2002.0,3
3,a aneesh,2004.0,6
4,a aneesh,2006.0,8
...,...,...,...
99337,zulema valdez,2017.0,30
99338,zulema valdez,2018.0,31
99339,zulema valdez,2019.0,33
99340,zulema valdez,2020.0,38


In [242]:
cumsumyear = author_papers.groupby(['NormalizedName', 'Year']).size().groupby(level=0).cumsum().reset_index()

cumsumyear.to_csv('data/cum_author_paper_counts.csv', index=False)

In [243]:
cumsumyear[cumsumyear['NormalizedName'] == "john levi martin"]

Unnamed: 0,NormalizedName,Year,0
45050,john levi martin,1962.0,1
45051,john levi martin,1995.0,2
45052,john levi martin,1996.0,3
45053,john levi martin,1998.0,5
45054,john levi martin,1999.0,9
45055,john levi martin,2000.0,12
45056,john levi martin,2001.0,15
45057,john levi martin,2002.0,17
45058,john levi martin,2003.0,22
45059,john levi martin,2004.0,23


### Now for counting Citations

Rename columns in the citing papers df so they are explicit

In [185]:
citing_df = citing_df.rename(columns={"PaperId": "CitingPaperId", "PaperReferenceId": "PaperId"})
citing_papers_df = citing_papers_df.rename(columns={"PaperId": "CitingPaperId", "Year":"CitingYear"})

In [186]:
author_papers

Unnamed: 0,PaperId,PaperYear,AuthorId,NormalizedName,DisplayName
0,204697957,1999.0,2168830315,james williams,James Williams
1,3125546681,1999.0,2168830315,james williams,James Williams
2,2070450629,1997.0,2168830315,james williams,James Williams
3,1976317998,2004.0,2168830315,james williams,James Williams
4,2019022785,1989.0,2168830315,james williams,James Williams
...,...,...,...,...,...
1056926,2832267950,2014.0,3071142340,bin xu,Xu Bin
1056927,2934042183,2015.0,3076683674,wei zhang,Zhang Wei
1056928,2961656015,2019.0,3177573677,feng wang,Wang Feng
1056929,3108675323,2020.0,3146168402,yang yang,Yang Yang


In [187]:
citing_df

Unnamed: 0,CitingPaperId,PaperId
0,285,2128720819
1,417,2162106818
2,2370,1964941906
3,2678,1811781384
4,3066,2139033805
...,...,...
14638088,3187075103,2742847392
14638089,3187075103,2907375416
14638090,3187075169,2753051611
14638091,3187075269,2491315640


In [188]:
citing_papers_df

Unnamed: 0,CitingPaperId,CitingYear
0,84606111,2018.0
1,166500554,2004.0
2,91275118,2002.0
3,179792389,2014.0
4,185077135,2004.0
...,...,...
8223931,3136264816,2021.0
8223932,3158097200,2021.0
8223933,3165273089,2021.0
8223934,3176882034,2021.0


now a bunch of merges

In [189]:
author_papers = author_papers.rename(columns={"Year":"PaperYear"}) #rename year column

author_papers = author_papers.merge(citing_df, on = "PaperId") # merging to get citing papers

author_papers = author_papers.merge(citing_papers_df, on= "CitingPaperId")

author_papers

Unnamed: 0,PaperId,PaperYear,AuthorId,NormalizedName,DisplayName,CitingPaperId,CitingYear
0,204697957,1999.0,2168830315,james williams,James Williams,50739872,2000.0
1,2144373764,1991.0,311813037,sara mclanahan,Sara McLanahan,50739872,2000.0
2,2073328109,1994.0,2809991721,kelly r damphousse,Kelly R. Damphousse,50739872,2000.0
3,2073328109,1994.0,2971303771,howard b kaplan,Howard B. Kaplan,50739872,2000.0
4,1996242568,1996.0,2809991721,kelly r damphousse,Kelly R. Damphousse,50739872,2000.0
...,...,...,...,...,...,...,...
16036541,2832267950,2014.0,3071142340,bin xu,Xu Bin,2838463062,2017.0
16036542,2832267950,2014.0,3071142340,bin xu,Xu Bin,2930292877,2016.0
16036543,2934042183,2015.0,3076683674,wei zhang,Zhang Wei,2846737028,2016.0
16036544,2934042183,2015.0,3076683674,wei zhang,Zhang Wei,2932089498,2018.0


In [234]:
cite_counts_year = author_papers.groupby(['NormalizedName','PaperId', "CitingYear"]).size().reset_index(name='counts')
cite_counts = author_papers.groupby(['NormalizedName','PaperId']).size().reset_index(name='counts')
cite_counts.to_csv("data/author_paper_cite_counts.csv", index=False)

In [216]:
cite_counts[cite_counts["NormalizedName"] == "john levi martin"].sort_values(by="counts")

Unnamed: 0,NormalizedName,PaperId,counts
188117,john levi martin,3089988150,1
188114,john levi martin,2941543147,1
188079,john levi martin,2048796275,1
188103,john levi martin,2556943354,1
188110,john levi martin,2796332085,1
...,...,...,...
188067,john levi martin,2000132240,103
188077,john levi martin,2047314324,104
188085,john levi martin,2073018323,111
188058,john levi martin,406505337,148


As we can see above and below Johns most cited paper is "What is Field Theory?" so it seems to have kind of worked. But in looking at this I realized that coauthored papers are included in the dataset and there i no way of diambigauting that. See PaperId `2000132240` for an example from johns papers. 

In [225]:
papers_df[papers_df["PaperId"] == 1543353770]

Unnamed: 0,PaperId,Rank,Doi,DocType,PaperTitle,OriginalTitle,BookTitle,Year,Date,OnlineDate,...,FirstPage,LastPage,ReferenceCount,CitationCount,EstimatedCitation,OriginalVenue,FamilyId,FamilyRank,DocSubTypes,CreatedDate
906456,1543353770,18345,10.1086/375201,Journal,what is field theory 1,What Is Field Theory?1,,2003.0,2003-07-01,,...,1,49,105.0,593.0,921.0,American Journal of Sociology,,,,2016-06-24


Lets try to do the same cumsum thing we did above here for citations

In [230]:
cite_counts_year

Unnamed: 0,NormalizedName,PaperId,CitingYear,counts
0,a aneesh,230858940,2017.0,1
1,a aneesh,570668191,2014.0,1
2,a aneesh,570668191,2015.0,1
3,a aneesh,570668191,2020.0,1
4,a aneesh,606888222,2007.0,2
...,...,...,...,...
3211425,zulema valdez,2943340270,2020.0,4
3211426,zulema valdez,2943340270,2021.0,1
3211427,zulema valdez,3011657546,2021.0,1
3211428,zulema valdez,3085416121,2020.0,1


In [233]:
d = cite_counts_year.groupby(['NormalizedName', 'CitingYear']).size().groupby(level=0).cumsum().reset_index()
d[d["NormalizedName"] == "john levi martin"]

Unnamed: 0,NormalizedName,CitingYear,0
62437,john levi martin,1981.0,1
62438,john levi martin,1988.0,2
62439,john levi martin,1996.0,3
62440,john levi martin,1998.0,4
62441,john levi martin,1999.0,6
62442,john levi martin,2000.0,10
62443,john levi martin,2001.0,14
62444,john levi martin,2002.0,24
62445,john levi martin,2003.0,30
62446,john levi martin,2004.0,40


In [228]:
cite_cumsumyear = cite_counts_year.groupby(['NormalizedName', 'PaperYear']).size().groupby(level=0).cumsum().reset_index()
cite_cumsumyear
#citecumsumyear.to_csv('data/cum_author_paper_counts.csv', index=False)

In [125]:
citing_papers_df

Unnamed: 0,CitingPaperId,Year
0,84606111,2018.0
1,166500554,2004.0
2,91275118,2002.0
3,179792389,2014.0
4,185077135,2004.0
...,...,...
8223931,3136264816,2021.0
8223932,3158097200,2021.0
8223933,3165273089,2021.0
8223934,3176882034,2021.0


In [136]:
author_papers = papers_df.drop(columns = ['Rank', 'Doi', 'DocType', 'PaperTitle', 'OriginalTitle',
       'BookTitle', 'Date', 'OnlineDate', 'Publisher', 'JournalId',
       'ConferenceSeriesId', 'ConferenceInstanceId', 'Volume', 'Issue',
       'FirstPage', 'LastPage', 'ReferenceCount', 'CitationCount',
       'EstimatedCitation', 'OriginalVenue', 'FamilyId', 'FamilyRank',
       'DocSubTypes', 'CreatedDate'])

authors2papers = authors2papers_df.drop(columns=['AffiliationId', 'AuthorSequenceNumber',
       'OriginalAuthor', 'OriginalAffiliation'])

authors_df = authors_df.drop(columns=['Rank',
       'LastKnownAffiliationId', 'PaperCount', 'PaperFamilyCount',
       'CitationCount', 'CreatedDate'])

citing_papers_df = citing_papers_df.drop(columns=['Rank', 'Doi', 'DocType', 'PaperTitle', 'OriginalTitle',
       'BookTitle','Date', 'OnlineDate', 'Publisher', 'JournalId',
       'ConferenceSeriesId', 'ConferenceInstanceId', 'Volume', 'Issue',
       'FirstPage', 'LastPage', 'ReferenceCount', 'CitationCount',
       'EstimatedCitation', 'OriginalVenue', 'FamilyId', 'FamilyRank',
       'DocSubTypes', 'CreatedDate'])

KeyError: "['Rank' 'LastKnownAffiliationId' 'PaperCount' 'PaperFamilyCount'\n 'CitationCount' 'CreatedDate'] not found in axis"

In [137]:
citing_df = citing_df.rename(columns={"PaperId": "CitingPaperId", "PaperReferenceId": "PaperId"})
citing_papers_df = citing_papers_df.rename(columns={"PaperId": "CitingPaperId"})

In [None]:
author_papers = author_papers.merge(authors2papers, on= "PaperId") # frist merge
author_papers = author_papers.merge(authors_df, on= "AuthorId") # second merge

author_papers = author_papers.rename(columns={"Year":"PaperYear"}) #rename year column

author_papers = author_papers.merge(citing_df, on = "PaperId") # merging to get citing papers

author_papers.merge(citing_papers_df, on= "CitingPaperId")