# PaperMill - Data Preprocessing
This notebook contains all the necessary scripts used to clean the data used in PaperMill.
Initially, the data is broken into smaller chunks for easier manipulation during analysis. During this process, the dataset is also divided into two subsets: Relationship and Detail (RD) and Textual Content.
The RD data is then recombined for use in Neo4j.
The Textual Content is cleaned for use in the recommender system. It is divided into a list of the paper IDs and text data and then recombined.
Finally, the text data is used to generate a similarity matrix, the basis for the recommender system.
<br>
<br>
To begin, the DBLP dataset can be downloaded [here](https://lfs.aminer.cn/misc/dblp.v11.zip)

<br>

## Splitting the Dataset
To begin, the dataset is split into the RD dataset and the Textual Content dataset. The RD data is converted to CSV format, while the Textual Content remains as JSON.
The "split_dataset" function takes four parameters:
1. file (string): Directory location of the dataset
2. numRows (integer): Number of rows within the dataset
3. splits (integer): Number of times to split the dataset
4. output (string): Directroy location of the output files
5. badIDs (list of strings): List of the corrupted entries identified in the dataset

In [1]:
from utils.split import split_dataset

fileName = "dataset/DBLP_V11.json"
outputDir = "dataset/split/"
badIDs = ["133370419","1978543128","1995561360","2015718743","2114368344","2124878095","2143245624","2171998353","2344857557","2405346676"]

split_dataset(fileName, 4107340, 4, outputDir, badIDs)

2020-05-17 16:16:53.020776: Splitting initiated.
Bad ID caught: 133370419.
Bad ID caught: 1978543128.
Bad ID caught: 1995561360.
2020-05-17 16:32:27.918113: Split 1 complete.
Bad ID caught: 2015718743.
2020-05-17 16:49:06.626307: Split 2 complete.
Bad ID caught: 2114368344.
Bad ID caught: 2124878095.
Bad ID caught: 2143245624.
Bad ID caught: 2171998353.
Bad ID caught: 2344857557.
2020-05-17 17:06:03.531514: Split 3 complete.
Bad ID caught: 2405346676.
2020-05-17 17:22:02.218399: Split 4 complete.
2020-05-17 17:22:02.218399: Datset split.


## Merging the RD Dataset
The RD dataset is then recombined using the "merge_csvs" function. This function takes six parameters:
1. inputDir (string): Location of the split data
2. outputDir (string): Location for the output file
3. prefix (string): Prefix which points towards the correct data type
4. outputFileName (string): Name of the outputted file
5. splits (integer): Number of files the input data has been split into
6. headers (boolean): Whether the split files have headers or not (default is True)

In [1]:
from utils.merge import merge_csvs

inputDir = "dataset/split/"
outputDir = "dataset/neo4j/"
prefix = "Split_DBLP_RD_"
outputFileName = "DBLP_RD"

merge_csvs(inputDir, outputDir, prefix, outputFileName, 4)

2020-05-17 17:28:02.726589: Merge initiated.
2020-05-17 17:28:15.387427: dataset/split/Split_DBLP_RD_1.csv merged.
2020-05-17 17:28:21.921405: dataset/split/Split_DBLP_RD_2.csv merged.
2020-05-17 17:28:28.455670: dataset/split/Split_DBLP_RD_3.csv merged.
2020-05-17 17:28:34.210036: dataset/split/Split_DBLP_RD_4.csv merged.
2020-05-17 17:28:34.211029: DBLP_RD.csv successfully merged.


## Cleaning and Formatting the Textual Content
The text data needs to be cleaned for use in the recommender system. This is carried out by the "text_preprocessing" function, which takes three parameters:
1. inputDir (string): Location of the split textual content dataset
2. outputDir (string): Location for the cleaned file
3. splits (integer): Number of times the original DBLP dataset was split

In [None]:
from utils.recommender import text_preprocessing

inputDir = "dataset/split/"
outputDir = "dataset/recommender/"

text_preprocessing(inputDir, outputDir, 4)

## Generating Recommender System Framework
The final step is to create the framework for the recommender system. This process is carried out when the "generate_recommender" function is called, which takes seven parameters.
1. textFile (string): Location of the specified text file used in dictionary generation
2. outputDir (string): Location to save recommender parts
3. weighting (string): SMART-IRS TF-IDF weighting, default is ntc
4. dictExists (boolean): Whether the Gensim dictionary exists already, default is False
5. corpusExists (boolean): Whether the Gensim corpus exists already, default is False
6. tfidfExists (boolean): Whether the Gensim TF-IDF model exists already, default is False
7. createSimMatrix (boolean): Whether to create a disk based version of the similarity matrix, default is true

For more information on the createSimMatrix parameter, please refer to the README file of this project.

In [1]:
from utils.recommender import generate_recommender

textFile = "dataset/recommender/Text.csv"
outputDir = "src/static/data/"

generate_recommender(textFile, outputDir, weighting="ntc", dictExists=False, corpusExists=False, 
                     tfidfExists=False, createSimMatrix=True)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ronan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


TypeError: unsupported operand type(s) for +: 'datetime.datetime' and 'str'