# New word comb

## Overview
This notebook is designed for analyzing a dataset of papers to identify the new word combinations.
For each new word combination, the ID is the first paper is identified and the number of subsequent papers the re-use the word are counted.

A baseline of word combinations is defined. Word combinations that appear in the baseline are not considered as new word combinations. 

The script reads processed data from a CSV file, compares each word combination against the baseline, and counts the occurrences of the new word combinations. The results are then written to a new CSV file.

## Workflow
- **Setting Up the Environment**: The script starts by importing necessary libraries and adjusting the system’s maximum integer size to avoid errors when reading large lines from the CSV file.

- **Counting the Number of Papers:** It calculates the total number of papers to be processed by counting the lines in the processed data CSV file. This is needed to keep track of the process with a progress bar (tqdm).

- **Creating the Baseline:** A baseline set of word combinations is created from papers published before a specified baseline year. The notebook reads each paper and adds word combinations to the baseline set if the paper’s publication year is before the baseline year.

- **Counting New Words:** The notebook then reads the processed data of each paper and counts the occurrence of word combinations that are not in the baseline set. Each new word combination’s count and the ID of the paper in which it first appeared are stored.

- **Exporting the Results:** The counted new word combinations, along with the ID of the paper in which each word combination first appeared and the total count of each word combinations’s occurrence, are written to a new CSV file. Word combinations that only appeared once are filtered out.

## Output
The notebook generates a CSV file containing each new word combination that is not part of the baseline, the ID of the paper in which the word combinations first appeared, and the total count of the word combination’s occurrence in all papers. Each row in the file represents a unique new word combination.

In [2]:
import csv
from tqdm.notebook import tqdm
import collections
import sys
import itertools as it

## Increase the max size of a line reading, otherwise an error is raised
maxInt = sys.maxsize

while True:
    # decrease the maxInt value by factor 10 
    # as long as the OverflowError occurs.

    try:
        csv.field_size_limit(maxInt)
        break
    except OverflowError:
        maxInt = int(maxInt/10)

In [3]:

# Count the number of papers
print('Get the number of papers to process...')
with open('../data/processed/papers_words.csv', 'r', encoding='utf-8') as file:
    line_count = sum(1 for line in file)
total_papers = line_count - 1  # Subtract 1 for the header

print('Creating the baseline...')
# Creating a baseline set of words from papers published before the baseline year
baseline_year = 2000
baseline = set()

print('Iterating over the baseline...')
with open('../data/raw/papers_raw.csv', 'r', encoding='utf-8') as raw_reader, \
        open('../data/processed/papers_words.csv', 'r', encoding='utf-8') as processed_reader:
        
    csv_raw_reader = csv.reader(raw_reader, delimiter='\t', quotechar='"')
    csv_processed_reader = csv.reader(processed_reader, delimiter=',', quotechar='"')

    # Skipping the headers
    next(csv_raw_reader)
    next(csv_processed_reader)
    
    # Iterating over each paper and adding words to the baseline if the paper was published before the baseline year
    for line_raw, line_processed in tqdm(zip(csv_raw_reader, csv_processed_reader), total=total_papers):
        if int(line_raw[1].split('-')[0]) > baseline_year:
            continue
            
        text = set(line_processed[1].split() + line_processed[2].split())
        
        combs = list(it.combinations(text,2))
        combs = set([tuple(sorted(comb)) for comb in combs])
        
        baseline.update(combs)
        
# Counting the occurrence of new words that are not in the baseline
counter = collections.Counter()
paperIds = collections.defaultdict()

print('Calculating new words...')
# Reading the processed papers data and counting new words
with open('../data/processed/papers_words.csv', 'r', encoding='utf-8') as reader:
    csv_reader = csv.reader(reader, delimiter=',', quotechar='"')
    next(csv_reader)  # Skip header

    for line in tqdm(csv_reader, total=total_papers):
        paperID = int(line[0])
        text = set(line[1].split() + line[2].split())
        
        combs = list(it.combinations(text,2))
        combs = set([tuple(sorted(comb)) for comb in combs])
        
        for comb in combs:
            if comb in baseline:
                continue
                
            if comb not in counter:
                counter[comb] = 0
                paperIds[comb] = paperID
            else:
                counter[comb] += 1
                
print('Exporting the results...')
# Exporting the results to a new CSV file
with open('../data/metrics/new_word_comb.csv', 'w', encoding="utf-8") as writer:
    writer.write('word1,word2,PaperID,reuse\n') # Header

    for (word1, word2), paperID, reuse in tqdm(zip(counter.keys(), paperIds.values(), counter.values()), total=len(counter)):
        # Filter out if reused only once
        if reuse == 0:
            continue

        writer.write(f'{word1},{word2},{paperID},{reuse}\n')


Get the number of papers to process...
Creating the baseline...
Iterating over the baseline...


  0%|          | 0/355 [00:00<?, ?it/s]

Calculating new words...


  0%|          | 0/355 [00:00<?, ?it/s]

Exporting the results...


  0%|          | 0/274544 [00:00<?, ?it/s]