In [1181]:
import ndjson
import json
import random
import re
from functools import partial
from tqdm import tqdm

# The Stack Code

Stack stats:

In [1133]:
cumsize = 0
cumtokens = 0
with open("stack-code/stats.json") as f: 
    stats = json.load(f)
    
for key in stats:
    print(key.upper())
    tokens = stats[key]["neox_tokens"]/10**9
    cumtokens += tokens
    print(f"tokens: {tokens:.4f} B")
    size = stats[key]["size"]/10**9
    cumsize += size
    print(f"size: {size:.4f} GB\n")

print("CUMULATIVE:")
print(f"tokens: {cumtokens:.4f} B")
print(f"size: {cumsize:.4f} GB\n")



MATLAB
tokens: 0.0380 B
size: 0.0480 GB

JULIA
tokens: 0.6699 B
size: 1.7519 GB

R
tokens: 0.1385 B
size: 0.3165 GB

SAGE
tokens: 0.0063 B
size: 0.0148 GB

MATHEMATICA
tokens: 0.9193 B
size: 1.8779 GB

MAPLE
tokens: 0.0135 B
size: 0.0268 GB

GAP
tokens: 0.0053 B
size: 0.0126 GB

LEAN
tokens: 0.0695 B
size: 0.1628 GB

ISABELLE
tokens: 0.0393 B
size: 0.0989 GB

PYTHON
tokens: 6.8227 B
size: 21.0366 GB

C
tokens: 0.0254 B
size: 0.0680 GB

C++
tokens: 1.3958 B
size: 4.2658 GB

TEX
tokens: 1.0167 B
size: 2.9576 GB

CUMULATIVE:
tokens: 11.1601 B
size: 32.6383 GB



**Problems with the stack**
- Issue: Matlab is wrong. There are only 111 matlab files that match the regex `[a-df-zA-Z]`. Looks like most of the matlab files are just arrays saved as text. Very little of the actual code was captured. 
    - [x] Fix 1: Regex filter to delete arrays
    - [ ] Fix 2: Find rest of matlab files
- Issue: The R data contains MacOS "resource fork" files that aren't related to R at all. 
    - [x] Fix: filter out resource forks
- Issue: .sagews files have a bunch of hashes all over the place.
    - [ ] Fix: figure out how to delete hashes, or render notebooks. 
- Issue: .sage files tend to have a bunch of long strings of hardcode numbers. Is this ok? e.g `ClathomasPrime/CompetitiveStableMatching:Plotting/plots.sage`
- Issue: Wolfram mathematica has three file formats:`.wls`: Wolfram language script, handled ok; `.m`Wolfram language package, handled ok; `.nb`: notebook, the plaintext has a bunch of noise. Need to export as `.wls`. 
    - [ ] Fix: convert notebooks to tex or wls
- Issue: There is one mathematica repo, `dendaxD/QAOA-MaxCut-amplitudes`, that contains about half of all mathematica files in the stack. All these files are extremely similar and should be included on data diversity grounds
    - [x] Fix: filter out this repo. 
- Issue: Some maple files are actually xml
    - [x] Fix: filter out xml
- Issue: Lots of auto-generated tex files in directories called `latex`. 
    - [x] Fix: remove these

Languages the stack does ok:
- Lean is fine
- Julia is fine (possibly want to remove files that meet jsonl spec)
- Python is clean (maybe get rid of Chinese characters?)

I'm not sure if my C/C++ filtering is good at all. Am I getting too many `.h` files?

Do we want Chinese in our Python?

Another issue to consider: Non-latin characters, e.g Chinese

In [1262]:
def matlab_rexp(example, rexp):
    return bool(rexp.search(example["content"]))

h = re.compile('[a-df-zA-Z]')
matlab_fix = partial(matlab_rexp, rexp=h)

def r_fix(example): 
    return "/* Resource fork" not in example["content"]

def mathematica_fix(example): 
    return example["max_stars_repo_name"] != "dendaxD/QAOA-MaxCut-amplitudes"

def maple_fix(example): 
    return "<?xml" != example["content"][:5]

def tex_not_rexp(example, rexp):
    return (not rexp.search(example["content"])) and "latex/" not in example["max_stars_repo_path"]

# gets rid of characters from non-Latin languages
h = re.compile("[\u0370-\u18aA\u3000-\U0001047f]")
tex_filter = partial(tex_not_rexp, rexp=h)

In [1234]:
lang = "tex"
with open(f"stack-code/{lang}/0000000.jsonl") as f: 
    ds = ndjson.load(f)

print("before filter len: ", len(ds))
ds = list(filter(tex_filter, tqdm(ds)))
print("after filter len: ", len(ds))

before filter len:  100000


100%|██████████████████████████████████████████████████| 100000/100000 [00:09<00:00, 10338.68it/s]


after filter len:  63025


In [1185]:
i = 0 
random.shuffle(ds)

In [1261]:
i += 1
text = ds[i]
print(i)
print(text["max_stars_repo_name"])
print(text["max_stars_repo_path"] + "\n" + "#"*40 + "\n")
print(text["content"])

73
lemoxiao/Awesome-Beamer-Collection
200+ beamer 模板合集/TeXTemplates(论文，报告，beamer，学术报告)/0_0_Preamble/Preamble_BibLaTeX_AER.tex
########################################

% !TeX TXS-program:compile = txs:///pdflatex/
% !TeX TXS-program:bibliography = txs:///biber
% !TeX program = pdflatex
% !BIB program = biber




%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%  CITATION COMMANDS AND BIBLIOGRAPHY STYLE %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


% AER/JEL/JEP style

\usepackage[backend=biber, natbib=true, bibencoding=inputenc, bibstyle=authoryear, citestyle=authoryear-comp, mincitenames=1, maxcitenames=3, minbibnames=99, maxbibnames=99, uniquename=false, uniquelist=true, backref=true, backrefstyle=three, doi=true, isbn=false, dashed=false, sorting=ynt, sortcites=true, mergedate=true, dateabbrev=false, abbreviate=false, citetracker=true]{biblatex}
% sortcites sorts the in-text citations by year of publication
\DeclareBibliographyAlias{newspaper}{article}

% Full author list on first

In [1263]:
j = 10
print(ds[j]["max_stars_repo_name"])
print(ds[j]["max_stars_repo_path"])
print(ds[j]["content"])

nicolair/maths-cours
C2195.tex
\input{courspdf.tex}
\debutcours{Approximations des zéros d'une fonction}{alpha}

L'approximation des zéros (ou racines) d'une fonction comporte deux temps : la séparation des racines et l'approximation proprement dite.\newline
La séparation des racines consiste à former des intervalles sur lesquels la restriction de la fonction a de bonnes propriétés et admet une seule racine. Les méthodes proposées ici ne portent que sur les manières de former des valeurs approchées de l'unique zéro dans l'intervalle considéré.\newline
Dans les trois cas, on supposera que la fonction est strictement croissante sur un intervalle $[a,b]$ avec $f(a)<0$ et $f(b)>0$.

\section{Dichotomie}
La méthode de dichotomie repose sur le diagramme suivant et se met en oeuvre très facilement informatiquement. Il est à noter que l'on dispose automatiquement d'une majorations de l'erreur car après $n$ itérations, la racine est entre $a$ et $b$ avec 
\begin{displaymath}
 0<b-a=\frac{b-a}{2

In [101]:
bool(matlab_filter(ds[2]))

False