## Outdated Information For Future Reference
This file will attempt to perform sentence completion a little differently by utilizing [numpy](https://numpy-ml.readthedocs.io/en/latest/numpy_ml.ngram.goodturing.html) instead of nltk. Other numpy smoothing algorithms can be found [here](https://numpy-ml.readthedocs.io/en/latest/numpy_ml.ngram.html#:~:text=Laplace%20smoothing%20is%20the%20assumption,time%20than%20it%20actually%20does.&text=where%20c(a)%20denotes%20the,n%20%2Dgrams%20in%20the%20corpus.).

## Dataset Processing
The main goal of this notebook is now to investigate alternative processing methods for our input data. This is necessary so we can more easily add more information to the model.

# Input Creation
This file will be used to create and preprocess data used for model training. Our model was originally trained on only transcripts of lectures on Khan Academy. These lectures served us well as a proof of concept but introduced biases and limitations due to being spoken text. We aim to alleviate some of this problem by introducing more information to our model. This will primarily be done by expanding the size of the input corpus used to train the model. Many of the additional dataset sources are too large to include directly into our repository. This is due to the 100mb file size limit and 2GB repository size limit for GitHub. To combat this, while still allowing for more data, we will be using this file to download and setup the files locally. Note that some of the datasets being downloaded as part of this file can reach up to 37GB. This may cause performance issues on slower networks. This file also allows for the addition of ebooks to the model. Ebooks to be used in the model training should be added to the Epubs folder under the Datasets directory.

## General Process Order
1. Download the datasets and unzip if necessary. Additionlly, store each document to its own .txt file if necessary, as is the case with Khan Academy transcripts.
2. Load in the dataset from disk.
3. Run the standardized cleaning method over the text in each file.
4. Store the updated text to disk for use in the primary model training file.

## Datasets
- [Khan Academy Transcripts](https://www.khanacademy.org/)
- [Kaggle Ted Talks Dataset](https://www.kaggle.com/datasets/rounakbanik/ted-talks)
- [1.7GB of Project Gutenberg text files](https://zenodo.org/record/3360392#.ZFU5nXbMIuV)
- [Bookcorpus part 1: 2GB](https://the-eye.eu/public/AI/pile_preliminary_components/)
- [Bookcorpus part 3: 37GB](https://the-eye.eu/public/AI/pile_preliminary_components/)

In [1]:
from pathlib import Path
from collections import Counter, defaultdict
import pandas as pd
import random
from tqdm.notebook import tqdm
import nltk
from nltk.util import ngrams
import nltk.data
import re
import contractions
from bs4 import BeautifulSoup
import unidecode
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score
from sklearn.metrics import f1_score, recall_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import CountVectorizer
import math
import numpy as np
import copy
import seaborn as sns
import matplotlib.pyplot as plt
import json
import pickle
from nltk.lm.preprocessing import flatten
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm.preprocessing import pad_both_ends
from nltk.util import pad_sequence
from nltk.util import everygrams
from nltk.lm import MLE
from functools import partial
import requests
import sys
import time
import urllib.request
import shutil
import os

nltk.download([
"names",
"stopwords",
"state_union",
"twitter_samples",
"movie_reviews",
"averaged_perceptron_tagger",
"vader_lexicon",
"punkt",
])

lemmatizer = WordNetLemmatizer()
sia = SentimentIntensityAnalyzer()
encoder = LabelEncoder()
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
tqdm.pandas()
n = 3
input_path = "Datasets/Input"

[nltk_data] Downloading package names to /home/fassg/nltk_data...
[nltk_data]   Package names is already up-to-date!
[nltk_data] Downloading package stopwords to /home/fassg/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package state_union to
[nltk_data]     /home/fassg/nltk_data...
[nltk_data]   Package state_union is already up-to-date!
[nltk_data] Downloading package twitter_samples to
[nltk_data]     /home/fassg/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     /home/fassg/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/fassg/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/fassg/nltk_data...
[nltk_data]   Package vader

In [2]:
def clean_base(x: str):
        # remove any html tags
        x = BeautifulSoup(x, "html.parser").get_text(separator=" ")
        # # set all to lower
        # x = x.lower()
        # clean up the contractions
        x = contractions.fix(x)
        # remove accended characters
        x = unidecode.unidecode(x)
        # # remove stopwords: https://stackoverflow.com/questions/19560498/faster-way-to-remove-stop-words-in-python
        # x = ' '.join([word for word in x.split() if word not in cachedStopWords]) # slower to use word tokenize
        # # fix punctuation spacing
        # x = re.sub(r'(?<=[\.\,\?])(?=[^\s])', r' ', x)
        # # strip punctuation
        # x = re.sub(r'[\.\,\?\\\/\<\>\;\:\[\]\{\}]', r'', x)
        # strip quotes
        # x = x.replace('\'', '').replace('\"', '')
        x = x.replace('\"', '')
        # # remove some actions
        # remove_list = ['(Laughter)', '(laughter)', '(Music)', '(music)', '(Music ends)', '(Audience cheers)', '(Applause)', '(Applause ends)', '(Applause continues)', '(Bells)', '(Trumpet)', '(Clears throat)']
        # x = ' '.join([word for word in x.split() if word not in remove_list])
        # remove extraneous items
        x = x.replace(' -- ', '').replace(' .. ', ' ').replace(' ... ', ' ')
        # remove extra whitespace
        x = ' '.join(x.strip().split())
        # # may want to add lematization
        # x = ' '.join([lemmatizer.lemmatize(word) for word in x.split()])
        # remove some of the extra bracket tags
        x = re.sub(r"\s{2,}", " ", re.sub(r"[\(\[\{][^\)\]\}]*[\)\]\}]", "", x))
        # # Strip newlines
        x = re.sub(r"\n", " ", x)
        return x

## Khan Academy Transcripts
These transcripts are currently stored as five separate CSV files. Each CSV file represents a different domain of lectures on the site. Each domain contains multiple lectures, from a variety of courses and subjects. The primary step in this section is to store cleaned versions of each individual transcript to its own text file. The names of each file will be based on the video title and the course of the lecture.

In [3]:
computing_df = pd.read_csv(Path("Datasets/KhanAcademy/Computing.csv"))
computing_df = computing_df.dropna()

economics_df = pd.read_csv(Path("Datasets/KhanAcademy/Economics.csv"))
economics_df = economics_df.dropna()

humanities_df = pd.read_csv(Path("Datasets/KhanAcademy/Humanities.csv"))
humanities_df = humanities_df.dropna()

math_df = pd.read_csv(Path("Datasets/KhanAcademy/Math.csv"))
math_df = math_df.dropna()

science_df = pd.read_csv(Path("Datasets/KhanAcademy/Science.csv"))
science_df = science_df.dropna()

khan_dfs = [computing_df, economics_df, humanities_df, math_df, science_df]
khan = pd.concat(khan_dfs, axis=0)
khan.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8261 entries, 0 to 2789
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   course       8261 non-null   object
 1   unit         8261 non-null   object
 2   lesson       8261 non-null   object
 3   video_title  8261 non-null   object
 4   about        8261 non-null   object
 5   transcript   8261 non-null   object
dtypes: object(6)
memory usage: 451.8+ KB


In [4]:
def store_khan_lecture(row):
    new_title = f"khan - {row['course']} - {row['video_title']}"
    new_title = re.sub(r'[\.\,\?\\\/\<\>\;\:\[\]\{\}\!\"®︎\|\*\(\)]', r'', new_title)
    new_fp = Path(f"{input_path}/Khan/{new_title}.txt")
    with open(new_fp, 'w', encoding="utf-8") as f:
        # f.write(row['transcript'])
        f.write(clean_base(row['transcript']))

khan.progress_apply(lambda row: store_khan_lecture(row), axis=1)
print("Done")

  0%|          | 0/8261 [00:00<?, ?it/s]

Done


## Ted Talk Transcripts
The Ted Talk dataset consists of two CSV files. Ted main contains information about each of the talks while the ted transcripts contains the transcript information as well as the url to the talk. As with the Khan Academy dataset, we will need to clean each transcript and store it to its own text file.

In [6]:
ted_main = pd.read_csv("Datasets/TEDTalksDataset/ted_main.csv")
transcripts = pd.read_csv("Datasets/TEDTalksDataset/transcripts.csv")
ted = ted_main.join(transcripts, lsuffix='url', rsuffix='url', sort=True)
ted = ted.dropna()
ted.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2461 entries, 0 to 2466
Data columns (total 19 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   comments            2461 non-null   int64 
 1   description         2461 non-null   object
 2   duration            2461 non-null   int64 
 3   event               2461 non-null   object
 4   film_date           2461 non-null   int64 
 5   languages           2461 non-null   int64 
 6   main_speaker        2461 non-null   object
 7   name                2461 non-null   object
 8   num_speaker         2461 non-null   int64 
 9   published_date      2461 non-null   int64 
 10  ratings             2461 non-null   object
 11  related_talks       2461 non-null   object
 12  speaker_occupation  2461 non-null   object
 13  tags                2461 non-null   object
 14  title               2461 non-null   object
 15  urlurl              2461 non-null   object
 16  views               2461

In [7]:
def store_ted_lecture(row):
    new_title = f"ted - {row['title']}"
    new_title = re.sub(r'[\.\,\?\\\/\<\>\;\:\[\]\{\}\!\"®︎\|\*\(\)]', r'', new_title)
    new_fp = Path(f"{input_path}/Ted/{new_title}.txt")
    with open(new_fp, 'w', encoding="utf-8") as f:
        # f.write(row['transcript'])
        f.write(clean_base(row['transcript']))

ted.progress_apply(lambda row: store_ted_lecture(row), axis=1)
print("Done")

  0%|          | 0/2461 [00:00<?, ?it/s]



Done


## Method for Reporting Downloads
The below method is used to give progress updates on the download.

In [8]:
# https://blog.shichao.io/2012/10/04/progress_speed_indicator_for_urlretrieve_in_python.html
def reporthook(count, block_size, total_size):
    global start_time
    if count == 0:
        start_time = time.time()
        return
    duration = time.time() - start_time
    duration = 1 if duration == 0 else duration
    progress_size = int(count * block_size)
    speed = int(progress_size / (1024 * duration))
    percent = int(count * block_size * 100 / total_size)
    sys.stdout.write("\r...%d%%, %d MB, %d KB/s, %d seconds passed" %
                     (percent, progress_size / (1024 * 1024), speed, duration))
    sys.stdout.flush()

## Project Gutenberg
These files come from Zenodo. More text files can be manually added to this directory as wanted.

In [10]:
new_file_path = Path("Downloads/D1.7GB.zip")
if not os.path.exists(new_file_path):
    url = 'https://zenodo.org/record/3360392/files/D1.7GB.zip?download=1'
    urllib.request.urlretrieve(url, new_file_path, reporthook=reporthook)

...6%, 33 MB, 1715 KB/s, 19 seconds passed

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



...11%, 62 MB, 2759 KB/s, 23 seconds passed

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



...66%, 357 MB, 7507 KB/s, 48 seconds passed

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



...78%, 419 MB, 7977 KB/s, 53 seconds passed

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



...84%, 452 MB, 8191 KB/s, 56 seconds passed

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



...97%, 524 MB, 8600 KB/s, 62 seconds passed

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



...100%, 537 MB, 8658 KB/s, 63 seconds passed

In [11]:
# url = 'https://zenodo.org/record/3360392/files/D1.7GB.zip?download=1'
# r = requests.get(url, allow_redirects=True)
# filename = getFilename_fromCd(r.headers.get('content-disposition'))
# # print(f"downloading {filename}")
# open(Path(f"Downloads/{filename}"), 'wb').write(r.content)

In [12]:
extract_dir = Path("Datasets/Input/")
shutil.unpack_archive(new_file_path, extract_dir)

## Bookcorpus


In [None]:
new_file_path = Path("Downloads/books1.tar.gz")
if not os.path.exists(new_file_path):
    url = 'https://the-eye.eu/public/AI/pile_preliminary_components/books1.tar.gz'
    urllib.request.urlretrieve(url, new_file_path, reporthook=reporthook)

...13%, 303 MB, 6184 KB/s, 50 seconds passed

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



...14%, 333 MB, 6440 KB/s, 52 seconds passed

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



...16%, 369 MB, 6527 KB/s, 57 seconds passed

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



...17%, 399 MB, 6642 KB/s, 61 seconds passed

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



...18%, 429 MB, 6659 KB/s, 66 seconds passed

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



...22%, 512 MB, 6214 KB/s, 84 seconds passed

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



...25%, 590 MB, 6390 KB/s, 94 seconds passed

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



...28%, 660 MB, 6420 KB/s, 105 seconds passed

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



...31%, 722 MB, 6319 KB/s, 116 seconds passed

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



...34%, 783 MB, 6211 KB/s, 129 seconds passed

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



...36%, 842 MB, 5941 KB/s, 145 seconds passed

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



...38%, 893 MB, 5976 KB/s, 153 seconds passed

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



...63%, 1453 MB, 5685 KB/s, 261 seconds passed

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



...67%, 1538 MB, 5444 KB/s, 289 seconds passed

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



...70%, 1614 MB, 5477 KB/s, 301 seconds passed

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



...81%, 1873 MB, 4864 KB/s, 394 seconds passed

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



...85%, 1967 MB, 4760 KB/s, 423 seconds passed

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



...89%, 2054 MB, 4860 KB/s, 432 seconds passed

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



...93%, 2133 MB, 4834 KB/s, 451 seconds passed

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



...94%, 2172 MB, 4855 KB/s, 458 seconds passed

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



...96%, 2209 MB, 4895 KB/s, 462 seconds passed

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



...97%, 2241 MB, 4934 KB/s, 465 seconds passed

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



...99%, 2271 MB, 4967 KB/s, 468 seconds passed

In [14]:
extract_dir = Path("Datasets/Input/")
shutil.unpack_archive(new_file_path, extract_dir)

In [15]:
directory_path = Path('Datasets/Input/books1/epubtxt')
# documents = []
for data_path in tqdm(list(directory_path.glob("**/*.txt"))):
    text = ""
    try:
        with open(Path(data_path), "r", encoding="utf-8") as f:
            text = clean_base(f.read())
            # f.write(text)
            # documents.append(text)
        with open(Path(data_path), "w", encoding="utf-8") as f:
            # text = clean_base(f.read())
            f.write(text)
    except Exception as e:
        print(f"{data_path} failed with exception {e}")

  0%|          | 0/17868 [00:00<?, ?it/s]

Datasets/Input/books1/epubtxt/the-microworld-miracle.epub.txt failed with exception string index out of range
Datasets/Input/books1/epubtxt/a-helping-hand-for-refugees.epub.txt failed with exception string index out of range
Datasets/Input/books1/epubtxt/20-soruda-evrim-teorisinin-cokusu.epub.txt failed with exception string index out of range
Datasets/Input/books1/epubtxt/the-pkks-treachery-and-oppression.epub.txt failed with exception string index out of range
Datasets/Input/books1/epubtxt/gunumuz-cahiliye-toplumunun-adi-konmamis-karanlik-dini-adaml.epub.txt failed with exception string index out of range
Datasets/Input/books1/epubtxt/70th-aacc-annual-scientific-meeting.epub.txt failed with exception string index out of range
Datasets/Input/books1/epubtxt/the-voyage-edited-by-chandani-lokuge-david-morley.epub.txt failed with exception string index out of range




Datasets/Input/books1/epubtxt/sufism.epub.txt failed with exception string index out of range
Datasets/Input/books1/epubtxt/allahin-renk-sanati.epub.txt failed with exception string index out of range
Datasets/Input/books1/epubtxt/kuran-ile-hayat-nasil-yasanir.epub.txt failed with exception string index out of range
Datasets/Input/books1/epubtxt/finance-guide.epub.txt failed with exception string index out of range
Datasets/Input/books1/epubtxt/has-the-bible-been-changed-the-reliability-of-the-scriptures.epub.txt failed with exception string index out of range
Datasets/Input/books1/epubtxt/the-error-of-the-evolution-of-species.epub.txt failed with exception string index out of range
Datasets/Input/books1/epubtxt/istanbul-intrigues.epub.txt failed with exception string index out of range
Datasets/Input/books1/epubtxt/allahin-detay-sanati.epub.txt failed with exception string index out of range
Datasets/Input/books1/epubtxt/americas-failure-to-perceive-the-pkk.epub.txt failed with except

In [16]:
# new_file_path = Path("Downloads/books3.tar.gz")
# if not os.path.exists(new_file_path):
#     url = 'https://the-eye.eu/public/AI/pile_preliminary_components/books3.tar.gz'
#     urllib.request.urlretrieve(url, new_file_path, reporthook=reporthook)

In [17]:
# extract_dir = Path("Datasets/Input/")
# shutil.unpack_archive(new_file_path, extract_dir)

In [18]:
# import tarfile
# tarfile.open(new_file_path).extractall(extract_dir).close()

## Determine how to read data back in
This section investigates how we can read this data back into the program for use.

In [20]:
# directory_path = Path(new_dir)
# documents = []
# for data_path in tqdm(list(directory_path.glob("**/*.txt"))):
#     text = ""
#     with open(Path(data_path), "r", encoding="utf-8") as f:
#         text = clean_base(f.read())
#         documents.append(text)

Additional text files were sourced from [Project Gutenberg](https://www.gutenberg.org/) indirectly through [Zenodo](https://zenodo.org/record/3360392#.ZFUirnbMIuU).

## Ebooks:
the below section investigates how to read ebooks with an OCHEM ebook I found online. I investigated two options for this. The first option is more concise, but has issues where it removes the spaces between some words during html parsing. The second option is a little slower but more accurate so we will go with that.

In [21]:
import ebooklib
from ebooklib import epub

book = epub.read_epub(Path("Datasets/Epubs/Organic-Chemistry-I-1639153167.epub"))

text = ""
for doc in book.get_items_of_type(ebooklib.ITEM_DOCUMENT):
    # print(doc.content)
    text += clean_base(doc.content)
# print(text)



In [22]:
# !pip install epub-conversion ebooklib xml_cleaner

In [25]:
# from epub_conversion import Converter
# converter = Converter(Path("Datasets/Epubs/"))
# converter.convert("epubs.txt")
from epub_conversion.utils import open_book, convert_epub_to_lines

directory_path = Path("Datasets/Epubs")
documents = []
for data_path in tqdm(list(directory_path.glob("**/*.epub"))):
    book = open_book(Path(data_path))
    lines = [clean_base(line) for line in convert_epub_to_lines(book)]
    lines = [i for i in lines if i != '']
    text = '\n'.join(lines)
    with open(Path(f'{input_path}/Epubs/{data_path.stem}.txt'), 'w', encoding='utf-8') as f:
        f.write(text)

  0%|          | 0/1 [00:00<?, ?it/s]



FileNotFoundError: [Errno 2] No such file or directory: 'Datasets/Input/Epubs/Organic-Chemistry-I-1639153167.txt'