# Doing Things with Text 3a: Import and clean multiple text files

This notebook introduces the automatic cleaning and saving of multiple text files 

### Step 1: Setting Up NLTK

NLTK (Natural Language Toolkit) is a library for working with text. To use it, you'll need to download some additional language data the first time you use NLTK. Run the following cell once:

In [None]:
# Import NLTK and download required packages
import nltk
nltk.download('punkt')  # Tokenizer
nltk.download('stopwords')  # Stopwords

### Step 2: Importing Required Packages

Here, we're loading a few packages to help with text cleaning:
- `BeautifulSoup`: To clean up HTML text.
- `os`: Helps with interacting with the operating system, such as managing file paths and directories.
- `re`: For regular expressions (patterns used for finding and cleaning text).
- `nltk.tokenize`: For splitting text into individual words.
- `nltk.corpus.stopwords`: A collection of common words like 'the', 'and', 'is', which are often removed in analysis.
- `matplotlib.pyplot`: Allows for creating visualizations like charts and graphs to represent data visually.
-  `pandas`: Provides tools for handling and analyzing structured data in tables, making it easier to work with datasets.

In [None]:
# Import necessary libraries
from bs4 import BeautifulSoup
import os
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
import pandas as pd

### Step 3: Define Input and Output Paths

Define where your text file is located (input) and where you want to save your processed text (output). You will use `os.path.join()` to define your paths. This approach is cross-platform, meaning it will work on Windows, macOS, and Linux.

Replace 'path', 'to', 'your', 'input', 'folder' with the actual paths to your files. It is not necessary for the output folder to exist. If it doesn't, this code will create it for you.

In [None]:
# Define input and output paths
indir = '/Path/to/indir/'
outdir = '/Path/to/outdir'
os.makedirs(outdir, exist_ok=True)  # Create output directory if it doesn't exist

dataset = 'dataset' # here the name of your actual dataset for output files

### Step 3: Load text documents, clean, and write to outdir

Everything that we did in notebook 1 step by step is comprimised in the next code block. First, it creates some variables that we need, next it starts a `for` loop that loops through the directory we named 'indir', opens files one by one, cleans them, and saves them to the directory named 'outdir'.

In [None]:
file_names = []
token_counts_before = []
token_counts_after = []
all_cleaned_text = []
cleaned_data = []

for filename in sorted(os.listdir(indir)):
    if filename.endswith('.txt'):
        file_path = os.path.join(indir, filename)
        file_names.append(filename)
        with open(file_path, 'r', encoding='utf8') as f:
            text = f.read()

        # Cleaning steps
        text = text.lower() # Convert text to lowercase
        text = BeautifulSoup(text, 'html.parser').get_text() # Remove HTML tags
        text = re.sub('[^a-z\\s\']', '', text) # Remove non-alphabetic characters
        tokens = word_tokenize(text) # Tokenize the text
        stop_words = set(stopwords.words('english')) 
        clean_tokens = [word for word in tokens if word not in stop_words] # Remove stopwords
        clean_tokens = [word for word in clean_tokens if len(word) >= 4] # Remove short words

        token_counts_before.append(len(tokens)) # Add length of tokenized text before stop word and short word removal to token_counts_before
        token_counts_after.append(len(clean_tokens)) # Add length of tokenized text after stop word and short word removal to token_counts_after
        
        cleaned_text = ' '.join(clean_tokens)
        all_cleaned_text.append(cleaned_text)
        cleaned_data.append({'filename': filename, 'cleaned_text': cleaned_text})

        output_file_path = os.path.join(outdir, filename)
        with open(output_file_path, 'w', encoding='utf8') as f:
            f.write(cleaned_text)
        
        print(f'Processed and saved: {filename}')


### (Optional) Step 4: Save All Cleaned Text to a Single File

In [None]:
def save_dataset(dataset):
    dataset_out = dataset.replace(" ", "_").lower()
    return dataset_out

merged_output_file = os.path.join(outdir, 'cleaned_text_%s.txt' %(save_dataset(dataset)))
with open(merged_output_file, 'w', encoding='utf8') as f:
    f.write('\n'.join(all_cleaned_text))
print('Merged cleaned text saved to:', merged_output_file)

### (Optional) Step 5: Save Cleaned Text in a CSV

In [None]:
def save_dataset(dataset):
    dataset_out = dataset.replace(" ", "_").lower()
    return dataset_out

df_cleaned = pd.DataFrame(cleaned_data)
csv_output_file = os.path.join(outdir, 'cleaned_text_%s.csv' %(save_dataset(dataset))) 
df_cleaned.to_csv(csv_output_file, index=False, encoding='utf8')
print('All cleaned text saved to CSV:', csv_output_file)

### Step 6: Count total number of words

#### Step 6a: Visualize total number of words before and after preprocessing in a bar chart

In [None]:
plt.figure(figsize=(10, 6))
bar_width = 0.35
index = range(len(file_names))

plt.bar(index, token_counts_before, bar_width, label='Before Cleaning')
plt.bar([i + bar_width for i in index], token_counts_after, bar_width, label='After Cleaning')

plt.xlabel('Files')
plt.ylabel('Number of Tokens')
plt.title(f'Token Counts Before and After Cleaning in {dataset}')
plt.xticks([i + bar_width / 2 for i in index], file_names, rotation=90)
plt.legend()
plt.tight_layout()
plt.show()

#### Step 6b: Count total number of words in your dataset before preprocessing

In [None]:
print("The total number of words in \'%s\' before preprocessing is: %s" 
      %(str(indir), sum(token_counts_before)))

#### Step 6c: Count total number of words in your dataset after preprocessing

In [None]:
print("The total number of words in \'%s\' after preprocessing is: %s" 
      %(str(indir), sum(token_counts_after)))

#### Step 6d: Calculate total number of words removed by preprocessing

In [None]:
print("The total number of tokens removed by preprocessing is: %s" 
      %(total_words_before - total_words_after)