### About:
Python script to read MS WORD (.DOCX) files (downloaded from LexisNexis) and extract articles with their headings and output as .CSV files. Each cell has some comments to help user understand the functionality.

### Requirements:
- **Folder Path** MS WORD documents should be in a folder, and you have to specify the path of that folder as value for the `folder_path` variable in next cell. Your folder may contain sub-folders containing many files recursively. The script will handle this.

- **File Names** Make sure each MS_WORD.DOCX file is named properly like this format: `Files (1-100).DOCX`
This is mandatory as the script will parse the values specified in parentheses to skip some pages in the file which are not required. Also, not all files contain 100 articles, some contains, e.g., 36 articles, then the name of that file will be `Files (1-36).DOCX`.

- **Errors** In case you will see errors while running the script, you will see the name of last file processed which caused error. It could be because your file is not named properly, e.g., if it is missing a parenthesis like `Files (1-36.DOCX`, or it has only one value in the parenthesis like `Files (100).DOCX`. You will have to correct these kind of issues yourself and then rerun the script.

- **Output** The script wil output files in CSV format in the same directory with the same name as of file. For example, when processing `Files (1-36).DOCX` file, it will output `Files (1-36).DOCX.csv` file in the same directory. Output CSV file contains two columns, where first column is heading and second column is text of the article. You can change the values of both these columns in the code below.

**Note:** This work is done as part of Sana Rabbani's Master Thesis at CUI.

In [15]:
#import libraries
import docx2python
from os import listdir
import csv
from os.path import isfile, join, isdir


In [16]:
# Specify folder path
folder_path = "Air Pollution Data"

# array initialization
file_paths = []

# specify the extension of MS WORD files. Default value is .DOCX.
file_extension = ".DOCX"

# specify the header values of CSV files, default values are headline and body for first and 2nd column.
header = ['headline', 'body']

In [17]:
def read_files_from_dir(path):
    files = get_files(path)
    # print(files)
    for file_path in files:
        file_name = file_path.split("/")[-1]
        limits = file_name[file_name.find("(") + 1:file_name.find(")")]
        index_to_start_from = limits.split("-")[1]
        print("Processing file: " + file_path)
        remainder = int(index_to_start_from) % 100
        if remainder == 0:
            # print(index_to_start_from)
            init_reader_writer(file_path, 200)
        else:
            index_to_start_from = remainder
            # print(index_to_start_from)
            init_reader_writer(file_path, int(index_to_start_from) * 2)

def get_files(path):
    for f in listdir(path):
        if isfile(join(path, f)):
            # print("file " + join(path, f))
            if f.endswith(file_extension):
                file_paths.append(join(path, f))

        if isdir(join(path, f)):
            # print("dir " +  join(directory, f))
            get_files(join(path, f))
    return file_paths

Following three methods will parse the content of each file and output csv files.

In [18]:
def init_reader_writer(file_path, index_to_start_from):
    # open the file in the write mode
    f = open(file_path + ".csv", 'w')
    # create the csv writer
    writer = csv.writer(f)
    # write a row to the csv file
    writer.writerow(header)
    read_word_file(file_path, writer, index_to_start_from)
    # close the file
    f.close()

def read_word_file(address, writer, index_to_start_from):
    raw_document = docx2python.docx2python(address)
    content = raw_document.body_runs
    parse_content(content[index_to_start_from][0][0], writer)

def parse_content(content, writer):
    row = []
    for counter in range(0, len(content)):
        if len(content[counter]) > 0:
            if "<a href=\"http" in content[counter][0]:
                title = content[counter][0]
                title = (title.split(">")[1].split("<"))[0]
                row.append(title)

            if content[counter][0] == 'Body':
                # get content now
                local_counter = counter + 1
                body_content_indices = []

                while len(content[local_counter]) == 0:
                    local_counter += 1

                while "\nLoad-Date" not in content[local_counter][0]:
                    if len(content[local_counter]) > 0:
                        body_content_indices.append(local_counter)
                    local_counter += 1
                    while len(content[local_counter]) == 0:
                        local_counter += 1
                text = ""
                for para in body_content_indices:
                    for element in content[para]:
                        text = join(text, element)
                row.append(text)
                writer.writerow(row)
                row = []

In [19]:
# method call to start extraction
read_files_from_dir(folder_path)

['Air Pollution Data/India/Hindustan Times/JAN2020-DEC2022/Files (401-446).DOCX', 'Air Pollution Data/India/Hindustan Times/JAN2020-DEC2022/Files (101-200).DOCX', 'Air Pollution Data/India/Hindustan Times/JAN2020-DEC2022/Files (201-300).DOCX', 'Air Pollution Data/India/Hindustan Times/JAN2020-DEC2022/Files (1-100).DOCX', 'Air Pollution Data/India/Hindustan Times/JAN2020-DEC2022/Files (301-400).DOCX', 'Air Pollution Data/India/Hindustan Times/JAN2017-DEC2019/Files (401-500).DOCX', 'Air Pollution Data/India/Hindustan Times/JAN2017-DEC2019/Files (801-809).DOCX', 'Air Pollution Data/India/Hindustan Times/JAN2017-DEC2019/Files (701-800).DOCX', 'Air Pollution Data/India/Hindustan Times/JAN2017-DEC2019/Files (601-700).DOCX', 'Air Pollution Data/India/Hindustan Times/JAN2017-DEC2019/Files (101-200).DOCX', 'Air Pollution Data/India/Hindustan Times/JAN2017-DEC2019/Files (201-300).DOCX', 'Air Pollution Data/India/Hindustan Times/JAN2017-DEC2019/Files (501-600).DOCX', 'Air Pollution Data/India/Hin