![ukds.png](attachment:ukds.png)

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-required-modules" data-toc-modified-id="Import-required-modules-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import required modules</a></span></li><li><span><a href="#Import-one-named-RTF-file" data-toc-modified-id="Import-one-named-RTF-file-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Import one named RTF file</a></span></li><li><span><a href="#Import-multiple-named-RTF-files" data-toc-modified-id="Import-multiple-named-RTF-files-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Import multiple named RTF files</a></span></li></ul></div>

# Read-in multiple Rich Text Format files

The first step in any text-mining project is going to involve reading-in your text files into your coding editor. RTF files can be particularly difficult/annoying to  read-in, because they include extra data which defines how the text is layed-out and formatted. Therefore, we will be using the 'striprtf' library which converts rtf files into Python strings. 

## Import required modules

In [1]:
# !pip install pandas
# !pip install striprtf

In [2]:
import os
# provides functions for interacting with underlying operating system
# e.g. change working directory, locate files

import csv
# includes functions for reading-in and writing tabular data in CSV format

import pandas as pd
# includes useful functions for manipulating data 

from striprtf.striprtf import rtf_to_text
# simple library to convert rtf files to python strings

In [3]:
# getcwd() allows us to see our current working directory 
# I.e. "where we are now"
os.chdir('/Users/loucap/Documents/GitWork/Text-Mining-Health')

Let's go ahead and 

In [4]:
os.listdir("Data/rtf")

['5407int26.rtf',
 '5407int32.rtf',
 '5407diary43.rtf',
 '5407diary42.rtf',
 '5407int27.rtf',
 '5407int19.rtf',
 '5407int31.rtf',
 '5407diary40.rtf',
 '5407diary54.rtf',
 '5407diary55.rtf',
 '5407diary41.rtf',
 '5407int24.rtf',
 '5407int30.rtf',
 '5407int18.rtf',
 '5407int34.rtf',
 '5407int20.rtf',
 '5407int08.rtf',
 '.DS_Store',
 '5407diary44.rtf',
 '5407int09.rtf',
 '5407int21.rtf',
 '5407int23.rtf',
 '5407int37.rtf',
 '5407diary52.rtf',
 '5407diary47.rtf',
 '5407diary53.rtf',
 '5407int36.rtf',
 '5407int22.rtf',
 '5407diary08.rtf',
 '5407diary34.rtf',
 '5407diary21.rtf',
 '5407diary09.rtf',
 '5407int44.rtf',
 '5407int52.rtf',
 '5407diary23.rtf',
 '5407diary37.rtf',
 '5407diary36.rtf',
 '5407diary22.rtf',
 '5407int47.rtf',
 '5407int53.rtf',
 '5407int43.rtf',
 '5407diary26.rtf',
 '5407diary32.rtf',
 '5407diary27.rtf',
 '5407int42.rtf',
 '5407int40.rtf',
 '5407int54.rtf',
 '5407diary31.rtf',
 '5407diary19.rtf',
 '5407diary18.rtf',
 '5407diary24.rtf',
 '5407diary30.rtf',
 '5407int55.rtf'

## Import one named RTF file

In [5]:
def import_one_rtf(input_rtf):
    #     function has 'input_rtf' parameter, which takes the input file as its argumennt
    with open(input_rtf, 'r') as file:
        #         open function takes input file and names it 'file'
        # 'r' opens a file for reading
        text = file.read()
        # assigns a variable that reads inputted file
        stripped_text = rtf_to_text(text)
# converts that read file to text using the rtf_to_text command


    print(stripped_text)

### Example

In [6]:
import_one_rtf("Data/rtf/5407diary02.rtf")



Information about diarist
Date of birth: 1975
Gender: M
Occupation: Group 6
Geographic region: North Cumbria


Diary 1         
Thursday Meeting @ N Lakes
Friday TB testing on restocking farm. Usual chat and DEFRA comments
The meeting (research panel gp 6) at the North Lakes was interesting. It surprises me sometimes how people (myself included) never seem to tire of the same stories and complaints over how the crisis was handled. Some of the episodes recounted must have been told dozens of times over the last year but whoever says it always seems just as keen to say it again – Perhaps a reflection of how deeply people feel about the events of the last year. Having said that, most of the resentments and rants that I hear on daily farm visits are focused fairly and squarely at DEFRA and not FMD virus. Farmers seem far more upset at the constriction put on them by DEFRA than they do by the loss of stock now, although I know and saw how utterly devastated most were when they were actual

## Import multiple named RTF files

Required me to delete an illegible file from the rtf folder.

In [7]:
def import_rtf(input_rtf):
    #     function has 'input_rtf' parameter, which takes the input file directory as its argument
    lines = []
# initialise empty list

    for filename in sorted(os.listdir(input_rtf)):
#     sorted command ensures my filenames are processed in alphanumeric order
        if filename != '.DS_Store':
#  if file isn't a .DS_Store file..iterate through each file in Data/rtf folder
            with open(input_rtf + "/" + filename , 'r') as file: 
                print(filename)
#      open command enters Data/rtf folder then each filename, and names it file 
                text = file.read()
#       assigns a variable that reads inputted file
                stripped_text = rtf_to_text(text)
# converts that read file to text using the rtf_to_text command    
                row_contents = [filename, stripped_text]
#     create a variable that specifies what you want on your csv rows
                lines.append(row_contents)
# append the specified row_contents to empty list

            df = pd.DataFrame(lines)
            df.to_csv("Code/Data/" + "text.csv")

In [8]:
import_rtf("Data/rtf")

5407diary02.rtf
5407diary03.rtf
5407diary07.rtf
5407diary08.rtf
5407diary09.rtf
5407diary10.rtf
5407diary13.rtf
5407diary14.rtf
5407diary15.rtf
5407diary16.rtf
5407diary17.rtf
5407diary18.rtf
5407diary19.rtf
5407diary21.rtf
5407diary22.rtf
5407diary23.rtf
5407diary24.rtf
5407diary26.rtf
5407diary27.rtf
5407diary28.rtf
5407diary29.rtf
5407diary30.rtf
5407diary31.rtf
5407diary32.rtf
5407diary34.rtf
5407diary36.rtf
5407diary37.rtf
5407diary39.rtf
5407diary40.rtf
5407diary41.rtf
5407diary42.rtf
5407diary43.rtf
5407diary44.rtf
5407diary47.rtf
5407diary48.rtf
5407diary49.rtf
5407diary52.rtf
5407diary53.rtf
5407diary54.rtf
5407diary55.rtf
5407fg01.rtf
5407fg02.rtf
5407fg03.rtf
5407fg04.rtf
5407fg05.rtf
5407fg06.rtf
5407int02.rtf
5407int03.rtf
5407int07.rtf
5407int08.rtf
5407int09.rtf
5407int10.rtf
5407int13.rtf
5407int14.rtf
5407int15.rtf
5407int16.rtf
5407int17.rtf
5407int18.rtf
5407int19.rtf
5407int20.rtf
5407int21.rtf
5407int22.rtf
5407int23.rtf
5407int24.rtf
5407int26.rtf
5407int27.rtf
54