### Process KYCOVID-19 confirmed cases from daily press realeases (3-21-2020 format)

Data Source: https://governor.ky.gov/news  &  https://governor.ky.gov/attachments/COVID19_Case-Information.pdf  

This notebook details how to read a PDF file and convert it to data ready for Pandas

In [1]:
import tika
from tika import parser
import pandas as pd
from pathlib import Path

In [2]:
file_date = '03_21_2020'
filePath = Path("data")   # the file path for data
full_path = '/Users/mark/Documents/github-public/covid-19/data/covid19_case_information_' + file_date + '.pdf'
full_path

'/Users/mark/Documents/github-public/covid-19/data/covid19_case_information_03_20_2020.pdf'

#### Read the PDF file into the Tika parser

In [3]:
raw = parser.from_file(full_path)
raw = raw['content'].lstrip().rstrip() # Remove leading and trailing spaces
raw=raw.replace('\n\n','\n').replace(',','').replace('  ',' ') .replace('  ',' ') .replace('  ',' ') .replace('  ',' ')    # replace all double newline characters with one
raw

'COVID-19 case information \n \nAs of 5 p.m. March 20 the state’s COVID-19 patient information includes 63 who have tested \npositive. \n27 F Harrison recovered \n69 M Jefferson \n67 F Harrison \n40 F Fayette \n68 M Harrison \n46 M Fayette \n60 M Harrison \n54 F Harrison \n31 F Fayette \n51 M Harrison \n56 M Montgomery \n80 F Jefferson \n68 F Jefferson \n66 M Bourbon passed away \n53 M Nelson \n67 F Jefferson \n47 M Fayette \n31 M Fayette \n49 M Clark \n73 F Jefferson \n54 M Jefferson \n51 M Montgomery \n34 F Jefferson \n33 F Out of State \n74 M Jefferson \n66 M Jefferson \n69 M Lyon \n88 F Bourbon \n27 F Clark \n51 M Daviess \n26 F Fayette \n61 F Franklin \n50 M Harrison \n F Henderson \n66 F Kenton \n59 F Pulaski \n73 M Warren \n61 F Christian \n F Fayette \n45 F Jefferson \n46 F Jefferson \n F Henderson \n17 F Jefferson \n F Warren \n48 F Pulaski \n28 M Calloway \n79 F Jefferson \n34 M Fayette \n67 M Fayette \n55 F Fayette \n F Jefferson \n F Jefferson \n F Jefferson \n  Jefferson \

#### Turn the string into a list, breaking on newline character

In [4]:
string_list = [x.split(',') for x in raw.split(' \n')]
string_list = string_list[2:-2]
string_list[:10]

[['As of 5 p.m. March 20 the state’s COVID-19 patient information includes 63 who have tested'],
 ['positive.'],
 ['27 F Harrison recovered'],
 ['69 M Jefferson'],
 ['67 F Harrison'],
 ['40 F Fayette'],
 ['68 M Harrison'],
 ['46 M Fayette'],
 ['60 M Harrison'],
 ['54 F Harrison']]

#### Convert the list of string lists into a list of lists

In [5]:
new_list=[]
new_string=''
for indx, items in enumerate(string_list):
    new_string=new_string + (str(items).replace(' ', ',', 2))
new_string = new_string.replace("'","").replace('][','\n').replace('[','')
new_list = [x.split(',') for x in new_string.split('\n')]
new_list[:10]

[['As',
  'of',
  '5 p.m. March 20 the state’s COVID-19 patient information includes 63 who have tested'],
 ['positive.'],
 ['27', 'F', 'Harrison recovered'],
 ['69', 'M', 'Jefferson'],
 ['67', 'F', 'Harrison'],
 ['40', 'F', 'Fayette'],
 ['68', 'M', 'Harrison'],
 ['46', 'M', 'Fayette'],
 ['60', 'M', 'Harrison'],
 ['54', 'F', 'Harrison']]

#### Create a dataframe from the list of lists

In [6]:
df = pd.DataFrame(new_list,columns=['Age','M/F','County'])
df.head()

Unnamed: 0,Age,M/F,County
0,As,of,5 p.m. March 20 the state’s COVID-19 patient i...
1,positive.,,
2,27,F,Harrison recovered
3,69,M,Jefferson
4,67,F,Harrison


#### Write dataframe to CSV file

In [7]:
file_name = 'ky_'+file_date+'.csv'
file_out = filePath.joinpath(file_name)  # path and filename

df.to_csv(file_out)  # output to csv

In [8]:
len(df)

66