# <center>Week 5 Assignment</center>

This week you will be retrieving and cleansing data from a survey generated by the National Center for Immunization and Respiratory Diseases about National Immunizations in Children. In completing this assignment, you will be able to combine topics discussed in several of our prior FTEs.

File needed to complete this assignment are located in the data_5 folder:
* NISPUF14_CODEBOOK.PDF
* nispuf14.dat

Assignment Requirements:
* Retrieve all of the data within nispuf14.dat and store it in a more <i> accessible format</i>
* <i> Accessible format </i> can be any of the following:
    - csv file
    - json file
    - relational database
* For this assignment, feel free to use a dataframe for intermediate steps. 

<hr>

### What's in these two files?
I'm glad you asked that! And to be honest, you probably are not going to like the answer.

NISPUF14_CODEBOOK.PDF is a PDF that contains a description of the format for the data in nispuf14.dat. In other words, the PDF tells you how to read the data in nispuf14.dat.

Why would we need a PDF to tell us how to read our data?  Well, this data file is stored in a positional format. This means that both the value and relative position of each character provides meaning within the dataset.

Here's what the data in nispuf14.dat looks like.
<img align="left" style="padding-right:10px;" src="figures_5/positional_data.jpeg" width = 800><br>

Ugly? Yes! And very much so. However, data in this format is not all that uncommon. Mainframe computers operate on positional formating. 

Q - Who still uses mainframe computers?<br>
A - Mainframes are more prevalent than you'd think. Any industry that has a large volume of daily mathematical calculations to do, most likely use a mainframe computer as part of their normal operations. For example, the banking industry. Certainly, the website and customer-facing applications are not run on a mainframe computer, but the nightly accounting processes probably are. 

The following article walks through the history of the mainframe computer and how it has evolved over the years. 
https://www.thocp.net/hardware/mainframe.htm

<hr>

### How are we supposed to read that?
This is where NISPUF14_CODEBOOK.PDF comes into the picture. Section 1 of the PDF contains the description of the positional formatting information for each data field. Here's how it works!

As an example, let's say that our data file looked like this:<br>
CAT  FLUFFY410<br>
DOG  FIDO  522<br>
BIRD CHIRP 2 1<br>

At a glance, we can determine that each line contains information about animals. We can see a field representing an animal_type and perhaps an animal_name.  However, we have little to no information about what the numerics at the end of each line mean. Or even how many fields the numeric group is representing. The last line is leading us to believe that there might be more than one field represented, but we are not confident at this point.


### Does this come with a 'Magical Decoder Ring'?
Short of an actual magical ring, I'd settle for a description of each field and their relative position in the line.  It would be even better if the description was written down for future reference.

Let's look at the above animal dataset in conjunction with the  following description:<br>
Type 1 5<br>
Name 6 11<br>
Age 12 12<br>
Weight 13 14<br>

Aaahhhhh! Now everything is starting to come together!!! We can now confirm that the first field is indeed animal_type, and the second is animal_name. However, we now know that the numeric grouping is really two fields, animal_age and animal_weight. We can also see that animal_age is a single digit, and animal_weight is a 2-digit numeric. We are also able to determine at this point that the animal_name on the first line is actually 'FLUFFY' and not 'FLUFFY410'.

Time to add a little code to this example.

<hr>

In [None]:
# Load the sample data into a list
animal_data = ['CAT  FLUFFY410', 'DOG  FIDO  522', 'BIRD CHIRP 2 1' ]

# processing each animal_data line
for  line in animal_data:
    animal_type = line[0:5]
    animal_name = line[5:11]
    age = line[11]
    weight = line[12:14]
    
    print(f'{animal_name} is a {animal_type} that is {age} years old and weights {weight} pound(s)')


Hopefully, things are looking less scary at this point? 

Retrieving data from a positionally formatted file is just a matter of chunking the larger string up into smaller pieces. The trick is in determining where to make those chunks. 

The key to all this is the 'magical decoder' description because there are no other clues in the file itself. Unlike a csv type file, positional formatted files don't have a delimited to help identify individual data elements. 

That being said, positional formats do account for every character within a row.  Meaning that even unused characters are given a value. In our example above, a blank  character(' ') was used to fill unused characters. The value used to represent unused characters can literally be anything. For example, if '-' was used instead of a ' ' our sample data would have looked like:

CAT--FLUFFY410<br>
DOG--FIDO--522<br>
BIRD-CHIRP-2-1<br>

Let's see if our code above will still work?

In [None]:
# Load the sample data into a list
animal_data2 = ['CAT--FLUFFY410', 'DOG--FIDO--522', 'BIRD-CHIRP-2-1' ]

# processing each animal_data line
for  line in animal_data2:
    animal_type = line[0:5]
    animal_name = line[5:11]
    age = line[11]
    weight = line[12:14]
    
    print(f'{animal_name} is a {animal_type} that is {age} years old and weights {weight} pound(s)')

Aside from changing the initial list that contains our dataset, no coding changes were needed. 

Our output looks a little different, but that's because of the different unused character representation. Both of the above examples have their respective unused character values in the data elements.  It's just easier to see in the second example over the first.

Let's try stripping out the unused characters in both examples.

In [None]:
# working with the second dataset, animal_data2, first.

# processing each animal_data line
for  line in animal_data2:
    animal_type = line[0:5].strip('-')
    animal_name = line[5:11].strip('-')
    age = line[11].strip('-')
    weight = line[12:14].strip('-')
    
    print(f'{animal_name} is a {animal_type} that is {age} years old and weights {weight} pound(s)')

In [None]:
# repeat the same things with the first set, animal_data.

# processing each animal_data line
for  line in animal_data:
    animal_type = line[0:5].strip('-')
    animal_name = line[5:11].strip('-')
    age = line[11].strip('-')
    weight = line[12:14].strip('-')
    
    print(f'{animal_name} is a {animal_type} that is {age} years old and weights {weight} pound(s)')

<div class="alert alert-success">
Success!! The two outputs match!
</div>

<hr>

### Back to our assignment
Section 1 of NISPUF14_CODEBOOK.PDF contains the description of the positional format for nispuf14.dat.

<div class="alert alert-block alert-info">
<b>Helpful Hint::</b> Combining pyPDF2 and Tabula would work great for  parsing the information within section 1 of NISPUF14_CODEBOOK.PDF. pyPDF2 to retrieve section 1 of the PDF and Tabula for getting the positional formatting information off the PDF and into a pandas dataframe.
</div>

Installation reminders from FTE for week3.
<div class="alert alert-block alert-success">
<b>Installation - PyPDF2::</b> PyPDF2 can be installed as normal using pip.
</div>

<div class="alert alert-block alert-success">
<b>Installation - Tabula::</b> To install the tabula package, you can use pip as shown before. https://pypi.org/project/tabula-py/
</div>

<div class="alert alert-block alert-success">
<b>Installation - Java::</b> Note: in order to use tabula, you need to have the latest version of java installed. https://aegis4048.github.io/parse-pdf-files-while-retaining-structure-with-tabula-py has some useful information if you need help getting java installed on your machine.
</div>

#### Assignment Approach

<div class="alert alert-block alert-warning">
<b>One possible solution: </b> Students are encouraged to define their approach when completing any assignment in this class.  Below, I have shared my approach to the assignment for this week.  Feel free to use some or all of this design, if you'd like.
</div>



for each line in the file:

    data_line = new list
    for each variable (line) found in the dataframe:
        create a dictionary with variable name as key, 
        use start / end position numbers as a slice to give the dictionary's value
        append dictionary to data_line
    write data_line to CSV file

<hr>

In [1]:
import tabula, pandas as pd, csv
path='C:/Users/eltac/Downloads/data_5/data_5/'

In [2]:
df = tabula.read_pdf(path+'NISPUF14_CODEBOOK.PDF', pages='5-21')

In [3]:
df

Unnamed: 0,Variable Name,Position,Position.1,Section,Variable Label
0,SEQNUMC,1,6,1,UNIQUE CHILD IDENTIFIER
1,SEQNUMHH,7,11,1,UNIQUE HOUSEHOLD IDENTIFIER
2,PDAT,12,12,1,CHILD HAS ADEQUATE PROVIDER DATA
3,PROVWT_D,13,31,1,FINAL DUAL-FRAME PROVIDER-PHASE WEIGHT (EXCLUDES
4,,,,,TERRITORIES)
...,...,...,...,...,...
704,INS_4_5,861,862,10,"IS CHILD COVERED BY INDIAN HEALTH SERVICE, MIL..."
705,,,,,"CARE, TRICARE, CHAMPUS, OR CHAMP-VA?"
706,INS_6,863,864,10,IS CHILD COVERED BY ANY OTHER HEALTH INSURANCE...
707,,,,,CARE PLAN?


In [4]:
print(df.shape)

(709, 5)


In [5]:
list(df.columns)

['Variable Name', 'Position', 'Position.1', 'Section', 'Variable Label']

In [6]:
df=df.drop(['Section','Variable Label'], axis=1)

In [7]:
df.head()

Unnamed: 0,Variable Name,Position,Position.1
0,SEQNUMC,1.0,6.0
1,SEQNUMHH,7.0,11.0
2,PDAT,12.0,12.0
3,PROVWT_D,13.0,31.0
4,,,


In [8]:
df=df.dropna()

In [9]:
df.head()

Unnamed: 0,Variable Name,Position,Position.1
0,SEQNUMC,1,6
1,SEQNUMHH,7,11
2,PDAT,12,12
3,PROVWT_D,13,31
5,PROVWT_D_TERR,32,50


In [10]:
indexNames = df[(df['Variable Name'] == 'Variable Name')  ].index

In [11]:
indexNames

Int64Index([41, 86, 136, 184, 233, 284, 334, 381, 421, 460, 497, 539, 578, 617,
            656, 695],
           dtype='int64')

In [12]:
df.drop(indexNames , inplace=True)

In [13]:
print(df.dtypes)

Variable Name    object
Position         object
Position.1       object
dtype: object


In [14]:
df['Variable Name'] = df['Variable Name'].astype(str)
df['Position'] = df['Position'].astype(int)
df['Position.1'] = df['Position.1'].astype(int)

In [15]:
print(df.dtypes)

Variable Name    object
Position          int32
Position.1        int32
dtype: object


In [16]:
lines_list = []
files = [path+'nispuf14.dat']
for file in files:
    with open(file) as infile:
        for line in infile:
            if len(line) > 1:    # Blank lines at the end of files.
                lines_list.append(line.strip()) # strip() leaves empty blank lines -- skip these
    

In [17]:
with open(path+'nispuf14.dat') as fn:
    content = fn.readlines()

In [18]:
len(lines_list)

24897

In [19]:
data_line = []
data_line2 = []
Line_control=lines_list[:]
keys=dict.fromkeys(df['Variable Name'])
for line in Line_control:
    for index,row in df.iterrows():
        head=row['Variable Name']
        var=line[row['Position']:row['Position.1']]
        var2=var.replace('.','')
        data_line2.append(var2.strip())
        #####data_line.append(var)
        
    data_line.append(dict(zip(keys, data_line2)))


In [20]:
data_line[:5]

[{'SEQNUMC': '00011',
  'SEQNUMHH': '0001',
  'PDAT': '',
  'PROVWT_D': '',
  'PROVWT_D_TERR': '',
  'RDDWT_D': '21830024855484000',
  'RDDWT_D_TERR': '21830024855484000',
  'STRATUM': '022',
  'YEAR': '014',
  'AGECPOXR': '',
  'HAD_CPOX': '2',
  'SHOTCARD': '',
  'AGEGRP': '',
  'BF_ENDR06': '652500',
  'BF_EXCLR06': '521875',
  'BF_FORMR08': '826250',
  'BFENDFL06': '',
  'BFFORMFL06': '',
  'C1R': '',
  'C5R': '2',
  'CBF_01': '1',
  'CEN_REG': '',
  'CHILDNM': '',
  'CWIC_01': '2',
  'CWIC_02': '',
  'EDUC1': '',
  'FRSTBRN': '',
  'I_HISP_K': '',
  'INCPORAR': '00000000000000',
  'INCPOV1': '',
  'INCQ298A': '4',
  'INTRP': '',
  'LANGUAGE': '',
  'M_AGEGRP': '',
  'MARITAL2': '',
  'MOBIL_I': '',
  'NUM_PHONE': '9',
  'NUM_CELLS_HH': '9',
  'NUM_CELLS_PARENTS': '9',
  'RACE_K': '',
  'RACEETHK': '',
  'RENT_OWN': '1',
  'SEX': '',
  'ESTIAP14': '22',
  'EST_GRANT': '2',
  'STATE': '2',
  'D6R': '',
  'D7': '',
  'N_PRVR': '',
  'PROV_FAC': '',
  'REGISTRY': '',
  'VFC_ORDER': ''

In [21]:
with open(path+'gbernal_week5.csv', 'w',newline='') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=keys)
    writer.writeheader()
    for row in data_line:
        writer.writerow(row)
csvfile.close()