# Project Overview
Goal is to **acquire engine performance data** from some source yet to be determined, then run **statistical analysis** on horsepower and torque (targets) against engine data, then **build features** and **create regression models** in order to predict horsepower.

In [1]:
import pandas as pd
import re
import tabula

import prepare

# Setup
## Bottom Line Up Front
1. Used tabula for first time
2. Ran into Issue #1, read_pdf concatenates values past page one into a single column, fixed with: **prepare.py/fix_dyno_pdf(df)**
3. Ran into Issue #2, will need to create functions to run conversion and fix through loops, fixed with: **multiple prepare.py functions**
4. Ran into Issue #3, will need to create a function to store dataframes locally from the dataframe dictionary

### Initial try with tabula-py
First I will attempt to read data from a Cobb Tuning PDF.

In [2]:
# pull pdf into dataframe
df = tabula.read_pdf('cobb_dyno_data/subaru/subaru_impreza_wrx_sti/2013_model_subaru_wrx_sti_Adam_B_1.pdf', 
                            pages='all', 
                            multiple_tables=False)[0]

In [3]:
# check work
df

Unnamed: 0,RPM,HP,Torque,AFR,Boost
0,2240,88.00,207.0,12.5,7.5
1,2260,90.00,210.0,12.3,7.7
2,2280,92.00,212.0,12.2,8.1
3,2300,93.00,213.0,12.0,8.4
4,2320,94.00,214.0,11.9,8.8
...,...,...,...,...,...
213,6500,300.00,243.0,10.7,15.3
214,6520299.00241.0010.7015.30,,,,
215,6540,297.00,239.0,10.7,15.2
216,6560296.00238.0010.7015.20,,,,


### Issue #1: Fix concatenation issue
Because pages after page one do not have column names, the read_pdf function concatenates each row's values into the RPM column. Next, I will create a function that fixes this.

In [4]:
# Function outlined here before definition
for i, item in enumerate(df['RPM']):
    if len(item) > 5:
        RPM = item[:4]
        item = item[4:]
        HP = re.findall(r'...\.\w\w', item)[0]
        item = item[len(HP):]
        Torque = re.findall(r'...\.\w\w', item)[0]
        item = item[len(Torque):]
        AFR = re.findall(r'..\.\w\w', item)[0]
        item = item[len(AFR):]
        Boost = re.findall(r'..\.\w\w', item)[0]
        df.loc[i] = {'RPM':RPM, 'HP':HP, 'Torque':Torque, 'AFR':AFR, 'Boost':Boost}
    else:
        continue

In [5]:
# test work
df = df.astype('float')
print(df.sample(5), '\n\n')
df.info()

        RPM     HP  Torque   AFR  Boost
79   3820.0  241.0   333.0  10.7   18.2
147  5180.0  300.0   305.0  10.6   17.4
153  5300.0  297.0   295.0  10.5   17.2
141  5060.0  296.0   308.0  10.7   17.6
75   3740.0  239.0   337.0  10.7   18.3 


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 218 entries, 0 to 217
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   RPM     218 non-null    float64
 1   HP      218 non-null    float64
 2   Torque  218 non-null    float64
 3   AFR     218 non-null    float64
 4   Boost   218 non-null    float64
dtypes: float64(5)
memory usage: 8.6 KB


### Solution to Issue #1
I've pushed the function to prepare.py, now I will import the function and test it on a new PDF.

In [6]:
# read second pdf as test
df2 = tabula.read_pdf('cobb_dyno_data/subaru/subaru_impreza_wrx_sti/2013_model_subaru_wrx_sti_Adam_B_2.pdf', 
                            pages='all', 
                            multiple_tables=False)[0]

In [7]:
# try out function
df2 = prepare.fix_dyno_pdf(df2)
df2.sample(5)

Unnamed: 0,RPM,HP,Torque,AFR,Boost
166,5560.0,313.0,296.0,10.6,16.5
158,5400.0,312.0,304.0,10.6,16.9
115,4540.0,293.0,340.0,10.7,18.4
81,3860.0,255.0,349.0,10.6,19.0
11,2460.0,113.0,242.0,11.0,11.1


So the function works when you call it with a dataframe. I will need to call the function in a loop to iterate through PDFs.

### Issue #2: Convert PDFs programatically
I need to create three functions:
1. Uses tabula-py to convert a PDF to dataframe
2. Identifies filepaths of PDFs in a given directory
3. Loops through filepaths from #2 and calls #1 on each PDF. 

The end goal is to convert all PDFs into individual CSV files.

### Solution to Issue #2
All three files added to prepare.py:
1. prepare.py/pdf_to_df(filepath)
2. prepare.py/iterate_pdfs(folder_path)
3. prepare.py/fix_pdfs(folder_path)

In [8]:
# testing all three functions
df_dict = prepare.fix_pdfs('cobb_dyno_data/subaru/subaru_impreza_wrx_sti/')
df_dict

{'2013_model_subaru_wrx_sti_Marco_B.pdf':         RPM     HP  Torque   AFR  Boost
 0    2240.0   75.0   177.0  12.8    3.8
 1    2260.0   77.0   179.0  12.7    4.0
 2    2280.0   78.0   182.0  12.6    4.2
 3    2300.0   80.0   184.0  12.4    4.3
 4    2320.0   82.0   187.0  12.3    4.5
 ..      ...    ...     ...   ...    ...
 197  6180.0  291.0   248.0  10.6   14.1
 198  6200.0  292.0   248.0  10.6   14.0
 199  6220.0  292.0   247.0  10.6   14.0
 200  6240.0  292.0   247.0  10.6   13.9
 201  6260.0  293.0   246.0  10.6   13.9
 
 [202 rows x 5 columns],
 '2013_model_subaru_wrx_sti_Eric_B_1.pdf':         RPM     HP  Torque   AFR  Boost
 0    2260.0   76.0   178.0  12.4    6.6
 1    2280.0   78.0   180.0  12.3    6.8
 2    2300.0   80.0   183.0  12.1    7.0
 3    2320.0   81.0   185.0  12.0    7.3
 4    2340.0   83.0   188.0  11.8    7.5
 ..      ...    ...     ...   ...    ...
 206  6380.0  298.0   246.0  10.8   14.4
 207  6400.0  297.0   244.0  10.8   14.4
 208  6420.0  296.0   243.0  

Looking good! Now that these PDFs are cached, I can push each of these dataframes to CSV or use them inline.

### Issue #3: Programatically store new dataframes to CSVs