# Project Overview
Goal is to **acquire engine performance data** from some source yet to be determined, then run **statistical analysis** on horsepower and torque (targets) against engine data, then **build features** and **create regression models** in order to predict horsepower.

In [1]:
import pandas as pd
import re
import tabula

import prepare

# Setup
## Bottom Line Up Front
1. Used tabula for first time
2. Ran into Issue #1, read_pdf concatenates values past page one into a single column, fixed with: **prepare.py/fix_dyno_pdf(df)**
3. Ran into Issue #2, will need to create functions to run conversion and fix through loops, fixed with: **multiple prepare.py functions**
4. Ran into Issue #3, will need to create a function to store dataframes locally from the dataframe dictionary, fixed by **manual assignment** because tabula-py had issues reading larger PDFs
5. Ran into Issue #4, needed to decide what to do for project, chose **max horsepower analysis** as initial approach.

## Initial try with tabula-py
First I will attempt to read data from a Cobb Tuning PDF.

In [2]:
# pull pdf into dataframe
# df = tabula.read_pdf('cobb_dyno_data/subaru/subaru_impreza_wrx_sti/2013_model_subaru_wrx_sti_Adam_B_1.pdf', 
#                             pages='all', 
#                             multiple_tables=False)[0]

In [3]:
# check work
# df

## Issue #1: Fix concatenation issue
Because pages after page one do not have column names, the read_pdf function concatenates each row's values into the RPM column. Next, I will create a function that fixes this.

In [4]:
# Function outlined here before definition
# for i, item in enumerate(df['RPM']):
#     if len(item) > 5:
#         RPM = item[:4]
#         item = item[4:]
#         HP = re.findall(r'...\.\w\w', item)[0]
#         item = item[len(HP):]
#         Torque = re.findall(r'...\.\w\w', item)[0]
#         item = item[len(Torque):]
#         AFR = re.findall(r'..\.\w\w', item)[0]
#         item = item[len(AFR):]
#         Boost = re.findall(r'..\.\w\w', item)[0]
#         df.loc[i] = {'RPM':RPM, 'HP':HP, 'Torque':Torque, 'AFR':AFR, 'Boost':Boost}
#     else:
#         continue

In [5]:
# test work
# df = df.astype('float')
# print(df.sample(5), '\n\n')
# df.info()

### Solution to Issue #1
I've pushed the function to prepare.py, now I will import the function and test it on a new PDF.

In [6]:
# read second pdf as test
# df2 = tabula.read_pdf('cobb_dyno_data/subaru/subaru_impreza_wrx_sti/2013_model_subaru_wrx_sti_Adam_B_2.pdf', 
#                             pages='all', 
#                             multiple_tables=False)[0]

In [7]:
# try out function
# df2 = prepare.fix_dyno_pdf(df2)
# df2.sample(5)

So the function works when you call it with a dataframe. I will need to call the function in a loop to iterate through PDFs.

## Issue #2: Convert PDFs programatically
I need to create three functions:
1. Uses tabula-py to convert a PDF to dataframe
2. Identifies filepaths of PDFs in a given directory
3. Loops through filepaths from #2 and calls #1 on each PDF. 

The end goal is to convert all PDFs into individual CSV files.

### Solution to Issue #2
All three files added to prepare.py:
1. prepare.py/pdf_to_df(filepath)
2. prepare.py/iterate_pdfs(folder_path)
3. prepare.py/fix_pdfs(folder_path)

In [8]:
# testing all three functions
# df_dict = prepare.fix_pdfs('cobb_dyno_data/subaru/subaru_impreza_wrx_sti/')
# df_dict

Looking good! Now that these PDFs are cached, I can push each of these dataframes to CSV or use them inline.

## Issue #3: Programatically store new dataframes to CSVs
So, I ran into a new issue. Apparently, tabula has trouble with larger PDFs due to memory allocation. So, I messed around a bit, and found out that following the URL in each pdf (getrundetails.php?runid1=number) gives a page with no weird column issues. I can copy straight out of that PHP into the CSV, manually. It's a better solution and a worse solution at the same time... but it's a solution that's guaranteed to work. Well, it was fun while it lasted.

ID list: 
4591
4592
4721
4722
4443
4876
4496
4542 
4642 
5648 
4494 
4495 
4794 
4162 
4163 
5385 
5386 
4825 
4624 
4625 
4623 
4501 
4502 
4500
5168
4938
4091
4092
4093
5174
5366
5367
5170
4834
4872
4994
4764
5262
5263
5123
5389
4576
5022
5023
5310
4724
4713
4797
4634
4632
4633
4827
5554

In [9]:
runlist = [4591,4592,4721,4722,4443,4876,4496,4542,4642,5648,4494,4495,4794,4162,4163,
           5385,5386,4825,4624,4625,4623,4501,4502,4500,5168,4938,4091,4092,4093,5174,
           5366,5367,5170,4834,4872,4994,4764,5262,5263,5123,5389,4576,5022,5023,5310,
           4724,4713,4797,4634,4632,4633,4827,5554]
runlist.sort()

- 4091, OTS
- 4092, Stage 2 Sport
- 4093, Stage 2 Sport#
- 4162, Stage 2 17.5psi 93 Octane
- 4163, Stage 2 20.5psi 93 Octane
- 4443, Cobb TBE, Cobb SF Intake
- 4494, COBB Tuning Accessport, COBB Tuning Turboback- Econ- 92 Octane
- 4495, COBB Tuning Accessport, COBB Tuning Turboback- 15psi- 92 Octane
- 4496, COBB Tuning Accessport, COBB Tuning Turboback- 18psi- 92 Octane
- 4500, COBB Tuning Accessport, Turboback, COBB Tuning SF Intake- 16psi- 92 Octane
- 4501, COBB Tuning Accessport, Turboback, COBB Tuning SF Intake- 18psi- 92 Octane
- 4502, COBB Tuning Accessport, Turboback, COBB Tuning SF Intake- 19psi- 92 Octane
- 4542, Periin CAI, BOV, TBE, DW 265lph
- 4576, PW TMIC, TBE, EBCS, Tomei Header, Uppipe
- 4591, COBB Downpipe
- 4592, COBB Downpipe + Torco
- 4623, COBB Tuning Accessport, Invidia Turboback, COBB Tuning EWG (Rerouted), COBB Tuning 3-Port EBCS, COBB Tuning SF Intake/Airbox, Deatschwerks 65c Fuel Pump, Injector Dynamics 1000cc Injectors- Wastegate- 92 Octane
- 4624, COBB Tuning Accessport, Invidia Turboback, COBB Tuning EWG (Rerouted), COBB Tuning 3-Port EBCS, COBB Tuning SF Intake/Airbox, Deatschwerks 65c Fuel Pump, Injector Dynamics 1000cc Injectors- 16psi- 92 Octane
- 4625, COBB Tuning Accessport, Invidia Turboback, COBB Tuning EWG (Rerouted), COBB Tuning 3-Port EBCS, COBB Tuning SF Intake/Airbox, Deatschwerks 65c Fuel Pump, Injector Dynamics 1000cc Injectors- 18psi- 92 Octane
- 4632, COBB Tuning Accessport, COBB Tuning Downpipe, COBB Tuning EWG w/Reroute,SPT Catback, COBB Tuning EBCS, ETS TMIC, Injector Dynamics 1000cc Injectors, Deatschwerks Fuel Pump, COBB Tuning SF Intake/Airbox, LPC FPR Backdate, Perrin Big Tube Header- Wastegate-
- 4633, COBB Tuning Accessport, COBB Tuning Downpipe, COBB Tuning EWG w/Reroute,SPT Catback, COBB Tuning EBCS, ETS TMIC, Injector Dynamics 1000cc Injectors, Deatschwerks Fuel Pump, COBB Tuning SF Intake/Airbox, LPC FPR Backdate, Perrin Big Tube Header- 15psi- 92
- 4634, COBB Tuning Accessport, COBB Tuning Downpipe, COBB Tuning EWG w/Reroute,SPT Catback, COBB Tuning EBCS, ETS TMIC, Injector Dynamics 1000cc Injectors, Deatschwerks Fuel Pump, COBB Tuning SF Intake/Airbox, LPC FPR Backdate, Perrin Big Tube Header- 18.5psi- 9
- 4642 Anthony L, TBE, EBCS
- 4713, Invidia TBE, COBB SF Intake, COBB EBCS, DW FP, PW TMIC, Ported WG
- 4721, COBB SF Intake, COBB Downpipe, DW FP, Protune
- 4722, COBB SF Intake, COBB Downpipe, DW FP, OTS Map (Stage 2 + SF ACN91)
- 4724, Invidia TBE, Tomei ELH, COBB SF Intake
- 4764, COBB Tuning Accessport, COBB Tuning Turboback, Deatschwerks Fuel Pump, Injector Dymamics 1000cc Injectors, '04 STi Fuel Pressure Regulator- 15psi- 92 Octane
- 4794, COBB Tuning Accessport, Perrin Turboback, Perrin Big Tube Header/Up Pipe, Perrin CAI w/Airbox- 14psi- 92 Octane
- 4797, COBB Tuning Accessport, 3" Turboback Exhaust, '04 STi FPR Backdate- 13psi- 92 Octane
- 4825, TBE, COBB SF Intake
- 4827, COBB Downpipe, SF Intake
- 4834, Downpipe, SF Intake
- 4872, TBE, SF Intake
- 4876, SF Intake, TBE
- 4938, COBB Tuning Accessport, Nameless TBE, Tial 38mm EWG- DTA, Perrin Header, Grimmspeed EBCS, Process West TMIC, COBB Tuning SF Intake, Deatschwerks 65c Fuel Pump- 18psi- 92 Octane
- 4994, Cobb Tuning Accessport, Cobb Tuning downpipe, SPT catback, GD STi FPR, COBB 1000cc, DW 65c- 14psi - 92 octane
- 5022, Built Block Stage 2 w/ Headers E85
- 5023, Built Block Stage 2 w/ Headers ACN91
- 5123, Stage 2 19.5psi 93 Octane
- 5168, 18.5psi 93 octane
- 5170, Stage 2 19psi 93 Octane
- 5174, 20G 22psi 91 Octane
- 5262, Stage 2 OTS
- 5263, Stage 2 19.5psi 93 Octane
- 5310, Stage 2 19.5psi 93 Octane
- 5366, GTX3076 23psi 91 Octane
- 5367, GTX3076 28psi 110 Octane
- 5385, Stage 3 93 Octane 19.5psi
- 5386, Stage 3 E85 19.5psi
- 5389, COBB 20g 20psi 93 Octane
- 5554, Stage 3 20psi 93 octane
- 5648, Stage 2 20psi 93 Octane

Alright... it's unfortunate that I ended up doing it this way, but it's done.

### Issue #3, Part Two
So I really wanted to read things programatically, and I found a solution in the pandas read_html method, which I didn't know about until now.

In [10]:
pd.read_html('https://dyno.cobbtuning.com/getrundetails.php?runid1=4591')[1]

Unnamed: 0,RPM,HP,Torque,AFR,Boost
0,2240,88.0,207.0,12.5,7.5
1,2260,90.0,210.0,12.3,7.7
2,2280,92.0,212.0,12.2,8.1
3,2300,93.0,213.0,12.0,8.4
4,2320,94.0,214.0,11.9,8.8
...,...,...,...,...,...
213,6500,300.0,243.0,10.7,15.3
214,6520,299.0,241.0,10.7,15.3
215,6540,297.0,239.0,10.7,15.2
216,6560,296.0,238.0,10.7,15.2


As you can see, this fairly-simple approach can work for a lot of dyno runs. Here we go...

In [11]:
# checking if easy append works
pd.read_html('https://dyno.cobbtuning.com/getrundetails.php?runid1=4591')[1].append(pd.read_html('https://dyno.cobbtuning.com/getrundetails.php?runid1=4591')[1])

Unnamed: 0,RPM,HP,Torque,AFR,Boost
0,2240,88.0,207.0,12.5,7.5
1,2260,90.0,210.0,12.3,7.7
2,2280,92.0,212.0,12.2,8.1
3,2300,93.0,213.0,12.0,8.4
4,2320,94.0,214.0,11.9,8.8
...,...,...,...,...,...
213,6500,300.0,243.0,10.7,15.3
214,6520,299.0,241.0,10.7,15.3
215,6540,297.0,239.0,10.7,15.2
216,6560,296.0,238.0,10.7,15.2


Car info is in the first dataframe, dyno runs are in the second dataframe, so all we have to do is store the runs to one dataframe and the info to another, then merge the two dataframes. 

I've done this below for the originally-planned 2013 WRX STI analysis, but because it was so successful, I ended up running the same lines of code but for any run between runs 1 through 10,000. 

In [12]:
dyno_run_df = pd.DataFrame({'Run':[], 'RPM':[], 'HP':[], 'Torque':[], 'AFR':[], 'Boost':[]})
car_info_df = pd.DataFrame({'Run':[], 'Date':[], 'Car':[], 'Name':[], 'Specs':[]})

for run in runlist:
    all_df = pd.read_html('https://dyno.cobbtuning.com/getrundetails.php?runid1=' + str(run))
    # add dyno run data to dyno_run_df
    dyno_run = all_df[1]
    dyno_run['Run'] = str(run)
    dyno_run_df = dyno_run_df.append(dyno_run)
    # add car info to car_info_df
    date = all_df[0][1][0]
    car = all_df[0][0][0]
    name = all_df[0][0][1]
    specs = all_df[0][0][2]
    dyno_run = str(run)
    car_info = pd.DataFrame({'Run':[dyno_run], 'Date':[date], 'Car':[car], 'Name':[name], 'Specs':[specs]})
    car_info_df = car_info_df.append(car_info)
    

print(dyno_run_df.sample(10))
print(car_info_df.sample(3))

      Run     RPM     HP  Torque    AFR  Boost
212  5170  4150.0  261.0   333.0  11.50  19.06
2    4764  2020.0   58.0   151.0  15.10   3.50
424  4625  6080.0  299.0   259.0  11.19  14.40
111  4633  2940.0  144.0   259.0  11.17  13.20
83   4938  2670.0  129.0   257.0  20.00  13.00
307  5554  5110.0  296.0   306.0  10.50  17.50
17   4642  2580.0  115.0   235.0  12.40   9.40
118  4624  3020.0  161.0   284.0  10.96  15.40
424  4502  6240.0  286.0   241.0  11.30  14.20
249  4634  4330.0  255.0   312.0  11.05  18.50
    Run                 Date                          Car            Name  \
0  4834  08-30-2013 01:35 pm  2013 Subaru Impreza WRX STI         Lucas L   
0  4634  06-21-2013 04:01 pm  2013 Subaru Impreza WRX STI  Sterling Brand   
0  4872  10-03-2013 02:56 pm  2013 Subaru Impreza WRX STI         Mason L   

                                               Specs  
0                                Downpipe, SF Intake  
0  COBB Tuning Accessport, COBB Tuning Downpipe, ...  
0        

In [13]:
all_run_df = pd.merge(left=dyno_run_df, right=car_info_df, on='Run')
all_run_df

Unnamed: 0,Run,RPM,HP,Torque,AFR,Boost,Date,Car,Name,Specs
0,4091,2030.0,48.0,126.0,11.7,3.28,11-13-2012 05:16 pm,2013 Subaru Impreza WRX STI,Jacob Gardner,OTS
1,4091,2040.0,48.0,125.0,11.7,3.40,11-13-2012 05:16 pm,2013 Subaru Impreza WRX STI,Jacob Gardner,OTS
2,4091,2050.0,49.0,125.0,11.7,3.52,11-13-2012 05:16 pm,2013 Subaru Impreza WRX STI,Jacob Gardner,OTS
3,4091,2060.0,49.0,125.0,11.7,3.63,11-13-2012 05:16 pm,2013 Subaru Impreza WRX STI,Jacob Gardner,OTS
4,4091,2070.0,49.0,126.0,11.7,3.74,11-13-2012 05:16 pm,2013 Subaru Impreza WRX STI,Jacob Gardner,OTS
...,...,...,...,...,...,...,...,...,...,...
20668,5648,6730.0,288.0,225.0,11.2,14.05,12-09-2015 03:52 pm,2013 Subaru Impreza WRX STI,Arnold Marroquin,Stage 2 20psi 93 Octane
20669,5648,6740.0,287.0,225.0,11.2,14.02,12-09-2015 03:52 pm,2013 Subaru Impreza WRX STI,Arnold Marroquin,Stage 2 20psi 93 Octane
20670,5648,6750.0,287.0,224.0,11.2,14.00,12-09-2015 03:52 pm,2013 Subaru Impreza WRX STI,Arnold Marroquin,Stage 2 20psi 93 Octane
20671,5648,6760.0,287.0,224.0,11.2,13.98,12-09-2015 03:52 pm,2013 Subaru Impreza WRX STI,Arnold Marroquin,Stage 2 20psi 93 Octane


Here is the 10,000-run web scrape that I ran. It's commented-out because the query took about 45 minutes to complete when I ran it and I've stored the results to CSVs, which I plan to upload to Kaggle in some form. If you're reading this, I've likely already done this and the link is... somewhere... in my repository. (I hope)

In [14]:
# dyno_run_df = pd.DataFrame({'Run':[], 'RPM':[], 'HP':[], 'Torque':[], 'AFR':[], 'Boost':[]})
# car_info_df = pd.DataFrame({'Run':[], 'Date':[], 'Car':[], 'Name':[], 'Specs':[]})

# for run in range(1,10000):
#     all_df = pd.read_html('https://dyno.cobbtuning.com/getrundetails.php?runid1=' + str(run))
#     # add dyno run data to dyno_run_df
#     dyno_run = all_df[1]
#     dyno_run['Run'] = str(run)
#     dyno_run_df = dyno_run_df.append(dyno_run)
#     # add car info to car_info_df
#     date = all_df[0][1][0]
#     car = all_df[0][0][0]
#     name = all_df[0][0][1]
#     specs = all_df[0][0][2]
#     dyno_run = str(run)
#     car_info = pd.DataFrame({'Run':[dyno_run], 'Date':[date], 'Car':[car], 'Name':[name], 'Specs':[specs]})
#     car_info_df = car_info_df.append(car_info)
    
# all_run_df = pd.merge(left=dyno_run_df, right=car_info_df, on='Run')
# all_run_df

Storing to CSVs...

In [15]:
# all_run_df.to_csv('cobb_dyno_data.csv')

In [16]:
# dyno_run_df.to_csv('dyno_runs.csv')

In [17]:
# car_info_df.to_csv('car_info.csv')

## Issue #4: Deciding what to do with the data
I now have CSVs for each car's dyno test. Some CSVs are runs on the same car with different gas octane and tuning. I could look at each car's max horsepower as an observation and create features for boost PSI levels, gas octane levels, stages, etc. I could also look at max torque, or other maximums. 

### Understanding the data
1. RPM: Revolutions Per Minute, referring to how many times the camshaft spins in a minute. Independent variable.
2. HP: Horsepower, referring to the power output of the engine, likely calculated at the wheels (WHP). Dependent.
3. Torque: refers to the ability of the engine to overcome resistance, in lb/sqft. Dependent.
4. AFR: Air-Fuel Ratio, refers to how lean the engine is running (higher number = more lean). Independent (you may consider this dependent, but the issue is that you can tune the car to run more lean or more rich).
5. Boost: refers to the additional air pressure in PSI from a turbo. Independent.

### Decision
For the initial analysis, I will look at max horsepower as an observation. I'll add features that capture the different specifications of each car, then perform statistical analysis on those features against horsepower. Later, I may look at max torque (would be fairly easy to swap with max horsepower as a variable). I could look into area-under-curve analysis for horsepower and torque to see which cars have best performance through the entire cycle, but I'd need to be careful with cars having a wider RPM spread (torque and HP held even, a car with wider RPM spread will have larger area under curve).

# Prepare
I will first create a function that locates the max HP row in each CSV and appends the row to a max_hp dataframe along with an index. Then I will create a feature list and add each feature as a column with one-hot encoded values (manually-entered) for whether the observation had the feature or not. After this is done, I will push the max-HP dataframe to a CSV.

In [18]:
# function here

Features Candidates:
1. OTS: Off-The-Shelf, a standard tuning setting for the car of that make model and year
2. Stages 1,2,3: Specific tune and parts, each stage increases performance. See: https://thinktuning.com/wrx-stages/. Downpipes are part of Stage 2, but dyno tests say downpipe *or* Stage 2, so I will need to combine the two in some way, potentially relaxing Stage 2 into its component parts and making each part a feature
3. PSI: Boost amounts, lowest is 13 highest is 28
4. Octane: Fuel quality, E85 is 105 octane, lowest is 91, highest is 110
5. Cobb Tuning Accessport
6. Parts: There are a lot of different parts combinations, but it might be useful to look at each of them against horsepower.