# Overview
This notebook contains my work for acquisition and preparation of the scraped dyno run data.

# Imports

In [1]:
import pandas as pd
import re

# Car Info CSV work
Let's look at car_info.csv first.

In [2]:
# ingest car info csv
df = pd.read_csv('car_info.csv')
df

Unnamed: 0,Run,Date,Car,Name,Specs
0,4,11-08-2009 05:51 pm,2009 Nissan GT-R,GOTO:Racing,"Stock engine internals, AMS turbo upgrade, Dea..."
1,5,10-21-2009 01:08 pm,2002 Subaru Impreza WRX,Neil Bywater,"2.5L Sti longblock (No avcs)FP Green, APS 525 ..."
2,31,11-12-2009 10:56 am,2010 Nissan GT-R,Jason McCartney,2010 GTR - Completely stock - 100 octane fuel ...
3,32,11-12-2009 10:57 am,2010 Nissan GT-R,Mike Cheng,2010 GTR - stock with high flow downpipe - 94 ...
4,33,11-02-2009 10:58 am,2009 Nissan GT-R,Dave Pickering,2009 GTR - stage 2 with full exhaust and stock...
...,...,...,...,...,...
4950,5926,10-13-2017 12:25 pm,2016 Volkswagen Golf R,COBB Tuning,Stage 1
4951,5927,10-13-2017 12:25 pm,2016 Volkswagen Golf R,COBB Tuning,Stage 2
4952,5933,10-13-2017 12:47 pm,2016 Volkswagen Golf R,COBB Tuning,Stock
4953,5934,10-13-2017 12:47 pm,2016 Volkswagen Golf R,COBB Tuning,Stage 1 High Boost


In [3]:
# check nulls
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4955 entries, 0 to 4954
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Run     4955 non-null   int64 
 1   Date    4955 non-null   object
 2   Car     4955 non-null   object
 3   Name    4955 non-null   object
 4   Specs   4955 non-null   object
dtypes: int64(1), object(4)
memory usage: 193.7+ KB


**Looks good, we will need to do a few things before Explore:**
- Split the Car column into year, make, and model columns
- Convert the Date column to a datetime format
- Split the runs into Train, Validate, and Test

## 'Car' column split into year, make, and model columns

In [4]:
# checking how we could split the Car column
df.Car.unique().tolist()

['2009 Nissan GT-R',
 '2002 Subaru Impreza WRX',
 '2010 Nissan GT-R',
 '2004 Subaru Impreza WRX STI',
 '2010 Mazda Mazdaspeed3',
 '2006 Subaru Impreza WRX STI',
 '2009 Honda Civic Si',
 '2007 Mazda Mazdaspeed3',
 '2005 Subaru Outback XT',
 '1998 Subaru Impreza RS',
 '2004 Subaru Fat Albert',
 '2008 Subaru Impreza WRX STI',
 '2008 Mitsubishi EVO X GSR',
 '2008 Mitsubishi EVO X MR',
 '2005 Subaru Impreza WRX',
 '2009 Nissan 370Z',
 '2007 Subaru Legacy 2.5 spec.B',
 '2005 Subaru Impreza WRX STI',
 '2006 Subaru Impreza WRX',
 '2008 Subaru Impreza WRX',
 '2008 Subaru Legacy 2.5 spec.B',
 '1998 Subaru Legacy',
 '2005 Subaru Legacy 2.5GT',
 '2003 Subaru Impreza WRX',
 '2009 Subaru Impreza WRX',
 '2008 Mitsubishi EVO X',
 '2006 Subaru Legacy 2.5GT',
 '2008 Mazda Mazdaspeed3',
 '2007 Subaru Impreza WRX STI',
 '2009 Subaru Impreza WRX STI',
 '2006 Mitsubishi EVO VIII',
 '2006 Nissan 350Z',
 '2005 Nissan 350Z',
 '2007 Subaru Impreza WRX',
 '2008 Honda Civic Si',
 '2009 Mazda Mazdaspeed3',
 '2009 

In [5]:
# using first-4 characters as year
df['car_year'] = df.Car.str[:4]
df.car_year.sample(3, random_state=1)

1271    2010
2360    2008
2369    2011
Name: car_year, dtype: object

In [6]:
# splitting the rest of the string into make and model
make_model = df.Car.str.extract(r'\W(.*?)\W(.*?)$')
# using the second word as the make
df['car_make'] = make_model[0]
# using the last portion as the model
df['car_model'] = make_model[1]

# check work
df.head(3)

Unnamed: 0,Run,Date,Car,Name,Specs,car_year,car_make,car_model
0,4,11-08-2009 05:51 pm,2009 Nissan GT-R,GOTO:Racing,"Stock engine internals, AMS turbo upgrade, Dea...",2009,Nissan,GT-R
1,5,10-21-2009 01:08 pm,2002 Subaru Impreza WRX,Neil Bywater,"2.5L Sti longblock (No avcs)FP Green, APS 525 ...",2002,Subaru,Impreza WRX
2,31,11-12-2009 10:56 am,2010 Nissan GT-R,Jason McCartney,2010 GTR - Completely stock - 100 octane fuel ...,2010,Nissan,GT-R


In [7]:
# drop redundant Car column
df = df.drop(columns='Car')
df.head(3)

Unnamed: 0,Run,Date,Name,Specs,car_year,car_make,car_model
0,4,11-08-2009 05:51 pm,GOTO:Racing,"Stock engine internals, AMS turbo upgrade, Dea...",2009,Nissan,GT-R
1,5,10-21-2009 01:08 pm,Neil Bywater,"2.5L Sti longblock (No avcs)FP Green, APS 525 ...",2002,Subaru,Impreza WRX
2,31,11-12-2009 10:56 am,Jason McCartney,2010 GTR - Completely stock - 100 octane fuel ...,2010,Nissan,GT-R


## Datetime column

In [8]:
# capture date and time into columns
date_time = df.Date.str.extract(r'^(.*?) (.*?)$')

In [9]:
# show discrepancy in date
date_time[0].unique().tolist()[53:58]

['10-27-2009', '11-05-2009', '10-26-2009', '11-30--0001', '12-01-2009']

Runs are chronological from what I can tell, I bet we can fix the discrepancy with the correct year...

In [10]:
# check runs where with the incorrect year
df[df.Date.str.contains('11-30--0001')]

Unnamed: 0,Run,Date,Name,Specs,car_year,car_make,car_model
98,160,11-30--0001 04:00 pm,Joey Hauser,98 Octane,2008,Mitsubishi,EVO X
117,180,11-30--0001 04:00 pm,Jeff Sponaugle,"Built 2.5L, GT30R .63 Rotated Kit, 92 Octane",2008,Subaru,Impreza WRX STI
118,181,11-30--0001 04:00 pm,Jeff Sponaugle,"Built H6, Custom GT35R Kit, 92 Octane",2002,Subaru,Impreza WRX
119,182,11-30--0001 04:00 pm,Jeff Sponaugle,"Built H6, Custom GT35R Kit, 100 Octane",2002,Subaru,Impreza WRX
120,183,11-30--0001 04:00 pm,Tim Bailey,"Built 2.5L, Cosworth Heads & Cams, GT35R, AEM ...",2004,Subaru,Impreza WRX STI
216,299,11-30--0001 04:00 pm,Sage Merrill,"Evo3 16g, TMIC, 750cc Injectors, SF Intake, 18...",2002,Subaru,Impreza WRX
217,300,11-30--0001 04:00 pm,Sage Merrill,"Evo3 16g, TMIC, 750cc Injectors, SF Intake, 16...",2002,Subaru,Impreza WRX
1338,1625,11-30--0001 03:35 pm,Mark Dickins,Stage 2 19psi 91 Octane,2007,Subaru,Impreza WRX STI
1339,1626,11-30--0001 04:34 pm,Mark Dickins,Stage 2 21psi E85,2007,Subaru,Impreza WRX STI
1404,1695,11-30--0001 04:41 pm,RUF RUF,Stock Baseline,2007,Porsche,911


In [11]:
def run_date_fixer(col):
    """ Pass back the previous and next run numbers for erroneous run date runs, if available """
    col_prev = col - 1
    col_next = col + 1
    new_col = col.append(col_prev).append(col_next).tolist()
    return new_col

In [12]:
# get list of runs
run_checks = run_date_fixer(df[df.Date.str.contains('11-30--0001')].Run)

# check work
run_checks.sort()
print(run_checks)

[159, 160, 161, 179, 180, 180, 181, 181, 181, 182, 182, 182, 183, 183, 184, 298, 299, 299, 300, 300, 301, 1624, 1625, 1625, 1626, 1626, 1627, 1694, 1695, 1696, 1696, 1697, 1698, 2502, 2503, 2504, 2882, 2883, 2883, 2884, 2884, 2885, 3402, 3403, 3403, 3404, 3404, 3404, 3405, 3405, 3405, 3406, 3406, 3406, 3407, 3407, 3407, 3408, 3408, 3408, 3409, 3409, 3410, 4132, 4133, 4133, 4134, 4134, 4134, 4135, 4135, 4135, 4136, 4136, 4137, 4298, 4299, 4299, 4300, 4300, 4300, 4301, 4301, 4301, 4302, 4302, 4303]


In [13]:
# use run_checks list to pass back rows with that run number
df[df.Run.isin(run_checks[:20])]

Unnamed: 0,Run,Date,Name,Specs,car_year,car_make,car_model
97,159,10-26-2009 11:38 am,Jake Zvirzdys,"Stage 2, 92 Octane",2006,Subaru,Impreza WRX
98,160,11-30--0001 04:00 pm,Joey Hauser,98 Octane,2008,Mitsubishi,EVO X
99,161,12-07-2009 11:42 am,John Wulf,"Stage 2, AEM Intake, AVO TMIC, 17.5psi, 92 Octane",2006,Subaru,Legacy 2.5GT
116,179,10-08-2009 03:38 pm,Tad Ogland,"Stage 2, 92 Octane",2008,Mitsubishi,EVO X GSR
117,180,11-30--0001 04:00 pm,Jeff Sponaugle,"Built 2.5L, GT30R .63 Rotated Kit, 92 Octane",2008,Subaru,Impreza WRX STI
118,181,11-30--0001 04:00 pm,Jeff Sponaugle,"Built H6, Custom GT35R Kit, 92 Octane",2002,Subaru,Impreza WRX
119,182,11-30--0001 04:00 pm,Jeff Sponaugle,"Built H6, Custom GT35R Kit, 100 Octane",2002,Subaru,Impreza WRX
120,183,11-30--0001 04:00 pm,Tim Bailey,"Built 2.5L, Cosworth Heads & Cams, GT35R, AEM ...",2004,Subaru,Impreza WRX STI
121,184,11-30-2009 03:46 pm,Tim Diens,"Stock, 92 Octane",2009,Subaru,Impreza WRX STI
215,298,01-01-2010 05:08 pm,Lance Lucas,"GT35R .82, Built 2.5L Longblock, 272 Cams, 21-...",2004,Subaru,Impreza WRX


Based on the need to clean 'Date' column and the lack of value it provides, I'll just drop the 'Date' column.

In [14]:
# drop Date column
df = df.drop(columns='Date')
df.head(3)

Unnamed: 0,Run,Name,Specs,car_year,car_make,car_model
0,4,GOTO:Racing,"Stock engine internals, AMS turbo upgrade, Dea...",2009,Nissan,GT-R
1,5,Neil Bywater,"2.5L Sti longblock (No avcs)FP Green, APS 525 ...",2002,Subaru,Impreza WRX
2,31,Jason McCartney,2010 GTR - Completely stock - 100 octane fuel ...,2010,Nissan,GT-R


Looks good- we'll split the data in our wrangle.py module.