# Grand Circus Final Project
### Car Crash and Safety Data Comparisons/Evaluations

This project aims to compare safety ratings from crash tests to actual data of fatal crashes. The use of fatal crash data is better suited for hard crashes where occupant life is and was in danger, providing more relevant data entries compared to fender benders or other minimal 'traffic incidents'. This analysis could be useful for car buyers, car manufactureres, government testers, and insurance companies.

## Extraction
To start the ETA process, data but be extracted and placed into usable structures. To do this, we will be importing the data from the api(s) and any other flat file sources.

In [4]:
import pandas as pd
import requests
import json
import tqdm

# Begin pulling make names and ID's for internal use
# Definitions endpoint query
make_url = "https://crashviewer.nhtsa.dot.gov/CrashAPI/definitions/GetVariableAttributes?variable=make&caseYear=2021&format=json"

# Get response
response = requests.get(make_url)
# Turn response into json
data = response.json()

In [5]:
#Drill down json to list of dictionary
results = data['Results'][0]

In [51]:
# split data into lists
id_list = []
name_list = []
for entry in results:
    id_list.append(int(entry['ID']))
    name_list.append(entry['TEXT'])

# Make columns dictionary based on lists
data = {'MakeID': id_list, 'Name': name_list}

# Create df using dictionary
manufacturer_df = pd.DataFrame(data)

# Sort by Id instead of name
manufacturer_df = manufacturer_df.sort_values(by=['MakeID'])
manufacturer_df.head()

Unnamed: 0,MakeID,Name
3,1,American Motors
38,2,Jeep / Kaiser-Jeep / Willys- Jeep
2,3,AM General
13,6,Chrysler
18,7,Dodge


## Only Taking Top 11 Best-Selling Makes
Since the API contains data for all involved in crashes, such as the American Motors Ambassador made from 1952-1974, a fair portion of vehicles are not statistically relevant, or would be outwighed by more common vehicles. To prevent a weighting issing where more prevalent vehicles scew results to thinking more crashes are common, we will be using some of the most popular makes only.

In [53]:
to_keep = ['Nissan/Datsun', 'Toyota', 'KIA', 'Honda', 'Subaru', 'Ford', 'Chevrolet', 'Hyundai', 'Jeep / Kaiser-Jeep / Willys- Jeep', 'GMC', 'Dodge']
new_df = manufacturer_df[manufacturer_df['Name'].isin(to_keep)]
manufacturer_df = new_df
manufacturer_df.rename(columns={'Name': 'MakeName'}, inplace=True)
manufacturer_df.head(10)

Unnamed: 0,MakeID,MakeName
38,2,Jeep / Kaiser-Jeep / Willys- Jeep
18,7,Dodge
23,12,Ford
12,20,Chevrolet
27,23,GMC
55,35,Nissan/Datsun
30,37,Honda
73,48,Subaru
76,49,Toyota
31,55,Hyundai


### Fetching Model IDs

In [55]:
import time
all_models = []
for make_ID in manufacturer_df['MakeID']:
    model_url = f'https://crashviewer.nhtsa.dot.gov/CrashAPI/definitions/GetVariableAttributesForModel?variable=model&caseYear=2021&make={make_ID}&format=json'
    response = requests.get(model_url)
    model_data = response.json()
    
    results_model = model_data.get('Results') 

    time.sleep(1)
    for model in results_model:
        all_models.append({
            'MakeID': make_ID,
            'Models': model[0:]
        })
# Drill down into JSON
drill_down = all_models[0]['Models']
drill_down

[{'ID': 404,
  'MODELNAME': 'Cherokee (1984-on) (For Grand Cherokee for 2014 on use 02-422.)',
  'Make': None},
 {'ID': 421, 'MODELNAME': 'Cherokee (thru 1983)', 'Make': None},
 {'ID': 401, 'MODELNAME': 'CJ-2/CJ-3/CJ-4', 'Make': None},
 {'ID': 402, 'MODELNAME': 'CJ-5/CJ-6/CJ-7/CJ-8', 'Make': None},
 {'ID': 482, 'MODELNAME': 'Comanche', 'Make': None},
 {'ID': 406, 'MODELNAME': 'Commander', 'Make': None},
 {'ID': 1, 'MODELNAME': 'Compass', 'Make': None},
 {'ID': 483, 'MODELNAME': 'Gladiator', 'Make': None},
 {'ID': 422,
  'MODELNAME': 'Grand Cherokee (For 2014 on.  Use model 404 for model years prior to 2013.)',
  'Make': None},
 {'ID': 431, 'MODELNAME': 'Grand Wagoneer', 'Make': None},
 {'ID': 405, 'MODELNAME': 'Liberty', 'Make': None},
 {'ID': 498, 'MODELNAME': 'Other (light truck)', 'Make': None},
 {'ID': 407, 'MODELNAME': 'Patriot', 'Make': None},
 {'ID': 481, 'MODELNAME': 'Pick-up', 'Make': None},
 {'ID': 408, 'MODELNAME': 'Renegade', 'Make': None},
 {'ID': 999, 'MODELNAME': 'Unknow

In [58]:
models_df = pd.DataFrame(all_models).sort_values(by='MakeID')

In [88]:
# Merge manufacturer_df & models_df
merged_df = pd.merge(manufacturer_df, models_df, on="MakeID", how="left")
merged_df = merged_df.sort_values(by='MakeID')

In [94]:
# Explode the Models column to separate rows
exploded_df = merged_df.explode('Models')
exploded_df.reset_index(inplace=True)
exploded_df.drop('index', axis=1, inplace=True) 
exploded_df

Unnamed: 0,MakeID,MakeName,Models
0,2,Jeep / Kaiser-Jeep / Willys- Jeep,"{'ID': 404, 'MODELNAME': 'Cherokee (1984-on) (..."
1,2,Jeep / Kaiser-Jeep / Willys- Jeep,"{'ID': 421, 'MODELNAME': 'Cherokee (thru 1983)..."
2,2,Jeep / Kaiser-Jeep / Willys- Jeep,"{'ID': 401, 'MODELNAME': 'CJ-2/CJ-3/CJ-4', 'Ma..."
3,2,Jeep / Kaiser-Jeep / Willys- Jeep,"{'ID': 402, 'MODELNAME': 'CJ-5/CJ-6/CJ-7/CJ-8'..."
4,2,Jeep / Kaiser-Jeep / Willys- Jeep,"{'ID': 482, 'MODELNAME': 'Comanche', 'Make': N..."
...,...,...,...
468,63,KIA,"{'ID': 41, 'MODELNAME': 'Stinger', 'Make': None}"
469,63,KIA,"{'ID': 422, 'MODELNAME': 'Telluride', 'Make': ..."
470,63,KIA,"{'ID': 399, 'MODELNAME': 'Unknown (automobile)..."
471,63,KIA,"{'ID': 999, 'MODELNAME': 'Unknown (KIA)', 'Mak..."


In [96]:
# Extract ID and MODELNAME from the dictionaries in the Models column
exploded_df['ModelID'] = exploded_df['Models'].apply(lambda x: x['ID'] if isinstance(x, dict) else None)
exploded_df['ModelName'] = exploded_df['Models'].apply(lambda x: x['MODELNAME'] if isinstance(x, dict) else None)

In [98]:
# Drop the original Models column
df = exploded_df.drop(columns=['Models'])
df

Unnamed: 0,MakeID,MakeName,ModelID,ModelName
0,2,Jeep / Kaiser-Jeep / Willys- Jeep,404,Cherokee (1984-on) (For Grand Cherokee for 201...
1,2,Jeep / Kaiser-Jeep / Willys- Jeep,421,Cherokee (thru 1983)
2,2,Jeep / Kaiser-Jeep / Willys- Jeep,401,CJ-2/CJ-3/CJ-4
3,2,Jeep / Kaiser-Jeep / Willys- Jeep,402,CJ-5/CJ-6/CJ-7/CJ-8
4,2,Jeep / Kaiser-Jeep / Willys- Jeep,482,Comanche
...,...,...,...,...
468,63,KIA,41,Stinger
469,63,KIA,422,Telluride
470,63,KIA,399,Unknown (automobile)
471,63,KIA,999,Unknown (KIA)


In [100]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 473 entries, 0 to 472
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   MakeID     473 non-null    int64 
 1   MakeName   473 non-null    object
 2   ModelID    473 non-null    int64 
 3   ModelName  473 non-null    object
dtypes: int64(2), object(2)
memory usage: 14.9+ KB


## Bodytype fetching

In [102]:
# Every car needs a body type to query the api with
import os.path
import tqdm
base_url = "https://crashviewer.nhtsa.dot.gov/CrashAPI/definitions/GetVariableAttributesForbodyType"

bodytypes = []

# loop through every row in dataframe
for car in tqdm.tqdm(range(len(df))):
    if os.path.isfile("body-types.json"):
        break
    # for every car in dataframe     df.iloc[0]['A']
    params = f"?variable=bodytype&make={df.iloc[car]['MakeID']}&model={df.iloc[car]['ModelID']}&format=json"
    # get "BODY_ID" from responses and append to each row
    # Get response
    response = requests.get(base_url + params)

    # check if successful
    if response.status_code != 200:
        print(f"Error: Received status code {response.status_code}")
        print(f"Response content: {response.text}")
        raise Exception(f"API request failed with status code {response.status_code}")
    # Turn response into json
    data = response.json()

    # drill down
    results = data['Results'][0]

    # pull data from each bodytype per car
    # format is going to be a list of dictionaries, such that the bodytypes list will be like bodytypes[car][dictionary response]
    extracted = {entry['BODY_DEF'].split('(')[0].strip(): entry['BODY_ID'] for entry in data['Results'][0]}

    # append extracted to main list
    bodytypes.append(extracted)

    # sleep for polite scraping
    time.sleep(.5)

100%|██████████| 473/473 [08:30<00:00,  1.08s/it]


In [104]:
if not os.path.isfile("body-types.json"):
    with open("body-types.json", "w") as outfile:
        outfile.write(json.dumps(bodytypes))
else:
    with open('body-types.json', 'r') as openfile:
        bodytypes = json.load(openfile)

In [106]:
BodyDef = []
BodyId = []
for dictionary in bodytypes:
    for key, value in dictionary.items():
        BodyDef.append(key)
        BodyId.append(int(value))
        break
        
df['BodyID'] = BodyId
df['BodyType'] = BodyDef

df.head()

Unnamed: 0,MakeID,MakeName,ModelID,ModelName,BodyID,BodyType
0,2,Jeep / Kaiser-Jeep / Willys- Jeep,404,Cherokee (1984-on) (For Grand Cherokee for 201...,14,Compact utility
1,2,Jeep / Kaiser-Jeep / Willys- Jeep,421,Cherokee (thru 1983),15,Large utility
2,2,Jeep / Kaiser-Jeep / Willys- Jeep,401,CJ-2/CJ-3/CJ-4,14,Compact utility
3,2,Jeep / Kaiser-Jeep / Willys- Jeep,402,CJ-5/CJ-6/CJ-7/CJ-8,14,Compact utility
4,2,Jeep / Kaiser-Jeep / Willys- Jeep,482,Comanche,31,Standard pickup


## Getting Crashes Per Year Per Car

In [116]:
# Need to add crash totals per model to above dataframe 
# this will be done by simply tallying responses for each car
# Since the api has a max return limit, querying by each year (2010-onwards) will ensure all data is gathered, and allow for year grouping

# Base URL for NHTSA API
base_url = "https://crashviewer.nhtsa.dot.gov/CrashAPI/crashes/GetCrashesByVehicle"


year_totals = {}
for year in range(2010, 2011): # MAKE 2011-2021!!!!!
    total = []
    for row in tqdm.tqdm(df[['MakeID','ModelID', 'BodyID']].itertuples(index=False, name=None)):
        fatalities = 0
        for model_year in range (2010, 2021): # MAKE 2011-2021!!!!!!
            for state in range(1, 56):
                params = f"?make={row[0]}&model={row[1]}&modelyear={model_year}&bodyType={row[2]}&fromCaseYear={year}&toCaseYear={year}&state={state}&format=json"
                
                # get response(s)
                response = requests.get(base_url + params)
                # check for success/fail
                if response.status_code != 200:
                    print(f"Error: Received status code {response.status_code}")
                    print(f"Response content: {response.text}")
                    raise Exception(f"API request failed with status code {response.status_code}")

                data = response.json()
                
                # if success increment fatalities
                if data['Message'] == "Results returned successfully":
                    fatalities += 1

                # TEMP TEST PROOF OF CONCEPT 
                #fatalities += 1
                
                # sleep for a few seconds
                time.sleep(.5)
                # end state loop
            # end model_year loop
        total.append(fatalities)
        # end car loop
    year_totals[year] = total
    with open(f"{year}.json", "w") as outfile:
        outfile.write(json.dumps(year_totals))
    # end year loop


0it [01:21, ?it/s]


ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

In [None]:
len(year_totals[2011])
# {2011: [23, 34, 12, 55, 23, 4534]}
year_df = pd.DataFrame.from_dict(year_totals)
year_df.head()

In [None]:
year_df.info()
df.info()

In [None]:
df.head()

## Transformation
Now that we have usable, workable data, we can begin cleaning and organizing.

In [None]:
# Transformation code

# Drop any unneeded columns/rows
    # duplicates
    # nulls
    # outliers

# Merge/Join Data into one dataframe



## Load
With curated data, can now be loaded into postgres

In [None]:
# import sql alchemy and stuff
from sqlalchemy import create_engine

with open('credentials.json', 'r') as openfile:
    credentials = json.load(openfile)


TABLE_NAME = 'car_data'

DB_NAME = "safecars"
DB_USER = credentials['user']
DB_PASS = credentials['pass']
DB_HOST = "localhost"
DB_PORT = "5432"

# create engine with defined macros
engine = create_engine(f"postgresql+psycopg2://{DB_USER}:{DB_PASS}@{DB_HOST}/{DB_NAME}")
# send the df over
#df.to_sql(name=TABLE_NAME,
          con=engine,
          index=False)


In [None]:
sql = "SELECT * FROM car_data" # simple query for all rows
#sql_df = pd.read_sql(sql, engine) # make a df from postgres
#sql_df.head()