# Dowdle's Fuel Efficiency Prediction
**Author:** Brittany Dowdle   
**Date:** 4/13/2025   
**Objective:** This project will demonstrate my ability to apply regression modeling techniques to a real-world dataset. I will:
* Create and save a pipeline based model.   
* Create a RESTful service using the model.   
* Demonstrate receipt of  results from the service.

## Introduction
This project uses the UCI Auto MPG Dataset to predict fuel efficiency based on features such as cylinders, horsepower, and weight. The goal is to predict the MPG for each vehicle. I will create a regression model, split/train the data, evaluate performance using key metrics, and create visualizations to interpret the results.

****

## Imports
In the code cell below are the necessary Python libraries for this notebook. *Pro Tip: All imports should be at the top of the notebook.*

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, BaggingClassifier, VotingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

****
## Section 1. Import and Inspect the Data

### 1.1 Load the dataset and display the first 10 rows

In [9]:
# Load the dataset
df = pd.read_csv(r"C:\Projects\ml_regression_dowdle\data\auto-mpg.csv", delimiter=",")

# Display the first 10 rows
df.head(10)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino
5,15.0,8,429.0,198.0,4341.0,10.0,70,1,ford galaxie 500
6,14.0,8,454.0,220.0,4354.0,9.0,70,1,chevrolet impala
7,14.0,8,440.0,215.0,4312.0,8.5,70,1,plymouth fury iii
8,14.0,8,455.0,225.0,4425.0,10.0,70,1,pontiac catalina
9,15.0,8,390.0,190.0,3850.0,8.5,70,1,amc ambassador dpl


### 1.2 Check for missing values and display summary statistics

In [11]:
# If command is not the last statement in a Python cell, wrap it in the print() function to display.
# Display missing values
print('Missing Values:')
print(df.isnull().sum(), '\n') 

# Display summary statistics
# For numerical columns
print('Summary Statistics (Numerical):')
print(df.describe(include=[np.number]), '\n')
# For categorical columns
print('Summary Statistics (Categorical):')
print(df.describe(include=[object]))

Missing Values:
mpg             0
cylinders       0
displacement    0
horsepower      6
weight          0
acceleration    0
model_year      0
origin          0
car_name        0
dtype: int64 

Summary Statistics (Numerical):
              mpg   cylinders  displacement  horsepower       weight  \
count  398.000000  398.000000    398.000000  392.000000   398.000000   
mean    23.514573    5.454774    193.425879  104.469388  2970.424623   
std      7.815984    1.701004    104.269838   38.491160   846.841774   
min      9.000000    3.000000     68.000000   46.000000  1613.000000   
25%     17.500000    4.000000    104.250000   75.000000  2223.750000   
50%     23.000000    4.000000    148.500000   93.500000  2803.500000   
75%     29.000000    8.000000    262.000000  126.000000  3608.000000   
max     46.600000    8.000000    455.000000  230.000000  5140.000000   

       acceleration  model_year      origin  
count    398.000000  398.000000  398.000000  
mean      15.568090   76.010050   

### Reflection 1: What do you notice about the dataset? Are there any data issues?
The only column with missing values is horsepower with 6 missing values. The car_name field has a high number of unique values, this means it won't be useful for grouping or modeling without some preprocessing. And features like displacement and weight have very different ranges, which could impact model training if not scaled appropriately. Most of the vehicles seem to be from region 1 based on the origin column median and mode.

****