# **Model selection for prediction of mpg of cars and possible explanatory variables**

Group 25

Team members: Felix Li, Naaimur Reza, ... 

In [None]:
library(tidyverse)
library(tidymodels)
library(repr)
library(infer)
library(cowplot)
library(broom)
library(GGally)
library(AER)

set.seed(1111)

# Introduction

https://archive.ics.uci.edu/ml/datasets/Auto+MPG

# Preliminary Results

In [None]:
mpg_data <- read.delim("auto-mpg-1.data",  header = FALSE, sep= "")
head(mpg_data)

**Explanation**: The raw data is auto-mpg-1.data with 9 columns V1-V9 including data type real number, integer and catergorical

In [None]:
mpg_data_2 <- mpg_data %>% mutate(mpg = V1, cylinders = V2, displacement =V3, hoursepower = V4, 
                    weight = V5, acceleration = V6, model_year = V7, origin = V8)

mpg_data_tidy <- mpg_data_2 %>% select(mpg,cylinders,displacement,hoursepower,weight,acceleration,model_year,origin)

mpg_data_tidy$cylinders <- as.factor(mpg_data_tidy$cylinders)
mpg_data_tidy$hoursepower <- as.integer(mpg_data_tidy$hoursepower)

tidy_data_2 <- mpg_data_tidy %>% select(mpg,cylinders,displacement,hoursepower,weight,acceleration)

tidy_data <- filter(tidy_data_2, !is.na(tidy_data_2$hoursepower))
head(tidy_data)
levels(tidy_data$cylinders)

**Explanation**: we rename all the columns to proper names and turn the type of cylinders into factor since it is a categorical variable with only 5 types. And we delete rows with a missing data. Also turn the type of houserpower into integer and then we get rid of column V7,V8,V9 which represents car name, model_year,and origin. car name includes over 100 different strings and it makes no sense to turn it into a categorical variable,so we decide to take alway this column, for model_year, it is a discrete variable which is also over 10 different types,so we decide to get rid of it. For origin the data does not provide information with what 1,2,3 represent so we won't use this column

In [None]:
options(repr.plot.width = 15, repr.plot.height = 12)

correlation_plots <- tidy_data %>%
  select(- cylinders) %>%
  ggpairs(progress = FALSE) +
  theme(
    text = element_text(size = 15),
    plot.title = element_text(face = "bold"),
    axis.title = element_text(face = "bold")
  )
correlation_plots