_Use this notebook to practice your exploratory data analysis and visualization skills._



> The **original dataset** is obtained from FuelEconomy.gov Web Services. The 1984-2017 fuel economy data is produced during vehicle testing at the **Environmental Protection Agency's (EPA) National Vehicle and Fuel Emissions Laboratory** in Ann Arbor, Michigan, and by vehicle manufacturers with EPA oversight. Check also the data in this [Kaggle page](https://www.kaggle.com/epa/fuel-economy).
The version of the data used in this notebook is also available in [this repo](https://github.com/reisanar/datasets/blob/master/fuel.csv).

## Load packages


In [None]:
# use data science tools from the tidyverse
library(tidyverse)


## Read the data

The adapted dataset used in this notebook contains  more than 38,000 observations and 81 variables are available! (We will focus on a small subset of the attributes for this initial exploration). A related data dictionary can be found at https://www.fueleconomy.gov/feg/ws/

EPA's fuel economy values are good **estimates of the fuel economy a typical driver will achieve under average driving conditions** and provide a good basis to compare one vehicle to another. Fuel economy varies, sometimes significantly, based on driving conditions, driving style, and other factors.

Below we read the `.csv` file using `readr::read_csv()` (the `readr` package is part of the `tidyverse`)


In [None]:
# read the dataset and store it in an object named fuel
# make sure to use the raw data from https://github.com/reisanar/datasets/blob/master/fuel.csv
# fuel <- 


Check the dimensions of the dataset:



In [None]:
# use any function(s) to find the number of rows and columns



## Data Exploration

Print a random sample of 7 rows in the dataframe


In [None]:
# random sample of the data
set.seed(217)      # this sets a random seed for reproducibility


We can see the range (minimum and maximum) of a variable using the `range()` function: 



In [None]:
# find the minimum and maximum values for the year variable



We can also use the `dplyr::sumarize()` function to get some summaries for certain variables:



In [None]:
# create the summary of minimum and maximum values for
# year, annual_fuel_cost_ft1, annual_consumption_in_barrels_ft1


For variables that are encoded as categorical, we can also get counts. First, below is a trick to find which variables are encoded as _character_ (this will help you determine which ones are actually categorical variables: for example an email is stored as a character, but we may not treat is a category since it may be unique, while colors and brands could be treated as categorical):



In [None]:
# select variables that are of type character
fuel %>%
   select_if(is.character)


Let us select check the number of observations for each class of vehicle (`class`)



In [None]:
# number of observations for each class

# you could also use group_by() followed by summarize() where the 
# summary counts the number of rows using the n() function


In [None]:
# alternative: using the group_by + summarize combination



When working with larger datasets like this one, chances are that several observations have missing values (`NA`) in some of the attributes available in the dataset. It is good practice to get a sense of the proportion of missing values for different variables. This may help you make design choices when exploring predictive models (e.g., how and what type of data imputation to incorporate - if any -, or deciding which variables have enough variation and are good choices for further analysis). 

Below is a trick to easily get this information using tools from `dplyr`:


In [None]:
# find the proportion of missing values for each attribute
fuel %>%
  summarize_all(~sum(is.na(.))/n())


> The code above tells you that we have no missing values for the variables `year`, `make`, `model` and others; and it also indicates that the attribute `range_ft2` is an empty column (all observations have a missing value there).<br><br>
**Quick explanation**: `is.na()` returns either `TRUE` if the element is missing, and `FALSE` otherwise. When combined with the function `sum()`, any value of `TRUE` will be understood as a `1`, and instances of `FALSE` as `0` (this is known as _**coercion**_). Therefore, adding all the `1`s will tell you how many observations have a missing value, and dividing by the number of observations (i.e., using `n()`) will give the proportion. Documentation for the `summarize_all()` function (and other similar functions) can be found [here](https://dplyr.tidyverse.org/reference/summarise_all.html). 
<br><br>Again, this shows the power of `dplyr`: just a few lines of code can give you very good information.


> **Practice**: which other type of summaries can you create? Try grouping by _multiple_ variables to analyze that set of observations (e.g., grouping by `make` and `transmission` to analyze the fuel efficiency of cars in the different groups) 


In [None]:
# create at least 1 more summary



## Some data visualizations

There are many observations and attributes (variables) available in this dataset. We will generate some data visualizations in this notebook that can help us confirm some of the things we would expect from the evolution and progress made in car manufacturing in recent years.

The purpose of EPA's fuel economy estimates is to provide a reliable basis for comparing vehicles. Most vehicles in the database (other than plug-in hybrids) have three fuel economy estimates: 

- a "city" estimate that represents urban driving, in which a vehicle is started in the morning (after being parked all night) and driven in stop-and-go traffic; 

- a "highway" estimate that represents a mixture of rural and interstate highway driving in a warmed-up vehicle, typical of longer trips in free-flowing traffic; 

- and a "combined" estimate that represents a combination of city driving (55%) and highway driving (45%). Estimates for all vehicles are based on laboratory testing under standardized conditions to allow for fair comparisons.

The dataset also provides annual fuel cost estimates, rounded to the nearest \$50, for each vehicle. The estimates are based on the assumptions that you travel 15,000 miles per year (55% under city driving conditions and 45% under highway conditions) and that fuel costs \$2.33/gallon for regular unleaded gasoline, \$2.58/gallon for mid-grade unleaded gasoline, and \$2.82/gallon for premium.


Create a **bar plot** showing the number of observations for each type of `engine_cylinders`:


In [None]:
# bar plot showing counts for engine_cylinders



If you look to the far right in the above plot, you will notice some vehicles with 16 cylinders. Use the `dplyr::filter()` function to find those observations:



Have you heard of the _Bugatti Veyron_? million dollars cars! Learn more about this car [here](https://en.wikipedia.org/wiki/Bugatti_Veyron). 

<img src="https://raw.githubusercontent.com/reisanar/figs/master/bugatti.jpeg" alt="Bugatti Veyron" width="40%"/>

In the set of slides for `ggplot2` we studied the relationship between the engine size (`engine_displacement`) and the fuel efficiency. Complete something similar here:


In [None]:
# create a scatterplot of city_mpg_ft1 versus engine_displacement



## Renewable energy

Let us aggregate data by fuel type and create a new variable called to identify if the energy source is renewable or not. We can also generate an estimate of efficiency and [tailpipe carbon dioxide](https://www.epa.gov/greenvehicles/greenhouse-gas-emissions-typical-passenger-vehicle) (CO2) (`tailpipe_co2_in_grams_mile_ft1`) averages by fuel type (`fuel_type`).

A good reference to learn more about this can be found in the (US Department of Energy) _Energy Efficiency and Renewable Energy_ page: https://afdc.energy.gov/fuels/

Find the number of observations available in each `fuel_type`.


In [None]:
# count the number of observations per fuel_type



Below we use the `dplyr::mutate()` function to create a new column, based on the value of `fuel_type` containing the word "Electricity" or "E85". This is done with the help of the `str_detect()` function from the `stringr` package (part of the `tidyverse`), and the `case_when()` function.



In [None]:
fuel %>%
  mutate(renewable = case_when(
           str_detect(fuel_type, pattern = "Elect") ~ "Yes", 
           str_detect(fuel_type, pattern = "E85") ~ "Yes", 
           TRUE ~ "No"
          )
        ) %>%
  group_by(renewable) %>%
  count()


Use the above pipeline, to create a summary showing the average `tailpipe_co2_in_grams_mile_ft1` and the `combined_mpg_ft1` values per renewable type.



In [None]:
# average tailpipe_co2_in_grams_mile_ft1



Create a simple plot comparing the available observations and the variable `renewable`:



In [None]:
# create auxiliary dataframe with summaries
# if needed for your visualization


In [None]:
# create scatterplot. Use at least one more variable
# mapped to the aesthetics color or size in your plot


Finally, analyze the relationship between gas emissions and fuel efficiency (ignoring electric cars):



In [None]:
# create a sample scatterplot. Use at least one more variable
# mapped to the aesthetics color or size in your plot


> **Practice**: explore the relationship between other variables. Can you characterize the trends you observe? How about the number of unique models over the years? 



In [None]:
# sample solution: Unique models considered by EPA over the years

