<h1>Activity: Dataframes with pandas</h1>

## Introduction

Your work as a data professional for the U.S. Environmental Protection Agency (EPA) requires you to analyze air quality index data collected from the United States and Mexico.

The air quality index (AQI) is a number that runs from 0 to 500. The higher the AQI value, the greater the level of air pollution and the greater the health concern. For example, an AQI value of 50 or below represents good air quality, while an AQI value over 300 represents hazardous air quality. Refer to this guide from [AirNow.gov](https://www.airnow.gov/aqi/aqi-basics/) for more information.

In this lab, you will practice working in pandas. You will load a dataframe, examine its metadata and summary statistics, and explore it using iloc indexing and sorting. You will also practice Boolean masking, grouping, and concatenating data.

## Task 1: Read data from csv file into a pandas dataframe

You are given two files of data. Begin with the first file, which contains the three states with the most observations (rows): California, Texas, and Pennsylvania.

### 1a: Import statements

Import numpy and pandas. Use their standard aliases.

In [None]:
### YOUR CODE HERE ###


### 1b: Read in the first file

1. As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

2. Use the `head()` method on the `top3` dataframe to inspect the first five rows.

In [None]:
# 1. ### YOUR CODE HERE ###
top3 = pd.read_csv('epa_ca_tx_pa.csv')

# 2. ### YOUR CODE HERE ###


## Task 2: Summary information

Now that you have a dataframe with the AQI data for California, Texas, and Pennsylvania, get some high-level summary information about it.

### 2a: Metadata

Use a DataFrame method to examine the number of rows and columns, the column names, the data type contained in each column, the number of non-null values in each column, and the amount of memory the dataframe uses.

In [None]:
### YOUR CODE HERE ###


### 2b: Summary statistics

Examine the summary statistics of the dataframe's numeric columns. The output should be a table that includes row count, mean, standard deviation, min, max, and quartile values.

In [None]:
### YOUR CODE HERE ###


## Task 3: Explore your data

Practice exploring your data by completing the following exercises.

### 3a: Rows per state

Select the `state_name` column and use the `value_counts()` method on it to check how many rows there are for each state in the dataframe.

In [None]:
### YOUR CODE HERE ###


### 3b: Sort by AQI

1.  Create a new dataframe called `top3_sorted` by using the `sort_values()` method on the `top3` dataframe. Refer to the [sort_values pandas documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html#) for more information about how to use this method.
    *  The new dataframe should contain the data sorted by AQI, beginning with the rows with the highest AQI values.
2.  Print the top 10 rows of `top3_sorted`.

In [None]:
# 1. ### YOUR CODE HERE ###


# 2. ### YOUR CODE HERE ###


### 3c: Use `iloc` to select rows

Use `iloc` to select the two rows at indices 10 and 11 of the `top3_sorted` dataframe.

In [None]:
### YOUR CODE HERE ###


## Task 4: Examine California data

You notice that the rows with the highest AQI represent data from California, so you want to examine the data for just the state of California.

### 4a: Basic Boolean masking

1. Create a Boolean mask that selects only the observations of the `top3_sorted` dataframe that are from California.
2. Apply the Boolean mask to the `top3_sorted` dataframe and assign the result to a variable called `ca_df`.
3. Print the first five rows of `ca_df`.

In [None]:
# 1. ### YOUR CODE HERE ###


# 2. ### YOUR CODE HERE ###


# 3. ### YOUR CODE HERE ###


### 4b: Validate CA data

Inspect the shape of your new `ca_df` dataframe. Does its row count match the number of California rows determined in Task 3a?

In [None]:
### YOUR CODE HERE ###


### 4c: Rows per CA county

Examine a list of the number of times each county is represented in the California data.

In [None]:
### YOUR CODE HERE ###


### 4d: Calculate mean AQI for Los Angeles county

You notice that Los Angeles county has more than twice the number of rows of the next-most-represented county in California, and you want to learn more about it.

*  Calculate the mean AQI for LA county.

In [None]:
### YOUR CODE HERE ###


## Task 5: Groupby

Group the original dataframe (`top3`) by state and calculate the mean AQI for each state.

In [None]:
### YOUR CODE HERE ###


## Task 6: Add more data

Now that you have performed a short examination of the file with AQI data for California, Texas, and Pennsylvania, you want to add more data from your second file.

### 6a: Read in the second file

1. Read in the data for the remaining territories. The file is called `'epa_others.csv'` and is already in your working directory. Assign the resulting dataframe to a variable named `other_states`.

2. Use the `head()` method on the `other_states` dataframe to inspect the first five rows.

In [None]:
# 1. ### YOUR CODE HERE ###


# 2. ### YOUR CODE HERE ###


### 6b: Concatenate the data

The data from `other_states` is in the same format as the data from `top3`. It has the same columns in the same order.

1. Add the data from `other_states` as new rows beneath the data from `top3`. Assign the result to a new dataframe called `combined_df`.

2. Verify that the length of `combined_df` is equal to the sum of the lengths of `top3` and `other_states`.

In [None]:
# 1. ### YOUR CODE HERE ###


# 2. ### YOUR CODE HERE ###


## Task 7: Complex Boolean masking

According to the EPA, AQI values of 51-100 are considered of "Moderate" concern. You've been tasked with examining some data for the state of Washington.

*  Use Boolean masking to return the rows that represent data from the state of Washington with AQI values of 51+.

In [None]:
### YOUR CODE HERE ###


# Conclusion

**What are your key takeaways from this lab?**

[Double-click here to record your response.]