# 1: Pandas

Pandas is a library that unifies the most common workflows that data analysts and data scientists previously relied on many different libraries for. Pandas has quickly became an important tool in a data professional's toolbelt and is the most popular library for working with tabular data in Python. Tabular data is any data that can be represented as rows and columns. 

To represent tabular data, Pandas uses a custom data structure called a **DataFrame**. A DataFrame is a highly efficient, 2-dimensional data structure that provides a suite of methods and attributes to quickly explore, analyze, and visualize data. The DataFrame object is similar to the NumPy 2D array but adds support for many features that help you work with tabular data.

One of the biggest advantages that Pandas has over NumPy is the ability to store mixed data types in rows and columns. Many tabular datasets contain a range of data types and Pandas DataFrames handle mixed data types effortlessly while NumPy doesn't. Pandas DataFrames can also handle missing values gracefully using a custom object, **NaN**, to represent those values. A common complaint with NumPy is its lack of an object to represent missing values and people end up having to find and replace these values manually. In addition, Pandas DataFrames contain axis labels for both rows and columns and enable you to refer to elements in the DataFrame more intuitively. Since many tabular datasets contain column titles, this means that DataFrames preserve the metadata from the file around the data.



## 2: Dataset

In this mission, you'll learn the basics of Pandas while exploring the **Auto** dataset from the textbook. This dataset contains Gas mileage, horsepower, and other information for cars.




## 3: Read In A CSV File
To use the Pandas library, we need to import it into the environment using the import keyword:

    import pandas
    
We can then refer to the module using pandas and use dot notation to call its methods. To read a CSV file into a DataFrame, we use the Pandas method read_csv() and pass in the file name as a string:



    # To read in the file `crime_rates.csv` into a DataFrame object named crime_rates.
    crime_rates = pandas.read_csv("crime_rates.csv")

You can read more about the parameters the read_csv() method takes to customize how a file is read in on the documentation page.

In [1]:
# Instructions
# Import the Pandas library.
# Use the Pandas function read_csv() to read the file "Auto.csv" into a DataFrame named auto.
# Use the type() and print() functions to display the type of auto to confirm that it's a DataFrame object.

import pandas as pd
auto = pd.read_csv('Data/Auto.csv', na_values='?').dropna()
print(auto.info())
auto.head()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 392 entries, 0 to 396
Data columns (total 9 columns):
mpg             392 non-null float64
cylinders       392 non-null int64
displacement    392 non-null float64
horsepower      392 non-null float64
weight          392 non-null int64
acceleration    392 non-null float64
year            392 non-null int64
origin          392 non-null int64
name            392 non-null object
dtypes: float64(4), int64(4), object(1)
memory usage: 30.6+ KB
None


Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,1,ford torino


## 4: Exploring The DataFrame
Now that we've read the dataset into a DataFrame, we can start using the DataFrame methods to explore the data. To select the first 5 rows of a DataFrame, use the DataFrame method head(). When you call the head() method, Pandas will return a new DataFrame containing just the first 5 rows:



    first_rows = food_info.head()
    
If you peek at the documentation, you'll notice that you can pass in an integer (n) into the head() method to display the first n rows instead of the first 5:



    # First 3 rows.
    print(food_info.head(3))
    
    
Since this DataFrame contains many columns and rows, Pandas uses ellipsis (...) to hide the columns and rows in the middle. Only the first few and the last few columns and rows are displayed to conserve space.

To access the full list of column names, use the columns attribute:



    column_names = food_info.columns
Lastly, you can use the shape attribute to understand the dimensions of the DataFrame. The shape attribute returns a tuple of integers representing the number of rows followed by the number of columns:



    # Returns the tuple (8618,36) and assigns to `dimensions`.
    dimensions = food_info.shape
    
    # The number of rows, 8618.
    num_rows = dimensions[0]
    
    # The number of columns, 36.
    num_cols = dimensions[1]


In [3]:
# Instructions
# Select the first 20 rows from Auto and assign to the variable first_twenty.

first_twenty = auto.head(20)
print(first_twenty.info())


<class 'pandas.core.frame.DataFrame'>
Int64Index: 20 entries, 0 to 19
Data columns (total 9 columns):
mpg             20 non-null float64
cylinders       20 non-null int64
displacement    20 non-null float64
horsepower      20 non-null float64
weight          20 non-null int64
acceleration    20 non-null float64
year            20 non-null int64
origin          20 non-null int64
name            20 non-null object
dtypes: float64(4), int64(4), object(1)
memory usage: 1.6+ KB
None


## 5: Indexing
When you read in a file into a DataFrame, Pandas uses the values in the first row (also known as the header) for the column labels and the row number for the row labels. Collectively, the labels are referred to as the index. DataFrames contain both a row index and a column index. Here's a diagram that displays some of the column and row labels for food_info:

The labels allow us to refer to values in the DataFrame, which we'll learn more about in the rest of this mission.

## 6: Series
The Series object is a core data structure that Pandas uses to represent rows and columns. A Series is a labelled collection of values similar to the NumPy vector. The main advantage of Series objects is the ability to utilize non-integer labels. NumPy arrays can only utilize integer labels for indexing.

Pandas utilizes this feature to provide more context when returning a row or a column from a DataFrame. For example, when you select a row from a DataFrame, instead of just returning the values in that row as a list, Pandas returns a Series object that contains the column labels as well as the corresponding values:

    
## 7: Selecting A Row
While we use bracket notation to access elements in a NumPy array or a standard list, we need to use the Pandas method loc[] to select rows in a DataFrame. The loc[] method allows you to select rows by row labels. Recall that when you read a file into a DataFrame, Pandas uses the row number (or position) as each row's label. Pandas uses zero-indexing, so the first row is at index 0, the second row at index 1, and so on.

If you're interested in accessing a single row, pass in the row label to the loc[] method. Python will return an error if you don't pass in a valid row label:


    # Series object representing the row at index 0.
    food_info.loc[0]
    # Series object representing the seventh row.
    food_info.loc[6]
    # Will throw an error: "KeyError: 'the label [8620] is not in the [index]'"
    food_info.loc[8620]
    
When accessing an individual row, Pandas returns a Series object containing the column names and that row's value for each column. In the following code cell, we select the first and seventh rows and display them using the print() function.



In [4]:
# Instructions
# Assign the 100th row of Auto to the variable hundredth_row.
# Display hundredth_row using the print() function.

print(auto.loc[99])



mpg                     18
cylinders                6
displacement           232
horsepower             100
weight                2945
acceleration            16
year                    73
origin                   1
name            amc hornet
Name: 99, dtype: object


## 8: Data Types
When you displayed individual rows, represented as Series objects, you may have noticed the text "dtype: object" after the last value. dtype: object refers to the data type, or dtype, of that Series. The object dtype is equivalent to the string type in Python. Pandas borrows from the NumPy type system and contains the following dtypes:

object - for representing string values.
int - for representing integer values.
float - for representing float values.
datetime - for representing time values.
bool - for representing Boolean values.
When reading a file into a DataFrame, Pandas analyzes the values and infers each column's types. To access the types for each column, use the DataFrame attribute dtypes to return a Series containing each column name and its corresponding type. Read more about data types on the Pandas documentation.



## 9: Selecting Multiple Rows
If you're interested in accessing multiple rows of the DataFrame, you can pass in either a slice of row labels or a list of row labels and Pandas will return a DataFrame object. Note that unlike slicing lists in Python, a slice of a DataFrame using .loc[] will include both the start and the end row:



    # DataFrame containing the rows at index 3, 4, 5, and 6 returned.
    food_info.loc[3:6]
    # DataFrame containing the rows at index 2, 5, and 10 returned. Either of the following work.
    # Method 1
    two_five_ten = [2,5,10] 
    food_info.loc[two_five_ten]
    # Method 2
    food_info.loc[[2,5,10]]


In [4]:
# Instructions
# Select the last 5 rows of Auto and assign to the variable last_rows.





## 10: Selecting Individual Columns
When accessing a column in a DataFrame, Pandas returns a Series object containing the row label and each row's value for that column. To access a single column, use bracket notation and pass in the column name as a string:



    # Series object representing the "NDB_No" column.
    ndb_col = food_info["NDB_No"]
    # You can instead access a column by passing in a string variable.
    col_name = "NDB_No"
    ndb_col = food_info[col_name]


In [5]:
# Instructions
# Assign the "mpg" column to the variable mpg.
# Assign the "cylinders" column to the variable cylinders.


  








## 11: Selecting Multiple Columns By Name
To select multiple columns, pass in a list of strings representing the column names and Pandas will return a DataFrame containing only the values in those columns. The following code returns a DataFrame containing the "Zinc_(mg)" and "Copper_(mg)" columns, in that order:

    columns = ["Zinc_(mg)", "Copper_(mg)"]
    zinc_copper = food_info[columns]
    # Skipping the assignment.
    zinc_copper = food_info[["Zinc_(mg)", "Copper_(mg)"]]
When selecting multiple columns, the order of the columns in the returned DataFrame matches the order of the column names in the list of strings that you passed in. This allows you to easily explore specific columns that may not be positioned next to each other in the DataFrame.

Instructions


In [6]:
# Instructions

# Select the 'mpg' and 'cylinders' columns and assign the resulting DataFrame to mpg_cylinders.







