# Introduction to the Pandas module

## Skills

1. **Load data into Python using the pandas module.**
2. **Select columns using `[]` and rows using `DataFrame.loc[]`.**
3. **Summarize columns with basic descriptive statistics.**
4. Summarize by category using `DataFrame.groupby()`
5. Create new columns.
6. Use built-in Pandas string manipulation functions.
7. Visualize data using Seaborn and Matplotlib

## Vocabulary List

**DataFrame.** A pandas object representing structured data.

**module.** A module extends what you can do in Python, usually by adding additional functions that can be called, but can contain data, and other things as well (your instructor isn't sure on that last part, but is being safe by including it).

**Series.** A pandas object representing a single column of a DataFrame. Some functions (like `.count_values()`) return Series instead of DataFrames.

**structured data.** Data which is organized into rows and columns, like a spreadsheet. Every column of structured data must have a single data type.

## Import Modules
The code below imports the *pandas* module, allowing us to make use of more powerful data-manipulating functions and types of objects. We are binding it to a nickname of *pd*, which means that our function calls will be written as `pd.function_name()`.

In [None]:
import pandas as pd

## Pandas DataFrame Basics

The pandas module contains many functions for loading and manipulating **structured** data. 

To get started, we'll use the pandas function `read_csv()` to load in some data. Because `read_csv()` is part of the pandas library (which we imported with the nickname *pd*), we call the function with `pd.read_csv()`

The following code loads in the Netflix description dataset and stores it in a variable called `netflix`. For this to work, the "netflix.csv" file *must be in the same folder as your notebook file.* Otherwise, you will get a very descriptively named *FileNotFoundError*.

In [None]:
netflix = pd.read_csv("netflix.csv")

The variable netflix is a special data type called a **DataFrame**, and contains everything that the "netflix.csv" file has. There are many functions associated with DataFrames, which we'll go over below.

Now that we've imported the data, what can we do with it? First, the DataFrame's `head()` function allows us to peek at a few rows of the dataset. This will be extremely useful for making sure that we haven't made any mistakes. `head()` takes a single argument, which is how many rows you want to see.

In [None]:
netflix.head()

Each column of a DataFrame must have a single data type. We can check what type each column is with its `dtypes` attribute. Note that `dtypes` is not a function, and so is not followed by parentheses.

The four data types you're likely to encounter with Pandas are:
* **int64** for integers.
* **float64** for floating-point (decimal) values.
* **datetime64** for dates and times (none in the Netflix data).
* **object** for anything else. This is what you'll see for strings.

In [None]:
netflix.dtypes

If instead of seeing a few rows, we want to see a particular column, we can treat our DataFrame like a list and pass in the name of a column as an index using `[]`.

In [None]:
netflix["release_year"]

Alternatively, you can select multiple columns by passing a list of column names. This looks weird, so let's take it one step at a time. Instead of just looking at the `release_year`, let's also see the show or movie's `title`.

This means that we want both column names in a list, i.e. `["title", "release_year"]`.

We then pass that list as an index to our `netflix` DataFrame (spaces added for readability):

In [None]:
netflix[ ["title","release_year"] ]

### Selecting Rows

Instead of selecting specific columns, we can also select specific rows using the logic we learned earlier. Let's say we want to find all of the titles released after the year 2000. We can start by using the `release_year` column:

> `netflix["release_year"] > 2000`
  
This gives us a list of values which is *True* whenever the release year is after 2000 and *False* whenever it is not.

This list can then be passed to our original DataFrame, using `DataFrame.loc[]`, which takes in such a list of *True/False* values as an index.

In [None]:
netflix.loc[ netflix["release_year"] > 2000 ]

We could also just select the movies, using `netflix["type"] == "MOVIE"`:

In [None]:
netflix.loc[ netflix["type"] == "MOVIE" ]

Or both, selecting all of the movies released after the year 2000.

Note that when combining logical statements with the *and* `&` or *or* `|` operators, each piece should be surrounded by parenthesis.

In [None]:
netflix[ (netflix["type"] == "MOVIE") & (netflix["release_year"] > 2000) ]

## Summary Functions

We can get summary statistics of our DataFrame using the `DataFrame.describe()` function. This will work on both the entire DataFrame, or on individual columns.

In [None]:
netflix["release_year"].describe()

If you want just one of these statistics, you can use `DataFrame.count()`, `.mean()`, `.median()`, `.min()`, `.max()` and so on. There are many more, which aren't included in the output of `.describe()`, like `DataFrame.sum()`. Here's one example:

In [None]:
netflix["release_year"].mean()

Unsurprisingly, we can only use these functions on the numeric (*int64* and *float64*) columns. What is the mean title on Netflix? That question doesn't even make sense, and will raise a `TypeError` because the column is the wrong data *type*.

In [None]:
netflix["title"].mean()

We will be doing quite a bit with *string* data, but for now, the simplest we can do is do some counting using `DataFrame.value_counts()`. Here we see that the most common rating on Netflix titles is "TV-MA" with 841 titles, followed by "R" with 575. Only 14 titles have an "NC-17" rating.

In [None]:
netflix["age_certification"].value_counts()

You can count the values of any column, regardless of data type. However, it only makes sense when a column has repeated values, for example the `age_certification` or `type` columns. Counting the `title` column will tell you that most movies and TV shows have unique titles. Counting the `release_year` column will tell you how many Netflix titles are from a particular year.

In [None]:
netflix["title"].value_counts()

## Putting it Together

Each of these functions returns a DataFrame (or, if the result is only a single column, a **Series** which is very similar), and so you can use multiple functions in a row.

**Example 1.** Let's say we want to look at recent (after the year 2000) movies, and see what their ratings are. We start by making a logical expression for recent movies:

> `(netflix["type"] == "MOVIE") & (netflix["release_year"] > 2000)`

And then select those using `DataFrame.loc[]`. To make this easier to read, I'm going to store this result in a new variable, `recentmovies`.

> `recentmovies = netflix.loc[ (netflix["type"] == "MOVIE") & (netflix["release_year"] > 2000) ]`

Now, with `recentmovies`, we can count how many there are of each rating:

> `recentmovies["age_certification"].value_counts()`

In [None]:
recentmovies = netflix.loc[ (netflix["type"] == "MOVIE") & (netflix["release_year"] > 2000) ]

recentmovies["age_certification"].value_counts()

**Example 2.** Now, let's find what the average IMDB user rating of television shows. Let's build this up step by step. First, select only television shows. This requires the logical statement:

> `netflix["type"] == "SHOW"`

Now, select those using `DataFrame.loc[]`:

> `netflix.loc[ netflix["type"] == "SHOW" ]`

As above, we could store this new DataFrame as a new variable, like `shows`, but we don't have to. We can instead just keep adding indices and member functions, and it will work just fine. It's a bit less readable, though.

Since we want the IMDB user ratings, we'll select that column:

> `netflix.loc[ netflix["type"] == "SHOW" ]["imdb_score"]`

And then finally, using `DataFrame.mean()`:

In [None]:
netflix.loc[ netflix["type"] == "SHOW" ]["imdb_score"].mean()