# Introduction to Python Pandas Library <a id='Start'></a>

The objective of this assignment is to provide an overview of the Python Pandas Library and its fundamentals. After completing this assignment, you should be able to understand and use the library effectively for data analysis and manipulation. This is not an all encompassing overview. There are additional functions that we will cover in the class and many that we will not cover. 

Use the links below to move to each section of the notebook directly.

Introduction to Pandas:
- [What is Pandas?](#Pandas)
- Features of Pandas.

Pandas Data Structures:
- [Series](#Series): Creating a series, indexing, accessing elements.
- [DataFrame](#DataFrame): Creating a DataFrame, indexing, accessing elements, data manipulation.

Data Import and Export:
- [Writing](#Write) data to a CSV file.
- [Reading](#Read) data from a CSV file.

Data Manipulation:
- [Viewing](#View) data.
- [Describing](#View) data.
- [Selecting](#Select) data.
- [Sorting](#Sort) data.
- [Filtering](#Filter) data.
- [Merging](#Merge) data.
- [Grouping](#Group) data.
- [Aggregating](#Aggregate) data.

Data Visualization:
- [Plotting](#Plot) Pandas data.

Go to [End](#End)

Resources:

- Pandas documentation: https://pandas.pydata.org/docs/

***

## What is the Pandas library? <a name="Pandas"></a>

Pandas is a popular open-source Python library used for data manipulation, analysis, and visualization. It provides high-performance, easy-to-use data structures and data analysis tools that make working with structured data fast, easy, and efficient. Pandas can handle a wide variety of data formats including tabular data (in CSV, Excel, or SQL database formats), time series data, and multidimensional data with ease. The library provides a variety of data manipulation functions such as merging, reshaping, and filtering, as well as statistical and mathematical functions for data analysis. Pandas also has built-in data visualization capabilities, making it a powerful tool for exploratory data analysis. Overall, Pandas is a powerful and essential tool for any data analyst, data scientist, or machine learning engineer working with Python.

### Features of Pandas: 
Some of the key features of Pandas are:

- <ins>Input/Output</ins>: Pandas provides functions for reading and writing data in a variety of formats, including CSV, Excel, SQL, JSON, and more.

- <ins>Data Structures</ins>: Pandas provides two main data structures - Series and DataFrame - which are powerful, flexible, and efficient for storing and manipulating data.

- <ins>Data Manipulation</ins>: Pandas has a wide range of functions for filtering, grouping, reshaping, pivoting, merging, and sorting data, making it easy to transform and manipulate data.

- <ins>Grouping and aggregation</ins>: Pandas provides methods to group data based on one or more columns and perform aggregation functions such as sum, mean, min, max, count, and more.

- <ins>Data Cleaning</ins>: Pandas has functions for handling missing data, removing duplicates, and correcting erroneous data, making it easy to clean up data before analysis.

- <ins>Time Series Analysis</ins>: Pandas provides powerful tools for working with time series data, including date and time functions, resampling, and windowing functions.

- <ins>Data Visualization</ins>: Pandas has built-in data visualization capabilities, making it easy to create a wide range of plots and charts for exploratory data analysis.


Pandas gives you answers about the data. Like:

- Is there a correlation between two or more columns?
- What is average value?
- Max value?
- Min value?

Pandas are also able to delete rows that are not relevant, or contains wrong values, like empty or NULL values. This is called cleaning the data.
***

### Import Pandas

Pandas is imported like other Python libraries by adding the import keyword:

In [None]:
import pandas

Now Pandas is imported and ready to use.

In [None]:
import pandas

mydataset = {
  'cars': ["BMW", "Volvo", "Ford"],
  'passings': [3, 7, 2]
}

myvar = pandas.DataFrame(mydataset)

print(myvar) 

Pandas is commonly imported under the pd alias.

Create an alias with the as keyword while importing:

In [None]:
import pandas as pd 

Now the Pandas package can be referred to as pd instead of pandas.

In [None]:
import pandas as pd

mydataset = {
  'cars': ["BMW", "Volvo", "Ford"],
  'passings': [3, 7, 2]
}

myvar = pd.DataFrame(mydataset)

print(myvar)

[Return to top](#Start)
***

## Pandas Data Structures
Data in Pandas is held in two types of structures, Series or DataFrames.

### Series <a name="Series"></a>

Series documentation: https://pandas.pydata.org/docs/reference/api/pandas.Series.html

A Series is a one-dimensional labeled array capable of holding any data type (integer, float, string, Python objects, etc.). It is similar to a column in a spreadsheet or a SQL table. It consists of two arrays - one for the data and another for the index.

The <ins>index</ins> is a sequence of labels that identifies each element in the data array. If an index is not specified, then it is created automatically as a sequence of integers starting from zero.

To create a Series, you can pass a Python list, dictionary, or a scalar value as input. For example, to create a Series with a list of integers, you can use the following code:

In [None]:
import pandas as pd
data = [1, 2, 3, 4, 5]
s = pd.Series(data)
print(s)

In this example, the index is automatically created as a sequence of integers from 0 to 4, and the data is a list of integers.

You can access elements in a Series using their index. For example, to access the element at index 2, you can use the following code:

In [None]:
print(s[2])

You can also perform operations on a Series, such as arithmetic operations and boolean indexing, just like you would with NumPy arrays.

### DataFrame <a name="DataFrame"></a>

DataFrame documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html

A DataFrame is a two-dimensional labeled data structure, similar to a spreadsheet or a SQL table. It consists of rows and columns, where each column can have a different data type (integer, float, string, etc.). You can think of a DataFrame as a collection of Series that share the same index.

To create a DataFrame, you can pass a dictionary of lists, where each key represents a column name, and each value represents the data for that column. For example, to create a DataFrame with three columns, you can use the following code:

In [None]:
import pandas as pd
data = {'name': ['Alice', 'Bob', 'Charlie', 'David'],
        'age': [25, 30, 35, 40],
        'city': ['New York', 'Paris', 'London', 'Tokyo']}
df = pd.DataFrame(data)
print(df)

In this example, the keys of the dictionary ('name', 'age', 'city') become the column names, and the values of the dictionary become the data for each column.

You can access elements in a DataFrame using various methods such as .loc[], .iloc[], and .at[]. For example, to access the element at row 1 and column 'name', you can use the following code:

In [None]:
print(df.loc[1, 'name'])

You can also perform various operations on a DataFrame, such as filtering, grouping, merging, joining, and more. Pandas provides many powerful methods to manipulate data in a DataFrame, making it a popular choice for data analysis in Python.

[Return to top](#Start)
***

## Data Import/Export

### Writing data to CSV <a name="Write"></a>

to_csv documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html

To write data to a CSV file using Pandas in Python, you can use the 'to_csv()' method of a Pandas DataFrame object.

Here is an example code snippet that demonstrates how to do this:

In [None]:
import pandas as pd

# create a sample dataframe
data = {'name': ['John', 'Emma', 'Sarah', 'Daniel'],
        'age': [25, 27, 29, 31],
        'country': ['USA', 'UK', 'Australia', 'Canada']}
df = pd.DataFrame(data)

# write the dataframe to a CSV file
df.to_csv('data.csv', index=False)

In this example, we first create a sample dataframe with some data. We then call the to_csv() method on the dataframe, passing the name of the output file as an argument. The index=False argument tells Pandas not to write the index column to the output file.

After running this code, you should find a new file named "output.csv" in the current working directory containing the data from the dataframe.

[Return to top](#Start)
***

### Reading data from CSV <a name="Read"></a>

read_csv documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

To import CSV data into a pandas DataFrame, you can use the read_csv() function provided by the pandas library. Here's how to do it:

First, import the pandas library using the following code:

In [None]:
import pandas as pd

You will only need to include this once. It is included multiple times here to allow you to run individuals parts of the file.

Next, use the read_csv() function to load the CSV data into a pandas DataFrame. The syntax for the function is as follows:

<b>pd.read_csv('filename.csv')</b>

Replace filename.csv with the path to your CSV file. If your CSV file is in the same directory as your Python script, you can simply specify the filename. Otherwise, you'll need to provide the full path to the file.

If the file cannot be found, you will recieve FileNotFoundError: [Errno 2] No such file or directory: 'filename.csv' at the end of the error message output.

For example, if you have a CSV file called data.csv in the same directory as your Python script, you can load it into a DataFrame like this:

In [None]:
import pandas as pd

df = pd.read_csv('data.csv')

By default, the read_csv() function assumes that the first row of the CSV file contains column headers. 

If your CSV file doesn't have column headers, you can specify them using the <ins>header</ins> parameter:

In [None]:
df = pd.read_csv('data.csv', header=None, names=['col1', 'col2', 'col3'])

This will create column headers named 'col1', 'col2', and 'col3' for your DataFrame.

There are many other options you can use with the read_csv() function to customize how the CSV data is loaded. 

You can find more information in the pandas documentation.

[Return to top](#Start)
***

## Data Manipulation 

- [Viewing](#View) data.
- [Describing](#View) data.
- [Selecting](#Select) data.
- [Sorting](#Sort) data.
- [Filtering](#Filter) data.
- [Merging](#Merge) data.
- [Grouping](#Group) data.
- [Aggregating](#Aggregate) data.

### Viewing data <a name="View"></a>

head() documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html

When first working with a dataset, it is best practice to understand the data you are working with. Many datasets can be quite large and difficult or impossible to open with Excel, Notepad++, etc. becasuse they try to open the entire dataset. The .head() function in Pandas is a method that is used to display the first n rows of a DataFrame or Series. By default, n is 5, which means that the .head() function will display the first 5 rows of the DataFrame or Series.

In [None]:
import pandas as pd

# create a sample dataframe
data = {'name': ['John', 'Emma', 'Sarah', 'Daniel', 'Raphael', 'Leonardo', 'Michelangelo', 'Mary', 'Vincent', 'Pablo'],
        'age': [27, 31, 25, 22, 28, 25, 37, 43, 33, 25],
        'country': ['USA', 'UK', 'Australia', 'Canada', 'Italy', 'Italy', 'Italy', 'USA', 'Netherlands', 'Spain']}
df = pd.DataFrame(data)

# display the first rows of the dataframe
print(df.head())

In this example, we first create a sample DataFrame df with some data.The .head() function is called on the DataFrame object df, with no argument passed to the function. We can also set the argument to a value to tell the function to display the first n rows of the DataFrame. Below is n=3

In [None]:
# display the first 3 rows of the dataframe
print(df.head(3))

You can use the .head() function to quickly inspect the first few rows of a DataFrame or Series and get a sense of the data. By default, the function displays the first 5 rows, but you can pass a different number to the function to display a different number of rows.

Alternatively, you can display the end of the dataset using tail() to display the last n rows of a DataFrame or Series. By default, n is 5, which means that the .tail() function will display the last 5 rows of the DataFrame or Series.

In [None]:
# display the last 3 rows of the dataframe
print(df.tail(3))

[Return to top](#Start)
***

### Describing data <a name="Describe"></a>

describe() documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html

The describe() function is a useful method in the Pandas library for quickly generating descriptive statistics of a DataFrame or a specific column in a DataFrame.

When applied to a DataFrame, describe() returns a summary of statistics for each <ins>numeric</ins> column in the DataFrame such as count, mean, standard deviation, minimum, maximum, and the quartile values. When applied to a non-numeric column, it will return the count, number of unique values, top, and frequency of the top value.

The statistics that are included in the output of the describe() function are:
- count: the number of non-missing values in each column.
- mean: the arithmetic mean (average) of the values in each column.
- std: the standard deviation of the values in each column.
- min: the minimum value in each column.
- 25%: the 25th percentile value in each column.
- 50%: the median value (50th percentile) in each column.
- 75%: the 75th percentile value in each column.
- max: the maximum value in each column.

It is important to note that the describe() function only generates statistics for the columns with numeric data types by default. However, if you set the include parameter to 'all', it will include summary statistics for both numeric and non-numeric columns.

Here is an example of how to use the describe() function:

In [None]:
import pandas as pd

# Create a DataFrame with some sample data
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
        'Age': [25, 30, 35, 40, 45],
        'Salary': [50000, 60000, 70000, 80000, 90000]}
df = pd.DataFrame(data)

# Call the describe() function on the DataFrame
print(df.describe())

From the output, we can see that the describe() function has generated summary statistics for the numeric columns Age and Salary in the DataFrame.

[Return to top](#Start)
***

### Selecting data <a name="Select"></a>

loc() documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html

iloc() documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html


### loc()

The .loc() function in Pandas is a method that is used to access a group of rows and columns in a DataFrame by label(s) or a boolean array. It allows you to subset or filter your data based on a specific row or column label or a specific condition.

The general syntax for using the .loc() function is as follows:

df.loc[row_labels, column_labels]

where df is the DataFrame you want to access, row_labels is the label or a boolean array for selecting specific rows, and column_labels is the label or a list of labels for selecting specific columns. The .loc() function takes two arguments: the first one specifies the row label, and the second one specifies the column label. <b>You can use a colon to select a range of labels</b>. For example, if you want to select all rows and some specific columns, you can use .loc[:, ['column1', 'column2']].

Here are some examples of how to use the .loc() function:

In [None]:
import pandas as pd

# create a sample dataframe
data = {'name': ['John', 'Emma', 'Sarah', 'Daniel', 'Raphael', 'Leonardo', 'Michelangelo', 'Mary', 'Vincent', 'Pablo'],
        'age': [27, 31, 25, 22, 28, 25, 37, 43, 33, 25],
        'country': ['USA', 'UK', 'Australia', 'Canada', 'Italy', 'Italy', 'Italy', 'USA', 'Netherlands', 'Spain']}
df = pd.DataFrame(data)

# access the row with index label 2
print(df.loc[2])
print('\n')

# access the row with index labels 1, 3, and 5
print(df.loc[[1, 3, 5]])
print('\n')

# access the rows with boolean array
print(df.loc[df['age'] > 28])
print('\n')

# access the rows and columns with label or list of labels
print(df.loc[[1, 3, 5], ['name', 'country']])
print('\n')

# access all the rows in specific columns with label or list of labels
print(df.loc[:, ['name']])
print('\n')

# access a specific slice of rows (rows 2-4) and all columns
print(df.loc[2:4, :])

The .loc() function is useful when you need to select data from a DataFrame based on the label or index of the rows and columns. It is an efficient way to retrieve specific data from a DataFrame without having to iterate over the entire DataFrame.


### iloc()

iloc() documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html

The .iloc() function in Pandas is an integer-based indexing method used to select rows and columns from a DataFrame. It is used to retrieve data from a Pandas DataFrame based on the position or index of the rows and columns.

The .iloc() function takes two arguments: the first one specifies the row position, and the second one specifies the column position. You can use a colon to select a range of positions. For example, if you want to select all rows and some specific columns, you can use .iloc[:, [0, 1, 2]].

Here's an example of how to use the .iloc() function:

In [None]:
import pandas as pd

# create a sample dataframe
data = {'name': ['John', 'Emma', 'Sarah', 'Daniel', 'Raphael', 'Leonardo', 'Michelangelo', 'Mary', 'Vincent', 'Pablo'],
        'age': [27, 31, 25, 22, 28, 25, 37, 43, 33, 25],
        'country': ['USA', 'UK', 'Australia', 'Canada', 'Italy', 'Italy', 'Italy', 'USA', 'Netherlands', 'Spain']}
df = pd.DataFrame(data)

# select the row at position 2 and the column at position 3
print(df.iloc[2, 2])
print('\n')

# select the rows at positions 1 through 3 and all columns
print(df.iloc[1:4, :])
print('\n')

# select the rows at positions 0, 2, and 4 and the columns at positions 0 and 2
print(df.iloc[[0, 2, 4], [0, 2]])


[Return to top](#Start)
***

### Sorting <a name="Sort"></a>

sort_values() documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html

sort_index() documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_index.html

You can sort the data in a Pandas DataFrame using the 'sort_values()' method. The 'sort_values()' method sorts a DataFrame by one or more columns.

Here's an example code that demonstrates how to sort a DataFrame by a single column:

In [None]:
import pandas as pd

# create a sample dataframe
data = {'name': ['John', 'Emma', 'Sarah', 'Daniel', 'Raphael', 'Leonardo', 'Michelangelo', 'Mary', 'Vincent', 'Pablo'],
        'age': [27, 31, 25, 22, 28, 25, 37, 43, 33, 25],
        'country': ['USA', 'UK', 'Australia', 'Canada', 'Italy', 'Italy', 'Italy', 'USA', 'Netherlands', 'Spain']}
df = pd.DataFrame(data)
print("Original DataFrame")
print(df) # display the unsorted dataframe

# sort the dataframe by the 'age' column in ascending order
df = df.sort_values(by='age', ascending=True)

# display the sorted dataframe
print("\n","DataFrame sorted by 'Age' column")
print(df)

In this example, we first create a sample DataFrame with some data. We then call the sort_values() method on the DataFrame, passing the name of the column to sort by as an argument (in this case, "age"). By default, 'sort_values()' sorts in ascending order, so we don't need to pass the ascending argument. If we wanted to sort in descending order, we could pass ascending=False as an argument.

After running this code, the DataFrame will be sorted by the "age" column in ascending order, and the sorted DataFrame will be displayed.

You can also sort a DataFrame by multiple columns by passing a list of column names to the by argument of the 'sort_values()' method. For example:

In [None]:
# sort the dataframe by the 'country' column in ascending order,
# and then by the 'age' column in descending order
df = df.sort_values(by=['country', 'age'], ascending=[True, False])
print(df)

In this case, the DataFrame will be first sorted by the "country" column in ascending order, and then by the "age" column in descending order.

Note that the index column still has the original order preserved after using 'sort_values()'. This can be helpful if you want to put the data back in the original order using 'sort_index()'.


In [None]:
print(df.sort_index()) # reorder the DataFrame by index and print the result

# note that we did not save the DataFrame this time. Try to use print(df) and check the output here.

You can also reset the index to the new order by using 'reset_index()' with the drop=True argument, which removes the original index column and replaces it with a new range index starting from 0. Setting drop=False will preserve the origial index in a new column named 'index' while also renumbering the DataFrame.

In [None]:
df = df.sort_values(by='name', ascending=True)
print("Current DataFrame")
print(df)

df = df.reset_index(drop=True)
print("\n","DataFrame sorted by 'Name' column")
print(df)

[Return to top](#Start)
***

### Filtering Data <a name="Filter"></a>

You can filter data in a Pandas DataFrame using one or more conditional arguments. For example, if you wanted to select all members of our DataFrame based on Age, we could select only those over 25 using the conditional statement below.

In [None]:
import pandas as pd

# create a sample dataframe
data = {'name': ['John', 'Emma', 'Sarah', 'Daniel', 'Raphael', 'Leonardo', 'Michelangelo', 'Mary', 'Vincent', 'Pablo'],
        'age': [27, 31, 25, 22, 28, 25, 37, 43, 33, 25],
        'country': ['USA', 'UK', 'Australia', 'Canada', 'Italy', 'Italy', 'Italy', 'USA', 'Netherlands', 'Spain']}
df = pd.DataFrame(data)
print("Original DataFrame")
print(df) # display the unsorted dataframe

# filter the dataframe by the condition where age is greater than 25
filtered_df = df[df['age'] > 25]

# display the filtered dataframe
print("\n","Filtered DataFrame")
print(filtered_df)

In this example, we first create a sample DataFrame with some data. We then filtered the DataFrame based on the condition where age is greater than 25 and assigned that output to a new DataFrame called 'filtered_df'.

The df['age'] > 25 part of this code creates a boolean array with True or False values depending on whether each row satisfies the condition. The df[df['age'] > 25] part of the code selects only the rows where the condition is True.

We can see the generated boolean array below. Only the values that met the condition and are marked as True will be passed to filtered_df.

In [None]:
df['age'] > 25

After running this code, the filtered DataFrame will contain only the rows where the age is greater than 25, and the filtered DataFrame will be displayed.

You can also combine multiple conditions using logical operators such as & (and) and | (or) as well as using non-numeric filters such as names or countries. For example let's filter to only anyone over the age of 25 and is from either USA or Italy:

In [None]:
# filter the dataframe by the condition where age is greater than 25
# and country is either USA or Canada
filtered_df = df[(df['age'] > 25) & ((df['country'] == 'USA') | (df['country'] == 'Italy'))]

print(filtered_df)

In this case, the filtered DataFrame will contain only the rows where the age is greater than 25 and the country is either "USA" or "Italy".

Other methods of filtering include using 'query()'.

[Return to top](#Start)
***

## Merging Data <a name="Merge"></a>

In Pandas, you can merge two or more DataFrames into a single DataFrame based on one or more common columns. This is similar to the SQL JOIN operation. Pandas provides various methods for easily combining together Series or DataFrame with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.

In addition, pandas also provides utilities to compare two Series or DataFrame and summarize their differences.

For a great overview with visuals, check out the Merge, Join, Concatenate, and Compare functions documentation page: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

### merge()

merge documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html

The merge() function in Pandas can be used to merge two DataFrames <ins>based on the values within one or more columns</ins>. Here's an example that demonstrates how to merge two DataFrames based on a common column:

In [None]:
import pandas as pd

# create a sample dataframe1
data1 = {'name': ['John', 'Emma', 'Sarah', 'Daniel'],
        'age': [25, 27, 29, 31],
        'country': ['USA', 'UK', 'Australia', 'Canada']}
df1 = pd.DataFrame(data1)
print("DataFrame #1")
print(df1)

# create a sample dataframe2
data2 = {'name': ['John', 'Emma', 'Sarah', 'Daniel'],
        'salary': [50000, 60000, 70000, 80000]}
df2 = pd.DataFrame(data2)
print("\n","DataFrame #2")
print(df2)

# merge the two dataframes based on the 'name' column
merged_df = pd.merge(df1, df2, on='name')

# display the merged dataframe
print("\n","Merged DataFrame")
print(merged_df)

In this example, we first create two sample DataFrames with some data. The first DataFrame df1 contains columns for name, age, and country, while the second DataFrame df2 contains columns for name and salary. We then use the following line of code to merge the two DataFrames based on the 'name' column:

In [None]:
merged_df = pd.merge(df1, df2, on='name')

The on='name' part of this code specifies that we want to merge the two DataFrames based on the 'name' column. By default, merge() performs an inner join, which means that only the rows with matching values in both DataFrames will be included in the merged DataFrame.

After running this code, the merged DataFrame will contain all the columns from both DataFrames and will be displayed.

You can also merge DataFrames based on multiple columns by passing a list of column names to the on argument. Additionally, you can specify different types of joins using the how argument, such as "left", "right", "outer", and "inner". For more information on merging DataFrames in Pandas, refer to the official documentation.
***

### concat()

concat documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html

In Pandas, the concat() function is used to concatenate two or more DataFrames <ins>along a particular axis</ins> (either rows or columns). The concat() function can be used to combine DataFrames even if they have different shapes, column names, and indexes.

Here's an example that demonstrates how to use the concat() function to concatenate two DataFrames:

In [None]:
import pandas as pd

# create a sample dataframe1
data1 = {'name': ['John', 'Emma', 'Sarah', 'Daniel'],
        'age': [25, 27, 29, 31]}
df1 = pd.DataFrame(data1)
print("DataFrame #1")
print(df1)

# create a sample dataframe2
data2 = {'name': ['Olivia', 'Sophia', 'Ethan', 'Liam'],
        'age': [24, 26, 28, 30]}
df2 = pd.DataFrame(data2)
print("\n","DataFrame #2")
print(df2)

# concatenate the two dataframes along rows
concatenated_df = pd.concat([df1, df2])

# display the concatenated dataframe
print("\n","Concatenated DataFrame")
print(concatenated_df)

In this example, we first create two sample DataFrames with some data. The first DataFrame df1 contains columns for name and age, while the second DataFrame df2 contains columns for name and age. We then use the following line of code to concatenate the two DataFrames along the rows:

In [None]:
concatenated_df = pd.concat([df1, df2])
print(concatenated_df)

The pd.concat() function is passed a list of DataFrames to concatenate. By default, the function concatenates the DataFrames along the rows (axis=0). If you want to concatenate the DataFrames along the columns (axis=1), you can specify the axis argument as follows:

In [None]:
concatenated_df = pd.concat([df1, df2], axis=1)
print(concatenated_df)

After running this code, the concatenated DataFrame will contain all the rows from both DataFrames and will be displayed. If the two DataFrames have different columns, the resulting concatenated DataFrame will have all the columns from both DataFrames, with missing values (NaN) in the cells where data is missing.

Take note of the index in both cases. When using concat along the rows (axis=0), there are now multiple index entries that have the same value. You can also specify how the indexes should be handled when concatenating DataFrames using the ignore_index argument. If ignore_index=True, the resulting concatenated DataFrame will have a new index that ignores the original indexes of the input DataFrames. If ignore_index=False (the default), the original indexes of the input DataFrames will be preserved in the resulting concatenated DataFrame.

In [None]:
concatenated_df = pd.concat([df1, df2], ignore_index=True)
print(concatenated_df)

***

### join()

join documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html

The join() function is used to join two or more DataFrames <ins>based on the indexes or columns of the DataFrames</ins>. The join() function is similar to the merge() function, but it is a convenient method for combining DataFrames that have the same or similar indexes.

Here's an example that demonstrates how to use the join() function to join two DataFrames based on their indexes:

In [None]:
# create a sample dataframe1 containing age and using names as the index values
data1 = {'age': [25, 27, 29, 31]}
df1 = pd.DataFrame(data1, index=['John', 'Emma', 'Sarah', 'Daniel']) # here we assign the index to be names
print("DataFrame #1")
print(df1)

# create a sample dataframe2
data2 = {'salary': [50000, 60000, 70000, 80000]}
df2 = pd.DataFrame(data2, index=['John', 'Emma', 'Sarah', 'Daniel']) # here we assign the index to be the same names
print("\n","DataFrame #2")
print(df2)

# join the two dataframes based on their indexes
joined_df = df1.join(df2)

# display the joined dataframe
print("\n","Joined DataFrame")
print(joined_df)


In this example, we first create two sample DataFrames with some data. The first DataFrame df1 contains a column for age and has the same index as the second DataFrame df2. The second DataFrame df2 contains a column for salary. We then use the following line of code to join the two DataFrames based on their indexes:

In [None]:
joined_df = df1.join(df2)

The join() function is called on the first DataFrame, and the second DataFrame is passed as an argument to the function. By default, the join() function performs a left join, which means that all the rows from the first DataFrame are included in the resulting DataFrame, and only the matching rows from the second DataFrame are included. If there are missing values in the second DataFrame, the cells will be filled with NaN.

After running this code, the joined DataFrame will contain all the columns from both DataFrames and will be displayed.

You can also specify how to join the DataFrames using the how argument. For example, you can perform an inner join, outer join, or right join by specifying how='inner', how='outer', or how='right', respectively. You can also join the DataFrames based on their columns instead of their indexes by specifying the column names using the on argument.
***

### compare()

compare documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.compare.html

The compare() function is used to compare two DataFrames or two Series and returns a DataFrame of Boolean values indicating whether the corresponding elements in the two DataFrames or Series are equal or not.

Here's an example code snippet that demonstrates how to use the compare() function to compare two DataFrames:

In [None]:
import pandas as pd

# create two sample dataframes
data1 = {'name': ['John', 'Emma', 'Sarah', 'Daniel'],
        'age': [25, 27, 29, 31]}
df1 = pd.DataFrame(data1)
print("DataFrame #1")
print(df1)

data2 = {'name': ['John', 'Emma', 'Sarah', 'David'],
        'age': [25, 27, 29, 31]}
df2 = pd.DataFrame(data2)
print("\n","DataFrame #2")
print(df2)

# compare the two dataframes
compared_df = df1.compare(df2)

# display the compared dataframe
print("\n","Differences between the two")
print(compared_df)


In this example, we first create two sample DataFrames with some data. The first DataFrame df1 contains columns for name and age, while the second DataFrame df2 contains columns for name and age but with a different name "David" instead of "Daniel". We then use the following line of code to compare the two DataFrames:

In [None]:
compared_df = df1.compare(df2)

The compare() function is called on the first DataFrame, and the second DataFrame is passed as an argument to the function. By default, the compare() function compares the two DataFrames element-wise and returns a new DataFrame containing Boolean values indicating whether the corresponding elements are equal or not.

After running this code, the compared DataFrame will contain the same number of rows and columns as the input DataFrames, and the cells will contain True if the corresponding element in df1 is equal to the corresponding element in df2 and False otherwise. In the output above, the self=df1 and other=df2 based on how we used them in the function.

You can also specify how to compare the DataFrames using the method argument. For example, you can compare the DataFrames based on their indexes or columns by specifying method='index' or method='columns', respectively. You can also specify how to handle missing values using the keep_shape and keep_equal arguments.

[Return to top](#Start)
***


## Grouping data <a name="Group"></a>

### groupby()
groupby() documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html

In Pandas, grouping data is the process of splitting data into groups based on some criteria, optionally applying a function to each group independently, and optionally combining the results back into a single DataFrame. The groupby() function is used to group data in Pandas.

Here's an example code snippet that demonstrates how to use the groupby() function to group data in a Pandas DataFrame:


In [None]:
import pandas as pd

# create a sample dataframe
data = {'name': ['John', 'Emma', 'Sarah', 'Daniel', 'Jessica', 'Tom'],
        'age': [25, 27, 29, 31, 22, 28],
        'state': ['SC', 'SC', 'NC', 'NC', 'SC', 'NC'],
        'salary': [50000, 65000, 75000, 80000, 55000, 65000]}

df = pd.DataFrame(data)
print(df)

# group the dataframe by state
grouped_df = df.groupby(['state'])

# calculate the mean salary for each group
mean_salary = grouped_df['salary'].mean()

# display the mean salary for each group
print("\n","Mean Salary by State")
print(mean_salary)

In this example, we first create a sample DataFrame df with some data. The DataFrame contains columns for name, age, state, and salary. We then use the following line of code to group the DataFrame by state:

In [None]:
grouped_df = df.groupby(['state'])

The groupby() function is called on the DataFrame, and the column name 'state' is passed as an argument to the function. This creates a DataFrameGroupBy object that contains the original DataFrame split into groups based on the values in the 'state' column.

After grouping the DataFrame, we use the following line of code to calculate the mean salary for each group:

In [None]:
mean_salary = grouped_df['salary'].mean()

The mean() function is called on the 'salary' column of the DataFrameGroupBy object, which calculates the mean salary for each group. This creates a new Series object containing the mean salary for each group.

You can also apply other aggregation functions to each group, such as sum(), count(), min(), max(), and median(). You can also group the DataFrame by multiple columns by passing a list of column names to the groupby() function. 

In [None]:
# calculate and display the sum of salary for each group
sum_salary = grouped_df['salary'].sum()
print("\n","Sum of Salary by State")
print(sum_salary)

# calculate and display the count of salary for each group
count_salary = grouped_df['salary'].count()
print("\n","Count of Salary by State")
print(count_salary)

# calculate and display the min salary for each group
min_salary = grouped_df['salary'].min()
print("\n","Min Salary by State")
print(min_salary)

# calculate and display the max salary for each group
max_salary = grouped_df['salary'].max()
print("\n","Max Salary by State")
print(max_salary)

# calculate and display the median salary for each group
med_salary = grouped_df['salary'].median()
print("\n","Median Salary by State")
print(med_salary)

Each group can also be accessed individually as well. In Pandas, the get_group() function is used to retrieve a single group of data from a grouped DataFrame. This function is applied on a GroupBy object, which is created by grouping a DataFrame with one or more columns.

### get_group()

get_group documentation: https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.get_group.html

Here's an example code snippet that demonstrates how to use the get_group() function:

In [None]:
import pandas as pd

# create a sample dataframe
data = {'name': ['John', 'Emma', 'Sarah', 'Daniel', 'Jessica', 'Tom'],
        'age': [25, 27, 29, 31, 22, 28],
        'state': ['SC', 'SC', 'NC', 'NC', 'SC', 'NC'],
        'salary': [50000, 65000, 75000, 80000, 55000, 65000]}

df = pd.DataFrame(data)

# group the dataframe by state
grouped_df = df.groupby(['state'])

# get the group of data for the 'SC' state
m_group = grouped_df.get_group('SC')

# display the group of data for the 'SC' state
print(m_group)


The groupby() function is called on the DataFrame, and the column name 'state' is passed as an argument to the function. This creates a DataFrameGroupBy object that contains the original DataFrame split into groups based on the values in the 'state' column.

After grouping the DataFrame, we use the following line of code to retrieve the group of data for the 'SC' state:

In [None]:
m_group = grouped_df.get_group('SC')

The get_group() function is called on the DataFrameGroupBy object, with the value 'SC' passed as an argument to the function. This creates a new DataFrame object that contains only the rows of the original DataFrame where the state column has the value 'SC'.

You can use the get_group() function to retrieve a single group of data based on the values in any column that was used to group the original DataFrame. You can also apply other functions to the group of data, such as filtering or aggregation functions. For more information on the get_group() function and other functions related to grouping data in Pandas, refer to the official documentation.

[Return to top](#Start)
***

## Aggregating data <a name="Aggregate"></a>

Aggregating data in pandas refers to the process of grouping and summarizing data based on certain criteria. It is a powerful feature of pandas that allows you to calculate summary statistics, apply functions to subsets of data, and generate pivot tables. Grouping, covered above, is one common method. Two other are aggregating and using pivot tables. 

### agg()

agg documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.agg.html

Once you've grouped your data, you can apply aggregate functions to each group using the agg() function. This function takes a dictionary of column names and aggregate functions as input. For example, you can calculate the sum, mean, and count of a column for each group. Similar to what we did after grouping with separate function, can be easily tabulated with agg().

In [None]:
import pandas as pd

# create a sample dataframe
data = {'column_name': ['A', 'A', 'B', 'B', 'B', 'C'],
    'column1': [1, 1, 3, 3, 5, 7],
    'column2': [2, 2, 4, 4, 6, 8],
    'column3': [10.1, 20.2, 30.3, 40.4, 50.5, 60.6]}

df = pd.DataFrame(data)

# group the dataframe by column_name
grouped_df = df.groupby(['column_name'])

grouped_df.agg({'column1': 'sum', 'column2': 'mean', 'column3': 'count'})

In this example, the data has four columns: column_name, column1, column2, and column3. The values in column_name are either 'A', 'B', or 'C', and the values in column1 and column2 are integers. The values in column3 are numeric and represent some kind of data that you want to analyze.

As you can see, the grouped.agg() function groups the data by the values in column_name and calculates the sum of column1, mean of column2, and count of column3 for each group. This provides a useful summary of the data that can be used for further analysis or visualization.

[Return to top](#Start)
***

### pivot_table()

pivot_table documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot_table.html

Pivot tables are a powerful tool for summarizing and analyzing data. They create a spreadsheet-style pivot table that summarizes and aggregates data in a DataFrame. This function allows you to group data by one or more columns, apply a function to one or more columns of values, and reshape the results into a new DataFrame with a hierarchical index.

In [None]:
import pandas as pd

# create a sample dataframe
data = {'column1': ['A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C'],
    'column2': [1, 2, 1, 2, 2, 1, 2, 1, 1],
    'column3': [10.1, 20.2, 30.3, 40.4, 50.5, 60.6, 70.7, 80.8, 90.9]}

df = pd.DataFrame(data)

pivot_table = pd.pivot_table(df, index=['column1', 'column2'], values='column3', aggfunc='sum')
print(pivot_table)

This will create a pivot table that groups the data by the columns 'column1' first and then 'column2' and calculates the sum of 'column3' for each group. As you can see, the pivot table groups the data by the values in column1 and column2 and calculates the sum of column3 for each group. This provides a useful summary of the data that can be used for further analysis or visualization.

Here's the basic syntax of the pivot_table() function:

<b>pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All')</b>

The parameters of this function are:

- <b>data</b>: This is the DataFrame that you want to use to create the pivot table.
values: This is the column or list of columns that you want to apply the aggregation function to.
- <b>index</b>: This is the column or list of columns that you want to group the data by.
columns: This is the column or list of columns that you want to use as the columns in the pivot table.
- <b>aggfunc</b>: This is the aggregation function that you want to apply to the values column or columns. The default is mean, but you can use other functions like sum, min, max, count, std, var, and so on.
- <b>fill_value</b>: This is the value that you want to use to replace missing values in the pivot table. The default is None.
- <b>margins</b>: This is a Boolean value that indicates whether to include row and column totals in the pivot table. The default is False.
- <b>dropna</b>: This is a Boolean value that indicates whether to exclude rows or columns from the pivot table that contain missing values. The default is True.
- <b>margins_name</b>: This is the name that you want to use for the row and column totals in the pivot table. The default is 'All'.

[Return to top](#Start)
***



## Plotting data basics <a name="Plot"></a>

The plot function in the Python Pandas library is used to create different types of plots such as line plots, bar plots, histogram plots, scatter plots, and many others. This function is applied on a Pandas DataFrame or Series to visualize the data in a graphical format.

The plot function has several parameters that can be used to customize the appearance and behavior of the plot. Some of the most commonly used parameters are:

- kind: This parameter specifies the type of plot to be created. The available options include line, bar, histogram, scatter, and many others.
- x and y: These parameters specify the column names or indices to be plotted on the x and y-axes, respectively.
- title: This parameter is used to set the title of the plot.
- xlabel and ylabel: These parameters are used to set the labels for the x and y-axes.
- color: This parameter is used to specify the color of the plot.
- legend: This parameter is used to display the legend on the plot.

A great reference for ideas and code for plotting: https://pandas.pydata.org/docs/user_guide/visualization.html

Here is an example of how to use the plot function to create a line plot:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Create a sample DataFrame
data = pd.DataFrame({'x': np.arange(10),
                     'y': np.random.randn(10)})

# Plot the data as a line plot
data.plot(x='x', y='y', kind='line')

# Add title and axis labels
plt.title('Sample Line Plot')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')

# Display the plot
plt.show()

This code will create a line plot of the data in the DataFrame, with the x-axis showing the values of the 'x' column and the y-axis showing the values of the 'y' column. The plot will also have a title and axis labels.

In addition to creating basic line plots, bar plots, and scatter plots, the plot function in Pandas provides many customization options and supports many other types of plots such as:

- Create stacked or grouped bar plots: You can create stacked or grouped bar plots by setting the stacked or groupby parameters, respectively. Stacked bar plots show multiple bars stacked on top of each other, while grouped bar plots show multiple bars side-by-side.
- Create box plots: You can create box plots by setting the kind parameter to 'box'.
- Create area plots: You can create area plots by setting the kind parameter to 'area'.
- Create pie charts: You can create pie charts by setting the kind parameter to 'pie'.
- Set the style and color of the plot: You can customize the style and color of the plot by setting the style and color parameters, respectively.
- Display multiple plots on the same figure: You can display multiple plots on the same figure by creating a subplot using the subplots function and passing the returned axis object to the plot function.
- Create log-scale plots: You can create log-scale plots by setting the logx or logy parameters to True.
- Save the plot as an image: You can save the plot as an image by using the savefig function.

Here is an example of how to use the plot function to create a stacked bar plot:

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Create a sample DataFrame
data = pd.DataFrame({'A': [1, 2, 3],
                     'B': [4, 5, 6],
                     'C': [7, 8, 9]})

# Create a stacked bar plot
data.plot(kind='bar', stacked=True)

# Add title and axis labels
plt.title('Sample Stacked Bar Plot')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')

# Display the plot
plt.show()

This code will create a stacked bar plot of the data in the DataFrame, with each bar showing the values of the 'A', 'B', and 'C' columns. The plot will also have a title and axis labels.

You can add multiple plots to the same figure using Pandas by creating subplots and passing the returned axis object to the plot function. The subplots function creates a grid of subplots with the specified number of rows and columns.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Create a sample DataFrame
data = pd.DataFrame({'x': [1, 2, 3, 4],
                     'y1': [10, 20, 30, 40],
                     'y2': [5, 15, 25, 35]})

# Create a subplot with two rows and one column
fig, axes = plt.subplots(nrows=2, ncols=1)

# Plot the first data series on the first subplot
data.plot(x='x', y='y1', kind='line', ax=axes[0])

# Plot the second data series on the second subplot
data.plot(x='x', y='y2', kind='line', ax=axes[1])

# Add title and axis labels to the figure
fig.suptitle('Sample Multiple Plots')
axes[0].set_ylabel('Y1-axis')
axes[1].set_xlabel('X-axis')
axes[1].set_ylabel('Y2-axis')

# Display the plot
plt.show()


This code will create a figure with two subplots, each displaying a line plot of a different data series from the DataFrame. The subplot function returns a tuple containing the figure object and an array of axis objects. The ax parameter is used to specify which axis object to use for each plot. The suptitle, set_xlabel, and set_ylabel functions are used to add a title and axis labels to the figure.

You may be wondering at this point why Matplotlib is being used. Matplotlib is used to add a title and axis labels to the figure. While the plot function in Pandas provides basic plotting functionality and allows you to create many types of plots with just one line of code, it may not provide all the customization options you need to create publication-quality plots.

Matplotlib is a powerful plotting library in Python that provides many customization options for creating high-quality plots. You can use Matplotlib in combination with Pandas to customize your plots and add more advanced features, such as annotations, legends, and custom color maps.

In the last example, after creating the subplots with Pandas, Matplotlib is used to add a title and axis labels to the figure. The suptitle, set_xlabel, and set_ylabel functions are provided by Matplotlib and are used to customize the figure created by Pandas.

In summary, while Pandas provides basic plotting functionality, Matplotlib can be used to customize your plots and add more advanced features, and is often used in combination with Pandas to create high-quality plots.

Some other useful visualizations follow to get you thinking about being creative with your plotting.

In [None]:
# Plot all data from a DataFrame in a matrix plot to examine the relationships between the data

import pandas as pd
import numpy as np

# Create dataframe and fill with random generated data
df = pd.DataFrame(np.random.randn(1000, 4), columns=["a", "b", "c", "d"])

# Plot all data in one matrix using scatter plots with kde distribution plots on the diagonal axis
pd.plotting.scatter_matrix(df, alpha=0.2, figsize=(6, 6), diagonal="kde");

In [None]:
# Create and plot a dataframe of four columns of random data and label with dates

import pandas as pd
import numpy as np

# Create a Series of dates that we will use as an index
ts = pd.Series(np.random.randn(1000), index=pd.date_range("1/1/2000", periods=1000))

# Create DataFrame with four columns (ABCD) and fill with random data and label with the date Series ts
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=list("ABCD"))

# Return the cumulative sum over the DataFrame
df = df.cumsum()

# Plot the cumulative sum data
df.plot(legend=True);

In [None]:
# Using the same data, plot A and B on separate axis
# Note how different this plot is from keeping the data on the same axis

df['A'].plot();

df['B'].plot(secondary_y=True, style='g');

In [None]:
# this data can also be easily plotted separately by adding the subplots keyword

df.plot(subplots=True, figsize=(6, 6));

In [None]:
# We can also control the layout and plot individual data
df.plot(subplots=True, layout=(2, 3), figsize=(6, 6), sharex=False);

In [None]:
# A helpful tip, we can also let Pandas calculate the number of rows
# or columns needed by replacing one of the respective value with -1.


# control number of rows (2) but not columns
df.plot(subplots=True, layout=(2, -1), figsize=(6, 6), sharex=False);


# control number of columns (3) but not rows
df.plot(subplots=True, layout=(-1, 3), figsize=(6, 6), sharex=False);

This is a small selection of the possible plotting options available to you. Try to be creative with your plotting.

*** 

### Seaborn

Recommended Tutorial for capabilities, ideas, and code: https://seaborn.pydata.org/tutorial/introduction.html

Below is another example of plotting DataFrames, but for this section we will use the Seaborn library. Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for creating informative and attractive statistical graphics. Seaborn comes with several built-in themes and color palettes to enhance the visual aesthetics of plots.

Seaborn is often used for statistical data visualization and exploration, and it includes several types of plots, including scatter plots, line plots, bar plots, histograms, kernel density plots, box plots, violin plots, heatmaps, and more. Seaborn also provides support for complex data types such as multi-panel categorical plots, and it can also be used for visualizing relationships between variables using techniques such as linear regression or correlation analysis. Overall, Seaborn is a powerful and flexible library for creating high-quality data visualizations in Python.

We will use historical NBA performance data. First we will need to get the data. We can either download it directly or we can use Python to pull it in as it is needed and temporarily store it rather than on our drive. This can be very helpful when using online data sources or if we don't want to store the data on our system.

Download: https://github.com/fivethirtyeight/data/tree/master/nba-elo

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt

# Load the NBA Elo dataset from the online source 
nba_elo_df = pd.read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/nba-elo/nbaallelo.csv')

# Print the first few rows of the DataFrame to verify that the data was loaded correctly
print(nba_elo_df.head())

# Print the column names so that we know what data might be available
print('\n\n', 'Columns in NBA Dataset')
for col in nba_elo_df.columns:
    print(col)

Now that we have the data and have a basic understanding of what might be avialable, lets start plotting it. 

In [None]:
# Create a bar chart showing the total number of points scored by the top ten franchises during playoffs across all time

# We need to first group the data we want. From our prompt, 
    # we only want data from playoff games
    # we want to group it by franchise
    # we want to summarize all data available (across all time included)

# First select only data from playoff games using conditionals 
playoffs_true = nba_elo_df['is_playoffs'] == True 
# Now we have found where the is_playoffs column is True (outputs True/False for each row)

# separate all of the columns where it is a playoff game (only outputs the rows where True)
nba_playoffs = nba_elo_df[playoffs_true] 
# Now we have only the playoff data

# group the data by franchise ID
nba_playoffs_grouped = nba_playoffs.groupby('fran_id')

# find the sum of the points columns
nba_playoffs_grouped_sum = nba_playoffs_grouped['pts'].sum().reset_index()

# sort the DataFrame based on points
nba_playoffs_sorted = nba_playoffs_grouped_sum.sort_values('pts', ascending=False).reset_index(drop=True)

# keep only the top ten teams
top_ten = nba_playoffs_sorted.head(10)

# output the top ten to check our work
print(top_ten)

# now we can plot the data
sns.barplot(data=top_ten, x='fran_id', y='pts');

# plot formatting
plt.xticks(rotation=45)
plt.xlabel('Franchise ID')
plt.ylabel('Number of points')
plt.title('Top 10 Number of points scored per franchise');

In [None]:
# The last script showed you step by step how to get to the data we desired. 
# Commonly, you will see this in shortform which you also use as you become more comfortable with python
# The below two lines do the same thing as the six lines we used previously. This can also be made 
# into one line but for readability, it is two lines here. The functions are applied in a left to right manner. 

champions_df = nba_elo_df[nba_elo_df['is_playoffs'] == True].groupby('fran_id')['pts'].sum().reset_index()
champions_df = champions_df.sort_values('pts', ascending=False).head(10).reset_index(drop=True)

print(champions_df)

sns.barplot(data=champions_df, x='fran_id', y='pts');

plt.xticks(rotation=45)
plt.xlabel('Franchise ID')
plt.ylabel('Number of points')
plt.title('Top 10 Number of points scored per franchise');

In [None]:
# Create a line plot of the Elo rating over time for the Boston Celtics
team = 'Celtics'

boston_df = nba_elo_df[nba_elo_df['fran_id'] == team]
sns.lineplot(data=boston_df, x='year_id', y='elo_n')

# plot formatting using matplotlib
plt.title(f'{team} Elo rating by year'); # we can use python fstrings to dynamically update our plots for us
# the ; here suppresses text output from the plt function. 
# Depending on your Jupyter settings, you may or may not see this text

In [None]:
# Create a scatter plot of the Elo rating vs. the Equivalent number of wins for all games in the 2014-2015 season
season_df = nba_elo_df[nba_elo_df['year_id'] == 2014]
sns.scatterplot(data=season_df, x='elo_n', y='win_equiv')

# Plot formatting
plt.title('Elo rating vs Equivalent number of wins for all games in 2014-2015 season');

In [None]:
# Create a box plot of the Elo rating by team for the 2014-2015 season
season_df = nba_elo_df[nba_elo_df['year_id'] == 2014]
sns.boxplot(data=season_df, x='fran_id', y='elo_n')

# plot formatting
plt.xticks(rotation=90); # the ; here suppresses text output from the plt function. Depending on your Jupyter settings, you may or may not see this text
plt.xlabel('Franchise ID') 
plt.ylabel('Team Elo following game')
plt.title('Elo rating by team for 2014-2015 season');

In [None]:
# we can also examine this data a pivot table to examine the historical mean points the 
# team scored based on location (Home-H, Away-A) and whether they won (W) or lost (L)

pivot_table = pd.pivot_table(nba_elo_df, index=['fran_id', 'game_location'], columns='game_result', values='pts', aggfunc='mean')

print(pivot_table.head(15)) # Print the first 15 rows of the output
print('\n') # Add a blank line between output

# Print the same table but let's display only 1 decimal point
print(pivot_table.head(15).round(1))

In [None]:
# Now plot the pivot table to make it quick to reference highs (dark blue) and lows (light green)

# This is alot of data to plot, we have to manually set the size of the plot so that everything is visible
# Since Seaborn is based on Matplotlib, we can use it to our advantage and set the size of the plot 
# by making a 1x1 subplot in the size desired.
fig, ax = plt.subplots(figsize=(5, 40))

# plot the data in a heatmap with the YlGnBu colormap and place it in the current plot ax
sns.heatmap(data=pivot_table, cmap='YlGnBu', ax=ax); 
plt.show()

# Seaborn has great colormaps available and you can define your own
# Find more info here: https://seaborn.pydata.org/tutorial/color_palettes.html

In [None]:
# Create a heatmap showing the frequency of home team victories by year and location
victories_df = nba_elo_df[nba_elo_df['game_result'] == 'W'].groupby(['year_id', 'game_location'])['game_result'].count().reset_index()
victories_pivot = victories_df.pivot_table(values='game_result', index='game_location', columns='year_id', aggfunc='sum')

sns.heatmap(data=victories_pivot, cmap='YlGnBu');

[Return to top](#Start)
***
<a name="End"></a>

# END