<h1 align="center">Python Data Science Guides</h1>
<h2 align="center">Pandas - Creating DataFrames</h2>

&nbsp;

### Contents

Section 1 - Introduction to Pandas

Section 2 - Importing and Configuring Pandas

Section 3 - Creating DataFrames

Section 4 - Exploring DataFrames

Section 5 - Renaming Row and Columns

Conclusion

&nbsp;

### Overview

Pandas is one of the most important data science libraries in Python and is essential for working with large amounts of data. The notebooks in this guide give an introduction to the library, and provide resources to dive deeper into the complexities of the package.

The example data used here was collected by an airline company in order to study the customer satisfaction for ~120,000 passengers across a number of categories. The dataset also features information about the passengers themselves, including their: age, gender, whether they are a first-time flyer with the airline or a returning passenger etc. More information on this dataset can be found [here](https://www.kaggle.com/datasets/mysarahmadbhat/airline-passenger-satisfaction).

&nbsp;


<h2 align="center">Section 1 - Introduction to Pandas</h2>

### 1.1 - Pandas Overview

Pandas is a third-party data science library for Python and is one of the core libraries in the PyData stack - a collection of many libraries for working with data. Other popular libraries in this group include: NumPy (which provides mathematical tools), Matplotlib (which provides graphing tools) and SciPy (which provides scientific tools). SciKit Learn and TensorFlow are also commonly used in the PyData stack, contributing machine learning tools.

Pandas makes it very simple to work with tabular data through the use of its two-dimensional data structures called **DataFrames**. These are tables made up of rows and columns, and can be used to store and manipulate data. Pandas also provides support for reading data from many sources (such as the web, excel files, CSVs, SQL databases etc), as well as a plethora of features for manipulating the data once it has been loaded in. These features include many common functions used in SQL such as aggregation and filtering, but extends far beyond the capabilities of most data analytics tools. This series of guides will aim to cover the main functions included in the Pandas library. A complete list can be found in the [Pandas Documentation](https://pandas.pydata.org/docs/).

Pandas is built on top of the NumPy library, which gives a significant performance boost over pure Python. As a result, Pandas is very powerful, and is the data science library of choice for many developers today.

&nbsp;

### 1.2 - The Two Main Data Structures in Pandas

Pandas introduces a few new data structures to Python, the main two of which include DataFrames and Series. A brief overview of both are given below, and more can be found in the [documentation](https://pandas.pydata.org/docs/user_guide/dsintro.html).

&nbsp;

**DataFrames:**

DataFrames are two-dimensional data structures that store tabular data, much like an Excel spreadsheet. Internally they function very similarly to Python dictionaries, and so are called *dict-like* objects. The reason for this will become clearer later on, when using columns names as *keys* to access column data. DataFrames are the main addition that Pandas brings to the Python language.

**Series:**

Series are one-dimensional data structures, and form the individual columns of a DataFrame. Series objects are also *dict-like* and contain an index which can be used to access each value. Sometimes the index is an integer, but this can also be a textual label describing what the value represents. As shown later in these notebooks, subsets of data can be extracted from DataFrames. If this subset is two-dimensional then another DataFrame will be returned, if the subset is one-dimensional then a Series will be returned. The index in this case will be the column or row labels describing what the values are.

&nbsp;

### 1.3 - Data Types in Pandas

DataFrames and Series objects can store many different data types, including those introduced in NumPy. Below is an overview of the data types that are not included in the Python standard library, as well as standard data types which are given different names.

&nbsp;

| Pandas Data Type |      Python Equivalent      |                                              Description                                              |
|:----------------:|:---------------------------:|:-----------------------------------------------------------------------------------------------------:|
|      object      | string, or mixed data types |                         Data made up of entirely strings, or mixed data types                         |
|       int64      |             int             |                          64-bit whole numbers                          |
|      float64     |            float            |                           64-bit decimnal numbers                           |
|       bool       |             bool            |                                      Boolean values (True/False)                                      |
|    datetime64    |              -              |                               Date and time values stored using 64-bits                               |
|     timedelta    |              -              |                                 Differences between datetime64 values                                 |
|     category     |              -              | Used for categorical data, appears to be strings but is stored internally as integers for performance |

&nbsp;

<style>
table,td,tr,th {border:none!important}
</style>

### 1.4 - *Preview then Save Changes* Philosophy

Pandas has the philiosophy that changes to a DataFrame/Series object should not be permanent unless explictly specified. That is, many of the methods introduced below will run and show a preview of the changes in the output. However in the next cell those changes have been reversed. In order to save changes to a DataFrame or Series, the variable containing the object should be overwritten (e.g. `df = df.some_method()`), or by using an `inplace` argument if given the option. The `inplace` parameter tells Pandas whether to update the object and save changes after running a method, and is a very common parameter in Pandas methods.

&nbsp;


<h2 align="center">Section 2 - Importing and Configuring Pandas</h2>

### 2.1 - Import Pandas

Pandas is often imported as `pd` - this convention is recommended on the official Pandas documentation. To see which version of Pandas is running, access the `pd.__version__` attribute. As stated above Pandas is built on top of NumPy, and in fact uses many other dependencies in order to run. To see a list of the dependencies and which versions of those are installed (as well as some system information), run the `pd.show_versions` function.

In [781]:
import pandas as pd

print(f'Pandas version: {pd.__version__}')
print(pd.show_versions())

Pandas version: 1.3.4

INSTALLED VERSIONS
------------------
commit           : 945c9ed766a61c7d2c0a7cbb251b6edebf9cb7d5
python           : 3.9.7.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 21.3.0
Version          : Darwin Kernel Version 21.3.0: Wed Jan  5 21:37:58 PST 2022; root:xnu-8019.80.24~20/RELEASE_ARM64_T8101
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : None.UTF-8

pandas           : 1.3.4
numpy            : 1.20.3
pytz             : 2021.3
dateutil         : 2.8.2
pip              : 21.2.4
setuptools       : 58.0.4
Cython           : 0.29.24
pytest           : 6.2.4
hypothesis       : None
sphinx           : 4.2.0
blosc            : None
feather          : None
xlsxwriter       : 3.0.1
lxml.etree       : 4.6.3
html5lib         : 1.1
pymysql          : None
psycopg2         : None
jinja2           : 2.11.3
IPython          : 7.29.0
pandas_datareader: 

### 2.2 - Change the Default Display Options

By default, Jupyter notebooks will only display a small number of rows and columns when printing a DataFrame before truncating with an elipsis. The maximum number of rows and columns can be changed using the `pd.set_option` command, using passing in `display.max_rows` or `display.max_columns` in as an argument, followed by the desired number of rows/columns.

In [782]:
pd.set_option('display.max_rows',100)
pd.set_option('display.max_columns',100)

<h2 align="center">Section 3 - Creating DataFrames</h2>

### 3.1 - Create a DataFrame using `pd.DataFrame()`

DataFrames can be created in multiple ways, one of which is by using the `pd.DataFrame` constructor. This method takes in a required argument for the data to use in the DataFrame, as well as some optional arguments. The data can be in the form of a NumPy ndarray, a Python iterable (including lists and tuples) or dictionary, or another DataFrame. Some of the most commonly-used arguments are summarised below, and a complete list can be found [here](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html?highlight=read_csv#pandas.read_csv).


|         |                                                                                                                                                                                                                    |
|---------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| *index*   | The index (row labels) to use, this can be a Pandas Index object or any array-like object (including Python lists). If `None` the default is to enumerate each row starting from 0.                                |
| *columns* | The column labels to use, similarly to the index argument, this can be a Pandas Index object or any array-like object (including Python lists). If `None` the default is to enumerate each column starting from 0. |
| *dtype*   | A single data type to force the elements in the DataFrame to take. If `None` the default is to infer the data type for each element individually.                                                                  |
| *copy*    | True for dictionary data and False for ndarray data by default                                                                                                                                                     |

&nbsp;

<style>
table,td,tr,th {border:none!important}
</style>

### 3.2 - Create a DataFrame from a Dictionary using `pd.DataFrame()`

As stated above, the *data* parameter in the `pd.DataFrame` constructor accepts Python dictionaries as one source of data. This example uses a dictionary of data about 5 people, including their name, age and job. The keys of the dictionary form the column names, and the lists making up the values form the columns.


In [783]:
person_dict = {'Name': ['Adam', 'Bob', 'Charlie', 'Dan', 'Edward'],
               'Age': [28,22,18,23,25],
               'Job': ['Electrician', 'Programmer', 'Builder', 'Doctor', 'Accountant']}

person_df = pd.DataFrame(person_dict)
person_df

Unnamed: 0,Name,Age,Job
0,Adam,28,Electrician
1,Bob,22,Programmer
2,Charlie,18,Builder
3,Dan,23,Doctor
4,Edward,25,Accountant


### 3.3 - Create a DataFrame from an ndarray using `pd.DataFrame()`

The `pd.DataFrame` constructor also accepts ndarrays as a source of data. This example uses 5 row x 3 column NumPy array populated with random numbers to create a DataFrame. Using random numbers in this way can be useful to generate an example DataFrame using toy data. Note that the *columns* argument is more useful when working with NumPy arrays since the array will not have textual names like the keys in a dictionary. The *columns* argument is still optional however, and if `None` the column names will default to integers starting from 0.

In [784]:
import numpy as np

number_df = pd.DataFrame(np.random.rand(5,3), columns=['Col A', 'Col B', 'Col C'])
number_df

Unnamed: 0,Col A,Col B,Col C
0,0.501332,0.172422,0.591807
1,0.686454,0.045325,0.285053
2,0.604577,0.394338,0.575509
3,0.897241,0.854519,0.729816
4,0.804204,0.294538,0.688883


### 3.4 - Create a DataFrame from a Dictionary using `pd.DataFrame.from_dict()`

The `pd.DataFrame.from_dict` method works near-identically to regular `pd.DataFrame` constructor when passed a dictionary, but offers one more optional argument: *orient*. If the *orient* argument is set to *index* the data is transposed, and so rows become columns and columns become row. In this case the *columns* argument may be useful to give names to the new columns.

|           |                                                                                                                                                                                                                                                                                                                                                                        |
|-----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| *orient*  | Handles the orientation of the data. If `None` the default values is *columns* which creates a column from each of the values using the key as the column header, that is it defaults to the behaviour of the regular constructor. If *index* is passed the data is transposed, and the values become rows instead of columns, with the keys forming the index labels. |
| *index*   | See `pd.DataFrame()`                                                                                                                                                                                                                                                                                                                                                   |
| *columns* | See `pd.DataFrame()`                                                                                                                                                                                                                                                                                                                                                   |
| *dtype*   | See `pd.DataFrame()`                                                                                                                                                                                                                                                                                                                                                   |
| *copy*    | See `pd.DataFrame()`                                                                                                   
<style>
table,td,tr,th {border:none!important}
</style>

In [785]:
person_df = pd.DataFrame.from_dict(person_dict, orient='index')
person_df

Unnamed: 0,0,1,2,3,4
Name,Adam,Bob,Charlie,Dan,Edward
Age,28,22,18,23,25
Job,Electrician,Programmer,Builder,Doctor,Accountant


### 3.5 - Create a DataFrame from a Comma-Separated Values (.csv) File using `read_csv()`

Comma-separated values files (.csv) are a commmon way to store tabular data. Pandas provides the `pd.read_csv` function which accepts a required argument for the source of the data, and many optional arguments. A few of the most commonly-used arguments are described below, and a complete list can be found [here](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html?highlight=read_csv#pandas.read_csv).

|                               |                                                                                                                                                                                                                                                                         |
|-------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| filepath_or_buffer (required) | An absolute or relative filepath, or URL to a .csv file                                                                                                                                                                                                                 |
| sep                           | The separator character used to divide values in the file. If `None` the default value used is a comma. Other common formats use tabs to separate data, in so-caled *tab-separated files* (.tsv)                                                                        |
| header                        | The row number (or list of row numbers) containing the column names hence and the start of the data. If `None` this defaults to the top of the file (row 0).                                                                                                            |
| names                         | A list of column names to use. The *header* argument should be explicitly set to 0 to use this feature.                                                                                                                                                                 |
| index_col                     | An integer or list of integers to use as the index/indices. If `None` an index will be created enumerating from 0. This argument is useful when a dataset already features a column to uniquely identify each row (such as an ID column).                                                                                                                                                     |
| usecols                       | A list of integers or column names specifying which columns should be read. If `None` every column will be used to create the DataFrame.                                                                                                                                |
| dtype                         | Set the datatype for data in a DataFrame or column (this also accepts NumPy data types such as np.float32)                                                                                                                                                              |
| skiprows                      | A list of row numbers to skip at the top of the file when reading the data. If `None` every row will be read.                                                                                                                                                           |
| skipfooter                    | A list of row numbers to skip at the bottom of the file when reading the data. If `None` every row will be read.                                                                                                                                                        |
| na_values                     | A list of additional values to treat as NaN values. If `None` only the following values will be replaced with NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘<NA>’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’. |

&nbsp;

<style>
table,td,tr,th {border:none!important}
</style>

In [786]:
# Read in a CSV file
airline_df = pd.read_csv('airline_passenger_satisfaction.csv')
airline_df

Unnamed: 0,ID,Gender,Age,Customer Type,Type of Travel,Class,Flight Distance,Departure Delay,Arrival Delay,Departure and Arrival Time Convenience,Ease of Online Booking,Check-in Service,Online Boarding,Gate Location,On-board Service,Seat Comfort,Leg Room Service,Cleanliness,Food and Drink,In-flight Service,In-flight Wifi Service,In-flight Entertainment,Baggage Handling,Satisfaction
0,1,Male,48,First-time,Business,Business,821,2,5.0,3,3,4,3,3,3,5,2,5,5,5,3,5,5,Neutral or Dissatisfied
1,2,Female,35,Returning,Business,Business,821,26,39.0,2,2,3,5,2,5,4,5,5,3,5,2,5,5,Satisfied
2,3,Male,41,Returning,Business,Business,853,0,0.0,4,4,4,5,4,3,5,3,5,5,3,4,3,3,Satisfied
3,4,Male,50,Returning,Business,Business,1905,0,0.0,2,2,3,4,2,5,5,5,4,4,5,2,5,5,Satisfied
4,5,Female,49,Returning,Business,Business,3470,0,1.0,3,3,3,5,3,3,4,4,5,4,3,3,3,3,Satisfied
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
129875,129876,Male,28,Returning,Personal,Economy Plus,447,2,3.0,4,4,4,4,2,5,1,4,4,4,5,4,4,4,Neutral or Dissatisfied
129876,129877,Male,41,Returning,Personal,Economy Plus,308,0,0.0,5,3,5,3,4,5,2,5,2,2,4,3,2,5,Neutral or Dissatisfied
129877,129878,Male,42,Returning,Personal,Economy Plus,337,6,14.0,5,2,4,2,1,3,3,4,3,3,4,2,3,5,Neutral or Dissatisfied
129878,129879,Male,50,Returning,Personal,Economy Plus,337,31,22.0,4,4,3,4,1,4,4,5,3,3,4,5,3,5,Satisfied


In [787]:
# Read in only the ID, Gender and Age columns 
airline_df = pd.read_csv('airline_passenger_satisfaction.csv', usecols=['ID', 'Gender', 'Age'])
airline_df

Unnamed: 0,ID,Gender,Age
0,1,Male,48
1,2,Female,35
2,3,Male,41
3,4,Male,50
4,5,Female,49
...,...,...,...
129875,129876,Male,28
129876,129877,Male,41
129877,129878,Male,42
129878,129879,Male,50


In [788]:
# Set the index to the ID column
airline_df = pd.read_csv('airline_passenger_satisfaction.csv', index_col='ID')
airline_df

Unnamed: 0_level_0,Gender,Age,Customer Type,Type of Travel,Class,Flight Distance,Departure Delay,Arrival Delay,Departure and Arrival Time Convenience,Ease of Online Booking,Check-in Service,Online Boarding,Gate Location,On-board Service,Seat Comfort,Leg Room Service,Cleanliness,Food and Drink,In-flight Service,In-flight Wifi Service,In-flight Entertainment,Baggage Handling,Satisfaction
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
1,Male,48,First-time,Business,Business,821,2,5.0,3,3,4,3,3,3,5,2,5,5,5,3,5,5,Neutral or Dissatisfied
2,Female,35,Returning,Business,Business,821,26,39.0,2,2,3,5,2,5,4,5,5,3,5,2,5,5,Satisfied
3,Male,41,Returning,Business,Business,853,0,0.0,4,4,4,5,4,3,5,3,5,5,3,4,3,3,Satisfied
4,Male,50,Returning,Business,Business,1905,0,0.0,2,2,3,4,2,5,5,5,4,4,5,2,5,5,Satisfied
5,Female,49,Returning,Business,Business,3470,0,1.0,3,3,3,5,3,3,4,4,5,4,3,3,3,3,Satisfied
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
129876,Male,28,Returning,Personal,Economy Plus,447,2,3.0,4,4,4,4,2,5,1,4,4,4,5,4,4,4,Neutral or Dissatisfied
129877,Male,41,Returning,Personal,Economy Plus,308,0,0.0,5,3,5,3,4,5,2,5,2,2,4,3,2,5,Neutral or Dissatisfied
129878,Male,42,Returning,Personal,Economy Plus,337,6,14.0,5,2,4,2,1,3,3,4,3,3,4,2,3,5,Neutral or Dissatisfied
129879,Male,50,Returning,Personal,Economy Plus,337,31,22.0,4,4,3,4,1,4,4,5,3,3,4,5,3,5,Satisfied


<h2 align="center">Section 4 - Exploring DataFrames</h2>

### 4.1 - Display Rows using `head()`

Both DataFrame and Series objects have a `head` method which can be used to show the first *n* rows of data. If an argument is not passed *n* defaults to 5, however *n* can be set to any value. If a negative value is passed for *n*, then all rows are display except for the final *n* rows. This is identical to `df[:-n]`.

In [789]:
airline_df.head(3)

Unnamed: 0_level_0,Gender,Age,Customer Type,Type of Travel,Class,Flight Distance,Departure Delay,Arrival Delay,Departure and Arrival Time Convenience,Ease of Online Booking,Check-in Service,Online Boarding,Gate Location,On-board Service,Seat Comfort,Leg Room Service,Cleanliness,Food and Drink,In-flight Service,In-flight Wifi Service,In-flight Entertainment,Baggage Handling,Satisfaction
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
1,Male,48,First-time,Business,Business,821,2,5.0,3,3,4,3,3,3,5,2,5,5,5,3,5,5,Neutral or Dissatisfied
2,Female,35,Returning,Business,Business,821,26,39.0,2,2,3,5,2,5,4,5,5,3,5,2,5,5,Satisfied
3,Male,41,Returning,Business,Business,853,0,0.0,4,4,4,5,4,3,5,3,5,5,3,4,3,3,Satisfied


In [790]:
# Same result using negative indexing
airline_df.head(-129877)

Unnamed: 0_level_0,Gender,Age,Customer Type,Type of Travel,Class,Flight Distance,Departure Delay,Arrival Delay,Departure and Arrival Time Convenience,Ease of Online Booking,Check-in Service,Online Boarding,Gate Location,On-board Service,Seat Comfort,Leg Room Service,Cleanliness,Food and Drink,In-flight Service,In-flight Wifi Service,In-flight Entertainment,Baggage Handling,Satisfaction
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
1,Male,48,First-time,Business,Business,821,2,5.0,3,3,4,3,3,3,5,2,5,5,5,3,5,5,Neutral or Dissatisfied
2,Female,35,Returning,Business,Business,821,26,39.0,2,2,3,5,2,5,4,5,5,3,5,2,5,5,Satisfied
3,Male,41,Returning,Business,Business,853,0,0.0,4,4,4,5,4,3,5,3,5,5,3,4,3,3,Satisfied


### 4.2 - Display Rows using `tail()`

The `tail` method is very similar to the `head` method, and shows the final *n* rows. Both DataFrame and Series objects have a `tail` method, and as before the default number of rows shown if no argument is passed is 5. If a negative value is passed for *n*, then all rows are display except for the first *n* rows. This is identical to `df[n:]`.

In [791]:
airline_df.tail(3)

Unnamed: 0_level_0,Gender,Age,Customer Type,Type of Travel,Class,Flight Distance,Departure Delay,Arrival Delay,Departure and Arrival Time Convenience,Ease of Online Booking,Check-in Service,Online Boarding,Gate Location,On-board Service,Seat Comfort,Leg Room Service,Cleanliness,Food and Drink,In-flight Service,In-flight Wifi Service,In-flight Entertainment,Baggage Handling,Satisfaction
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
129878,Male,42,Returning,Personal,Economy Plus,337,6,14.0,5,2,4,2,1,3,3,4,3,3,4,2,3,5,Neutral or Dissatisfied
129879,Male,50,Returning,Personal,Economy Plus,337,31,22.0,4,4,3,4,1,4,4,5,3,3,4,5,3,5,Satisfied
129880,Female,20,Returning,Personal,Economy Plus,337,0,0.0,1,3,4,3,2,4,2,4,2,2,2,3,2,1,Neutral or Dissatisfied


In [792]:
# Same result using negative indexing
airline_df.tail(-129877)

Unnamed: 0_level_0,Gender,Age,Customer Type,Type of Travel,Class,Flight Distance,Departure Delay,Arrival Delay,Departure and Arrival Time Convenience,Ease of Online Booking,Check-in Service,Online Boarding,Gate Location,On-board Service,Seat Comfort,Leg Room Service,Cleanliness,Food and Drink,In-flight Service,In-flight Wifi Service,In-flight Entertainment,Baggage Handling,Satisfaction
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
129878,Male,42,Returning,Personal,Economy Plus,337,6,14.0,5,2,4,2,1,3,3,4,3,3,4,2,3,5,Neutral or Dissatisfied
129879,Male,50,Returning,Personal,Economy Plus,337,31,22.0,4,4,3,4,1,4,4,5,3,3,4,5,3,5,Satisfied
129880,Female,20,Returning,Personal,Economy Plus,337,0,0.0,1,3,4,3,2,4,2,4,2,2,2,3,2,1,Neutral or Dissatisfied


### 4.3 - Display Dimensions using `shape`

The `shape` attribute of a DataFrame (or Series) returns a tuple containing the number of the rows and columns. This can be useful to access the number of rows or columns programmatically, such as in *for loops*.

In [793]:
airline_df.shape

(129880, 23)

### 4.4 - Display Column Information using `info()`

The `info` method returns a summary of each column, including the column name, count of non-null values and datatype. Each column is assigned an integer values starting from 0, which corresponds to the column order.

In [794]:
airline_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 129880 entries, 1 to 129880
Data columns (total 23 columns):
 #   Column                                  Non-Null Count   Dtype  
---  ------                                  --------------   -----  
 0   Gender                                  129880 non-null  object 
 1   Age                                     129880 non-null  int64  
 2   Customer Type                           129880 non-null  object 
 3   Type of Travel                          129880 non-null  object 
 4   Class                                   129880 non-null  object 
 5   Flight Distance                         129880 non-null  int64  
 6   Departure Delay                         129880 non-null  int64  
 7   Arrival Delay                           129487 non-null  float64
 8   Departure and Arrival Time Convenience  129880 non-null  int64  
 9   Ease of Online Booking                  129880 non-null  int64  
 10  Check-in Service                        1298

### 4.5 - Display Data Types using `dtypes`

If only the data type of each column is required (and not the full information provided using `info()`), then the `dtypes` attribute returns a Series of the data types for each column.

In [795]:
airline_df.dtypes

Gender                                     object
Age                                         int64
Customer Type                              object
Type of Travel                             object
Class                                      object
Flight Distance                             int64
Departure Delay                             int64
Arrival Delay                             float64
Departure and Arrival Time Convenience      int64
Ease of Online Booking                      int64
Check-in Service                            int64
Online Boarding                             int64
Gate Location                               int64
On-board Service                            int64
Seat Comfort                                int64
Leg Room Service                            int64
Cleanliness                                 int64
Food and Drink                              int64
In-flight Service                           int64
In-flight Wifi Service                      int64


<h2 align="center">Section 5 - Renaming Rows and Columns</h2>

There are a number of ways to rename rows and column labels depending on the situation. All of the different ways are detailed in the following cells, but a guide for which method may be best is given below:

|                                                 |                                                                           |
|-------------------------------------------------|---------------------------------------------------------------------------|
| Rename columns when reading from a file         | See *3.5 - Create a DataFrame from a Comma-Separated Values (.csv) File using `read_csv()`* |
| Simply rename all columns                       | See *5.4 - Rename All Columns by Overwriting the `columns` Attribute* |
| Simply rename all rows                          | See *5.5 - Rename All Rows by Overwriting the `index` Attribute*      |
| Rename only specific rows/columns               | See *5.1 - Rename Specific Rows or Columns using  `rename()`*         |
| Rename all rows/columns with additional options | See *5.2 - Rename All Rows using `set_index()`*                       |
| Replace substring in row/column names           | See *5.6 - Edit Row or Column Names using `str.replace()`*            |
| Add prefix to all column names                  | See *5.7 - Edit Column Names using `add_prefix()`*                    |
| Add suffix to all column names                  | See *5.8 - Edit Column Names using `add_suffix()`*                    |

&nbsp;

<style>
table,td,tr,th {border:none!important}
</style>

### 5.1 - Rename Specific Rows or Columns using `rename()`

The `rename` method can be used to rename any number of row(s) or column(s) for both DataFrame and Series objects. The single required argument, *mapper*, takes a dictionary or function that defines a one-to-one mapping for how to rename the rows/columns. To state whether rows or columns are to be changed, the *axis* argument can be passed. The most common optional arguments are summarised below. A full list of optional arguments can be found [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html?highlight=rename#pandas.DataFrame.rename).

|                    |                                                                                                                 |
|--------------------|-----------------------------------------------------------------------------------------------------------------|
| mapping (required) | A dictionary or function which maps the column(s) or row(s) to change to the new names.                         |
| inplace            | If `True` save the changes to the DataFrame/Series. If `None` default to `False` and do not save the changes. It may be useful to not use this argument if a preview of a change is needed, without actually committing any permanent change to the object.   |
| axis               | Set the axis to apply changes along: 0 (or 'index') changes the row names (this is the default), 1 (or 'columns') changes the column names. |

<style>
table,td,tr,th {border:none!important}
</style>

In [796]:
# Rename the column names and overwrite the old DataFrame
person_df = person_df.rename({0:'Person #0', 
                  1:'Person #1',
                  2:'Person #2',
                  3:'Person #3',
                  4:'Person #4'},
                  axis=1)
person_df

Unnamed: 0,Person #0,Person #1,Person #2,Person #3,Person #4
Name,Adam,Bob,Charlie,Dan,Edward
Age,28,22,18,23,25
Job,Electrician,Programmer,Builder,Doctor,Accountant


In [797]:
# Rename the column names in-place
person_df.rename({'Person #0':'Person 0', 
                  'Person #1':'Person 1',
                  'Person #2':'Person 2',
                  'Person #3':'Person 3',
                  'Person #4':'Person 4'},
                  axis=1,
                  inplace=True)
person_df

Unnamed: 0,Person 0,Person 1,Person 2,Person 3,Person 4
Name,Adam,Bob,Charlie,Dan,Edward
Age,28,22,18,23,25
Job,Electrician,Programmer,Builder,Doctor,Accountant


In [798]:
# Rename the row names in-place
person_df.rename({'Name':'name', 'Age': 'age', 'Job': 'job'})

Unnamed: 0,Person 0,Person 1,Person 2,Person 3,Person 4
name,Adam,Bob,Charlie,Dan,Edward
age,28,22,18,23,25
job,Electrician,Programmer,Builder,Doctor,Accountant


### 5.2 - Rename All Rows using `set_index()`

An index object is an immutable sequence in Pandas used for uniquely identifying rows and columns. These objects are used to read or modify the existing row and column labels in a DataFrame or Series. For example, using `df.index` returns an index of the row labels, and `df.columns` returns an index of the column labels. To modify the row labels, use the `set_index` method. This method takes 1 required argument and 4 optional arguments:

|                    |                                                                                                                                                                                                                                                             |
|--------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| *keys* (required)  | A label for a column in the DataFrame, or an list-like containing row labels matching the length of the DataFrame to use as the row labels. List-like data includes Series, Index, NumPy ndarrays and lists.                                              |
| *drop*             | If a column in the DataFrame is passed to the *keys* parameter, the column will be removed from the DataFrame by default. This behaviour can be disabled by setting the *drop* parameter to `False`.                                                        |
| *append*           | Add the new index column alongside the old index column to create multiple indices.                                                                                                                                                                         |
| *inplace*          | If `True` save the changes to the DataFrame/Series. If `None` default to `False` and do not save the changes. It may be useful to not use this argument if a preview of a change is needed, without actually committing any permanent change to the object. |
| *verify_integrity* | If `True` check the index provided is valid by verifying each value uniquely identifies a row (check for duplicates). If `False` postpone this check for later in the program.                                                                                                               |

&nbsp;

<style>
table,td,tr,th {border:none!important}
</style>

In [799]:
# Set the index to the Person 0 column (and drop the column from the DataFrame)
person_df.set_index('Person 0')

Unnamed: 0_level_0,Person 1,Person 2,Person 3,Person 4
Person 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Adam,Bob,Charlie,Dan,Edward
28,22,18,23,25
Electrician,Programmer,Builder,Doctor,Accountant


In [800]:
# Set the index to the Person 0 column (and do not drop the column from the DataFrame)
person_df.set_index('Person 0', drop=False)

Unnamed: 0_level_0,Person 0,Person 1,Person 2,Person 3,Person 4
Person 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Adam,Adam,Bob,Charlie,Dan,Edward
28,28,22,18,23,25
Electrician,Electrician,Programmer,Builder,Doctor,Accountant


In [801]:
# Add the new index alongside the old index (append it), this creates multiple indices
person_df.set_index('Person 0', append=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,Person 1,Person 2,Person 3,Person 4
Unnamed: 0_level_1,Person 0,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Name,Adam,Bob,Charlie,Dan,Edward
Age,28,22,18,23,25
Job,Electrician,Programmer,Builder,Doctor,Accountant


In [802]:
# Verify the new index contain a unique value for each row (check for duplicates)
person_df.set_index('Person 0', verify_integrity=True)

Unnamed: 0_level_0,Person 1,Person 2,Person 3,Person 4
Person 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Adam,Bob,Charlie,Dan,Edward
28,22,18,23,25
Electrician,Programmer,Builder,Doctor,Accountant


### 5.3 - Rename All Rows or Columns using `set_axis()`

The `set_axis` method can be used in a similar way to the `set_index` method, but for both the index and the columns. This method however only takes 3 arguments, since multiple column names are not allowed unlike multiple indices. Unlike `set_index`, a column name cannot be passed as a label: the argument must be list-like. The `set_axis` method offers less options than `set_index`.

|                     |                                                                                                                                                                                                                                                             |
|---------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| *labels* (required) | A list-like containing row labels matching the length of the DataFrame to use as the row or column labels. Array-like data includes Series, Index, NumPy ndarrays and lists.                                              |
| *axis*              | Set the axis to apply changes along: 0 (or 'index') changes the row names (this is the default), 1 (or 'columns') changes the column names.                                                                                                                 |
| *inplace*           | If `True` save the changes to the DataFrame/Series. If `None` default to `False` and do not save the changes. It may be useful to not use this argument if a preview of a change is needed, without actually committing any permanent change to the object. |

&nbsp;

<style>
table,td,tr,th {border:none!important}
</style>

In [803]:
# Set the row labels to Person 1 column
person_df.set_axis(person_df['Person 1'], axis=0)

Unnamed: 0_level_0,Person 0,Person 1,Person 2,Person 3,Person 4
Person 1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Bob,Adam,Bob,Charlie,Dan,Edward
22,28,22,18,23,25
Programmer,Electrician,Programmer,Builder,Doctor,Accountant


In [804]:
# Set the column labels
person_df.set_axis(['Col A','Col B','Col C','Col D','Col E',], axis=1)

Unnamed: 0,Col A,Col B,Col C,Col D,Col E
Name,Adam,Bob,Charlie,Dan,Edward
Age,28,22,18,23,25
Job,Electrician,Programmer,Builder,Doctor,Accountant


### 5.4 - Rename All Columns by Overwriting the `columns` Attribute

If all the columns in an object need to be renamed without any additional options, it is easier to overwrite the `columns` attribute directly. Calling `DataFrame.columns` returns the name of each column for a DataFrame. This can be set equal to a list or Pandas Index object, where each element of the list/index is positional (the order of the list/index is equal to the order of the new column names).

In [805]:
# Reassign the columns attribute
person_df.columns = ['Column 1', 'Column 2', 'Column 3', 'Column 4', 'Column 5']
person_df

Unnamed: 0,Column 1,Column 2,Column 3,Column 4,Column 5
Name,Adam,Bob,Charlie,Dan,Edward
Age,28,22,18,23,25
Job,Electrician,Programmer,Builder,Doctor,Accountant


### 5.5 - Rename All Rows by Overwriting the `index` Attribute

In a similar way to the `columns` attribute, the `index` attribute can be reassigned to a list-like object which is equal to the length of the DataFrame.

In [806]:
# Reassign the index attribute
person_df.index = ['Row 1', 'Row 2', 'Row 3']
person_df

Unnamed: 0,Column 1,Column 2,Column 3,Column 4,Column 5
Row 1,Adam,Bob,Charlie,Dan,Edward
Row 2,28,22,18,23,25
Row 3,Electrician,Programmer,Builder,Doctor,Accountant


### 5.6 - Edit Row or Column Names using `str.replace()`

The `index` and `columns` attributes can be overwritten using the `str.replace` method. This is useful if a string is needed to be replaced with another string. For example, replacing a space with an underscore, or replacing a certain character with a blank space in order to remove that character from all row/column names.

In [807]:
# Add underscore to column names
person_df.columns = person_df.columns.str.replace(' ', '_')
person_df

Unnamed: 0,Column_1,Column_2,Column_3,Column_4,Column_5
Row 1,Adam,Bob,Charlie,Dan,Edward
Row 2,28,22,18,23,25
Row 3,Electrician,Programmer,Builder,Doctor,Accountant


In [810]:
# Add underscore to row names
person_df.index = person_df.index.str.replace(' ', '_')
person_df

Unnamed: 0,Column_1,Column_2,Column_3,Column_4,Column_5
Row_1,Adam,Bob,Charlie,Dan,Edward
Row_2,28,22,18,23,25
Row_3,Electrician,Programmer,Builder,Doctor,Accountant


### 5.7 - Edit Column Names using `add_prefix()`

The `add_prefix` method can be called on a DataFrame/Series object to add a prefix to the beginning of each column name. This method takes no other arguments.

In [808]:
person_df.add_prefix('Person_')

Unnamed: 0,Person_Column_1,Person_Column_2,Person_Column_3,Person_Column_4,Person_Column_5
Row 1,Adam,Bob,Charlie,Dan,Edward
Row 2,28,22,18,23,25
Row 3,Electrician,Programmer,Builder,Doctor,Accountant


### 5.8 - Edit Column Names using `add_suffix()`

The `add_suffix` method can be called on a DataFrame/Series object to add a suffix to the end of each column name. This method takes no other arguments.

In [809]:
person_df.add_suffix('_Column')

Unnamed: 0,Column_1_Column,Column_2_Column,Column_3_Column,Column_4_Column,Column_5_Column
Row 1,Adam,Bob,Charlie,Dan,Edward
Row 2,28,22,18,23,25
Row 3,Electrician,Programmer,Builder,Doctor,Accountant


<h2 align="center">Conclusion</h2>

Tabular data can easily be read into Pandas from a variety of formats. Once read, the data is stored in a 2-dimensional object called a DataFrame, which is comprised of rows and columns. The rows and columns can be renamed to better present the data. These fundamental concepts are used to create DataFrames, which are very versatile objects. In later notebooks the methods for filtering, manipulating, dropping and creating new data are introduced, which build on the concepts shown here. 