<a href="https://colab.research.google.com/github/ProfessorPatrickSlatraigh/CST2312/blob/main/Pandas_29_Nov_2021.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Pandas**

Pandas is an open-source Python library that provides powerful, flexible and high-performance tools to process data. It was developed in 2008 by Wes McKinney.

Some of Pandas features include:

*   High-performance DataFrame objects with effective indexing.
*   Tools for loading data into memory with multiple file formats.
*   High performance for operations such as data merges and joins.
*   Manipulation of time series data.
*   Easy to manipulate row and column data.
*   SQL-like operations.
*   Vectorized operations.
*   Label-based slicing, indexing, and subsetting of large data sets.

**Pandas topics covered in this notebook.**

1.   How to manipulate Tabular data.
2.   Aggregation data on multiple dimensions.
3.   SQL-like data join, group, and sort.
4.   Powerful data filtering.
5.   I/O of multiple file formats.
6.   Extensive type support, such as int, float, string, and datetime.
7.   Advanced usage of pandas, such as memory usage reduction, sped up file loading and sped up operations.

The standard Python installation does not include the Pandas library.

You can install Pandas using pip on a local machine:

```
pip install pandas
```

*Pandas is available on Colab without the need to pip install it.*

In Colab you can just import pandas.  We typically **import pandas as pd** to save typing later.


In [2]:
import pandas as pd

## Pandas Data Structures

### Series in Pandas

A Series is a one-dimensional array consisting of elements of multiple data types such as int, float, string, and object.

A series is just like an N-dimensional NumPy array (ndarray) with an index. You can explicitly specify an index for a Series object, which can be either int or string type. When the specified index is not provided, Series automatically starts at zero and creates an index for the data with a step size of one.

A Series can be created using any of the following:

*   **Dictionary** or **list**
*   **Numpy ndarray**
*   **Scalar** value


**Dictionaries** are data structures more generally known as associative arrays which are used to store data values in key:value pairs. A dictionary is a collection which is ordered, changeable and does not allow duplicates.

**Lists** are used to store multiple items in a single variable in Python.  A list is a data structure in Python that is a mutable, or changeable, ordered sequence of elements.

An **ndarray** is a (usually fixed-size) multidimensional container of items of the same type and size. The number of dimensions and items in an array is defined by its shape , which is a tuple of N non-negative integers that specify the sizes of each dimension.

**Scalars** are single values representing one unit of data, such as an integer or bool , as opposed to data structures like a list or tuple , which are composed of scalars.


### Dataframes in Pandas

DataFrame is a two-dimensional data structure with columns of different types. You can think of it like an MS Excel or SQL table, or a dictionary of series type objects.

Pandas provides many APIs to manipulate the data frame objects using slices, indexing, aggregation, and statistical operations.

DataFrame can be created from:

*   **Dictionary** of lists and **dictionary** of Series.

*   **CSV**, **TSV**, or **MS Excel**.

A **CSV** is a comma-separated values file which allows data to be saved in a tabular format. **TSV** is similar to CSV, but the separator is tab instead of a comma.

*   A **Numpy ndarray** -- se the definition in the section on Series above.

**Creating a Series**

*(one-dimensional array)*



**Create a Series from a list without an index**

Create a Series (**series1**) from the Python List (**list1**) below. 

By default the index starts at zero.  We use an arbitrary counter (**i**) to reference index numbers for printing. 



In [None]:
# creating a pandas series and printing a range of values in the series

list1 = range(1, 10)

series1 = pd.Series(list1)
print("The pandas series series1 is: " )
print(series1)

print("\n------------------\n")

i=3  # index value to start the while loop
print("Beginning with the element at index {}, the values through the end of series1 are: ".format(i))
while i < len(series1):
    print("For index {} -> the value is {}".format(i, series1[i]))
    i += 1

In the output of **print(series1)**, we can see two columns. The first column is the index number and the second column is the corresponding data value.

We acess a specific value of the series using its index, as in the while loop above where we are accessing the **i**th value of the series.



**Creating a Series from a list with a specified index**

In the last example, we created a series without a specified index. Here is a Series where the index is not int64, but rather, a string type. 

A value of the series can be accessed by a string index just like we did with an integer index. The only exception is that now we write the string value in brackets, [], when accessing an element.






In [9]:
list2 = range(1, 6)
# list2 is [1, 2, 3, 4, 5]

index2 = ["aa", "bb", "ccc", "dddd", "eeeeee"]
series2 = pd.Series(list2, index2)

print("The pandas series series2 is: " )
print(series2)

print("\n------------------\n")

i = "ccc"
print("the index {} -> value is {}".format(i, series2[i]))


The pandas series series2 is: 
aa        1
bb        2
ccc       3
dddd      4
eeeeee    5
dtype: int64

------------------

the index ccc -> value is 3


A DataFrame object is created where two parameters are passed. The first one is a list of values from lines 1 to 5 (**list2**). The second is an index list (**index2**).

In the output of print(**series2**) we can see two columns. The first column is the index and the second column is the corresponding data value.

We access a specific value of the series using its index.  In the example above we access the value of index **ccc**.