# Fundamental concepts

**Table of contents**<a id='toc0_'></a>    
- [Pandas primary data structures](#toc1_1_1_)    
- [The notion of Index](#toc1_1_2_)    
- [Data types](#toc1_1_3_)    
- [Some quick hacks](#toc1_1_4_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=4
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

-------------------------
#### <a id='toc1_1_1_'></a>[Pandas primary data structures](#toc0_)
----------------------

There are two main data structures in pandas. 
- Series: It is one dimensional and is similar to a single column of data.
- DataFrame: It is two dimensional and can be considered as a collection of Series objects.

In [1]:
# import statements
import numpy as np
import pandas as pd

- #### *constructing Series objects*

> `pd.Series(data, index, dtype, name, copy)` method is used for creating a series object

In [None]:
gpa = pd.Series(
    data=[3.38, 3.58, 3.76, 3.1],
    index=["Hasan", "Ali", "Pial", "David"],
    dtype="float64",
    name="GPA_of_Term_01",
)

In [3]:
gpa

Hasan    3.38
Ali      3.58
Pial     3.76
David    3.10
Name: GPA_of_Term_01, dtype: float64

In [4]:
# index values
gpa.index

Index(['Hasan', 'Ali', 'Pial', 'David'], dtype='object')

- #### *constructing DataFrame objects*

In [5]:
# we need to have this array in column arrangement for this example
term_gpa = np.array([[3.38, 3.58, 3.76, 3.1], [3.6, 3.65, 3.71, 3.4]]).T

> `pd.DataFrame(data, index, columns, dtype, copy)` method is used for creating a pandas Dataframe object

In [None]:
gpa_term = pd.DataFrame(
    data=term_gpa,
    index=["Hasan", "Ali", "Pial", "David"],
    columns=["GPA of Term 01", "GPA of Term 02"],
    copy=True,
)

In [7]:
gpa_term

Unnamed: 0,GPA of Term 01,GPA of Term 02
Hasan,3.38,3.6
Ali,3.58,3.65
Pial,3.76,3.71
David,3.1,3.4


----------------------------------
#### <a id='toc1_1_2_'></a>[The notion of Index](#toc0_)
---------------------------------

The notion of **index** is very important in understanding Pandas data structures.

Along with the data, a pandas data structure has some other attributes associated with it, i.e, some extra bits of information. Such as, **an index** and a name (only in Series objects). 

The index is not part of the values. Usually it is reffered to as **axis** and the values of the index (e.g, 0, 1, 2 etc.) are called **axis labels**. Whereas, the actual **data** is called the **values**.

The series object has one index and the dataframe object has two index (one for the rows and one for the columns).

The default values for index are increasing integers (starting from 0 with step=1). But, non-integer values such as, strings, dates, as well as arbitrarily ordered indices can be used as axis-labels. This allows much more functionality. We can even have duplicate index values.

------------------------
#### <a id='toc1_1_3_'></a>[Data types](#toc0_)
---------------------

As data values, a pandas data structure can hold numerical data, time data, strings, or arbitrary Python objects.

To convert between different datatypes we can use the `<Series/DataFrame Object>.astype(dtype)` method.

**Note:** The first thing we should do when we load in a dataset is to check the datatype of each column and cast each of them to a more suitable datatype. This helps to save space and speed up different processes.

In [8]:
gpa_term.astype(dtype="str")

Unnamed: 0,GPA of Term 01,GPA of Term 02
Hasan,3.38,3.6
Ali,3.58,3.65
Pial,3.76,3.71
David,3.1,3.4


To have a basic idea about the pandas data types check out this article @ https://pbpython.com/pandas_dtypes.html

------------------------
#### <a id='toc1_1_4_'></a>[Some quick hacks](#toc0_)
------------------------

To see some entries from the start of the series or dataframe we can use,

    <Series|DataFrame>.head(n)
    
To see some entries from the end of the series or dataframe we can use,

    <Series|DataFrame>.tail(n)
    
To see some random entries from the series or dataframe we can use,

    <Series|DataFrame>.sample(n)

To see an exploded view of a series or dataframe, instead of a partial view, we can use, 

    pd.set_option("display.max_columns", m)
    pd.set_option("display.max_rows", n)