# **Pandas Foundations**
Pandas is library which is useful for dealing with structured data.

#### **Structured Data**
Data that is stored in tables, such as CSV files, Excel spreadsheets or database tables.

`pd.Series` is 1D collection of data. It can have only one column with multiple rows.
 - It is ***Homogeneous*** type of data.
    - All elements have the same data type.
    - Example:
        - All integers
        - All floats
        - All strings

`pd.DataFrame` is 2D collection of object. Stores information as excel sheet or database tables. It can multiple columns and rows.
- It is ***Heterogeneous*** type of data.
    - Elements can have different data types (a mix of nmeric, text, boolean, etc.).


`pd.Index` labels data and allows fast lookup, alignment, joining and selection.


## **Importing Pandas**

`import pandas as pd`  - loads the Pandas library and assigns it the short alias pd for convenient use.

along with pandas, importing *NumPy*, *Matplotlib* and *PyArrow* for efficient usage of code.

In [1]:
import pandas as pd
import numpy as np
import matplotlib as plt
import pyarrow as pa

### Series

passing data in the form of list

In [2]:
pd.Series(
    [0, 1, 2]
)

Unnamed: 0,0
0,0
1,1
2,2


data can be passed inform of tuple also

In [3]:
pd.Series(
(12.34, 56.78, 91.01)
)

Unnamed: 0,0
0,12.34
1,56.78
2,91.01


`range` - a function used to generate sample data

In [8]:
#0 is starting of range, 12 is ending of range (not considered), since counting starts
#from 0 which follows indexing and value 3 is differe between each value

pd.Series(range(1, 12, 3))


Unnamed: 0,0
0,1
1,4
2,7
3,10


Pandas automatically infers data types, but we can explicitly set `dtype ` to control the type and memory usage - eg. using `dtype="int8"` instead of default `int64` for small integers saves memory and improves compatibility with systems like SQL Databases.

In [9]:
pd.Series(range(3), dtype="int8")

Unnamed: 0,0
0,0
1,1
2,2


A `pd.Series` can be given a label using the `name=` argument.
If not provided, the Series name defaults to `None`

In [10]:
pd.Series(
    ["apple","banana","orange"], name="fruits"
)

Unnamed: 0,fruits
0,apple
1,banana
2,orange


### DataFrame
`pd.DataFrame` is the main and most-used pandas object. while data is usually import from files or databases, it can also be created directly in code.

In [11]:
pd.DataFrame(
    [
        [0,1,2],
        [3,4,5],
        [6,7,8]
    ]
)

Unnamed: 0,0,1,2
0,0,1,2
1,3,4,5
2,6,7,8


When creating a `pd.DataFrame` from a list of lists, pandas auto-numbers rows and columns, but using the `columns=` argument to name columns makes indexing and selection cleared and easier

In [12]:
pd.DataFrame(
    [
        [1, 2],
        [4, 8]
    ], columns=["col_a","col_b"]
)

Unnamed: 0,col_a,col_b
0,1,2
1,4,8


`pd.DataFrame` can be created from dictionary, where dictionary keys become column names and values becomes the column data

In [13]:
pd.DataFrame(
    {
        "first_name" : ["John","Grisham"],
        "last_name" : ["Harry", "Potter"]
    }
)

Unnamed: 0,first_name,last_name
0,John,Harry
1,Grisham,Potter


When creating a `pd.DataFrame` from a dictionary, values can be any sequence type (including pd.Series), not just lists

In [14]:
ser1 = pd.Series(range(3), dtype="int8", name="int8_col")
ser2 = pd.Series(range(3), dtype="int16", name="int16_col")
pd.DataFrame(
    {ser1.name: ser1, ser2.name:ser2}
)

Unnamed: 0,int8_col,int16_col
0,0,0
1,1,1
2,2,2


### Index
The Numbers shown on left of a pd.Series or pd.DataFrame are the pd.Index, which labels rows and enables data selection and alignment.

A pd.DataFrame has two pd.Index objects—one for rows (row index) and one for columns (column index).

- By default, pandas assigns an auto-numbered `pd.RangeIndex`.
- Its fine for rows but rarely used for columns, where meaningfule labels eg. City, Data are preferred, and both rows and columns indexes can be customized during object creation.

`pd.Series`, you can easily set or chagne the row index by passing a lable sequence using the `index=` argument

In [15]:
pd.Series([4, 4, 2], index=["dog","cat","human"])

Unnamed: 0,0
dog,4
cat,4
human,2


For more control, you can explicitly create a pd.Index (eg., naming it animal) and assign it via `index=`, while also naming the pd.Series (eg., *num_legs*) for clearer context.

In [17]:
index = pd.Index(
    ["dog","cat","human"], name="animal"
)
pd.Series([4, 4, 2], name="num_legs", index=index)

Unnamed: 0_level_0,num_legs
animal,Unnamed: 1_level_1
dog,4
cat,4
human,2


A `pd.DataFrame` uses `pd.Index` for both rows and columns --- set rows labels with `index=` and column labels with `columns=` when creating the DataFrame.

In [18]:
pd.DataFrame([
    [24, 180],
    [42,166]
            ],
    columns=["age","height_cm"],
    index=["Jack","Peter"]
            )

Unnamed: 0,age,height_cm
Jack,24,180
Peter,42,166


#### Series Attributes
Series attributes allow you to inspect data type, name, index, shape, and size for quick understanding of your data.

- `ser.dtype` → shows the data type of values in the Series (e.g., int64)

- `ser.name` → returns the name of the Series (or None if not set)

- `ser.index` → gives the index labels associated with the Series

- `ser.index.name` → returns the name of the Index

- `ser.shape` → shows number of rows as a one-value tuple (e.g., (3,))

- `ser.size`→ returns total number of elements

- `len(ser)` → returns number of rows in the Series

In [20]:
index = pd.Index(
    ["dog","cat","human"], name="animal"
)
series = pd.Series([4, 4, 2], name="num_legs", index=index)
series

Unnamed: 0_level_0,num_legs
animal,Unnamed: 1_level_1
dog,4
cat,4
human,2


In [21]:
series.dtype

dtype('int64')

In [22]:
series.name

'num_legs'

In [23]:
series.index

Index(['dog', 'cat', 'human'], dtype='object', name='animal')

In [24]:
series.shape

(3,)

In [25]:
series.size

3

In [26]:
len(series)

3

#### DataFrame Attributes
DataFrame attributes help inspect column types, row/column labels, and dataset size and shape.

 - `df.dtypes` → shows data type of each column (returned as a Series)

 - `df.index` → displays row labels (row index)

 - `df.columns` → displays column labels (column index)

 - `df.shape` → returns (rows, columns)

 - `df.size` → total number of elements = rows × columns

 - `len(df)` → number of rows

In [27]:
index = pd.Index(["Jack", "Jill"], name="person")
df = pd.DataFrame([
    [24, 180, "red"],
    [42, 166, "blue"],
], columns=["age", "height_cm", "favorite_color"], index=index)
df


Unnamed: 0_level_0,age,height_cm,favorite_color
person,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Jack,24,180,red
Jill,42,166,blue


In [29]:
df.dtypes

Unnamed: 0,0
age,int64
height_cm,int64
favorite_color,object


In [30]:
df.index

Index(['Jack', 'Jill'], dtype='object', name='person')

In [31]:
df.columns

Index(['age', 'height_cm', 'favorite_color'], dtype='object')

In [32]:
df.shape

(2, 3)

In [33]:
df.size

6

In [34]:
len(df)

2