# Pandas Overview

## Topics
Introduction<br>
Data Structures<br>
- Series
- DataFrame


## Introduction
Pandas is an open-source Python Library providing high-performance data manipulation and analysis tool using its powerful data structures.<br>
Pandas lets you clean your data before analysing it. “Cleaning” your data, often called “data wrangling” or “data munging”, 
is a process of removing erroneous data from your dataset prior to processing it (dealing with missing values, binning, dealing with categorial data, etc) and drawing any insights from it. <br> 
Using Pandas, we can accomplish five typical steps in the processing and analysis of data, regardless of the origin of data — load, prepare, manipulate, model, and analyse. <br>

### Key Features of Pandas:
1. Fast and efficient DataFrame object with default and customised indexing.<br>
2. Tools for loading data into in-memory data objects from different file formats.<br>
3. Data alignment and integrated handling of missing data.<br>
4. Reshaping and pivoting of date sets. <br>
5. Label-based slicing, indexing and subsetting of large data sets.<br>
6. Columns from a data structure can be deleted or inserted.<br>
7. Group by data for aggregation and transformations.<br>
8. High performance merging and joining of data.<br>

## Introduction to Data Structures:

Pandas the following main data structures:
- Series<br>
- DataFrame<br>

The best way to think of these data structures is that the higher dimensional data structure (DataFrame) is a container of its lower dimensional data structure (Series). <br>
All Pandas data structures are value mutable (can be changed) and DataFrame is size mutable (Series is size immutable).

### Series

Series is a one-dimensional array like structure with homogeneous data.<br>

For example, the following series is a collection of integers 10, 23, 56, 17, 52, 61, 73, 90, 26, 72.<br>

Key Points:<br>
- Homogeneous data (each value has the same type)
- Data is mutable (values can be changed)
- Size immutable (not possible to add/remove values)

__Syntax:__  pandas.Series(data, index, dtype, name, copy)

where:
    
- data contains the data; this can be in Series, or lists, or dictionary (or even a scalar)
- index (optional), if provided, a label, if not provided by default a counter (0, 1, ...); if the data is a dictionary then the keys in the data are used as the index
- dtype (data type), optional, if not provided it is inferred
- name (optional), if provided, the Series will be named
- copy; advanced setting, see documentation [https://pandas.pydata.org/docs/reference/api/pandas.Series.html](https://pandas.pydata.org/docs/reference/api/pandas.Series.html)

__Returns:__ A Series

#### Create a Series from ndarray<br>
If data is an ndarray, then index passed must be of the same length. If no index is passed, then by default index will be range(n) where n is array length, i.e., [0,1,2,3…. range(len(array))-1].


In [None]:
#import pandas and numpy
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
print('type data:', type(data))
s = pd.Series(data)
print('s:', s)

In [None]:
s = pd.Series(data,index=[100,101,102,103])
print(s)

#### Create a Series from dict<br>
A dict can be passed as input and if no index is specified, then the dictionary keys are taken in a sorted order to construct index. If index is passed, the values in data corresponding to the labels in the index will be pulled out.<br>

In [None]:
data = {'a' : 0.0, 'b' : 1.0, 'c' : 2.0}
s = pd.Series(data)
print(s)

In [None]:
s = pd.Series(data,index=['b','c','d','a'])
print(s)

#### Create a Series from Scalar
If data is a scalar value, an index must be provided. The value will be repeated to match the length of index<br>

In [None]:
s = pd.Series(5, index=[0, 1, 2, 3])
print(s)

#### Accessing Data from Series with Position
Data in the series can be accessed similar to that in ndarray.

In [None]:
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
#retrieve the element with index 2 (third element)
print('s[2]:', s[2])

In [None]:
#retrieve the first two elements (position zero and one)
print(s[0:2])

In [None]:
#retrieve the last three element
print(s[-3:])

#### Retrieve Data Using Index
A Series is like a fixed-size dict in that you can get and set values by index label.<br>

> Note: 'label' is a named value of the index 

In [None]:
s = pd.Series([1,2,3,4,5],index = ['a','b','c','d','e'])
#retrieve a single element
print(s['a'])

In [None]:
#retrieve multiple elements
print(s[['a','c','d']])

### DataFrame

DataFrame is a two-dimensional array with heterogeneous data, where the structure is like a spreadsheet (each variable in a column). <br> 

Key Points:<br>
- Heterogeneous data
- Size Mutable
- Data Mutable

A pandas DataFrame can be created using the following constructor:
    
__Syntax:__  pandas.DataFrame( data, index, columns, dtype, copy)

where:
    
- data contains the data; this can be in Series, or lists, or dictionary (or even a scalar)
- index (optional), if provided, a label, if not provided by default a counter (0, 1, ...); if the data is a dictionary then the keys in the data are used as the index
- columns: column labels if the data does not have them
- dtype (data type), optional, if not provided it is inferred
- copy; advanced setting, see documentation [https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)

__Returns:__ A DataFrame
    
In DataFrames, there are two types of indices. There is an index to access each variable (passed when creating the DataFrame), and another index to access the elements (the label). 


#### Create an Empty DataFrame
A basic DataFrame, which can be created is an Empty Dataframe.<br>

In [None]:
df = pd.DataFrame()
print(df)

To create a Pandas DataFrame, you can pass in a variety of data structures, such as:

1. A NumPy array, with optional row and column labels
2. A list of lists, with optional row and column labels
3. A dictionary of lists, with optional row labels
4. A list of dictionaries, with optional row labels
5. A dictionary of dictionaries, with optional row labels

source: chat at openai.com

In [None]:
# Example 1: Create a DataFrame from a NumPy array, with row and column labels
data = np.array([[1, 2, 3], [4, 5, 6]])
df = pd.DataFrame(data, columns=['a', 'b', 'c'], index=['row1', 'row2'])
print('example 1\n', df)

In [None]:
# Example 2: Create a DataFrame from a list of lists, with row and column labels
data = [[1, 2, 3], [4, 5, 6]]
df = pd.DataFrame(data, columns=['a', 'b', 'c'], index=['row1', 'row2'])
print('\nexample 2\n', df)

In [None]:
# Example 3: Create a DataFrame from a dictionary of lists, with row labels
data = {'a': [1, 4], 'b': [2, 5], 'c': [3, 6]}
df = pd.DataFrame(data, index=['row1', 'row2'])
print('\nexample 3\n', df)

In [None]:
# Example 4: Create a DataFrame from a list of dictionaries, with row labels
data = [{'a': 1, 'b': 2, 'c': 3}, {'a': 4, 'b': 5, 'c': 6}]
df = pd.DataFrame(data, index=['row1', 'row2'])
print('\nexample 4\n',df)

In [None]:
# Example 5: Create a DataFrame from a dictionary of dictionaries, with row labels
# 'rows' here are actually the column names
data = {'row1': {'a': 1, 'b': 2, 'c': 3}, 'row2': {'a': 4, 'b': 5, 'c': 6}}
df = pd.DataFrame(data)
print('\nexample 5\n', df)

#### Create a DataFrame from Lists
The DataFrame can be created using a single list or a list of lists.<br>



In [None]:
data = [['Amanda',10],['Billy',12],['Claire',13]]
df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
print(df)

### A few more examples

#### Create a DataFrame from Dict of ndarrays / Lists
All the ndarrays must be of same length. If index is passed, then the length of the index should equal to the length of the arrays.<br>
If no index is passed, then by default, index will be range(n), where n is the array length.<br>

In [None]:
# 'Name' and 'Age' are indices for accessing the variables/columns
# no index is passed (for accessing the rows), so by default the index is 0, 1, ...
data = {'Name':['Tasha', 'Jack', 'Steve', 'Rishi'],'Age':[28,34,29,42]}
df = pd.DataFrame(data)
print(df)

#### Create a DataFrame from List of Dicts
List of Dictionaries can be passed as input data to create a DataFrame. The dictionary keys are by default taken as column names.<br>

In [None]:
# data is a list of dictionaries; each element is a dictionary with key-value pairs,
# where the key is the variable (column) name
# no index is passed (for accessing the rows), so by default the index is 0, 1, ...
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print(df)

In [None]:
# index is specified, so rows can be selected with 'first' and 'second' (in addition to [0] and [1])
data = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
# With two column indices, values same as dictionary keys
df1 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b'])
# With two column indices with one index with other name
# it will look up values for keys a and b1; so b is ignored; 
# since b1 is missing the values for this variabel are NaNs
df2 = pd.DataFrame(data, index=['first', 'second'], columns=['a', 'b1'])
print('df1\n',df1)
print('\ndf2\n',df2)

#### Create a DataFrame from Dict of Series
Dictionary of Series can be passed to form a DataFrame. <br> 

In [None]:
# the row indices are specified as part of the series
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print(df)

## Column selection, adding and deleting columns 

### Selecting a column

In [None]:
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print(df ['one'])

In [None]:
# selecting multiple columns
selected_columns = df[ ['one', 'two'] ]
print(selected_columns)

### Creating (adding) a new column 

In [None]:
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)

# Adding a new column to an existing DataFrame object with column label by passing new series
print ("Adding a new column by passing as Series:")
df['three']=pd.Series([10,20,30],index=['a','b','c'])
print(df)

print ("\nAdding a new column using the existing columns in DataFrame:")
df['four']=df['one']+df['three']
print(df)

### Deleting a column 

In [None]:
# Using the previous DataFrame, we will delete a column
# using del function
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 
   'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']), 
   'three' : pd.Series([10,20,30], index=['a','b','c'])}
df = pd.DataFrame(d)
print ("Our dataframe is:")
print (df)
# using del function
print ("Deleting the first column using DEL function:")
del df['one']
print (df)
# using pop function
print ("Deleting another column using POP function:")
# the pop function will return the removed column
colRm = df.pop('two')
print (df)
print('\nRemoved column\n', colRm)

## Row selection, addition and deletion

### Selecting rows

Use functions `index`, `loc`, `iloc` and filtering to select rows;  see [pandas-loc.ipynb](pandas-loc.ipynb)

### Adding rows 

Use the `concat` function to add rows; see [pandas-6-merging-joining.ipynb](pandas-6-merging-joining.ipynb)

### Deleting a row

Use index label to delete or drop rows from a DataFrame. If label is duplicated, then multiple rows will be dropped.<br>

In [None]:
df

In [None]:
# Drop rows with label 0
df = df.drop(0)
print(df)

## Panel data and multilevel indices

Pandas used to have another data structure called `Panel`. This has been replaced with 'multi-index', also called multi-level index or hierarchical index.
This allows you to have multiple columns acting as a row identifier.

If you have:
    
- a cross section of one variable (for example, a list of ages of the students in this section): you can use a Series, or a DataFrame (with one variable)
- a timeseries of one variable (for example, quarterly GDP for the US): you can use a Series, or a DataFrame (with one variable)
- a cross section of multiple variables (for example, a list of ages of the students in this class, one for each time the class is offered): you can use a DataFrame
- a timeseries of multiple variables (for example, state production (GSP), nontax income and tax income for each of the 51 States): you can use a DataFrame (one column for each variable)
- paneldate: a cross section with values for multiple periods (for example, tax income for each of the 51 States for several years): you can use a DataFrame with multi-index (one index to identify the State, another index to identify the quarter)
    
    
