<center> <img src ="https://i.postimg.cc/1X8H7YYt/BITS-Logo.png" width = "400" alt="BITS Pilani Logo" /> </center>

<font color='green'> <h1> <center> Pandas Objects  </center> </h1> </font>



"pandas" is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.

pandas is well suited for many different kinds of data:<br>

- Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
- Ordered and unordered (not necessarily fixed-frequency) time series data.
- Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
- Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure


The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.

Usually, pandas and numpy are imported as follows:

In [1]:
import pandas as pd
import numpy as np

In [2]:
print("Pandas version", pd.__version__)

Pandas version 2.1.4


In [3]:
print("Numpy versoion ", np.__version__)

Numpy versoion  1.24.3


Series - 1D labeled homogeneously-typed array

DataFrame - General 2D labeled, size-mutable tabular structure with potentially heterogeneously-typed column

The best way to think about the pandas data structures is as flexible containers for lower dimensional data. For example, DataFrame is a container for Series, and Series is a container for scalars. We would like to be able to insert and remove objects from these containers in a dictionary-like fashion.

# <b> Pandas Series<b>

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:

In [11]:
#series = pd.Series(data, index=index)

Here, data can be many different things: <br>

- a Python dict
- an ndarray
- a scalar value (like 5) <br>
The passed index is a list of axis labels. Thus, this separates into a few cases depending on what data is:

<b>From Python dict <br></b>

Series can be instantiated from dicts:

In [13]:
dict = {"key1" : 1, "key2" : 1, "key3": 2}
dict

{'key1': 1, 'key2': 1, 'key3': 2}

In [17]:
series = pd.Series(dict)

print("Series elements : ")
print(series)

print("\n data type : " , type(series))

print("\n data type of series elements: ", series.dtype)

print("\n number of series elements : ", len(series))

Series elements : 
key1    1
key2    1
key3    2
dtype: int64

 data type :  <class 'pandas.core.series.Series'>

 data type of series elements:  int64

 number of series elements :  3


<b> From ndarray <b>

If data is an ndarray, index must be the same length as data. If no index is passed, one will be created having values [0, ..., len(data) - 1].

In [21]:
array = np.array([10, 23, 34, 45, 45])
series = pd.Series(array)
series

0    10
1    23
2    34
3    45
4    45
dtype: int32

First value in each row is index starting from 0 to number of elements minus one. Second value is actual data value.

In [9]:
array = np.array([12, 23, 45, 56, 56])
index = ["a", "b", "c", "d", "e"]
series = pd.Series(array, index)
series

a    12
b    23
c    45
d    56
e    56
dtype: int32

First value in each row is user defined index. Second value is actual data value.

Series index can be obtained using "index" propety of series object.

In [10]:
series.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

<b>From scalar value<b>

If data is a scalar value, an index must be provided. The value will be repeated to match the length of index.

In [11]:
series = pd.Series(1., index=['a', 'b', 'c'])
series

a    1.0
b    1.0
c    1.0
dtype: float64

Here data value '1.0' is repeated thrice (based on the number of index values provided).

# DataFrame

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dictionary of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input: <br>

- Dictionary of Series, 1D ndarrays, lists, dicts
- 2-D numpy.ndarray
- Structured or record ndarray
- A Series
- Another DataFrame <br>

Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments.

<b> From dictionary of Series <b>

The data frame can be created from dictionary object which holds series as its values. Here dictionry contains four series in it like name, age, salary and working with corresponding data values in it. 

In [12]:
dict = {
           "name" : pd.Series(["A", "B", "c"]), 
           "age": pd.Series([21, 22, 23]), 
           "salary": pd.Series([11.5, 23.4, 56]),
           "working" : pd.Series([True, False, True])
       }
dict

{'name': 0    A
 1    B
 2    c
 dtype: object,
 'age': 0    21
 1    22
 2    23
 dtype: int64,
 'salary': 0    11.5
 1    23.4
 2    56.0
 dtype: float64,
 'working': 0     True
 1    False
 2     True
 dtype: bool}

In [13]:
data_frame = pd.DataFrame(dict)
data_frame

Unnamed: 0,name,age,salary,working
0,A,21,11.5,True
1,B,22,23.4,False
2,c,23,56.0,True


Various properties of data frame can be explored as follows :

In [14]:
print("Type of data frame object : ", type(data_frame))
print("Data types of data frame columns : ", data_frame.dtypes)
print("Columns of data frame : ", data_frame.columns)
print("Row labels of data frame : ", data_frame.index)
print("Number of entries in data frame : ", len(data_frame))
print("Number of cells is data frame : " , data_frame.size)
print()

Type of data frame object :  <class 'pandas.core.frame.DataFrame'>
Data types of data frame columns :  name        object
age          int64
salary     float64
working       bool
dtype: object
Columns of data frame :  Index(['name', 'age', 'salary', 'working'], dtype='object')
Row labels of data frame :  RangeIndex(start=0, stop=3, step=1)
Number of entries in data frame :  3
Number of cells is data frame :  12



A dictionary can be created from series which have different number of data values present in it. For example, first series can have 4 values whereas second one can have only two values in it. Then lets see how a data frame can be created out of such dictionary object.

In [15]:
dict = { 
            "one" : pd.Series(['A', 'B', 'C', 'D'], index=["a", "b", "c", "d"]), 
            "two" : pd.Series([1, 2, 3], index=["a", "b", "c"]),
            "three " : pd.Series([11, 22], index=["a", "b"])
       }
dict

{'one': a    A
 b    B
 c    C
 d    D
 dtype: object,
 'two': a    1
 b    2
 c    3
 dtype: int64,
 'three ': a    11
 b    22
 dtype: int64}

In [16]:
data_frame = pd.DataFrame(dict)
data_frame

Unnamed: 0,one,two,three
a,A,1.0,11.0
b,B,2.0,22.0
c,C,3.0,
d,D,,


As you can see, for the series elements where values are less are filled with NaN i.e. Not a number , kind of missing value representation. 

<b> From dict of ndarrays / lists<b>

Data frame can be created out of dictionary having ndarrays or lists as its object values. 

In [17]:
array = np.random.randint(5, size = 5)
array

array([4, 1, 2, 4, 2])

In [18]:
list1 = [ i*i for i in range(5)]
list1

[0, 1, 4, 9, 16]

In [19]:
dict = {
            "column1" : array,
            "column2" : list1
       }
dict

{'column1': array([4, 1, 2, 4, 2]), 'column2': [0, 1, 4, 9, 16]}

In [20]:
data_frame = pd.DataFrame(dict)
data_frame

Unnamed: 0,column1,column2
0,4,0
1,1,1
2,2,4
3,4,9
4,2,16


<b> From series<b>

If the series containt the data values for a column then following steps needs to be followed to create a data frame for it. <br>
   - Create a serieses using the data values for a column with index specified
   - Create a list of serieses 
   - Use pd.concat along horizontal axis to add the series to the data frames
   - Rename the data frames columns

In [23]:
#Create an index that will be used along with each series
list_of_indices = ["index1", "index2", "index3", "index4", "index5"]

In [24]:
#Create a serieses using the data values for a column with index specified
series1 = pd.Series([1, 2, 3, 4, 5], index = list_of_indices)
series2 = pd.Series([11, 22, 33, 44, 55], index = list_of_indices)
series3 = pd.Series(["A", "B", "C", "D", "E"], index = list_of_indices)

In [25]:
#Create a list of serieses
list_of_series = [ series1, series2, series3]

In [26]:
#Use pd.concat along horizontal axis to add the series to the data frames
data_frame = pd.concat(list_of_series, axis=1)
data_frame

Unnamed: 0,0,1,2
index1,1,11,A
index2,2,22,B
index3,3,33,C
index4,4,44,D
index5,5,55,E


In [28]:
#Rename the data frames columns
list_of_columns = ["column1", "column2", "column3"]
data_frame.columns = list_of_columns
data_frame

Unnamed: 0,column1,column2,column3
index1,1,11,A
index2,2,22,B
index3,3,33,C
index4,4,44,D
index5,5,55,E


If series is row then following steps needs to be followed to create a data frame for it. <br>
   - Create a serieses using the data values for a column with index specified
   - Create a list of serieses 
   - Use pd.DataFrame the series list to the data frames
   - Rename the data frames columns

In [29]:
#Create an index that will be used along with each series
list_of_indices = ["column1", "column2", "column3", "column4", "column5"]

In [32]:
#Create a serieses using the data values for a column with index specified
series1 = pd.Series([1, 2, "A", True, 5.0], index = list_of_indices)
series2 = pd.Series([11, 22, "B", False, 55.0], index = list_of_indices)
series3 = pd.Series([21, 34, "C", True, 12], index = list_of_indices)

In [33]:
#Create a list of serieses
list_of_series = [ series1, series2, series3]

In [34]:
#Use pd.DataFrame along horizontal axis to add the series to the data frames
data_frame = pd.DataFrame(list_of_series)
data_frame

Unnamed: 0,column1,column2,column3,column4,column5
0,1,2,A,True,5.0
1,11,22,B,False,55.0
2,21,34,C,True,12.0
