<a href="https://colab.research.google.com/github/Shuraimi/DataScience-Handbook-Notes/blob/main/2.%20Data_manipulation_with_Pandas/2.%20Introducing_Pandas_Objects.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introducing Pandas Objects

Pandas objects can be thought of as an enhanced version of Numpy structured array where a row or column is accessed by a label rather than simple indices.

In [35]:
import pandas as pd
import numpy as np

## Pandas Series object
A Series is a 1D array of indexed data. It can be made from an array of list as

In [2]:
x=pd.Series([1,2,3])

Series wraps index as well as values which can be accessed using `values` and `index` attributes.

In [3]:
x.values

array([1, 2, 3])

This is just like a Numpy array.

In [4]:
x.index

RangeIndex(start=0, stop=3, step=1)

This is an array-like object of type pd.Index

We can access a specific value using indexing just like we do for accessing values in an Array.

In [5]:
x[2]

3

W can also slice the Series and these are more flexible than Numpy arrays.

### Series as a generalised Numpy array

The only difference between Pandas Series and Numpy array is the presence of index.

In numpy array, the index is implicitly defined and is associated with the value whereas in Pandas Series, the index is explicitly defined and is associated with the value.

The explicitly defined index in Series gives it more capabilities. The index need not be an integer always. It can be any set of values.

For example:-
We can use strings as index like 👇

In [7]:
s=pd.Series([2,4,6,8],index=['a','b','c','d'])

These indices can also be non contiguous values

In [8]:
k=pd.Series([3,5,7,9],index=[2,4,6,8])

### Series as specialised dictionary

**What's a dictionary?**<br>
A dictionary is a structure which maps arbitrary keys to arbitrary values where as Pandas Series is structure which maps typed keys to typed values.

The type-specific compiled code behind of **Numpy array** is more efficient than a **python list** certain operations and the type information in **Series** makes it more efficient than a **python dictionary** in certain operations.

Constructing a Series from a dictionary.

In [10]:
population_dict = {'California': 38332521,'Texas': 26448193,'New York': 19651127,'Florida': 19552860,'Illinois': 12882135}
pop=pd.Series(population_dict)
pop

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

The series is created and the index is the sorted keys from this dictionary.

### Constructing Series object

There are many ways to construct Series
```
pd.Series(data,index)
```

Here data can be one of the namely entities and index is optional parameter.

#### data can be a python list/Numpy array

In [14]:
l=pd.Series([11,12,13,14])
l

0    11
1    12
2    13
3    14
dtype: int64

Here default index value are taken as index.

#### data can be a scalar

In [18]:
sc=pd.Series(6,index=[200,300,400])
sc

200    6
300    6
400    6
dtype: int64

The series is filled with scalar 6 with the given indicies.

#### data can be dictionary where index defaults to the sorted key values

In [15]:
d=pd.Series({'2':6,'1':8})
d

2    6
1    8
dtype: int64

To get different output, index is specified

In [19]:
d2=pd.Series({2:'a',1:'b',3:'c'},index=[2,1])
d2

2    a
1    b
dtype: object

## Pandas DataFrame object
The next fundament object in Pandas is the DataFrame object. <br>
This can be thought of as a generalised Numpy array or as a specialisation of dictionary as discussed before.

### DataFrame as a generalised Numpy array

If **Series** is an analog to one dimensional array with flexible indices, then **DataFrame** is an analog to two dimensional array with flexible row and column indices.

Just as we think of 2D array as a aligned sequence of one dimensional columns,we can think of DataFrame as an aligned sequence of Series objects.
Aligned means they share the same index.

In [20]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,'Florida': 170312, 'Illinois': 149995}
area=pd.Series(area_dict)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

In [22]:
states=pd.DataFrame({'area':area,'population': pop})
states

Unnamed: 0,area,population
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


Now a DataFrame is made from the 2 Series.

Like the Series, DataFrame has *index* attribute which gives the index labels.

In [23]:
states.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

Additionally, it has *columns* attribute which is of *Index object* type which gives column labels.

In [26]:
states.columns

Index(['area', 'population'], dtype='object')

### DataFrames as a specialised dictionary
A dictionary maps keys to values but DataFrame maps column name to a Series of column data.
We access the data in the column as
```
dataframe[col_name]
```

### Constructing DataFrame object
DataFrame object can be constructed in many ways

#### From a single Series object
A DataFrame is a collection of Series objects

In [28]:
df1=pd.DataFrame(pop,columns=['population'])
df1

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


#### From a list of dicts
DataFrame can be constructed from a list of dicts. We'll use list comprehensions to generate some data.

In [30]:
df2=pd.DataFrame([{'a':i,'b':2*i} for i in range(3)])
df2

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


In [31]:
df3=pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])
df3

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0



It seems key values pairs are missing, Pandas will fill them with NaN.

#### From a dictionary of Series objects

In [32]:
df4=pd.DataFrame({'population': pop,'area':area})
df4

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


#### From 2D Numpy array

If index and column values are not specified, integer index will be used for each.

In [36]:
df4=pd.DataFrame(np.random.rand(3,2),columns=['heelo','green'],index=['a','b','c'])
df4

Unnamed: 0,heelo,green
a,0.268075,0.010392
b,0.886056,0.597612
c,0.597371,0.73874



#### From structured arrays

In [39]:
a=np.zeros(3,dtype=[('A','i8'),('B','f8')])
a

array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])

## Index object
So far we've seen that both **Series** and **DataFrame** objects have an explicitly defined index that lets you access and modify the data.

This Index is interesting in it's type and can be thought of as a immutable array or as an ordered set(technically a multi set as Index object can have repeated values )

Creating a simple Index from a list

In [40]:
ind=pd.Index([2,4,6,7])
ind

Int64Index([2, 4, 6, 7], dtype='int64')

### Index as a immutable array
Just like Numpy array, we can perform indexing and slicing for indices.

In [41]:
ind[0]

2

In [42]:
ind[::-1]

Int64Index([7, 6, 4, 2], dtype='int64')

Index also has many attributes familiar to Numpy arrays

In [47]:
print(ind.shape, ind.ndim, ind.dtype, ind.size)

(4,) 1 int64 4


One difference is that **Index is immutable** i.e it cannot be modified by normal means.

This immutability makes it safer to share indicies across multiple dataframes and arrays without the risk of any inadvertent index modification.

### Index as ordered set
Index objects are designed to facilitate operations such as joins, etc on datasets.

In [48]:
indA=pd.Index([1,4,6,8])
indB=pd.Index([1,3,8,9])

In [50]:
indA.intersection(indB)

Int64Index([1, 8], dtype='int64')

In [51]:
indA.union(indB)

Int64Index([1, 3, 4, 6, 8, 9], dtype='int64')

In [52]:
indA.symmetric_difference(indB)

Int64Index([3, 4, 6, 9], dtype='int64')