<a href="https://colab.research.google.com/github/HarisJafri-xcode/Python-for-Data-Science/blob/main/03-Pandas/3_1_Getting_Started_with_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Pandas

Pandas stand for Panel Data. Pandas is designed for working with tabular or heterogeneous data.

We can import this Library into the Code via following Code.

In [1]:
import pandas as pd
import numpy as np # Importing NumPy is not neccessary but we shall surely need it

# Pandas Data Structures

## Series

A Series is a one-dimensional array-like object containing a sequence of values and an associated array of data labels, called its index.

The simplest Series is formed from only an array of data:

In [2]:
ds = pd.Series([4, 7, -5, 3]) # Passing a List as Argument
ds

Unnamed: 0,0
0,4
1,7
2,-5
3,3


Often it will be desirable to create a Series with an index identifying each data point with a label:


In [3]:
ds_2 = pd.Series([31,22,33,14,95], index=[4,5,'a',7,8]) # Passing a List as Argument and another List as Argument for Index Parameter
ds_2

Unnamed: 0,0
4,31
5,22
a,33
7,14
8,95


You can use labels in the index when selecting single
values or a set of values:

In [4]:
print(ds_2['a']) # Printing the Value at Index 'a' # Series causes Scaler

33


In [5]:
ds_2[7] = 45 # Assigning a new value at Index 7
ds_2

Unnamed: 0,0
4,31
5,22
a,33
7,45
8,95


In [6]:
ds_2[[8, 'a', 4]] # Printing multiple Indices of a Series

Unnamed: 0,0
8,95
a,33
4,31


Using NumPy functions or NumPy-like operations, such as filtering with a boolean
array, scalar multiplication, or applying math functions, will preserve the index-value link:

In [7]:
ds_2[ds_2>32] # Filtering a Data Series

Unnamed: 0,0
a,33
7,45
8,95


In [8]:
ds_2 * 2 # Mathematical Operation with a Data Series

Unnamed: 0,0
4,62
5,44
a,66
7,90
8,190


In [9]:
np.square(ds_2) # NumPy Function on a Data Series

Unnamed: 0,0
4,961
5,484
a,1089
7,2025
8,9025


Identity Operators can be used to confirm whether an Index resides inside a Series.

In [10]:
'a' in ds_2 # Identity Operators to confirm whether an Index is inside the Data Series or not

True

In [11]:
33 in ds_2 # Identity Operators to confirm whether an Index is inside the Data Series or not

False

Instead of passing a List or a Tuple in pd.Series() Function, we can also pass a Dictionary inside it.

In [12]:
dict_3 = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
ds_3 = pd.Series(dict_3)
ds_3

Unnamed: 0,0
Ohio,35000
Texas,71000
Oregon,16000
Utah,5000


We can create a Series by passing another Series inside the Function and assigning index !

In [13]:
ds_4 = pd.Series(ds_3, index=['California', 'Ohio', 'Oregon', 'Texas']) # If Index does not exist in passed Series, the New Series returns NaN in New Series
ds_4

Unnamed: 0,0
California,
Ohio,35000.0
Oregon,16000.0
Texas,71000.0


The isnull and notnull functions in pandas should be used to detect missing data:

In [14]:
pd.isnull(ds_4)

Unnamed: 0,0
California,True
Ohio,False
Oregon,False
Texas,False


In [15]:
pd.notnull(ds_4)

Unnamed: 0,0
California,False
Ohio,True
Oregon,True
Texas,True


Series also has these as Instance Methods.

In [16]:
ds_4.isnull()

Unnamed: 0,0
California,True
Ohio,False
Oregon,False
Texas,False


A useful Series feature for many applications is that it automatically aligns by index label in arithmetic operations:

In [17]:
ds_3

Unnamed: 0,0
Ohio,35000
Texas,71000
Oregon,16000
Utah,5000


In [18]:
ds_4

Unnamed: 0,0
California,
Ohio,35000.0
Oregon,16000.0
Texas,71000.0


In [19]:
ds_5 = ds_3 + ds_4
ds_5 # Both has to be available and should have Numeric Content inside

Unnamed: 0,0
California,
Ohio,70000.0
Oregon,32000.0
Texas,142000.0
Utah,


We can assign a Name to DataSeries and its index as well.

In [20]:
ds_5.name = 'Population' # Giving Data Series a Name
ds_5.index.name = 'State' # Giving Index Column a Name
ds_5

Unnamed: 0_level_0,Population
State,Unnamed: 1_level_1
California,
Ohio,70000.0
Oregon,32000.0
Texas,142000.0
Utah,


A Seriesâ€™s index can be altered in-place by assignment:

In [21]:
ds_5.index = ['California - N/a', 'Ohio', 'Oregon', 'Texas','Utah - N/a'] # Reassigning the entire Index
ds_5

Unnamed: 0,Population
California - N/a,
Ohio,70000.0
Oregon,32000.0
Texas,142000.0
Utah - N/a,


In [22]:
# Individual Names of Index can also be changed
ds_6 = ds_5.rename(index={'Texas':'Texas City'})
ds_6

Unnamed: 0,Population
California - N/a,
Ohio,70000.0
Oregon,32000.0
Texas City,142000.0
Utah - N/a,


## DataFrame

Pandas DataFrame are technically a List of Data Series having the same Index. It is based on a Dictionary where the Keys become the Heading of each Series while the List Values take on the form of Columns Content.

In [41]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
 'year': [2000, 2001, 2002, 2001, 2002, 2003],
 'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]} # Constructing Dictionary

df = pd.DataFrame(data) # passing the dictionary as Argument in pd.DataFrame function

In [42]:
df

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [43]:
df = pd.DataFrame(data, columns=['year', 'state', 'pop']) # Specifying Column Arrangement
df

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


In [44]:
df = pd.DataFrame(data, columns=['year','state','pop','debt'], index=['one','two','three','four','five','six']) # Specifying Column Arrangement with an Additional Column
df

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,


In [45]:
df['state'] # Extracting Data Series from Data Frame

Unnamed: 0,state
one,Ohio
two,Ohio
three,Ohio
four,Nevada
five,Nevada
six,Nevada


In [46]:
df[['state','year']] # Extracting Data Frame from Data Frame

Unnamed: 0,state,year
one,Ohio,2000
two,Ohio,2001
three,Ohio,2002
four,Nevada,2001
five,Nevada,2002
six,Nevada,2003


Rows can also be retrived using .loc and iloc attributes.

In [47]:
df.loc['six'] # Rows can also be retrived using .loc attribute

Unnamed: 0,six
year,2003
state,Nevada
pop,3.2
debt,


In [48]:
df.iloc[1] # Rows can also be retrived using .iloc attribute

Unnamed: 0,two
year,2001
state,Ohio
pop,1.7
debt,


Columns can be modified by assignment. For example, the empty 'debt' column
could be assigned a scalar value or an array of values:


In [49]:
df['debt'] = 16.5

In [50]:
df

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,16.5
three,2002,Ohio,3.6,16.5
four,2001,Nevada,2.4,16.5
five,2002,Nevada,2.9,16.5
six,2003,Nevada,3.2,16.5


In [51]:
df['debt'] = np.arange(6)

In [52]:
df

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0
two,2001,Ohio,1.7,1
three,2002,Ohio,3.6,2
four,2001,Nevada,2.4,3
five,2002,Nevada,2.9,4
six,2003,Nevada,3.2,5


In [53]:
val = pd.Series([-1.2, -1.5, -1.7], index=['one','two','six'])

In [54]:
df['debt'] = val

In [55]:
df

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,-1.2
two,2001,Ohio,1.7,-1.5
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,-1.7


Assigning a column that doesnâ€™t exist will create a new column. The del keyword will delete columns as with a dict.

In [56]:
df['eastern'] = (df['state'] == 'Ohio') # Creating additional Column of Boolean Values
df

Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,-1.2,True
two,2001,Ohio,1.7,-1.5,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,,False
five,2002,Nevada,2.9,,False
six,2003,Nevada,3.2,-1.7,False


In [57]:
del df['eastern'] # del Keyword deletes an entire Column

In [58]:
df

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,-1.2
two,2001,Ohio,1.7,-1.5
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,-1.7


If you would like to observe the first 5 Rows of a Data Set or Bottom 5 Rows of a Data Set.

In [63]:
df.head() # Returns First 5 Rows of the Data Frame

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,-1.2
two,2001,Ohio,1.7,-1.5
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,


In [64]:
df.tail() # Returns Bottom 5 Rows of the Data Frame

Unnamed: 0,year,state,pop,debt
two,2001,Ohio,1.7,-1.5
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,-1.7


In [66]:
df.columns # Returns the Column Names of the Data Frame

Index(['year', 'state', 'pop', 'debt'], dtype='object')

Just like Series, we can name Index Column but also we can name Column Row as well.

In [69]:
df.index.name = 'Number'
df.columns.name = 'Attribute'
df

Attribute,year,state,pop,debt
Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
one,2000,Ohio,1.5,-1.2
two,2001,Ohio,1.7,-1.5
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,-1.7


The .values returns a 2D ndArray.

In [75]:
df.values

array([[2000, 'Ohio', 1.5, -1.2],
       [2001, 'Ohio', 1.7, -1.5],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, nan],
       [2002, 'Nevada', 2.9, nan],
       [2003, 'Nevada', 3.2, -1.7]], dtype=object)