Putting Some Pandas In Your Python 🐼
# Introduction to Pandas 🐼

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with relational or labeled data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python.

Reference: https://pandas.pydata.org/docs/getting_started/index.html

Question: What are the Data Structures in Pandas?
Answer: Series (similar to 1 dim numpy array) and DataFrame (similar to 2 dim numpy array)

Installation Command
! pip install pandas

Importing Pandas
import pandas as pd

### What's covered in this notebook?
1. Pandas Data Structure - Series (ndarray-like)
Creating Series using Python list or dict
Creating Series from Numpy ndarray
Creating Series from scalar
Accessing Properties/Attributes and Methods of Series
Accessing data using Indexing and Slicing
2. Pandas Data Structure - DataFrame
Creating DataFrame using Python dict, list or tuple
Creating DataFrame using Numpy Array
Accessing Attributes/Properties and Methods of DataFrame
3. Working with Tabular Data
Dataframe to .csv & .xlsx
Reading .xlsx File
Reading .csv File - Iris Dataset
4. Non-Visual Data Analysis using Pandas (Statistical Analysis)
sum()
min() and max()
mean(), median(), var() and std()
describe() to summarize the data
corr(), skew() and kurt()
count(), unique() and value_counts() for categorical column
DataFrame.agg()
5. Accessing Data in a DataFrame using Indexing and Slicing in Pandas DataFrame
Reading .csv File - Weather Dataset
Filtering Single Column vs Multiple Columns from a DataFrame
Filtering Rows from a DataFrame
Filtering specific rows and columns from a DataFrame
loc() vs iloc()
6. Renaming Columns, Modifying DataTypes, Creating New Columns and Deleting Columns in Pandas DataFrame
Reading .csv File - Retail Store Sales Data
Renaming Columns
Modifying Columns DataTypes
Creating a Derived Column
Creating columns using apply() function
Deleting column(s) in DataFrame
7. Adding/Inserting Row(s)
Reading .xlsx File - Weather Data
Insert Row(s) using pandas.concat()
Inserting a Row using List - .loc[] and .iloc[]
Inserting a Row at a Specific Index of a DataFrame
Saving DataFrame to .xlsx
8. Handling TimeSeries Data
Reading .csv File - Online Store Sales Data
pd.to_datetime()
Working with DateTime in Pandas
Creating a Column containing only the Order Month
Calculating Delivery Time from Order Date and Ship Date
pandas.Timedelta
Creating a Column containing Delivery Time in Number of Days
Improve Performance by Setting Date Column as the Index
Sorting Data Based on Index vs Values and Resetting Index
9. Summary

Import Pandas Module

In [1]:
import pandas as pd
import numpy as np

Pandas Data Structure - Series (ndarray-like)
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index.

The basic method to create a Series is to call:
s = pd.Series(data, index=index)

Important Note: Series data structures are value-mutable (the values they contain can be altered) but not size-mutable.

Here, data can be many different things:

a Python list or dict
an ndarray
a scalar value (like 5)

Creating Series using Python list or dict

In [2]:
# pd.Series(data,index)
# index-> Unique, Hashable, same length as data. By default np.arange(n)
import pandas as pd

s = pd.Series([1, 2, 3, 4])
print(s)


0    1
1    2
2    3
3    4
dtype: int64


In [3]:
s = pd.Series(['x', 'y', 'z', 'abc'])

print(s)

0      x
1      y
2      z
3    abc
dtype: object


In [4]:
s = pd.Series(['Alade','Idris'])
print(s)

0    Alade
1    Idris
dtype: object


In [5]:
d = {"b": 1, "a": 0, "c": 2}

s = pd.Series(d)

print(s)

b    1
a    0
c    2
dtype: int64


Creating Series from Numpy ndarray

In [7]:
data = np.arange(10)
s = pd.Series(data)
print(s)

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64


Creating Series from scalar

In [9]:
pd.Series(5.0, index=["a", "b", "c", "d", "e"])

a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

Accessing Properties/Attributes and Methods of Series

In [10]:
data = np.array([10, 20, 30, 40, 50, 60, 70, 80])

s = pd.Series(data)
print("Data Type:", s.dtype)
print("Shape:", s.shape)
print("Values:", s.values)
print("Array:", s.array)

Data Type: int64
Shape: (8,)
Values: [10 20 30 40 50 60 70 80]
Array: <NumpyExtensionArray>
[10, 20, 30, 40, 50, 60, 70, 80]
Length: 8, dtype: int64


In [11]:
print("Method to extract actual numpy ndarray:", s.to_numpy())

Method to extract actual numpy ndarray: [10 20 30 40 50 60 70 80]


In [12]:
s.head()

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [13]:
s.tail()

3    40
4    50
5    60
6    70
7    80
dtype: int64

In [14]:
s.info()

<class 'pandas.core.series.Series'>
RangeIndex: 8 entries, 0 to 7
Series name: None
Non-Null Count  Dtype
--------------  -----
8 non-null      int64
dtypes: int64(1)
memory usage: 192.0 bytes


Accessing data using Indexing and Slicing

In [19]:
s = pd.Series([1, 2, 3, 4, 5])

print(s[2])
print(s[[1,4]])
print(s[2:])

3
1    2
4    5
dtype: int64
2    3
3    4
4    5
dtype: int64


In [20]:
s = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])

print(s)

a    1
b    2
c    3
d    4
e    5
dtype: int64


## Pandas Data Structure - DataFrame

Pandas is a general 2D labeled, value and size-mutable tabular structure with potentially heterogeneously-typed column.

Important Note: Pandas data structures are value-mutable (the values they contain can be altered) as well as size-mutable.


Question: What kind of data does pandas handle?
Answer: When working with tabular data, such as data stored in spreadsheets or databases, pandas is the right tool for you. pandas will help you to explore, clean, and process your data. In pandas, a data table is called a DataFrame.

Remember
Import the package, aka import pandas as pd
A table of data is stored as a pandas DataFrame
Each column in a DataFrame is a Series
You can do things by applying a method to a DataFrame or Series

Creating a Pandas DataFrame
Syntax
df = pd.DataFrame(data, index=idxs, columns=cols)

Here data can be many different things:

Python Dict, List or Tuple
Numpy array

Creating DataFrame using Python dict, list or tuple

In [21]:
# Creating dataframe using Python Dictionary

data = {
        'Name': ['Tom', 'Jack', 'Steve', 'Ricky'], 
        'Age': [28,34,np.nan,42],
        'Gender': ['Male', 'Female', 'Female', 'Male']
       }

df = pd.DataFrame(data)

df

Unnamed: 0,Name,Age,Gender
0,Tom,28.0,Male
1,Jack,34.0,Female
2,Steve,,Female
3,Ricky,42.0,Male


In [22]:
# Creating a dataframe using Tuple/list

data = [('1/1/2019', 13, 6, 'Rain'),
       ('2/1/2019', 11, 7, 'Fog'),
       ('3/1/2019', 12, 8, 'Sunny'),
       ('4/1/2019', 8, 5, 'Snow'),
       ('5/1/2019', 9, 6, 'Rain')]

df = pd.DataFrame(data)

df

Unnamed: 0,0,1,2,3
0,1/1/2019,13,6,Rain
1,2/1/2019,11,7,Fog
2,3/1/2019,12,8,Sunny
3,4/1/2019,8,5,Snow
4,5/1/2019,9,6,Rain


In [23]:
# Creating a dataframe using Tuple/list

data = [('1/1/2019', 13, 6, 'Rain'),
       ('2/1/2019', 11, 7, 'Fog'),
       ('3/1/2019', 12, 8, 'Sunny'),
       ('4/1/2019', 8, 5, 'Snow'),
       ('5/1/2019', 9, 6, 'Rain')]

df = pd.DataFrame(data, columns=['Day', 'Temperature', 'WindSpeed', 'Event'])

df

Unnamed: 0,Day,Temperature,WindSpeed,Event
0,1/1/2019,13,6,Rain
1,2/1/2019,11,7,Fog
2,3/1/2019,12,8,Sunny
3,4/1/2019,8,5,Snow
4,5/1/2019,9,6,Rain


In [24]:
# Creating a dataframe using Tuple/list

data = (['1/1/2019', 13, 6, 'Rain'],
       ['2/1/2019', 11, 7, 'Fog'],
       ['3/1/2019', 12, 8, 'Sunny'],
       ['4/1/2019', 8, 5, 'Snow'],
       ['5/1/2019', 9, 6, 'Rain'])

df = pd.DataFrame(data, 
                  index=['I1', 'I2', 'I3', 'I4', 'I5'], 
                  columns=['Day', 'Temperature', 'WindSpeed', 'Event'])

df

Unnamed: 0,Day,Temperature,WindSpeed,Event
I1,1/1/2019,13,6,Rain
I2,2/1/2019,11,7,Fog
I3,3/1/2019,12,8,Sunny
I4,4/1/2019,8,5,Snow
I5,5/1/2019,9,6,Rain


In [26]:
print(df['Temperature'])

I1    13
I2    11
I3    12
I4     8
I5     9
Name: Temperature, dtype: int64


Creating DataFrame using Numpy Array

In [27]:
arr = np.random.randint(100,1999, size=(1000,100))
print(arr.shape)

(1000, 100)


In [28]:
df = pd.DataFrame(arr)
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,1043,1509,1614,762,372,1753,1140,370,443,645,...,1106,381,388,1074,1898,678,1462,1975,177,1106
1,1991,1442,1255,1229,885,112,1963,1341,1595,133,...,1115,1953,1879,1817,477,669,255,569,561,949
2,642,1131,1851,941,733,1913,577,584,582,1919,...,939,828,1252,675,522,1353,1108,681,527,1658
3,1371,152,507,1564,1303,351,461,871,414,1585,...,1028,1376,1646,1725,612,1756,524,1608,1185,236
4,1007,765,735,385,561,1341,191,1213,1211,1397,...,1787,1958,849,443,187,1773,233,584,494,443
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,734,1450,1443,1767,1243,1698,1776,1757,272,1135,...,780,1981,1484,325,1394,810,410,671,1863,1384
996,151,247,740,1300,1872,601,1823,773,1589,613,...,1545,1144,879,635,198,176,1832,361,1613,1605
997,1984,631,1879,1298,863,1325,577,1446,1498,1059,...,429,616,451,1505,1310,1492,318,1134,1086,1369
998,1992,1065,447,1200,635,1446,1092,1771,807,781,...,1955,1098,920,739,1363,921,1652,524,1912,1612


In [30]:
df = pd.DataFrame(arr, columns=['col_'+str(i) for i in range(1,101)])
df

Unnamed: 0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,col_8,col_9,col_10,...,col_91,col_92,col_93,col_94,col_95,col_96,col_97,col_98,col_99,col_100
0,1043,1509,1614,762,372,1753,1140,370,443,645,...,1106,381,388,1074,1898,678,1462,1975,177,1106
1,1991,1442,1255,1229,885,112,1963,1341,1595,133,...,1115,1953,1879,1817,477,669,255,569,561,949
2,642,1131,1851,941,733,1913,577,584,582,1919,...,939,828,1252,675,522,1353,1108,681,527,1658
3,1371,152,507,1564,1303,351,461,871,414,1585,...,1028,1376,1646,1725,612,1756,524,1608,1185,236
4,1007,765,735,385,561,1341,191,1213,1211,1397,...,1787,1958,849,443,187,1773,233,584,494,443
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,734,1450,1443,1767,1243,1698,1776,1757,272,1135,...,780,1981,1484,325,1394,810,410,671,1863,1384
996,151,247,740,1300,1872,601,1823,773,1589,613,...,1545,1144,879,635,198,176,1832,361,1613,1605
997,1984,631,1879,1298,863,1325,577,1446,1498,1059,...,429,616,451,1505,1310,1492,318,1134,1086,1369
998,1992,1065,447,1200,635,1446,1092,1771,807,781,...,1955,1098,920,739,1363,921,1652,524,1912,1612


Accessing Attributes/Properties and Methods of DataFrame

In [31]:
# Create Dictionary of Series
import pandas as pd
import numpy as np

data = {'Name':pd.Series(['Tom', 'Jack', 'Steve', 'Ricky', 'Vin', 'James', 'Vin']),
       'Age':pd.Series([25,26,25,35,23,33,31]),
       'Rating':pd.Series([4.23,4.1,3.4,5,2.9,np.nan,3.1])}

df = pd.DataFrame(data)

df

Unnamed: 0,Name,Age,Rating
0,Tom,25,4.23
1,Jack,26,4.1
2,Steve,25,3.4
3,Ricky,35,5.0
4,Vin,23,2.9
5,James,33,
6,Vin,31,3.1


In [32]:
print('Shape of DataFrame:', df.shape)
print()
print('Name of each column:', df.columns)
print()
print('Data Types of each Columns:\n', df.dtypes)
print()
print('Axes:\n', df.axes)
print()
print('Return data as numpy array:\n', df.values)

Shape of DataFrame: (7, 3)

Name of each column: Index(['Name', 'Age', 'Rating'], dtype='object')

Data Types of each Columns:
 Name       object
Age         int64
Rating    float64
dtype: object

Axes:
 [RangeIndex(start=0, stop=7, step=1), Index(['Name', 'Age', 'Rating'], dtype='object')]

Return data as numpy array:
 [['Tom' 25 4.23]
 ['Jack' 26 4.1]
 ['Steve' 25 3.4]
 ['Ricky' 35 5.0]
 ['Vin' 23 2.9]
 ['James' 33 nan]
 ['Vin' 31 3.1]]


In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Name    7 non-null      object 
 1   Age     7 non-null      int64  
 2   Rating  6 non-null      float64
dtypes: float64(1), int64(1), object(1)
memory usage: 296.0+ bytes


The method info() provides technical information about a DataFrame, so let’s explain the output in more detail:

It is indeed a DataFrame.
There are 7 entries, i.e. 7 rows.
Each row has a row label (aka the index) with values ranging from 0 to 6.
The table has 3 columns. Name and Age columns have a value for each of the rows (all 7 values are non-null). Rating column do have missing values and less than 7 non-null values.
The column Name consists of textual data (strings, aka object). The other columns are numerical data with some of them whole numbers (aka integer) and others are real numbers (aka float).
The kind of data (characters, integers,…) in the different columns are summarized by listing the dtypes.
The approximate amount of RAM used to hold the DataFrame is provided as well.# head -> by default head returns first 5 rows

In [34]:
# head -> by default head returns first 5 rows

df.head()

Unnamed: 0,Name,Age,Rating
0,Tom,25,4.23
1,Jack,26,4.1
2,Steve,25,3.4
3,Ricky,35,5.0
4,Vin,23,2.9


In [35]:
# head -> by default head returns first 5 rows

df.head(2)

Unnamed: 0,Name,Age,Rating
0,Tom,25,4.23
1,Jack,26,4.1


In [36]:
# tail -> by default tail returns last 5 rows

df.tail()

Unnamed: 0,Name,Age,Rating
2,Steve,25,3.4
3,Ricky,35,5.0
4,Vin,23,2.9
5,James,33,
6,Vin,31,3.1


In [37]:
# tail -> by default tail returns last 5 rows

df.tail(2)

Unnamed: 0,Name,Age,Rating
5,James,33,
6,Vin,31,3.1


### Working with Tabular Data
Question: How do I read and write tabular data?
Answer: pandas supports the integration with many file formats or data sources out of the box (csv, excel, sql, json, parquet,…). 
Importing data from each of these data sources is provided by function with the prefix read_*. Similarly, the to_* methods are used to store data.

Remember
Getting data in to pandas from many different file formats or data sources is supported by read_* functions.
Exporting data out of pandas is provided by different to_* methods.
The head/tail/info methods and the dtypes attribute are convenient for a first check.

Dataframe to .csv & .xlsx

In [2]:
import pandas as pd
import numpy as np

# Create Dictionary of Series
data = {'Name':pd.Series(['Tom', 'Jack', 'Steve', 'Ricky', 'Vin', 'James', 'Smith']),
       'Age':pd.Series([25,26,25,35,23,33,31]),
       'Rating':pd.Series([4.23,4.1,3.4,5,np.nan,4.7,3.1])}

df = pd.DataFrame(data)

In [3]:
df.head()

Unnamed: 0,Name,Age,Rating
0,Tom,25,4.23
1,Jack,26,4.1
2,Steve,25,3.4
3,Ricky,35,5.0
4,Vin,23,


In [39]:
df.tail()

Unnamed: 0,Name,Age,Rating
2,Steve,25,3.4
3,Ricky,35,5.0
4,Vin,23,2.9
5,James,33,
6,Vin,31,3.1


In [40]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Name    7 non-null      object 
 1   Age     7 non-null      int64  
 2   Rating  6 non-null      float64
dtypes: float64(1), int64(1), object(1)
memory usage: 296.0+ bytes


In [43]:
# Write DataFram to csv
df.to_csv('temp/new_csv_file.csv')

In [44]:
# Write Dataframe to CSV without index

df.to_csv('temp/new_csv_file_no_index.csv', index=False)

In [4]:
# Write Dataframe to XLSX

df.to_excel('temp/new_excel_file.xlsx', sheet_name='stud_data')

In [5]:
# Write Dataframe to XLSX without index

df.to_excel('temp/new_excel_file_noIndex.xlsx', sheet_name='stud_data', index=False)


In [8]:
# Reading .xlsx file
df = pd.read_excel('temp/new_excel_file_noIndex.xlsx')
df.head()

Unnamed: 0,Name,Age,Rating
0,Tom,25,4.23
1,Jack,26,4.1
2,Steve,25,3.4
3,Ricky,35,5.0
4,Vin,23,


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Name    7 non-null      object 
 1   Age     7 non-null      int64  
 2   Rating  6 non-null      float64
dtypes: float64(1), int64(1), object(1)
memory usage: 296.0+ bytes


Reading .csv File

In [11]:
df = pd.read_csv('temp/Iris.csv')
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             150 non-null    int64  
 1   SepalLengthCm  150 non-null    float64
 2   SepalWidthCm   150 non-null    float64
 3   PetalLengthCm  150 non-null    float64
 4   PetalWidthCm   150 non-null    float64
 5   Species        150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB


Data Description
The Iris Dataset contains four features (length and width of sepals and petals) of 50 samples of three species of Iris (Iris setosa, Iris virginica and Iris versicolor).

The iris data set is widely used as a beginner's dataset for machine learning purposes.

### Non-Visual Data Analysis using Pandas (Statistical Analysis)
Question: How to calculate summary statistics?
Answer: Basic statistics (mean, median, min, max, counts…) are easily calculable. These or custom aggregations can be applied on the entire data set, a sliding window of the data, or grouped by categories. The latter is also known as the split-apply-combine approach.

Remember
Aggregation statistics(mean, median, min, max, counts…) can be calculated on entire columns or rows.
groupby provides the power of the split-apply-combine pattern.
value_counts is a convenient shortcut to count the number of entries in each category of a variable.

In [13]:
df = pd.read_csv('temp/Iris.csv')
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [14]:
df.shape

(150, 6)

In [15]:
df.describe()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
count,150.0,150.0,150.0,150.0,150.0
mean,75.5,5.843333,3.054,3.758667,1.198667
std,43.445368,0.828066,0.433594,1.76442,0.763161
min,1.0,4.3,2.0,1.0,0.1
25%,38.25,5.1,2.8,1.6,0.3
50%,75.5,5.8,3.0,4.35,1.3
75%,112.75,6.4,3.3,5.1,1.8
max,150.0,7.9,4.4,6.9,2.5


In [16]:
df.columns

Index(['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',
       'Species'],
      dtype='object')

In [17]:
df.sum()

Id                                                           11325
SepalLengthCm                                                876.5
SepalWidthCm                                                 458.1
PetalLengthCm                                                563.8
PetalWidthCm                                                 179.8
Species          Iris-setosaIris-setosaIris-setosaIris-setosaIr...
dtype: object

In [19]:
df[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']].sum(axis=1)

0      10.2
1       9.5
2       9.4
3       9.4
4      10.2
       ... 
145    17.2
146    15.7
147    16.7
148    17.3
149    15.8
Length: 150, dtype: float64

min() and max()

In [20]:
df.min()

Id                         1
SepalLengthCm            4.3
SepalWidthCm             2.0
PetalLengthCm            1.0
PetalWidthCm             0.1
Species          Iris-setosa
dtype: object

In [21]:
df.max()

Id                          150
SepalLengthCm               7.9
SepalWidthCm                4.4
PetalLengthCm               6.9
PetalWidthCm                2.5
Species          Iris-virginica
dtype: object

mean(), median(), var() and std()

In [22]:
df[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']].mean()

SepalLengthCm    5.843333
SepalWidthCm     3.054000
PetalLengthCm    3.758667
PetalWidthCm     1.198667
dtype: float64