Putting Some Pandas In Your Python 🐼
# Introduction to Pandas 🐼

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with relational or labeled data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python.

Reference: https://pandas.pydata.org/docs/getting_started/index.html

Question: What are the Data Structures in Pandas?
Answer: Series (similar to 1 dim numpy array) and DataFrame (similar to 2 dim numpy array)

Installation Command
! pip install pandas

Importing Pandas
import pandas as pd

### What's covered in this notebook?
1. Pandas Data Structure - Series (ndarray-like)
Creating Series using Python list or dict
Creating Series from Numpy ndarray
Creating Series from scalar
Accessing Properties/Attributes and Methods of Series
Accessing data using Indexing and Slicing
2. Pandas Data Structure - DataFrame
Creating DataFrame using Python dict, list or tuple
Creating DataFrame using Numpy Array
Accessing Attributes/Properties and Methods of DataFrame
3. Working with Tabular Data
Dataframe to .csv & .xlsx
Reading .xlsx File
Reading .csv File - Iris Dataset
4. Non-Visual Data Analysis using Pandas (Statistical Analysis)
sum()
min() and max()
mean(), median(), var() and std()
describe() to summarize the data
corr(), skew() and kurt()
count(), unique() and value_counts() for categorical column
DataFrame.agg()
5. Accessing Data in a DataFrame using Indexing and Slicing in Pandas DataFrame
Reading .csv File - Weather Dataset
Filtering Single Column vs Multiple Columns from a DataFrame
Filtering Rows from a DataFrame
Filtering specific rows and columns from a DataFrame
loc() vs iloc()
6. Renaming Columns, Modifying DataTypes, Creating New Columns and Deleting Columns in Pandas DataFrame
Reading .csv File - Retail Store Sales Data
Renaming Columns
Modifying Columns DataTypes
Creating a Derived Column
Creating columns using apply() function
Deleting column(s) in DataFrame
7. Adding/Inserting Row(s)
Reading .xlsx File - Weather Data
Insert Row(s) using pandas.concat()
Inserting a Row using List - .loc[] and .iloc[]
Inserting a Row at a Specific Index of a DataFrame
Saving DataFrame to .xlsx
8. Handling TimeSeries Data
Reading .csv File - Online Store Sales Data
pd.to_datetime()
Working with DateTime in Pandas
Creating a Column containing only the Order Month
Calculating Delivery Time from Order Date and Ship Date
pandas.Timedelta
Creating a Column containing Delivery Time in Number of Days
Improve Performance by Setting Date Column as the Index
Sorting Data Based on Index vs Values and Resetting Index
9. Summary

Import Pandas Module

In [1]:
import pandas as pd
import numpy as np

Pandas Data Structure - Series (ndarray-like)
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index.

The basic method to create a Series is to call:
s = pd.Series(data, index=index)

Important Note: Series data structures are value-mutable (the values they contain can be altered) but not size-mutable.

Here, data can be many different things:

a Python list or dict
an ndarray
a scalar value (like 5)

Creating Series using Python list or dict

In [2]:
# pd.Series(data,index)
# index-> Unique, Hashable, same length as data. By default np.arange(n)
import pandas as pd

s = pd.Series([1, 2, 3, 4])
print(s)


0    1
1    2
2    3
3    4
dtype: int64


In [3]:
s = pd.Series(['x', 'y', 'z', 'abc'])

print(s)

0      x
1      y
2      z
3    abc
dtype: object


In [4]:
s = pd.Series(['Alade','Idris'])
print(s)

0    Alade
1    Idris
dtype: object


In [5]:
d = {"b": 1, "a": 0, "c": 2}

s = pd.Series(d)

print(s)

b    1
a    0
c    2
dtype: int64


Creating Series from Numpy ndarray

In [7]:
data = np.arange(10)
s = pd.Series(data)
print(s)

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64


Creating Series from scalar

In [9]:
pd.Series(5.0, index=["a", "b", "c", "d", "e"])

a    5.0
b    5.0
c    5.0
d    5.0
e    5.0
dtype: float64

Accessing Properties/Attributes and Methods of Series

In [10]:
data = np.array([10, 20, 30, 40, 50, 60, 70, 80])

s = pd.Series(data)
print("Data Type:", s.dtype)
print("Shape:", s.shape)
print("Values:", s.values)
print("Array:", s.array)

Data Type: int64
Shape: (8,)
Values: [10 20 30 40 50 60 70 80]
Array: <NumpyExtensionArray>
[10, 20, 30, 40, 50, 60, 70, 80]
Length: 8, dtype: int64


In [11]:
print("Method to extract actual numpy ndarray:", s.to_numpy())

Method to extract actual numpy ndarray: [10 20 30 40 50 60 70 80]


In [12]:
s.head()

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [13]:
s.tail()

3    40
4    50
5    60
6    70
7    80
dtype: int64

In [14]:
s.info()

<class 'pandas.core.series.Series'>
RangeIndex: 8 entries, 0 to 7
Series name: None
Non-Null Count  Dtype
--------------  -----
8 non-null      int64
dtypes: int64(1)
memory usage: 192.0 bytes


Accessing data using Indexing and Slicing

In [19]:
s = pd.Series([1, 2, 3, 4, 5])

print(s[2])
print(s[[1,4]])
print(s[2:])

3
1    2
4    5
dtype: int64
2    3
3    4
4    5
dtype: int64


In [20]:
s = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])

print(s)

a    1
b    2
c    3
d    4
e    5
dtype: int64


## Pandas Data Structure - DataFrame

Pandas is a general 2D labeled, value and size-mutable tabular structure with potentially heterogeneously-typed column.

Important Note: Pandas data structures are value-mutable (the values they contain can be altered) as well as size-mutable.


Question: What kind of data does pandas handle?
Answer: When working with tabular data, such as data stored in spreadsheets or databases, pandas is the right tool for you. pandas will help you to explore, clean, and process your data. In pandas, a data table is called a DataFrame.

Remember
Import the package, aka import pandas as pd
A table of data is stored as a pandas DataFrame
Each column in a DataFrame is a Series
You can do things by applying a method to a DataFrame or Series

Creating a Pandas DataFrame
Syntax
df = pd.DataFrame(data, index=idxs, columns=cols)

Here data can be many different things:

Python Dict, List or Tuple
Numpy array

Creating DataFrame using Python dict, list or tuple

In [21]:
# Creating dataframe using Python Dictionary

data = {
        'Name': ['Tom', 'Jack', 'Steve', 'Ricky'], 
        'Age': [28,34,np.nan,42],
        'Gender': ['Male', 'Female', 'Female', 'Male']
       }

df = pd.DataFrame(data)

df

Unnamed: 0,Name,Age,Gender
0,Tom,28.0,Male
1,Jack,34.0,Female
2,Steve,,Female
3,Ricky,42.0,Male


In [22]:
# Creating a dataframe using Tuple/list

data = [('1/1/2019', 13, 6, 'Rain'),
       ('2/1/2019', 11, 7, 'Fog'),
       ('3/1/2019', 12, 8, 'Sunny'),
       ('4/1/2019', 8, 5, 'Snow'),
       ('5/1/2019', 9, 6, 'Rain')]

df = pd.DataFrame(data)

df

Unnamed: 0,0,1,2,3
0,1/1/2019,13,6,Rain
1,2/1/2019,11,7,Fog
2,3/1/2019,12,8,Sunny
3,4/1/2019,8,5,Snow
4,5/1/2019,9,6,Rain


In [23]:
# Creating a dataframe using Tuple/list

data = [('1/1/2019', 13, 6, 'Rain'),
       ('2/1/2019', 11, 7, 'Fog'),
       ('3/1/2019', 12, 8, 'Sunny'),
       ('4/1/2019', 8, 5, 'Snow'),
       ('5/1/2019', 9, 6, 'Rain')]

df = pd.DataFrame(data, columns=['Day', 'Temperature', 'WindSpeed', 'Event'])

df

Unnamed: 0,Day,Temperature,WindSpeed,Event
0,1/1/2019,13,6,Rain
1,2/1/2019,11,7,Fog
2,3/1/2019,12,8,Sunny
3,4/1/2019,8,5,Snow
4,5/1/2019,9,6,Rain


In [24]:
# Creating a dataframe using Tuple/list

data = (['1/1/2019', 13, 6, 'Rain'],
       ['2/1/2019', 11, 7, 'Fog'],
       ['3/1/2019', 12, 8, 'Sunny'],
       ['4/1/2019', 8, 5, 'Snow'],
       ['5/1/2019', 9, 6, 'Rain'])

df = pd.DataFrame(data, 
                  index=['I1', 'I2', 'I3', 'I4', 'I5'], 
                  columns=['Day', 'Temperature', 'WindSpeed', 'Event'])

df

Unnamed: 0,Day,Temperature,WindSpeed,Event
I1,1/1/2019,13,6,Rain
I2,2/1/2019,11,7,Fog
I3,3/1/2019,12,8,Sunny
I4,4/1/2019,8,5,Snow
I5,5/1/2019,9,6,Rain


In [26]:
print(df['Temperature'])

I1    13
I2    11
I3    12
I4     8
I5     9
Name: Temperature, dtype: int64


Creating DataFrame using Numpy Array

In [27]:
arr = np.random.randint(100,1999, size=(1000,100))
print(arr.shape)

(1000, 100)


In [28]:
df = pd.DataFrame(arr)
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,1043,1509,1614,762,372,1753,1140,370,443,645,...,1106,381,388,1074,1898,678,1462,1975,177,1106
1,1991,1442,1255,1229,885,112,1963,1341,1595,133,...,1115,1953,1879,1817,477,669,255,569,561,949
2,642,1131,1851,941,733,1913,577,584,582,1919,...,939,828,1252,675,522,1353,1108,681,527,1658
3,1371,152,507,1564,1303,351,461,871,414,1585,...,1028,1376,1646,1725,612,1756,524,1608,1185,236
4,1007,765,735,385,561,1341,191,1213,1211,1397,...,1787,1958,849,443,187,1773,233,584,494,443
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,734,1450,1443,1767,1243,1698,1776,1757,272,1135,...,780,1981,1484,325,1394,810,410,671,1863,1384
996,151,247,740,1300,1872,601,1823,773,1589,613,...,1545,1144,879,635,198,176,1832,361,1613,1605
997,1984,631,1879,1298,863,1325,577,1446,1498,1059,...,429,616,451,1505,1310,1492,318,1134,1086,1369
998,1992,1065,447,1200,635,1446,1092,1771,807,781,...,1955,1098,920,739,1363,921,1652,524,1912,1612


In [30]:
df = pd.DataFrame(arr, columns=['col_'+str(i) for i in range(1,101)])
df

Unnamed: 0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,col_8,col_9,col_10,...,col_91,col_92,col_93,col_94,col_95,col_96,col_97,col_98,col_99,col_100
0,1043,1509,1614,762,372,1753,1140,370,443,645,...,1106,381,388,1074,1898,678,1462,1975,177,1106
1,1991,1442,1255,1229,885,112,1963,1341,1595,133,...,1115,1953,1879,1817,477,669,255,569,561,949
2,642,1131,1851,941,733,1913,577,584,582,1919,...,939,828,1252,675,522,1353,1108,681,527,1658
3,1371,152,507,1564,1303,351,461,871,414,1585,...,1028,1376,1646,1725,612,1756,524,1608,1185,236
4,1007,765,735,385,561,1341,191,1213,1211,1397,...,1787,1958,849,443,187,1773,233,584,494,443
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,734,1450,1443,1767,1243,1698,1776,1757,272,1135,...,780,1981,1484,325,1394,810,410,671,1863,1384
996,151,247,740,1300,1872,601,1823,773,1589,613,...,1545,1144,879,635,198,176,1832,361,1613,1605
997,1984,631,1879,1298,863,1325,577,1446,1498,1059,...,429,616,451,1505,1310,1492,318,1134,1086,1369
998,1992,1065,447,1200,635,1446,1092,1771,807,781,...,1955,1098,920,739,1363,921,1652,524,1912,1612


Accessing Attributes/Properties and Methods of DataFrame

In [31]:
# Create Dictionary of Series
import pandas as pd
import numpy as np

data = {'Name':pd.Series(['Tom', 'Jack', 'Steve', 'Ricky', 'Vin', 'James', 'Vin']),
       'Age':pd.Series([25,26,25,35,23,33,31]),
       'Rating':pd.Series([4.23,4.1,3.4,5,2.9,np.nan,3.1])}

df = pd.DataFrame(data)

df

Unnamed: 0,Name,Age,Rating
0,Tom,25,4.23
1,Jack,26,4.1
2,Steve,25,3.4
3,Ricky,35,5.0
4,Vin,23,2.9
5,James,33,
6,Vin,31,3.1


In [32]:
print('Shape of DataFrame:', df.shape)
print()
print('Name of each column:', df.columns)
print()
print('Data Types of each Columns:\n', df.dtypes)
print()
print('Axes:\n', df.axes)
print()
print('Return data as numpy array:\n', df.values)

Shape of DataFrame: (7, 3)

Name of each column: Index(['Name', 'Age', 'Rating'], dtype='object')

Data Types of each Columns:
 Name       object
Age         int64
Rating    float64
dtype: object

Axes:
 [RangeIndex(start=0, stop=7, step=1), Index(['Name', 'Age', 'Rating'], dtype='object')]

Return data as numpy array:
 [['Tom' 25 4.23]
 ['Jack' 26 4.1]
 ['Steve' 25 3.4]
 ['Ricky' 35 5.0]
 ['Vin' 23 2.9]
 ['James' 33 nan]
 ['Vin' 31 3.1]]


In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Name    7 non-null      object 
 1   Age     7 non-null      int64  
 2   Rating  6 non-null      float64
dtypes: float64(1), int64(1), object(1)
memory usage: 296.0+ bytes


The method info() provides technical information about a DataFrame, so let’s explain the output in more detail:

It is indeed a DataFrame.
There are 7 entries, i.e. 7 rows.
Each row has a row label (aka the index) with values ranging from 0 to 6.
The table has 3 columns. Name and Age columns have a value for each of the rows (all 7 values are non-null). Rating column do have missing values and less than 7 non-null values.
The column Name consists of textual data (strings, aka object). The other columns are numerical data with some of them whole numbers (aka integer) and others are real numbers (aka float).
The kind of data (characters, integers,…) in the different columns are summarized by listing the dtypes.
The approximate amount of RAM used to hold the DataFrame is provided as well.# head -> by default head returns first 5 rows

In [34]:
# head -> by default head returns first 5 rows

df.head()

Unnamed: 0,Name,Age,Rating
0,Tom,25,4.23
1,Jack,26,4.1
2,Steve,25,3.4
3,Ricky,35,5.0
4,Vin,23,2.9


In [35]:
# head -> by default head returns first 5 rows

df.head(2)

Unnamed: 0,Name,Age,Rating
0,Tom,25,4.23
1,Jack,26,4.1


In [36]:
# tail -> by default tail returns last 5 rows

df.tail()

Unnamed: 0,Name,Age,Rating
2,Steve,25,3.4
3,Ricky,35,5.0
4,Vin,23,2.9
5,James,33,
6,Vin,31,3.1


In [37]:
# tail -> by default tail returns last 5 rows

df.tail(2)

Unnamed: 0,Name,Age,Rating
5,James,33,
6,Vin,31,3.1


### Working with Tabular Data
Question: How do I read and write tabular data?
Answer: pandas supports the integration with many file formats or data sources out of the box (csv, excel, sql, json, parquet,…). 
Importing data from each of these data sources is provided by function with the prefix read_*. Similarly, the to_* methods are used to store data.

Remember
Getting data in to pandas from many different file formats or data sources is supported by read_* functions.
Exporting data out of pandas is provided by different to_* methods.
The head/tail/info methods and the dtypes attribute are convenient for a first check.

Dataframe to .csv & .xlsx

In [2]:
import pandas as pd
import numpy as np

# Create Dictionary of Series
data = {'Name':pd.Series(['Tom', 'Jack', 'Steve', 'Ricky', 'Vin', 'James', 'Smith']),
       'Age':pd.Series([25,26,25,35,23,33,31]),
       'Rating':pd.Series([4.23,4.1,3.4,5,np.nan,4.7,3.1])}

df = pd.DataFrame(data)

In [3]:
df.head()

Unnamed: 0,Name,Age,Rating
0,Tom,25,4.23
1,Jack,26,4.1
2,Steve,25,3.4
3,Ricky,35,5.0
4,Vin,23,


In [39]:
df.tail()

Unnamed: 0,Name,Age,Rating
2,Steve,25,3.4
3,Ricky,35,5.0
4,Vin,23,2.9
5,James,33,
6,Vin,31,3.1


In [40]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Name    7 non-null      object 
 1   Age     7 non-null      int64  
 2   Rating  6 non-null      float64
dtypes: float64(1), int64(1), object(1)
memory usage: 296.0+ bytes


In [43]:
# Write DataFram to csv
df.to_csv('temp/new_csv_file.csv')

In [44]:
# Write Dataframe to CSV without index

df.to_csv('temp/new_csv_file_no_index.csv', index=False)

In [4]:
# Write Dataframe to XLSX

df.to_excel('temp/new_excel_file.xlsx', sheet_name='stud_data')

In [5]:
# Write Dataframe to XLSX without index

df.to_excel('temp/new_excel_file_noIndex.xlsx', sheet_name='stud_data', index=False)


In [8]:
# Reading .xlsx file
df = pd.read_excel('temp/new_excel_file_noIndex.xlsx')
df.head()

Unnamed: 0,Name,Age,Rating
0,Tom,25,4.23
1,Jack,26,4.1
2,Steve,25,3.4
3,Ricky,35,5.0
4,Vin,23,


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Name    7 non-null      object 
 1   Age     7 non-null      int64  
 2   Rating  6 non-null      float64
dtypes: float64(1), int64(1), object(1)
memory usage: 296.0+ bytes


Reading .csv File

In [11]:
df = pd.read_csv('temp/Iris.csv')
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             150 non-null    int64  
 1   SepalLengthCm  150 non-null    float64
 2   SepalWidthCm   150 non-null    float64
 3   PetalLengthCm  150 non-null    float64
 4   PetalWidthCm   150 non-null    float64
 5   Species        150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB


Data Description
The Iris Dataset contains four features (length and width of sepals and petals) of 50 samples of three species of Iris (Iris setosa, Iris virginica and Iris versicolor).

The iris data set is widely used as a beginner's dataset for machine learning purposes.

### Non-Visual Data Analysis using Pandas (Statistical Analysis)
Question: How to calculate summary statistics?
Answer: Basic statistics (mean, median, min, max, counts…) are easily calculable. These or custom aggregations can be applied on the entire data set, a sliding window of the data, or grouped by categories. The latter is also known as the split-apply-combine approach.

Remember
Aggregation statistics(mean, median, min, max, counts…) can be calculated on entire columns or rows.
groupby provides the power of the split-apply-combine pattern.
value_counts is a convenient shortcut to count the number of entries in each category of a variable.

In [3]:
import numpy as np
import pandas as pd
df = pd.read_csv('temp/Iris.csv')
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [14]:
df.shape

(150, 6)

In [15]:
df.describe()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
count,150.0,150.0,150.0,150.0,150.0
mean,75.5,5.843333,3.054,3.758667,1.198667
std,43.445368,0.828066,0.433594,1.76442,0.763161
min,1.0,4.3,2.0,1.0,0.1
25%,38.25,5.1,2.8,1.6,0.3
50%,75.5,5.8,3.0,4.35,1.3
75%,112.75,6.4,3.3,5.1,1.8
max,150.0,7.9,4.4,6.9,2.5


In [16]:
df.columns

Index(['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',
       'Species'],
      dtype='object')

In [17]:
df.sum()

Id                                                           11325
SepalLengthCm                                                876.5
SepalWidthCm                                                 458.1
PetalLengthCm                                                563.8
PetalWidthCm                                                 179.8
Species          Iris-setosaIris-setosaIris-setosaIris-setosaIr...
dtype: object

In [19]:
df[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']].sum(axis=1)

0      10.2
1       9.5
2       9.4
3       9.4
4      10.2
       ... 
145    17.2
146    15.7
147    16.7
148    17.3
149    15.8
Length: 150, dtype: float64

min() and max()

In [20]:
df.min()

Id                         1
SepalLengthCm            4.3
SepalWidthCm             2.0
PetalLengthCm            1.0
PetalWidthCm             0.1
Species          Iris-setosa
dtype: object

In [21]:
df.max()
#mean(), median(), var() and std()

Id                          150
SepalLengthCm               7.9
SepalWidthCm                4.4
PetalLengthCm               6.9
PetalWidthCm                2.5
Species          Iris-virginica
dtype: object

In [22]:
df[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']].mean()

SepalLengthCm    5.843333
SepalWidthCm     3.054000
PetalLengthCm    3.758667
PetalWidthCm     1.198667
dtype: float64

In [4]:
df[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']].median()

SepalLengthCm    5.80
SepalWidthCm     3.00
PetalLengthCm    4.35
PetalWidthCm     1.30
dtype: float64

In [5]:
# Syntax: DataFrame.select_dtypes(include=None, exclude=None)
num_cols = df.select_dtypes(include=['float64']).columns

print(num_cols)

Index(['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm'], dtype='object')


In [6]:
df[num_cols].median()

SepalLengthCm    5.80
SepalWidthCm     3.00
PetalLengthCm    4.35
PetalWidthCm     1.30
dtype: float64

In [7]:
df[num_cols].var()

SepalLengthCm    0.685694
SepalWidthCm     0.188004
PetalLengthCm    3.113179
PetalWidthCm     0.582414
dtype: float64

In [8]:
df[num_cols].std()

SepalLengthCm    0.828066
SepalWidthCm     0.433594
PetalLengthCm    1.764420
PetalWidthCm     0.763161
dtype: float64

count(), nunique(), unique() and value_counts() for categorical column

In [9]:
df['Species'].count()

150

In [10]:
df['Species'].unique()

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

In [11]:
df['Species'].nunique()

3

In [12]:
df['Species'].value_counts()

Species
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: count, dtype: int64

describe() to summarize the data

In [13]:
df.describe()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
count,150.0,150.0,150.0,150.0,150.0
mean,75.5,5.843333,3.054,3.758667,1.198667
std,43.445368,0.828066,0.433594,1.76442,0.763161
min,1.0,4.3,2.0,1.0,0.1
25%,38.25,5.1,2.8,1.6,0.3
50%,75.5,5.8,3.0,4.35,1.3
75%,112.75,6.4,3.3,5.1,1.8
max,150.0,7.9,4.4,6.9,2.5


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             150 non-null    int64  
 1   SepalLengthCm  150 non-null    float64
 2   SepalWidthCm   150 non-null    float64
 3   PetalLengthCm  150 non-null    float64
 4   PetalWidthCm   150 non-null    float64
 5   Species        150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB


In [14]:
# include object, number, all

df.describe(include=['object'])

Unnamed: 0,Species
count,150
unique,3
top,Iris-setosa
freq,50


In [16]:
df.describe(include=['number'])

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
count,150.0,150.0,150.0,150.0,150.0
mean,75.5,5.843333,3.054,3.758667,1.198667
std,43.445368,0.828066,0.433594,1.76442,0.763161
min,1.0,4.3,2.0,1.0,0.1
25%,38.25,5.1,2.8,1.6,0.3
50%,75.5,5.8,3.0,4.35,1.3
75%,112.75,6.4,3.3,5.1,1.8
max,150.0,7.9,4.4,6.9,2.5


In [18]:
df.describe(include='all')

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
count,150.0,150.0,150.0,150.0,150.0,150
unique,,,,,,3
top,,,,,,Iris-setosa
freq,,,,,,50
mean,75.5,5.843333,3.054,3.758667,1.198667,
std,43.445368,0.828066,0.433594,1.76442,0.763161,
min,1.0,4.3,2.0,1.0,0.1,
25%,38.25,5.1,2.8,1.6,0.3,
50%,75.5,5.8,3.0,4.35,1.3,
75%,112.75,6.4,3.3,5.1,1.8,


corr(), skew() and kurt()

In [21]:
num_cols = df.select_dtypes(include=['float64']).columns

df[num_cols].corr()

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
SepalLengthCm,1.0,-0.109369,0.871754,0.817954
SepalWidthCm,-0.109369,1.0,-0.420516,-0.356544
PetalLengthCm,0.871754,-0.420516,1.0,0.962757
PetalWidthCm,0.817954,-0.356544,0.962757,1.0


In [20]:
num_cols = df.select_dtypes(include=['float64']).columns

df[num_cols].kurt()

SepalLengthCm   -0.552064
SepalWidthCm     0.290781
PetalLengthCm   -1.401921
PetalWidthCm    -1.339754
dtype: float64

DataFrame.agg()
Instead of the predefined statistics, specific combinations of aggregating statistics for given columns can be defined using the DataFrame.agg() method.

List of all the aggregating statistics can be found on below reference:
Reference: https://pandas.pydata.org/docs/user_guide/basics.html#basics-stats

In [22]:
df.columns

Index(['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',
       'Species'],
      dtype='object')

In [23]:
df.agg(
    {
        "SepalLengthCm" : ["min", "max", "median", "count"],
        "PetalWidthCm" : ["min", "max", "mean", "count"],
        "Species" : ["count"]
    }
)

Unnamed: 0,SepalLengthCm,PetalWidthCm,Species
min,4.3,0.1,
max,7.9,2.5,
median,5.8,,
count,150.0,150.0,150.0
mean,,1.198667,


### Accessing Data in a DataFrame using Indexing and Slicing in Pandas DataFrame
Question: How do I select a subset of a table?
Answer: Selecting or filtering specific rows and/or columns? Filtering the data on a condition? Methods for slicing, selecting, and extracting the data you need are available in pandas.

- Remember
When selecting subsets of data, square brackets [] are used.
Inside these brackets, you can use a single column/row label, a list of column/row labels, a slice of labels, a conditional expression or a colon.
Select specific rows and/or columns using loc when using the row and column names.
Select specific rows and/or columns using iloc when using the positions in the table.
You can assign new values to a selection based on loc/iloc.

#### Reading .csv File - Weather Dataset

Data Description
Weather data collected from the National Weather Service. It contains the first six months of 2016, for a weather station in central park. It contains for each day the minimum temperature, maximum temperature, average temperature, precipitation, new snow fall, and current snow depth. The temperature is measured in Fahrenheit and the depth is measured in inches. T means that there is a trace of precipitation.

In [26]:
# importing pandas
import pandas as pd
df = pd.read_csv('temp/nyc_weather.csv')
df.head()

Unnamed: 0,EST,Temperature,DewPoint,Humidity,Sea Level PressureIn,VisibilityMiles,WindSpeedMPH,PrecipitationIn,CloudCover,Events,WindDirDegrees
0,1/1/2016,38,23,52,30.03,10,8.0,0,5,,281
1,1/2/2016,36,18,46,30.02,10,7.0,0,3,,275
2,1/3/2016,40,21,47,29.86,10,8.0,0,1,,277
3,1/4/2016,25,9,44,30.05,10,9.0,0,3,,345
4,1/5/2016,20,-3,41,30.57,10,5.0,0,0,,333


In [27]:
df.tail()

Unnamed: 0,EST,Temperature,DewPoint,Humidity,Sea Level PressureIn,VisibilityMiles,WindSpeedMPH,PrecipitationIn,CloudCover,Events,WindDirDegrees
26,1/27/2016,41,22,45,30.03,10,7.0,T,3,Rain,311
27,1/28/2016,37,20,51,29.9,10,5.0,0,1,,234
28,1/29/2016,36,21,50,29.58,10,8.0,0,4,,298
29,1/30/2016,34,16,46,30.01,10,7.0,0,0,,257
30,1/31/2016,46,28,52,29.9,10,5.0,0,0,,241


In [28]:
df.shape

(31, 11)

In [29]:
df.columns

Index(['EST', 'Temperature', 'DewPoint', 'Humidity', 'Sea Level PressureIn',
       'VisibilityMiles', 'WindSpeedMPH', 'PrecipitationIn', 'CloudCover',
       'Events', 'WindDirDegrees'],
      dtype='object')

In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31 entries, 0 to 30
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   EST                   31 non-null     object 
 1   Temperature           31 non-null     int64  
 2   DewPoint              31 non-null     int64  
 3   Humidity              31 non-null     int64  
 4   Sea Level PressureIn  31 non-null     float64
 5   VisibilityMiles       31 non-null     int64  
 6   WindSpeedMPH          28 non-null     float64
 7   PrecipitationIn       31 non-null     object 
 8   CloudCover            31 non-null     int64  
 9   Events                9 non-null      object 
 10  WindDirDegrees        31 non-null     int64  
dtypes: float64(2), int64(6), object(3)
memory usage: 2.8+ KB


In [31]:
df.describe()

Unnamed: 0,Temperature,DewPoint,Humidity,Sea Level PressureIn,VisibilityMiles,WindSpeedMPH,CloudCover,WindDirDegrees
count,31.0,31.0,31.0,31.0,31.0,28.0,31.0,31.0
mean,34.677419,17.83871,51.677419,29.992903,9.193548,6.892857,3.129032,247.129032
std,7.639315,11.378626,11.634395,0.237237,1.939405,2.871821,2.629853,92.308086
min,20.0,-3.0,33.0,29.52,1.0,2.0,0.0,34.0
25%,29.0,10.0,44.5,29.855,9.0,5.0,1.0,238.0
50%,35.0,18.0,50.0,30.01,10.0,6.5,3.0,281.0
75%,39.5,23.0,55.0,30.14,10.0,8.0,4.5,300.0
max,50.0,46.0,78.0,30.57,10.0,16.0,8.0,345.0


In [33]:
df['Temperature'].max()

50

In [34]:
df['Temperature'].min()

20

Filtering Single Column vs Multiple Columns from a DataFrame
To select a single column, use square brackets [] with the column name of the column of interest.

In [36]:
# selecting single colulmn
max_temp = df['Temperature']
max_temp.head()

0    38
1    36
2    40
3    25
4    20
Name: Temperature, dtype: int64

In [37]:
max_temp.shape

(31,)

In [38]:
# selecting single column
max_temp = df[['Temperature']]
max_temp.tail()

Unnamed: 0,Temperature
26,41
27,37
28,36
29,34
30,46


In [40]:
# selecting multiple column
max_temp = df[['Temperature', 'DewPoint', 'Humidity']]
max_temp.head()

Unnamed: 0,Temperature,DewPoint,Humidity
0,38,23,52
1,36,18,46
2,40,21,47
3,25,9,44
4,20,-3,41


### Filtering Rows from a DataFrame
Way 1
We can select the rows by using slicing operation.
Syntax df[ starting_row_index : ending_row_index : step ]

Way 2
Similar to numpy Pandas can accept boolean indexes.
To select rows based on a conditional expression, use a condition inside the selection brackets [].
Syntax df[ CONDITION ]

In [42]:
df[1:15]

Unnamed: 0,EST,Temperature,DewPoint,Humidity,Sea Level PressureIn,VisibilityMiles,WindSpeedMPH,PrecipitationIn,CloudCover,Events,WindDirDegrees
1,1/2/2016,36,18,46,30.02,10,7.0,0,3,,275
2,1/3/2016,40,21,47,29.86,10,8.0,0,1,,277
3,1/4/2016,25,9,44,30.05,10,9.0,0,3,,345
4,1/5/2016,20,-3,41,30.57,10,5.0,0,0,,333
5,1/6/2016,33,4,35,30.5,10,4.0,0,0,,259
6,1/7/2016,39,11,33,30.28,10,2.0,0,3,,293
7,1/8/2016,39,29,64,30.2,10,4.0,0,8,,79
8,1/9/2016,44,38,77,30.16,9,8.0,T,8,Rain,76
9,1/10/2016,50,46,71,29.59,4,,1.8,7,Rain,109
10,1/11/2016,33,8,37,29.92,10,,0,1,,289


In [47]:
df[['Temperature']]>30


Unnamed: 0,Temperature
0,True
1,True
2,True
3,False
4,False
5,True
6,True
7,True
8,True
9,True


In [48]:
# Similar to numpy Pandas can accept boolean indexes
df[ df["Temperature"] > 45 ]

Unnamed: 0,EST,Temperature,DewPoint,Humidity,Sea Level PressureIn,VisibilityMiles,WindSpeedMPH,PrecipitationIn,CloudCover,Events,WindDirDegrees
9,1/10/2016,50,46,71,29.59,4,,1.8,7,Rain,109
15,1/16/2016,47,37,70,29.52,8,7.0,0.24,7,Rain,340
30,1/31/2016,46,28,52,29.9,10,5.0,0.0,0,,241


The output of the conditional expression (>, but also ==, !=, <, <=,… would work) is actually a pandas Series of boolean values (either True or False) with the same number of rows as the original DataFrame. Such a Series of boolean values can be used to filter the DataFrame by putting it in between the selection brackets []. Only rows for which the value is True will be selected.

In [49]:
df[df['DewPoint']>=30]

Unnamed: 0,EST,Temperature,DewPoint,Humidity,Sea Level PressureIn,VisibilityMiles,WindSpeedMPH,PrecipitationIn,CloudCover,Events,WindDirDegrees
8,1/9/2016,44,38,77,30.16,9,8.0,T,8,Rain,76
9,1/10/2016,50,46,71,29.59,4,,1.8,7,Rain,109
14,1/15/2016,43,31,62,29.82,9,5.0,T,2,,101
15,1/16/2016,47,37,70,29.52,8,7.0,0.24,7,Rain,340


In [51]:
df[df['EST'].isin(['1/10/2016', '1/16/2016', '1/2/2016'])]

Unnamed: 0,EST,Temperature,DewPoint,Humidity,Sea Level PressureIn,VisibilityMiles,WindSpeedMPH,PrecipitationIn,CloudCover,Events,WindDirDegrees
1,1/2/2016,36,18,46,30.02,10,7.0,0.0,3,,275
9,1/10/2016,50,46,71,29.59,4,,1.8,7,Rain,109
15,1/16/2016,47,37,70,29.52,8,7.0,0.24,7,Rain,340


Similar to the conditional expression, the isin() conditional function returns a True for each row the values are in the provided list. To filter the rows based on such a function, use the conditional function inside the selection brackets [].

The above is equivalent to filtering by rows for which the date is either '10-5-2016' or '10-4-2016' or '10-6-2016' and combining the three statements with an | (or) operator:

In [54]:
df[ (df['EST']=='1/3/2016')  | 
    (df['EST']=='1/15/2016') | 
    (df['EST']=='1/2/2016') 
  ]

Unnamed: 0,EST,Temperature,DewPoint,Humidity,Sea Level PressureIn,VisibilityMiles,WindSpeedMPH,PrecipitationIn,CloudCover,Events,WindDirDegrees
1,1/2/2016,36,18,46,30.02,10,7.0,0,3,,275
2,1/3/2016,40,21,47,29.86,10,8.0,0,1,,277
14,1/15/2016,43,31,62,29.82,9,5.0,T,2,,101


Remember
When combining multiple conditional statements, each condition must be surrounded by parentheses (). Moreover, you can not use or/and but need to use the or operator | and the and operator &.

Filtering specific rows and columns from a DataFrame

In [55]:
df[1:5]

Unnamed: 0,EST,Temperature,DewPoint,Humidity,Sea Level PressureIn,VisibilityMiles,WindSpeedMPH,PrecipitationIn,CloudCover,Events,WindDirDegrees
1,1/2/2016,36,18,46,30.02,10,7.0,0,3,,275
2,1/3/2016,40,21,47,29.86,10,8.0,0,1,,277
3,1/4/2016,25,9,44,30.05,10,9.0,0,3,,345
4,1/5/2016,20,-3,41,30.57,10,5.0,0,0,,333


In case, you want a subset of both rows and columns in one go, just using selection brackets [] is not sufficient anymore.
Here loc/iloc operators are required in front of the selection brackets []. When using loc/iloc, the part before the comma is the rows you want, and the part after the comma is the columns you want to select.

Syntax:
df.loc[row_label, col_label]
df.iloc[row_index, col_index]

loc() vs iloc()

In [67]:
# label based accessing
df.loc[30]

EST                     1/31/2016
Temperature                    46
DewPoint                       28
Humidity                       52
Sea Level PressureIn         29.9
VisibilityMiles                10
WindSpeedMPH                  5.0
PrecipitationIn                 0
CloudCover                      0
Events                        NaN
WindDirDegrees                241
Name: 30, dtype: object

In [70]:
df.loc[30, ['EST','DewPoint']]

EST         1/31/2016
DewPoint           28
Name: 30, dtype: object

In [71]:
# Index based accessing
df.iloc[30]

EST                     1/31/2016
Temperature                    46
DewPoint                       28
Humidity                       52
Sea Level PressureIn         29.9
VisibilityMiles                10
WindSpeedMPH                  5.0
PrecipitationIn                 0
CloudCover                      0
Events                        NaN
WindDirDegrees                241
Name: 30, dtype: object

In [72]:
df.iloc[10,[0,5]]

EST                1/11/2016
VisibilityMiles           10
Name: 10, dtype: object

In [74]:
#Slicing with Lables

df.loc[ 10:15, "Temperature":'DewPoint' ]

#Observe that indexing start from start till end for lable based accessing

Unnamed: 0,Temperature,DewPoint
10,33,8
11,35,15
12,26,4
13,30,12
14,43,31
15,47,37


Renaming Columns, Modifying DataTypes, Creating New Columns and Deleting Columns in Pandas DataFrame

Question: How to create new columns derived from existing columns?
Answer: There is no need to loop over all rows of your data table to do calculations. Data manipulations on a column work elementwise. Adding a column to a DataFrame based on existing data in other columns is straightforward.

Remember
Create a new column by assigning the output to the DataFrame with a new column name in between the [].
Operations are element-wise, no need to loop over rows.
Use rename() with a dictionary or function to rename row labels or column names.
If you need more advanced logic, you can use arbitrary Python code via apply().

Reading .csv File - Retail Store Sales Data

In [2]:
import numpy as np
import pandas as pd

df = pd.read_excel('retail_store_sales.xlsx')
df.head()

Unnamed: 0,Invoice No,Stock-Code,Description,Quantity,Invoice Date,Unit Price,Customer ID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [4]:
df.columns

Index(['Invoice No', ' Stock-Code ', 'Description', 'Quantity', 'Invoice Date',
       'Unit Price', 'Customer ID', 'Country'],
      dtype='object')

In [5]:
df.shape

(541909, 8)

What comes to my mind immediately after looking at the dataset?

How many sales records do we have in the dataset?
How many customers do we have?
What is the date range of data?
Which country recorded maximum sales count?
What is the minimum order amount and maximum order amount?
How many orders for each customer?
What is the revenue contributed by each customer?
What is the revenue generated each year?
Which customer contributed to the maximum revenue each year and how much?
Are there more orders placed on weekends?
How many customers churned (i.e. Customers not making any purchases for more than or equal to 2 months)?
Try to understand that as a data analyst, first we should be capable to ask right questions. Answering these questions can be done with the help of Pandas module. We will learn later how to answer each of these questions. For now let's understand how to create new columns derived from the existing columns.

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   Invoice No    541909 non-null  object        
 1    Stock-Code   541909 non-null  object        
 2   Description   540455 non-null  object        
 3   Quantity      541909 non-null  int64         
 4   Invoice Date  541909 non-null  datetime64[ns]
 5   Unit Price    541909 non-null  float64       
 6   Customer ID   406829 non-null  float64       
 7   Country       541909 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB


In [8]:
print("Total Sales Record:", df.shape[0])
print("Total Customers:", df['Customer ID'].nunique())
print("Date Range:", df['Invoice Date'].min(), "to", df['Invoice Date'].max())

Total Sales Record: 541909
Total Customers: 4372
Date Range: 2010-12-01 08:26:00 to 2011-12-09 12:50:00


In [9]:
# checking all the unique countries
df['Country'].unique()

array(['United Kingdom', 'France', 'Australia', 'Netherlands', 'Germany',
       'Norway', 'EIRE', 'Switzerland', 'Spain', 'Poland', 'Portugal',
       'Italy', 'Belgium', 'Lithuania', 'Japan', 'Iceland',
       'Channel Islands', 'Denmark', 'Cyprus', 'Sweden', 'Austria',
       'Israel', 'Finland', 'Bahrain', 'Greece', 'Hong Kong', 'Singapore',
       'Lebanon', 'United Arab Emirates', 'Saudi Arabia',
       'Czech Republic', 'Canada', 'Unspecified', 'Brazil', 'USA',
       'European Community', 'Malta', 'RSA'], dtype=object)

In [10]:
# Countries with total number of sales record

df['Country'].value_counts()

Country
United Kingdom          495478
Germany                   9495
France                    8557
EIRE                      8196
Spain                     2533
Netherlands               2371
Belgium                   2069
Switzerland               2002
Portugal                  1519
Australia                 1259
Norway                    1086
Italy                      803
Channel Islands            758
Finland                    695
Cyprus                     622
Sweden                     462
Unspecified                446
Austria                    401
Denmark                    389
Japan                      358
Poland                     341
Israel                     297
USA                        291
Hong Kong                  288
Singapore                  229
Iceland                    182
Canada                     151
Greece                     146
Malta                      127
United Arab Emirates        68
European Community          61
RSA                         58


Renaming Columns
Syntax to rename columns
df.rename(index=None, columns=None)

The rename() function can be used for both row labels and column labels. Provide a dictionary with the keys the current names and the values the new names to update the corresponding names.

In [11]:
df.columns

Index(['Invoice No', ' Stock-Code ', 'Description', 'Quantity', 'Invoice Date',
       'Unit Price', 'Customer ID', 'Country'],
      dtype='object')

In [12]:
df_renamed = df.rename(columns={'Description': 'Product Description', 'Customer ID': 'Cust ID'})

df_renamed.columns

Index(['Invoice No', ' Stock-Code ', 'Product Description', 'Quantity',
       'Invoice Date', 'Unit Price', 'Cust ID', 'Country'],
      dtype='object')

In [13]:
df_renamed.head()

Unnamed: 0,Invoice No,Stock-Code,Product Description,Quantity,Invoice Date,Unit Price,Cust ID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


A very common column renaming strategy
Let's convert column names by performing below mentioned operations:

Strip extra spaces
Convert to lower cases
Remove all the special characters including spaces
Benefit of this is, we can now access the columns in the dataframe using the dot, similar to how we access the properties/attributes of a python object. For eg:
Acessing INVOICE NO can be done using: df_renamed.invoice_no

In [20]:
col_names = [ col.strip().lower().replace(' ', '_').replace('-', '_') for col in df_renamed ]

print(col_names)

['invoice_no', 'stock_code', 'product_description', 'quantity', 'invoice_date', 'unit_price', 'cust_id', 'country']


Modifying Columns DataType
Modifying the DataType using DataFrame.astype()
We can pass any Python, Numpy, or Pandas datatype to change all columns of a Dataframe to that type, or we can pass a dictionary having column names as keys and datatype as values to change the type of selected columns.

Modifying the DataType using DataFrame.apply()
We can pass pandas.to_numeric, pandas.to_datetime, and pandas.to_timedelta as arguments to apply the apply() function to change the data type of one or more columns to numeric, DateTime, and time delta respectively.

Modifying the DataType using DataFrame.astype()

In [21]:
df_renamed.columns

Index(['Invoice No', ' Stock-Code ', 'Product Description', 'Quantity',
       'Invoice Date', 'Unit Price', 'Cust ID', 'Country'],
      dtype='object')

In [22]:
df_renamed.columns = col_names
df_renamed.columns

Index(['invoice_no', 'stock_code', 'product_description', 'quantity',
       'invoice_date', 'unit_price', 'cust_id', 'country'],
      dtype='object')

In [24]:
# converting all columns to string type
df_renamed = df_renamed.astype(str)
df_renamed.dtypes

invoice_no             object
stock_code             object
product_description    object
quantity               object
invoice_date           object
unit_price             object
cust_id                object
country                object
dtype: object

In [25]:
df_renamed[['quantity', 'unit_price', 'cust_id']] = df_renamed[['quantity', 'unit_price', 'cust_id']].astype(float)

df_renamed.dtypes

invoice_no              object
stock_code              object
product_description     object
quantity               float64
invoice_date            object
unit_price             float64
cust_id                float64
country                 object
dtype: object

In [26]:
# using dictionary to convert specific columns
convert_dict = {'quantity': int,
                'country': str
                }
 
df_renamed = df_renamed.astype(convert_dict)

df_renamed.dtypes

invoice_no              object
stock_code              object
product_description     object
quantity                 int64
invoice_date            object
unit_price             float64
cust_id                float64
country                 object
dtype: object

Modifying the DataType using DataFrame.apply()

In [27]:
# using apply method to convert datatype
df_renamed['invoice_date'] = df_renamed['invoice_date'].apply(pd.to_datetime)
df_renamed.dtypes

invoice_no                     object
stock_code                     object
product_description            object
quantity                        int64
invoice_date           datetime64[ns]
unit_price                    float64
cust_id                       float64
country                        object
dtype: object

Creating a Derived Column

-Creating a column by merging Product Category and Sub-category
-Think about how to perform the same operation in Numpy?

In [28]:
df_renamed['amount'] = df_renamed['quantity']*df_renamed['unit_price']
df_renamed.head()

Unnamed: 0,invoice_no,stock_code,product_description,quantity,invoice_date,unit_price,cust_id,country,amount
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom,15.3
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom,22.0
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34


Remember
The calculation is again element-wise, so the + is applied for the values in each row. Also other mathematical operators (+, -, *, /,…) or logical operators (<, >, ==,…) work element-wise.

#### Creating Columns using apply() function
Syntax for DataFrame
df.apply(function, axis=0)
Applies the function column wise.
Axis Parameter
Axis along which the function is applied. Axis can be {0 or ‘index’, 1 or ‘columns’}, default 0:

0 or ‘index’: apply function to each column.
1 or ‘columns’: apply function to each row.
Syntax for Series
series.apply(function, axis=0)
Applies the function element wise.

In [29]:
df_renamed.dtypes

invoice_no                     object
stock_code                     object
product_description            object
quantity                        int64
invoice_date           datetime64[ns]
unit_price                    float64
cust_id                       float64
country                        object
amount                        float64
dtype: object

In [30]:
# np.max function is applied column wise by default - i.e. axis=0

df_renamed.apply(np.max)

invoice_no                         C581569
stock_code                               m
product_description      wrongly sold sets
quantity                             80995
invoice_date           2011-12-09 12:50:00
unit_price                         38970.0
cust_id                            18287.0
country                        Unspecified
amount                            168469.6
dtype: object

In [31]:
# Apply a function on the complete column at once
df_renamed[['amount']].apply(np.mean)

amount    17.987795
dtype: float64

In [None]:
# There is much better way of performing above operation - df['order_amount'].mean()

df_renamed['amount'].mean()

In [32]:
# Apply a function on the column - row wise. Returns Series.

df_renamed['amount'].apply(np.mean)

0         15.30
1         20.34
2         22.00
3         20.34
4         20.34
          ...  
541904    10.20
541905    12.60
541906    16.60
541907    16.60
541908    14.85
Name: amount, Length: 541909, dtype: float64

In [33]:
# Apply a function on the column - row wise. Returns Series.

df_renamed['amount'].apply(np.mean)

0         15.30
1         20.34
2         22.00
3         20.34
4         20.34
          ...  
541904    10.20
541905    12.60
541906    16.60
541907    16.60
541908    14.85
Name: amount, Length: 541909, dtype: float64

In [36]:
# Creating new column using apply()
# Let's assume we have to create a column - new_amount
# new_amount = quantity * unit_price
# we already saw how to perform this using df['amount'] = df['quantity'] * df['unit_price']
# Let's do the same operation using apply() function now

df_renamed['new_amount'] = df_renamed.apply(lambda row: row['quantity'] * row['unit_price'], axis=1)

df_renamed.head()

Unnamed: 0,invoice_no,stock_code,product_description,quantity,invoice_date,unit_price,cust_id,country,amount,new_amount
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom,15.3,15.3
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34,20.34
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom,22.0,22.0
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34,20.34
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34,20.34


Deleting column(s) in DataFrame
Syntax 1 - Dropping columns by using columns name

# Dropping two columns by passing column names
# inplace=True parameter performs the operation saves the result back to the dataframe
df.drop(['col1', 'col3'], axis=1, inplace=True)
Syntax 2 - Removing columns by using columns name using loc[]

# Removing all columns between col2 to col4
df.drop(df.loc[:, 'col2':'col4'], inplace=True, axis=1)
Syntax 3 - Removing column based on index

# Remove three columns as index base
df.drop(df.columns[[0, 4, 2]], axis=1, inplace=True)
Syntax 4 - Removing column based on index using iloc[]

# removing two columns between column index 1 to 3
df.drop(df.iloc[:, 1:3], inplace=True, axis=1)
Synatx 5 - DataFrame.pop() method

# Using pop() we can delete single column at a time
df.pop("Col4")

In [37]:
df_renamed.columns

Index(['invoice_no', 'stock_code', 'product_description', 'quantity',
       'invoice_date', 'unit_price', 'cust_id', 'country', 'amount',
       'new_amount'],
      dtype='object')

In [40]:
# syntax 1
df_renamed.drop(['amount'], axis=1)

Unnamed: 0,invoice_no,stock_code,product_description,quantity,invoice_date,unit_price,cust_id,country,new_amount
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom,15.30
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom,22.00
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
...,...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,2011-12-09 12:50:00,0.85,12680.0,France,10.20
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,2011-12-09 12:50:00,2.10,12680.0,France,12.60
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,2011-12-09 12:50:00,4.15,12680.0,France,16.60
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,2011-12-09 12:50:00,4.15,12680.0,France,16.60


In [41]:
df_renamed.columns

Index(['invoice_no', 'stock_code', 'product_description', 'quantity',
       'invoice_date', 'unit_price', 'cust_id', 'country', 'amount',
       'new_amount'],
      dtype='object')

Observation
Observe that the amount column is still not removed from dataframe. To make the changes permanent, pass inplace=True parameter.

In [42]:
df_renamed.drop(['amount'], axis=1, inplace=True)

In [43]:
df_renamed.columns

Index(['invoice_no', 'stock_code', 'product_description', 'quantity',
       'invoice_date', 'unit_price', 'cust_id', 'country', 'new_amount'],
      dtype='object')

In [44]:
# Syntax 2

df_renamed.drop(df_renamed.loc[:, 'invoice_no':'invoice_date'], axis=1)

Unnamed: 0,unit_price,cust_id,country,new_amount
0,2.55,17850.0,United Kingdom,15.30
1,3.39,17850.0,United Kingdom,20.34
2,2.75,17850.0,United Kingdom,22.00
3,3.39,17850.0,United Kingdom,20.34
4,3.39,17850.0,United Kingdom,20.34
...,...,...,...,...
541904,0.85,12680.0,France,10.20
541905,2.10,12680.0,France,12.60
541906,4.15,12680.0,France,16.60
541907,4.15,12680.0,France,16.60


In [51]:
df_renamed.loc[[50000]]

Unnamed: 0,invoice_no,stock_code,product_description,quantity,invoice_date,unit_price,cust_id,country,new_amount
50000,540542,22639,SET OF 4 NAPKIN CHARMS HEARTS,6,2011-01-09 15:18:00,2.55,15107.0,United Kingdom,15.3


In [52]:
df_renamed.iloc[1:5]

Unnamed: 0,invoice_no,stock_code,product_description,quantity,invoice_date,unit_price,cust_id,country,new_amount
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom,22.0
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34


In [53]:
# Syntax 3

df_renamed.drop(df_renamed.columns[[0, 4, 2]], axis=1)

Unnamed: 0,stock_code,quantity,unit_price,cust_id,country,new_amount
0,85123A,6,2.55,17850.0,United Kingdom,15.30
1,71053,6,3.39,17850.0,United Kingdom,20.34
2,84406B,8,2.75,17850.0,United Kingdom,22.00
3,84029G,6,3.39,17850.0,United Kingdom,20.34
4,84029E,6,3.39,17850.0,United Kingdom,20.34
...,...,...,...,...,...,...
541904,22613,12,0.85,12680.0,France,10.20
541905,22899,6,2.10,12680.0,France,12.60
541906,23254,4,4.15,12680.0,France,16.60
541907,23255,4,4.15,12680.0,France,16.60


In [54]:
df_renamed.columns

Index(['invoice_no', 'stock_code', 'product_description', 'quantity',
       'invoice_date', 'unit_price', 'cust_id', 'country', 'new_amount'],
      dtype='object')

Observation
Observe that the columns are still not removed from dataframe. To make the changes permanent, pass inplace=True parameter.

In [55]:
# Syntax 4

df_renamed.drop(df_renamed.iloc[:, 1:3], axis=1)

Unnamed: 0,invoice_no,quantity,invoice_date,unit_price,cust_id,country,new_amount
0,536365,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom,15.30
1,536365,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
2,536365,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom,22.00
3,536365,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
4,536365,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
...,...,...,...,...,...,...,...
541904,581587,12,2011-12-09 12:50:00,0.85,12680.0,France,10.20
541905,581587,6,2011-12-09 12:50:00,2.10,12680.0,France,12.60
541906,581587,4,2011-12-09 12:50:00,4.15,12680.0,France,16.60
541907,581587,4,2011-12-09 12:50:00,4.15,12680.0,France,16.60


In [57]:
# Syntax 5

df_renamed.pop("new_amount")

0         15.30
1         20.34
2         22.00
3         20.34
4         20.34
          ...  
541904    10.20
541905    12.60
541906    16.60
541907    16.60
541908    14.85
Name: new_amount, Length: 541909, dtype: float64

In [58]:
df_renamed.columns

Index(['invoice_no', 'stock_code', 'product_description', 'quantity',
       'invoice_date', 'unit_price', 'cust_id', 'country'],
      dtype='object')

Remeber that DataFrame.pop("Col_Name") function:

Removes the single column and returns the deleted column.
Applies the changes to the dataframe without any need of inplace=True

Adding/Inserting Row(s)
Reading a .xlsx File - Weather Data

In [59]:
import pandas as pd
import numpy as np

In [60]:
df = pd.read_excel('weather_data.xlsx')
df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny


In [61]:
df.shape


(6, 4)

Insert Row(s) using Dictionary - pandas.concat()

Insert Row(s) using Dictionary - pandas.concat()
Syntax 1 - Inserting a Single Row

# Creat a new record using Dictionary
new_record = pd.DataFrame([{'day': '1/7/2017', 'temperature': 36, 'windspeed': 4, 'event': 'Sunny'}])

# Inserting row at the end
df = pd.concat([df, new_record], ignore_index=True)

# Inserting row at the top
df = pd.concat([new_record, df], ignore_index=True)
Syntax 2 - Insert multiple rows (i.e. a batch of data)

# Creat a new record using Dictionary
batch_records = pd.DataFrame([{'day': '1/8/2017', 'temperature': 30, 'windspeed': 3, 'event': 'Rain'}, {'day': '1/9/2017', 'temperature': 27, 'windspeed': 4, 'event': 'Snow'}])

# Inserting row at the end
df = pd.concat([df, batch_records], ignore_index=True)

# Inserting row at the top
df = pd.concat([batch_records, df], ignore_index=True)

In [62]:
# Creat a new record using Dictionary
new_record = pd.DataFrame([{'day': '1/7/2017', 
                            'temperature': 36, 
                            'windspeed': 4, 
                            'event': 'Sunny'}])

# Inserting row at the end
df = pd.concat([df, new_record], ignore_index=True)

df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny
6,1/7/2017,36,4,Sunny


In [63]:
# Creat a new record using Dictionary
batch_records = pd.DataFrame([{'day': '1/8/2017', 'temperature': 30, 'windspeed': 3, 'event': 'Rain'}, 
                              {'day': '1/9/2017', 'temperature': 27, 'windspeed': 4, 'event': 'Snow'}])

# Inserting row at the end
df = pd.concat([df, batch_records], ignore_index=True)

df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny
6,1/7/2017,36,4,Sunny
7,1/8/2017,30,3,Rain
8,1/9/2017,27,4,Snow


Inserting a Row using List - .loc[] and .iloc[]
To add a list to a Pandas DataFrame works a bit differently since we can’t simply use the .concat() function. In order to do this, we need to use the loc accessor. The label that we use for our loc accessor will be the length of the DataFrame. This will create a new row.

Syntax - Using DataFrame.loc[]

df.loc[len(df)] = ['1/12/2017', 28, 2, 'Rain']
Syntax - Using DataFrame.iloc[]
Generates Error - You cannot use .iloc to enlarge the target object.(i.e .iloc can't be used to add new rows)

In [64]:
df.loc[len(df)] = ['1/12/2017', 28, 2, 'Rain']
df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny
6,1/7/2017,36,4,Sunny
7,1/8/2017,30,3,Rain
8,1/9/2017,27,4,Snow
9,1/12/2017,28,2,Rain


Inserting a Row at a Specific Index of a DataFrame
Adding a row at a specific index is a bit different. As shown in the example of using lists, we need to use the loc accessor. However, inserting a row at a given index will only overwrite this. What we can do instead is pass in a value close to where we want to insert the new row.

For example, if we have current indices from 0-9 and we want to insert a new row at index 9, we can simply assign it using index 8.5. Let’s see how this works:

Syntax - Inserting a row at a specific index

# Adding at row label 8.5
df.loc[8.5] = ['1/11/2017', 30, 3, 'Rain']

# sort index
df = df.sort_index().reset_index(drop=True)

df

In [65]:
df.loc[8.5] = ['1/10/2017', 30, 3, 'Rain']

#sort index
df = df.sort_index().reset_index(drop=True)

df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny
6,1/7/2017,36,4,Sunny
7,1/8/2017,30,3,Rain
8,1/9/2017,27,4,Snow
9,1/10/2017,30,3,Rain


In [66]:
# Adding at row label 9.5
df.loc[9.5] = ['1/11/2017', 27, 1, 'Snow']

#sort index
df = df.sort_index().reset_index(drop=True)

df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny
6,1/7/2017,36,4,Sunny
7,1/8/2017,30,3,Rain
8,1/9/2017,27,4,Snow
9,1/10/2017,30,3,Rain


Saving DataFrame to .xlsx

In [67]:
df.to_excel('temp/updated_weather_data.xlsx', sheet_name='weather_data')

### Handling TimeSeries DataReading .csv File - Online Store Sales Data

Question: How to handle time series data?
Answer: pandas has great support for time series and has an extensive set of tools for working with dates, times, and time-indexed data.

Remember
Valid date strings can be converted to datetime objects using to_datetime function or as part of read functions.
pandas.Datetime objects in pandas support calculations, logical operations and convenient date-related properties using the dt accessor like year, month, day, day_of_week, day_of_year, is_leap_year, week, etc...
We can also access datetime methods using dt accessor like day_name(), month_name(), etc...
pandas.Timedelta Represents a duration, the difference between two dates or times. Many properties of timedelta can be accessed using dt like components, days, seconds, etc...
We can also access timedelta methods using dt accessor like total_seconds().

Reading .csv File - Online Store Sales Data

In [2]:
import pandas as pd
df = pd.read_csv('online_store_sales.csv')
df.head()

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,State,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales
0,1,CA-2017-152156,08/11/2017,11/11/2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96
1,2,CA-2017-152156,08/11/2017,11/11/2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94
2,3,CA-2017-138688,12/06/2017,16/06/2017,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,90036.0,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62
3,4,US-2016-108966,11/10/2016,18/10/2016,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775
4,5,US-2016-108966,11/10/2016,18/10/2016,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.368


In [69]:
df.shape

(9800, 18)

In [70]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9800 entries, 0 to 9799
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Row ID         9800 non-null   int64  
 1   Order ID       9800 non-null   object 
 2   Order Date     9800 non-null   object 
 3   Ship Date      9800 non-null   object 
 4   Ship Mode      9800 non-null   object 
 5   Customer ID    9800 non-null   object 
 6   Customer Name  9800 non-null   object 
 7   Segment        9800 non-null   object 
 8   Country        9800 non-null   object 
 9   City           9800 non-null   object 
 10  State          9800 non-null   object 
 11  Postal Code    9789 non-null   float64
 12  Region         9800 non-null   object 
 13  Product ID     9800 non-null   object 
 14  Category       9800 non-null   object 
 15  Sub-Category   9800 non-null   object 
 16  Product Name   9800 non-null   object 
 17  Sales          9800 non-null   float64
dtypes: float

What comes to my mind immediately after looking at the dataset?

What are the different customer segments?
How many sales records do we have in the dataset?
Which region recorded maximum sales count?
What are the different product categories?
What is the minimum order amount and maximum order amount?
What is the revenue generated in the year 2017?
Which customer contributed to the maximum revenue in 2017 and how much?
Which product category is doing best? (revenue and count)
Are there more orders placed on weekends?
How many days on average it takes for the products to get shipped?
Try to understand that as a data analyst, first we should be capable to ask right questions. Answering these questions can be done with the help of Pandas module. We will learn later how to answer each of these questions. For now let's understand how to create new columns derived from the existing columns.

In [73]:
pd.to_datetime(df['Ship Date'], dayfirst=True)

0      2017-11-11
1      2017-11-11
2      2017-06-16
3      2016-10-18
4      2016-10-18
          ...    
9795   2017-05-28
9796   2016-01-17
9797   2016-01-17
9798   2016-01-17
9799   2016-01-17
Name: Ship Date, Length: 9800, dtype: datetime64[ns]

In [74]:
pd.to_datetime(df['Ship Date'], format='%d/%m/%Y')

0      2017-11-11
1      2017-11-11
2      2017-06-16
3      2016-10-18
4      2016-10-18
          ...    
9795   2017-05-28
9796   2016-01-17
9797   2016-01-17
9798   2016-01-17
9799   2016-01-17
Name: Ship Date, Length: 9800, dtype: datetime64[ns]

In [75]:
df['Ship Date'] = pd.to_datetime(df['Ship Date'], format="%d/%m/%Y")
df['Order Date'] = pd.to_datetime(df['Order Date'], format="%d/%m/%Y")

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9800 entries, 0 to 9799
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Row ID         9800 non-null   int64         
 1   Order ID       9800 non-null   object        
 2   Order Date     9800 non-null   datetime64[ns]
 3   Ship Date      9800 non-null   datetime64[ns]
 4   Ship Mode      9800 non-null   object        
 5   Customer ID    9800 non-null   object        
 6   Customer Name  9800 non-null   object        
 7   Segment        9800 non-null   object        
 8   Country        9800 non-null   object        
 9   City           9800 non-null   object        
 10  State          9800 non-null   object        
 11  Postal Code    9789 non-null   float64       
 12  Region         9800 non-null   object        
 13  Product ID     9800 non-null   object        
 14  Category       9800 non-null   object        
 15  Sub-Category   9800 n

Initially, the values in Order Date and Ship Date were character strings and do not provide any datetime operations (e.g. extract the year, day of the week,…). By applying the to_datetime function, pandas interprets the strings and convert these to datetime (i.e. datetime64[ns, UTC]) objects.

Important Note
As many data sets do contain datetime information in one of the columns, pandas input function like pandas.read_csv() and pandas.read_json() can do the transformation to dates when reading the data using the parse_dates parameter with a list of the columns to read as Timestamp:
pd.read_csv(PATH, parse_dates=["cols"])

Remember, the warnings while parsing dates?
You can fix those warnings by passing either one of the two parameters: dayfirst=True or date_format.

In [3]:
df = pd.read_csv('online_store_sales.csv', parse_dates=["Order Date", "Ship Date"], dayfirst=True)

df.head()

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,State,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales
0,1,CA-2017-152156,2017-11-08,2017-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96
1,2,CA-2017-152156,2017-11-08,2017-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94
2,3,CA-2017-138688,2017-06-12,2017-06-16,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,90036.0,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62
3,4,US-2016-108966,2016-10-11,2016-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775
4,5,US-2016-108966,2016-10-11,2016-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.368


In [5]:
col_names = [ col.strip().lower().replace(' ', '_').replace('-', '_') for col in df.columns ]

df.columns = col_names

df.columns

Index(['row_id', 'order_id', 'order_date', 'ship_date', 'ship_mode',
       'customer_id', 'customer_name', 'segment', 'country', 'city', 'state',
       'postal_code', 'region', 'product_id', 'category', 'sub_category',
       'product_name', 'sales'],
      dtype='object')

In [78]:
df['order_date'].min()

Timestamp('2015-01-03 00:00:00')

In [80]:
print("Orders starting from", df['order_date'].min(), "till", df['order_date'].max())

Orders starting from 2015-01-03 00:00:00 till 2018-12-30 00:00:00


In [81]:
df['order_date'].max() - df['order_date'].min()

Timedelta('1457 days 00:00:00')

Working with DateTime in Pandas
Get year, month, and day
df['year']= df['DoB'].dt.year
df['month']= df['DoB'].dt.month
df['day']= df['DoB'].dt.day
Get the week of year, the day of week and leap year
df['week_of_year'] = df['DoB'].dt.week
df['day_of_week'] = df['DoB'].dt.dayofweek
df['is_leap_year'] = df['DoB'].dt.is_leap_year

dw_mapping={
    0: 'Monday', 
    1: 'Tuesday', 
    2: 'Wednesday', 
    3: 'Thursday', 
    4: 'Friday',
    5: 'Saturday', 
    6: 'Sunday'
} 
df['day_of_week_name']=df['DoB'].dt.weekday.map(dw_mapping)
Get the age from the date of birth
today = pd.to_datetime('today')
df['age'] = today.year - df['DoB'].dt.year

In [10]:
today = pd.to_datetime('today')
print(today)

2023-12-30 11:19:49.002854


In [6]:
df['order_date'].dt.year

0       2017
1       2017
2       2017
3       2016
4       2016
        ... 
9795    2017
9796    2016
9797    2016
9798    2016
9799    2016
Name: order_date, Length: 9800, dtype: int32

In [9]:
today = pd.to_datetime('today')
today.year - df['order_date'].dt.year

0       6
1       6
2       6
3       7
4       7
       ..
9795    6
9796    7
9797    7
9798    7
9799    7
Name: order_date, Length: 9800, dtype: int32

In [13]:
df['order_date'].dt.day_name()

0       Wednesday
1       Wednesday
2          Monday
3         Tuesday
4         Tuesday
          ...    
9795       Sunday
9796      Tuesday
9797      Tuesday
9798      Tuesday
9799      Tuesday
Name: order_date, Length: 9800, dtype: object

In [14]:
df['order_date'].dt.month_name()

0       November
1       November
2           June
3        October
4        October
          ...   
9795         May
9796     January
9797     January
9798     January
9799     January
Name: order_date, Length: 9800, dtype: object

Creating a Column containing only the Order Month
By using Timestamp objects for dates, a lot of time-related properties are provided by pandas. For example the month, but also year, quarter,… All of these properties are accessible by the dt accessor like year, month, day, day_of_week, day_of_year, is_leap_year, week, etc. We can also access methods using dt accessor like day_name(), month_name(), etc.

In [15]:
df['order_month'] = df['order_date'].dt.month

df.head()

Unnamed: 0,row_id,order_id,order_date,ship_date,ship_mode,customer_id,customer_name,segment,country,city,state,postal_code,region,product_id,category,sub_category,product_name,sales,order_month
0,1,CA-2017-152156,2017-11-08,2017-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96,11
1,2,CA-2017-152156,2017-11-08,2017-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94,11
2,3,CA-2017-138688,2017-06-12,2017-06-16,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,90036.0,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62,6
3,4,US-2016-108966,2016-10-11,2016-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775,10
4,5,US-2016-108966,2016-10-11,2016-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.368,10


Calculating Delivery Time from Order Date and Ship Date

In [16]:
df['delivery_time'] = df['ship_date'] - df['order_date']

df.head()

Unnamed: 0,row_id,order_id,order_date,ship_date,ship_mode,customer_id,customer_name,segment,country,city,state,postal_code,region,product_id,category,sub_category,product_name,sales,order_month,delivery_time
0,1,CA-2017-152156,2017-11-08,2017-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96,11,3 days
1,2,CA-2017-152156,2017-11-08,2017-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94,11,3 days
2,3,CA-2017-138688,2017-06-12,2017-06-16,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,90036.0,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62,6,4 days
3,4,US-2016-108966,2016-10-11,2016-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775,10,7 days
4,5,US-2016-108966,2016-10-11,2016-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.368,10,7 days


In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9800 entries, 0 to 9799
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype          
---  ------         --------------  -----          
 0   row_id         9800 non-null   int64          
 1   order_id       9800 non-null   object         
 2   order_date     9800 non-null   datetime64[ns] 
 3   ship_date      9800 non-null   datetime64[ns] 
 4   ship_mode      9800 non-null   object         
 5   customer_id    9800 non-null   object         
 6   customer_name  9800 non-null   object         
 7   segment        9800 non-null   object         
 8   country        9800 non-null   object         
 9   city           9800 non-null   object         
 10  state          9800 non-null   object         
 11  postal_code    9789 non-null   float64        
 12  region         9800 non-null   object         
 13  product_id     9800 non-null   object         
 14  category       9800 non-null   object         
 15  sub_