Putting Some Pandas In Your Python 🐼
# Introduction to Pandas 🐼

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with relational or labeled data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python.

Reference: https://pandas.pydata.org/docs/getting_started/index.html

- Question: What are the Data Structures in Pandas?
- Answer: Series (similar to 1 dim numpy array) and DataFrame (similar to 2 dim numpy array)

Installation Command
! pip install pandas

Importing Pandas
import pandas as pd

### What's covered in this notebook?
1. Pandas Data Structure - Series (ndarray-like)
Creating Series using Python list or dict
Creating Series from Numpy ndarray
Creating Series from scalar
Accessing Properties/Attributes and Methods of Series
Accessing data using Indexing and Slicing
2. Pandas Data Structure - DataFrame
Creating DataFrame using Python dict, list or tuple
Creating DataFrame using Numpy Array
Accessing Attributes/Properties and Methods of DataFrame
3. Working with Tabular Data
Dataframe to .csv & .xlsx
Reading .xlsx File
Reading .csv File - Iris Dataset
4. Non-Visual Data Analysis using Pandas (Statistical Analysis)
sum()
min() and max()
mean(), median(), var() and std()
describe() to summarize the data
corr(), skew() and kurt()
count(), unique() and value_counts() for categorical column
DataFrame.agg()
5. Accessing Data in a DataFrame using Indexing and Slicing in Pandas DataFrame
Reading .csv File - Weather Dataset
Filtering Single Column vs Multiple Columns from a DataFrame
Filtering Rows from a DataFrame
Filtering specific rows and columns from a DataFrame
loc() vs iloc()
6. Renaming Columns, Modifying DataTypes, Creating New Columns and Deleting Columns in Pandas DataFrame
Reading .csv File - Retail Store Sales Data
Renaming Columns
Modifying Columns DataTypes
Creating a Derived Column
Creating columns using apply() function
Deleting column(s) in DataFrame
7. Adding/Inserting Row(s)
Reading .xlsx File - Weather Data
Insert Row(s) using pandas.concat()
Inserting a Row using List - .loc[] and .iloc[]
Inserting a Row at a Specific Index of a DataFrame
Saving DataFrame to .xlsx
8. Handling TimeSeries Data
Reading .csv File - Online Store Sales Data
pd.to_datetime()
Working with DateTime in Pandas
Creating a Column containing only the Order Month
Calculating Delivery Time from Order Date and Ship Date
pandas.Timedelta
Creating a Column containing Delivery Time in Number of Days
Improve Performance by Setting Date Column as the Index
Sorting Data Based on Index vs Values and Resetting Index
9. Summary

Q1. Import Pandas Module and numpy module

Q2. Create Series using Python list or dict

Q3. Create Series from Numpy ndarray

Q4. Create Series from scalar and the index should start from 1 or a

Accessing Properties/Attributes and Methods of Series

Q1. Create array of numbers, check for the data type, shape, values and the length

Q2. convert to numpy array s = pd.Series([1,2,3,4,5,6,7,8,9]) using to_numpy()

Q3. Access variable using head(),tail() and info()

Accessing data using Indexing and Slicing

Q1. Access the 2nd and 4th elements in s = pd.Series([1, 2, 3, 4, 5]) using the indexing or slicing method

Q2. Create a pandas series with index=['a', 'b', 'c', 'd', 'e']

## Pandas Data Structure - DataFrame

Pandas is a general 2D labeled, value and size-mutable tabular structure with potentially heterogeneously-typed column.

Important Note: Pandas data structures are value-mutable (the values they contain can be altered) as well as size-mutable.

Q1. Create a DataFrame using Python dict, list or tuple to print Name, Age and Gender

Q2. Create a Dataframe using Tuple, assingn it to a variable named data
       ('1/1/2019', 13, 6, 'Rain'),
       ('2/1/2019', 11, 7, 'Fog'),
       ('3/1/2019', 12, 8, 'Sunny'),
       ('4/1/2019', 8, 5, 'Snow'),
       ('5/1/2019', 9, 6, 'Rain')
Note: The minimum rows should be 50

Q2. Create a column for Q2: Day, Temperature, Windspeed and Event

Q3. print the Temperature column

Creating DataFrame using Numpy Array

Q1. Create an array using random.randint. choose the range and the size should be (1000,100)

Q2. Convert Q1 to a DataFrame 

Q3. Write a program to output col_1 to col_100 for the Dataframe in Q2
Hint: col_ + str(i) followed by a for loop

Accessing Attributes/Properties and Methods of DataFrame

Q1. Create a dictionary of series, assign it to a variable name data

Q2. Print the shape, column, data type, axes and value of Q1

Q3. Check for the dataframe information

Q4. Output the last 2 row in the dataframe

### Working with Tabular Data

Remember
Getting data in to pandas from many different file formats or data sources is supported by read_* functions.
Exporting data out of pandas is provided by different to_* methods.
The head/tail/info methods and the dtypes attribute are convenient for a first check.

Q1. Create an array using random.randint. choose the range and the size should be (1000,100)

Q2. Write Q1 dataframe to csv with the name new_data.csv without index

Q3. Write Q1 dataframe to xsls with the sheet_name = new_data.xlsx without index

Q4. Read the data in the temp/iris.csv. use head(),tail(),info() and dtype

The iris data set is widely used as a beginner's dataset for machine learning purposes.

##### Non-Visual Data Analysis using Pandas (Statistical Analysis)
groupby provides the power of the split-apply-combine pattern.
value_counts is a convenient shortcut to count the number of entries in each category of a variable.

Q5. print the columns in Q3

Q6. Sum each columns, output the min, max, mean, median, var, std

count(), nunique(), unique() and value_counts() for categorical column

Q7. Use the above for analysing the data in Q3

Q8. Describe the data and include(object)

corr(), skew() and kurt()

df.agg(
    {
        "SepalLengthCm" : ["min", "max", "median", "count"],
        "PetalWidthCm" : ["min", "max", "mean", "count"],
        "Species" : ["count"]
    }
)

##### Accessing Data in a DataFrame using Indexing and Slicing in Pandas DataFrame
You can assign new values to a selection based on loc/iloc.

Reading .csv File - Weather Dataset

Q1. Read the weather Dataset(nyc_weather.csv)

Q2. Access the data using head(), tail(), info, columns and shape

Q3. Check for the min and max Temperature

Q4. Select Multiple column(Temerature, Deewpoint and Humidity)

Q5. Filter the rows using Slicing method. Row 1 to 30

Q6. Check for in the Temperature column values greater than 30

Q7. Show the whole Dataframe for Q6 which shows the index of values greater than 30

Q8. Using isin: check if '1/10/2016', '1/16/2016', '1/2/2016' isin EST

Q9. Use loc and iloc to print row 30

Q10. Show EST and DewPoint for Q9

Reading .csv File - Retail Store Sales Data

Q1. Read retail_store_sales.xlsx file after importing the necessary module

Q2. What comes to your mind immediately after looking at the dataset?

Answer the following:
How many sales records do we have in the dataset?
How many customers do we have?
What is the date range of data?
Which country recorded maximum sales count?
What is the minimum order amount and maximum order amount?
How many orders for each customer?
What is the revenue contributed by each customer?
What is the revenue generated each year?
Which customer contributed to the maximum revenue each year and how much?
Are there more orders placed on weekends?
How many customers churned (i.e. Customers not making any purchases for more than or equal to 2 months)?

NB:
Try to understand that as a data analyst, first we should be capable to ask right questions. Answering these questions can be done with the help of Pandas module.

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   Invoice No    541909 non-null  object        
 1    Stock-Code   541909 non-null  object        
 2   Description   540455 non-null  object        
 3   Quantity      541909 non-null  int64         
 4   Invoice Date  541909 non-null  datetime64[ns]
 5   Unit Price    541909 non-null  float64       
 6   Customer ID   406829 non-null  float64       
 7   Country       541909 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB


In [8]:
print("Total Sales Record:", df.shape[0])
print("Total Customers:", df['Customer ID'].nunique())
print("Date Range:", df['Invoice Date'].min(), "to", df['Invoice Date'].max())

Total Sales Record: 541909
Total Customers: 4372
Date Range: 2010-12-01 08:26:00 to 2011-12-09 12:50:00


In [9]:
# checking all the unique countries
df['Country'].unique()

array(['United Kingdom', 'France', 'Australia', 'Netherlands', 'Germany',
       'Norway', 'EIRE', 'Switzerland', 'Spain', 'Poland', 'Portugal',
       'Italy', 'Belgium', 'Lithuania', 'Japan', 'Iceland',
       'Channel Islands', 'Denmark', 'Cyprus', 'Sweden', 'Austria',
       'Israel', 'Finland', 'Bahrain', 'Greece', 'Hong Kong', 'Singapore',
       'Lebanon', 'United Arab Emirates', 'Saudi Arabia',
       'Czech Republic', 'Canada', 'Unspecified', 'Brazil', 'USA',
       'European Community', 'Malta', 'RSA'], dtype=object)

In [10]:
# Countries with total number of sales record

df['Country'].value_counts()

Country
United Kingdom          495478
Germany                   9495
France                    8557
EIRE                      8196
Spain                     2533
Netherlands               2371
Belgium                   2069
Switzerland               2002
Portugal                  1519
Australia                 1259
Norway                    1086
Italy                      803
Channel Islands            758
Finland                    695
Cyprus                     622
Sweden                     462
Unspecified                446
Austria                    401
Denmark                    389
Japan                      358
Poland                     341
Israel                     297
USA                        291
Hong Kong                  288
Singapore                  229
Iceland                    182
Canada                     151
Greece                     146
Malta                      127
United Arab Emirates        68
European Community          61
RSA                         58


Renaming Columns
Syntax to rename columns
df.rename(index=None, columns=None)

The rename() function can be used for both row labels and column labels. Provide a dictionary with the keys the current names and the values the new names to update the corresponding names.

In [11]:
df.columns

Index(['Invoice No', ' Stock-Code ', 'Description', 'Quantity', 'Invoice Date',
       'Unit Price', 'Customer ID', 'Country'],
      dtype='object')

In [12]:
df_renamed = df.rename(columns={'Description': 'Product Description', 'Customer ID': 'Cust ID'})

df_renamed.columns

Index(['Invoice No', ' Stock-Code ', 'Product Description', 'Quantity',
       'Invoice Date', 'Unit Price', 'Cust ID', 'Country'],
      dtype='object')

In [13]:
df_renamed.head()

Unnamed: 0,Invoice No,Stock-Code,Product Description,Quantity,Invoice Date,Unit Price,Cust ID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


A very common column renaming strategy
Let's convert column names by performing below mentioned operations:

Strip extra spaces
Convert to lower cases
Remove all the special characters including spaces
Benefit of this is, we can now access the columns in the dataframe using the dot, similar to how we access the properties/attributes of a python object. For eg:
Acessing INVOICE NO can be done using: df_renamed.invoice_no

In [20]:
col_names = [ col.strip().lower().replace(' ', '_').replace('-', '_') for col in df_renamed ]

print(col_names)

['invoice_no', 'stock_code', 'product_description', 'quantity', 'invoice_date', 'unit_price', 'cust_id', 'country']


Modifying Columns DataType
Modifying the DataType using DataFrame.astype()
We can pass any Python, Numpy, or Pandas datatype to change all columns of a Dataframe to that type, or we can pass a dictionary having column names as keys and datatype as values to change the type of selected columns.

Modifying the DataType using DataFrame.apply()
We can pass pandas.to_numeric, pandas.to_datetime, and pandas.to_timedelta as arguments to apply the apply() function to change the data type of one or more columns to numeric, DateTime, and time delta respectively.

Modifying the DataType using DataFrame.astype()

In [21]:
df_renamed.columns

Index(['Invoice No', ' Stock-Code ', 'Product Description', 'Quantity',
       'Invoice Date', 'Unit Price', 'Cust ID', 'Country'],
      dtype='object')

In [22]:
df_renamed.columns = col_names
df_renamed.columns

Index(['invoice_no', 'stock_code', 'product_description', 'quantity',
       'invoice_date', 'unit_price', 'cust_id', 'country'],
      dtype='object')

In [24]:
# converting all columns to string type
df_renamed = df_renamed.astype(str)
df_renamed.dtypes

invoice_no             object
stock_code             object
product_description    object
quantity               object
invoice_date           object
unit_price             object
cust_id                object
country                object
dtype: object

In [25]:
df_renamed[['quantity', 'unit_price', 'cust_id']] = df_renamed[['quantity', 'unit_price', 'cust_id']].astype(float)

df_renamed.dtypes

invoice_no              object
stock_code              object
product_description     object
quantity               float64
invoice_date            object
unit_price             float64
cust_id                float64
country                 object
dtype: object

In [26]:
# using dictionary to convert specific columns
convert_dict = {'quantity': int,
                'country': str
                }
 
df_renamed = df_renamed.astype(convert_dict)

df_renamed.dtypes

invoice_no              object
stock_code              object
product_description     object
quantity                 int64
invoice_date            object
unit_price             float64
cust_id                float64
country                 object
dtype: object

Modifying the DataType using DataFrame.apply()

In [27]:
# using apply method to convert datatype
df_renamed['invoice_date'] = df_renamed['invoice_date'].apply(pd.to_datetime)
df_renamed.dtypes

invoice_no                     object
stock_code                     object
product_description            object
quantity                        int64
invoice_date           datetime64[ns]
unit_price                    float64
cust_id                       float64
country                        object
dtype: object

Creating a Derived Column

-Creating a column by merging Product Category and Sub-category
-Think about how to perform the same operation in Numpy?

In [28]:
df_renamed['amount'] = df_renamed['quantity']*df_renamed['unit_price']
df_renamed.head()

Unnamed: 0,invoice_no,stock_code,product_description,quantity,invoice_date,unit_price,cust_id,country,amount
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom,15.3
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom,22.0
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34


Remember
The calculation is again element-wise, so the + is applied for the values in each row. Also other mathematical operators (+, -, *, /,…) or logical operators (<, >, ==,…) work element-wise.

#### Creating Columns using apply() function
Syntax for DataFrame
df.apply(function, axis=0)
Applies the function column wise.
Axis Parameter
Axis along which the function is applied. Axis can be {0 or ‘index’, 1 or ‘columns’}, default 0:

0 or ‘index’: apply function to each column.
1 or ‘columns’: apply function to each row.
Syntax for Series
series.apply(function, axis=0)
Applies the function element wise.

In [29]:
df_renamed.dtypes

invoice_no                     object
stock_code                     object
product_description            object
quantity                        int64
invoice_date           datetime64[ns]
unit_price                    float64
cust_id                       float64
country                        object
amount                        float64
dtype: object

In [30]:
# np.max function is applied column wise by default - i.e. axis=0

df_renamed.apply(np.max)

invoice_no                         C581569
stock_code                               m
product_description      wrongly sold sets
quantity                             80995
invoice_date           2011-12-09 12:50:00
unit_price                         38970.0
cust_id                            18287.0
country                        Unspecified
amount                            168469.6
dtype: object

In [31]:
# Apply a function on the complete column at once
df_renamed[['amount']].apply(np.mean)

amount    17.987795
dtype: float64

In [None]:
# There is much better way of performing above operation - df['order_amount'].mean()

df_renamed['amount'].mean()

In [32]:
# Apply a function on the column - row wise. Returns Series.

df_renamed['amount'].apply(np.mean)

0         15.30
1         20.34
2         22.00
3         20.34
4         20.34
          ...  
541904    10.20
541905    12.60
541906    16.60
541907    16.60
541908    14.85
Name: amount, Length: 541909, dtype: float64

In [33]:
# Apply a function on the column - row wise. Returns Series.

df_renamed['amount'].apply(np.mean)

0         15.30
1         20.34
2         22.00
3         20.34
4         20.34
          ...  
541904    10.20
541905    12.60
541906    16.60
541907    16.60
541908    14.85
Name: amount, Length: 541909, dtype: float64

In [36]:
# Creating new column using apply()
# Let's assume we have to create a column - new_amount
# new_amount = quantity * unit_price
# we already saw how to perform this using df['amount'] = df['quantity'] * df['unit_price']
# Let's do the same operation using apply() function now

df_renamed['new_amount'] = df_renamed.apply(lambda row: row['quantity'] * row['unit_price'], axis=1)

df_renamed.head()

Unnamed: 0,invoice_no,stock_code,product_description,quantity,invoice_date,unit_price,cust_id,country,amount,new_amount
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom,15.3,15.3
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34,20.34
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom,22.0,22.0
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34,20.34
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34,20.34


Deleting column(s) in DataFrame
Syntax 1 - Dropping columns by using columns name

# Dropping two columns by passing column names
# inplace=True parameter performs the operation saves the result back to the dataframe
df.drop(['col1', 'col3'], axis=1, inplace=True)
Syntax 2 - Removing columns by using columns name using loc[]

# Removing all columns between col2 to col4
df.drop(df.loc[:, 'col2':'col4'], inplace=True, axis=1)
Syntax 3 - Removing column based on index

# Remove three columns as index base
df.drop(df.columns[[0, 4, 2]], axis=1, inplace=True)
Syntax 4 - Removing column based on index using iloc[]

# removing two columns between column index 1 to 3
df.drop(df.iloc[:, 1:3], inplace=True, axis=1)
Synatx 5 - DataFrame.pop() method

# Using pop() we can delete single column at a time
df.pop("Col4")

In [37]:
df_renamed.columns

Index(['invoice_no', 'stock_code', 'product_description', 'quantity',
       'invoice_date', 'unit_price', 'cust_id', 'country', 'amount',
       'new_amount'],
      dtype='object')

In [40]:
# syntax 1
df_renamed.drop(['amount'], axis=1)

Unnamed: 0,invoice_no,stock_code,product_description,quantity,invoice_date,unit_price,cust_id,country,new_amount
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom,15.30
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom,22.00
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
...,...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,2011-12-09 12:50:00,0.85,12680.0,France,10.20
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,2011-12-09 12:50:00,2.10,12680.0,France,12.60
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,2011-12-09 12:50:00,4.15,12680.0,France,16.60
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,2011-12-09 12:50:00,4.15,12680.0,France,16.60


In [41]:
df_renamed.columns

Index(['invoice_no', 'stock_code', 'product_description', 'quantity',
       'invoice_date', 'unit_price', 'cust_id', 'country', 'amount',
       'new_amount'],
      dtype='object')

Observation
Observe that the amount column is still not removed from dataframe. To make the changes permanent, pass inplace=True parameter.

In [42]:
df_renamed.drop(['amount'], axis=1, inplace=True)

In [43]:
df_renamed.columns

Index(['invoice_no', 'stock_code', 'product_description', 'quantity',
       'invoice_date', 'unit_price', 'cust_id', 'country', 'new_amount'],
      dtype='object')

In [44]:
# Syntax 2

df_renamed.drop(df_renamed.loc[:, 'invoice_no':'invoice_date'], axis=1)

Unnamed: 0,unit_price,cust_id,country,new_amount
0,2.55,17850.0,United Kingdom,15.30
1,3.39,17850.0,United Kingdom,20.34
2,2.75,17850.0,United Kingdom,22.00
3,3.39,17850.0,United Kingdom,20.34
4,3.39,17850.0,United Kingdom,20.34
...,...,...,...,...
541904,0.85,12680.0,France,10.20
541905,2.10,12680.0,France,12.60
541906,4.15,12680.0,France,16.60
541907,4.15,12680.0,France,16.60


In [51]:
df_renamed.loc[[50000]]

Unnamed: 0,invoice_no,stock_code,product_description,quantity,invoice_date,unit_price,cust_id,country,new_amount
50000,540542,22639,SET OF 4 NAPKIN CHARMS HEARTS,6,2011-01-09 15:18:00,2.55,15107.0,United Kingdom,15.3


In [52]:
df_renamed.iloc[1:5]

Unnamed: 0,invoice_no,stock_code,product_description,quantity,invoice_date,unit_price,cust_id,country,new_amount
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom,22.0
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34


In [53]:
# Syntax 3

df_renamed.drop(df_renamed.columns[[0, 4, 2]], axis=1)

Unnamed: 0,stock_code,quantity,unit_price,cust_id,country,new_amount
0,85123A,6,2.55,17850.0,United Kingdom,15.30
1,71053,6,3.39,17850.0,United Kingdom,20.34
2,84406B,8,2.75,17850.0,United Kingdom,22.00
3,84029G,6,3.39,17850.0,United Kingdom,20.34
4,84029E,6,3.39,17850.0,United Kingdom,20.34
...,...,...,...,...,...,...
541904,22613,12,0.85,12680.0,France,10.20
541905,22899,6,2.10,12680.0,France,12.60
541906,23254,4,4.15,12680.0,France,16.60
541907,23255,4,4.15,12680.0,France,16.60


In [54]:
df_renamed.columns

Index(['invoice_no', 'stock_code', 'product_description', 'quantity',
       'invoice_date', 'unit_price', 'cust_id', 'country', 'new_amount'],
      dtype='object')

Observation
Observe that the columns are still not removed from dataframe. To make the changes permanent, pass inplace=True parameter.

In [55]:
# Syntax 4

df_renamed.drop(df_renamed.iloc[:, 1:3], axis=1)

Unnamed: 0,invoice_no,quantity,invoice_date,unit_price,cust_id,country,new_amount
0,536365,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom,15.30
1,536365,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
2,536365,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom,22.00
3,536365,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
4,536365,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
...,...,...,...,...,...,...,...
541904,581587,12,2011-12-09 12:50:00,0.85,12680.0,France,10.20
541905,581587,6,2011-12-09 12:50:00,2.10,12680.0,France,12.60
541906,581587,4,2011-12-09 12:50:00,4.15,12680.0,France,16.60
541907,581587,4,2011-12-09 12:50:00,4.15,12680.0,France,16.60


In [57]:
# Syntax 5

df_renamed.pop("new_amount")

0         15.30
1         20.34
2         22.00
3         20.34
4         20.34
          ...  
541904    10.20
541905    12.60
541906    16.60
541907    16.60
541908    14.85
Name: new_amount, Length: 541909, dtype: float64

In [58]:
df_renamed.columns

Index(['invoice_no', 'stock_code', 'product_description', 'quantity',
       'invoice_date', 'unit_price', 'cust_id', 'country'],
      dtype='object')

Remeber that DataFrame.pop("Col_Name") function:

Removes the single column and returns the deleted column.
Applies the changes to the dataframe without any need of inplace=True

Adding/Inserting Row(s)
Reading a .xlsx File - Weather Data

In [59]:
import pandas as pd
import numpy as np

In [60]:
df = pd.read_excel('weather_data.xlsx')
df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny


In [61]:
df.shape


(6, 4)

Insert Row(s) using Dictionary - pandas.concat()

Insert Row(s) using Dictionary - pandas.concat()
Syntax 1 - Inserting a Single Row

# Creat a new record using Dictionary
new_record = pd.DataFrame([{'day': '1/7/2017', 'temperature': 36, 'windspeed': 4, 'event': 'Sunny'}])

# Inserting row at the end
df = pd.concat([df, new_record], ignore_index=True)

# Inserting row at the top
df = pd.concat([new_record, df], ignore_index=True)
Syntax 2 - Insert multiple rows (i.e. a batch of data)

# Creat a new record using Dictionary
batch_records = pd.DataFrame([{'day': '1/8/2017', 'temperature': 30, 'windspeed': 3, 'event': 'Rain'}, {'day': '1/9/2017', 'temperature': 27, 'windspeed': 4, 'event': 'Snow'}])

# Inserting row at the end
df = pd.concat([df, batch_records], ignore_index=True)

# Inserting row at the top
df = pd.concat([batch_records, df], ignore_index=True)

In [62]:
# Creat a new record using Dictionary
new_record = pd.DataFrame([{'day': '1/7/2017', 
                            'temperature': 36, 
                            'windspeed': 4, 
                            'event': 'Sunny'}])

# Inserting row at the end
df = pd.concat([df, new_record], ignore_index=True)

df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny
6,1/7/2017,36,4,Sunny


In [63]:
# Creat a new record using Dictionary
batch_records = pd.DataFrame([{'day': '1/8/2017', 'temperature': 30, 'windspeed': 3, 'event': 'Rain'}, 
                              {'day': '1/9/2017', 'temperature': 27, 'windspeed': 4, 'event': 'Snow'}])

# Inserting row at the end
df = pd.concat([df, batch_records], ignore_index=True)

df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny
6,1/7/2017,36,4,Sunny
7,1/8/2017,30,3,Rain
8,1/9/2017,27,4,Snow


Inserting a Row using List - .loc[] and .iloc[]
To add a list to a Pandas DataFrame works a bit differently since we can’t simply use the .concat() function. In order to do this, we need to use the loc accessor. The label that we use for our loc accessor will be the length of the DataFrame. This will create a new row.

Syntax - Using DataFrame.loc[]

df.loc[len(df)] = ['1/12/2017', 28, 2, 'Rain']
Syntax - Using DataFrame.iloc[]
Generates Error - You cannot use .iloc to enlarge the target object.(i.e .iloc can't be used to add new rows)

In [64]:
df.loc[len(df)] = ['1/12/2017', 28, 2, 'Rain']
df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny
6,1/7/2017,36,4,Sunny
7,1/8/2017,30,3,Rain
8,1/9/2017,27,4,Snow
9,1/12/2017,28,2,Rain


Inserting a Row at a Specific Index of a DataFrame
Adding a row at a specific index is a bit different. As shown in the example of using lists, we need to use the loc accessor. However, inserting a row at a given index will only overwrite this. What we can do instead is pass in a value close to where we want to insert the new row.

For example, if we have current indices from 0-9 and we want to insert a new row at index 9, we can simply assign it using index 8.5. Let’s see how this works:

Syntax - Inserting a row at a specific index

# Adding at row label 8.5
df.loc[8.5] = ['1/11/2017', 30, 3, 'Rain']

# sort index
df = df.sort_index().reset_index(drop=True)

df

In [65]:
df.loc[8.5] = ['1/10/2017', 30, 3, 'Rain']

#sort index
df = df.sort_index().reset_index(drop=True)

df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny
6,1/7/2017,36,4,Sunny
7,1/8/2017,30,3,Rain
8,1/9/2017,27,4,Snow
9,1/10/2017,30,3,Rain


In [66]:
# Adding at row label 9.5
df.loc[9.5] = ['1/11/2017', 27, 1, 'Snow']

#sort index
df = df.sort_index().reset_index(drop=True)

df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny
6,1/7/2017,36,4,Sunny
7,1/8/2017,30,3,Rain
8,1/9/2017,27,4,Snow
9,1/10/2017,30,3,Rain


Saving DataFrame to .xlsx

In [67]:
df.to_excel('temp/updated_weather_data.xlsx', sheet_name='weather_data')

### Handling TimeSeries DataReading .csv File - Online Store Sales Data

Question: How to handle time series data?
Answer: pandas has great support for time series and has an extensive set of tools for working with dates, times, and time-indexed data.

Remember
Valid date strings can be converted to datetime objects using to_datetime function or as part of read functions.
pandas.Datetime objects in pandas support calculations, logical operations and convenient date-related properties using the dt accessor like year, month, day, day_of_week, day_of_year, is_leap_year, week, etc...
We can also access datetime methods using dt accessor like day_name(), month_name(), etc...
pandas.Timedelta Represents a duration, the difference between two dates or times. Many properties of timedelta can be accessed using dt like components, days, seconds, etc...
We can also access timedelta methods using dt accessor like total_seconds().

Reading .csv File - Online Store Sales Data

In [2]:
import pandas as pd
df = pd.read_csv('online_store_sales.csv')
df.head()

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,State,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales
0,1,CA-2017-152156,08/11/2017,11/11/2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96
1,2,CA-2017-152156,08/11/2017,11/11/2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94
2,3,CA-2017-138688,12/06/2017,16/06/2017,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,90036.0,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62
3,4,US-2016-108966,11/10/2016,18/10/2016,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775
4,5,US-2016-108966,11/10/2016,18/10/2016,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.368


In [69]:
df.shape

(9800, 18)

In [70]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9800 entries, 0 to 9799
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Row ID         9800 non-null   int64  
 1   Order ID       9800 non-null   object 
 2   Order Date     9800 non-null   object 
 3   Ship Date      9800 non-null   object 
 4   Ship Mode      9800 non-null   object 
 5   Customer ID    9800 non-null   object 
 6   Customer Name  9800 non-null   object 
 7   Segment        9800 non-null   object 
 8   Country        9800 non-null   object 
 9   City           9800 non-null   object 
 10  State          9800 non-null   object 
 11  Postal Code    9789 non-null   float64
 12  Region         9800 non-null   object 
 13  Product ID     9800 non-null   object 
 14  Category       9800 non-null   object 
 15  Sub-Category   9800 non-null   object 
 16  Product Name   9800 non-null   object 
 17  Sales          9800 non-null   float64
dtypes: float

What comes to my mind immediately after looking at the dataset?

What are the different customer segments?
How many sales records do we have in the dataset?
Which region recorded maximum sales count?
What are the different product categories?
What is the minimum order amount and maximum order amount?
What is the revenue generated in the year 2017?
Which customer contributed to the maximum revenue in 2017 and how much?
Which product category is doing best? (revenue and count)
Are there more orders placed on weekends?
How many days on average it takes for the products to get shipped?
Try to understand that as a data analyst, first we should be capable to ask right questions. Answering these questions can be done with the help of Pandas module. We will learn later how to answer each of these questions. For now let's understand how to create new columns derived from the existing columns.

In [73]:
pd.to_datetime(df['Ship Date'], dayfirst=True)

0      2017-11-11
1      2017-11-11
2      2017-06-16
3      2016-10-18
4      2016-10-18
          ...    
9795   2017-05-28
9796   2016-01-17
9797   2016-01-17
9798   2016-01-17
9799   2016-01-17
Name: Ship Date, Length: 9800, dtype: datetime64[ns]

In [74]:
pd.to_datetime(df['Ship Date'], format='%d/%m/%Y')

0      2017-11-11
1      2017-11-11
2      2017-06-16
3      2016-10-18
4      2016-10-18
          ...    
9795   2017-05-28
9796   2016-01-17
9797   2016-01-17
9798   2016-01-17
9799   2016-01-17
Name: Ship Date, Length: 9800, dtype: datetime64[ns]

In [75]:
df['Ship Date'] = pd.to_datetime(df['Ship Date'], format="%d/%m/%Y")
df['Order Date'] = pd.to_datetime(df['Order Date'], format="%d/%m/%Y")

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9800 entries, 0 to 9799
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Row ID         9800 non-null   int64         
 1   Order ID       9800 non-null   object        
 2   Order Date     9800 non-null   datetime64[ns]
 3   Ship Date      9800 non-null   datetime64[ns]
 4   Ship Mode      9800 non-null   object        
 5   Customer ID    9800 non-null   object        
 6   Customer Name  9800 non-null   object        
 7   Segment        9800 non-null   object        
 8   Country        9800 non-null   object        
 9   City           9800 non-null   object        
 10  State          9800 non-null   object        
 11  Postal Code    9789 non-null   float64       
 12  Region         9800 non-null   object        
 13  Product ID     9800 non-null   object        
 14  Category       9800 non-null   object        
 15  Sub-Category   9800 n

Initially, the values in Order Date and Ship Date were character strings and do not provide any datetime operations (e.g. extract the year, day of the week,…). By applying the to_datetime function, pandas interprets the strings and convert these to datetime (i.e. datetime64[ns, UTC]) objects.

Important Note
As many data sets do contain datetime information in one of the columns, pandas input function like pandas.read_csv() and pandas.read_json() can do the transformation to dates when reading the data using the parse_dates parameter with a list of the columns to read as Timestamp:
pd.read_csv(PATH, parse_dates=["cols"])

Remember, the warnings while parsing dates?
You can fix those warnings by passing either one of the two parameters: dayfirst=True or date_format.

In [3]:
df = pd.read_csv('online_store_sales.csv', parse_dates=["Order Date", "Ship Date"], dayfirst=True)

df.head()

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,State,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales
0,1,CA-2017-152156,2017-11-08,2017-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96
1,2,CA-2017-152156,2017-11-08,2017-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94
2,3,CA-2017-138688,2017-06-12,2017-06-16,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,90036.0,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62
3,4,US-2016-108966,2016-10-11,2016-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775
4,5,US-2016-108966,2016-10-11,2016-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.368


In [5]:
col_names = [ col.strip().lower().replace(' ', '_').replace('-', '_') for col in df.columns ]

df.columns = col_names

df.columns

Index(['row_id', 'order_id', 'order_date', 'ship_date', 'ship_mode',
       'customer_id', 'customer_name', 'segment', 'country', 'city', 'state',
       'postal_code', 'region', 'product_id', 'category', 'sub_category',
       'product_name', 'sales'],
      dtype='object')

In [78]:
df['order_date'].min()

Timestamp('2015-01-03 00:00:00')

In [80]:
print("Orders starting from", df['order_date'].min(), "till", df['order_date'].max())

Orders starting from 2015-01-03 00:00:00 till 2018-12-30 00:00:00


In [81]:
df['order_date'].max() - df['order_date'].min()

Timedelta('1457 days 00:00:00')

Working with DateTime in Pandas
Get year, month, and day
df['year']= df['DoB'].dt.year
df['month']= df['DoB'].dt.month
df['day']= df['DoB'].dt.day
Get the week of year, the day of week and leap year
df['week_of_year'] = df['DoB'].dt.week
df['day_of_week'] = df['DoB'].dt.dayofweek
df['is_leap_year'] = df['DoB'].dt.is_leap_year

dw_mapping={
    0: 'Monday', 
    1: 'Tuesday', 
    2: 'Wednesday', 
    3: 'Thursday', 
    4: 'Friday',
    5: 'Saturday', 
    6: 'Sunday'
} 
df['day_of_week_name']=df['DoB'].dt.weekday.map(dw_mapping)
Get the age from the date of birth
today = pd.to_datetime('today')
df['age'] = today.year - df['DoB'].dt.year

In [10]:
today = pd.to_datetime('today')
print(today)

2023-12-30 11:19:49.002854


In [6]:
df['order_date'].dt.year

0       2017
1       2017
2       2017
3       2016
4       2016
        ... 
9795    2017
9796    2016
9797    2016
9798    2016
9799    2016
Name: order_date, Length: 9800, dtype: int32

In [9]:
today = pd.to_datetime('today')
today.year - df['order_date'].dt.year

0       6
1       6
2       6
3       7
4       7
       ..
9795    6
9796    7
9797    7
9798    7
9799    7
Name: order_date, Length: 9800, dtype: int32

In [13]:
df['order_date'].dt.day_name()

0       Wednesday
1       Wednesday
2          Monday
3         Tuesday
4         Tuesday
          ...    
9795       Sunday
9796      Tuesday
9797      Tuesday
9798      Tuesday
9799      Tuesday
Name: order_date, Length: 9800, dtype: object

In [14]:
df['order_date'].dt.month_name()

0       November
1       November
2           June
3        October
4        October
          ...   
9795         May
9796     January
9797     January
9798     January
9799     January
Name: order_date, Length: 9800, dtype: object

Creating a Column containing only the Order Month
By using Timestamp objects for dates, a lot of time-related properties are provided by pandas. For example the month, but also year, quarter,… All of these properties are accessible by the dt accessor like year, month, day, day_of_week, day_of_year, is_leap_year, week, etc. We can also access methods using dt accessor like day_name(), month_name(), etc.

In [15]:
df['order_month'] = df['order_date'].dt.month

df.head()

Unnamed: 0,row_id,order_id,order_date,ship_date,ship_mode,customer_id,customer_name,segment,country,city,state,postal_code,region,product_id,category,sub_category,product_name,sales,order_month
0,1,CA-2017-152156,2017-11-08,2017-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96,11
1,2,CA-2017-152156,2017-11-08,2017-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94,11
2,3,CA-2017-138688,2017-06-12,2017-06-16,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,90036.0,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62,6
3,4,US-2016-108966,2016-10-11,2016-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775,10
4,5,US-2016-108966,2016-10-11,2016-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.368,10


Calculating Delivery Time from Order Date and Ship Date

In [16]:
df['delivery_time'] = df['ship_date'] - df['order_date']

df.head()

Unnamed: 0,row_id,order_id,order_date,ship_date,ship_mode,customer_id,customer_name,segment,country,city,state,postal_code,region,product_id,category,sub_category,product_name,sales,order_month,delivery_time
0,1,CA-2017-152156,2017-11-08,2017-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96,11,3 days
1,2,CA-2017-152156,2017-11-08,2017-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94,11,3 days
2,3,CA-2017-138688,2017-06-12,2017-06-16,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,90036.0,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62,6,4 days
3,4,US-2016-108966,2016-10-11,2016-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775,10,7 days
4,5,US-2016-108966,2016-10-11,2016-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.368,10,7 days


In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9800 entries, 0 to 9799
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype          
---  ------         --------------  -----          
 0   row_id         9800 non-null   int64          
 1   order_id       9800 non-null   object         
 2   order_date     9800 non-null   datetime64[ns] 
 3   ship_date      9800 non-null   datetime64[ns] 
 4   ship_mode      9800 non-null   object         
 5   customer_id    9800 non-null   object         
 6   customer_name  9800 non-null   object         
 7   segment        9800 non-null   object         
 8   country        9800 non-null   object         
 9   city           9800 non-null   object         
 10  state          9800 non-null   object         
 11  postal_code    9789 non-null   float64        
 12  region         9800 non-null   object         
 13  product_id     9800 non-null   object         
 14  category       9800 non-null   object         
 15  sub_