# Module 03 - Data Wrangling with Pandas

---

#### <a href="linkedin.com/in/tasmim-rahman-adib-403074221">Tasmim Rahman Adib</a>

# Lecture 4.1 - Introduction to Data Analysis & Manipulation with Pandas
## Agenda
- Introduction
- Getting Started
- Creating Series
- Creating Dataframes
- Importing Data from Different Sources

## 4.1.1 Introduction
### What is Pandas?
The Pandas library is built on NumPy and provides easy-to-use data structures and data analysis tools for the Python programming language.

### Pandas Series
A **one-dimensional** labeled array a capable of holding any data type

### Pandas DataFrame
A **two-dimensional** labeled data structure with columns of potentially different types
![Pandas](../img/pandas.png)

### Advantages of Pandas
- Data representation
- Less writing and more work done
- An extensive set of features
- Efficiently handles large data
- Makes data flexible and customizable
- Made for Python




## 4.1.2 Getting Started

In [1]:
# install pandas
!pip install pandas

Collecting pandas
  Using cached pandas-2.2.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.1 MB)
Collecting tzdata>=2022.7
  Using cached tzdata-2024.2-py2.py3-none-any.whl (346 kB)
Collecting pytz>=2020.1
  Downloading pytz-2024.2-py2.py3-none-any.whl (508 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m508.0/508.0 KB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m0:01[0m:01[0m
[?25hCollecting python-dateutil>=2.8.2
  Using cached python_dateutil-2.9.0.post0-py2.py3-none-any.whl (229 kB)
Collecting six>=1.5
  Downloading six-1.16.0-py2.py3-none-any.whl (11 kB)
Installing collected packages: pytz, tzdata, six, python-dateutil, pandas
Successfully installed pandas-2.2.3 python-dateutil-2.9.0.post0 pytz-2024.2 six-1.16.0 tzdata-2024.2


In [2]:
# Conventional  way to import pandas 
import pandas as pd 

In [3]:
# Check pandas version
pd.__version__

'2.2.3'

In [40]:
# Show version of all packages 
pd.show_versions()


INSTALLED VERSIONS
------------------
commit                : 0691c5cf90477d3503834d983f69350f250a6ff7
python                : 3.10.12
python-bits           : 64
OS                    : Linux
OS-release            : 6.8.0-45-generic
Version               : #45~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Sep 11 15:25:05 UTC 2
machine               : x86_64
processor             : x86_64
byteorder             : little
LC_ALL                : None
LANG                  : en_US.UTF-8
LOCALE                : en_US.UTF-8

pandas                : 2.2.3
numpy                 : 2.1.2
pytz                  : 2022.1
dateutil              : 2.9.0.post0
pip                   : 22.0.2
Cython                : None
sphinx                : None
IPython               : 8.28.0
adbc-driver-postgresql: None
adbc-driver-sqlite    : None
bs4                   : 4.12.3
blosc                 : None
bottleneck            : None
dataframe-api-compat  : None
fastparquet           : None
fsspec                : None
h

## 4.1.3 Creating Series

In [6]:
# import numpy
import numpy as np

In [8]:
A = np.array([3, 4, 5])
A 

array([3, 4, 5])

In [9]:
type(A)

numpy.ndarray

In [10]:
# Create Series 
s1 = pd.Series([3, 6, 9, 12, 20, 21])
s1

0     3
1     6
2     9
3    12
4    20
5    21
dtype: int64

In [11]:
# Check type 
type(s1)

pandas.core.series.Series

In [12]:
# To see values 
s1.values

array([ 3,  6,  9, 12, 20, 21])

In [14]:
# To see index/keys 
s1.index

RangeIndex(start=0, stop=6, step=1)

In [15]:
# Creating labeled series 
s2 = pd.Series([200000, 300000, 4000000, 500000], index=['A', 'B', 'C', 'D'])

In [16]:
s2

A     200000
B     300000
C    4000000
D     500000
dtype: int64

In [17]:
s2.values

array([ 200000,  300000, 4000000,  500000])

In [18]:
s2.index

Index(['A', 'B', 'C', 'D'], dtype='object')

In [19]:
# Indexing
s2['A']

np.int64(200000)

In [20]:
# Boolean indexing
s2[s2 > 700000]

C    4000000
dtype: int64

## 4.1.4 Creating Dataframe

In [21]:
# Create a DataFrame 
data = {'Country': ['Belgium', 'India', 'Brazil'],
        'Capital': ['Brussels', 'New Delhi', 'Brasília'],
        'Population': [11190846, 1303171035, 207847528]
}

df = pd.DataFrame(data, columns=["Country", "Capital", "Population"])

In [22]:
df

Unnamed: 0,Country,Capital,Population
0,Belgium,Brussels,11190846
1,India,New Delhi,1303171035
2,Brazil,Brasília,207847528


In [23]:
type(df)

pandas.core.frame.DataFrame

In [24]:
# Indexing
df["Country"]

0    Belgium
1      India
2     Brazil
Name: Country, dtype: object

In [25]:
# or 
df.Country

0    Belgium
1      India
2     Brazil
Name: Country, dtype: object

In [26]:
# type 
type(df["Country"])

pandas.core.series.Series

In [27]:
# type 
type(df["Capital"])

pandas.core.series.Series

In [28]:
# type 
type(df["Population"])

pandas.core.series.Series

In [29]:
# Boolean indexing 
df["Population"]  > 4000

0    True
1    True
2    True
Name: Population, dtype: bool

In [30]:
df["Country"] == "Belgium"

0     True
1    False
2    False
Name: Country, dtype: bool

## 4.1.5 Importing Data from Different Sources

### 4.1.5.1 Read Csv

In [31]:
# read data from csv file 
diabetes = pd.read_csv("../data/diabetes.csv")

In [32]:
# type 
type(diabetes)

pandas.core.frame.DataFrame

In [33]:
# Examine first few rows 
diabetes.head() 

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


### 4.1.5.2 Read Excel Sheet

In [43]:
# read data from excel file 
cholesterol = pd.read_excel("../data/Cholesterol.xlsx")

In [44]:
type(cholesterol)

pandas.core.frame.DataFrame

In [45]:
cholesterol.head()

Unnamed: 0,Subject,Before,After,Difference
0,1,195,146,49
1,2,145,155,−10
2,3,205,178,27
3,4,159,146,13
4,5,244,208,36


In [47]:
### 4.1.5.3 From URL

In [46]:
# read a dataset of pulse rate directly from a URL and store the results in a DataFrame 
pulse = pd.read_table('http://media.news.health.ufl.edu/misc/bolt/Intro/SPSS/OriginalData/pulse.txt')

In [48]:
pulse.head()

Unnamed: 0,Height,Weight,Age,Gender,Smokes,Alcohol,Exercise,Ran,Pulse1,Pulse2,Year
0,173,57.0,18,2,2,1,2,2,86.0,88.0,93
1,179,58.0,19,2,2,1,2,1,82.0,150.0,93
2,167,62.0,18,2,2,1,1,1,96.0,176.0,93
3,195,84.0,18,1,2,1,1,2,71.0,73.0,93
4,173,64.0,18,2,2,1,3,2,90.0,88.0,93


### 4.1.5.4 Read Biological Data(.txt)

In [61]:
# read text/csv data into pandas 
chrom = pd.read_csv("../data/dummy.txt", delimiter= ",")

In [62]:
chrom.head()

Unnamed: 0,Name,Age,City
0,Alice,29,New York
1,Bob,34,Los Angeles
2,Charlie,25,Chicago
3,David,22,Houston
4,Eva,31,Phoenix


### 4.1.5.5 Advance Data Importing Techniques

In [63]:
df = pd.read_csv("../data/covid19.csv")
# examine first few rows 
df.head() 

Unnamed: 0,SNo,ObservationDate,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
0,1,01/22/2020,Anhui,Mainland China,1/22/2020 17:00,1.0,0.0,0.0
1,2,01/22/2020,Beijing,Mainland China,1/22/2020 17:00,14.0,0.0,0.0
2,3,01/22/2020,Chongqing,Mainland China,1/22/2020 17:00,6.0,0.0,0.0
3,4,01/22/2020,Fujian,Mainland China,1/22/2020 17:00,1.0,0.0,0.0
4,5,01/22/2020,Gansu,Mainland China,1/22/2020 17:00,0.0,0.0,0.0


In [64]:
# Set index 
df = pd.read_csv("../data/covid19.csv", index_col= "Country/Region")
df.head() 

Unnamed: 0_level_0,SNo,ObservationDate,Province/State,Last Update,Confirmed,Deaths,Recovered
Country/Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Mainland China,1,01/22/2020,Anhui,1/22/2020 17:00,1.0,0.0,0.0
Mainland China,2,01/22/2020,Beijing,1/22/2020 17:00,14.0,0.0,0.0
Mainland China,3,01/22/2020,Chongqing,1/22/2020 17:00,6.0,0.0,0.0
Mainland China,4,01/22/2020,Fujian,1/22/2020 17:00,1.0,0.0,0.0
Mainland China,5,01/22/2020,Gansu,1/22/2020 17:00,0.0,0.0,0.0


In [65]:
# Skipping headers 
df = pd.read_csv("../data/covid19.csv", header=None)
df.head() 

  df = pd.read_csv("../data/covid19.csv", header=None)


Unnamed: 0,0,1,2,3,4,5,6,7
0,SNo,ObservationDate,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
1,1,01/22/2020,Anhui,Mainland China,1/22/2020 17:00,1.0,0.0,0.0
2,2,01/22/2020,Beijing,Mainland China,1/22/2020 17:00,14.0,0.0,0.0
3,3,01/22/2020,Chongqing,Mainland China,1/22/2020 17:00,6.0,0.0,0.0
4,4,01/22/2020,Fujian,Mainland China,1/22/2020 17:00,1.0,0.0,0.0


In [66]:
# Custom column names 
df = pd.read_csv("../data/covid19.csv", header = 0,
                 names= ["SL", "ObservationDate", "State", "Country", "Last Update", "Confirmed", "Deaths", "Recovered"])
df.head() 

Unnamed: 0,SL,ObservationDate,State,Country,Last Update,Confirmed,Deaths,Recovered
0,1,01/22/2020,Anhui,Mainland China,1/22/2020 17:00,1.0,0.0,0.0
1,2,01/22/2020,Beijing,Mainland China,1/22/2020 17:00,14.0,0.0,0.0
2,3,01/22/2020,Chongqing,Mainland China,1/22/2020 17:00,6.0,0.0,0.0
3,4,01/22/2020,Fujian,Mainland China,1/22/2020 17:00,1.0,0.0,0.0
4,5,01/22/2020,Gansu,Mainland China,1/22/2020 17:00,0.0,0.0,0.0


In [67]:
# Use only selected columuns 
df = pd.read_csv("../data/covid19.csv", usecols = ["Country/Region", "Confirmed", "Deaths", "Recovered"])
df.head() 

Unnamed: 0,Country/Region,Confirmed,Deaths,Recovered
0,Mainland China,1.0,0.0,0.0
1,Mainland China,14.0,0.0,0.0
2,Mainland China,6.0,0.0,0.0
3,Mainland China,1.0,0.0,0.0
4,Mainland China,0.0,0.0,0.0


In [68]:
# Set index and use selected columns 
df = pd.read_csv("../data/covid19.csv", index_col="Country/Region",
                 usecols=["Country/Region", "Confirmed", "Deaths", "Recovered"])
df.head() 

Unnamed: 0_level_0,Confirmed,Deaths,Recovered
Country/Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Mainland China,1.0,0.0,0.0
Mainland China,14.0,0.0,0.0
Mainland China,6.0,0.0,0.0
Mainland China,1.0,0.0,0.0
Mainland China,0.0,0.0,0.0


In [69]:
# exploring columns 
df.columns

Index(['Confirmed', 'Deaths', 'Recovered'], dtype='object')

In [70]:
# Customize columns 
df.columns = ["Confirmed Cases", "Deaths Cases", "Recovered Cases"]

In [71]:
df.columns

Index(['Confirmed Cases', 'Deaths Cases', 'Recovered Cases'], dtype='object')

In [72]:
# Set index name 
df.index.name = "Country"

In [73]:
df.head()

Unnamed: 0_level_0,Confirmed Cases,Deaths Cases,Recovered Cases
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Mainland China,1.0,0.0,0.0
Mainland China,14.0,0.0,0.0
Mainland China,6.0,0.0,0.0
Mainland China,1.0,0.0,0.0
Mainland China,0.0,0.0,0.0


### 4.1.5.6 Importing and Manipulating Excel Files with pd.read_excel()

In [80]:
df = pd.read_excel("../data/PIDD.xlsx")

In [81]:
df.head()

Unnamed: 0,Pregnancies,Glucose,Blood pressure,Skin thickness,Insulin,Body mass index,Diabetes pedigree function,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [83]:
df = pd.read_excel("../data/PIDD.xlsx", index_col = 0, header = 0, 
                   names=['Pregnancies', 'Glucose', 'Blood pressure', 'Skin thickness', 'Insulin', 'Body mass index','Diabetes pedigree function','Age','Outcome'])
df.head() 

Unnamed: 0_level_0,Glucose,Blood pressure,Skin thickness,Insulin,Body mass index,Diabetes pedigree function,Age,Outcome
Pregnancies,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0
0,137,40,35,168,43.1,2.288,33,1


1. **`index_col=0`**: The first column (0th column) of the Excel file is used as the index for the DataFrame.
   
2. **`header=0`**: The first row of the Excel file is treated as the header (column names) for the DataFrame.

3. **`names=[...]`**: The `names` parameter provides a custom list of column names for the DataFrame, overriding any existing column names from the Excel file. The columns are renamed as:
   - `Pregnancies`
   - `Glucose`
   - `Blood pressure`
   - `Skin thickness`
   - `Insulin`
   - `Body mass index`
   - `Diabetes pedigree function`
   - `Age`
   - `Outcome`

In [85]:
df = pd.read_excel("../data/PIDD.xlsx", index_col=0,  header = 0, usecols = "A:D")
df.head() 

Unnamed: 0_level_0,Glucose,Blood pressure,Skin thickness
Pregnancies,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
6,148,72,35
1,85,66,29
8,183,64,0
1,89,66,23
0,137,40,35


1. **`index_col=0`**: The first column (0th column) of the Excel file will be used as the index for the DataFrame.

2. **`header=0`**: The first row of the Excel file contains the column headers for the DataFrame.

3. **`usecols="A:D"`**: Only columns from "A" to "D" in the Excel file are read and included in the DataFrame.


In [86]:
df = pd.read_excel("../data/PIDD.xlsx", index_col=0,  header = 0, usecols = "A:D, F")
df.head() 

Unnamed: 0_level_0,Glucose,Blood pressure,Skin thickness,Body mass index
Pregnancies,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
6,148,72,35,33.6
1,85,66,29,26.6
8,183,64,0,23.3
1,89,66,23,28.1
0,137,40,35,43.1


1. **`index_col=0`**: The first column (0th column) of the Excel file will be used as the index for the DataFrame.

2. **`header=0`**: The first row of the Excel file contains the column headers for the DataFrame.

3. **`usecols="A:D, F"`**: Only columns from "A" to "D" and "F" in the Excel file are read and included in the DataFrame.


In [88]:
df = pd.read_excel("../data/PIDD.xlsx", index_col=0,  header = 0, usecols = ":F")
df.head() 

Unnamed: 0_level_0,Glucose,Blood pressure,Skin thickness,Insulin,Body mass index
Pregnancies,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
6,148,72,35,0,33.6
1,85,66,29,0,26.6
8,183,64,0,0,23.3
1,89,66,23,94,28.1
0,137,40,35,168,43.1


**`usecols=":F"`**: Only columns from "A" to "F" in the Excel file are read and included in the DataFrame.

In [3]:
df = pd.read_excel("../data/PIDD.xlsx", index_col=0,  header = 0, usecols = [0,3,4])
df.head() 

Unnamed: 0_level_0,Skin thickness,Insulin
Pregnancies,Unnamed: 1_level_1,Unnamed: 2_level_1
6,35,0
1,29,0
8,0,0
1,23,94
0,35,168


1. **`index_col=0`**: The first column (0th column) of the Excel file will be used as the index for the DataFrame.

2. **`header=0`**: The first row of the Excel file contains the column headers for the DataFrame.

3. **`usecols=[0,3,4]`**: Only columns number 0, 3 and 4 in the Excel file are read and included in the DataFrame.


In [7]:
df = pd.read_excel("../data/PIDD.xlsx",  usecols = ["Age", "Insulin"])
df.head() 

Unnamed: 0,Insulin,Age
0,0,50
1,0,31
2,0,32
3,94,21
4,168,33


### 4.1.5.7 Customizing and Handling Multiple Excel Sheets import with pd.read_excel()

In [25]:
# read the sheet_name "day1"
pd.read_excel("../data/covid19_multiple_sheets.xlsx", sheet_name = "day1")

Unnamed: 0,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
0,Hubei,Mainland China,3/13/2020 06:00,67786,3062,51553
1,Guangdong,Mainland China,3/13/2020 06:00,1356,8,1296
2,Zhejiang,Mainland China,3/13/2020 06:00,1215,1,1209
3,Shandong,Mainland China,3/13/2020 06:00,760,7,739
4,Henan,Mainland China,3/13/2020 06:00,1273,22,1249
...,...,...,...,...,...,...
216,,Mongolia,3/13/2020 06:00,1,0,0
217,,St. Barth,3/13/2020 06:00,1,0,0
218,,St. Vincent Grenadines,3/13/2020 06:00,1,0,0
219,,Togo,3/13/2020 06:00,1,0,0


In [30]:
# skip row index 0 and 1
pd.read_excel("../data/covid19_multiple_sheets.xls", sheet_name = "day1", skiprows= [0,1])

Unnamed: 0,Guangdong,Mainland China,3/13/2020 06:00,1356,8,1296
0,Zhejiang,Mainland China,3/13/2020 06:00,1215,1,1209
1,Shandong,Mainland China,3/13/2020 06:00,760,7,739
2,Henan,Mainland China,3/13/2020 06:00,1273,22,1249
3,Anhui,Mainland China,3/13/2020 06:00,990,6,984
4,Jiangxi,Mainland China,3/13/2020 06:00,935,1,934
...,...,...,...,...,...,...
214,,Mongolia,3/13/2020 06:00,1,0,0
215,,St. Barth,3/13/2020 06:00,1,0,0
216,,St. Vincent Grenadines,3/13/2020 06:00,1,0,0
217,,Togo,3/13/2020 06:00,1,0,0


In [29]:
pd.read_excel("../data/covid19_multiple_sheets.xls", sheet_name = "day1", skiprows= 2, usecols= "A:C")

Unnamed: 0,Guangdong,Mainland China,3/13/2020 06:00
0,Zhejiang,Mainland China,3/13/2020 06:00
1,Shandong,Mainland China,3/13/2020 06:00
2,Henan,Mainland China,3/13/2020 06:00
3,Anhui,Mainland China,3/13/2020 06:00
4,Jiangxi,Mainland China,3/13/2020 06:00
...,...,...,...
214,,Mongolia,3/13/2020 06:00
215,,St. Barth,3/13/2020 06:00
216,,St. Vincent Grenadines,3/13/2020 06:00
217,,Togo,3/13/2020 06:00


### 4.1.5.7 Importing Data from the Web  with pd.read_html()

In [31]:
url = "https://en.wikipedia.org/wiki/1976_Summer_Olympics_medal_table"

In [34]:
df = pd.read_html(url)

In [33]:
# if this error occured : ImportError: Missing optional dependency 'lxml'.  Use pip or conda to install lxml.
pip install lxml

Defaulting to user installation because normal site-packages is not writeable
Collecting lxml
  Downloading lxml-5.3.0-cp310-cp310-manylinux_2_28_x86_64.whl (5.0 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.0/5.0 MB[0m [31m240.6 kB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
[?25hInstalling collected packages: lxml
Successfully installed lxml-5.3.0
Note: you may need to restart the kernel to use updated packages.


In [36]:
type(df)

list

In [37]:
df = pd.read_html(url)[0]
df.head() 

Unnamed: 0,1976 Summer Olympics medals,1976 Summer Olympics medals.1,Unnamed: 2
0,Location,"Montreal, Canada",
1,Highlights,Highlights,
2,Most gold medals,Soviet Union (49),
3,Most total medals,Soviet Union (125),
4,Medalling NOCs,41,


*Copyright &copy; 2024  [Md. Jubayer Hossain](https://hossainlab.github.io/) &  [Center for Bioinformatics Learning Advancement and Systematic Training (cBLAST)](https://www.cblast.du.ac.bd/). All rights reserved*