# **Ten days rigorous FREE data science training with Python**


**Duration: 01st March 2024 to 09th March 2024.**

**Lecture Time: 07:30-0900 AM EST**

**Organized by: https://www.facebook.com/LearnPythonR4Datascience**

**Please like, share, and follow my facebook page for more interesting content.**

**Please note that zoom link is already posted on my facebook page.**

## **What will be covered in Pandas three lectures?**

### *Lecture 1*
- What is Pandas dataframe and Pandas Series?
- How to create Pandas dataframe from Python's dictionary?
- How to import and export different file formats using Pandas?
- Some common attributes and methods of Pandas Series and Pandas dataframe
- How to select a single or multiple columns from the Pandas dataframe?
- How to use loc[ ] and iloc[ ] method to select single or group of values from Pandas dataframe?
- How to modify/replace a single or group of values from Pandas dataframe?
- How to count the unique (distinct) values in a Pandas Series?
- How to filter rows using single boolean condition (combined boolean condition based on & and |)?
- How to sort Pandas dataframe in an ascending or descending order using single or group of columns?

### *Lecture 2*

- How to rename column names, how to drop columns and how to create new columns from the existing columns?
- How to use apply, applymap and map methods in Pandas dataframe?
- How to groupby() Pandas dataframe using single Pandas Series or groups of Series?
- What type of aggregation methods can be applied *following* Pandas groupby() method to answer real world answers?
- Application of the concepts using real world gapminder dataset?

### *Lecture 3*

- What do we mean by merging and joining Pandas dataframe?
- What is primary and foreign key? Why do we need these concepts to merge the datasets?
- What is inner merge, outer merge, left merge and right merge? When do we use which?
- Concatenation of dataframes along horizontal axis and vertical axis
- How to treat missing values in Pandas dataframe?

**Note:** Students are also encouraged to complete the small project to practice Pandas concept. The dataset and questions will be made available to my github account.

# Python for Data Analysis

In [16]:
#! pip install pandas
import pandas as pd
import numpy as np

## Pandas
Pandas is a popular open-source data manipulation and analysis library for Python. It provides data structures and functions for efficiently working with **structured data**, such as tables or spreadsheets. Pandas is one of the most downloaded libraries of Python.

Two fundamental data structures in Pandas are:

- DataFrame.
- Series.

### *DataFrame:*

A DataFrame is a **two-dimensional** table-like data structure with rows and columns. It is similar to a spreadsheet or a SQL table. In a DataFrame, columns can have **different data types** (e.g., numbers, strings, dates), and each column represents a variable or a feature, while each row represents an observation or a record.

In [18]:
## Here's an example of creating a DataFrame from a dictionary

phd_scientists = {
              'Name': ['John', 'Emma', 'Peter'],
               'Age': np.array([25, 30, 35]),
               'Country': ['USA', 'Canada', 'UK'],
                "Bool1" : [True, False, True]
}

df = pd.DataFrame(phd_scientists, index = ["A","B","C"])

In [34]:
df

Unnamed: 0,Name,Age,Country,Bool1
A,John,25,USA,True
B,Emma,30,Canada,False
C,Peter,35,UK,True


In [20]:
type(df)

pandas.core.frame.DataFrame

### *Series*:

A Series is a **one-dimensional** labeled array that can hold data of any type (e.g., numbers, strings, dates). It can be seen as a single column from a DataFrame. Each element in a Series has an associated index, which is used to access and manipulate the data.

*Two main Attributes of Pandas Series*

**Values:** We access the values in a Pandas series with the "values" attribute of Series. This return as a "numpy array."

**Index:** We can access the index of a series with the index attribute of Series.


In [39]:
## You can create a Series from a list, array, or dictionary. Here's an example:

data = [10, 20, 30, 40, 50]

series = pd.Series(data)

print(series)


0    10
1    20
2    30
3    40
4    50
dtype: int64


In [42]:
series

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [9]:
boolean_array = np.array([True, False, True, False])

boolean_series = pd.Series(boolean_array)
print(boolean_series)

0     True
1    False
2     True
3    False
dtype: bool


In [10]:
type(boolean_series)

pandas.core.series.Series

In [24]:
## Series from dicitonary
dict1 = {"cgpa" : [3.40, 3.90, 2.50, 3.00]}

series3 = pd.Series(dict1)

print(series3)

cgpa    [3.4, 3.9, 2.5, 3.0]
dtype: object


In [25]:
type(series3)

pandas.core.series.Series

In [26]:
dir(series3)

['T',
 '_AXIS_LEN',
 '_AXIS_ORDERS',
 '_AXIS_TO_AXIS_NUMBER',
 '_HANDLED_TYPES',
 '__abs__',
 '__add__',
 '__and__',
 '__annotations__',
 '__array__',
 '__array_priority__',
 '__array_ufunc__',
 '__bool__',
 '__class__',
 '__contains__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__finalize__',
 '__float__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__ifloordiv__',
 '__imod__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__int__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__lt__',
 '__matmul__',
 '__mod__',
 '__module__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__nonzero__',
 '__or__',
 '__pos__',
 '__pow__',
 '__radd__',
 '__rand__',
 '__rdivmod__',
 '__reduce__',
 '__reduce_ex__',
 '__rep

## Dataframe

### Attributes
**shape**: Returns a tuple representing the dimensions of the DataFrame. The tuple contains the number of rows and columns, respectively.
df.shape

**index**: Returns the index labels of the DataFrame, representing the row labels or indices.
df.index

**columns**: Returns the column labels of the DataFrame, representing the column names or headers.
df.columns

**values**: Returns a 2D NumPy array containing the actual data values of the DataFrame.
df.values

**dtypes**: Returns the data types of each column in the DataFrame
df.dtypes

**size**: Returns the total number of elements in the DataFrame, which is equal to the number of rows multiplied by the number of columns
df.size

### Methods
**head()**: Returns the first n rows of the DataFrame. By default, it returns the first five rows
df.head()

**tail()**: Returns the last n rows of the DataFrame. By default, it returns the last five rows
df.tail()

**info()**: Provides a summary of the DataFrame, including the number of non-null values, data types, and memory usage
df.info()

In [36]:
df

Unnamed: 0,Name,Age,Country,Bool1
A,John,25,USA,True
B,Emma,30,Canada,False
C,Peter,35,UK,True


## **Series**

In [None]:
series.describe()

## **Gapminder dataset**

Let us try to understand the Pandas concept on real world Gapminder dataset.

The Gapminder dataset is a comprehensive collection of global socio-economic indicators spanning several decades. It includes data on various metrics such as population, GDP per capita, life expectancy, education, and more, across different countries and regions of the world. The dataset is often used for analysis and visualization to understand trends and patterns in global development.

In [43]:
## We can read different types of files (csv, excel, json etc.) using Pandas functions
gapminder = pd.read_csv("gap_minder.csv")

In [44]:
gapminder

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.853030
2,Afghanistan,Asia,1962,31.997,10267083,853.100710
3,Afghanistan,Asia,1967,34.020,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106
...,...,...,...,...,...,...
1699,Zimbabwe,Africa,1987,62.351,9216418,706.157306
1700,Zimbabwe,Africa,1992,60.377,10704340,693.420786
1701,Zimbabwe,Africa,1997,46.809,11404948,792.449960
1702,Zimbabwe,Africa,2002,39.989,11926563,672.038623


In [46]:
gapminder.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
2,Afghanistan,Asia,1962,31.997,10267083,853.10071
3,Afghanistan,Asia,1967,34.02,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106
5,Afghanistan,Asia,1977,38.438,14880372,786.11336
6,Afghanistan,Asia,1982,39.854,12881816,978.011439
7,Afghanistan,Asia,1987,40.822,13867957,852.395945
8,Afghanistan,Asia,1992,41.674,16317921,649.341395
9,Afghanistan,Asia,1997,41.763,22227415,635.341351


In [48]:
gapminder.tail(10)

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
1694,Zimbabwe,Africa,1962,52.358,4277736,527.272182
1695,Zimbabwe,Africa,1967,53.995,4995432,569.795071
1696,Zimbabwe,Africa,1972,55.635,5861135,799.362176
1697,Zimbabwe,Africa,1977,57.674,6642107,685.587682
1698,Zimbabwe,Africa,1982,60.363,7636524,788.855041
1699,Zimbabwe,Africa,1987,62.351,9216418,706.157306
1700,Zimbabwe,Africa,1992,60.377,10704340,693.420786
1701,Zimbabwe,Africa,1997,46.809,11404948,792.44996
1702,Zimbabwe,Africa,2002,39.989,11926563,672.038623
1703,Zimbabwe,Africa,2007,43.487,12311143,469.709298


## *Accessing a single Pandas Series*

In [57]:
gapminder.year ## How to access a single pandas series?

array([1952, 1957, 1962, ..., 1997, 2002, 2007], dtype=int64)

In [54]:
gapminder["year"] ## Another way to access a single Pandas series.

0       1952
1       1957
2       1962
3       1967
4       1972
        ... 
1699    1987
1700    1992
1701    1997
1702    2002
1703    2007
Name: year, Length: 1704, dtype: int64

In [None]:
gapminder.year.values ## Accessing single column

## *Accessing multiple pandas series*

Simply pass a list of all column names which we want to access from Pandas dataframe.

In [59]:
gapminder[["pop","year","lifeExp", "country"] ]## You can access single as well as multiple columns using this method. Provide the list of columns when accessing multiple columns

Unnamed: 0,pop,year,lifeExp,country
0,8425333,1952,28.801,Afghanistan
1,9240934,1957,30.332,Afghanistan
2,10267083,1962,31.997,Afghanistan
3,11537966,1967,34.020,Afghanistan
4,13079460,1972,36.088,Afghanistan
...,...,...,...,...
1699,9216418,1987,62.351,Zimbabwe
1700,10704340,1992,60.377,Zimbabwe
1701,11404948,1997,46.809,Zimbabwe
1702,11926563,2002,39.989,Zimbabwe


In [60]:
gapminder[["country","pop"]] ## Accessing multiple columns

Unnamed: 0,country,pop
0,Afghanistan,8425333
1,Afghanistan,9240934
2,Afghanistan,10267083
3,Afghanistan,11537966
4,Afghanistan,13079460
...,...,...
1699,Zimbabwe,9216418
1700,Zimbabwe,10704340
1701,Zimbabwe,11404948
1702,Zimbabwe,11926563


In [None]:
gapminder[["country","pop"]]

## *value_counts() method for Pandas Series*

How many distinct (unique) countries we have in gapminder dataset?

How many distinct (unique) countries we have in gapminder dataset?

How many distinct (unique) years we have in gapminder dataset?

In [61]:
gapminder["country"].value_counts()

country
Afghanistan          12
Pakistan             12
New Zealand          12
Nicaragua            12
Niger                12
                     ..
Eritrea              12
Equatorial Guinea    12
El Salvador          12
Egypt                12
Zimbabwe             12
Name: count, Length: 142, dtype: int64

# **loc[ ] and iloc[ ] method**

In Pandas, both loc and iloc are methods used to access and select data from a DataFrame, but they differ in how they handle the indexing.

**loc** is primarily **label-based** and is used to access data using label-based indexing. It accepts row and column labels as arguments and returns a subset of the DataFrame based on those labels

## *Toy datasets*

Let us create toy (example) datasets to understand the difference between loc[] and iloc[] method.

In [62]:
data = {
    'Name': ['John', 'Emma', 'Peter'],
    'Age': [25, 30, 35],
    'Country': ['USA', 'Canada', 'UK']
}

df1 = pd.DataFrame(data, index=['A', 'B', 'C'])

print(df1)

    Name  Age Country
A   John   25     USA
B   Emma   30  Canada
C  Peter   35      UK


In [63]:
# Access a single row using label-based indexing
row_A = df1.loc['C']
print(row_A)


Name       Peter
Age           35
Country       UK
Name: C, dtype: object


In [65]:
# Access a subset of rows and columns using label-based indexing
# To subset multiple rows and columns simulaneously, we pass a separate "list" of rows and column labels in respective dimensions.
subset = df1.loc[['A', 'C'], ['Name', 'Age']]
print(subset)

    Name  Age
A   John   25
C  Peter   35


In [66]:
country_info = {
    'Country': ['USA', 'Canada', 'UK'],
    'Population': [328_200_000, 37_590_000, 66_650_000],
    'Capital': ['Washington', 'Ottawa', 'London']
}

df_country = pd.DataFrame(country_info)
print(df_country)


  Country  Population     Capital
0     USA   328200000  Washington
1  Canada    37590000      Ottawa
2      UK    66650000      London


In [69]:
# Access a single row using integer-based indexing
row_0 = df_country.iloc[0, [0,1,2]]
print(row_0)


Country              USA
Population     328200000
Capital       Washington
Name: 0, dtype: object


In [71]:
# Access a subset of rows and columns using integer-based indexing
subset = df_country.iloc[[0, 2], [0, 1]]
print(subset)

  Country  Population
0     USA   328200000
2      UK    66650000


### *Accessing and modifying values using loc[ ] and iloc[ ] method*

In [72]:
df_country

Unnamed: 0,Country,Population,Capital
0,USA,328200000,Washington
1,Canada,37590000,Ottawa
2,UK,66650000,London


In [74]:
df_country.iloc[1, 2] = "Toronto"

In [75]:
df_country

Unnamed: 0,Country,Population,Capital
0,USA,328200000,Washington
1,Canada,37590000,Toronto
2,UK,66650000,London


In [77]:
df_country.iloc[[0,2], 0] = ["United States", "United Kingdom"]

In [78]:
df_country

Unnamed: 0,Country,Population,Capital
0,United States,328200000,Washington
1,Canada,37590000,Toronto
2,United Kingdom,66650000,London


In [None]:
## let us use loc and iloc on the gapminder dataset

# Filtering rows
You can create a boolean condition based on specific criteria and use it to filter the DataFrame using square brackets. You can filter rows based on various conditions, such as comparisons, equality, string matching, and more.

In [79]:
condition = (gapminder["country"] == "Afghanistan")
print(condition)

0        True
1        True
2        True
3        True
4        True
        ...  
1699    False
1700    False
1701    False
1702    False
1703    False
Name: country, Length: 1704, dtype: bool


In [80]:
gapminder[condition]

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
2,Afghanistan,Asia,1962,31.997,10267083,853.10071
3,Afghanistan,Asia,1967,34.02,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106
5,Afghanistan,Asia,1977,38.438,14880372,786.11336
6,Afghanistan,Asia,1982,39.854,12881816,978.011439
7,Afghanistan,Asia,1987,40.822,13867957,852.395945
8,Afghanistan,Asia,1992,41.674,16317921,649.341395
9,Afghanistan,Asia,1997,41.763,22227415,635.341351


In [81]:
combine_condition = (gapminder["year"]==2007) & (gapminder["country"]=="Canada")

In [82]:
gapminder[combine_condition]

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
251,Canada,Americas,2007,80.653,33390141,36319.23501


### *Easy filtering questions*

- All the countries with population 10 million in year 2007 in gapminder dataset
- The poorest country in African continent in 2007.
- The richest country in European continent in 1952.
- The most popolous country in Africa in 2007.

In [83]:
ten_million_2007 = (gapminder["pop"] > 10000000) & (gapminder["year"]==2007)
ten_million_2007

0       False
1       False
2       False
3       False
4       False
        ...  
1699    False
1700    False
1701    False
1702    False
1703     True
Length: 1704, dtype: bool

In [84]:
gapminder[ten_million_2007]

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
11,Afghanistan,Asia,2007,43.828,31889923,974.580338
35,Algeria,Africa,2007,72.301,33333216,6223.367465
47,Angola,Africa,2007,42.731,12420476,4797.231267
59,Argentina,Americas,2007,75.320,40301927,12779.379640
71,Australia,Oceania,2007,81.235,20434176,34435.367440
...,...,...,...,...,...,...
1643,Venezuela,Americas,2007,73.747,26084662,11415.805690
1655,Vietnam,Asia,2007,74.249,85262356,2441.576404
1679,"Yemen, Rep.",Asia,2007,62.698,22211743,2280.769906
1691,Zambia,Africa,2007,42.384,11746035,1271.211593


In [95]:
df_africa = gapminder[(gapminder["continent"] == "Africa") & (gapminder["year"] == 2007)]

In [98]:
df_africa.sort_values(by = "pop", ascending = True)

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
1307,Sao Tome and Principe,Africa,2007,65.528,199579,1598.435089
431,Djibouti,Africa,2007,54.791,496374,2082.481567
491,Equatorial Guinea,Africa,2007,51.579,551201,12154.08975
323,Comoros,Africa,2007,65.152,710960,986.147879
1271,Reunion,Africa,2007,76.442,798094,7670.122558
1463,Swaziland,Africa,2007,39.613,1133066,4513.480643
983,Mauritius,Africa,2007,72.801,1250882,10956.99112
551,Gabon,Africa,2007,56.735,1454867,13206.48452
635,Guinea-Bissau,Africa,2007,46.388,1472041,579.231743
167,Botswana,Africa,2007,50.728,1639131,12569.85177


You can combine multiple conditions using logical operators like & (and) and | (or) within the condition expressions

### *Sorting the dataframe by ascending (descending) order or by single or multiple columns*

*Sorting by a single column*:
To sort a DataFrame by a single column, you can pass the column name to the sort_values() method

*Sorting by multiple columns*:
To sort a DataFrame by multiple columns, you can pass a list of column names to the sort_values() method. The DataFrame will be sorted by the first column in the list, and if there are any ties, it will be further sorted by the subsequent columns


In [99]:
gapminder.sort_values("pop", ascending = False)

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
299,China,Asia,2007,72.961,1318683096,4959.114854
298,China,Asia,2002,72.028,1280400000,3119.280896
297,China,Asia,1997,70.426,1230075000,2289.234136
296,China,Asia,1992,68.690,1164970000,1655.784158
707,India,Asia,2007,64.698,1110396331,2452.210407
...,...,...,...,...,...,...
1299,Sao Tome and Principe,Africa,1967,54.425,70787,1384.840593
1298,Sao Tome and Principe,Africa,1962,51.893,65345,1071.551119
420,Djibouti,Africa,1952,34.812,63149,2669.529475
1297,Sao Tome and Principe,Africa,1957,48.945,61325,860.736903


In [100]:
data = {
    'Name': ['John', 'Emma', 'Peter', 'Emma'],
    'Age': [25, 30, 35, 28],
    'Country': ['USA', 'Canada', 'UK', 'Australia']
}

df = pd.DataFrame(data)
print(df)


    Name  Age    Country
0   John   25        USA
1   Emma   30     Canada
2  Peter   35         UK
3   Emma   28  Australia


In [101]:
# Sort by the 'Name' column in ascending order and then by the 'Age' column in descending order
sorted_df = df.sort_values(['Name', 'Age'], ascending=[False, True])
print(sorted_df)

    Name  Age    Country
2  Peter   35         UK
0   John   25        USA
3   Emma   28  Australia
1   Emma   30     Canada


In [None]:
gapminder