<center>
<table style="border:none">
    <tr style="border:none">
    <th style="border:none">
        <a  href='https://colab.research.google.com/github/AmirMardan/ml_course/blob/main/3_pandas/0_intro_to_pandas.ipynb'><img src='https://colab.research.google.com/assets/colab-badge.svg'></a>
    </th>
    <th style="border:none">
        <a  href='https://github1s.com/AmirMardan/ml_course/blob/main/3_pandas/0_intro_to_pandas.ipynb'><img src='../imgs/open_vscode.svg' height=20px width=115px></a>
    </th>
    </tr>
</table>
</center>


This notebook is created by <a href='https://amirmardan.github.io'> Amir Mardan</a>. For any feedback or suggestion, please contact me via my <a href="mailto:mardan.amir.h@gmail.com">email</a>, (mardan.amir.h@gmail.com).



<center>
<img id='PYTHON' src='img/pandas.svg' width='300px'>
</center>

<a name='top'></a>
# Introduction to pandas

This notebook will cover the following topics:

- [Introduction](#introduction)
- [1. Introducing Pandas objects](#objects)
    - [The pandas `Series` object](#series)
    - [The pandas `DataFrame` object](#dataframe)
- [2. Data indexing and selection](#indexing)
    - [Data selection in Series](#index_series)
    - [Data selection in DataFrame](#index_df)
- [3. Handling missing data](#missing)
    - [Detecting the missing values](#check_missing)
    - [Dealing with missing values](#deal_missing)
- [4. IO in pandas](#import)

<a name='introduction'></a>
## Introduction 


pandas is a library for data manipulation and analysis.
Created by **Wes McKinney**, first time released in January 2008.

<center><img src='./img/wes.png' alter='tavis' width=300px></center>

In this notebook, we learn the basic pandas. We learn
- What the pandas' objects are and how to create them,
- Data selection and indexing
- Handling missing data


In [1]:
# Ignore this cell

def letter_generator(n, random=True):
    """
    random_letter generates characters


    Parameters
    ----------
    n : int
        Number of required characters

    random : Boolean
        If True, the function returns structured random characters
    Returns
    -------
    str
        Random characters
    """
    alphabet = 'abcdefghijklmnopqrstuvwxyz'
    dis_alphabet = np.array([char for char in alphabet])
    ind = np.random.randint(0, 26, n)
    
    to_return = [dis_alphabet[ind[:n]] if random else dis_alphabet[:n]]
    return to_return[0]

<a name='objects'></a>
## 1. Introducing pandas objects

At a basic level, pandas objects can be thought of as NumPy structured arrays in which the rows and columns are identified with labels rather than integer indices. There are three fundamental pandas structures:
- `Series`
- `DataFrame`
- `Index`

Let's import pandas and NumPy and discuss the mentioned structures.

In [2]:
import numpy as np
import pandas as pd

<a name='series'></a>
### 1.1 The pandas `Series` object


A pandas `Series` is a one-dimensional array.

In [3]:
# Creating a series from list

data = pd.Series([2, 1, 3.4, -8])
data

0    2.0
1    1.0
2    3.4
3   -8.0
dtype: float64

As we see, `Series` makes a sequence of values and a sequence of indices.

In [4]:
pd.Series(['k', 3, 2])

0    k
1    3
2    2
dtype: object

We can define the index

In [5]:
pd.Series([1, 2, 4], index=['a', 'x', 't'])

a    1
x    2
t    4
dtype: int64

In [6]:
# Creating a series from dictionary

courses = {
    'Math': 3.4,
    'Literatur': 4,
    'French': 3
}

pd.Series(courses)

Math         3.4
Literatur    4.0
French       3.0
dtype: float64

In [7]:
# Creating a series from NumPy array

df = pd.Series(np.arange(3, 9, 1.2), index=['a', 'b', 'c', 'd', 'e'])
df

a    3.0
b    4.2
c    5.4
d    6.6
e    7.8
dtype: float64

We have access to values and the indices using `values` and `index`  



In [8]:
# Get the indices

df.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [9]:
# Get the values

df.values

array([3. , 4.2, 5.4, 6.6, 7.8])

Values are accessible using indices

In [10]:
# Creating homogenous series

pd.Series(50, index=[1, 2, 3])

1    50
2    50
3    50
dtype: int64

<a name='dataframe'></a>
### 1.2 The pandas `DataFrame` object


A pandas `DataFrame` can be thought of NumPy array of a dictionary.

In [11]:
# Let's prepare some data

population_dict = {
    'China': 1439323776,
    'India': 1380004385,
    'US': 331002651,
    'Indonesia': 273523615,
    'Pakistan': 220892340
}

land_area_dict = {
    'China': 9388211,
    'India': 2973190,
    'US': 9147420,
    'Indonesia': 1811570,
    'Pakistan': 770880
}

In [12]:
# Creating DataFrame using Series 

# 1.  Creating Series
population = pd.Series(population_dict)
land_area = pd.Series(land_area_dict)

# 2. Combine the Series
countries = pd.DataFrame({'Population': population, 'Land Area': land_area})
countries

Unnamed: 0,Population,Land Area
China,1439323776,9388211
India,1380004385,2973190
US,331002651,9147420
Indonesia,273523615,1811570
Pakistan,220892340,770880


In [13]:
# Creating DataFrame using list 

# 1.  Creating the list

countries_list = []
population_list = []
land_area_list = []

for param in land_area_dict:
    countries_list.append(param)
    population_list.append(population_dict[param])
    land_area_list.append(land_area_dict[param])

countries_list

['China', 'India', 'US', 'Indonesia', 'Pakistan']

In [14]:
# 2. Combine the lists

df = pd.DataFrame({"Population": population_list,
                   "Land Area": land_area_list},
                  index=countries_list)
df

Unnamed: 0,Population,Land Area
China,1439323776,9388211
India,1380004385,2973190
US,331002651,9147420
Indonesia,273523615,1811570
Pakistan,220892340,770880


In [15]:
# Adding another column.
# For example, let's calculate the density

df['Density']= df['Population'] /df['Land Area']

df

Unnamed: 0,Population,Land Area,Density
China,1439323776,9388211,153.311827
India,1380004385,2973190,464.14941
US,331002651,9147420,36.185356
Indonesia,273523615,1811570,150.987053
Pakistan,220892340,770880,286.545688


We use `index` and `columns` attributes to get the index and the name of columns.

In [16]:
df.index

Index(['China', 'India', 'US', 'Indonesia', 'Pakistan'], dtype='object')

In [17]:
df.columns

Index(['Population', 'Land Area', 'Density'], dtype='object')

In [18]:
# Attribute values

df.values

array([[1.43932378e+09, 9.38821100e+06, 1.53311827e+02],
       [1.38000438e+09, 2.97319000e+06, 4.64149410e+02],
       [3.31002651e+08, 9.14742000e+06, 3.61853562e+01],
       [2.73523615e+08, 1.81157000e+06, 1.50987053e+02],
       [2.20892340e+08, 7.70880000e+05, 2.86545688e+02]])

In [19]:
# Creating DataFrame with missing values

pd.DataFrame([{'a': 0, 'b': 1},
              {'a': 2, 'f':3, 'g':6}])

Unnamed: 0,a,b,f,g
0,0,1.0,,
1,2,,3.0,6.0


In [20]:
# Creating with 2-D NumPy array

pd.DataFrame(np.random.random((3, 4)),
             columns=['col1', 'col2', 'col3', 'col4'], 
             index=[2, 4, 6])

Unnamed: 0,col1,col2,col3,col4
2,0.2487,0.765018,0.690605,0.972128
4,0.502477,0.310301,0.332155,0.036046
6,0.402649,0.203905,0.657401,0.013301


<a name='indexing'></a>
## 2. Data indexing and selection


In this part, we learn how to get access to a part of data and modify it.

<a name='index_series'></a>
### 2.1 Data selection in Series

In [21]:
# Creating a series from NumPy array
a = np.arange(2.5, 12, 1.5)
df = pd.Series(a, index=letter_generator(len(a), False))
df

a     2.5
b     4.0
c     5.5
d     7.0
e     8.5
f    10.0
g    11.5
dtype: float64

We can get a part of a `Series` with different methods:
- slicing
- masking
- fancy masking

For slicing, the data is accessible either with explicit indexing or implicit indexing.

In [22]:
# Explicit indexing to one element

df['a']

2.5

In [23]:
# Implicit indexing to one element

df[0]

2.5

In [24]:
# Explicit indexing

df['a': 'c']

a    2.5
b    4.0
c    5.5
dtype: float64

In [25]:
# Explicit indexing

df[['a', 'd']]

a    2.5
d    7.0
dtype: float64

In [26]:
# Masking

# Let's vreate a mask
mask = (df > 1) & (df % 2 == 0)
print("The create mask is:\n{}".format(mask))

# Index using the mask

masked = df[mask]
print("\nThe masked DataFrame is:\n{}".format(masked))


The create mask is:
a    False
b     True
c    False
d    False
e    False
f     True
g    False
dtype: bool

The masked DataFrame is:
b     4.0
f    10.0
dtype: float64


In [27]:
# Fancy indexing

df[[0, 3]]

a    2.5
d    7.0
dtype: float64

#### Indexers, loc and iloc

Let's imagine a `Series` have integer indexing that doesn't start from zero. This can be the source of lots of confusion for explicit and implicit indexing.

In [28]:
df = pd.Series(letter_generator(5, random=False),
               index=[4, 2, 3, 1, 6])

df

4    a
2    b
3    c
1    d
6    e
dtype: object

<hr>
<div>
<span style="color:#151D3B; font-weight:bold">Question: 🤔</span><p>
What's the result of 

<code>df[2]</code>

Explicit indexing: 'b'

implicit indexing: 'c'
</div>
<hr>

In [29]:
# Answer


To avoid confusion, pandas provides some special *indexer*
- `loc`
- `iloc`

In [30]:
# loc for explicit indexing

df.loc[2]

'b'

In [31]:
# iloc for implicit indexing

df.iloc[2]

'c'

In [32]:
# Implicit slicing 

df.iloc[2: 4]

3    c
1    d
dtype: object

In [33]:
# Implicit fancy indexing

df.loc[2: 1]

2    b
3    c
1    d
dtype: object

In [34]:
# Explicit fancy indexing

df.loc[[2, 4]]

2    b
4    a
dtype: object

<a name='index_df'></a>
### 2.2 Data selection in DataFrame

In [35]:
# Let's create a DataFrame

countries_list = ['China', 'India', 'US','Indonesia', 'Pakistan']
population_list = [1439323776, 1380004385, 331002651, 273523615, 220892340]

land_area_list = [9388211, 2973190, 9147420, 1811570, 770880]

density = list(map(np.divide, population_list, land_area_list))

df = pd.DataFrame({"Population": population_list,
                   "Land Area": land_area_list,
                  "Density": density},
                  index=countries_list)

df

Unnamed: 0,Population,Land Area,Density
China,1439323776,9388211,153.311827
India,1380004385,2973190,464.14941
US,331002651,9147420,36.185356
Indonesia,273523615,1811570,150.987053
Pakistan,220892340,770880,286.545688


An individual `Series` of the DataFrame can be accessed in attribute-style indexing.

In [36]:
df.Population

China        1439323776
India        1380004385
US            331002651
Indonesia     273523615
Pakistan      220892340
Name: Population, dtype: int64

However, this might cause some confusion if DataFrame has a column with the name a reserved key. In this case, it's better to use dictionary-style indexing.

In [37]:
df['Population']

China        1439323776
India        1380004385
US            331002651
Indonesia     273523615
Pakistan      220892340
Name: Population, dtype: int64

The other advantage of dictionary-style indexing is its functionality for picking more than one column.

In [38]:
df[['Population', 'Density']]

Unnamed: 0,Population,Density
China,1439323776,153.311827
India,1380004385,464.14941
US,331002651,36.185356
Indonesia,273523615,150.987053
Pakistan,220892340,286.545688


We can also use `loc` and `iloc`.


In [39]:
# Explicit indexing for DataFrame

df.loc['India', ['Population' ,'Density']]

Population    1.380004e+09
Density       4.641494e+02
Name: India, dtype: float64

<hr>
<div>
<span style="color:#151D3B; font-weight:bold">Question: 🤔</span><p>
Select the population and land area of Pakistan and India using explicit indexing.
</div>
<hr>

In [40]:
# Answer
df.loc[['Pakistan', 'India'],['Population', 'Land Area']]

Unnamed: 0,Population,Land Area
Pakistan,220892340,770880
India,1380004385,2973190


In [41]:
# Answer using implicit indexing

df.iloc[[4, 1], [0, 1]]

Unnamed: 0,Population,Land Area
Pakistan,220892340,770880
India,1380004385,2973190


#### Conditional indexing

In [42]:
# Get all columns based on a condition

df[df['Density'] < 120]

Unnamed: 0,Population,Land Area,Density
US,331002651,9147420,36.185356


<hr>
<div>
<span style="color:#151D3B; font-weight:bold">Question: 🤔</span><p>
Get the population and land area of the countries with the density of at least twice the density of the US?
</div>
<hr>

In [43]:
# Answer

# df.loc[df['Density'] >=  2 * df.loc['US', 'Density'], ['Population', 'Land Area']]

$\color{red}{\text{Note:}}$
- Indexing refers to columns
- Slicing refers to rows
- Masking operations are row-wise


<a name='missing'></a>
## 3. Handling missing data

The data in the real world is rarely clean and homogenous. There are usually missing values in datasets. More complicated, there are different ways to indicate the missing data.


In [44]:
# Let's create a dataframe

population_dict = {
    'China': 1439323776,
    'India': 1380004385,
    'US': 331002651,
    'Indonesia': 273523615,
    'Pakistan': 220892340,
}

land_area_dict = {
    'China': 9388211,
    'US': 9147420,
    'Indonesia': 1811570,
    'Pakistan': 770880,
    'Brazil': 8358140
}

# 1.  Creating Series
population = pd.Series(population_dict)
land_area = pd.Series(land_area_dict)

# 2. Combine the Series
df_missing = pd.DataFrame({'Population': population, 'Land Area': land_area})
df_missing

Unnamed: 0,Population,Land Area
Brazil,,8358140.0
China,1439324000.0,9388211.0
India,1380004000.0,
Indonesia,273523600.0,1811570.0
Pakistan,220892300.0,770880.0
US,331002700.0,9147420.0


<a name='check_missing'></a>
### 3.1 Detecting the missing values

In [45]:
# Find the missing values using isna

df_missing.isna()

Unnamed: 0,Population,Land Area
Brazil,True,False
China,False,False
India,False,True
Indonesia,False,False
Pakistan,False,False
US,False,False


In [46]:
# Find the missing values using isnull

df_missing.isnull()

Unnamed: 0,Population,Land Area
Brazil,True,False
China,False,False
India,False,True
Indonesia,False,False
Pakistan,False,False
US,False,False


In [47]:
# Check the missing value

df_missing.notna()

Unnamed: 0,Population,Land Area
Brazil,False,True
China,True,True
India,True,False
Indonesia,True,True
Pakistan,True,True
US,True,True


In [48]:
# Number of missing values

df_missing.isnull().sum()

Population    1
Land Area     1
dtype: int64

In [49]:
# Percentage of missing values

100 * df_missing.isnull().sum() / len(df_missing.isnull())

Population    16.666667
Land Area     16.666667
dtype: float64

<a name='deal_missing'></a>
### 3.2 Dealing with missing values

Missing values can be either *ignored*, *dropped*, or *filled*. 

#### Dropping

In [50]:
# Dropping the misisng values with axis = 0

df_missing.dropna()

Unnamed: 0,Population,Land Area
China,1439324000.0,9388211.0
Indonesia,273523600.0,1811570.0
Pakistan,220892300.0,770880.0
US,331002700.0,9147420.0


In [51]:
# Dropping the misisng values with axis = 1

df_missing.dropna(axis=1)

Brazil
China
India
Indonesia
Pakistan
US


In [52]:
# Dropping the misisng values with axis = 'rows'

df_missing.dropna(axis='rows')

Unnamed: 0,Population,Land Area
China,1439324000.0,9388211.0
Indonesia,273523600.0,1811570.0
Pakistan,220892300.0,770880.0
US,331002700.0,9147420.0


In [53]:
# Drop specific column

df_missing['Population'].dropna()

China        1.439324e+09
India        1.380004e+09
Indonesia    2.735236e+08
Pakistan     2.208923e+08
US           3.310027e+08
Name: Population, dtype: float64

In [54]:
df_missing

Unnamed: 0,Population,Land Area
Brazil,,8358140.0
China,1439324000.0,9388211.0
India,1380004000.0,
Indonesia,273523600.0,1811570.0
Pakistan,220892300.0,770880.0
US,331002700.0,9147420.0


#### Filling

In [55]:
# Filling with a specific value

df_missing['Population'].fillna(df_missing['Population'].mean())

Brazil       7.289494e+08
China        1.439324e+09
India        1.380004e+09
Indonesia    2.735236e+08
Pakistan     2.208923e+08
US           3.310027e+08
Name: Population, dtype: float64

In [56]:
# Filling using forward or backward fill (ffill / bfill)

df_missing.fillna(method='ffill')

Unnamed: 0,Population,Land Area
Brazil,,8358140.0
China,1439324000.0,9388211.0
India,1380004000.0,9388211.0
Indonesia,273523600.0,1811570.0
Pakistan,220892300.0,770880.0
US,331002700.0,9147420.0


In [57]:
# Filling using forward or backward fill (ffill / bfill)

df_missing.fillna(method='bfill')

Unnamed: 0,Population,Land Area
Brazil,1439324000.0,8358140.0
China,1439324000.0,9388211.0
India,1380004000.0,1811570.0
Indonesia,273523600.0,1811570.0
Pakistan,220892300.0,770880.0
US,331002700.0,9147420.0


In [58]:
# Filling with axis

df_missing.fillna(method='bfill', axis='columns')

Unnamed: 0,Population,Land Area
Brazil,8358140.0,8358140.0
China,1439324000.0,9388211.0
India,1380004000.0,
Indonesia,273523600.0,1811570.0
Pakistan,220892300.0,770880.0
US,331002700.0,9147420.0


<a name='import'></a>
## 4. IO in pandas

Pandas has powerful functionality for dealing with different file formats. Here, we see how to import data files with CSV format or files from Excel

In [59]:
#Let's download the data

!curl -O https://raw.githubusercontent.com/AmirMardan/ml_course/main/data/house_intro_pandas.csv
!curl -O https://raw.githubusercontent.com/AmirMardan/ml_course/main/data/house_intro_pandas.xlsx


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   949  100   949    0     0  20779      0 --:--:-- --:--:-- --:--:-- 24973
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  5558  100  5558    0     0   122k      0 --:--:-- --:--:-- --:--:--  150k


In [60]:
# Loading CSV file
df_csv = pd.read_csv('./house_intro_pandas.csv')

# We can use the method head() to see the five first rows of a dataframe
df_csv.head()

Unnamed: 0,bedroom,price,area,furnish_type,bathroom,city
0,2.0,20000.0,1450.0,Furnished,2.0,Ahmedabad
1,1.0,7350.0,210.0,Semi-Furnished,1.0,Ahmedabad
2,3.0,22000.0,1900.0,Unfurnished,3.0,Ahmedabad
3,2.0,13000.0,1285.0,Semi-Furnished,2.0,Ahmedabad
4,2.0,18000.0,1600.0,Furnished,2.0,Ahmedabad


In [61]:
# Saving CSV file
df_csv.to_csv('./house_intro_pandas1.csv', index=False)


In [62]:
# Loading Excel file
df_xlsx = pd.read_excel('./house_intro_pandas.xlsx')

df_xlsx.head()

Unnamed: 0,bedroom,price,area,furnish_type,bathroom,city
0,2,20000,1450,Furnished,2,Ahmedabad
1,1,7350,210,Semi-Furnished,1,Ahmedabad
2,3,22000,1900,Unfurnished,3,Ahmedabad
3,2,13000,1285,Semi-Furnished,2,Ahmedabad
4,2,18000,1600,Furnished,2,Ahmedabad


In [63]:
# Saving Excel file

df_csv.to_excel('./house_intro_pandas1.xlsx', index=False)


### [TOP ☝️](#top)