### Packt >
# Python Data Analysis - Third Edition

## Pandas
### Creating pandas DataFrames

**DataFrame** is a tabular, two-dimensional labeled and indexed data structure with a grid of rows and columns. Its columns are heterogeneous types. It has the capability to work with different types of objects, carry out grouping and joining operations, handle missing values, create pivot tables, and deal with dates.

**Series** is a one-dimensional sequential data structure that is able to handle any type of data, such as string, numeric, datetime, Python lists, and dictionaries with labels and indexes. Series is one of the columns of a DataFrame.

In [5]:
import pandas as pd
import numpy as np

In [2]:
# creating DataFrame
data = {'Name': ['Mary', 'Jan', 'John', 'Angie'], 'Age': [23, 38, 45, 53]}
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age
0,Mary,23
1,Jan,38
2,John,45
3,Angie,53


In [3]:
data = [{'Name': 'Mary', 'Age': 23}, {'Name': 'Jan', 'Age': 38}, 
        {'Name': 'John', 'Age': 45}, {'Name': 'Angie', 'Age': 53}]
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age
0,Mary,23
1,Jan,38
2,John,45
3,Angie,53


In [6]:
# creating Series
data = {0: 'Mary', 1: 'Jan', 2: 'John', 3: 'Angie'}
series_name = pd.Series(data)
series_name

0     Mary
1      Jan
2     John
3    Angie
dtype: object

In [7]:
arr = np.array([23, 38, 45, 53])
series_age = pd.Series(arr)
series_age

0    23
1    38
2    45
3    53
dtype: int64

In [9]:
df = pd.DataFrame()
df['Name'] = series_name
df['Age'] = series_age
df

Unnamed: 0,Name,Age
0,Mary,23
1,Jan,38
2,John,45
3,Angie,53


In [10]:
series_scalar = pd.Series(10, index=['Mary', 'Jan', 'John', 'Angie'])
series_scalar

Mary     10
Jan      10
John     10
Angie    10
dtype: int64

In [11]:
df = pd.read_csv('../Python-Data-Analysis-Third-Edition/Chapter02/WHO_first9cols.csv')
df

Unnamed: 0,Country,CountryID,Continent,Adolescent fertility rate (%),Adult literacy rate (%),Gross national income per capita (PPP international $),Net primary school enrolment ratio female (%),Net primary school enrolment ratio male (%),Population (in thousands) total
0,Afghanistan,1,1,151.0,28.0,,,,26088.0
1,Albania,2,2,27.0,98.7,6000.0,93.0,94.0,3172.0
2,Algeria,3,3,6.0,69.9,5940.0,94.0,96.0,33351.0
3,Andorra,4,2,,,,83.0,83.0,74.0
4,Angola,5,3,146.0,67.4,3890.0,49.0,51.0,16557.0
...,...,...,...,...,...,...,...,...,...
197,Vietnam,198,6,25.0,90.3,2310.0,91.0,96.0,86206.0
198,West Bank and Gaza,199,1,,,,,,
199,Yemen,200,1,83.0,54.1,2090.0,65.0,85.0,21732.0
200,Zambia,201,3,161.0,68.0,1140.0,94.0,90.0,11696.0


In [15]:
series = df['Country']
type(series)

pandas.core.series.Series

In [24]:
# some DF attributes and methods
print(f'DF shape: {df.shape}')
print(f'DF columns list:\n {df.columns}')
print(f'DF columns types:\n {df.dtypes}')
print(f'DF count values:\n {df.Country.count()}')

DF shape: (202, 9)
DF columns list:
 Index(['Country', 'CountryID', 'Continent', 'Adolescent fertility rate (%)',
       'Adult literacy rate (%)',
       'Gross national income per capita (PPP international $)',
       'Net primary school enrolment ratio female (%)',
       'Net primary school enrolment ratio male (%)',
       'Population (in thousands) total'],
      dtype='object')
DF columns types:
 Country                                                    object
CountryID                                                   int64
Continent                                                   int64
Adolescent fertility rate (%)                             float64
Adult literacy rate (%)                                   float64
Gross national income per capita (PPP international $)    float64
Net primary school enrolment ratio female (%)             float64
Net primary school enrolment ratio male (%)               float64
Population (in thousands) total                           float64

In [19]:
# series slicing
series[-5:]

197               Vietnam
198    West Bank and Gaza
199                 Yemen
200                Zambia
201              Zimbabwe
Name: Country, dtype: object

In [21]:
series[3:10]

3                Andorra
4                 Angola
5    Antigua and Barbuda
6              Argentina
7                Armenia
8              Australia
9                Austria
Name: Country, dtype: object

In [25]:
# Quandl offers commercial and alternative financial data for investment data analyst
# It provides data using API, R, Python, or Excel.
!pip3 install Quandl

Collecting Quandl
  Downloading Quandl-3.6.1-py2.py3-none-any.whl (26 kB)
Collecting inflection>=0.3.1
  Downloading inflection-0.5.1-py2.py3-none-any.whl (9.5 kB)
Installing collected packages: inflection, Quandl
Successfully installed Quandl-3.6.1 inflection-0.5.1


In [27]:
import quandl
sunspots = quandl.get('SIDC/SUNSPOTS_A')

LimitExceededError: (Status 429) (Quandl Error QELx01) You have exceeded the anonymous user limit of 50 calls per day. To make more calls today, please register for a free Quandl account and then include your API key with your requests.

In [29]:
sunspots.head()

Unnamed: 0_level_0,Yearly Mean Total Sunspot Number,Yearly Mean Standard Deviation,Number of Observations,Definitive/Provisional Indicator
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1700-12-31,8.3,,,1.0
1701-12-31,18.3,,,1.0
1702-12-31,26.7,,,1.0
1703-12-31,38.3,,,1.0
1704-12-31,60.0,,,1.0


In [30]:
sunspots.tail()

Unnamed: 0_level_0,Yearly Mean Total Sunspot Number,Yearly Mean Standard Deviation,Number of Observations,Definitive/Provisional Indicator
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2016-12-31,39.8,3.9,9940.0,1.0
2017-12-31,21.7,2.5,11444.0,1.0
2018-12-31,7.0,1.1,12611.0,1.0
2019-12-31,3.6,0.5,12884.0,1.0
2020-12-31,8.8,4.1,14440.0,1.0


In [31]:
# filrering columns
sunspots_filtered = sunspots[['Yearly Mean Total Sunspot Number', 'Definitive/Provisional Indicator']]
sunspots_filtered

Unnamed: 0_level_0,Yearly Mean Total Sunspot Number,Definitive/Provisional Indicator
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
1700-12-31,8.3,1.0
1701-12-31,18.3,1.0
1702-12-31,26.7,1.0
1703-12-31,38.3,1.0
1704-12-31,60.0,1.0
...,...,...
2016-12-31,39.8,1.0
2017-12-31,21.7,1.0
2018-12-31,7.0,1.0
2019-12-31,3.6,1.0


In [32]:
# filtering rows
sunspots['20020101':'20131231']

Unnamed: 0_level_0,Yearly Mean Total Sunspot Number,Yearly Mean Standard Deviation,Number of Observations,Definitive/Provisional Indicator
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2002-12-31,163.6,9.8,6588.0,1.0
2003-12-31,99.3,7.1,7087.0,1.0
2004-12-31,65.3,5.9,6882.0,1.0
2005-12-31,45.8,4.7,7084.0,1.0
2006-12-31,24.7,3.5,6370.0,1.0
2007-12-31,12.6,2.7,6841.0,1.0
2008-12-31,4.2,2.5,6644.0,1.0
2009-12-31,4.8,2.5,6465.0,1.0
2010-12-31,24.9,3.4,6328.0,1.0
2011-12-31,80.8,6.7,6077.0,1.0


In [33]:
# boolean filtering
sunspots[sunspots['Yearly Mean Total Sunspot Number'] > sunspots['Yearly Mean Total Sunspot Number'].mean()]

Unnamed: 0_level_0,Yearly Mean Total Sunspot Number,Yearly Mean Standard Deviation,Number of Observations,Definitive/Provisional Indicator
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1705-12-31,96.7,,,1.0
1717-12-31,105.0,,,1.0
1718-12-31,100.0,,,1.0
1726-12-31,130.0,,,1.0
1727-12-31,203.3,,,1.0
...,...,...,...,...
2003-12-31,99.3,7.1,7087.0,1.0
2011-12-31,80.8,6.7,6077.0,1.0
2012-12-31,84.5,6.7,5753.0,1.0
2013-12-31,94.0,6.9,5347.0,1.0


In [37]:
# returns a small table with descriptive statistics
sunspots.describe()

Unnamed: 0,Yearly Mean Total Sunspot Number,Yearly Mean Standard Deviation,Number of Observations,Definitive/Provisional Indicator
count,321.0,203.0,203.0,321.0
mean,78.517134,7.892118,1691.857143,1.0
std,62.091523,3.86631,2913.060813,0.0
min,0.0,0.5,150.0,1.0
25%,24.2,4.55,365.0,1.0
50%,65.3,7.6,365.0,1.0
75%,115.2,10.35,366.0,1.0
max,269.3,19.1,14440.0,1.0


In [38]:
# returns the number of non-NaN items
sunspots.count()

Yearly Mean Total Sunspot Number    321
Yearly Mean Standard Deviation      203
Number of Observations              203
Definitive/Provisional Indicator    321
dtype: int64

In [48]:
# some statistics .median(), .mean(), .min(), .max(), .mode(), .std()
sunspots.median()

Yearly Mean Total Sunspot Number     65.3
Yearly Mean Standard Deviation        7.6
Number of Observations              365.0
Definitive/Provisional Indicator      1.0
dtype: float64

In [54]:
# grouping
df.groupby('Continent').count()

Unnamed: 0_level_0,Country,CountryID,Adolescent fertility rate (%),Adult literacy rate (%),Gross national income per capita (PPP international $),Net primary school enrolment ratio female (%),Net primary school enrolment ratio male (%),Population (in thousands) total
Continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,21,21,20,18,17,19,19,20
2,51,51,48,26,48,45,45,51
3,48,48,45,42,46,47,47,46
4,7,7,5,1,5,5,5,6
5,31,31,27,22,28,29,29,29
6,35,35,23,14,25,25,25,28
7,9,9,9,8,9,9,9,9


In [55]:
# Group By DataFrame on the basis of continent and select adult literacy rate(%)
df.groupby('Continent').mean()['Adult literacy rate (%)']

Continent
1    76.900000
2    97.911538
3    61.690476
4    91.600000
5    87.940909
6    87.607143
7    69.812500
Name: Adult literacy rate (%), dtype: float64

In [57]:
# groupby multiple columns
df.groupby(['Continent', 'Country']).mean()['Adult literacy rate (%)']

Continent  Country    
1          Afghanistan    28.0
           Bahrain        86.5
           Cape Verde     81.2
           Djibouti        NaN
           Egypt          71.4
                          ... 
7          India          61.0
           Maldives       96.3
           Nepal          48.6
           Pakistan       49.9
           Sri Lanka      90.7
Name: Adult literacy rate (%), Length: 202, dtype: float64

In [60]:
# join multiple DataFrames using the merge() function
dest = pd.read_csv('../Python-Data-Analysis-Third-Edition/Chapter02/dest.csv')
tips = pd.read_csv('../Python-Data-Analysis-Third-Edition/Chapter02/tips.csv')
dest.head(10)

Unnamed: 0,EmpNr,Dest
0,5,The Hague
1,3,Amsterdam
2,9,Rotterdam


In [59]:
tips.head()

Unnamed: 0,EmpNr,Amount
0,5,10.0
1,9,5.0
2,7,2.5


In [61]:
# join DataFrames using inner join
df_inner = pd.merge(dest, tips, on='EmpNr', how='inner')
df_inner

Unnamed: 0,EmpNr,Dest,Amount
0,5,The Hague,10.0
1,9,Rotterdam,5.0


In [62]:
# join DataFrames using outer join
df_outer = pd.merge(dest, tips, on='EmpNr', how='outer')
df_outer

Unnamed: 0,EmpNr,Dest,Amount
0,5,The Hague,10.0
1,3,Amsterdam,
2,9,Rotterdam,5.0
3,7,,2.5


In [63]:
# join DataFrames using left join
df_left = pd.merge(dest, tips, on='EmpNr', how='left')
df_left

Unnamed: 0,EmpNr,Dest,Amount
0,5,The Hague,10.0
1,3,Amsterdam,
2,9,Rotterdam,5.0


In [64]:
# join DataFrames using right join
df_right = pd.merge(dest, tips, on='EmpNr', how='right')
df_right

Unnamed: 0,EmpNr,Dest,Amount
0,5,The Hague,10.0
1,9,Rotterdam,5.0
2,7,,2.5


### Working with missing values

In [65]:
# check missing values in a DataFrame
df.isnull().sum()
# isnull() checks for the existence of null values and returns True for null values
# the sum() function will sum all the True values and returns the count of missing values

Country                                                    0
CountryID                                                  0
Continent                                                  0
Adolescent fertility rate (%)                             25
Adult literacy rate (%)                                   71
Gross national income per capita (PPP international $)    24
Net primary school enrolment ratio female (%)             23
Net primary school enrolment ratio male (%)               23
Population (in thousands) total                           13
dtype: int64

In [78]:
pd.isnull(df).sum()

Country                                                   0
CountryID                                                 0
Continent                                                 0
Adolescent fertility rate (%)                             0
Adult literacy rate (%)                                   0
Gross national income per capita (PPP international $)    0
Net primary school enrolment ratio female (%)             0
Net primary school enrolment ratio male (%)               0
Population (in thousands) total                           0
dtype: int64

In [76]:
df[df['Population (in thousands) total'].isnull()].head(5)

Unnamed: 0,Country,CountryID,Continent,Adolescent fertility rate (%),Adult literacy rate (%),Gross national income per capita (PPP international $),Net primary school enrolment ratio female (%),Net primary school enrolment ratio male (%),Population (in thousands) total
19,Bermuda,20,4,,,,,,
39,"Congo, Dem. Rep.",40,3,125.0,67.2,270.0,47.0,60.0,
40,"Congo, Rep.",41,3,132.0,84.4,2420.0,52.0,58.0,
62,French Polynesia,63,6,,,,,,
76,"Hong Kong, China",77,6,,,,,,


In [77]:
# drop missing values from the DataFrame
df.dropna(inplace=True)
df.info()
# inplace=True attribute makes the changes in the original DataFrame

<class 'pandas.core.frame.DataFrame'>
Int64Index: 118 entries, 1 to 200
Data columns (total 9 columns):
 #   Column                                                  Non-Null Count  Dtype  
---  ------                                                  --------------  -----  
 0   Country                                                 118 non-null    object 
 1   CountryID                                               118 non-null    int64  
 2   Continent                                               118 non-null    int64  
 3   Adolescent fertility rate (%)                           118 non-null    float64
 4   Adult literacy rate (%)                                 118 non-null    float64
 5   Gross national income per capita (PPP international $)  118 non-null    float64
 6   Net primary school enrolment ratio female (%)           118 non-null    float64
 7   Net primary school enrolment ratio male (%)             118 non-null    float64
 8   Population (in thousands) total          

In [80]:
# fill the missing values with zero, mean, median, or constant values
df.fillna(0, inplace=True)

### Creating pivot tables

In [81]:
purchase = pd.read_csv('../Python-Data-Analysis-Third-Edition/Chapter02/purchase.csv')
purchase.head()

Unnamed: 0,Weather,Food,Price,Number
0,cold,soup,3.745401,8
1,hot,soup,9.507143,8
2,cold,icecream,7.319939,8
3,hot,chocolate,5.986585,8
4,cold,icecream,1.560186,8


In [89]:
pd.pivot_table(purchase, values=('Number'), index=['Weather',], columns=['Food'], aggfunc=np.sum)

Food,chocolate,icecream,soup
Weather,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
cold,,16.0,16.0
hot,8.0,8.0,8.0


In [86]:
purchase['Weather'].value_counts()

cold    4
hot     3
Name: Weather, dtype: int64

In [87]:
purchase['Food'].value_counts()

soup         3
icecream     3
chocolate    1
Name: Food, dtype: int64

### Dealing with dates

In [94]:
# date_range() function generates sequences of date and time with a fixed-frequency interval
# freq parameters can take values such as B for business day frequency, 
# W for weekly frequency, H for hourly frequency, M for minute frequency, 
# S for second frequency, L for millisecond frequency, and U for microsecond frequency
pd.date_range('01-01-2021', periods=45, freq='D')

DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
               '2021-01-05', '2021-01-06', '2021-01-07', '2021-01-08',
               '2021-01-09', '2021-01-10', '2021-01-11', '2021-01-12',
               '2021-01-13', '2021-01-14', '2021-01-15', '2021-01-16',
               '2021-01-17', '2021-01-18', '2021-01-19', '2021-01-20',
               '2021-01-21', '2021-01-22', '2021-01-23', '2021-01-24',
               '2021-01-25', '2021-01-26', '2021-01-27', '2021-01-28',
               '2021-01-29', '2021-01-30', '2021-01-31', '2021-02-01',
               '2021-02-02', '2021-02-03', '2021-02-04', '2021-02-05',
               '2021-02-06', '2021-02-07', '2021-02-08', '2021-02-09',
               '2021-02-10', '2021-02-11', '2021-02-12', '2021-02-13',
               '2021-02-14'],
              dtype='datetime64[ns]', freq='D')

In [95]:
# to_datetime(): to_datetime() converts a timestamp string into datetime
pd.to_datetime('1/1/1970')

Timestamp('1970-01-01 00:00:00')

In [96]:
pd.to_datetime(['20200101', '20200102'], format='%Y%m%d')

DatetimeIndex(['2020-01-01', '2020-01-02'], dtype='datetime64[ns]', freq=None)

In [101]:
!git status

On branch master
nothing to commit, working tree clean
