# 7. Tutorial 4: Descriptive analysis and data visualisation (1)

#### Naoki TANI
#### Center for Advanced Policy Studies (CAPS), Institute of Economic Research, Kyoto University
#### April 25, 2024

In [1]:
import numpy as np
import pandas as pd
from IPython.display import Image

## 1. Descriptive analysis with pandas

### 1-1. Pandas `Series` and `DataFrame`

#### [pandas](https://pandas.pydata.org/) is a python library providing a dataframe object to help you manage data. This lecture covers how to use `Series` and `DataFrame` objects to handle data.
#### `Series` is a one-dimensional array holdking any type of indexed data. All the data in a Series is of the same data type.

In [2]:
pref = ['Kochi', 'Ehime', 'Kagawa', 'Tokushima']
prefecture = pd.Series(pref)
prefecture # The values are labeled with their index number. First value has index 0, second value has index 1 etc.

0        Kochi
1        Ehime
2       Kagawa
3    Tokushima
dtype: object

#### The `Series` is composed of the values and index attributes.

In [3]:
print(prefecture.values)
print(prefecture.index)

['Kochi' 'Ehime' 'Kagawa' 'Tokushima']
RangeIndex(start=0, stop=4, step=1)


In [4]:
prefecture[1]

'Ehime'

#### `DataFrame` is a two-dimensional table of data with columns and rows. The columns are made up of `Series` objects.

In [5]:
prefstat = pd.DataFrame({'population':[736000,1396000,980000,763000],
                          'gdp':[2349510000000,4756495000000,3672273000000,3012328000000],
                          'income':[1866110000000,3516676000000,2835364000000,2219318000000]},
                           index=prefecture)
prefstat

Unnamed: 0,population,gdp,income
Kochi,736000,2349510000000,1866110000000
Ehime,1396000,4756495000000,3516676000000
Kagawa,980000,3672273000000,2835364000000
Tokushima,763000,3012328000000,2219318000000


In [6]:
prefstat.values
prefstat.index
prefstat.columns

Index(['population', 'gdp', 'income'], dtype='object')

In [7]:
prefstat.population

Kochi         736000
Ehime        1396000
Kagawa        980000
Tokushima     763000
Name: population, dtype: int64

### 1-2. Working with consumption data

In [None]:
# We can read CSV files by using a pandan read_csv method.
df = pd.read_csv('consumption_data.csv')

# drop rows if all values in the rows are NaN
df.dropna(how='all', inplace=True)

df;

#### Data from a CSV file is stored in a DataFrame object.

In [None]:
type(df)

#### We can see some basic information about a dataframe by using `.info()`, `.columns`, and `.shape`.

In [None]:
print(df.info())
print(df.columns)
print('The shape of the dataframe is',df.shape)

In [None]:
# Convert object to integer
df['金額'] = df['金額'].str.replace(',', '').astype(int)

#### Each column contains a specific data type. `.describe()` reports statistical information for numeric columns.

In [None]:
df.describe()