Importing pandas, numpy and matpltlib.

In [40]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

The csv file contains some miscellaneous information as the first 8 rows. We use skiprows to avoid any read errors.

In [41]:
df = pd.read_csv('data/FEI_PREF_190228112345.csv', skiprows=8)

Lets take a look at what our data looks like using pandas head() to see the top 5 rows.

In [42]:
df.head()

Unnamed: 0,AREA Code,AREA,YEAR Code,YEAR,/ITEMS,A1101_Total population (Both sexes)[person],Annotation,A110101_Total population (Male)[person],Annotation.1,A110102_Total population (Female)[person],Annotation.2
0,47000,Okinawa-ken,2016100000,2016,,1439000,,708000,,732000,
1,47000,Okinawa-ken,2015100000,2015,,1433566,,704619,,728947,
2,47000,Okinawa-ken,2014100000,2014,,1426000,,700000,,725000,
3,47000,Okinawa-ken,2013100000,2013,,1419000,,697000,,722000,
4,47000,Okinawa-ken,2012100000,2012,,1411000,,693000,,719000,


Now lets look at the kind of info inside the dataframe.

In [43]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42 entries, 0 to 41
Data columns (total 11 columns):
AREA Code                                      42 non-null int64
AREA                                           42 non-null object
YEAR Code                                      42 non-null int64
YEAR                                           42 non-null int64
/ITEMS                                         0 non-null float64
A1101_Total population (Both sexes)[person]    42 non-null object
Annotation                                     0 non-null float64
A110101_Total population (Male)[person]        42 non-null object
Annotation.1                                   0 non-null float64
A110102_Total population (Female)[person]      42 non-null object
Annotation.2                                   0 non-null float64
dtypes: float64(4), int64(3), object(4)
memory usage: 3.7+ KB


There is a lot of columns that aren't required for our analysis. We can drop those from the dataframe. I went simple with just the single dropna command to find any columns with NaN values.

In [44]:
df = df.dropna(axis = 'columns')

Lets see what it looks like now.

In [45]:
df.head()

Unnamed: 0,AREA Code,AREA,YEAR Code,YEAR,A1101_Total population (Both sexes)[person],A110101_Total population (Male)[person],A110102_Total population (Female)[person]
0,47000,Okinawa-ken,2016100000,2016,1439000,708000,732000
1,47000,Okinawa-ken,2015100000,2015,1433566,704619,728947
2,47000,Okinawa-ken,2014100000,2014,1426000,700000,725000
3,47000,Okinawa-ken,2013100000,2013,1419000,697000,722000
4,47000,Okinawa-ken,2012100000,2012,1411000,693000,719000


Now lets set the YEAR column to be the index. I'll also add the parameters drop just so I remember that it drops that particular column when making it into an index and inplace to modify the dataframe in place and not return a new object.

In [46]:
df.set_index('YEAR' , drop = True, inplace=True)

Before we have a look at our dataframe, lets drop the columns AREA Code, AREA and YEAR Code. These are just taking up extra space and I already know all this data is for prefecture Okinawa.

In [47]:
df.drop(['AREA Code', 'AREA', 'YEAR Code'], axis=1, inplace=True)

Again, before we look at our finish dataframe product, lets do the final cleaning of the remaining column names to make them a bit more readable. I'm going to use the new pandas method of set_axis.

In [48]:
df.set_axis(['TOTAL', 'MALE', 'FEMALE'], axis='columns', inplace=True)

Now lets take a look at the whole dataframe.

In [49]:
df

Unnamed: 0_level_0,TOTAL,MALE,FEMALE
YEAR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2016,1439000,708000,732000
2015,1433566,704619,728947
2014,1426000,700000,725000
2013,1419000,697000,722000
2012,1411000,693000,719000
2011,1402000,688000,714000
2010,1392818,683328,709490
2009,1385000,679000,706000
2008,1378000,675000,703000
2007,1374000,674000,700000


Now lets see just a few years as a stacked bar graph. This will utlize the matplotlib library.