Importing pandas, numpy and matpltlib.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
%matplotlib inline

ImportError: No module named sklearn

The csv file contains some miscellaneous information as the first 8 rows. We use skiprows to avoid any read errors.

In [None]:
df = pd.read_csv('data/FEI_PREF_190228112345.csv', skiprows=8)

Lets take a look at what our data looks like using pandas head() to see the top 5 rows & tail() to see the last 5 rows; two important commands to run to get a good look at your dataframe.

In [None]:
df.head()

In [None]:
df.tail()

Now lets look at the kind of info inside the dataframe.

In [None]:
df.info()

There is a lot of columns that aren't required for our analysis. We can drop those from the dataframe. I went simple with just the single dropna command to find any columns with NaN values.

In [None]:
df = df.dropna(axis = 'columns')

Lets see what it looks like now.

In [None]:
df.head()

Before we have a look at our dataframe, lets drop the columns AREA Code, AREA and YEAR Code. These are just taking up extra space and I already know all this data is for prefecture Okinawa.

In [None]:
df.drop(['AREA Code', 'AREA', 'YEAR Code'], axis=1, inplace=True)

Again, before we look at our finish dataframe product, lets do the final cleaning of the remaining column names to make them a bit more readable. I'm going to use the new pandas method of set_axis.

In [None]:
df.set_axis(['YEAR', 'TOTAL', 'MALE', 'FEMALE'], axis='columns', inplace=True)

Now lets take a look at the dataframe head() again.

In [None]:
df.head()

Looks sharp. Now lets check the shape and info one final time.

In [None]:
df.shape

In [None]:
df.info()

Just ran shape() and info() to ensure all values are counted as INTEGERs or FLOATs. Also just want to see how big our dataframe is and ensure our columns are properly counted. Time to run a quick and dirty plot.

In [None]:
df.plot(x='YEAR', y='TOTAL', style='o')
plt.title('Population Of Okinawa')
plt.xlabel('Years')
plt.ylabel('Population')
plt.show()

In [None]:
df.plot(x='YEAR', y=['MALE', 'FEMALE'], stacked=True, kind='bar')
ax = df['TOTAL'].plot(secondary_y=True, color='k', marker='o')
ax.set_ylabel('Total')
plt.show()

This graph is ugly but we can fix that later. Right now I just remembered we should flip our entire dataframe so that the oldest year is on top indexed as 0. Lets flip the dataframe!

In [None]:
df.head()

The easiest way to reverse the order is just by copying the dataframe and using a step slice. [start:end:step] so the first two are blank and -1 is in the step slice. Therefore the entire dataframe is copied over in reverse order. Thanks to user Grote in the Python Discord for the help!

In [None]:
df = df.iloc[::-1]

Now we will reset the index and use the drop=True to note we are getting rid of the old index.

In [None]:
df = df.reset_index(drop=True)

Lets check our work.

In [None]:
df.head()

Awesome. Now lets try that plot one more time. Also, lets unstack since we can't really tell if one is growing over the other.

In [None]:
df.plot(x='YEAR', y=['MALE', 'FEMALE'], stacked=False, kind='bar')
ax = df['TOTAL'].plot(secondary_y=True, color='k', marker='o')
ax.set_ylabel('Total')
plt.show()

Much better. Reading normally from left to right. But this is too compact and ugly. Lets try a line graph.

In [None]:
df.plot(x='YEAR', y=['MALE', 'FEMALE'], grid=True)

Interesting. Just by looking at this simple graph, we can see that there is a gap which might be growing between population of women and men. Also, while the women population is quite normal, the men's population had a slightly noticeable decrease sometime before 2010, maybe 2008. Let me remember how many rows we have of data.

In [None]:
df.shape

Oh right, 42. Ok. So let me look at the bottom 15 rows using slice again.

In [None]:
df.iloc[-15:]

Great, 2008 is in there with some years to chew on the top. Lets find the percentage change in this little part of the dataset to see how much that slow down was and exactly where it was. Now lets use loc to show only a part of the dataframe and pct_change to find the percentage change from previous year to next.

In [None]:
df.loc[27::,['MALE', 'FEMALE']].pct_change()

You can see that while the FEMALE percentage change from these years is generally above .4 percent, the MALE had a slow down at index 33 which is year 2008. The change dropped down to .001484 here. OK. Now lets move on to predictions!

In [None]:
X = df.iloc[:,0].values
Y = df.iloc[:,1].values