### Jupyter notebook demo from Python Tri-Cities, WA interest group

In this notebook we will work with some data from Wikipedia, specifically
the US Cities by population located here:
https://en.wikipedia.org/wiki/List_of_United_States_cities_by_population

Always remember data is never clean!

To run this notebook clone the git repository.
You will also need to have python3 installed.
See this for information on how to do that:
https://github.com/PythonTriCities/starthere/blob/master/README.md
then pip install the requirements.txt file, and start the notebook.
From the git repository type:
```
pip install -r requirements
jupyter notebook
```
If you have trouble or are on a Windows machine try using this:
https://www.anaconda.com/download/

Use the [>| Run] button above or a Shift + Enter on each cell.

In [None]:
import pandas as pd

In [None]:
import matplotlib.pyplot as plt

Now lets import the data into what is called a data frame, or df for short.

In [None]:
df = pd.read_table('data/2016_city_data2')

In [None]:
df.head()

So we have some data, now let's look at the columns, and determine why they don't line up.

In [None]:
df.columns

In [None]:
len(df.columns)

So it looks like we have 9 columns, lets look at the "zeroth" location to see what that looks like

In [None]:
df.iloc[0]

So it appears that the Name row has a multilevel index, that is 2 values (1, New York[6] )
let's look closer at the index.

In [None]:
df.index

There it is right at the top, it reads, 'MultiIndex(levels,
so lets reset that index to be one level.

In [None]:
df = df.reset_index(level=1)

In [None]:
df.index

In [None]:
df.head()

Now let's reset our column names, but we need to make sure they line up by length.

In [None]:
len(df.columns)

In [None]:
new_col = ['city', 
           'state', 
           '2016 estimated', 
           '2010 census',
           'percent change', 
           '2016 land area mi2',
           '2016 land area km2',
           '2016 population density per mi2', 
           '2016 population density per km2',
           'location']

In [None]:
len(new_col)

In [None]:
df.columns = new_col

In [None]:
df.columns

In pandas it is easier to have columns without spaces, so let's replace them with an '_' them, we could have done it when we entered them above, but what if we had 1000 columns, let's have python do it for us, by looping through each one.

In [None]:
df.columns = [c.replace(' ', '_') for c in df.columns]

In [None]:
df.columns

In [None]:
df.sort_values(by=['2010_census'])

Something is not right with this sort, what could it be, maybe the columns are not numbers, maybe we could sort them again in reverse order with ones that are not a number first.

In [None]:
df.sort_values(by='2010_census', ascending=False, na_position='first')

No, that is not it, maybe the values are not numbers? Let's look at the type.

In [None]:
df.dtypes

That is it, they are objects.  Let's convert them to be an integer.
Firs we can remove the ',' and then cast them to be in integer.

In [None]:
df['2016_estimated'] = df['2016_estimated'].str.replace(',', '').astype(int)

In [None]:
df['2010_census'] = df['2010_census'].str.replace(',', '').astype(int)

In [None]:
df.head()

In [None]:
df.dtypes

In [None]:
df.sort_values(by=['2010_census'], ascending=False)

So let's verfiy the data in the percent change columm.

In [None]:
df['increase'] = df['2016_estimated'] - df['2010_census']

In [None]:
df.head()

In [None]:
df['new_pct_increase'] = ((df.increase/df['2010_census']) * 100)

In [None]:
df.head()

In [None]:
col = df.columns.tolist()

Now let's re-arrange the columns with a classic python slice replacing col[4]

In [None]:
col[4]

In [None]:
col = col[:4] + col[-1:] + col[5:-1]

In [None]:
col

In [None]:
col[4]

In [None]:
df.head()

Now we loop through and replace all the column names, substuting an '_' where there is a ' '.

In [None]:
df.columns = [c.replace(' ', '_') for c in df.columns]

Now we drop a column, by name and with an axis=1 value to indicate we want to drop the column and not a row named 'increase'

In [None]:
df.drop(['increase'], axis=1)

In [None]:
df.sort_values(by=['new_pct_increase'], ascending=False)

Find the highest growth rate city

In [None]:
df.new_pct_increase.max()

Find the lowest growth rate city

In [None]:
df.new_pct_increase.min()

In [None]:
%matplotlib inline

This line tells the Jupyter notebook to display the graph in the notebook and not as a seperate window.

In [None]:
plt.show()

In [None]:
df.plot(x='city', y='new_pct_increase', style='o')