# Descriptive Spatial Statistics

In this lab we will begin to understand some of the basics means to derive descriptive spatial statistics. First lets get some data loaded up so we can work with it. As always we will start with importing some essential libraries, then using the OS to navigate to our data folder and then importing the csv file.

In [None]:
import pandas as pd
import os

In [None]:
pd.set_option('display.max_columns', 500)

In [None]:
os.chdir('data')

In [None]:
os.listdir()

The file we will be working with today is the 'food_all.csv'. This is data derived from a number of sources including REIGN, UCDP, World Bank, and others!

In [None]:
food_data = pd.read_csv('food_all.csv', encoding='latin-1')

Lets check and make sure that our data is loaded up properly.

In [None]:
food_data.head()

Great! It all looks good so we can proceed to discuss some of the important things we talked about last class.

## Variable Types

First up variable types. As we talked about before people generally conceptualize variable types in four basic levels; nominal, ordinal, ratio and interval. Python has a number of accepted data types but we will focus for now on the ones that are related to the variable types discussed.

### Nominal

Nominal variable types are those that have not statistically important significance, but that does not however mean these are not important! Furthermore nominal data can come in a variety of python data types.  These can be both numbers, like IDs, or names.

**EXCERCISE** 
Find a nominal variable that is a:
1. Object (This is a string variable)
2. Int64 (This is a number or integer)


*Hint*
The python code to find a data type is '.dtypes'

Great! So why are nominal variable types so important? They can be used to group data. We will talk about crosstabulation later in the lab where these values will become increasingly important. 

### Ordinal

This data actually does not have an ordinal variable type already, but thats ok because its good practice to make a new variable! How about we take the leader age variable and bucket it! First lets check out what the distribution of the age variable is.

In [None]:
food_data.age.describe()

Alright so we can now see the distributions of the variable.  Lets make four bins based on these percentiles, and lets just give it a simple label as well.

In [None]:
bins = [0, 49, 57, 64, 94]
labels = [1,2,3,4]

Now we can use the cut method (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html) to make this a new column in our data.

In [None]:
food_data['age_bin'] = pd.cut(food_data['age'], bins,labels=labels)

In [None]:
food_data.age_bin.describe()

We now have an ordinal data type!

### Continous

Lets just put the next two together for this excercise as there are quite a few. 

**EXERCISE** 
Find the continuous value for the per person daily caloric food supply and record the:
1. Minimum
2. Maximum 


### BONUS! Binary/DUMMY Variables

Find one! Which value(0,1) represents what?

## Measures of Central Tendency

Lets start with the simplest and often the most important, average. Remeber average is only important for certain variable types. Lets pick one of the continuous variables and see what comes back.

In [None]:
food_data.Rye.mean()

How about a binary variable?

In [None]:
food_data.democracy.mean()

Great lets try and get the median value of both of these now.

In [None]:
food_data.Rye.median()

In [None]:
food_data.democracy.median()

How about the mode?

In [None]:
food_data.Rye.mode()

Hmm that is odd? Mode is supposed to return the most frequent value and yet this has returned four. Lets try to get mode a different way, namely using the value_counts() function. This returns a list of the number of times each value pops up.

In [None]:
food_data.Rye.value_counts()

Aha! So all four of those values are the mode because each pops up in the data 4 times. 

## Sum, Min, Max

Lets go backwards. Min/Max can be really useful and also really easy to figure out. Just use the describe function we have used plenty so far.

In [None]:
food_data.Corn.describe()

Now we can see that the price of corn has a low of 21.5 and a high of 148.9.  This can be particularly useful for looking into outliers in the data. Outliers are always important to account for.

Sum can be useful across an entire data frame in some circumstance, but for many of the variables not all that important.  What would be the use of summing the age of all leaders? Lets find a variable that could be useful however. Events are particularly useful to sum oftentimes. One event in this data is Coups! pt_attempt means in this given year a Powell and Thyne recorded coup was attempted. 

In [None]:
food_data.pt_attempt.sum()

That may seem like a lot of coups but in fact its only a very small percentage of the total country-years in the data frame. 

**Exercise** Find the number of country-years that had a coup

## Data Frame Types

So we've covered some basics of descriptives but now lets turn to the slightly more confusing aspect of manipulating dataframes in ways that are important to better illustrating trends we want to see. 

### Panel

We're gonng go backwards since you can always easily aggregate down, but not up. Lets try to identify the two key components that make a data frame the panel type. Remember in a panel data frame time and grouping units must both vary:
1. Grouping Variable
2. Time Variable

In [None]:
food_data.year.describe()

In [None]:
food_data.country.unique()

Awesome so this is definitely a panel data frame! We have been performing descriptive statistics on this data frame before this point but now lets look at a few other ways to manipulate it through the groupby function. 

In [None]:
food_data.groupby('democracy')['Rye'].mean()

Here we see that the price of Rye is a little higher on average in autocracies. Lets see how many coups we have across these two groups.

In [None]:
food_data.groupby('democracy')['pt_attempt'].sum()

Quite the disparity! A lot more coups have occured in autocracies historically. 

Lets see what the average age and percentage of leaders who have a military career all in one line. 

In [None]:
food_data.groupby('democracy')[['age', 'militarycareer']].mean()

While the average age is only slightly higher in democracies, far less leaders have had a military career. 

### Longitudinal

We no are going to turn to a longitudinal data frame. This means that we are holding the grouping unit constant, and looking just at the change over time within that unit.

In [None]:
poland = food_data.loc[food_data.country == "Poland"]

In [None]:
poland.tail(10)
           

Yep thats Poland. As we can see each row corresponds to a different year within Poland from 1950-2019. Now when we take descriptive statistics similar to before, we are doing it just across data on Poland in particular. 

In [None]:
poland.democracy.value_counts()

In [None]:
poland['percap cals'].mean()

### Cross-Sectional

Cross-sectional holds time constant across units.  We could go about this a number of ways, taking ranges or aggregating across the entire dataframe. Let's just pick a single year though.

In [None]:
cross_13 = food_data.loc[food_data['year'] == 2013]

Again we can run descriptives across the data frame and to explain trends and patterns across all countries.

In [None]:
cross_13.government.value_counts()

In [None]:
cross_13.democracy.mean()

In [None]:
cross_13['percap cals'].describe()

## BONUS Graph Descriptives

Don't worry about this next box for now this is just some settings for visual purposes.

In [None]:
import matplotlib as mpl
from matplotlib.lines import Line2D
import matplotlib.pyplot as plt

def tableau_colors():
    """
    Args:

    Returns:
        dictionary of {color (str) : RGB (tuple) for the dark tableau20 colors}
    """
    tableau20 = [(31, 119, 180), (174, 199, 232), (255, 127, 14), (255, 187, 120),
                 (44, 160, 44), (152, 223, 138), (214, 39, 40), (255, 152, 150),
                 (148, 103, 189), (197, 176, 213), (140, 86, 75), (196, 156, 148),
                 (227, 119, 194), (247, 182, 210), (127, 127, 127), (199, 199, 199),
                 (188, 189, 34), (219, 219, 141), (23, 190, 207), (158, 218, 229)]

    for i in range(len(tableau20)):
        r, g, b = tableau20[i]
        tableau20[i] = (r / 255., g / 255., b / 255.)
    names = ['blue', 'orange', 'green', 'red', 'purple', 'brown', 'pink', 'gray', 'yellow', 'turquoise']
    colors = [tableau20[i] for i in range(0, 20, 2)]

    return dict(zip(names,colors))


def set_rc_params_jup():
    """
    Args:

    Returns:
        dictionary of settings for mpl.rcParams
    """
    params = {'axes.linewidth' : 1.5,
              'axes.unicode_minus' : False,
              'figure.dpi' : 000,
              'font.size' : 12,
              'legend.frameon' : True,
              'legend.handletextpad' : 0.4,
              'legend.handlelength' : 1,
              'legend.facecolor' : 'white',
              'legend.fancybox'  : True,
              'legend.fontsize' : 8,
              'mathtext.default' : 'regular',
              'savefig.bbox' : 'tight',
              'xtick.labelsize' : 10,
              'ytick.labelsize' : 10,
              'xtick.major.size' : 4,
              'ytick.major.size' : 4,
              'xtick.major.width' : 1,
              'ytick.major.width' : 1,
              'xtick.top' : True,
              'ytick.right' : True,
              'axes.edgecolor' : 'black',
              'savefig.facecolor'   : 'white',
              'axes.facecolor'   : 'whitesmoke',
              'font.family' : 'sans',
              'font.monospace' : 'computer modern roman',
              'text.usetex' : True,
              'axes.grid' : True,
              'grid.color' :   'gray',
              'grid.linestyle' :   '-',
              'grid.linewidth'  :   0.2,
              'grid.alpha'       :   0.3,
              'axes.axisbelow'      : 'line'
              }
    for p in params:
        mpl.rcParams[p] = params[p]
    return params

set_rc_params_jup()
tab = tableau_colors()

Lets plot a continuous variable!

In [None]:
fig, ax = plt.subplots(1, figsize=[10,10])
france = food_data.loc[food_data['country'] == "France"]
usa = food_data.loc[food_data['country'] == "USA"]
plt.plot(france['year'], france['percap cals'], label='France')
plt.plot(usa['year'], usa['percap cals'], label='USA')
plt.title('Food Supply Decrease', y=1.02)
plt.ylabel('Kilocalories per person per day')
plt.legend()
fig.savefig('wealthy_decrease.png')
plt.tight_layout()
plt.show()

How about a nominal variable?

In [None]:
food_data.government.value_counts().plot(kind='barh')
plt.show()