# Scales
* We're going to talk about things you probably learned in grade school but also probably don't think about much
* And of course, we're going to talk about them in Pandas!

In [None]:
# Let's look at some letter grades...
import pandas as pd
df=pd.DataFrame(['A+', 'A', 'A-', 'B+', 'B', 'B-', 'C+', 'C', 'C-', 'D+', 'D'],
                index=['excellent', 'excellent', 'excellent', 'good', 'good', 'good', 
                       'ok', 'ok', 'ok', 'poor', 'poor'],
               columns=["Grades"])
df

In [None]:
# What is our series datatype?


* That seems pretty broad, eh? "object" pretty much means anything...
* We know more here. We have clear categories that have meaning to us. We can put this meaning into pandas `DataFrame` objects

In [None]:
# We can use the astype() function to tell pandas to mark this as a category


* Notice that there are now 11 categories!
* But actually, our data isn't really categorical, is it? What else do we know about this data?

In [None]:
# We can tell pandas that the data is ordered by first creating our own data type

# then we just pass this to the astype() function


In [None]:
# Now we can do ordinal comparisons! Look at the bad example first (no category original dataframe)


In [None]:
# Now how's that look in a category aware sense?


* Great! So we can encapsulate a limited set of data types (categories) and an ordering if appropriate (through our own dtype) in pandas and it allows us to do operations we otherwise couldn't do
* Now, it turns out we use this in machine learning and data mining a fair bit. Some techniques (regression) are used to predict continuous values, while others (classification) are used to predict categories
* So how do we change from continuous data to categorical data in pandas? I'm glad you asked!

In [None]:
# Let's look at that census data
import numpy as np
df=pd.read_csv("datasets/census.csv")
result=df[df['SUMLEV']==50]
result=df.set_index('STNAME').groupby(level=0)['CENSUS2010POP'].agg(np.average)
result.head()

In [None]:
# Now if we just want to make "bins" of each of these, we can use cut()
# this just takes the dataframe, and the number of bins, and returns a new dataframe


* Notice the notation is mathematical (open/closed intervals)
* See how Alabama and Alaska are now in the same category, but Arizona is in another category
* Notice that pandas ordered all of these now too
* More on categories: https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html