# Scales

Ratio scale:

    - Units
    - mathematical operations of +-/* are al vallid
    - E.g. height and weight
    
Interval scale:

    - Unites are equally spaced, but there is no true zero

Ornidal scale:

    - The order of the units is important, but not evenly spaced
    - Letter grades such as A+, A are a good example
    
Nominal scale:

    - Categories of data, but the categories have no order with respect to one another
    - E.g. Teams of a sport

In [2]:
import pandas as pd

# Let's create a DataFrame of letter grades in descending order.

df = pd.DataFrame(["A+", "A", "A-", "B+", "B", "B-", "C+", "C", "C-", "D+", "D"], 
                 index = ["Excellent", "Excellent", "Excellent", "Good", "Good", "Good",
                         "Ok", "Ok", "Ok", "Poor", "Poor"],
                 columns = ["Grades"])

df

Unnamed: 0,Grades
Excellent,A+
Excellent,A
Excellent,A-
Good,B+
Good,B
Good,B-
Ok,C+
Ok,C
Ok,C-
Poor,D+


In [4]:
# If we check the datatype of this column, we see that it's just an object
df.dtypes

Grades    object
dtype: object

In [6]:
# We can tell pandas that we want to change the type to category, using the astype() function
df["Grades"].astype("category").head()

Excellent    A+
Excellent     A
Excellent    A-
Good         B+
Good          B
Name: Grades, dtype: category
Categories (11, object): [A, A+, A-, B, ..., C+, C-, D, D+]

There are eleven categories, and pandas is aware of what those categories are. More interesting though, is that our data isn't just categorical, but that it's ordered. That is, an A- comes after a B+. We can tell pandas thaht the data is ordered by first creating a new categorical data type with the list of the categories (in order) and the **ordered=True** flag

In [8]:
my_categories = pd.CategoricalDtype(categories = ["D", "D+", "C-", "C", "C+", "B-", "B", "B+", "A-", "A", "A+"],
                                   ordered=True)
my_categories

CategoricalDtype(categories=['D', 'D+', 'C-', 'C', 'C+', 'B-', 'B', 'B+', 'A-', 'A',
                  'A+'],
                 ordered=True)

In [10]:
grades = df["Grades"].astype(my_categories)
grades

Excellent    A+
Excellent     A
Excellent    A-
Good         B+
Good          B
Good         B-
Ok           C+
Ok            C
Ok           C-
Poor         D+
Poor          D
Name: Grades, dtype: category
Categories (11, object): [D < D+ < C- < C ... B+ < A- < A < A+]

In [12]:
# Here, we notice that our first attempt was in incorrect order
df[df["Grades"] > "C"]

Unnamed: 0,Grades
Ok,C+
Ok,C-
Poor,D+
Poor,D


In [13]:
# Here is wath we want
grades[ grades > "C" ]

Excellent    A+
Excellent     A
Excellent    A-
Good         B+
Good          B
Good         B-
Ok           C+
Name: Grades, dtype: category
Categories (11, object): [D < D+ < C- < C ... B+ < A- < A < A+]

In [15]:
# Let's work with the census,csv
import numpy as np

# Now we read in our DataSet
df = pd.read_csv("DataSets/census.csv")
df.head()

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
0,40,3,6,1,0,Alabama,Alabama,4779736,4780127,4785161,...,0.002295,-0.193196,0.381066,0.582002,-0.467369,1.030015,0.826644,1.383282,1.724718,0.712594
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861


In [20]:
# We reduce this to country data
df = df[df["SUMLEV"].eq(50)]

df.head()

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861
5,50,3,6,1,9,Alabama,Blount County,57322,57322,57373,...,1.807375,-1.177622,-1.748766,-2.062535,-1.36997,1.859511,-0.84858,-1.402476,-1.577232,-0.884411


In [22]:
# And for a few groups
df = df.set_index("STNAME").groupby(level=0)["CENSUS2010POP"].agg(np.average)
df.head()

STNAME
Alabama        71339.343284
Alaska         24490.724138
Arizona       426134.466667
Arkansas       38878.906667
California    642309.586207
Name: CENSUS2010POP, dtype: float64

In [26]:
# Let's build a categories from our data usign cut() function
pd.cut(df, 10)

STNAME
Alabama                   (11706.087, 75333.413]
Alaska                    (11706.087, 75333.413]
Arizona                 (390320.176, 453317.529]
Arkansas                  (11706.087, 75333.413]
California              (579312.234, 642309.586]
Colorado                 (75333.413, 138330.766]
Connecticut             (390320.176, 453317.529]
Delaware                (264325.471, 327322.823]
District of Columbia    (579312.234, 642309.586]
Florida                 (264325.471, 327322.823]
Georgia                   (11706.087, 75333.413]
Hawaii                  (264325.471, 327322.823]
Idaho                     (11706.087, 75333.413]
Illinois                 (75333.413, 138330.766]
Indiana                   (11706.087, 75333.413]
Iowa                      (11706.087, 75333.413]
Kansas                    (11706.087, 75333.413]
Kentucky                  (11706.087, 75333.413]
Louisiana                 (11706.087, 75333.413]
Maine                    (75333.413, 138330.766]
Maryland     

For instance, cut gives you interval data, where spacing between each category is equal sized, but sometimes you want to form categories based on frecuency - you want the number of items in each bin to be the same, instead of spacing between the bins. It really depends on what the shape of your data is, and what you're planning to do with it.