# Scale
We have already discussed about different types of commonly used scales in pandas library. Now we will here try to construct scale by ourselves using python dataframes. More specifically Categorical Scale.

In [1]:
import pandas as pd
import numpy as np

In [2]:
#Lets create a data frame
dataframe=pd.DataFrame(["A+","A","A-","B+","B","B-","C+","C","C-","D","E","F"],
                       index=["Execellent","Execellent","Execellent","Very Good","Very Good","Very Good","Good","Good","Good"
                             ,"Average","Bad","Fail"],
                       columns=["Grades"])
dataframe

Unnamed: 0,Grades
Execellent,A+
Execellent,A
Execellent,A-
Very Good,B+
Very Good,B
Very Good,B-
Good,C+
Good,C
Good,C-
Average,D


In [3]:
#Let us now see the datatpye present in our dataframe
dataframe.dtypes

Grades    object
dtype: object

As observable, the general datatype of the dataframe is found to be objects. But we have mentioned we want scale dataframe rather than a object dataframe hence we will see convertion of dtypes from objects to catagories, remember that catagories is one of the multiple types of scales supported on pandas library.


Though we have used it previously, I do rather like to mention it again. There are a few methods via which dtypes of variables can be changed. One of them is ofcourse type casting. But there is also a method called __astype()__

The general syntax of usage of __astype__ is 
###### < Variable Whose Dtype is To Be Converted >.astype("< datatype you want to convert it to >")

Lets see a case where we are converting our object type elements of dataframe to __catagory__

In [4]:
newSeries=dataframe["Grades"].astype("category")
newSeries

Execellent    A+
Execellent     A
Execellent    A-
Very Good     B+
Very Good      B
Very Good     B-
Good          C+
Good           C
Good          C-
Average        D
Bad            E
Fail           F
Name: Grades, dtype: category
Categories (12, object): ['A', 'A+', 'A-', 'B', ..., 'C-', 'D', 'E', 'F']

Up until now what we did was to declare catagories. But to perform operations we clearly need to setup the orders of the availabe catagories. Lets us now just to that.

How?

Well in pandas the default method of creating categories is add follows
###### < Variable >=pandas.CategoricalDtype(catagories=[ List of catagories in ascending order],ordered=True)
Let see example below.

In [5]:
#Sample for catagories declaration
myCatagories=pd.CategoricalDtype(categories=["F","E","D","C-","C","C+","B-","B","B+","A-","A","A+"],ordered=True)
newDataset1=dataframe["Grades"].astype(myCatagories)
#lets understand what just happened in the above code. We firstly are creating a new series in "newDataset1". Now we are 
#extracting the grades column of the a pre-existing dataframe "dataframe" for further operations. What happens next is we 
#using this column called in conjunction with "astype" function and in that astype function we are passing the category object
#we created previously. What will be the effect of this? Well the values of the Grades column of dataframe "dataframe" will 
#be copied to the new series "newDataset1" unchanged. But what will change is that we have now defined order of the catagories 
#we pre-declared in the Grades column which previously had no such thing. Advantage of doing this? Well we could easily perform
#many functionalities. As you will see ahead.

In [6]:
newDataset1

Execellent    A+
Execellent     A
Execellent    A-
Very Good     B+
Very Good      B
Very Good     B-
Good          C+
Good           C
Good          C-
Average        D
Bad            E
Fail           F
Name: Grades, dtype: category
Categories (12, object): ['F' < 'E' < 'D' < 'C-' ... 'B+' < 'A-' < 'A' < 'A+']

In [7]:
myCatagories

CategoricalDtype(categories=['F', 'E', 'D', 'C-', 'C', 'C+', 'B-', 'B', 'B+', 'A-', 'A',
                  'A+'],
, ordered=True)

Now lets print grades greater than C

We will do this in two ways:

1) Via the dataframe grade column

2) Via the category object we just created

In [8]:
#Method 1
dataframe[dataframe["Grades"]>'C']

Unnamed: 0,Grades
Good,C+
Good,C-
Average,D
Bad,E
Fail,F


In [9]:
newDataset1[newDataset1>"C"]

Execellent    A+
Execellent     A
Execellent    A-
Very Good     B+
Very Good      B
Very Good     B-
Good          C+
Name: Grades, dtype: category
Categories (12, object): ['F' < 'E' < 'D' < 'C-' ... 'B+' < 'A-' < 'A' < 'A+']

See the correct ordered is obtained unlike the previous case.


There is another scale based operation where we find ourselves converting the scales, for instance lets say, converting interval or ratio scale to catagory scale. Now,
this might seem a bit counter intuitive to you, since you are losing information about the value. But it’s
commonly done in a couple of places. For instance, if you are visualizing the frequencies of categories,
this can be an extremely useful approach, and histograms are regularly used with converted interval or ratio
data. In addition, if you’re using a machine learning classification approach on data, you need to be using
categorical data, so reducing dimensionality may be useful just to apply a given technique. Pandas has a
function called cut which takes an arguments in array-like structure where we enter a column of a dataframe or a
series. It also takes a number of bins to be used, and all bins are kept at equal spacing.

Lets see an example.

In [10]:
# let's bring in numpy
import numpy as np

# Now we read in our dataset
df=pd.read_csv("assets/census.csv")

# And we reduce this to country data
df=df[df['SUMLEV']==50]

# And for a few groups
df=df.set_index('STNAME').groupby(level=0)['CENSUS2010POP'].agg(np.average)

df.head()

STNAME
Alabama        71339.343284
Alaska         24490.724138
Arizona       426134.466667
Arkansas       38878.906667
California    642309.586207
Name: CENSUS2010POP, dtype: float64

In [11]:
# Now if we just want to make "bins" of each of these, we can use cut()
pd.cut(df,10)

STNAME
Alabama                   (11706.087, 75333.413]
Alaska                    (11706.087, 75333.413]
Arizona                 (390320.176, 453317.529]
Arkansas                  (11706.087, 75333.413]
California              (579312.234, 642309.586]
Colorado                 (75333.413, 138330.766]
Connecticut             (390320.176, 453317.529]
Delaware                (264325.471, 327322.823]
District of Columbia    (579312.234, 642309.586]
Florida                 (264325.471, 327322.823]
Georgia                   (11706.087, 75333.413]
Hawaii                  (264325.471, 327322.823]
Idaho                     (11706.087, 75333.413]
Illinois                 (75333.413, 138330.766]
Indiana                   (11706.087, 75333.413]
Iowa                      (11706.087, 75333.413]
Kansas                    (11706.087, 75333.413]
Kentucky                  (11706.087, 75333.413]
Louisiana                 (11706.087, 75333.413]
Maine                    (75333.413, 138330.766]
Maryland     

Here we see that states like alabama and alaska fall into the same category, while california and the
disctrict of columbia fall in a very different category.

Now, cutting is just one way to build categories from your data, and there are many other methods. For
instance, cut gives you interval data, where the spacing between each category is equal sized. But sometimes
you want to form categories based on frequency – you want the number of items in each bin to the be the
same, instead of the spacing between bins. It really depends on what the shape of your data is, and what
you’re planning to do with it.