# Scales used in PANDAS

There are atleast 4 scales used in pandas which will carry out stattistical tests for machine learning and advanced calculations. These are based off on the order of the data and that if these datas are separated equally or with respect to each other.

## astype()  Function

In [28]:
# the following is an example of the ordinal scaling, including the different grades

import pandas as pd
import numpy as np

# lets create a graded series

df = pd.DataFrame(['A+', 'A', 'B+', 'B', 'C+', 'C', 'D+', 'D'], columns = ['Grades'], 
                  index = ['excellent', 'excellent', 'good', 'good', 'ok', 'ok', 'poor', 'poor'])
print(df.head())

# as seen above it can be seen as the dtype of the values is object since only string values are inserted in the
# Series or Dataframe

          Grades
excellent     A+
excellent      A
good          B+
good           B
ok            C+


In [42]:
df.dtypes    # to check the dataframe of an object the dtypes attribute is used (In series .dtype is used)
# these are used without the paranthesis

Grades    object
dtype: object

In [73]:
# We can explicitely change the dtype of this data using the 'astype()' function

df['Grades'].astype('category').head()  # this will change the dtypes of the dataframe to category

# here the dtype of the objects has been changed to category. Along with this we can tell pandas that the values are
# in an order (this means that 'A' comes before 'B+'). It has been shown in the following.

categorical_df = pd.CategoricalDtype(['D', 'D+', 'C', 'C+', 'B', 'B+', 'A', 'A+'], ordered = True)
# this is now turned into an ordinal data *****
# now we can just pass the above in the astype function to create an ordinal version of the dataframe

grades = df['Grades'].astype(categorical_df)
grades
# As you can see from the above, when dtype was object there was no difference between the values in 'Grades'
# with the help of 'pd.CategoricalDtype' and 'astype()' function we can insert a hierarchy of values
# THIS IS HELPFUL DURING COMARISONS AND BOOLEAN MASKING

excellent    A+
excellent     A
good         B+
good          B
ok           C+
ok            C
poor         D+
poor          D
Name: Grades, dtype: category
Categories (8, object): [D < D+ < C < C+ < B < B+ < A < A+]

In [75]:
# The difference can be seen in the following, if we wanted, all grades greater than 'C'

# using object dtype
df[df['Grades']>'C']  
# here we can see that even 'D' is getting print, since it is comparing as a string and not a grade

# using categorical dtype
grades[grades>'C']
# here the grades that greater than 'C' instead of a string are return 
# NOTE : This is only possible if the list is ordered = True using the 'CategoricalDtype' function from pd.

excellent    A+
excellent     A
good         B+
good          B
ok           C+
Name: Grades, dtype: category
Categories (8, object): [D < D+ < C < C+ < B < B+ < A < A+]

### There is another operation which can change the interval and ratio scaled data (e.g, Numeric Grade) into categorical data.
* This is useful while visualizing the frequencies of the categories
* Historgrams are generally used by converting the interval and ratio data

## Pandas has a function known as 'cut()' that takes in array-like structure such as a Series or column of a DataFrame as an agrument and also takes a number of bins to be used. (These bins are equally spaced)

In [85]:
df = pd.read_csv('data/census.csv')
df = df[df['SUMLEV'] == 50]

df = df.set_index('STNAME').groupby(level = 0)['CENSUS2010POP'].agg(np.nanmean)

pd.cut(df, 2)      # passed in the dataframe and the bin number

STNAME
Alabama                  (11706.087, 327322.823]
Alaska                   (11706.087, 327322.823]
Arizona                 (327322.823, 642309.586]
Arkansas                 (11706.087, 327322.823]
California              (327322.823, 642309.586]
Colorado                 (11706.087, 327322.823]
Connecticut             (327322.823, 642309.586]
Delaware                 (11706.087, 327322.823]
District of Columbia    (327322.823, 642309.586]
Florida                  (11706.087, 327322.823]
Georgia                  (11706.087, 327322.823]
Hawaii                   (11706.087, 327322.823]
Idaho                    (11706.087, 327322.823]
Illinois                 (11706.087, 327322.823]
Indiana                  (11706.087, 327322.823]
Iowa                     (11706.087, 327322.823]
Kansas                   (11706.087, 327322.823]
Kentucky                 (11706.087, 327322.823]
Louisiana                (11706.087, 327322.823]
Maine                    (11706.087, 327322.823]
Maryland     