In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# A. Machine learning
The DataFrame object is great for storing a dataset and performing data analysis in Python. However, most machine learning frameworks (e.g. TensorFlow), work directly with NumPy data. Furthermore, the NumPy data used as input to machine learning models must solely contain quantitative values.

Therefore, to use a DataFrame's data with a machine learning model, we need to convert the DataFrame to a NumPy matrix of quantitative data. So even the categorical features of a DataFrame, such as gender and birthplace, must be converted to quantitative values.

# B. Indicator features
When converting a DataFrame to a NumPy matrix of quantitative data, we need to find a way to modify the categorical features in the DataFrame.

The easiest way to do this is to convert each categorical feature into a set of indicator features for each of its categories. The indicator feature for a specific category represents whether or not a given data sample belongs to that category.

The code below shows a DataFrame with indicator features.

In [4]:
df = pd.DataFrame({'color':['red', 'blue', 'green', 'red', 'red', 'blue']},
                  index = ['r1', 'r2', 'r3', 'r4', 'r5', 'r6'])
print('{}\n'.format(df))

df_indicator = pd.get_dummies(df, columns=['color'], prefix = 'Indicator')
# in the above line we make an (indicator DataFrame) and this thing can be understand by taking the output of this cell
print('{}\n'.format(df_indicator))

# in order to print 0,1 in the place of True,False
df_indicator = df_indicator.astype(int)
print('{}\n'.format(df_indicator))

    color
r1    red
r2   blue
r3  green
r4    red
r5    red
r6   blue

    Indicator_blue  Indicator_green  Indicator_red
r1           False            False           True
r2            True            False          False
r3           False             True          False
r4           False            False           True
r5           False            False           True
r6            True            False          False

    Indicator_blue  Indicator_green  Indicator_red
r1               0                0              1
r2               1                0              0
r3               0                1              0
r4               0                0              1
r5               0                0              1
r6               1                0              0



In the code above, the DataFrame df has a single categorical feature called Color. The corresponding indicator features for Color are shown in indicator_df.

Note that an indicator feature contains 1 when the row has that particular category, and 0 if the row does not.

# C. Converting to indicators
In pandas, we convert each categorical feature of a DataFrame to indicator features with the get_dummies function. The function takes in a DataFrame as its required argument, and returns the DataFrame with each of its categorical features converted to indicator features.

The code below demonstrates how to use the get_dummies function.

In [25]:

df = pd.DataFrame({'lgID': ['AL', 'NL', 'AL', 'NL'],
                   'teamID': ['BOS', 'PIT', 'BOS', 'PIT']},
                   index  = ['bettsmo01', 'martest01', 'pedrodu01', 'polangr01'])
df.index.name = 'playerID' # by writing this line we have given 'playerID' name or label to our rows 
print('{}\n'.format(df))

converted = pd.get_dummies(df)
print('{}\n'.format(converted.columns))

print('{}\n'.format(converted[['teamID_BOS',
                               'teamID_PIT']]))
print('{}\n'.format(converted[['lgID_AL',
                               'lgID_NL']]))
converted = converted.astype(int)
print('{}\n'.format(converted[['teamID_BOS',
                               'teamID_PIT']]))
print('{}\n'.format(converted[['lgID_AL',
                               'lgID_NL']]))

          lgID teamID
playerID             
bettsmo01   AL    BOS
martest01   NL    PIT
pedrodu01   AL    BOS
polangr01   NL    PIT

Index(['lgID_AL', 'lgID_NL', 'teamID_BOS', 'teamID_PIT'], dtype='object')

           teamID_BOS  teamID_PIT
playerID                         
bettsmo01        True       False
martest01       False        True
pedrodu01        True       False
polangr01       False        True

           lgID_AL  lgID_NL
playerID                   
bettsmo01     True    False
martest01    False     True
pedrodu01     True    False
polangr01    False     True

           teamID_BOS  teamID_PIT
playerID                         
bettsmo01           1           0
martest01           0           1
pedrodu01           1           0
polangr01           0           1

           lgID_AL  lgID_NL
playerID                   
bettsmo01        1        0
martest01        0        1
pedrodu01        1        0
polangr01        0        1



Note that the indicator features have the original categorical feature's label as a prefix. This makes it easy to see where each indicator feature originally came from.

# D. Converting to NumPy
After converting all the categorical features to indicator features, the DataFrame should have all quantitative data. We can then convert to a NumPy matrix using the values function.

The code below converts a DataFrame, df into a NumPy matrix.

In [26]:
df = converted[['teamID_BOS', 'teamID_PIT']]
print('{}\n'.format(df))
df.insert(0,'HR',[24, 7, 7, 11]) # by this we are adding a new column named 'HR' at first positin means at index 0
print('{}\n'.format(df))

n_matrix = df.values # here this DataFrame is converted to a NumPy matrix
print(repr(n_matrix))

           teamID_BOS  teamID_PIT
playerID                         
bettsmo01           1           0
martest01           0           1
pedrodu01           1           0
polangr01           0           1

           HR  teamID_BOS  teamID_PIT
playerID                             
bettsmo01  24           1           0
martest01   7           0           1
pedrodu01   7           1           0
polangr01  11           0           1

array([[24,  1,  0],
       [ 7,  0,  1],
       [ 7,  1,  0],
       [11,  0,  1]], dtype=int64)


The rows and columns of the output matrix correspond to the rows and columns of the same position in the DataFrame. In the code above, the first column of the NumPy matrix represents HR, the second column represents teamID_BOS, and the third column represents teamID_PIT.