# DAT210x - Programming with Python for DS

## Module2 - Lab5

## Lab Assignment 5

Barry Becker extracted a reasonably clean subset of the 1994, [U.S. Census database](https://archive.ics.uci.edu/ml/datasets/Census+Income), with a goal of running predictions to determine whether a person makes over 50K a year. The dataset is hosted on the University of California, Irvine's Machine Learning Repository and includes features such as the person's age, occupation, and hours worked per week, etc.

As clean as the data is, it still isn't quite ready for analysis by SciKit-Learn! Using what you've learned in this chapter, clean up the various columns by encode them properly using the best practices so that they're ready to be examined. We've included a modifies subset of the dataset at Module2/Datasets/<b>census.data</b> and also have some started code to get you going located at Module2/assignment5.py.

<ol>
<li>Load up the dataset and set header label names to:
['education', 'age', 'capital-gain', 'race', 'capital-loss', 'hours-per-week', 'sex', 'classification']</li>

<p>Ensure you use the right command to do this, as there is more than one command! To verify you used the correct one, open the dataset in a text editor like SublimeText or Notepad, and double check your df.head() to ensure the first values match up.</p>

<p><li>Make sure any value that needs to be replaced with a NAN is set as such. There are at least three ways to do this. One is much easier than the other two.</li>
<p><li>Look through the dataset and ensure all of your columns have appropriate data types. Numeric columns should be float64 or int64, and textual columns should be object.</li>
<p><li>Properly encode any ordinal features using the method discussed in the chapter.</li>
<p><li>Properly encode any nominal features by exploding them out into new, separate, boolean features.</li>
</ol>

Import and alias Pandas:

In [34]:
# .. your code here ..
import pandas as pd

In [35]:
# pd.read_csv?

As per usual, load up the specified dataset, setting appropriate header labels.

In [36]:
# .. your code here ..
col_name=['education', 'age', 'capital-gain', 'race', 'capital-loss',
    'hours-per-week', 'sex', 'classification']
df=pd.read_csv('Datasets/census.data',names=col_name,header=None,na_values='?')
print(df.head())


   education  age  capital-gain   race  capital-loss  hours-per-week     sex  \
0  Bachelors   39        2174.0  White             0              40    Male   
1  Bachelors   50           NaN  White             0              13    Male   
2    HS-grad   38           NaN  White             0              40    Male   
3       11th   53           NaN  Black             0              40    Male   
4  Bachelors   28           0.0  Black             0              40  Female   

  classification  
0          <=50K  
1          <=50K  
2          <=50K  
3          <=50K  
4          <=50K  


Excellent.

Now, use basic pandas commands to look through the dataset. Get a feel for it before proceeding!

Do the data-types of each column reflect the values you see when you look through the data using a text editor / spread sheet program? If you see `object` where you expect to see `int32` or `float64`, that is a good indicator that there might be a string or missing value or erroneous value in the column.

In [37]:
# .. your code here ..
print(df.describe())
print(df.dtypes)


                age  capital-gain  capital-loss  hours-per-week
count  29536.000000  29532.000000  29536.000000    29536.000000
mean      38.506094    928.454321     84.957408       40.243872
std       13.811739   6557.886804    397.107750       12.326211
min       17.000000      0.000000      0.000000        1.000000
25%       27.000000      0.000000      0.000000       40.000000
50%       37.000000      0.000000      0.000000       40.000000
75%       48.000000      0.000000      0.000000       45.000000
max       90.000000  99999.000000   4356.000000       99.000000
education          object
age                 int64
capital-gain      float64
race               object
capital-loss        int64
hours-per-week      int64
sex                object
classification     object
dtype: object


Try use `your_data_frame['your_column'].unique()` or equally, `your_data_frame.your_column.unique()` to see the unique values of each column and identify the rogue values.

If you find any value that should be properly encoded to NaNs, you can convert them either using the `na_values` parameter when loading the dataframe. Or alternatively, use one of the other methods discussed in the reading.

In [38]:
# .. your code here ..
# for i in range(len(df.columns)):
#     print(df.iloc[:,i].unique())
    
    
for i in col_name:
    print(i+' : ', df.loc[:,i].unique())

education :  ['Bachelors' 'HS-grad' '11th' 'Masters' '9th' 'Some-college' '7th-8th'
 'Doctorate' '5th-6th' '10th' '1st-4th' 'Preschool' '12th']
age :  [39 50 38 53 28 37 49 52 31 42 30 23 34 25 32 43 40 54 35 59 56 19 20 45
 22 48 21 24 57 44 18 47 46 41 29 36 79 27 67 33 76 17 55 61 70 64 71 68
 51 58 26 60 90 66 65 77 62 63 80 72 74 69 73 81 78 75 82 83 84 85 88 86
 87]
capital-gain :  [ 2174.    nan     0. 14084.  5178.  5013.  2407. 14344. 15024.  7688.
 34095.  4064.  4386.  7298.  1409.  3674.  1055.  3464.  2050.  2176.
   594.  6849.  4101.  1111.  3411.  2597. 25236.  4650.  9386.  2463.
  3103. 10605.  2964.  3325.  2580.  3471.  4865.  6514.  1471.  2329.
 99999. 20051.  2105.  2885. 25124. 10520.  2202.  2961. 27828.  6767.
  8614.  2228.  1506. 13550.  2635.  5556.  4787.  3781.  3137.  3818.
  3942.   914.   401.  2829.  2977.  4934.  2062. 15020.  1424.  3273.
 22040.  4416. 10566.   991.  4931.  1086.  7430.  6497.   114.  7896.
  2346.  3432.  2907.  1151.  2414.  2290

Look through your data and identify any potential categorical features. Ensure you properly encode any ordinal and nominal types using the methods discussed in the chapter.

Be careful! Some features can be represented as either categorical or continuous (numerical). If you ever get confused, think to yourself what makes more sense generally---to represent such features with a continuous numeric type... or a series of categories?

In [39]:
# .. your code here ..
#categorical -  race,sex, 
#ordinal- edu, classif
#nominal - age,cap-gain, cap-loss,

In [40]:
#education - ordinal
education_ordered = ['Preschool', '1st-4th', '5th-6th', '7th-8th', '9th', '10th', '11th', '12th', 'HS-grad', 'Some-college', 'Bachelors', 'Masters', 'Doctorate']
df.education=pd.Categorical(df['education'],categories=education_order,ordered=True)

#race-categorical
race_category=['White' 'Black' 'Asian-Pac-Islander' 'Amer-Indian-Eskimo' 'Other']
df=pd.get_dummies(df,columns=['race'])

#sex-cactegorical
sex_cat=['Male' 'Female']
df=pd.get_dummies(df,columns=['sex'])

#classification - ordinal
classification_cat=['<=50K' '>50K']
df=pd.get_dummies(df,columns=['classification'])
df.dtypes


education                  category
age                           int64
capital-gain                float64
capital-loss                  int64
hours-per-week                int64
race_Amer-Indian-Eskimo       uint8
race_Asian-Pac-Islander       uint8
race_Black                    uint8
race_Other                    uint8
race_White                    uint8
sex_Female                    uint8
sex_Male                      uint8
classification_<=50K          uint8
classification_>50K           uint8
dtype: object

Lastly, print out your dataframe!

In [44]:
# .. your code here ..
df.head()

Unnamed: 0,education,age,capital-gain,capital-loss,hours-per-week,race_Amer-Indian-Eskimo,race_Asian-Pac-Islander,race_Black,race_Other,race_White,sex_Female,sex_Male,classification_<=50K,classification_>50K
0,Bachelors,39,2174.0,0,40,0,0,0,0,1,0,1,1,0
1,Bachelors,50,,0,13,0,0,0,0,1,0,1,1,0
2,HS-grad,38,,0,40,0,0,0,0,1,0,1,1,0
3,11th,53,,0,40,0,0,1,0,0,0,1,1,0
4,Bachelors,28,0.0,0,40,0,0,1,0,0,1,0,1,0


### Lab Question 1

Before you made any changes to the downloaded dataset, how many of the columns were ordinal?
<p><b>Ans. 2</b></p>

<p>Ordinal features are those that are categorical but have an underlaying ordering to them, such as: Tall, Medium, and Small. In this dataset, there are only two ordinal features: education and classification.</p>

Education is ordinal because people move from one education group to the next based on the amount of their studies. The classification column is ordinal because it's based on how much a person makes. If we had the actual numeric feature of their precise annual income, there is no doubt it would be a continuous column. Since we only know if they make more or less than 50k, it is therefore ordinal.


### Lab Question 2

<p>Before you made any changes to the downloaded dataset, how many of the columns were nominal?</p>
<p><b>Ans. 2</b></p>

Race and sex are the only columns that are nominal. Nominal features are those that are categorical, but lack any innate ordering.



