<a href="https://colab.research.google.com/github/TDMDegree/Level-4-Introduction-to-AI-and-ML/blob/main/Consolidation%20Seminar_Classification_tutorial_answers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



Introduction

The adult dataset is from the 1994 Census database. Details of this dataset can be found at the UCI Machine Learning Repository. My main aim is to create a model to get a prediction of whether their income will exceed $50,000. The objectives that I will follow to complete my main aim are set out below:

1) Clean the data: Replace or delete missing data.

2) Explore the data: Delete any unnecessary fields

3) Scale the data

4)Convert the needed data for modelling.

5) Explore the machine learning algorithms.

6)Optimise the selected machine learning algorithm.






Stage 1: Exploratory Data Analysis (EDA)

Task 1 - Present the code that tells me all of the columns, data types, and records

Importantly, you should begin considering whether the dimensionality of the dataset is high or low. Additionally, you should strive to understand the information described by all the columns.


In [19]:
#Task 1 - upload the Salary.txt file and turn it into a pandas dataframe.
import pandas as pd

df = pd.read_csv("Salary.txt",index_col=False, names = [ "age" ,"workclass", "fnlwgt","education","education-num","marital-status","occupation","relationship","race","sex","capital-gain","capital-loss","hours-per-week","native-country","label"])

print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  label           32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
None


From the information collected on the dataframe, we know that there are 14 features that could be used for predicting the label column. However, from the original 15 columns, 2 columns are not self-explanatory. Both fnlwgt and education-num does not have an obvious correlation to the salary. After further examining the information on the dataset, fnlwgt is a weight on the sample and education-num is a repetitive column repeating the information in the education columns. It was decided that both columns would be dropped from the dataframe.

In [20]:
# Task 2 - drop the following columns from the dataframe :"fnlwgt", "education-num"
df = df.drop(["fnlwgt", "education-num"], axis=1)
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   education       32561 non-null  object
 3   marital-status  32561 non-null  object
 4   occupation      32561 non-null  object
 5   relationship    32561 non-null  object
 6   race            32561 non-null  object
 7   sex             32561 non-null  object
 8   capital-gain    32561 non-null  int64 
 9   capital-loss    32561 non-null  int64 
 10  hours-per-week  32561 non-null  int64 
 11  native-country  32561 non-null  object
 12  label           32561 non-null  object
dtypes: int64(4), object(9)
memory usage: 3.2+ MB
None




From the information file on the dataset, we know that all missing data was represented as "?".

All missing data was changed to the numpy form "NaN".


In [21]:
#Task 3 - replace the ? with pd.Na

df = df.replace(" ?", pd.NA)

A report on the missing data was then produced

In [22]:
#Task 4 - check to see if we have any missing values
print(df.isnull().sum())

age                  0
workclass         1836
education            0
marital-status       0
occupation        1843
relationship         0
race                 0
sex                  0
capital-gain         0
capital-loss         0
hours-per-week       0
native-country     583
label                0
dtype: int64


More information was gathered to find out how much data would be lost if we deleted all the missing data.

In [23]:
#Task 5 - check to see how many records we will lost by creating 2 dataframes. One that drops all missing values and the other that doesn't. Compare the number of records in each dataframe.
new_df = df.dropna()
print("################")
print("Total records that would be lost if deleted : ")
print(df.shape[0]- new_df.shape[0])
del new_df

################
Total records that would be lost if deleted : 
2399


Although we would lose only 2399 records out of 32561, further exploration of the data could result in a reduction in the loss of data.

Importantly, there are several strategies that could be used to address this missing data. Before making a decision, I need to understand the significance of the missing records and the feature itself.

**Exploring the data**

In [24]:
#Task 6- Find out the value count for each of the columns that have missing records
print(df['workclass'].value_counts())
print(df['occupation'].value_counts())
print(df['native-country'].value_counts())



workclass
Private             22696
Self-emp-not-inc     2541
Local-gov            2093
State-gov            1298
Self-emp-inc         1116
Federal-gov           960
Without-pay            14
Never-worked            7
Name: count, dtype: int64
occupation
Prof-specialty       4140
Craft-repair         4099
Exec-managerial      4066
Adm-clerical         3770
Sales                3650
Other-service        3295
Machine-op-inspct    2002
Transport-moving     1597
Handlers-cleaners    1370
Farming-fishing       994
Tech-support          928
Protective-serv       649
Priv-house-serv       149
Armed-Forces            9
Name: count, dtype: int64
native-country
United-States                 29170
Mexico                          643
Philippines                     198
Germany                         137
Canada                          121
Puerto-Rico                     114
El-Salvador                     106
India                           100
Cuba                             95
England         



From the information above, we can see that 1 unique value of this feature has over 91% of the data. Therefore, a decision was made to change all missing data from the Native Country column to equal "United States".

*Importantly, this was not the only handling strategy that could have been used. I could have tried to explore the correlation between country and salary to determine if I could have removed this feature completely or created a ML classification to predict the most likely country based on the other features. *

Additional, the majority of the other unique values had such a low number of records, resulting in a high cardinality for those records. The decision was made to try and combine the countries together in a way that could add to the model. The decision was to use the information from the website :

https://www.nationsonline.org/oneworld/GNI_PPP_of_countries.htm

to split the countries into high_PPP ,medium_PPP and low_PPP.

Task 7 - Replace all of np.NaN with United-States and loop round each of the arrays to replace based on high_PPP ,medium_PPP and low_PPP

low_PPP = [" Honduras", " Vietnam"," Cambodia"," Laos"," Haiti", " Yugoslavia"," India"," Guatemala", " Nicaragua"]

medium_PPP = [" Trinadad&Tobago"," Poland" ," Mexico" , " Thailand"," Iran"," Columbia", " Peru", " Philippines" ," China"," Ecuador" , " Cuba"," El-Salvador"," Jamaica"," South"]

high_PPP = [" Holand-Netherlands"," Scotland"," Ireland"," Hong"," Beligum" ," Japan"," Italy"," England"," Germany"," Canada"," France"," Taiwan"," Greece"," Portugal" , " Hungary"," Outlying-US(Guam-USVI-etc)", " Puerto-Rico", " Dominican-Republic"]

Importantly, this was not the only handling strategy that could have been used. I could have keep the data as it was and evaluated the different models to select the most suitable one.


In [25]:

df["native-country"] = df["native-country"].replace(pd.NA, " United-States")
low_PPP = [" Honduras", " Vietnam"," Cambodia"," Laos"," Haiti",
               " Yugoslavia"," India"," Guatemala", " Nicaragua"]
medium_PPP = [" Trinadad&Tobago"," Poland" ," Mexico" , " Thailand"," Iran",
                " Columbia", " Peru", " Philippines" ," China"," Ecuador" ,
                " Cuba"," El-Salvador"," Jamaica"," South"]
high_PPP = [" Holand-Netherlands"," Scotland"," Ireland"," Hong"," Beligum" ," Japan"," Italy"," England"," Germany"," Canada"," France"," Taiwan"," Greece",
               " Portugal" , " Hungary"," Outlying-US(Guam-USVI-etc)",
               " Puerto-Rico", " Dominican-Republic"]

for i in high_PPP:
    df["native-country"] = df["native-country"].replace(i,"high_PPP")

for i in medium_PPP:
    df["native-country"] = df["native-country"].replace(i,"medium_PPP")

for i in low_PPP:
    df["native-country"] = df["native-country"].replace(i,"  low_PPP")

print(df['native-country'].value_counts())

native-country
 United-States    29753
medium_PPP         1536
high_PPP            897
  low_PPP           375
Name: count, dtype: int64


The next step would be to look at the working class and occupation data and how the data is distributed among the unique values. Does it have high cardinality ? What should I do with the missing values ?

For the rest of the columns, I will drop all missing values

In [26]:
#task 11 - Drop all missing values
df = df.dropna()



Encoding the data

The data has been cleaned and combined to make it more effective in the modeling process.However, currently there are still a number of columns that have an unsuitable datatype for some of the machine learning algorithms that I will be using.

I will be converting all the object datatypes to a numerical type. The options that I have for this are using the Labelencoder and OneHotEncoder.

Due to the non-ordinal relations between the data categories in each column. I have chosen to use pandas.get_dummies on the object datatypes -

["workclass","education","marital-status","occupation","relationship","race","sex","native-country" were converted to a numeric value.]


In [27]:
#task 12 - encode the following columns "workclass","education","marital-status","occupation","relationship","race","sex","native-country" using the pd.get_dummies() method
df.info()
features = df[["workclass","education","marital-status","occupation","relationship","race","sex","native-country"]]
new_features = pd.get_dummies(features)
df.reset_index(inplace= True)
new_features.reset_index(inplace= True)
df = pd.merge(df,new_features,on="index",how="inner")
df = df.drop(["index","workclass","education","marital-status","relationship","race","sex","native-country","occupation"],axis=1)
df.head(2)

<class 'pandas.core.frame.DataFrame'>
Index: 30718 entries, 0 to 32560
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             30718 non-null  int64 
 1   workclass       30718 non-null  object
 2   education       30718 non-null  object
 3   marital-status  30718 non-null  object
 4   occupation      30718 non-null  object
 5   relationship    30718 non-null  object
 6   race            30718 non-null  object
 7   sex             30718 non-null  object
 8   capital-gain    30718 non-null  int64 
 9   capital-loss    30718 non-null  int64 
 10  hours-per-week  30718 non-null  int64 
 11  native-country  30718 non-null  object
 12  label           30718 non-null  object
dtypes: int64(4), object(9)
memory usage: 3.3+ MB


Unnamed: 0,age,capital-gain,capital-loss,hours-per-week,label,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Private,workclass_ Self-emp-inc,workclass_ Self-emp-not-inc,...,race_ Asian-Pac-Islander,race_ Black,race_ Other,race_ White,sex_ Female,sex_ Male,native-country_ low_PPP,native-country_ United-States,native-country_high_PPP,native-country_medium_PPP
0,39,2174,0,40,<=50K,False,False,False,False,False,...,False,False,False,True,False,True,False,True,False,False
1,50,0,0,13,<=50K,False,False,False,False,True,...,False,False,False,True,False,True,False,True,False,False


These features are then added to the original dataframe, and the original columns deleted.



Continuing with the necessary conversion of the data, the label column was updated so that:

0 = Below $50,000

1 = Above $50,000


In [28]:
#Task 13 - replace the label with a 1 for over 50k and a 0 for under
df['label'] = df['label'].replace({" <=50K": 0, " >50K": 1})
print(df.head())


   age  capital-gain  capital-loss  hours-per-week  label  \
0   39          2174             0              40      0   
1   50             0             0              13      0   
2   38             0             0              40      0   
3   53             0             0              40      0   
4   28             0             0              40      0   

   workclass_ Federal-gov  workclass_ Local-gov  workclass_ Private  \
0                   False                 False               False   
1                   False                 False               False   
2                   False                 False                True   
3                   False                 False                True   
4                   False                 False                True   

   workclass_ Self-emp-inc  workclass_ Self-emp-not-inc  ...  \
0                    False                        False  ...   
1                    False                         True  ...   
2             

  df['label'] = df['label'].replace({" <=50K": 0, " >50K": 1})


Feature Selection

Feature selection was then conducted to identify any highly or lowly correlated relationships between the label and the different variables. This process aims to reduce the dimensions of the dataset.

No features were removed; however, justification should be provided regarding why this ML model was selected based on these findings.

Importantly, this was just start and should have continued further for a more detailed correlation investigation

In [29]:
# Task 14 - find out if capital-gain and age have any correlation
print (df[['capital-gain','age']].corr())


              capital-gain       age
capital-gain      1.000000  0.080392
age               0.080392  1.000000


Lastly, I droped the label so I can start scaling and normalising the features.

In [30]:
# Task 15 - put the label column into its own dataframe and get rid of the label column in the main one
label_df = df['label']
features = df.drop("label",axis = 1)
print(features.columns)


Index(['age', 'capital-gain', 'capital-loss', 'hours-per-week',
       'workclass_ Federal-gov', 'workclass_ Local-gov', 'workclass_ Private',
       'workclass_ Self-emp-inc', 'workclass_ Self-emp-not-inc',
       'workclass_ State-gov', 'workclass_ Without-pay', 'education_ 10th',
       'education_ 11th', 'education_ 12th', 'education_ 1st-4th',
       'education_ 5th-6th', 'education_ 7th-8th', 'education_ 9th',
       'education_ Assoc-acdm', 'education_ Assoc-voc', 'education_ Bachelors',
       'education_ Doctorate', 'education_ HS-grad', 'education_ Masters',
       'education_ Preschool', 'education_ Prof-school',
       'education_ Some-college', 'marital-status_ Divorced',
       'marital-status_ Married-AF-spouse',
       'marital-status_ Married-civ-spouse',
       'marital-status_ Married-spouse-absent',
       'marital-status_ Never-married', 'marital-status_ Separated',
       'marital-status_ Widowed', 'occupation_ Adm-clerical',
       'occupation_ Armed-Forces', 'oc

Scaling the Data



The features were then scaled due to the differences between distributions of the features.

Importantly, I should really prove this by showing the distributions of the different features and discussing how it would impact the different ML models

The MinMaxScaler was used to transforms the data.


In [31]:
# Task 16- scale the data
from sklearn.preprocessing import MinMaxScaler


scaler = MinMaxScaler()
scaler.fit(features)
scaled_df = scaler.transform(features)
scaled_df = pd.DataFrame(scaled_df, columns=features.columns)
scaled_df.head(5)

Unnamed: 0,age,capital-gain,capital-loss,hours-per-week,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Private,workclass_ Self-emp-inc,workclass_ Self-emp-not-inc,workclass_ State-gov,...,race_ Asian-Pac-Islander,race_ Black,race_ Other,race_ White,sex_ Female,sex_ Male,native-country_ low_PPP,native-country_ United-States,native-country_high_PPP,native-country_medium_PPP
0,0.30137,0.02174,0.0,0.397959,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0
1,0.452055,0.0,0.0,0.122449,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0
2,0.287671,0.0,0.0,0.397959,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0
3,0.493151,0.0,0.0,0.397959,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
4,0.150685,0.0,0.0,0.397959,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0




Exploring the machine learning algorithms

With the transformation of the data complete, a selection of algorithms were chosen to explore the best method to predict the salary classification. Below are the selected algorithms :

1)Logestic Regression

2)Random Forest Tree

3)Extra Trees

4)Support Vector Machine

5)Neural Network


In [32]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn import svm
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix


The data was then divided into test and train datasets. We have 2 ways to divide our data, with the train_test_split option and cross_val_score. We will initially see the results from the test_train_split.

In [33]:
#task 17 - split the data
X_train, X_test,y_train,y_test = train_test_split(scaled_df,label_df)

The classifiers are then set to their default settings and each classifier is trained and tested with the data.

In [34]:
classifiers = [
    ExtraTreesClassifier(n_estimators=100),
    svm.SVC(gamma='scale'),
    RandomForestClassifier(n_estimators=100),
    MLPClassifier(max_iter=1000),    LogisticRegression(max_iter=1000)
]


for clf in classifiers:
    clf.fit(X_train,y_train)
    name = clf.__class__.__name__
    print(name)
    prediction = clf.predict(X_test)
    acc = accuracy_score(y_test, prediction)
    print("Acc : ",acc)
    matrix = confusion_matrix(y_test, prediction)
    print(matrix)

ExtraTreesClassifier
Acc :  0.8296875
[[5205  577]
 [ 731 1167]]
SVC
Acc :  0.8358072916666667
[[5354  428]
 [ 833 1065]]
RandomForestClassifier
Acc :  0.847265625
[[5306  476]
 [ 697 1201]]
MLPClassifier
Acc :  0.8404947916666666
[[5214  568]
 [ 657 1241]]
LogisticRegression
Acc :  0.8459635416666667
[[5357  425]
 [ 758 1140]]




As we can see,  all machine learning algorithms are above 80%. With so many algorithms having a similar accuracy rating, further investigation will be undertaken by looking at the confusion matrix.





When evaluating a classification machine learning model, it's essential to use appropriate metrics that reflect its performance accurately. Metrics such as accuracy, precision, recall (sensitivity), and F1-score are suitable for assessing classification models. Accuracy measures the proportion of correctly classified instances out of the total instances, providing an overall view of the model's correctness. Precision quantifies the accuracy of positive predictions, while recall (sensitivity) measures the proportion of actual positives that are correctly identified by the model. The F1-score, which is the harmonic mean of precision and recall, balances these two metrics and is particularly useful in scenarios where there is an uneven class distribution.
