<a href="https://colab.research.google.com/github/TDMDegree/Level-4-Introduction-to-AI-and-ML/blob/main/Consolidation%20Seminar_Classification_tutorial_answers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



Introduction

The adult dataset is from the 1994 Census database. Details of this dataset can be found at the UCI Machine Learning Repository. My main aim is to create a model to get a prediction of whether their income will exceed $50,000. The objectives that I will follow to complete my main aim are set out below:

1) Clean the data: Replace or delete missing data.

2) Explore the data: Delete any unnecessary fields

3) Scale the data

4)Convert the needed data for modelling.

5) Explore the machine learning algorithms.

6)Optimise the selected machine learning algorithm.






Stage 1: Exploratory Data Analysis (EDA)

Task 1 - Present the code that tells me all of the columns, data types, and records

Importantly, you should begin considering whether the dimensionality of the dataset is high or low. Additionally, you should strive to understand the information described by all the columns.


In [None]:
#Task 1 - upload the Salary.txt file and turn it into a pandas dataframe.
import pandas as pd

df = pd.read_csv("Salary.txt",index_col=False, names = [ "age" ,"workclass", "fnlwgt","education","education-num","marital-status","occupation","relationship","race","sex","capital-gain","capital-loss","hours-per-week","native-country","label"])

print(df.info())

From the information collected on the dataframe, we know that there are 14 features that could be used for predicting the label column. However, from the original 15 columns, 2 columns are not self-explanatory. Both fnlwgt and education-num does not have an obvious correlation to the salary. After further examining the information on the dataset, fnlwgt is a weight on the sample and education-num is a repetitive column repeating the information in the education columns. It was decided that both columns would be dropped from the dataframe.

In [None]:
# Task 2 - drop the following columns from the dataframe :"fnlwgt", "education-num"
df = df.drop(["fnlwgt", "education-num"], axis=1)
print(df.info())



From the information file on the dataset, we know that all missing data was represented as "?".

All missing data was changed to the numpy form "NaN".


In [None]:
#Task 3 - replace the ? with pd.Na

df = df.replace(" ?", pd.NA)

A report on the missing data was then produced

In [None]:
#Task 4 - check to see if we have any missing values
print(df.isnull().sum())

More information was gathered to find out how much data would be lost if we deleted all the missing data.

In [None]:
#Task 5 - check to see how many records we will lost by creating 2 dataframes. One that drops all missing values and the other that doesn't. Compare the number of records in each dataframe.
new_df = df.dropna()
print("################")
print("Total records that would be lost if deleted : ")
print(df.shape[0]- new_df.shape[0])
del new_df

Although we would lose only 2399 records out of 32561, further exploration of the data could result in a reduction in the loss of data.

Importantly, there are several strategies that could be used to address this missing data. Before making a decision, I need to understand the significance of the missing records and the feature itself.

**Exploring the data**

In [None]:
#Task 6- Find out the value count for each of the columns that have missing records
print(df['workclass'].value_counts())
print(df['occupation'].value_counts())
print(df['native-country'].value_counts())





From the information above, we can see that 1 unique value of this feature has over 91% of the data. Therefore, a decision was made to change all missing data from the Native Country column to equal "United States".

*Importantly, this was not the only handling strategy that could have been used. I could have tried to explore the correlation between country and salary to determine if I could have removed this feature completely or created a ML classification to predict the most likely country based on the other features. *

Additional, the majority of the other unique values had such a low number of records, resulting in a high cardinality for those records. The decision was made to try and combine the countries together in a way that could add to the model. The decision was to use the information from the website :

https://www.nationsonline.org/oneworld/GNI_PPP_of_countries.htm

to split the countries into high_PPP ,medium_PPP and low_PPP.

Task 7 - Replace all of np.NaN with United-States and loop round each of the arrays to replace based on high_PPP ,medium_PPP and low_PPP

low_PPP = [" Honduras", " Vietnam"," Cambodia"," Laos"," Haiti", " Yugoslavia"," India"," Guatemala", " Nicaragua"]

medium_PPP = [" Trinadad&Tobago"," Poland" ," Mexico" , " Thailand"," Iran"," Columbia", " Peru", " Philippines" ," China"," Ecuador" , " Cuba"," El-Salvador"," Jamaica"," South"]

high_PPP = [" Holand-Netherlands"," Scotland"," Ireland"," Hong"," Beligum" ," Japan"," Italy"," England"," Germany"," Canada"," France"," Taiwan"," Greece"," Portugal" , " Hungary"," Outlying-US(Guam-USVI-etc)", " Puerto-Rico", " Dominican-Republic"]

Importantly, this was not the only handling strategy that could have been used. I could have keep the data as it was and evaluated the different models to select the most suitable one.


In [None]:

df["native-country"] = df["native-country"].replace(pd.NA, " United-States")
low_PPP = [" Honduras", " Vietnam"," Cambodia"," Laos"," Haiti",
               " Yugoslavia"," India"," Guatemala", " Nicaragua"]
medium_PPP = [" Trinadad&Tobago"," Poland" ," Mexico" , " Thailand"," Iran",
                " Columbia", " Peru", " Philippines" ," China"," Ecuador" ,
                " Cuba"," El-Salvador"," Jamaica"," South"]
high_PPP = [" Holand-Netherlands"," Scotland"," Ireland"," Hong"," Beligum" ," Japan"," Italy"," England"," Germany"," Canada"," France"," Taiwan"," Greece",
               " Portugal" , " Hungary"," Outlying-US(Guam-USVI-etc)",
               " Puerto-Rico", " Dominican-Republic"]

for i in high_PPP:
    df["native-country"] = df["native-country"].replace(i,"high_PPP")

for i in medium_PPP:
    df["native-country"] = df["native-country"].replace(i,"medium_PPP")

for i in low_PPP:
    df["native-country"] = df["native-country"].replace(i,"  low_PPP")

print(df['native-country'].value_counts())

The next step would be to look at the working class and occupation data and how the data is distributed among the unique values. Does it have high cardinality ? What should I do with the missing values ?

For the rest of the columns, I will drop all missing values

In [None]:
#task 11 - Drop all missing values
df = df.dropna()



Encoding the data

The data has been cleaned and combined to make it more effective in the modeling process.However, currently there are still a number of columns that have an unsuitable datatype for some of the machine learning algorithms that I will be using.

I will be converting all the object datatypes to a numerical type. The options that I have for this are using the Labelencoder and OneHotEncoder.

Due to the non-ordinal relations between the data categories in each column. I have chosen to use pandas.get_dummies on the object datatypes -

["workclass","education","marital-status","occupation","relationship","race","sex","native-country" were converted to a numeric value.]


In [None]:
#task 12 - encode the following columns "workclass","education","marital-status","occupation","relationship","race","sex","native-country" using the pd.get_dummies() method
df.info()
features = df[["workclass","education","marital-status","occupation","relationship","race","sex","native-country"]]
new_features = pd.get_dummies(features)
df.reset_index(inplace= True)
new_features.reset_index(inplace= True)
df = pd.merge(df,new_features,on="index",how="inner")
df = df.drop(["index","workclass","education","marital-status","relationship","race","sex","native-country","occupation"],axis=1)
df.head(2)

These features are then added to the original dataframe, and the original columns deleted.



Continuing with the necessary conversion of the data, the label column was updated so that:

0 = Below $50,000

1 = Above $50,000


In [None]:
df['label'] = df['label'].replace({" <=50K": 0, " >50K": 1})
print(df.head())


Feature Selection

Feature selection was then conducted to identify any highly or lowly correlated relationships between the label and the different variables. This process aims to reduce the dimensions of the dataset.

No features were removed; however, justification should be provided regarding why this ML model was selected based on these findings.

Importantly, this was just start and should have continued further for a more detailed correlation investigation

In [None]:
print (df[['label','capital-gain','age']].corr())


Lastly, I droped the label so I can start scaling and normalising the features.

In [None]:
label_df = df['label']
features = df.drop("label",axis = 1)
print(features.columns)


Scaling the Data



The features were then scaled due to the differences between distributions of the features.

Importantly, I should really prove this by showing the distributions of the different features and discussing how it would impact the different ML models

The MinMaxScaler was used to transforms the data.


In [None]:
from sklearn.preprocessing import MinMaxScaler


scaler = MinMaxScaler()
scaler.fit(features)
scaled_df = scaler.transform(features)
scaled_df = pd.DataFrame(scaled_df, columns=features.columns)
scaled_df.head(5)



Exploring the machine learning algorithms

With the transformation of the data complete, a selection of algorithms were chosen to explore the best method to predict the salary classification. Below are the selected algorithms :

1)Logestic Regression

2)Random Forest Tree

3)Extra Trees

4)Support Vector Machine

5)Neural Network


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn import svm
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix


The data was then divided into test and train datasets. We have 2 ways to divide our data, with the train_test_split option and cross_val_score. We will initially see the results from the test_train_split.

In [None]:
#task 15
X_train, X_test,y_train,y_test = train_test_split(scaled_df,label_df)

The classifiers are then set to their default settings and each classifier is trained and tested with the data.

In [None]:
classifiers = [
    ExtraTreesClassifier(n_estimators=100),
    svm.SVC(gamma='scale'),
    RandomForestClassifier(n_estimators=100),
    MLPClassifier(max_iter=1000),
    LogisticRegression(max_iter=1000)
]

alo = []
min_max = []
standard_list = []
confusion_matrix_list = []
for clf in classifiers:
    clf.fit(X_train,y_train)
    name = clf.__class__.__name__
    alo.append(name)
    prediction = clf.predict(X_test)
    acc = accuracy_score(y_test, prediction)
    min_max.append(acc)
    matrix = confusion_matrix(y_test, prediction)
    confusion_matrix_list.append(matrix)

This information is placed in a chart so we can compare the accuracy of each classifier.

In [None]:
print(min_max)
max_val = max(min_max)
print(max_val)
labels = list(alo)
index = np.arange(1,len(alo)+1)
bar_width = 0.35
fig, ax = plt.subplots(figsize=(10,10))
red_counter = False
blue_counter = False
for i in range(0,len(index)):
    if max_val == min_max[i]:
        colour = "Green"
        label = "Highest Acc Rating"
        ax.bar(index[i] ,min_max[i],bar_width,color=colour , label = label)
    elif  min_max[i] > 0.8:
        colour = "r"
        label = "Above 80% Acc Rating"
        if red_counter == False:
            ax.bar(index[i] ,min_max[i],bar_width,color=colour , label = label)
            red_counter = True
        else:
            ax.bar(index[i] ,min_max[i],bar_width,color=colour)

    else:
        label ="Below 80% Acc Rating"
        colour = "b"
        if blue_counter == False:
            ax.bar(index[i] ,min_max[i],bar_width,color=colour , label = label)
            blue_counter =True
        else:
            ax.bar(index[i] ,min_max[i],bar_width,color=colour)


ax.legend(bbox_to_anchor=(1.3, 0.5))
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
plt.xticks(range(1,len(labels)+1), labels)
for index,info in enumerate(ax.get_xticklabels()):
    info.set_rotation(90)
    ax.text(index+0.90,min_max[index]+0.02,"{:.2f}".format(min_max[index]))

[0.82421875, 0.8326822916666666, 0.8432291666666667, 0.83828125, 0.8438802083333333]
0.8438802083333333


NameError: name 'np' is not defined



As we can see,  all machine learning algorithms are above 80% with the NN slightly edging it at >85 % accuracy. With so many algorithms having a similar accuracy rating, further investigation will be undertaken by looking at the confusion matrix.





In [None]:
from matplotlib.gridspec import GridSpec
confusion_matrix_list
fig = plt.figure(figsize=(10,10))
gs = GridSpec(2,2)
ax1 = plt.subplot(gs[0,0])
ax2 = plt.subplot(gs[0,1])
ax3 = plt.subplot(gs[1,0])
ax4 = plt.subplot(gs[1,1])

ax1_info =[]
ax2_info =[]
ax3_info =[]
ax4_info =[]

plt.setp(ax1.get_xticklabels(), visible=False)
plt.setp(ax2.get_xticklabels(), visible=False)

for index,clf in enumerate(confusion_matrix_list):
    for index1,info_list in enumerate(clf):
        counter = 1
        for index2,info in enumerate(info_list):
            if index1 == 0:
                if index2 == 0 :
                    ax1_info.append(info)
                elif index2 == 1 :
                    ax2_info.append(info)
            elif index1 == 1:
                if index2 == 0 :
                    ax3_info.append(info)
                elif index2 == 1 :
                    ax4_info.append(info)

for index in range(1,len(ax1_info)+1):
    ax1.bar(index , ax1_info[index-1])
    ax2.bar(index , ax2_info[index-1])
    ax3.bar(index , ax3_info[index-1])
    ax4.bar(index , ax4_info[index-1])

ax1.set_title("True Negative")
ax2.set_title("False Positive")
ax3.set_title("False Negative")
ax4.set_title("True Positive")

labels.insert(0,"None")
ax3.set_xticklabels(labels)
ax4.set_xticklabels(labels)

for labelax3,labelax4 in zip(ax3.get_xticklabels(),ax4.get_xticklabels()):
    labelax3.set_rotation(90)
    labelax4.set_rotation(90)

When evaluating a classification machine learning model, it's essential to use appropriate metrics that reflect its performance accurately. Metrics such as accuracy, precision, recall (sensitivity), and F1-score are suitable for assessing classification models. Accuracy measures the proportion of correctly classified instances out of the total instances, providing an overall view of the model's correctness. Precision quantifies the accuracy of positive predictions, while recall (sensitivity) measures the proportion of actual positives that are correctly identified by the model. The F1-score, which is the harmonic mean of precision and recall, balances these two metrics and is particularly useful in scenarios where there is an uneven class distribution.
