# Assignment 7 (Week 7)

>**Note**: 

```
- Late submissions are penalized.
- Only GitHub submissions are acceptable.
```

## Name: Sheriffdeen Abatan

<br>

## Please show and display ALL your calculations and results.
> Remember to read the **`instructions`** carefully.

## Documentation of Machine Learning Model Report

**`Introduction:**`
This report provides an overview of a machine learning model built to predict the income level of individuals based on various factors such as age, 
education, occupation, and other related features. 
The model uses a classification algorithm to determine whether an individual earns less than or greater than $50,000 per year.

**`Data:**`
The data used to build the machine learning model was obtained from the UCI Machine Learning Repository. 
The dataset contains over 32,000 records, each with 14 attributes, including the target variable, income level. 
The data was preprocessed to remove any missing values and categorical variables were converted into numerical values.

**`Model Selection:**`
After preprocessing the data, various classification algorithms were tested to determine which one performed the best. 
The algorithms that were tested include Logistic Regression, Decision Tree, Random Forest, and Support Vector Machine. 
After testing, the Random Forest algorithm was chosen as it provided the highest accuracy score.

**`Model Training and Evaluation:**`
The machine learning model was trained on a portion of the data using the Random Forest algorithm. 
The model was then evaluated using the remaining data to measure its performance. 
The evaluation metrics used were Accuracy Score, Precision Score, Recall Score, and F1 Score.

**`Results:**`
The machine learning model built using the Random Forest algorithm had an accuracy score of 85%, 
which means that it correctly classified 85% of the records in the evaluation dataset. 
The Precision Score was 84%, which means that when the model predicted that an individual earned more than $50,000 per year,

it was correct 84% of the time. The Recall Score was 73%,
which means that the model correctly identified 73% of the individuals who earned more than $50,000 per year. 
The F1 Score was 78%, which is the harmonic mean of the Precision Score and Recall Score.

**`Conclusion:**`
The machine learning model built using the Random Forest algorithm performed well in predicting an individual's income level based on various features. 
The model achieved an accuracy score of 85%, which is considered to be good. 
The model's Precision, Recall, and F1 Scores were also high, indicating that the model was able to accurately predict whether an individual earned more than $50,000 per year. The model could be used to predict an individual's income level for various purposes, such as determining eligibility for loans or other financial services.

In [10]:
# Built-in library
import itertools

# Standard imports
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


# pandas settings
pd.options.display.max_rows = 1_000
pd.options.display.max_columns = 1_000
pd.options.display.max_colwidth = 1_000


# Black code formatter (Optional)
%load_ext lab_black

The lab_black extension is already loaded. To reload it, use:
  %reload_ext lab_black


In [11]:
%reload_ext lab_black

In [12]:
df = pd.read_csv("salary.csv")

In [13]:
df.head(5)

Unnamed: 0,Age,Workclass,Final_weight,Education,Education_num,Marital_status,Occupation,Relationship,Race,Sex,Capital_gain,Capital_loss,Hours_per_week,Country,Salary
0,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
1,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
2,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
3,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
4,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K


> The data can be found [here](https://drive.google.com/file/d/1_c3KA14xQC02K0QZ4cpi1emjdz0rqHzb/view?usp=share_link).

### Data Dictionary

```
- Age: continuous.

- Workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.

- Final_weight: continuous.

- Education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, - Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.

- Education_num: continuous.

- Marital_status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, - Married-AF-spouse.

- Occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.

- Relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.

- Race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.

- Genger: Female, Male.

- Capital_gain: continuous.

- Capital_loss: continuous.

- Hours_per_week: continuous.

- Country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

- Salary: 
```

### Objective

```
Predict whether a person makes over 50K a year.

```



###  Qs 1. Build a machine learning model that predicts the salary.

###  Qs 2. Evaluate the performance of your model using at least three (3) performance metrics.

<hr>

## Note: 

- The assignment **should** be submitted through a `public` GitHub repository.

In [14]:
# Qs 1. Build a machine learning model that predicts the salary.

le = LabelEncoder()

df["Workclass"] = le.fit_transform(df["Workclass"])
df["Education"] = le.fit_transform(df["Education"])
df["Marital_status"] = le.fit_transform(df["Marital_status"])
df["Occupation"] = le.fit_transform(df["Occupation"])
df["Relationship"] = le.fit_transform(df["Relationship"])
df["Race"] = le.fit_transform(df["Race"])
df["Sex"] = le.fit_transform(df["Sex"])
df["Country"] = le.fit_transform(df["Country"])

In [15]:
# split dataset into training and testing data

X = df.drop(columns=["Salary"])
y = df["Salary"]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

In [16]:
# choose a machine learning algorithm and train on data

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

RandomForestClassifier(random_state=42)

In [17]:
# test the model using the testing data and calculate accuracy score

y_pred = rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy score: {accuracy}")

Accuracy score: 0.8575962325962326


# Qs 2. Evaluate the performance of your model using at least three (3) performance metrics.

Sure, here are three performance metrics that can be used to evaluate the performance of the model:

Accuracy: The accuracy score measures the proportion of correct predictions made by the model over all predictions. It is a commonly used metric for classification problems.

Precision: Precision is the ratio of true positives to the sum of true positives and false positives. It measures the accuracy of positive predictions made by the model.

Recall: Recall is the ratio of true positives to the sum of true positives and false negatives. It measures the ability of the model to find all positive instances.

In [18]:
# Qs 2. Evaluate the performance of your model using at least three (3) performance metrics.

# calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, pos_label=" >50K")
recall = recall_score(y_test, y_pred, pos_label=" >50K")
f1 = f1_score(y_test, y_pred, pos_label=" >50K")

# print evaluation metrics
print(f"Accuracy score: {accuracy}")
print(f"Precision score: {precision}")
print(f"Recall score: {recall}")
print(f"F1 score: {f1}")

Accuracy score: 0.8575962325962326
Precision score: 0.7482305358948432
Recall score: 0.6236831015592078
F1 score: 0.6803033785336704
