# DTSC-670 Foundations of Machine Learning
## Assignment: Census Income
### Name: (Please Enter Your Name Before Submitting)

## Copyright & Academic Integrity Notice
<span style="color:red">This material is for enrolled students' academic use only and protected under U.S. Copyright Laws. This content must not be shared outside the confines of this course, in line with Eastern University's academic integrity policies. Unauthorized reproduction, distribution, or transmission of this material, including but not limited to posting on third-party platforms like GitHub, is strictly prohibited and may lead to disciplinary action. You may not alter or remove any copyright or other notice from copies of any content taken from BrightSpace or Eastern University’s website.</span>
 
<span style="color:red">© Copyright Notice 2024, Eastern University - All Rights Reserved.</span> 

## Student Learning Objectives

- Keep refining your data preparation skills, including building pipelines and employing column transformers.
- Develop a custom transformer tailored to your data manipulation needs.
- Further enhance your proficiency in the machine learning work flow.

## CodeGrade
This assignment will be automatically graded through CodeGrade, and you will have unlimited submission attempts. To ensure successful grading, please follow these instructions carefully: Name your notebook as `census_income_assignment.ipynb` before submission, as CodeGrade requires this specific filename for grading purposes. Additionally, make sure there are no errors in your notebook, as CodeGrade will not be able to grade it if errors are present. Before submitting, we highly recommend restarting your kernel and running all cells again to ensure that there will be no errors when CodeGrade runs your script.

## Assignment Overview
In this assignment, your focus will remain on honing your skills in using Scikit-Learn functions for data preprocessing, with the overarching objective of constructing a classification machine learning model for predicting whether an individual's income exceeds $50,000 based on specific features. A pivotal aspect of this assignment will involve implementing a custom transformer to manipulate your dataset effectively.  Custom transformers enable us to customize preprocessing steps, making it effortless to reuse the code whenever we encounter new data or data preprocessing needs.

### Data
This data comes from the 1994 Census database and is a widely known beginner friendly dataset in the machine learning community.  You may hear it referred to as the "Adult" or "Census Income" dataset.  It is often used for practice with data exploration and in testing classification models.  More information about the dataset can be found on [UCI Machine Learning's site](https://archive.ics.uci.edu/dataset/2/adult).  Please make sure that you are using the files from Brightspace as they have been changed for this assignment.

The columns in the file are as follows:

    - age : continuous.
    - workclass : Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
    - fnlwgt : continuous. Individuals who share similar demographic characteristics should ideally have similar weightings. However, it's crucial to remember that this principle holds true only within each state. This limitation arises because the CPS sample comprises 51 state-specific samples, each with its unique probability of selection.
    - education : Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
    - education_num : continuous.
    - marital_status : Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
    - occupation : Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
    - relationship : Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
    - race : White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
    - sex : Female, Male.
    - capital_gain : continuous.
    - capital_loss : continuous.
    - days_per_week : continuous. Days worked per week.
    - hours_per_day : continuous. Hours worked per day.
    - native_country : United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
    - income : <=50K, >50K (this is our label)

### Assignment Instructions
Walk through the rest of the assignment, completing the exercises as indicated.  As you read through the markdown comments, the provided code, and create your own code, think about how each section fits into the overall machine learning process.  

Once you have completed all the tasks, you are ready to submit your assignment to CodeGrade for testing. Please restart your notebook's kernel and run your code from the beginning to ensure there are no error messages. Once you have verified that the code runs without any issues, submit your .ipynb notebook file to CodeGrade for evaluation. Your notebook should be called `census_income_assignment.ipynb`. You have unlimited attempts for this assignment.

### Table of Contents 
1. [Standard Imports](#import)
2. [Get the Data](#data)
3. [Explore the Data](#explore)
4. [Prepare the Data](#prepare)
5. [Model Selection & Evaluation](#model_selection)
6. [Classification Metrics](#metrics)
7. [Final Model Evaluation](#final_model)
 

## Standard Imports<a name="import"></a>
Run the code block below to import your standard imports and setup the notebook for CodeGrade grading.

In [32]:
# standard imports
import pandas as pd
import numpy as np

# Do not change this option; This allows the CodeGrade auto grading to function correctly
pd.set_option('display.max_columns', 20)
np.set_printoptions(suppress=True)
import warnings
warnings.filterwarnings("ignore")

## Get the Data<a name="data"></a>
**Exercise 1:** In the code block below, import the `census_income.csv` file and call the DataFrame `census_income.` If you examine the CSV file, you'll observe that it contains "?" for missing values. To handle this, refer to the [Pandas read_csv() function documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) to identify and use the suitable parameter for encoding all "?" values as NA/NaN.

In [33]:
### ENTER CODE HERE ###
census_income=pd.read_csv('census_income.csv')

Let's again begin by examining fundamental details about the dataset. First, we will review the columns, check the total count of non-null entries, and analyze the data types associated with each column.

In [34]:
# check basic info about dataset and notice missing values
census_income.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             32561 non-null  int64  
 1   workclass       32561 non-null  object 
 2   fnlwgt          32561 non-null  int64  
 3   education       32561 non-null  object 
 4   education_num   32561 non-null  int64  
 5   marital_status  32561 non-null  object 
 6   occupation      32561 non-null  object 
 7   relationship    32561 non-null  object 
 8   race            32561 non-null  object 
 9   sex             32561 non-null  object 
 10  capital_gain    32561 non-null  int64  
 11  capital_loss    32561 non-null  int64  
 12  days_per_week   32561 non-null  float64
 13  hours_per_day   32561 non-null  float64
 14  native_country  32561 non-null  object 
 15  income          32561 non-null  object 
dtypes: float64(2), int64(5), object(9)
memory usage: 4.0+ MB


**Exercise 2:** Before diving deeper into the data, we should stop and create a training and a test set.
1) Since we are trying to predict whether an individual earns over $50K, save the `income` column as a Series named `income_label`.
2) Drop the `income` column from the `census_income` DataFrame and save the remaining columns as a DataFrame named `income_features`.
3) Utilize Scikit-learn's `train_test_split` function, employing the `income_features` and `income_label` variables, to partition the data into a training set and a test set. Allocate 80% of the instances for training and 20% for testing. Set the random_state to 42 to ensure reproducibility of our results.  Assign the DataFrames the following names: `X_train`, `X_test`, `y_train`, and `y_test`.

In [35]:
### ENTER CODE HERE ###
income_label= census_income['income']



In [36]:
### ENTER CODE HERE ###
income_features=census_income.drop('income',axis=1)

In [37]:
### ENTER CODE HERE ###
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test=train_test_split(income_features,income_label, test_size=0.2, random_state=42)


## Explore the Data<a name="explore"></a>


Rather than exploring the data together in this notebook, we recommend taking some time to review how someone else approached data exploration. You can find a valuable example in this [Kaggle notebook by user ADITI MULYE](https://www.kaggle.com/code/aditimulye/adult-income-dataset-from-scratch). This individual has done a thorough job of exploring the data using methods we may not have covered yet. Examining how others explore data is a valuable learning experience that I highly recommend you invest time in.  Please keep in mind that while their exploration is insightful, our dataset may differ, so not all aspects may align perfectly.  

## Prepare the Data<a name="prepare"></a>
In this section, we will build a custom transformer to address specific data preparation needs. We'll also set up pipelines to automate the data preparation process and introduce a column transformer to apply transformations to numeric and categorical columns in your dataset.

Let's begin by creating our custom transformer.  

### Custom Transformer
Create a custom transformer, similar to the custom transformer that we saw in the California House Prices module example.  Go back and review that code, making sure that you understand the various pieces, in order to more easily create this custom transformer.  

**Exercise 3:** Create a custom transformer that takes the numerical columns from your data and performs the following transformations:
1) You must name your custom transformer class `CensusIncomeTransformer` 
2) Your class should include an input parameter called `create_new_column` with a default value of `True` that performs the following two data preparation steps when its value is `True`, but skips these steps and just returns the DataFrame as is when you pass a value of `False`.
   - Adds an attribute to the end of the numerical data (i.e. new last column) that is the result of the `days_per_week` column multiplied by the `hours_per_day` column.  We are creating this column to better compare the amount of hours worked between the individuals.
   - Since they are not needed with the new column, delete the `days_per_week` and `hours_per_day` columns.
   - Remember that you only want these two steps to occur when the `create_new_column` parameter is `True`.  Your custom transfomer will be tested in CodeGrade to make sure these steps are not ran when the `create_new_column` parameter is `False`.

This transformer will be used in a pipeline. In that pipeline, an imputer will be run *before* this transformer. Keep in mind that the imputer will output an array, so **this transformer must be written to accept an array.**  This is very important and a cause of many errors that students encounter.  In other words, think about using NumPy in your transformer instead of Pandas.

Additionally, this transformer will ONLY be given the numerical features of the data. The categorical features will be handled elsewhere in the full pipeline. This means that your code for this transformer **must reflect the absence of the categorical columns** when indexing data structures.  Again this is very important and a cause of the second most number of errors that students encounter.

In [38]:
### ENTER CODE HERE ###
from sklearn.base import BaseEstimator, TransformerMixin

age_1,fnlwgt_1,education_num_1, capital_gain_1, captial_loss_1,days_per_week_1, hours_per_day_1= 0,2,4,10,11,12,13

class CensusIncomeTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, create_new_column=True):
       self.create_new_column=create_new_column
        
    def fit(self,X,y=None):
        return self

    def transform(self, X):
       num_features=census_income [[ 'age','fnlwgt','education_num', 'capital_gain', 'captial_loss']]
       if create_new_column==True:
         new_last_column=  X[:,days_per_week_1] * X[:,hours_per_day_1]
         return np.c_[num_feature, new_last_column]

       else: 
         return num_feautres



### Pipelines

**Exercise 4:** Create a pipeline for only the numeric data called `num_pipeline` that:

1) Utilizes Scikit-Learn's `make_pipeline` function to generate a pipeline named `num_pipeline`.  
2) Within this pipeline, begin by incorporating a `SimpleImputer` transformation using the `mean` strategy. Please note that this strategy employs the "mean" instead of the "median" as used in the previous assignment. You might be wondering why we are introducing this step, given that there are no missing data in the numerical columns at the moment. While this holds true for the current dataset, we cannot always guarantee that incoming data will be free of missing values. Therefore, it is advisable to prepare for this possibility as a best practice.
3) Next, apply the custom `CensusIncomeTransformer` class to the data.    
4) Finally, add a `StandardScalar` transformation into the pipeline.


In [39]:
### ENTER CODE HERE ###
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

num_pipeline= make_pipeline(SimpleImputer(),CensusIncomeTransformer(), StandardScaler())

**Exercise 5:** Create a pipeline for only the categorical data called `cat_pipeline` that:

1) Utilizes Scikit-Learn's `make_pipeline` function to generate a pipeline named `cat_pipeline`.  
2) Begin by incorporating a `SimpleImputer` transformation using the `most_frequent` strategy.  This will replace any missing values with the most frequent value for each column.
3) Next, add an `OneHotEncoder` class to the pipeline, making sure that the `drop` parameter is set to "first".  You must also include the `sparse_output=False` parameter to prevent a sparse array from being generated to make CodeGrade testing easier.   

In [40]:
### ENTER CODE HERE ###
cat_pipeline=make_pipeline((SimpleImputer(strategy='most_frequent')), (OneHotEncoder(drop='first', sparse_output=False)))

### Column Transformer
Next, you will create a Column Transformer to pass your numeric data to the `num_pipeline` and your categorical features to the `cat_pipeline` you created above.

**Exercise 6:**
1) Create a list of your numerical feature column names (in the order they appear in the original data). Name this list `num_attributes`.
2) Create a list of your categorical feature column names (in the order they appear in the original data). Name this list `cat_attributes`.
3) Utilize Scikit-learn's `ColumnTransformer` function to create a transformer that:
    - Directs the numeric data through the previously defined `num_pipeline`.
    - Directs the categorical features through the previously defined `cat_pipeline`.
    - Name this ColumnTransformer object `preprocessing`.
4) Invoke the fit_transform() method on the `X_train` dataset to generate the preprocessed data. Store the resulting output in a variable named `X_train_prepared`.

In [43]:
### ENTER CODE HERE ###
from sklearn.compose import ColumnTransformer

num_attributes=['age','fnlwgt','education_num', 'capital_gain', 'capital_loss','days_per_week','hours_per_day']
cat_attributes=['workclass','education','marital_status','occupation','relationship','race','sex','native_country']

preprocessing= ColumnTransformer(transformers=[
    ('num_pipeline', num_pipeline,num_attributes),
    ('cat_pipeline',cat_pipeline,cat_attributes)
], remainder='passthrough')

X_train_prepared= preprocessing.fit_transform(X_train)
census_income.columns

KeyError: "['captial_loss'] not in index"

## Model Selection<a name="model_selection"></a>
In this section, we will employ various models on our preprocessed data and assess their accuracy scores using Scikit-Learn's `cross_val_score` function. Proceed by executing the following cell groups to fit models, including a Logistic Regression model, a Stochastic Gradient Descent classifier, and a Random Forest classifier.

In [44]:
### Logistic Regression Classifier ###

from sklearn.linear_model import LogisticRegression

# instantiate a Logistic Regression Class 
# increasing the maximum number of iterations taken for the solvers to converge
log_clf = LogisticRegression(random_state=42, max_iter=1000)

# fit the model
log_clf.fit(X_train_prepared, y_train)

NameError: name 'X_train_prepared' is not defined

In [27]:
from sklearn.model_selection import cross_val_score

# check the accuracy scores
cross_val_score(log_clf, X_train_prepared, y_train, cv=3, scoring="accuracy")

NameError: name 'X_train_prepared' is not defined

In [28]:
### Stochastic Gradient Descent Classifier ###

from sklearn.linear_model import SGDClassifier

# instantiate SGD CLassifier Class
sgd_clf = SGDClassifier(random_state=42)

# fit the model 
sgd_clf.fit(X_train_prepared, y_train)

NameError: name 'X_train_prepared' is not defined

In [29]:
# check the accuracy scores
cross_val_score(sgd_clf, X_train_prepared, y_train, cv=3, scoring="accuracy")

NameError: name 'X_train_prepared' is not defined

In [30]:
### Random Forest Classifier ###

from sklearn.ensemble import RandomForestClassifier

# instantiate a Random Forest Classifier Class using default parameters
# we won't do it in this assignment, but normally we would want to perform a grid search to 
# find the best parameters to use
rnd_for_clf = RandomForestClassifier(random_state=0)

# fit the model
rnd_for_clf.fit(X_train_prepared, y_train)

NameError: name 'X_train_prepared' is not defined

In [31]:
# check the accuracy scores
cross_val_score(rnd_for_clf, X_train_prepared, y_train, cv=3, scoring="accuracy")

NameError: name 'X_train_prepared' is not defined

The accuracy scores for all three models fall within the range of 84-85%. In a real-world project, the next step would involve fine-tuning the model's hyperparameters. However, for the purpose of this assignment, we will not delve into hyperparameter tuning. 

Given that the accuracy scores are quite similar among the models, we will proceed with the Logistic Regression model. As we have come to understand, accuracy alone doesn't provide a comprehensive assessment of classification tasks. To gain a more comprehensive insight, let's evaluate the precision, recall, and F1 scores for this model.

## Classification Metrics<a name="metrics"></a>
**Exercise 7:** 
1) Utilizing Scikit-Learn's [cross_val_predict](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html#sklearn.model_selection.cross_val_predict) function, generate a list of predictions named `y_train_pred` by using the `log_clf` model, your `X_train_prepared` data, and your `y_train` data using 3-fold cross-validation.
2) Next, calculate the precision score, recall score, and F1 score by employing the [precision_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score), [recall_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html#sklearn.metrics.recall_score), and [f1_score]((https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score)) functions on the `y_train` and `y_train_pred` data, and store the results in variables named `precision`, `recall`, and `f1` respectively. It's important to note that you should specify `pos_label=">50K"` as a parameter in these functions to indicate the positive label.  Round these scores to 2 decimal places.

In [None]:
### ENTER CODE HERE ###

In [None]:
### ENTER CODE HERE ###

In [None]:
### ENTER CODE HERE ###

In [None]:
### ENTER CODE HERE ###

Take a moment to reflect on these metrics and their meanings. If an employer or interviewer were to inquire about them, consider how you would explain these metric scores to someone else. It's very important for a data scientist, or someone working in the field, to be able to convey a clear understanding of these metrics, including their significance in evaluating the performance of a machine learning model.

## Final Model Evaluation<a name="final_model"></a>
We are now ready to assess our model's performance on the test set, utilizing the previously established Logistic Regression model. It is necessary that we apply any data transformations we performed on the training data to the testing data. It is crucial to emphasize that we should solely apply transformations to the testing data without using the "fit_transform" method, as we want to exclusively use the information derived from the training data for this transformation process.

**Exercise 8:**
1) Utilizing the previously established `preprocessing` ColumnTransformer, apply transformations to your `X_test` data, and label the resulting dataset as `X_test_prepared`. It is essential that you should refrain from using the fit_transform method on your testing data in any capacity.
2) With the pre-fitted `log_clf` model, make predictions using the `X_test_prepared` dataset and store these predictions as a variable called `final_predictions`.
3) Utilizing Scikit-learn's [accuracy_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) function, calculate the accuracy by passing your `y_test` and `final_predictions`.  Round the accuracy score to 2 decimal places. Save this score as `final_accuracy`.

In [None]:
### ENTER CODE HERE ###