<a href="https://colab.research.google.com/github/DrKenReid/Introductory-Data-Science/blob/main/Day_1_Lab_1_Data_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hands-on Exercise: Introduction to ML data preprocessing

## Problem Description:

For this hands-on exercise, we will work on a real-world classification problem using the "Adult Income" dataset. The goal is to predict whether an individual's annual income exceeds \$50,000 based on various demographic and employment-related features. This is a binary classification problem, where the target variable has two classes: ">50K" (representing an annual income greater than \$50,000) and "<=50K" (representing an annual income less than or equal to \$50,000).

## Dataset:

The "Adult Income" dataset is publicly available and can be obtained from the UCI Machine Learning Repository. The dataset contains information about individuals, including their age, education, occupation, marital status, and other relevant features.


## Key Features:

*   age: The age of the individual (continuous)
*   workclass: The type of employer (categorical)
*   fnlwgt: The number of people the census takers believe that observation represents (continuous)
*   education: The highest level of education achieved (categorical)
*   education-num: The numeric representation of the   education level (continuous)
*   marital-status: The marital status of the individual (categorical)
*   occupation: The occupation of the individual (categorical)
*   relationship: The family relationship of the individual (categorical)
*   race: The race of the individual (categorical)
*   sex: The gender of the individual (categorical)
*   capital-gain: The capital gains of the individual (continuous)
*   capital-loss: The capital losses of the individual (continuous)
*   hours-per-week: The number of hours the individual works per week (continuous)
*   native-country: The native country of the individual (categorical)
*   income: The target variable, indicating whether the individual's annual income exceeds $50,000 (binary)




### Load dataset

In [13]:
# Import necessary libraries
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split

In [14]:
# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
column_names = [
    "age", "workclass", "fnlwgt", "education", "education-num", "marital-status",
    "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss",
    "hours-per-week", "native-country", "income"
]
data = pd.read_csv(url, header=None, names=column_names)

In [15]:
# Display the first few rows of the dataset
print("First few rows of the dataset:")
print(data.head())

First few rows of the dataset:
   age          workclass  fnlwgt   education  education-num  \
0   39          State-gov   77516   Bachelors             13   
1   50   Self-emp-not-inc   83311   Bachelors             13   
2   38            Private  215646     HS-grad              9   
3   53            Private  234721        11th              7   
4   28            Private  338409   Bachelors             13   

        marital-status          occupation    relationship    race      sex  \
0        Never-married        Adm-clerical   Not-in-family   White     Male   
1   Married-civ-spouse     Exec-managerial         Husband   White     Male   
2             Divorced   Handlers-cleaners   Not-in-family   White     Male   
3   Married-civ-spouse   Handlers-cleaners         Husband   Black     Male   
4   Married-civ-spouse      Prof-specialty            Wife   Black   Female   

   capital-gain  capital-loss  hours-per-week  native-country  income  
0          2174             0        

### Data cleaning

In [16]:
# Check for missing values
print("\nMissing values in each column:")
print(data.isnull().sum())


Missing values in each column:
age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
income            0
dtype: int64


The above table shows that there is no missing values in columns.
The next step is to check if there are unexpected, or irregular values in the dataset.

For categorical columns, list unique values to spot unexpected entries:

In [17]:
data['age'].unique()
data['workclass'].unique()

array([' State-gov', ' Self-emp-not-inc', ' Private', ' Federal-gov',
       ' Local-gov', ' ?', ' Self-emp-inc', ' Without-pay',
       ' Never-worked'], dtype=object)

For numeric columns, generate descriptive statistics to get a quick overview of potential issues in numeric columns.

Potentially, plot the numeric columns to understand the distributions.

In [18]:
statistics_summary = data.describe()
statistics_summary

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


We found that there are irregular entries like ' ?' in categorical columns. Next, we need to clean up the irregular values.

In [19]:
# Handle irregular values
data["workclass"] = data["workclass"].replace(" ?", "Unknown")
data["occupation"] = data["occupation"].replace(" ?", "Unknown")
data["native-country"] = data["native-country"].replace(" ?", "Unknown")

Preparing data for machine learning models that require numerical input:

(1) Transform categorical variables into numeric format:

In [20]:
# Encode categorical variables using LabelEncoder
categorical_features = [
    "workclass", "education", "marital-status", "occupation",
    "relationship", "race", "sex", "native-country", "income"
]
label_encoders = {}
for feature in categorical_features:
    label_encoders[feature] = LabelEncoder()
    data[feature] = label_encoders[feature].fit_transform(data[feature])

(2) Scale numerical features using StandardScaler:

In [21]:
numerical_features = [
    "age", "fnlwgt", "education-num", "capital-gain", "capital-loss", "hours-per-week"
]
scaler = StandardScaler()

# Separate features and target variable
X = data.drop("income", axis=1)
y = data["income"]

X[numerical_features] = scaler.fit_transform(X[numerical_features])

### Prepare training and testing sets

Splitting data into training and testing sets is essential for building a reliable machine learning model. Here's why:

Training the model: The training set is used to teach the model by allowing it to find patterns in the data. The model learns to predict the target variable (like "income") based on the features (like age, hours worked, etc.).

Testing the model: The testing set is kept separate to check how well the model performs on new, unseen data. This gives a realistic idea of how the model will work in real-world situations.

Preventing overfitting: If the model memorizes the training data too well, it might become too specific to that data, a problem called overfitting. Overfitting means the model performs well on the training set but poorly on new data because it doesn't generalize well. By using a testing set, we can check if the model overfits.

This way, the model is evaluated on both known and new data, ensuring it can generalize its predictions.

In [22]:
# Separate features and target variable
X = data.drop("income", axis=1)
y = data["income"]

In [23]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [28]:
# Print the shapes of the training and testing sets
print("\nShape of training set:")
print("Features:", X_train.shape)
print("Target:", y_train.shape)
print("\nShape of testing set:")
print("Features:", X_test.shape)
print("Target:", y_test.shape)
print("\nExample of the data: ")
print(X_train.head())


Shape of training set:
Features: (26048, 14)
Target: (26048,)

Shape of testing set:
Features: (6513, 14)
Target: (6513,)

Example of the data: 
       age  workclass  fnlwgt  education  education-num  marital-status  \
5514    33          1  198183          9             13               4   
19777   36          3   86459          8             11               2   
10781   58          5  203039          6              5               5   
32240   21          3  180190          8             11               2   
9876    27          3  279872         15             10               0   

       occupation  relationship  race  sex  capital-gain  capital-loss  \
5514            9             1     4    0             0             0   
19777           3             0     4    1             0          1887   
10781           2             1     4    1             0             0   
32240           4             0     4    1             0             0   
9876            7             1  

## Conclusion

Congratulations! You reached the end of the first lab. You have successfully:


*   Retrieved a dataset
*   Inspected for oddities
*   Cleaned the data
*   Split into training & test
