# For Group Members

## Data Split

After doing EDA on the data, we saw that there were about twice as many males as females and that majority race was white. This prompted us to think about these questions:
- How does the proportion of high income individuals vary across sex?
- How does the proportion of high income individuals vary across race?

AAt an in-person group meeting, we decided to focus on the first question. The `race` variable contained categories with very few observations in them, which would pose a challenge upon doing a test/train split. Hence, we will perform a test/train split, taking care to ensure that the distribution of the **sexes** in the test data is the same as in the overall data.

In [13]:
!pip install imblearn

# Importing necessary libraries
from ucimlrepo import fetch_ucirepo
from sklearn.model_selection import train_test_split

# Fetch the Adult dataset from UCI ML repository
adult = fetch_ucirepo(id=2)



In [14]:
## You can use this to generate the test/train data

# Separate features and targets
X = adult.data.features
y = adult.data.targets
y['income'] = y['income'].str.strip('.').str.strip()

# Ensure the split maintains the proportion of 'sex' in the test set
# Stratify by the 'sex' column in X (categorical feature)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=X['sex'], random_state=42
)

# Display the proportion of 'sex' in the overall dataset, train, and test sets
overall_sex_proportion = X['sex'].value_counts(normalize=True)
train_sex_proportion = X_train['sex'].value_counts(normalize=True)
test_sex_proportion = X_test['sex'].value_counts(normalize=True)

overall_sex_proportion, train_sex_proportion, test_sex_proportion

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  y['income'] = y['income'].str.strip('.').str.strip()


(sex
 Male      0.668482
 Female    0.331518
 Name: proportion, dtype: float64,
 sex
 Male      0.668492
 Female    0.331508
 Name: proportion, dtype: float64,
 sex
 Male      0.668441
 Female    0.331559
 Name: proportion, dtype: float64)

In [15]:
import os

# Create a 'data' directory in the parent directory if it doesn't exist
os.makedirs("../data", exist_ok=True)

# Save X_train, y_train, X_test, and y_test to separate CSV files
X_train.to_csv("../data/X_train.csv", index=True)        # Save features of train set
y_train.to_csv("../data/y_train.csv", index=True)        # Save targets of train set
X_test.to_csv("../data/X_test.csv", index=True)          # Save features of test set
y_test.to_csv("../data/y_test.csv", index=True)          # Save targets of test set

"Datasets saved as '../data/X_train.csv', '../data/y_train.csv', '../data/X_test.csv', and '../data/y_test.csv' with index."

"Datasets saved as '../data/X_train.csv', '../data/y_train.csv', '../data/X_test.csv', and '../data/y_test.csv' with index."

In [16]:
## You can use this code to load it
## On Windows, copy this code and press Ctrl + / to uncomment

import pandas as pd

# Load the saved datasets from the CSV files
X_train = pd.read_csv("../data/X_train.csv", index_col=0)  # Use the first column as index
y_train = pd.read_csv("../data/y_train.csv", index_col=0)  # Use the first column as index
X_test = pd.read_csv("../data/X_test.csv", index_col=0)    # Use the first column as index
y_test = pd.read_csv("../data/y_test.csv", index_col=0)    # Use the first column as index

# Display the first few rows of the loaded datasets
(X_train.head(), y_train.head(), X_test.head(), y_test.head())


(       age         workclass  fnlwgt education  education-num  \
 27381   50       Federal-gov  222020   HS-grad              9   
 682     34  Self-emp-not-inc  190290   HS-grad              9   
 6266    31           Private   56026   HS-grad              9   
 15380   35           Private  167990       9th              5   
 39248   52           Private   91506   HS-grad              9   
 
            marital-status         occupation relationship   race   sex  \
 27381  Married-civ-spouse       Adm-clerical      Husband  White  Male   
 682    Married-civ-spouse       Craft-repair      Husband  White  Male   
 6266   Married-civ-spouse       Craft-repair      Husband  White  Male   
 15380  Married-civ-spouse  Machine-op-inspct      Husband  White  Male   
 39248  Married-civ-spouse  Handlers-cleaners      Husband  White  Male   
 
        capital-gain  capital-loss  hours-per-week native-country  
 27381             0             0              48  United-States  
 682          

# Checks

I do some checks below to show that this is doing the right things.

In [17]:
## You can use this code to load it

import pandas as pd

# Load the saved datasets from the CSV files
X_train_loaded = pd.read_csv("../data/X_train.csv", index_col=0)  # Use the first column as index
y_train_loaded = pd.read_csv("../data/y_train.csv", index_col=0)  # Use the first column as index
X_test_loaded = pd.read_csv("../data/X_test.csv", index_col=0)    # Use the first column as index
y_test_loaded = pd.read_csv("../data/y_test.csv", index_col=0)    # Use the first column as index

# Display the first few rows of the loaded datasets
(X_train_loaded.head(), y_train_loaded.head(), X_test_loaded.head(), y_test_loaded.head())

# Compare loaded data with original data

# Check if the original and loaded training feature sets are equal
X_train_comparison = X_train_loaded.equals(X_train)
y_train_comparison = y_train_loaded.equals(y_train)

# Check if the original and loaded test feature sets are equal
X_test_comparison = X_test_loaded.equals(X_test)
y_test_comparison = y_test_loaded.equals(y_test)

# Create a summary of the comparisons
comparison_results = {
    'X_train_equal': X_train_comparison,
    'y_train_equal': y_train_comparison,
    'X_test_equal': X_test_comparison,
    'y_test_equal': y_test_comparison,
}

# Print out the comparison results
for key, value in comparison_results.items():
    print(f"{key}: {value}")

# If any discrepancies found, print the first few mismatched rows
if not X_train_comparison:
    print("\nMismatched rows in X_train:")
    print(X_train_loaded[X_train_loaded != X_train].dropna())

if not y_train_comparison:
    print("\nMismatched values in y_train:")
    print(y_train_loaded[y_train_loaded != y_train].dropna())

if not X_test_comparison:
    print("\nMismatched rows in X_test:")
    print(X_test_loaded[X_test_loaded != X_test].dropna())

if not y_test_comparison:
    print("\nMismatched values in y_test:")
    print(y_test_loaded[y_test_loaded != y_test].dropna())


X_train_equal: True
y_train_equal: True
X_test_equal: True
y_test_equal: True


# Extension (SMOTE)

Since we have unbalanced data, we may want to use a technique to balance the classes. One such method is SMOTE (synthetic minority oversampling technique) which uses nearest-neighbour methods to create synthetic minority samples. This is done to get balanced data.

You may want to use this if you decide to tackle the imbalanced data.

In [18]:
from sklearn.preprocessing import LabelEncoder
from imblearn.over_sampling import SMOTE
import pandas as pd

# Initialize label encoders for each categorical column
label_encoders = {}
categorical_columns = X_train.select_dtypes(include=['object']).columns  # Get categorical columns

# Apply label encoding to categorical columns
X_train_encoded = X_train.copy()  # Make a copy of the original data
for col in categorical_columns:
    le = LabelEncoder()
    X_train_encoded[col] = le.fit_transform(X_train_encoded[col])
    label_encoders[col] = le  # Save the encoder for each column

# Apply SMOTE to the label-encoded training data
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train_encoded, y_train)

# After SMOTE, reverse the label encoding to restore original categorical values
X_train_smote_restored = X_train_smote.copy()
for col in categorical_columns:
    le = label_encoders[col]  # Get the original label encoder for the column
    X_train_smote_restored[col] = le.inverse_transform(X_train_smote_restored[col])

# Save the SMOTE-applied datasets (with restored categorical values) to CSV files
X_train_smote_restored.to_csv("../data/X_train_smote.csv", index=True)  # Save features of SMOTE train set
y_train_smote.to_csv("../data/y_train_smote.csv", index=True)           # Save targets of SMOTE train set

print("SMOTE-applied datasets (with restored categories) saved as '../data/X_train_smote.csv' and '../data/y_train_smote.csv'.")

SMOTE-applied datasets (with restored categories) saved as '../data/X_train_smote.csv' and '../data/y_train_smote.csv'.


In [19]:
# Load the SMOTE-applied datasets
X_train_smote_loaded = pd.read_csv("../data/X_train_smote.csv", index_col=0)  # Use the first column as index
y_train_smote_loaded = pd.read_csv("../data/y_train_smote.csv", index_col=0)  # Use the first column as index

# Display the first few rows of the datasets
print("X_train_smote Loaded from CSV:")
print(X_train_smote_loaded.head())

print("\ny_train_smote Loaded from CSV:")
print(y_train_smote_loaded.head())

X_train_smote Loaded from CSV:
   age         workclass  fnlwgt education  education-num      marital-status  \
0   50       Federal-gov  222020   HS-grad              9  Married-civ-spouse   
1   34  Self-emp-not-inc  190290   HS-grad              9  Married-civ-spouse   
2   31           Private   56026   HS-grad              9  Married-civ-spouse   
3   35           Private  167990       9th              5  Married-civ-spouse   
4   52           Private   91506   HS-grad              9  Married-civ-spouse   

          occupation relationship   race   sex  capital-gain  capital-loss  \
0       Adm-clerical      Husband  White  Male             0             0   
1       Craft-repair      Husband  White  Male             0             0   
2       Craft-repair      Husband  White  Male             0             0   
3  Machine-op-inspct      Husband  White  Male             0             0   
4  Handlers-cleaners      Husband  White  Male             0             0   

   hours-per-

In [20]:
# Check the distribution of the target variable y_train before and after SMOTE
y_train_distribution_before = y_train.value_counts(normalize=True)
y_train_smote_distribution_after = y_train_smote.value_counts(normalize=True)

y_train_distribution_before, y_train_smote_distribution_after

(income
 <=50K     0.759706
 >50K      0.240294
 Name: proportion, dtype: float64,
 income
 <=50K     0.5
 >50K      0.5
 Name: proportion, dtype: float64)