<a href="https://colab.research.google.com/github/Skarthikak/Project-Nexus-AI/blob/main/Nexus_Data_Preparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Nexus: Phase 1 - AI-Assisted Data Preparation

This notebook focuses on the crucial first step of any responsible AI project: data preparation. We will load a publicly available dataset, analyze it for potential biases, and use interactive widgets and generative AI concepts to create a more balanced and robust training set.

---
**Note:** Before running this notebook, ensure you have run the `Nexus_Setup.ipynb` notebook to install all necessary libraries.

In [11]:
# Import libraries (assuming Nexus_Setup has been run)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from ipywidgets import interact, interactive, fixed, IntSlider
from IPython.display import display, HTML

# ----------------------------------------------------
# 1. LOAD AND PRE-PROCESS DATA
# ----------------------------------------------------

print("Downloading and loading the Adult Census Income dataset...")
data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
column_names = [
    'age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status',
    'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss',
    'hours-per-week', 'native-country', 'income'
]

# Read the data, skipping rows with missing values (' ?')
df = pd.read_csv(data_url, names=column_names, na_values=' ?', skipinitialspace=True)
df.dropna(inplace=True)

# Convert income to a binary variable (0 for <=50K, 1 for >50K)
df['income'] = df['income'].apply(lambda x: 1 if x == '>50K' else 0)

print("\nDataset loaded successfully. Here's a sample:")
display(df.head())
print(f"\nDataset shape: {df.shape}")

Downloading and loading the Adult Census Income dataset...

Dataset loaded successfully. Here's a sample:


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0



Dataset shape: (32561, 15)


In [13]:
# ----------------------------------------------------
# 2. INTERACTIVE BIAS IDENTIFICATION
# ----------------------------------------------------

print("\nAnalyzing dataset for potential biases...")
# Here's an example of correct indentation inside a function
def plot_distribution(attribute):
    plt.figure(figsize=(10, 6))
    sns.countplot(x=attribute, hue='income', data=df)
    plt.title(f'Distribution of {attribute} by Income')
    plt.xticks(rotation=45)
    plt.show()



# Create an interactive widget to select the attribute
print("Use the dropdown menu to explore the distribution of different attributes.")



Analyzing dataset for potential biases...
Use the dropdown menu to explore the distribution of different attributes.


# 3. Generative AI for Data Augmentation (Conceptual)

Generative AI can be a powerful tool for **data augmentation**, helping to address imbalances in a dataset. For example, if we have very few data points for a specific demographic group, we can use a generative model to create new, synthetic data points that mimic the characteristics of that group.

Below, we'll use a pre-trained text generation model from **Hugging Face's Transformers library** to demonstrate this concept. While our dataset is tabular, we can imagine a scenario where we use this model to generate descriptive text about individuals that could then be used for classification tasks.

In [14]:
# ----------------------------------------------------
# 4. DEMONSTRATE GENERATIVE AI CONCEPT
# ----------------------------------------------------


# Correct import for pipeline
from transformers import pipeline

# Use a text generation pipeline
generator = pipeline('text-generation', model='gpt2')


# Define a prompt to generate a synthetic data point
prompt = "A 45-year-old Asian female, who is a professor, earns over 50K."
print(f"Generating synthetic text based on the prompt:\n'{prompt}'\n")

# Generate text and display it
generated_text = generator(prompt, max_length=50, num_return_sequences=1)[0]['generated_text']
print("Generated Synthetic Text:")
print("------------------------")
print(generated_text)

Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Generating synthetic text based on the prompt:
'A 45-year-old Asian female, who is a professor, earns over 50K.'

Generated Synthetic Text:
------------------------
A 45-year-old Asian female, who is a professor, earns over 50K. She said she has been receiving calls from people who say she is "toxic" and "sick," and that she is struggling to find a job.

"I am completely demoralized," she said. "I am sick of being called names and told to look for work. I am sick of being told that I am not qualified."

The harassment is not limited to Asian women, said the victim. "The harassment continues to be systemic. It is happening to me because of my race, my age, my sexual orientation."

According to a complaint filed by a local women's rights group, the harasser told her she would never work with a white person again.

But the harassment continues.

"I am afraid to go to an Asian community event, I am afraid to go to an Asian conference, I am afraid to go to an Asian theater, I am afraid to g