### Dataset https://www.kaggle.com/datasets/kamaumunyori/income-prediction-dataset-us-20th-century-data/code
About Dataset
This dataset was introduced in a competition on Zindi to challenge data professionals to predict whether members of the test population would be earning below or above $50,000 based on the variables taken into account in the analysis.

- Age.
- Gender.
- Education.
- Class.
- Education institute.
- Marital status.
- Race.
- Is hispanic.
- Employment commitment.
- Unemployment reason.
- Employment state.
- Wage per hour.
- Is part of labor union.
- Working week per year.
- Industry code.
- Main Industry code.
- Occupation code.
- Main Occupation code.
- Total employed.
- Household stat.
- Household summary.
- Under 18 family.
- Veterans adminquestionnaire.
- Veteran benefit.
- Tax status.
- Gains.
- Losses.
- Stocks status.
- Citizenship.
- Migration year.
- Country of birth own.
- Country of birth father.
- Country of birth mother.
- Migration code change in msa.
- Migration previous sunbelt.
- Migration code move within registration.
- Migration code change in registration.
- Residence 1 year ago.
- Old residence registration.
- Old residence state.
- Importance of record.


- Income above limit.<br>
Target column categorizing  earning below or above $50,000

# Loading Dataset

In [None]:
import pandas as pd
import plotly.express as px
pd.set_option('display.max_columns', None)


In [None]:
df = pd.read_csv("Dataset/Train.csv")


# Data Exploring

## understanding data

In [None]:
df.info()

In [None]:
df.head(10)

In [None]:
df.describe().iloc[[3,7]]
#df.describe()

In [None]:
df.describe(include="object")

## Data Cleaning

### Drop Column and check for duplication  


In [None]:
df.drop(columns=["ID"], inplace=True)
df.drop(columns=["importance_of_record"], inplace=True)

In [None]:
df.duplicated().sum()

In [None]:
df.drop_duplicates(inplace=True,ignore_index=True) #drop duplicates with index reset

### Data inconsistency

remove white spaces from categorical columns and show unique values

In [None]:

for col in df.select_dtypes(include=["object"]).columns:
    df[col] = df[col].str.strip()
    print(f"{col} : {df[col].unique()} ")
  
    print("------------------")

Fix ever and never in column household_stat<br>
household_stat=> Child 18+ never marr Not in a subfamily + Child 18+ ever marr Not in a subfamily  change ever to never

In [None]:
df["household_stat"]=df["household_stat"].str.replace(r'\bever\b',"never",regex=True)

rename NA with dont know in is_hispanic

In [None]:
df["is_hispanic"]=df["is_hispanic"].str.replace(r'\bNA\b',"Do not know",regex=True)
df["is_hispanic"].unique()

change ? to Unknown

In [None]:
df["country_of_birth_own"]=df["country_of_birth_own"].str.replace('?',"Unknown")
df["country_of_birth_father"]=df["country_of_birth_father"].str.replace('?',"Unknown")
df["country_of_birth_mother"]=df["country_of_birth_mother"].str.replace('?',"Unknown")
df["migration_code_change_in_msa"]=df["migration_code_change_in_msa"].str.replace('?',"Unknown")
df["migration_prev_sunbelt"]=df["migration_prev_sunbelt"].str.replace('?',"Unknown")
df["migration_code_move_within_reg"]=df["migration_code_move_within_reg"].str.replace('?',"Unknown")
df["migration_code_change_in_reg"]=df["migration_code_change_in_reg"].str.replace('?',"Unknown")

### Data type mismatching not exists 

### numeric Graph Exploring

In [None]:
numeric_columns = df.select_dtypes(include="number").columns
for column in numeric_columns:
    fig = px.histogram(df, x=column, title=column)
    fig.show()

### Categorical Graph Exploring

In [None]:
#for best performance use histogram with histfunc="count" instead of bar
Categorical_columns = df.select_dtypes(include="O").columns
for column in Categorical_columns:
    #counts = df[column].value_counts()  # Get category counts
    #px.bar(data_frame=counts,x=counts.index,y=counts.values, title=column,labels={"y":"count"}).show()
     px.histogram(data_frame=df,x=column, title=column,histfunc="count").show()


### Handling Missing Values and un-needed column



In [None]:
df.isna().mean() *100

#### Categorical columns

keep occupation_code_main ,  class for its importance 

Drop missing values columns > 42% 
- education_institute Nan 95%
- unemployment_reason Nan 96%
- is_labor_union Nan 90%
- under_18_family Nan 72%
- old_residence_reg Nan 92%
- old_residence_state Nan 92%
- veterans_admin_questionnaire 99%
- residence_1_year_ago  50%

Drop column occupation_code have sam meaning of occupation_code_main

In [None]:
df.drop(columns=["occupation_code","education_institute","unemployment_reason","is_labor_union","under_18_family","old_residence_reg","old_residence_state","veterans_admin_questionnaire","residence_1_year_ago"], inplace=True)

drop rows <5%

In [None]:
# Drop rows where "migration_code_move_within_reg" or "migration_code_change_in_reg" or migration_code_change_in_msa have NaN values
df_drop_na_rows = df.dropna(subset=["migration_code_move_within_reg","migration_code_change_in_reg","migration_code_change_in_msa"],ignore_index=True)
df_drop_na_rows.shape

In [None]:
df_drop_na_rows.shape[0]/df.shape[0] *100

In [None]:
df=df_drop_na_rows

In [None]:
df["class"]=df["class"].fillna("Unknown")
df["occupation_code_main"]=df["occupation_code_main"].fillna("Unknown")
df["migration_prev_sunbelt"]=df["migration_prev_sunbelt"].fillna("Unknown")


In [None]:
df_cat= df[["occupation_code_main","migration_code_change_in_msa","migration_prev_sunbelt","migration_code_move_within_reg","migration_code_change_in_reg"]]
for column in df_cat.columns:
    fig = px.histogram(df_cat, x=column, title=column, histfunc="count")
    fig.show()

In [None]:
df.isna().mean() *100

In [None]:
df.duplicated().sum()

In [None]:
df.drop_duplicates(inplace=True, ignore_index=True)  # Drop duplicates with index reset

#### no missing Values for Numeric columns

### Outliers observation 

In [None]:
for column in df.select_dtypes(include=["number"]).columns:
   px.box(df, x=column,title=column).show()

### Feature Engineering

add education level category

In [None]:
def categorize_education(level):
    if level in ['Less than 1st grade', '1st 2nd 3rd or 4th grade', '5th or 6th grade', '7th and 8th grade']:
        return 'Primary Education'
    elif level in ['9th grade', '10th grade', '11th grade', '12th grade no diploma']:
        return 'Secondary Education'
    elif level == 'High school graduate':
        return 'High School Completion'
    elif level in ['Some college but no degree', 'Associates degree-academic program', 'Associates degree-occup /vocational']:
        return 'Post-Secondary (Higher Education)'
    elif level in ['Bachelors degree(BA AB BS)', 'Masters degree(MA MS MEng MEd MSW MBA)', 
                   'Prof school degree (MD DDS DVM LLB JD)', 'Doctorate degree(PhD EdD)']:
        return 'University Education'
    else:
        return 'Other'
df["education_level"] = df["education"].apply(categorize_education)


add earning 

In [None]:

df['earning'] = df['wage_per_hour'] * df['working_week_per_year']


# Analytics

corelation

In [None]:
corr = df.corr(numeric_only=True)
px.imshow(corr, text_auto=True, aspect="auto", title="Correlation Matrix")

there is no strong relation ship between data except new generated earning column showing strong relation with wage_per_hour and this is a normal relation.

Comparison of males and females earning over $50,000?

In [None]:
px.histogram(data_frame=df[["gender","income_above_limit"]],x="gender",facet_col="income_above_limit",barmode="group")

It is clear that the majority of the population earns less than $50,000; with a higher number of females falling into the lower income bracket.

comparison between education level average earning  ?

In [None]:
#exclude earning = 0 from the population
dt=round(df[df["earning"]>0].groupby("education_level")["earning"].mean().reset_index(),2)
px.pie(data_frame=dt, names="education_level", values="earning")


the university education came the first in earning compared to other education

university education completion compared to other education ?

In [None]:
count_df= df.education_level.value_counts().reset_index()
px.histogram(
    data_frame=count_df, 
    x="education_level",     
    y="count",
    title="Earning by Education Level"   
)

race earning comparison?

In [None]:

px.histogram(
    data_frame=df[["race","earning"]]
    ,x="race",
    y="earning",
    histfunc="avg"
)


In [None]:
df.to_csv("Dataset/Train_cleaned.csv", index=False)