# Project: Data Analysis and Vizualization on Adult Census Income Dataset

---
---
**Problem description**
- Use basic visualization techniques to gain an initial understanding of the dataset. Specifically, you are required to visualize the relationship between each attribute and the class label. For a continuous attribute, you might need to discretize it first using a simple strategy such as equi-width. Please experiment with at least three different bin widths if you decide to discretize a continuous attribute. Observe these basic visualizations and summarize your main insights.  You are strongly recommended to use Tableau for this task.
- Handling missing values: suggest and implement at least two strategies to handle the missing values for categorical and numeric attributes, respectively.  These strategies should be based on your observations made in the previous step.  


---
### **1.) Basic Imports and Access Setup**


In [None]:
# Imports

import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


In [None]:
# Read the CSV file
df = pd.read_csv('/kaggle/input/adult-census-income/adult.csv')
df

In [None]:
df.head()
df.info()
df.describe()

---
### **2.) Data Cleaning**

- Analyse the Missing/Null Values, and Perform required fixes.
  - Replace the missing values with most frequent categories


In [None]:
print(df.isnull().sum()) 

In [None]:
for column in df.columns:
    # Calculate the most frequent category (mode)
    most_frequent_category = df[column].mode()[0]
    
    # Replace missing values with the most frequent category
    df[column] = df[column].replace(['?', 'NAN', 'other_missing_value'], most_frequent_category)
print(df)

---
### **3.) Exploratory Data Analysis and Vizualization**

#### a. [Stacked bar plot]

In [None]:
grouped_data = df.groupby(['sex', 'income'])['income'].count().unstack()

# Create the stacked barplot
ax = grouped_data.plot(kind='bar', stacked=True, color=['#FFC0CB', '#ADD8E6'])

# Add labels and title
ax.set_title('Income by Sex')
ax.set_xlabel('Sex')
ax.set_ylabel('Count')
ax.legend(['<=50K', '>50K'], loc='upper left')

plt.show()


#### b. [Count Plot]
  - Create a count plot of income

In [None]:

sns.set(style="whitegrid")
sns.countplot(x="income", data=df, palette="Set2")
plt.title("Income Distribution")
plt.xlabel("Income")
plt.ylabel("Count")

#### c. [Box plot] 

In [None]:
sns.boxplot(x="income", y="age", hue="sex", data=df, palette="Set3")
plt.title("Age Distribution by Income Level and Sex")
plt.xlabel("Income")
plt.ylabel("Age")

#### d. [Scatter plot]

In [None]:
sns.scatterplot(x="age", y="hours.per.week", hue="income", data=df, palette="Set2")
plt.title("Age vs. Hours-per-Week")
plt.xlabel("Age")
plt.ylabel("Hours-per-Week")

#### e. [Historgram - Hist plot]
- Histogram in every column
- Histogram on only numerical columns

In [None]:
for i in df.columns:
    sns.histplot(x=i, data=df, kde=False, bins=20) 
    plt.title(f"{i} Distribution")
    plt.xlabel(i)
    plt.ylabel("Count")
    plt.show()
   

In [None]:
df.hist(bins=20, figsize=(10, 8))
plt.show()

#### f. [Pie chart]

In [None]:
counts=df['education'].value_counts().sort_index()
print(counts)
counts.plot(kind='pie',title='education_count',figsize=(11,10))
plt.legend()
plt.show()

#### g. [Violin plot]

In [None]:
sns.violinplot(x="income", y="hours.per.week", hue="sex", data=df, split=True, inner="quart")
plt.title("Hours-per-Week vs. Income by Sex")
plt.xlabel("Income")
plt.ylabel("Hours-per-Week")
plt.show()

#### h.) [Parallel coordinate plot-PCP]

In [None]:
import plotly.express as px
fig = px.parallel_coordinates(df, labels={},
                             color_continuous_scale=px.colors.diverging.Tealrose,
                             color_continuous_midpoint=2)
fig.show()

#### i. [Heat mapping]

In [None]:
numeric_columns = df.select_dtypes(include=['number'])

# Calculate the correlation matrix for numeric columns
corr = numeric_columns.corr()

# Create a heatmap of the correlation matrix
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title("Correlation Matrix")

# Show the plot
plt.show()


#### j. [Pair plot]

In [None]:
sns.pairplot(df[['age', 'education.num', 'hours.per.week', 'income']], hue='income', palette="Set2")
plt.suptitle("Pairplot of Age, Education-num, and Hours-per-week by Income Level")

plt.show()

--- 
### **4. Label Encoding**

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df = df.apply(le.fit_transform)
df.head()