## <p style="background-color:#47B699; font-family:Arial; color:white; font-size:200%; text-align:center; border-radius:10px 10px;">Analysis of US Citizens by Income Levels</p>

<a id="toc"></a>

## <p style="background-color:#47B699; font-family:Georgia; color:#FFFFFF; font-size:175%; text-align:center; border-radius:10px 10px;">Content</p>

* [Introduction](#0)
* [Dataset Info](#1)
* [Importing Related Libraries](#2)
* [Recognizing & Understanding Data](#3)
* [Univariate & Multivariate Analysis](#4)    
* [Other Specific Analysis Questions](#5)
* [Dropping Similar & Unneccessary Features](#6)
* [Handling with Missing Values](#7)
* [Handling with Outliers](#8)    
* [Final Step to make ready dataset for ML Models](#9)
* [The End of the Project](#10)

## <p style="background-color:#47B699; font-family:Georgia; color:white; font-size:175%; text-align:center; border-radius:10px 10px;">Introduction</p>

<a id="0"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:white; background-color:#47B699" data-toggle="popover">Content</a>

One of the most important components to any data science experiment that doesn’t get as much importance as it should is **``Exploratory Data Analysis (EDA)``**. In short, EDA is **``"A first look at the data"``**. It is a critical step in analyzing the data from an experiment. It is used to understand and summarize the content of the dataset to ensure that the features which we feed to our machine learning algorithms are refined and we get valid, correctly interpreted results.
In general, looking at a column of numbers or a whole spreadsheet and determining the important characteristics of the data can be very tedious and boring. Moreover, it is good practice to understand the problem statement and the data before you get your hands dirty, which in view, helps to gain a lot of insights. I will try to explain the concept using the Adult dataset/Census Income dataset available on the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Adult). The problem statement here is to predict whether the income exceeds 50k a year or not based on the census data.

# Aim of the Project

Applying Exploratory Data Analysis (EDA) and preparing the data to implement the Machine Learning Algorithms;
1. Analyzing the characteristics of individuals according to income groups
2. Preparing data to create a model that will predict the income levels of people according to their characteristics (So the "salary" feature is the target feature)

## <p style="background-color:#47B699; font-family:newtimeroman; color:white; font-size:175%; text-align:center; border-radius:10px 10px;">Dataset Info</p>

<a id="1"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:white; background-color:#47B699" data-toggle="popover">Content</a>

The Census Income dataset has 48,842 entries. Each entry contains the following information about an individual:

- **salary (target feature/label):** whether or not an individual makes more than $50,000 annually. (<= 50K, >50K)
- **age:** the age of an individual. (Integer greater than 0)
- **workclass:** a general term to represent the employment status of an individual. (Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked)
- **fnlwgt:** this is the number of people the census believes the entry represents. People with similar demographic characteristics should have similar weights.  There is one important caveat to remember about this statement. That is that since the CPS sample is actually a collection of 51 state samples, each with its own probability of selection, the statement only applies within state.(Integer greater than 0)
- **education:** the highest level of education achieved by an individual. (Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.)
- **education-num:** the highest level of education achieved in numerical form. (Integer greater than 0)
- **marital-status:** marital status of an individual. Married-civ-spouse corresponds to a civilian spouse while Married-AF-spouse is a spouse in the Armed Forces. Married-spouse-absent includes married people living apart because either the husband or wife was employed and living at a considerable distance from home (Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse)
- **occupation:** the general type of occupation of an individual. (Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces)
- **relationship:** represents what this individual is relative to others. For example an individual could be a Husband. Each entry only has one relationship attribute. (Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried)
- **race:** Descriptions of an individual’s race. (White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black)
- **sex:** the biological sex of the individual. (Male, female)
- **capital-gain:** capital gains for an individual. (Integer greater than or equal to 0)
- **capital-loss:** capital loss for an individual. (Integer greater than or equal to 0)
- **hours-per-week:** the hours an individual has reported to work per week. (continuous)
- **native-country:** country of origin for an individual (United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands)

## <p style="background-color:#47B699; font-family:newtimeroman; color:white; font-size:175%; text-align:center; border-radius:10px 10px;">Importing Related Libraries</p>

<a id="2"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:white; background-color:#47B699" data-toggle="popover">Content</a>

Once you've installed NumPy & Pandas you can import them as a library:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")
warnings.warn("this will not show")

plt.rcParams["figure.figsize"] = (10, 6)

sns.set_style("whitegrid")
pd.set_option('display.float_format', lambda x: '%.3f' % x)

# Set it None to display all rows in the dataframe
# pd.set_option('display.max_rows', None)

# Set it to None to display all columns in the dataframe
pd.set_option('display.max_columns', None)

### <p style="background-color:#47AC34; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:left; border-radius:10px 10px;">Reading the data from file</p>

In [None]:
df = pd.read_csv("/kaggle/input/eda-project-analyze-us-citizens/adult_eda.csv")
df

## <p style="background-color:#47B699; font-family:Georgia; color:white; font-size:150%; text-align:center; border-radius:10px 10px;">Recognizing and Understanding Data</p>

<a id="3"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:white; background-color:#47B699" data-toggle="popover">Content</a>

### 1. Try to understand what the data looks like
- Check the head, shape, data-types of the features.
- Check if there are some dublicate rows or not. If there are, then drop them. 
- Check the statistical values of features.
- If needed, rename the columns' names for easy use. 
- Basically check the missing values.

In [None]:
# Check first columns of dataset
df.head()

In [None]:
df.shape

In [None]:
df.info()

In [None]:
# Check if the Dataset have any Duplicate
df.duplicated().value_counts()

In [None]:
# Drop Duplicates
df.drop_duplicates(inplace = True)

In [None]:
# Check the shape of the Dataset
df.shape

In [None]:
df.describe().T

**Rename the features of;**<br>
**``"education-num"``**, **``"marital-status"``**, **``"capital-gain"``**, **``"capital-loss"``**, **``"hours-per-week"``**, **``"native-country"``** **as**<br>
**``"education_num"``**, **``"marital_status"``**, **``"capital_gain"``**, **``"capital_loss"``**, **``"hours_per_week"``**, **``"native_country"``**, **respectively and permanently.**

In [None]:
df.rename(columns = {"education-num": "education_num", "marital-status" : "marital_status", "capital-gain" : "capital_gain",
                     "capital-loss": "capital_loss", "hours-per-week": "hours_per_week", "native-country": "native_country" },
         inplace = True)

In [None]:
# Check the sum of Missing Values per column

df.isnull().sum()

In [None]:
# Check the Percentage of Missing Values
percent_missing = df.isnull().sum() * 100 / len(df)
percent_missing

### 2.Look at the value counts of columns that have object datatype and detect strange values apart from the NaN Values

In [None]:
df.columns

In [None]:
df.describe(include = "object").T

**Assign the Columns (Features) of object data type as** **``"object_col"``**

In [None]:
object_col = df.select_dtypes(include = "object")

object_col.columns

In [None]:
for col in object_col:
    print(col)
    print("--"*8)
    print(df[col].value_counts(dropna=False))
    print("--"*20)

**Check if the Dataset has any Question Mark** **``"?"``**

In [None]:
df.isin(["?"]).any()

## <p style="background-color:#47B699; font-family:georgia; color:white; font-size:175%; text-align:center; border-radius:10px 10px;">Univariate & Multivariate Analysis</p>

<a id="4"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:white; background-color:#47B699" data-toggle="popover">Content</a>

Examine all features (first target feature("salary"), then numeric ones, lastly categoric ones) separetly from different aspects according to target feature.

**to do list for numeric features:**
1. Check the boxplot to see extreme values 
2. Check the histplot/kdeplot to see distribution of feature
3. Check the statistical values
4. Check the boxplot and histplot/kdeplot by "salary" levels
5. Check the statistical values by "salary" levels
6. Write down the conclusions you draw from your analysis

**to do list for categoric features:**
1. Find the features which contains similar values, examine the similarities and analyze them together 
2. Check the count/percentage of person in each categories and visualize it with a suitable plot
3. If need, decrease the number of categories by combining similar categories
4. Check the count of person in each "salary" levels by categories and visualize it with a suitable plot
5. Check the percentage distribution of person in each "salary" levels by categories and visualize it with suitable plot
6. Check the count of person in each categories by "salary" levels and visualize it with a suitable plot
7. Check the percentage distribution of person in each categories by "salary" levels and visualize it with suitable plot
8. Write down the conclusions you draw from your analysis

**Note :** Instruction/direction for each feature is available under the corresponding feature in detail, as well.

## Salary (Target Feature)

**Check the count of person in each "salary" levels and visualize it with a countplot**

In [None]:
df["salary"].value_counts()

In [None]:
sns.countplot(data = df, x = "salary");

**Check the percentage of person in each "salary" levels and visualize it with a pieplot**

In [None]:
percentage_salary = df["salary"].value_counts() / len(df)

percentage_salary

In [None]:
plt.pie(percentage_salary, labels = ["<=50K", ">50K"],  autopct="%.1f%%", shadow = True)

plt.title("Percentage of Income-Levels");


**Result :** %75.9 of people work for low income, %24.1 of people work for high income

## Numeric Features

## age

**Check the boxplot to see extreme values**

In [None]:
sns.boxplot(data = df, x = "age");

**Check the histplot/kdeplot to see distribution of feature**

In [None]:
sns.histplot(data = df, x = "age", kde = True, bins = 20);

**Check the statistical values**

In [None]:
df.age.describe()

**Check the boxplot and histplot/kdeplot by "salary" levels**

In [None]:
sns.boxplot(data = df, x = "salary", y = "age");

In [None]:
sns.histplot(data = df, x = "age", hue = "salary", bins = 20, kde = True);

In [None]:
sns.kdeplot(data = df, x = "age", hue = "salary", shade = True);

**Check the statistical values by "salary" levels**

In [None]:
df.groupby("salary")["age"].describe()

**Result :** The mean and median age of the high-income group is higher than the low-income group. It means the older generation possesses more wealth than the young.Standard deviation is big both of them. It means that differences of value is big. Low income has average 36.8 age. High income has average 44.3 age.

## fnlwgt

**Check the boxplot to see extreme values**

In [None]:
sns.boxplot(data = df, x = "fnlwgt");

**Check the histplot/kdeplot to see distribution of feature**

In [None]:
sns.kdeplot(data = df, x = "fnlwgt", shade = True);

**Check the statistical values**

In [None]:
df["fnlwgt"].describe()

**Check the boxplot and histplot/kdeplot by "salary" levels**

In [None]:
sns.boxplot(data = df, x = "salary", y = "fnlwgt");

In [None]:
sns.kdeplot(data = df, x = "fnlwgt", shade = True, hue = "salary");

**Check the statistical values by "salary" levels**

In [None]:
df.groupby("salary")["fnlwgt"].describe()

**Result :** There is a very low correlation between salary and "fnlwgt" groups.

## capital_gain

**Check the boxplot to see extreme values**

In [None]:
sns.boxplot(data = df, x = "capital_gain");

**Check the histplot/kdeplot to see distribution of feature**

In [None]:
sns.kdeplot(data = df, x = "capital_gain", shade = True);

**Check the statistical values**

In [None]:
df.capital_gain.describe()

**Check the boxplot and histplot/kdeplot by "salary" levels**

In [None]:
sns.boxplot(data = df, x = "salary", y = "capital_gain");

In [None]:
sns.kdeplot(data = df, x = "capital_gain", shade = True, hue = "salary");

**Check the statistical values by "salary" levels**

In [None]:
df.groupby("salary")["capital_gain"].describe()

**Check the statistical values by "salary" levels for capital_gain not equal the zero**

In [None]:
df_drop_cap = df[df["capital_gain"] != 0]

df_drop_cap["capital_gain"].describe()

df_drop_cap.groupby("salary")["capital_gain"].describe()

**Result :** The "capital-gain" feature has not provided very meaningful insights. Nevertheless, we can say that the more "capital-gain", the more "high income".

## capital_loss

**Check the boxplot to see extreme values**

In [None]:
sns.boxplot(data = df, x = "capital_loss");

**Check the histplot/kdeplot to see distribution of feature**

In [None]:
sns.kdeplot(data = df, x = "capital_loss", shade = True);

**Check the statistical values**

In [None]:
df["capital_loss"].describe()

**Check the boxplot and histplot/kdeplot by "salary" levels**

In [None]:
sns.boxplot(data = df, x = "salary", y = "capital_loss");

In [None]:
sns.kdeplot(data = df, x = "capital_loss", hue = "salary", shade = True);

**Check the statistical values by "salary" levels**

In [None]:
df.groupby("salary")["capital_loss"].describe()

**Check the statistical values by "salary" levels for capital_loss not equel the zero**

In [None]:
df.loc[df["capital_loss"] != 0].groupby("salary")["capital_loss"].describe()

**Result :** There is no significant difference between high and low-income groups according to the "capital_loss" feature. There are lots of people capital loss = 0. This efect lots of things.

## hours_per_week

**Check the boxplot to see extreme values**

In [None]:
sns.boxplot(data = df, x = "hours_per_week");

**Check the histplot/kdeplot to see distribution of feature**

In [None]:
sns.kdeplot(data = df, x = "hours_per_week", shade = True);

Desired Output:

![image.png](https://i.ibb.co/tsp5GXb/30.png)

**Check the statistical values**

In [None]:
df.hours_per_week.describe()

**Check the boxplot and histplot/kdeplot by "salary" levels**

In [None]:
sns.boxplot(data = df, x = "salary", y = "hours_per_week");

In [None]:
sns.kdeplot(data = df, x = "hours_per_week", hue = "salary", shade = True);

**Check the statistical values by "salary" levels**

In [None]:
df.groupby("salary")["hours_per_week"].describe()

**Result :** In a high-income group, average working time is 45 hours  per week. On the other hand, in a low-income group, average working time is 39 hours per week. Both of them are normal disturbution we can say.

### See the relationship between each numeric features by target feature (salary) in one plot basically

In [None]:
sns.pairplot(data = df, hue = "salary", palette = "viridis", corner = True);

## Categorical Features

## education & education_num

**Detect the similarities between these features by comparing unique values**

In [None]:
df.education.value_counts()

In [None]:
df.education_num.value_counts()

In [None]:
df.groupby("education")["education_num"].value_counts(dropna = False)

**Visualize the count of person in each categories for these features (education, education_num) separately**

In [None]:
fig, ax = plt.subplots(figsize = (12, 6))

sns.countplot(data = df, x = "education")

ax.set_xticklabels(df.education.unique(), rotation=90);

In [None]:
sns.countplot(data = df, x = "education_num");

**Check the count of person in each "salary" levels by these features (education and education_num) separately and visualize them with countplot**

In [None]:
df.groupby("education")["salary"].value_counts()

In [None]:
fig, ax = plt.subplots(figsize = (12, 6))

sns.countplot(data = df, x = "education", hue = "salary")

ax.set_xticklabels(df.education.unique(), rotation = 90);

In [None]:
df.groupby("education_num")["salary"].value_counts()

In [None]:
sns.countplot(data = df, x = "education_num", hue = "salary");

**Visualize the boxplot of "education_num" feature by "salary" levels**

In [None]:
sns.boxplot(data = df, x = "salary", y = "education_num");

**Decrease the number of categories in "education" feature as low, medium, and high level and create a new feature with this new categorical data.**

In [None]:
def mapping_education(x):
    if x in ["Preschool", "1st-4th", "5th-6th", "7th-8th", "9th", "10th", "11th", "12th"]:
        return "low_level_grade"
    elif x in ["HS-grad", "Some-college", "Assoc-voc", "Assoc-acdm"]:
        return "medium_level_grade"
    elif x in ["Bachelors", "Masters", "Prof-school", "Doctorate"]:
        return "high_level_grade"

In [None]:
df.education.apply(mapping_education).value_counts()

In [None]:
# By using "mapping_education" def function above, create a new column named "education_summary"
df["education_summary"] = df.education.apply(mapping_education)

**Visualize the count of person in each categories for these new education levels (high, medium, low)**

In [None]:
sns.countplot(data = df, x = "education_summary");

**Check the count of person in each "salary" levels by these new education levels(high, medium, low) and visualize it with countplot**

In [None]:
df.groupby("education_summary")["salary"].value_counts()

In [None]:
sns.countplot(data = df, x = "education_summary", hue = "salary");

**Check the percentage distribution of person in each "salary" levels by each new education levels (high, medium, low) and visualize it with pie plot separately**

In [None]:
(df.groupby("education_summary")["salary"].value_counts()) / (df.groupby("education_summary")["salary"].count())

In [None]:
high_edu = (df.groupby("education_summary")["salary"].value_counts() / df.groupby("education_summary")["salary"].count())[:2]
high_edu

In [None]:
low_edu = (df.groupby("education_summary")["salary"].value_counts() / df.groupby("education_summary")["salary"].count())[2:4]
low_edu

In [None]:
medium_edu = (df.groupby("education_summary")["salary"].value_counts() / df.groupby("education_summary")["salary"].count())[4:6]
medium_edu

In [None]:
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize = (12, 6))

ax1.pie(x = high_edu, labels = ["<=50K", ">50K"], autopct = "%.2f%%")
ax1.set_ylabel = ("salary")
ax1.set_title("high_level_grade")

ax2.pie(x = low_edu, labels = ["<=50K", ">50K"],autopct = "%.2f%%")
ax2.set_ylabel = ("salary")
ax2.set_title("low_level_grade")

ax3.pie(x = medium_edu, labels = ["<=50K", ">50K"],  autopct = "%.2f%%")
ax3.set_ylabel = ("salary")
ax3.set_title("medium_level_grade");

**Check the count of person in each these new education levels(high, medium, low) by "salary" levels and visualize it with countplot**

In [None]:
df.groupby("salary")["education_summary"].value_counts()

In [None]:
sns.countplot(data = df, x = "salary", hue = "education_summary");

**Check the the percentage distribution of person in each these new education levels(high, medium, low) by "salary" levels and visualize it with pie plot separately**

In [None]:
df.groupby("salary")["education_summary"].value_counts() / df.groupby("salary")["education_summary"].count()

In [None]:
low_sal = (df.groupby("salary")["education_summary"].value_counts() / df.groupby("salary")["education_summary"].count())[:3]
low_sal

In [None]:
high_sal = (df.groupby("salary")["education_summary"].value_counts() / df.groupby("salary")["education_summary"].count())[3:6]
high_sal

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (12, 6))

ax1.pie(x = low_sal, labels = ["<=50K, medium_level_grade", "<=50K, high_level_grade", "<=50K, low_level_grade"], autopct = "%.2f%%"
)
ax1.set_ylabel = ("education_summary")
ax1.set_title = ("<=50K")

ax2.pie(x = high_sal, labels = [">50K, high_level_grade", ">50K, medium_level_grade", ">50K, low_level_grade"], autopct = "%.2f%%"
)
ax2.set_ylabel = ("education_summary")
ax2.set_title = (">50K");

In [None]:
salary = df.groupby("salary")["education_summary"].value_counts() / df.groupby("salary")["education_summary"].count()
salary

In [None]:
salary_table = pd.DataFrame(salary)
salary_table

salary_table.rename(columns = {"education_summary": "percentage"}, inplace = True)
salary_table

salary_table.reset_index(level=[0,1],inplace=True)
salary_table

In [None]:
low_1 = (df.groupby("salary")["education_summary"].value_counts() / df.groupby("salary")["education_summary"].count())[:3]
low_1

In [None]:
high_1 = (df.groupby("salary")["education_summary"].value_counts() / df.groupby("salary")["education_summary"].count())[3:6]
high_1

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (12, 6))

ax1.pie(x = low_1, labels = ["medium", "high", "low"], autopct = "%.2f%%")
ax1.set_title("<=50K")

ax2.pie(x = high_1, labels = ["high", "medium", "low"], autopct = "%.2f%%")
ax2.set_title(">50K");

**Result :** We can easily say that the more education means more high income. However 82% of medium level grade earn lower than $50000

## marital_status & relationship

**Detect the similarities between these features by comparing unique values**

In [None]:
df["marital_status"].value_counts()

In [None]:
df["relationship"].value_counts(dropna = False)

In [None]:
# Fill missing values with "Unknown" in the column of "relationship"

df.fillna("Unknown", inplace = True)

In [None]:
df.groupby("relationship")["marital_status"].value_counts()

**Visualize the count of person in each categories**

In [None]:
fig, ax = plt.subplots(figsize = (12, 6))

sns.countplot(data = df, x = "marital_status")

ax.set_xticklabels(df["marital_status"].unique(), rotation = 45);

**Check the count of person in each "salary" levels by categories and visualize it with countplot**

In [None]:
df.groupby("marital_status")["salary"].value_counts()

In [None]:
fig, ax = plt.subplots(figsize = (12, 6))

sns.countplot(data = df, x = "marital_status", hue = "salary")

ax.set_xticklabels(df.marital_status.unique(), rotation = 45);

**Decrease the number of categories in "marital_status" feature as married, and unmarried and create a new feature with this new categorical data**

In [None]:
def mapping_marital_status(x):
    if x in ["Never-married", "Divorced", "Separated", "Widowed"]:
        return "unmarried"
    elif x in ["Married-civ-spouse", "Married-AF-spouse", "Married-spouse-absent"]:
        return "married"

In [None]:
df.marital_status.apply(mapping_marital_status).value_counts()

In [None]:
# By using "mapping_marital_status" def function above, create a new column named "marital_status_summary"
df["marital_status_summary"] = df.marital_status.apply(mapping_marital_status)

**Visualize the count of person in each categories for these new marital status (married, unmarried)**

In [None]:
sns.countplot(data = df, x = "marital_status_summary");

**Check the count of person in each "salary" levels by these new marital status (married, unmarried) and visualize it with countplot**

In [None]:
df.groupby("marital_status_summary")["salary"].value_counts()

In [None]:
sns.countplot(data = df, x = "marital_status_summary", hue = "salary");

**Check the percentage distribution of person in each "salary" levels by each new marital status (married, unmarried) and visualize it with pie plot separately**

In [None]:
df.groupby("marital_status_summary")["salary"].value_counts() / df.groupby("marital_status_summary")["salary"].count()

In [None]:
married = (df.groupby("marital_status_summary")["salary"].value_counts() / df.groupby("marital_status_summary")["salary"].count())[:2]
married

In [None]:
unmarried = (df.groupby("marital_status_summary")["salary"].value_counts() / df.groupby("marital_status_summary")["salary"].count())[2:4]
unmarried

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (12, 6))

ax1.pie(x = married, labels = ["<=50K", ">50K"], autopct = "%.2f%%")
ax1.set_ylabel("salary")
ax1.set_title("married")

ax2.pie(x = unmarried, labels = ["<=50K", ">50K"], autopct = "%.2f%%")
ax2.set_ylabel("salary")
ax2.set_title("unmarried");

**Check the count of person in each these new marital status (married, unmarried) by "salary" levels and visualize it with countplot**

In [None]:
df.groupby("salary")["marital_status_summary"].value_counts()

In [None]:
sns.countplot(data = df, x = "salary", hue = "marital_status_summary");

**Check the the percentage distribution of person in each these new marital status (married, unmarried) by "salary" levels and visualize it with pie plot separately**

In [None]:
df.groupby("salary")["marital_status_summary"].value_counts() / df.groupby("salary")["marital_status_summary"].count()

In [None]:
sal_mar = df.groupby("salary")["marital_status_summary"].value_counts() / df.groupby("salary")["marital_status_summary"].count()

In [None]:
sal_mar_table = pd.DataFrame(sal_mar)
sal_mar_table

sal_mar_table.rename(columns = {"marital_status_summary": "percentage"}, inplace = True)
sal_mar_table

sal_mar_table.reset_index(level = [0, 1], inplace = True)
sal_mar_table

sal_mar_table.sort_values(by = ["salary", "marital_status_summary"])
sal_mar_table

In [None]:
low_salary = (df.groupby("salary")["marital_status_summary"].value_counts() * 100 / 
df.groupby("salary")["marital_status_summary"].count())[:2]

high_salary = (df.groupby("salary")["marital_status_summary"].value_counts() * 100 / 
df.groupby("salary")["marital_status_summary"].count())[2:4]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (12, 6))

ax1.pie(x = low_salary, labels = ["unmarried", "married"], autopct = "%.2f%%")
ax1.set_ylabel = ("percentage")
ax1.set_title("<=50K")

ax2.pie(x = high_salary, labels = ["married", "unmarried"], autopct = "%.2f%%")
ax2.set_ylabel = ("percentage")
ax2.set_title(">50K");

**Result :** We can easily say that married persons have earned more income than unmarried persons. Most of people in high income are married and most of people in low income are unmarried.

## workclass

**Check the count of person in each categories and visualize it with countplot**

In [None]:
df.workclass.value_counts()

In [None]:
fig, ax = plt.subplots(figsize = (12, 6))

sns.countplot(data = df, x = "workclass")

ax.set_xticklabels(df.workclass.unique(), rotation = 90);

**Replace the value "?" to the value "Unknown"** 

In [None]:
# Replace "?" values with "Unkown"

df.workclass.replace("?", "Unknown", inplace = True)

**Check the count of person in each "salary" levels by workclass groups and visualize it with countplot**

In [None]:
df.groupby("workclass")["salary"].value_counts()

In [None]:
fig, ax = plt.subplots(figsize = (12, 6))

sns.countplot(data = df, x = "workclass", hue = "salary")

ax.set_xticklabels(df.workclass.unique(), rotation = 90);

**Check the percentage distribution of person in each "salary" levels by each workclass groups and visualize it with bar plot**

In [None]:
df.groupby("workclass")["salary"].value_counts() / df.groupby("workclass")["salary"].count()

In [None]:
per_sal = df.groupby("workclass")["salary"].value_counts() / df.groupby("workclass")["salary"].count()

tab_per = pd.DataFrame(per_sal)

tab_per.rename(columns = {"salary": "percentage"}, inplace = True)

tab_per.reset_index(level = [0, 1], inplace = True)

tab_per

In [None]:
fig, ax = plt.subplots(figsize = (12, 6))

sns.barplot(data = tab_per, x = "workclass", y = "percentage", hue = "salary")

ax.set_xticklabels(tab_per.workclass.unique(), rotation = 90)

for i in ax.containers:
    ax.bar_label(i,fmt="%.2f");

**Check the count of person in each workclass groups by "salary" levels and visualize it with countplot**

In [None]:
df.groupby("salary")["workclass"].value_counts()

In [None]:
sns.countplot(data = df, x = "salary", hue = "workclass");

**Check the the percentage distribution of person in each workclass groups by "salary" levels and visualize it with countplot**

In [None]:
df.groupby("salary")["workclass"].value_counts() / df.groupby("salary")["workclass"].count()

In [None]:
work_per = df.groupby("salary")["workclass"].value_counts() / df.groupby("salary")["workclass"].count()

tab_work = pd.DataFrame(work_per)

tab_work.rename(columns = {"workclass": "percentage"}, inplace = True)

tab_work.reset_index(level = [0, 1], inplace =True)

tab_work

In [None]:
fig, ax = plt.subplots(figsize = (12, 6))

sns.barplot(data = tab_work, x = "salary", y = "percentage", hue = "workclass")

plt.legend(loc = "upper right");

for i in ax.containers:
    ax.bar_label(i,fmt="%.3f");

**Write down the conclusions you draw from your analysis**

**Result :** "Private" work-class has a high ratio in the self group about high-level income. "Private" work-class has a high ratio in the low-income group.

## occupation

**Check the count of person in each categories and visualize it with countplot**

In [None]:
df.occupation.value_counts()

In [None]:
fig, ax = plt.subplots(figsize = (12, 6))

sns.countplot(data = df, x = "occupation")

ax.set_xticklabels(df.occupation.unique(), rotation = 90);

**Replace the value "?" to the value "Unknown"**

In [None]:
# Replace "?" values with "Unknown"
df.occupation.replace("?", "Unknown", inplace = True)

**Check the count of person in each "salary" levels by occupation groups and visualize it with countplot**

In [None]:
df.groupby("occupation")["salary"].value_counts()

In [None]:
fig, ax = plt.subplots(figsize = (10, 5))

sns.countplot(data = df, x = "occupation", hue = "salary")

ax.set_xticklabels(df.occupation.unique(), rotation = 90);

**Check the percentage distribution of person in each "salary" levels by each occupation groups and visualize it with bar plot**

In [None]:
df.groupby("occupation")["salary"].value_counts() / df.groupby("occupation")["salary"].count()

In [None]:
per_occ = df.groupby("occupation")["salary"].value_counts() / df.groupby("occupation")["salary"].count()

tab_occ = pd.DataFrame(per_occ)

tab_occ.rename(columns = {"salary": "percentage"}, inplace = True)

tab_occ.reset_index(level = [0, 1], inplace = True)

tab_occ

In [None]:
fig, ax = plt.subplots(figsize = (12, 6))

sns.barplot(data = tab_occ, x = "occupation", y = "percentage", hue = "salary")

ax.set_xticklabels(df.occupation.unique(), rotation = 90)

for i in ax.containers:
    ax.bar_label(i,fmt="%.2f");

**Check the count of person in each occupation groups by "salary" levels and visualize it with countplot**

In [None]:
df.groupby("salary")["occupation"].value_counts()

In [None]:
sns.countplot(data = df, x = "salary", hue = "occupation");

**Check the the percentage distribution of person in each occupation groups by "salary" levels and visualize it with bar plot**

In [None]:
df.groupby("salary")["occupation"].value_counts() / df.groupby("salary")["occupation"].count()

In [None]:
occ_per = df.groupby("salary")["occupation"].value_counts() / df.groupby("salary")["occupation"].count()

tab_occ = pd.DataFrame(occ_per)

tab_occ.rename(columns = {"occupation": "percentage"}, inplace = True)

tab_occ.reset_index(level = [0, 1], inplace = True)

tab_occ

In [None]:
fig, ax = plt.subplots(figsize = (12, 6))

sns.barplot(data = tab_occ, x = "salary", y = "percentage", hue = "occupation")

for i in ax.containers:
    ax.bar_label(i,fmt="%.2f");

**Result :** "Exec-managerial" and "Prof-specialty" occupations have a high ratio (~ %50) of high-level income both in the self group and in the high-income group. There is no high ratio differences between occupations in low-level income group.

## race

**Check the count of person in each categories and visualize it with countplot**

In [None]:
df["race"].value_counts()

In [None]:
sns.countplot(data = df, x = "race");

**Check the count of person in each "salary" levels by races and visualize it with countplot**

In [None]:
df.groupby("race")["salary"].value_counts()

In [None]:
sns.countplot(data = df, x = "race", hue = "salary");

**Check the percentage distribution of person in each "salary" levels by each races and visualize it with pie plot**

In [None]:
df.groupby("race")["salary"].value_counts() / df.groupby("race")["salary"].count()

In [None]:

amer_sal = (df.groupby("race")["salary"].value_counts() *100 / df.groupby("race")["salary"].count())[:2]
asian_sal = (df.groupby("race")["salary"].value_counts() *100 / df.groupby("race")["salary"].count())[2:4]
black_sal = (df.groupby("race")["salary"].value_counts() *100 / df.groupby("race")["salary"].count())[4:6]
other_sal = (df.groupby("race")["salary"].value_counts() *100 / df.groupby("race")["salary"].count())[6:8]
white_sal = (df.groupby("race")["salary"].value_counts() *100 / df.groupby("race")["salary"].count())[8:10]


In [None]:
fig, ax = plt.subplots(2, 3, figsize = (12, 6))

ax[0,0].pie(x = amer_sal, 
            labels=["<=50K", ">50K"],
            autopct= "%.2f%%")
ax[0,0].set_title("Amer-Indian-Eskimo")
ax[0,0].set_ylabel("salary")

ax[0,1].pie(x = asian_sal, 
            labels=["<=50K", ">50K"],
            autopct= "%.2f%%")
ax[0,1].set_title("Asian-Pac-Islander")
ax[0,1].set_ylabel("salary")

ax[0,2].pie(x = black_sal, 
            labels=["<=50K", ">50K"],
            autopct= "%.2f%%")
ax[0,2].set_title("Black")
ax[0,2].set_ylabel("salary")

ax[1,0].pie(x = other_sal, 
            labels=["<=50K", ">50K"],
            autopct= "%.2f%%")
ax[1,0].set_title("Other")
ax[1,0].set_ylabel("salary")

ax[1,1].pie(x = white_sal, 
            labels=["<=50K", ">50K"],
            autopct= "%.2f%%")
ax[1,1].set_title("White")
ax[1,1].set_ylabel("salary")

ax[1,2].axis("off");

**Check the count of person in each races by "salary" levels and visualize it with countplot**

In [None]:
df.groupby("salary")["race"].value_counts()

In [None]:
sns.countplot(data = df, x = "salary", hue = "race");

**Check the the percentage distribution of person in each races by "salary" levels and visualize it with bar plot**

In [None]:
df.groupby("salary")["race"].value_counts() / df.groupby("salary")["race"].count()

In [None]:
race_per = df.groupby("salary")["race"].value_counts() / df.groupby("salary")["race"].count()

tab_race = pd.DataFrame(race_per)

tab_race.rename(columns = {"race": "percentage"}, inplace = True)

tab_race.reset_index(level = [0,1], inplace = True)

tab_race

In [None]:
fig, ax = plt.subplots(figsize = (12, 6))

sns.barplot(data = tab_race, x = "salary", y = "percentage", hue = "race")

for i in ax.containers:
    ax.bar_label(i,fmt="%.3f");

**Result :** "White" races have a high ratio in the self group about high-level income. "White" race has a high ratio in the low-income group. %25 of whites and Asians are  high level income.

## gender

**Check the count of person in each gender and visualize it with countplot**

In [None]:
df.sex.value_counts()

In [None]:
sns.countplot(data = df, x = "sex");

**Check the count of person in each "salary" levels by gender and visualize it with countplot**

In [None]:
df.groupby("sex")["salary"].value_counts()

In [None]:
sns.countplot(data = df, x = "sex", hue = "salary");

**Check the percentage distribution of person in each "salary" levels by each gender and visualize it with pie plot**

In [None]:
df.groupby("sex")["salary"].value_counts() / df.groupby("sex")["salary"].count()

In [None]:
fem_sal = (df.groupby("sex")["salary"].value_counts() * 100 / df.groupby("sex")["salary"].count())[:2]

ma_sal = (df.groupby("sex")["salary"].value_counts() * 100 / df.groupby("sex")["salary"].count())[2:4]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (12, 6))

ax1.pie(x = fem_sal, labels = ["<=50K", ">50K"], autopct = ("%.2f%%"))
ax1.set_ylabel = ("salary")
ax1.set_title("Female")

ax2.pie(x = ma_sal, labels = ["<=50K", ">50K"], autopct = ("%.2f%%"))
ax2.set_ylabel = ("salary")
ax2.set_title("Male");

**Check the count of person in each gender by "salary" levels and visualize it with countplot**

In [None]:
df.groupby("salary")["sex"].value_counts()

In [None]:
sns.countplot(data = df, x = "salary", hue = "sex");

**Check the the percentage distribution of person in each gender by "salary" levels and visualize it with pie plot**

In [None]:
df.groupby("salary")["sex"].value_counts() / df.groupby("salary")["sex"].count()

In [None]:
less_50 = (df.groupby("salary")["sex"].value_counts() * 100 / df.groupby("salary")["sex"].count())[:2]
more_50 = (df.groupby("salary")["sex"].value_counts() * 100 / df.groupby("salary")["sex"].count())[2:4]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (12, 6))

ax1.pie(x = less_50, labels = ["Male", "Female"],  autopct = ("%.2f%%"))
ax1.set_ylabel("gender")
ax1.set_title("<=50K")

ax2.pie(x = more_50, labels = ["Male", "Female"], autopct = ("%.2f%%"))
ax2.set_ylabel("gender")
ax2.set_title(">50K");

**Result :** We can easily say that males have earned more income than females. 30% of male and 11% of female have a higher income level when compared within themselves.

## native_country

**Check the count of person in each categories and visualize it with countplot**

In [None]:
df.native_country.value_counts()

In [None]:
fig, ax = plt.subplots(figsize = (12,6))

sns.countplot(data = df, x = "native_country")

ax.set_xticklabels(df.native_country.unique(), rotation = 90);

**Replace the value "?" to the value "Unknown"** 

In [None]:
# Replace "?" values with "Unknown"

df["native_country"].replace("?", "Unknown", inplace = True)

**Decrease the number of categories in "native_country" feature as US, and Others and create a new feature with this new categorical data**

In [None]:
def mapping_native_country(x):
    if x == "United-States":
        return "US"
    else:
        return "Others"

In [None]:
df.native_country.apply(mapping_native_country).value_counts()

In [None]:
# By using "mapping_native_country" def function above, create a new column named "native_country_summary"

df["native_country_summary"] = df.native_country.apply(mapping_native_country)

df.native_country_summary

**Visualize the count of person in each new categories (US, Others)**

In [None]:
sns.countplot(data = df, x = "native_country_summary");

**Check the count of person in each "salary" levels by these new native countries (US, Others) and visualize it with countplot**

In [None]:
df.groupby("native_country_summary")["salary"].value_counts()

In [None]:
sns.countplot(data = df, x = "native_country_summary", hue = "salary");

**Check the percentage distribution of person in each "salary" levels by each new native countries (US, Others) and visualize it with pie plot separately**

In [None]:
df.groupby("native_country_summary")["salary"].value_counts() / df.groupby("native_country_summary")["salary"].count()

In [None]:
per_other = (df.groupby("native_country_summary")["salary"].value_counts() * 100 / df.groupby("native_country_summary")["salary"].count())[:2]
per_US = (df.groupby("native_country_summary")["salary"].value_counts() * 100 / df.groupby("native_country_summary")["salary"].count())[2:4]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (10, 5))

ax1.pie(x = per_other, labels = ["<=50K", ">50K"], autopct = ("%.2f%%"))
ax1.set_ylabel("salary")
ax1.set_title("Others")

ax2.pie(x = per_US, labels = ["<=50K", ">50K"], autopct = ("%.2f%%"))
ax2.set_ylabel("salary")
ax2.set_title("US");

**Check the count of person in each these new native countries (US, Others) by "salary" levels and visualize it with countplot**

In [None]:
df.groupby("salary")["native_country_summary"].value_counts()

**Check the the percentage distribution of person in each these new native countries (US, Others) by "salary" levels and visualize it with pie plot separately**

In [None]:
df.groupby("salary")["native_country_summary"].value_counts() / df.groupby("salary")["native_country_summary"].count()

In [None]:
sal_less50 = (df.groupby("salary")["native_country_summary"].value_counts() * 100 / df.groupby("salary")["native_country_summary"].count())[:2]
sal_more50 = (df.groupby("salary")["native_country_summary"].value_counts() * 100 / df.groupby("salary")["native_country_summary"].count())[2:4]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (12, 6))

ax1.pie(x = sal_less50, labels = ["US", "Others"],  autopct = ("%.2f%%"))
ax1.set_ylabel("native_country_summary")
ax1.set_title("<=50K")

ax2.pie(x = sal_more50, labels = ["US", "Others"], autopct = ("%.2f%%"))
ax2.set_ylabel("native_country_summary")
ax2.set_title(">50K");

**Result :** "United States" has a high ratio of high-level income both in the self group and in the high-income group. Also, United States has a high ratio of low-level income. When we compare  United States and Others according  to salary that  we see there is no difference between them.

## <p style="background-color:#47B699; font-family:newtimeroman; color:white; font-size:175%; text-align:center; border-radius:10px 10px;">Other Specific Analysis Questions</p>

<a id="5"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:white; background-color:#47B699" data-toggle="popover">Content</a>

### 1. What is the average age of males and females by income level?

In [None]:
df.groupby(["salary", "sex"])["age"].mean()

In [None]:
ax = pd.DataFrame(df.groupby(["salary", "sex"])["age"].mean()).plot(kind = "bar", xlabel = "salary, sex")

ax.bar_label(ax.containers[0], fmt = "%.2f");

In [None]:
tab1 = df.groupby(["salary", "sex"])["age"].mean()

sal_gen_age = pd.DataFrame(tab1)

sal_gen_age.reset_index(level = [0, 1], inplace = True)

sal_gen_age

In [None]:
fig, ax = plt.subplots(figsize = (12, 6))

sns.barplot(data = sal_gen_age, x = "salary", y = "age", hue = "sex")

for i in ax.containers:
    ax.bar_label(i, fmt = ("%.2f"));

### 2. What is the workclass percentages of Americans in high-level income group?

In [None]:
df.loc[(df.native_country == "United-States") & (df.salary == ">50K")]

high_income = df.loc[(df.native_country == "United-States") & (df.salary == ">50K")]

high_income.workclass.value_counts() * 100 / high_income.workclass.count()

In [None]:
graph1 = high_income.workclass.value_counts() * 100 / high_income.workclass.count()

fig, ax = plt.subplots(figsize = (12, 6))

sns.barplot(x = graph1.index, y = graph1.values)

ax.set_xticklabels(graph1.index, rotation = 45)

for i in ax.containers:
    ax.bar_label(i, fmt = "%.2f");

### 3. What is the occupation percentages of Americans who work as "Private" workclass in high-level income group?

In [None]:
private1 = df.loc[(df.native_country == "United-States") & (df.salary== ">50K") & (df.workclass == "Private")]

private1.occupation.value_counts() * 100 / private1.occupation.count()

In [None]:
per_private1 = private1.occupation.value_counts() * 100 / private1.occupation.count()

fig, ax = plt.subplots(figsize = (10, 5))

sns.barplot(x = per_private1.index, y = per_private1.values)

ax.set_xticklabels(per_private1.index, rotation = 90)

for i in ax.containers:
    ax.bar_label(i, fmt = "%.2f");

### 4. What is the education level percentages of Asian-Pac-Islander race group in high-level income group?

In [None]:
edu_asian = df.loc[(df.salary == ">50K") & (df.race == "Asian-Pac-Islander")]

edu_asian.education.value_counts() * 100 / edu_asian.education.count()

In [None]:
graph2 = edu_asian.education.value_counts() * 100 / edu_asian.education.count()

fig, ax = plt.subplots(figsize = (10, 5))

sns.barplot(x = graph2.index, y = graph2.values)

ax.set_xticklabels(graph2.index, rotation = 90)

for i in ax.containers:
    ax.bar_label(i, fmt = "%.2f");

### 5. What is the occupation percentages of Asian-Pac-Islander race group who has a Bachelors degree in high-level income group?

In [None]:
bach_asian = df.loc[(df.salary == ">50K") & (df.race == "Asian-Pac-Islander") & (df.education == "Bachelors")]

bach_asian.occupation.value_counts() * 100 / bach_asian.occupation.count()

In [None]:
graph3 = bach_asian.occupation.value_counts() * 100 / bach_asian.occupation.count()

fig, ax = plt.subplots(figsize = (10, 5))

sns.barplot(x = graph3.index, y = graph3.values)

ax.set_xticklabels(graph3.index, rotation = 90)

for i in ax.containers:
    ax.bar_label(i, fmt = "%.2f");

### 6. What is the mean of working hours per week by gender for education level, workclass and marital status? Try to plot all required in one figure.

In [None]:
g = sns.catplot(x="education_summary",
            y="hours_per_week",
            data=df,
            kind="bar",
            hue="sex",
            ci= None,
            col="marital_status_summary",
            row="native_country_summary",
            
               );

g.fig.set_size_inches(15, 8)
g.fig.subplots_adjust(top=0.9)
g.fig.suptitle('Working Hours Per Week by Sex for Education, Native Country, Marital Status ')

ax1 = g.facet_axis(0,0)
ax2 = g.facet_axis(0,1)
ax3 = g.facet_axis(1,0)
ax4 = g.facet_axis(1,1)


for i in ax1.containers:
         ax1.bar_label(i,fmt="%.2f")
for i in ax2.containers:
         ax2.bar_label(i,fmt="%.2f")
for i in ax3.containers:
         ax3.bar_label(i,fmt="%.2f")        
for i in ax4.containers:
         ax4.bar_label(i,fmt="%.2f")

## <p style="background-color:#47B699; font-family:newtimeroman; color:white; font-size:175%; text-align:center; border-radius:10px 10px;">Dropping Similar & Unneccessary Features</p>

<a id="6"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:white; background-color:#47B699" data-toggle="popover">Content</a>

In [None]:
df.info()

In [None]:
# Drop the columns of "education", "education_num", "relationship", "marital_status", "native_country" permanently

df.drop(columns = ["education", "education_num", "relationship", "marital_status", "native_country"], inplace = True)

## <p style="background-color:#47B699; font-family:newtimeroman; color:white; font-size:175%; text-align:center; border-radius:10px 10px;">Handling with Missing Value</p>

<a id="7"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:white; background-color:#47B699" data-toggle="popover">Content</a>

**Check the missing values for all features basically**

In [None]:
df.isnull().sum()

**1. It seems that there is no missing value. But we know that "workclass", and "occupation" features have missing values as the "Unknown" string values. Examine these features in more detail.**

**2. Decide if drop these "Unknown" string values or not**

In [None]:
df.workclass.value_counts()

In [None]:
df["occupation"].value_counts()

In [None]:
df[(df["workclass"] == "Unknown") | (df.workclass == "Never-worked")]["workclass"].value_counts()

In [None]:
# Replace "Unknown" values with NaN using numpy library
df.replace("Unknown", np.nan, inplace = True)

In [None]:
df.isna().sum()

In [None]:
# Drop missing values in df permanently
df.dropna(inplace = True)

In [None]:
df.isnull().sum()

In [None]:
df.info()

## <p style="background-color:#47B699; font-family:newtimeroman; color:white; font-size:175%; text-align:center; border-radius:10px 10px;">Handling with Outliers</p>

<a id="8"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:white; background-color:#47B699" data-toggle="popover">Content</a>

### Boxplot and Histplot for all numeric features

**Plot boxplots for each numeric features at the same figure as subplots**

In [None]:
df.boxplot();

In [None]:
fig, ax = plt.subplots(2, 3, figsize = (12, 6))

sns.boxplot(data = df, x = "age", ax = ax[0,0])
sns.boxplot(data = df, x = "fnlwgt", ax = ax[0,1])
sns.boxplot(data = df, x = "capital_gain", ax = ax[0,2])
sns.boxplot(data = df, x = "capital_loss", ax = ax[1,0])
sns.boxplot(data = df, x = "hours_per_week", ax = ax[1,1])

ax[1,2].axis("off");

**Plot both boxplots and histograms for each numeric features at the same figure as subplots**

In [None]:
fig, ax = plt.subplots(5, 2, figsize = (15, 18))

sns.boxplot(data = df, x = "age", ax = ax[0,0])
sns.boxplot(data = df, x = "fnlwgt", ax = ax[1,0])
sns.boxplot(data = df, x = "capital_gain", ax = ax[2,0])
sns.boxplot(data = df, x = "capital_loss", ax = ax[3,0])
sns.boxplot(data = df, x = "hours_per_week", ax = ax[4,0])

sns.histplot(data = df, x = "age", ax = ax[0,1])
sns.histplot(data = df, x = "fnlwgt", ax = ax[1,1])
sns.histplot(data = df, x = "capital_gain", ax = ax[2,1])
sns.histplot(data = df, x = "capital_loss", ax = ax[3,1])
sns.histplot(data = df, x = "hours_per_week", ax = ax[4,1]);

**Check the statistical values for all numeric features**

In [None]:
df.describe().T

**1. After analyzing all features, we have decided that we can't evaluate extreme values in "fnlwgt, capital_gain, capital_loss" features in the scope of outliers.**

**2. So let's examine "age and hours_per_week" features and detect extreme values which could be outliers by using IQR Rule.**

### age

In [None]:
fig, ax = plt.subplots(1, 2, figsize = (10, 5))

sns.boxplot(data = df, y = "age", ax = ax[0])
sns.histplot(data = df, x = "age", ax = ax[1]);

In [None]:
# Find IQR defining quantile 0.25 for low level and 0.75 for high level 
Q1 = df.age.quantile(0.25)
Q3 = df.age.quantile(0.75)
IQR = Q3 - Q1

Q1, Q3, IQR

In [None]:
# Find lower and upper limit using IQR
age_lower_lim = Q1 - 1.5 * IQR
age_upper_lim = Q3 + 1.5 * IQR

age_lower_lim, age_upper_lim

In [None]:
df.age.value_counts().tail(14)

In [None]:
# Define the observations whose age is greater than upper limit and sort these observations by age in descending order
df[(df.age > 75)].sort_values(by = ["age"], ascending = False)

### hours_per_week

In [None]:
fig, ax = plt.subplots(1, 2, figsize = (10, 5))

sns.boxplot(data = df, y = "hours_per_week", ax = ax[0])
sns.histplot(data = df, x = "hours_per_week", ax = ax[1], bins = 10);

In [None]:
# Find IQR defining quantile 0.25 for low level and 0.75 for high level 

Q1 = df.hours_per_week.quantile(0.25)
Q3 = df.hours_per_week.quantile(0.75)

IQR = Q3 - Q1

Q1, Q3, IQR

In [None]:
# Find the lower and upper limit using IQR
hours_per_week_lower_lim = Q1 - 1.5 * IQR
hours_per_week_upper_lim = Q3 + 1.5 * IQR

hours_per_week_lower_lim, hours_per_week_upper_lim

In [None]:
(df[(df.hours_per_week > 52.5)].hours_per_week.value_counts()).sort_index(ascending=False)

In [None]:
# Define the observations where  hours per week are greater than upper limit and 
# sort these observations by hours per week in descending order
df[df.hours_per_week > 52.5].sort_values(by = ["hours_per_week"], ascending = False)

In [None]:
(df[df.hours_per_week < 32.5].hours_per_week.value_counts()).sort_index(ascending = True)

In [None]:
df.loc[(df.hours_per_week < 32.5)].groupby("salary")["hours_per_week"].describe()

In [None]:
df.loc[(df.hours_per_week < 32.5) ].groupby("salary")["age"].describe()

**Result :** As we see, there are number of extreme values in both "age and hours_per_week" features. But how can we know if these extreme values are outliers or not? At this point, **domain knowledge** comes to the fore.

**Domain Knowledge for this dataset:**
1. In this dataset, all values are created according to the statements of individuals. So It can be some "data entries errors".
2. In addition, we have aimed to create an ML model with some restrictions as getting better performance from the ML model.
3. In this respect, our sample space ranges for some features are as follows.
    - **age : 17 to 80**
    - **hours_per_week : 7 to 70**
    - **if somebody's age is more than 60, he/she can't work more than 60 hours in a week**

### Dropping rows according to the domain knownledge 

In [None]:
# Create a condition according to your domain knowledge on age stated above and 
# sort the observations meeting this condition by age in ascending order

df[(df.age > 80) | (df.age < 17)].sort_values(by = ["age"], ascending = False) 

In [None]:
# Find the shape of the dataframe created by the condition defined above for age 
df[(df.age > 80) | (df.age < 17)].sort_values(by = ["age"], ascending = False).shape

In [None]:
# Assign the indices of the rows defined in accordance with condition above for age
age_17to80 = df[(df.age > 80) | (df.age < 17)].sort_values(by = ["age"], ascending = False)

age_17to80.index

In [None]:
# Drop these indices defined above for age
df.drop(age_17to80.index, inplace = True)

In [None]:
# Create a condition according to your domain knowledge on hours per week stated above and 
# sort the observations meeting this condition by hours per week in descending order
df[(df.hours_per_week < 7) | (df.hours_per_week > 70)].sort_values(by = ["hours_per_week"], ascending = False)

In [None]:
# Find the shape of the dataframe created by the condition defined above for hours per week 
hours_per_week_7to70 = df[(df.hours_per_week < 7) | (df.hours_per_week > 70)]

hours_per_week_7to70.shape

In [None]:
# Assign the indices of the rows defined in accordance with condition above for hours per week

hours_per_week_7to70.index

In [None]:
# Drop these indices defined above for hours per week

df.drop(hours_per_week_7to70.index, inplace = True)

In [None]:
# Create a condition according to your domain knowledge on both age and hours per week stated above 

domain_for_60 = df[(df.age > 60) & (df.hours_per_week > 60)]

domain_for_60

In [None]:
# Find the shape of the dataframe created by the condition defined above for both age and hours per week
domain_for_60.shape

In [None]:
# Assign the indices of the rows defined in accordance with condition above for both age and hours per week
domain_for_60.index

In [None]:
# Drop these indices defined above for both age and hours per week
df.drop(domain_for_60.index, inplace = True)

In [None]:
# What is new shape of dataframe now
df.shape

In [None]:
# Reset the indices and take the head of DataFrame now
df.head()

## <p style="background-color:#47B699; font-family:newtimeroman; color:white; font-size:175%; text-align:center; border-radius:10px 10px;">Final Step to Make the Dataset Ready for ML Models</p>

<a id="9"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:white; background-color:#47B699" data-toggle="popover">Content</a>

### 1. Convert all features to numeric

**Convert target feature (salary) to numeric (0 and 1) by using map function**

In [None]:
df.salary = df.salary.map({"<=50K": 0,">50K": 1})
df.salary

In [None]:
df.salary.value_counts()

**Convert all features to numeric by using get_dummies function**

In [None]:
pd.get_dummies(df,drop_first=True)

In [None]:
# What's the shape of dataframe
df.shape

In [None]:
# What's the shape of dataframe created by dummy operation
pd.get_dummies(df,drop_first=True).shape

### 2. Take a look at correlation between features by utilizing power of visualizing

In [None]:
df_corr = pd.get_dummies(df,drop_first=True).corr()
df_corr

In [None]:
fig, ax = plt.subplots(figsize = (30, 18)) 

sns.heatmap(data = df_corr, annot = True, cmap="YlGnBu");

In [None]:
df_corr_salary = df_corr[["salary"]].sort_values(by = ["salary"],ascending = False)[1:]
df_corr_salary

In [None]:
fig, ax = plt.subplots(figsize = (6, 12)) 

sns.heatmap(data = df_corr_salary, annot = True, cmap="YlGnBu");

In [None]:
df_corr_salary.plot.barh();

## <p style="background-color:#47B699; font-family:newtimeroman; color:white; font-size:150%; text-align:center; border-radius:10px 10px;">SUMMARY</p>
*   %75.9 of people work for <=50K, %24.1 of people work for >50K
*   Age of the high-income group is higher than the low-income group. It means the older generation possesses more wealth than the young.
*   There is no significant difference between high and low-income groups according to the "fnlwgt" feature
*   The "capital-gain" feature has not provided very meaningful insights. Nevertheless, we can say that the more "capital-gain", the more "high-income".
*   There is no significant difference between high and low-income groups according to the "capital_loss" feature.
*   In a high-income group, average working time is 45 hours  per week. On the other hand, in a low-income group, average working time is 39 hours per week.
*  We can easily say that the more education the more high income.
*  Married persons have earned more income than unmarried persons.
*   "Private" work-class has a high ratio in the self group about high-level income. "Private" work-class has a high ratio in the low-income group.
*   "Exec-managerial" and "Prof-specialty" occupations have a high ratio ( ~ %50) of high-level income both in the self group and in the high-income group.
*   %25 of whites and Asians are  high level income.
*   Males have earned more income than females.
*   When we compare  United States and Others according  to salary that  we see there is no difference between them.

## <p style="background-color:#47B699; font-family:newtimeroman; color:white; font-size:150%; text-align:center; border-radius:10px 10px;">RESULTS</p>

*   On income; we can say that education, marriage, profession, age scale and weekly working hours are directly effective.
*   For example; if you are a highly educated 45-year-old married man, we can say that you probably have a high income in US.