<a id="content"></a>
<p style="background-color:plum; color:floralwhite; font-size:175%; text-align:center; border-radius:10px 10px; font-family:newtimeroman; line-height: 1.4;">Content</p>

* [Introduction](#0)
* [About Dataset](#1)
* [Importing Related Libraries](#2)
* [Recognizing & Understanding Data](#3)
* [Univariate & Multivariate Analysis](#4)    
* [Other Specific Analysis Questions](#5)
* [Dropping Similar & Unneccessary Features](#6)
* [Handling with Missing Values](#7)
* [Handling with Outliers](#8)    
* [Final Step to make ready dataset for ML Models](#9)
* [The End of the Project](#10)

One of the most important components to any data science experiment that doesn’t get as much importance as it should is **``Exploratory Data Analysis (EDA)``**. In short, EDA is **``"A first look at the data"``**. It is a critical step in analyzing the data from an experiment. It is used to understand and summarize the content of the dataset to ensure that the features which we feed to our machine learning algorithms are refined and we get valid, correctly interpreted results.
In general, looking at a column of numbers or a whole spreadsheet and determining the important characteristics of the data can be very tedious and boring. Moreover, it is **good practice to understand the problem statement** and the data before you get your hands dirty, which in view, **helps to gain a lot of insights**. I will try to explain the concept using the Adult dataset/Census Income dataset available on the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Adult). The problem statement here is to predict whether the income exceeds 50k a year or not based on the census data.

**Aim of the Project**

Applying Exploratory Data Analysis (EDA) and preparing the data to implement the Machine Learning Algorithms;
1. Analyzing the characteristics of individuals according to income groups
2. Preparing data to create a model that will predict the income levels of people according to their characteristics (So the "salary" feature is the target feature)

<a id="1"></a>
<p style="background-color:plum; color:floralwhite; font-size:175%; text-align:center; border-radius:10px 10px; font-family:newtimeroman; line-height: 1.4;">About Dataset</p>


The Census Income dataset has 48,842 entries. Each entry contains the following information about an individual:

- **salary (target feature/label):** whether or not an individual makes more than $50,000 annually. (<= 50K, >50K)
- **age:** the age of an individual. (Integer greater than 0)
- **workclass:** a general term to represent the employment status of an individual. (Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked)
- **fnlwgt:** this is the number of people the census believes the entry represents. People with similar demographic characteristics should have similar weights.  There is one important caveat to remember about this statement. That is that since the CPS sample is actually a collection of 51 state samples, each with its own probability of selection, the statement only applies within state.(Integer greater than 0)
- **education:** the highest level of education achieved by an individual. (Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.)
- **education-num:** the highest level of education achieved in numerical form. (Integer greater than 0)
- **marital-status:** marital status of an individual. Married-civ-spouse corresponds to a civilian spouse while Married-AF-spouse is a spouse in the Armed Forces. Married-spouse-absent includes married people living apart because either the husband or wife was employed and living at a considerable distance from home (Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse)
- **occupation:** the general type of occupation of an individual. (Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces)
- **relationship:** represents what this individual is relative to others. For example an individual could be a Husband. Each entry only has one relationship attribute. (Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried)
- **race:** Descriptions of an individual’s race. (White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black)
- **sex:** the biological sex of the individual. (Male, female)
- **capital-gain:** capital gains for an individual. (Integer greater than or equal to 0)
- **capital-loss:** capital loss for an individual. (Integer greater than or equal to 0)
- **hours-per-week:** the hours an individual has reported to work per week. (continuous)
- **native-country:** country of origin for an individual (United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands)

## Reading the Data from File

<a id="3"></a>
<p style="background-color:plum; color:floralwhite; font-size:175%; text-align:center; border-radius:10px 10px; font-family:newtimeroman; line-height: 1.4;">Recognizing and Understanding Data</p>


## Try to understand what the data looks like
- Check the head, shape, data-types of the features.
- Check if there are some dublicate rows or not. If there are, then drop them. 
- Check the statistical values of features.
- Basically check the missing values. (NaN, None)
- If needed, rename the columns' names for easy use.

### Check the head, shape, data-types of the features.

### Check if there are some dublicate rows or not. If there are, then drop them.

### Check the statistical values of features.

### Basically check the missing values. (NaN, None)

### If needed, rename the columns' names for easy use. 

## Examining the Data
- Look at the counts of columns that have OBJECT datatype 
- Assign the Columns (Features) of object data type as** **`"object_col"`**
- Detect strange values apart from the NaN Values.(isin(),Count(),Sum(),Any())

### Look at the value counts of columns that have OBJECT datatype

### Assign the Columns (Features) of object data type as** **``"object_col"``

In [None]:
for col in object_col:
    print(col)
    print("--"*20)
    print(df[col].value_counts(dropna=False))
    print("//"*20)

### Detect strange values apart from the NaN Values.(isin(),Count(),Sum(),Any())

<a id="4"></a>
<p style="background-color:plum; color:floralwhite; font-size:175%; text-align:center; border-radius:10px 10px; font-family:newtimeroman; line-height: 1.4;">Univariate & Multivariate Analysis</p>


Examine all Features:
- **Target Feature**
- **Numeric Ones**
- **Categoric Ones** separately from different aspects according to target feature.

**to do list for numeric features:**
1. Check the boxplot to see extreme values 
2. Check the histplot/kdeplot to see distribution of feature
3. Check the statistical values
4. Check the boxplot and histplot/kdeplot by target feature
5. Check the statistical values by target feature
6. Write down the conclusions you draw from your analysis

**to do list for categoric features:**
1. Find the features which contains similar values, examine the similarities and analyze them together 
2. Check the count/percentage in each categories and visualize it with a suitable plot
3. If need, decrease the number of categories by combining similar categories
4. Check the count/percentage in each target feature by categories and visualize it with a suitable plot
5. Check the percentage distribution in each target feature by categories and visualize it with suitable plot
6. Check the count in each categories by target feature and visualize it with a suitable plot
7. Check the percentage distribution in each categories by target feature and visualize it with suitable plot
8. Write down the conclusions you draw from your analysis

**Note :** **Instruction/direction** for each feature is available under the corresponding feature in detail, as well.

## Target Feature
- Salary

## Numeric Features

## Categorical Features

<a id="5"></a>
<p style="background-color:plum; color:floralwhite; font-size:175%; text-align:center; border-radius:10px 10px; font-family:newtimeroman; line-height: 1.4;">Other Specific Analysis Questions</p>



## Analysis Questions

### What is the average age of males and females by income level?

### What is the workclass percentages of Americans in high-level income group?

### What is the occupation percentages of Americans who work as "Private" workclass in high-level income group?

<a id="6"></a>
<p style="background-color:plum; color:floralwhite; font-size:175%; text-align:center; border-radius:10px 10px; font-family:newtimeroman; line-height: 1.4;">Dropping Similar & Unneccessary Features</p>

<a href="#content" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

<a id="7"></a>
<p style="background-color:plum; color:floralwhite; font-size:175%; text-align:center; border-radius:10px 10px; font-family:newtimeroman; line-height: 1.4;">Handling with Missing Value</p>

<a href="#content" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

<a id="8"></a>
<p style="background-color:plum; color:floralwhite; font-size:175%; text-align:center; border-radius:10px 10px; font-family:newtimeroman; line-height: 1.4;">Handling with Outliers</p>

<a href="#content" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

## Analyzing all Features  and Detecting Extreme Values