___

## <p style="background-color:#FDFEFE; font-family:Arial; color:#060108; font-size:200%; text-align:center; border-radius:10px 10px;">Data Analysis & Visualization with Python</p>

## <p style="background-color:#FDFEFE; font-family:Arial; color:#060108; font-size:200%; text-align:center; border-radius:10px 10px;">Project Solution</p>

![image.png](https://i.ibb.co/mT1GG7j/US-citizen.jpg)

## <p style="background-color:#FDFEFE; font-family:Arial; color:#060108; font-size:200%; text-align:center; border-radius:10px 10px;">Analysis of US Citizens by Income Levels</p>

<a id="toc"></a>

## <p style="background-color:#47AC34; font-family:Georgia; color:#FFFFFF; font-size:175%; text-align:center; border-radius:10px 10px;">Content</p>

* [Introduction](#0)
* [Dataset Info](#1)
* [Importing Related Libraries](#2)
* [Recognizing & Understanding Data](#3)
* [Univariate & Multivariate Analysis](#4)    
* [Other Specific Analysis Questions](#5)
* [Dropping Similar & Unneccessary Features](#6)
* [Handling with Missing Values](#7)
* [Handling with Outliers](#8)    
* [Final Step to make ready dataset for ML Models](#9)
* [The End of the Project](#10)

## <p style="background-color:#47AC34; font-family:Georgia; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Introduction</p>

<a id="0"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:#FFF9ED; background-color:#47AC34" data-toggle="popover">Content</a>

One of the most important components to any data science experiment that doesn’t get as much importance as it should is **``Exploratory Data Analysis (EDA)``**. In short, EDA is **``"A first look at the data"``**. It is a critical step in analyzing the data from an experiment. It is used to understand and summarize the content of the dataset to ensure that the features which we feed to our machine learning algorithms are refined and we get valid, correctly interpreted results.
In general, looking at a column of numbers or a whole spreadsheet and determining the important characteristics of the data can be very tedious and boring. Moreover, it is good practice to understand the problem statement and the data before you get your hands dirty, which in view, helps to gain a lot of insights. I will try to explain the concept using the Adult dataset/Census Income dataset available on the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Adult). The problem statement here is to predict whether the income exceeds 50k a year or not based on the census data.

# Aim of the Project

Applying Exploratory Data Analysis (EDA) and preparing the data to implement the Machine Learning Algorithms;
1. Analyzing the characteristics of individuals according to income groups
2. Preparing data to create a model that will predict the income levels of people according to their characteristics (So the "salary" feature is the target feature)

## <p style="background-color:#47AC34; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Dataset Info</p>

<a id="1"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:20b2aa; background-color:#47AC34" data-toggle="popover">Content</a>

The Census Income dataset has 48,842 entries. Each entry contains the following information about an individual:

- **salary (target feature/label):** whether or not an individual makes more than $50,000 annually. (<= 50K, >50K)
- **age:** the age of an individual. (Integer greater than 0)
- **workclass:** a general term to represent the employment status of an individual. (Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked)
- **fnlwgt:** this is the number of people the census believes the entry represents. People with similar demographic characteristics should have similar weights.  There is one important caveat to remember about this statement. That is that since the CPS sample is actually a collection of 51 state samples, each with its own probability of selection, the statement only applies within state.(Integer greater than 0)
- **education:** the highest level of education achieved by an individual. (Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.)
- **education-num:** the highest level of education achieved in numerical form. (Integer greater than 0)
- **marital-status:** marital status of an individual. Married-civ-spouse corresponds to a civilian spouse while Married-AF-spouse is a spouse in the Armed Forces. Married-spouse-absent includes married people living apart because either the husband or wife was employed and living at a considerable distance from home (Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse)
- **occupation:** the general type of occupation of an individual. (Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces)
- **relationship:** represents what this individual is relative to others. For example an individual could be a Husband. Each entry only has one relationship attribute. (Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried)
- **race:** Descriptions of an individual’s race. (White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black)
- **sex:** the biological sex of the individual. (Male, female)
- **capital-gain:** capital gains for an individual. (Integer greater than or equal to 0)
- **capital-loss:** capital loss for an individual. (Integer greater than or equal to 0)
- **hours-per-week:** the hours an individual has reported to work per week. (continuous)
- **native-country:** country of origin for an individual (United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands)

## <p style="background-color:#47AC34; font-family:Georgia; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">How to Installing/Enabling Intellisense or Autocomplete in Jupyter Notebook</p>

### Installing [jupyter_contrib_nbextensions](https://jupyter-contrib-nbextensions.readthedocs.io/en/latest/install.html)

**To install the current version from The Python Package Index (PyPI), which is a repository of software for the Python programming language, simply type:**

!pip install jupyter_contrib_nbextensions

**Alternatively, you can install directly from the current master branch of the repository:**

!pip install https://github.com/ipython-contrib/jupyter_contrib_nbextensions/tarball/master

### Enabling [Intellisense or Autocomplete in Jupyter Notebook](https://botbark.com/2019/12/18/how-to-enable-intellisense-or-autocomplete-in-jupyter-notebook/)


### Installing hinterland for jupyter without anaconda

**``STEP 1:``** ``Open cmd prompt and run the following commands``
 1) pip install jupyter_contrib_nbextensions<br>
 2) pip install jupyter_nbextensions_configurator<br>
 3) jupyter contrib nbextension install --user<br> 
 4) jupyter nbextensions_configurator enable --user<br>

**``STEP 2:``** ``Open jupyter notebook``
 - click on nbextensions tab<br>
 - unckeck disable configuration for nbextensions without explicit compatibility<br>
 - put a check on Hinterland<br>

**``Step 3:``** ``Open new python file and check autocomplete feature``

[VIDEO SOURCE](https://www.youtube.com/watch?v=DKE8hED0fow)

![Image_Assignment](https://i.ibb.co/RbmDmD6/E8-EED4-F3-B3-F4-4571-B6-A0-1-B3224-AAB060-4-5005-c.jpg)

## <p style="background-color:#47AC34; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Importing Related Libraries</p>

<a id="2"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:white; background-color:#47AC34" data-toggle="popover">Content</a>

Once you've installed NumPy & Pandas you can import them as a library:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")
warnings.warn("this will not show")

plt.rcParams["figure.figsize"] = (10, 6)

sns.set_style("whitegrid")
pd.set_option('display.float_format', lambda x: '%.3f' % x)

# Set it None to display all rows in the dataframe
# pd.set_option('display.max_rows', None)

# Set it to None to display all columns in the dataframe
pd.set_option('display.max_columns', None)

### <p style="background-color:#47AC34; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:left; border-radius:10px 10px;">Reading the data from file</p>

In [None]:
df = pd.read_csv("adult.csv")

In [None]:
df.head(1)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K


In [None]:
df.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'salary'],
      dtype='object')

## <p style="background-color:#47AC34; font-family:Georgia; color:#FFF9ED; font-size:150%; text-align:center; border-radius:10px 10px;">Recognizing and Understanding Data</p>

<a id="3"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:white; background-color:#47AC34" data-toggle="popover">Content</a>

### 1.Try to understand what the data looks like
- Check the head, shape, data-types of the features.
- Check if there are some dublicate rows or not. If there are, then drop them. 
- Check the statistical values of features.
- If needed, rename the columns' names for easy use. 
- Basically check the missing values.

In [None]:
# Your Code is Here
df.head()


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


Desired Output:

![image.png](https://i.ibb.co/qFn8RZs/US-Citicens1.png)

In [None]:
# Your Code is Here
df.shape


(32561, 15)

In [None]:
# Your Code is Here
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             32561 non-null  int64  
 1   workclass       32561 non-null  object 
 2   fnlwgt          32561 non-null  int64  
 3   education       32561 non-null  object 
 4   education-num   31759 non-null  float64
 5   marital-status  32561 non-null  object 
 6   occupation      32561 non-null  object 
 7   relationship    27493 non-null  object 
 8   race            32561 non-null  object 
 9   sex             32561 non-null  object 
 10  capital-gain    32561 non-null  int64  
 11  capital-loss    32561 non-null  int64  
 12  hours-per-week  32561 non-null  int64  
 13  native-country  32561 non-null  object 
 14  salary          32561 non-null  object 
dtypes: float64(1), int64(5), object(9)
memory usage: 3.7+ MB


In [None]:
# Check if the Dataset have any Duplicate

df.duplicated().sum()


24

In [None]:
# Drop Duplicates

df.drop_duplicates(inplace=True)
df




Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13.000,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13.000,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9.000,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7.000,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13.000,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12.000,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9.000,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9.000,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9.000,Never-married,Adm-clerical,,White,Male,0,0,20,United-States,<=50K


In [None]:
# Check the shape of the Dataset

df.shape



(32537, 15)

In [None]:
df.describe().T



Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,32537.0,38.586,13.638,17.0,28.0,37.0,48.0,90.0
fnlwgt,32537.0,189780.849,105556.471,12285.0,117827.0,178356.0,236993.0,1484705.0
education-num,31735.0,10.084,2.575,1.0,9.0,10.0,12.0,16.0
capital-gain,32537.0,1078.444,7387.957,0.0,0.0,0.0,0.0,99999.0
capital-loss,32537.0,87.368,403.102,0.0,0.0,0.0,0.0,4356.0
hours-per-week,32537.0,40.44,12.347,1.0,40.0,40.0,45.0,99.0


Desired Output:

![image.png](https://i.ibb.co/HnG6Xdn/US-Citicens2.png)

**Rename the features of;**<br>
**``"education-num"``**, **``"marital-status"``**, **``"capital-gain"``**, **``"capital-loss"``**, **``"hours-per-week"``**, **``"native-country"``** **as**<br>
**``"education_num"``**, **``"marital_status"``**, **``"capital_gain"``**, **``"capital_loss"``**, **``"hours_per_week"``**, **``"native_country"``**, **respectively and permanently.**

In [None]:
df.columns.str.replace("-", "_")



Index(['age', 'workclass', 'fnlwgt', 'education', 'education_num',
       'marital_status', 'occupation', 'relationship', 'race', 'sex',
       'capital_gain', 'capital_loss', 'hours_per_week', 'native_country',
       'salary'],
      dtype='object')

In [None]:
df.columns = df.columns.str.replace("-", "_")
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,salary
0,39,State-gov,77516,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [None]:
# Check the sum of Missing Values per column

df.isna().sum()



age                  0
workclass            0
fnlwgt               0
education            0
education_num      802
marital_status       0
occupation           0
relationship      5064
race                 0
sex                  0
capital_gain         0
capital_loss         0
hours_per_week       0
native_country       0
salary               0
dtype: int64

In [None]:
# Check the Percentage of Missing Values

df.isna().sum() * 100 / len(df)



age               0.000
workclass         0.000
fnlwgt            0.000
education         0.000
education_num     2.465
marital_status    0.000
occupation        0.000
relationship     15.564
race              0.000
sex               0.000
capital_gain      0.000
capital_loss      0.000
hours_per_week    0.000
native_country    0.000
salary            0.000
dtype: float64

### 2.Look at the value counts of columns that have object datatype and detect strange values apart from the NaN Values

In [None]:
df.columns



Index(['age', 'workclass', 'fnlwgt', 'education', 'education_num',
       'marital_status', 'occupation', 'relationship', 'race', 'sex',
       'capital_gain', 'capital_loss', 'hours_per_week', 'native_country',
       'salary'],
      dtype='object')

In [None]:
df.describe(include="object").T



Unnamed: 0,count,unique,top,freq
workclass,32537,9,Private,22673
education,32537,16,HS-grad,10494
marital_status,32537,7,Married-civ-spouse,14970
occupation,32537,15,Prof-specialty,4136
relationship,27473,5,Husband,13187
race,32537,5,White,27795
sex,32537,2,Male,21775
native_country,32537,42,United-States,29153
salary,32537,2,<=50K,24698


Desired Output:

![image.png](https://i.ibb.co/WspBGfZ/US-Citicens3.png)

**Assign the Columns (Features) of object data type as** **``"object_col"``**

In [None]:
object_col = df.loc[:, df.dtypes == object].columns
object_col



Index(['workclass', 'education', 'marital_status', 'occupation',
       'relationship', 'race', 'sex', 'native_country', 'salary'],
      dtype='object')

In [None]:
for col in object_col:
    print(col)
    print("--"*8)
    print(df[col].value_counts(dropna=False))
    print("--"*20)

workclass
----------------
Private             22673
Self-emp-not-inc     2540
Local-gov            2093
?                    1836
State-gov            1298
Self-emp-inc         1116
Federal-gov           960
Without-pay            14
Never-worked            7
Name: workclass, dtype: int64
----------------------------------------
education
----------------
HS-grad         10494
Some-college     7282
Bachelors        5353
Masters          1722
Assoc-voc        1382
11th             1175
Assoc-acdm       1067
10th              933
7th-8th           645
Prof-school       576
9th               514
12th              433
Doctorate         413
5th-6th           332
1st-4th           166
Preschool          50
Name: education, dtype: int64
----------------------------------------
marital_status
----------------
Married-civ-spouse       14970
Never-married            10667
Divorced                  4441
Separated                 1025
Widowed                    993
Married-spouse-absent      418


**Check if the Dataset has any Question Mark** **``"?"``**

In [None]:
# Your Code is Here



## <p style="background-color:#47AC34; font-family:georgia; color:white; font-size:175%; text-align:center; border-radius:10px 10px;">Univariate & Multivariate Analysis</p>

<a id="4"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:white; background-color:#47AC34" data-toggle="popover">Content</a>

Examine all features (first target feature("salary"), then numeric ones, lastly categoric ones) separetly from different aspects according to target feature.

**to do list for numeric features:**
1. Check the boxplot to see extreme values 
2. Check the histplot/kdeplot to see distribution of feature
3. Check the statistical values
4. Check the boxplot and histplot/kdeplot by "salary" levels
5. Check the statistical values by "salary" levels
6. Write down the conclusions you draw from your analysis

**to do list for categoric features:**
1. Find the features which contains similar values, examine the similarities and analyze them together 
2. Check the count/percentage of person in each categories and visualize it with a suitable plot
3. If need, decrease the number of categories by combining similar categories
4. Check the count of person in each "salary" levels by categories and visualize it with a suitable plot
5. Check the percentage distribution of person in each "salary" levels by categories and visualize it with suitable plot
6. Check the count of person in each categories by "salary" levels and visualize it with a suitable plot
7. Check the percentage distribution of person in each categories by "salary" levels and visualize it with suitable plot
8. Write down the conclusions you draw from your analysis

**Note :** Instruction/direction for each feature is available under the corresponding feature in detail, as well.

## Salary (Target Feature)

**Check the count of person in each "salary" levels and visualize it with a countplot**

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/9qwrtB1/US-Citicens4.png)

**Check the percentage of person in each "salary" levels and visualize it with a pieplot**

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/8YFvBrq/US-Citices5.png)

**Write down the conclusions you draw from your analysis**

**Result :** .................

## Numeric Features

## age

**Check the boxplot to see extreme values**

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/JKKwy5K/US-Citizens6.png)

**Check the histplot/kdeplot to see distribution of feature**

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/JcJ9cyp/US-Citizens7.png)

**Check the statistical values**

In [None]:
# Your Code is Here



**Check the boxplot and histplot/kdeplot by "salary" levels**

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/64tBVNT/US-Citizens8.png)

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/q5P0sVf/US-Citizens9.png)

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/7Y2HkxB/US-Citizens10.png)

**Check the statistical values by "salary" levels**

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/xYYZcZZ/US-Citizens11.png)

**Write down the conclusions you draw from your analysis**

**Result :** ................

## fnlwgt

**Check the boxplot to see extreme values**

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/x2TtkzH/US-Citizens12.png)

**Check the histplot/kdeplot to see distribution of feature**

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/ZmMV8nv/US-Citizens13.png)

**Check the statistical values**

In [None]:
# Your Code is Here



**Check the boxplot and histplot/kdeplot by "salary" levels**

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/ZxJS7JW/US-Citizens14.png)

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/TgygLrz/US-Citizens15.png)

**Check the statistical values by "salary" levels**

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/LzWqdBf/US-Citizens16.png)

**Write down the conclusions you draw from your analysis**

**Result :** ...............

## capital_gain

**Check the boxplot to see extreme values**

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/6Xj1TCz/US-Citizens17.png)

**Check the histplot/kdeplot to see distribution of feature**

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/X3nW72Q/US-Citizens18.png)

**Check the statistical values**

In [None]:
# Your Code is Here



**Check the boxplot and histplot/kdeplot by "salary" levels**

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/CM3cTgt/19.png)

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/h7DKvLY/20.png)

**Check the statistical values by "salary" levels**

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/mzYxTD4/21.png)

**Check the statistical values by "salary" levels for capital_gain not equal the zero**

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/r3mdBkK/22.png)

**Write down the conclusions you draw from your analysis**

**Result :** ...........................

## capital_loss

**Check the boxplot to see extreme values**

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/Db3XHKz/23.png)

**Check the histplot/kdeplot to see distribution of feature**

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/z7P15zX/24.png)

**Check the statistical values**

In [None]:
# Your Code is Here



**Check the boxplot and histplot/kdeplot by "salary" levels**

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/Dr7Bv9V/25.png)

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/4Vg5Zyy/26.png)

**Check the statistical values by "salary" levels**

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/h9DTKNW/27.png)

**Check the statistical values by "salary" levels for capital_loss not equel the zero**

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/gJzQvmD/28.png)

**Write down the conclusions you draw from your analysis**

**Result :** ..................

## hours_per_week

**Check the boxplot to see extreme values**

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/TkNCRYY/29.png)

**Check the histplot/kdeplot to see distribution of feature**

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/tsp5GXb/30.png)

**Check the statistical values**

In [None]:
# Your Code is Here



**Check the boxplot and histplot/kdeplot by "salary" levels**

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/4RhSct7/31.png)

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/pbbVnMG/32.png)

**Check the statistical values by "salary" levels**

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/6NbWfzz/33.png)

**Write down the conclusions you draw from your analysis**

**Result :** .....................

### See the relationship between each numeric features by target feature (salary) in one plot basically

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/N7Fz4hg/34.png)

## Categorical Features

## education & education_num

**Detect the similarities between these features by comparing unique values**

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



**Visualize the count of person in each categories for these features (education, education_num) separately**

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/5xc31HR/35.png)

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/6HWtNN6/36.png)

**Check the count of person in each "salary" levels by these features (education and education_num) separately and visualize them with countplot**

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/qxZXX1y/37.png)

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/2M0BYyk/38.png)

**Visualize the boxplot of "education_num" feature by "salary" levels**

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/mSBNzKw/39.png)

**Decrease the number of categories in "education" feature as low, medium, and high level and create a new feature with this new categorical data.**

In [None]:
def mapping_education(x):
    if x in ["Preschool", "1st-4th", "5th-6th", "7th-8th", "9th", "10th", "11th", "12th"]:
        return "low_level_grade"
    elif x in ["HS-grad", "Some-college", "Assoc-voc", "Assoc-acdm"]:
        return "medium_level_grade"
    elif x in ["Bachelors", "Masters", "Prof-school", "Doctorate"]:
        return "high_level_grade"

In [None]:
# Your Code is Here



In [None]:
# By using "mapping_education" def function above, create a new column named "education_summary"

# Your Code is Here



**Visualize the count of person in each categories for these new education levels (high, medium, low)**

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/cx3Dzn1/40.png)

**Check the count of person in each "salary" levels by these new education levels(high, medium, low) and visualize it with countplot**

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/tXk04LJ/41.png)

**Check the percentage distribution of person in each "salary" levels by each new education levels (high, medium, low) and visualize it with pie plot separately**

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/9W6kXc6/42.png)

**Check the count of person in each these new education levels(high, medium, low) by "salary" levels and visualize it with countplot**

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/K9xLxvF/43.png)

**Check the the percentage distribution of person in each these new education levels(high, medium, low) by "salary" levels and visualize it with pie plot separately**

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/42pnNPc/44.png)

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/jHYrhz8/45.png)

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/5BnYV6h/46.png)

**Write down the conclusions you draw from your analysis**

**Result :** ......................

## marital_status & relationship

**Detect the similarities between these features by comparing unique values**

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



In [None]:
# Fill missing values with "Unknown" in the column of "relationship"

# Your Code is Here



In [None]:
# Your Code is Here



**Assessment :** ........

**Visualize the count of person in each categories**

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/1RNHVvj/47.png)

**Check the count of person in each "salary" levels by categories and visualize it with countplot**

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/qjNhW9h/48.png)

**Decrease the number of categories in "marital_status" feature as married, and unmarried and create a new feature with this new categorical data**

In [None]:
def mapping_marital_status(x):
    if x in ["Never-married", "Divorced", "Separated", "Widowed"]:
        return "unmarried"
    elif x in ["Married-civ-spouse", "Married-AF-spouse", "Married-spouse-absent"]:
        return "married"

In [None]:
# Your Code is Here



In [None]:
# By using "mapping_marital_status" def function above, create a new column named "marital_status_summary"

# Your Code is Here



**Visualize the count of person in each categories for these new marital status (married, unmarried)**

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/wRjj6Bx/49.png)

**Check the count of person in each "salary" levels by these new marital status (married, unmarried) and visualize it with countplot**

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/0JtYnFb/50.png)

**Check the percentage distribution of person in each "salary" levels by each new marital status (married, unmarried) and visualize it with pie plot separately**

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/TYxT5Zz/51.png)

**Check the count of person in each these new marital status (married, unmarried) by "salary" levels and visualize it with countplot**

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/YWjjsZP/52.png)

**Check the the percentage distribution of person in each these new marital status (married, unmarried) by "salary" levels and visualize it with pie plot separately**

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/Swb4rb7/v53.png)

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/cJxmqwG/54.png)

**Write down the conclusions you draw from your analysis**

**Result :** .................

## workclass

**Check the count of person in each categories and visualize it with countplot**

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/NmKTp84/55.png)

**Replace the value "?" to the value "Unknown"** 

In [None]:
# Replace "?" values with "Unkown"

# Your Code is Here



**Check the count of person in each "salary" levels by workclass groups and visualize it with countplot**

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/bPnNvsn/56.png)

**Check the percentage distribution of person in each "salary" levels by each workclass groups and visualize it with bar plot**

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/8YvM14M/57.png)

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/NFN5q04/58.png)

**Check the count of person in each workclass groups by "salary" levels and visualize it with countplot**

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/98V8zkN/59.png)

**Check the the percentage distribution of person in each workclass groups by "salary" levels and visualize it with countplot**

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/QcdnXpk/60.png)

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/Kz5BDBj/61.png)

**Write down the conclusions you draw from your analysis**

**Result :** ..................

## occupation

**Check the count of person in each categories and visualize it with countplot**

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/F3qqLjS/62.png)

**Replace the value "?" to the value "Unknown"**

In [None]:
# Replace "?" values with "Unknown"

# Your Code is Here



**Check the count of person in each "salary" levels by occupation groups and visualize it with countplot**

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/RhkhQCW/63.png)

**Check the percentage distribution of person in each "salary" levels by each occupation groups and visualize it with bar plot**

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/mb7JS3n/64.png)

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/sW2b8wL/65.png)

**Check the count of person in each occupation groups by "salary" levels and visualize it with countplot**

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/cvHS3FH/66.png)

**Check the the percentage distribution of person in each occupation groups by "salary" levels and visualize it with bar plot**

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/7tK0PqX/67.png)

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/7brj34F/68.png)

**Write down the conclusions you draw from your analysis**

**Result :** ................

## race

**Check the count of person in each categories and visualize it with countplot**

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/LdKct3G/69.png)

**Check the count of person in each "salary" levels by races and visualize it with countplot**

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/Qb4n8Y5/70.png)

**Check the percentage distribution of person in each "salary" levels by each races and visualize it with pie plot**

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/xsJWXp4/71.png)

**Check the count of person in each races by "salary" levels and visualize it with countplot**

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/RBpPR38/72.png)

**Check the the percentage distribution of person in each races by "salary" levels and visualize it with bar plot**

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/Xy9sYCY/73.png)

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/X8kf9NZ/74.png)

**Write down the conclusions you draw from your analysis**

**Result :** ................

## gender

**Check the count of person in each gender and visualize it with countplot**

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/GVTRbrb/75.png)

**Check the count of person in each "salary" levels by gender and visualize it with countplot**

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/Nr8HRPk/76.png)

**Check the percentage distribution of person in each "salary" levels by each gender and visualize it with pie plot**

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/nrHj2jk/77.png)

**Check the count of person in each gender by "salary" levels and visualize it with countplot**

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/9sfsw11/78.png)

**Check the the percentage distribution of person in each gender by "salary" levels and visualize it with pie plot**

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/0DzhNgG/79.png)

**Write down the conclusions you draw from your analysis**

**Result :** ..............

## native_country

**Check the count of person in each categories and visualize it with countplot**

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/x3TNT7B/80.png)

**Replace the value "?" to the value "Unknown"** 

In [None]:
# Replace "?" values with "Unknown"

# Your Code is Here



**Decrease the number of categories in "native_country" feature as US, and Others and create a new feature with this new categorical data**

In [None]:
def mapping_native_country(x):
    if x == "United-States":
        return "US"
    else:
        return "Others"

In [None]:
# Your Code is Here



In [None]:
# By using "mapping_native_country" def function above, create a new column named "native_country_summary"

# Your Code is Here



**Visualize the count of person in each new categories (US, Others)**

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/wwDhVGd/81.png)

**Check the count of person in each "salary" levels by these new native countries (US, Others) and visualize it with countplot**

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/SVnKp4k/82.png)

**Check the percentage distribution of person in each "salary" levels by each new native countries (US, Others) and visualize it with pie plot separately**

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/4NQ5b1b/83.png)

**Check the count of person in each these new native countries (US, Others) by "salary" levels and visualize it with countplot**

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/c1gQfcg/84.png)

**Check the the percentage distribution of person in each these new native countries (US, Others) by "salary" levels and visualize it with pie plot separately**

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/QHc8m0x/85.png)

**Write down the conclusions you draw from your analysis**

**Result :** .................

## <p style="background-color:#47AC34; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Other Specific Analysis Questions</p>

<a id="5"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:white; background-color:#47AC34" data-toggle="popover">Content</a>

### 1. What is the average age of males and females by income level?

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/BBDy081/86.png)

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/4PD1208/87.png)

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/2n0yGt7/88.png)

### 2. What is the workclass percentages of Americans in high-level income group?

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/gMHzLgH/89.png)

### 3. What is the occupation percentages of Americans who work as "Private" workclass in high-level income group?

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/s3Kd7VS/90.png)

### 4. What is the education level percentages of Asian-Pac-Islander race group in high-level income group?

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/rZnSFBX/91.png)

### 5. What is the occupation percentages of Asian-Pac-Islander race group who has a Bachelors degree in high-level income group?

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/zZVsbJf/92.png)

### 6. What is the mean of working hours per week by gender for education level, workclass and marital status? Try to plot all required in one figure.

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/G5KY8nf/93.png)

## <p style="background-color:#47AC34; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Dropping Similar & Unneccessary Features</p>

<a id="6"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:white; background-color:#47AC34" data-toggle="popover">Content</a>

In [None]:
# Your Code is Here



In [None]:
# Drop the columns of "education", "education_num", "relationship", "marital_status", "native_country" permanently

# Your Code is Here



## <p style="background-color:#47AC34; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Handling with Missing Value</p>

<a id="7"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:white; background-color:#47AC34" data-toggle="popover">Content</a>

**Check the missing values for all features basically**

In [None]:
# Your Code is Here



**1. It seems that there is no missing value. But we know that "workclass", and "occupation" features have missing values as the "Unknown" string values. Examine these features in more detail.**

**2. Decide if drop these "Unknown" string values or not**

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



In [None]:
# Replace "Unknown" values with NaN using numpy library

# Your Code is Here



In [None]:
# Your Code is Here



In [None]:
# Drop missing values in df permanently

# Your Code is Here



In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



## <p style="background-color:#47AC34; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Handling with Outliers</p>

<a id="8"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:white; background-color:#47AC34" data-toggle="popover">Content</a>

### Boxplot and Histplot for all numeric features

**Plot boxplots for each numeric features at the same figure as subplots**

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/DKMSBDk/94.png)

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/JKtcs9S/95.png)

**Plot both boxplots and histograms for each numeric features at the same figure as subplots**

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/fMpP3yR/96.png)

**Check the statistical values for all numeric features**

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/t3MJHDr/97.png)

**1. After analyzing all features, we have decided that we can't evaluate extreme values in "fnlwgt, capital_gain, capital_loss" features in the scope of outliers.**

**2. So let's examine "age and hours_per_week" features and detect extreme values which could be outliers by using IQR Rule.**

### age

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/SnzH5Nz/98.png)

In [None]:
# Find IQR defining quantile 0.25 for low level and 0.75 for high level 

# Your Code is Here



In [None]:
# Find lower and upper limit using IQR

# Your Code is Here



In [None]:
# Your Code is Here



In [None]:
# Define the observations whose age is greater than upper limit and sort these observations by age in descending order

# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/x2wDgzQ/99.png)

### hours_per_week

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/xq53X6w/100.png)

In [None]:
# Find IQR defining quantile 0.25 for low level and 0.75 for high level 

# Your Code is Here



In [None]:
# Find the lower and upper limit using IQR

# Your Code is Here



In [None]:
# Your Code is Here



In [None]:
# Define the observations where  hours per week are greater than upper limit and 
# sort these observations by hours per week in descending order

# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/zGCnbjz/101.png)

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/swYNtdM/102.png)

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/S7RWpxD/103.png)

**Result :** As we see, there are number of extreme values in both "age and hours_per_week" features. But how can we know if these extreme values are outliers or not? At this point, **domain knowledge** comes to the fore.

**Domain Knowledge for this dataset:**
1. In this dataset, all values are created according to the statements of individuals. So It can be some "data entries errors".
2. In addition, we have aimed to create an ML model with some restrictions as getting better performance from the ML model.
3. In this respect, our sample space ranges for some features are as follows.
    - **age : 17 to 80**
    - **hours_per_week : 7 to 70**
    - **if somebody's age is more than 60, he/she can't work more than 60 hours in a week**

### Dropping rows according to the domain knownledge 

In [None]:
# Create a condition according to your domain knowledge on age stated above and 
# sort the observations meeting this condition by age in ascending order

# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/pJC50ZV/104.png)

In [None]:
# Find the shape of the dataframe created by the condition defined above for age 

# Your Code is Here



In [None]:
# Assign the indices of the rows defined in accordance with condition above for age

# Your Code is Here



In [None]:
# Drop these indices defined above for age

# Your Code is Here



In [None]:
# Create a condition according to your domain knowledge on hours per week stated above and 
# sort the observations meeting this condition by hours per week in descending order

# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/rMp7C58/105.png)

In [None]:
# Find the shape of the dataframe created by the condition defined above for hours per week 

# Your Code is Here




In [None]:
# Assign the indices of the rows defined in accordance with condition above for hours per week

# Your Code is Here



In [None]:
# Drop these indices defined above for hours per week

# Your Code is Here



In [None]:
# Create a condition according to your domain knowledge on both age and hours per week stated above 

# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/Ch8XSdW/106.png)

In [None]:
# Find the shape of the dataframe created by the condition defined above for both age and hours per week


# Your Code is Here



In [None]:
# Assign the indices of the rows defined in accordance with condition above for both age and hours per week

# Your Code is Here



In [None]:
# Drop these indices defined above for both age and hours per week

# Your Code is Here



In [None]:
# What is new shape of dataframe now

# Your Code is Here



In [None]:
# Reset the indices and take the head of DataFrame now

# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/5MXPD2b/107.png)

## <p style="background-color:#47AC34; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Final Step to Make the Dataset Ready for ML Models</p>

<a id="9"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:FFF9ED; background-color:#47AC34" data-toggle="popover">Content</a>

### 1. Convert all features to numeric

**Convert target feature (salary) to numeric (0 and 1) by using map function**

In [None]:
# Your Code is Here



In [None]:
# Your Code is Here



**Convert all features to numeric by using get_dummies function**

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/0F1SHRt/108.png)

In [None]:
# What's the shape of dataframe

# Your Code is Here



In [None]:
# What's the shape of dataframe created by dummy operation

# Your Code is Here



### 2. Take a look at correlation between features by utilizing power of visualizing

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/Dgb8RYZ/109.png)

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/5XH3X4q/110.png)

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/19RytkS/111.png)

In [None]:
# Your Code is Here



![image.png](https://i.ibb.co/80GcYKr/112.png)

In [None]:
# Your Code is Here



Desired Output:

![image.png](https://i.ibb.co/0MCPc4d/113.png)

<a id="10"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:white; background-color:#47AC34" data-toggle="popover">Content</a>

## <p style="background-color:#47AC34; font-family:arial; color:white; font-size:150%; text-align:center; border-radius:10px 10px;">The End of the Project</p>

___
