<a href="https://colab.research.google.com/github/Harkeerat-Pathak/Adult-income-prediction-/blob/main/DS_week1_solved(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Assignment 1: The data life cycle + Summary statistics**

In this week's assignment we will review the data life cycle with a particular application, as well as review some practical and fundamental concepts of statistics.

## **Description**

An individual’s annual income results from various factors. Intuitively, it is influenced by the individual’s education level, age, gender, occupation, and etc.

## **Exercise 1: Data life cycle**

To study the relationship of different variables to the annual income of worldwide workers, you will need to design the entire data plan for the project.

This plan should cover everything from the mechanisms that will be used to collect data from workers, the storage architectures, the identification of errors or inconsistencies that we foresee may occur more frequently, and the specific question we want to answer. The plan will end once the solution to the main goal is obtained and the results have been shared.

Describe all the individual phases your data project should follow and give a brief description of the technical details in each phase. You should customize each phase to the proposed dataset and problem.

**[Solution]**

Check that all phases of the data life cycle are mentioned. All phases must describe the specific processes used for the proposed case.

The processes described are unambiguous, especially in the phases of question proposal, data collection and publication/resolution.

## **Exercise 2: Central tendency and dispersion**

Once the data has been collected and preprocessed, we are interested in knowing what workers are paid according to different dimensions: their age, their job position, their level of education, ... To do this, it is necessary to use some statistical metrics such as central tendency and dispersion.

We start by loading the dataset we will be working with. Remember that in order to start working with Google Colab you must **Connect to a runtime environment**.

Then you must go to the **Files** folder (in the left panel), where you must upload the csv attached to this assignment.

In [None]:
import pandas as pd
dataset = pd.read_csv('adult.csv')

**Exercise:** Make a brief preliminar inspection of the dataset, indicating the number of records, the number of descriptors, and the name of the descriptors. If you identify a descriptor that should not appear in the dataset, remove it.

In [None]:
# Your answer here
print('Dataset has {} records and {} descriptors'.format(*dataset.shape))
print(dataset.columns)

Dataset has 48842 records and 15 descriptors
Index(['age', 'workclass', 'fnlwgt', 'education', 'educational-num',
       'marital-status', 'occupation', 'relationship', 'race', 'gender',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'income'],
      dtype='object')


**Exercise:** Show a random selection of some records from the dataset. This will help you to perform a preliminary visual inspection to confirm that the data has been loaded correctly.

In [None]:
# Your answer here
print(dataset.iloc[0])

age                               25
workclass                    Private
fnlwgt                        226802
education                       11th
educational-num                    7
marital-status         Never-married
occupation         Machine-op-inspct
relationship               Own-child
race                           Black
gender                          Male
capital-gain                       0
capital-loss                       0
hours-per-week                    40
native-country         United-States
income                         <=50K
Name: 0, dtype: object


To determine the mean, median, and standard deviation for the descriptors of a quantitative nature we will use the _describe()_ method on pandas DataFrame object.

In [None]:
dataset.describe()

Unnamed: 0,age,fnlwgt,educational-num,capital-gain,capital-loss,hours-per-week
count,48842.0,48842.0,48842.0,48842.0,48842.0,48842.0
mean,38.643585,189664.1,10.078089,1079.067626,87.502314,40.422382
std,13.71051,105604.0,2.570973,7452.019058,403.004552,12.391444
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117550.5,9.0,0.0,0.0,40.0
50%,37.0,178144.5,10.0,0.0,0.0,40.0
75%,48.0,237642.0,12.0,0.0,0.0,45.0
max,90.0,1490400.0,16.0,99999.0,4356.0,99.0


The 50% attribute indicates the value located in the ordered half of the attribute, which is associated with the median.

Note that for the **capital-gain** and **capital-loss** attributes the mean and median values (both central tendency values) are very different.

**Exercise:** What is the reason for this difference?

**Hint:** You may find it useful to visualize the ordered values of these variables.

**[Solution]**

The value of these descriptors is 0 for more than half of the records. This value probably tries to indicate the missing values of the variables. For this reason the central value (the median) is 0. However, the mean takes into account all the values, making its value higher.

In [None]:
dataset['capital-gain'].value_counts()

0        44807
15024      513
7688       410
7298       364
99999      244
         ...  
2387         1
22040        1
6612         1
1111         1
1639         1
Name: capital-gain, Length: 123, dtype: int64

In [None]:
dataset['capital-loss'].value_counts()

0       46560
1902      304
1977      253
1887      233
2415       72
        ...  
1539        1
2489        1
2201        1
1421        1
1870        1
Name: capital-loss, Length: 99, dtype: int64

**Exercise:** What is the mean and standard deviation of the **capital-gain** attribute as a function of the worker's level of education? Order the results by educational level (you may use the **educational-num** descriptor) and interpret the results obtained.

**Hint:** Use the _groupby()_ method of the pandas DataFrame object. 

In [None]:
# Solution
dataset.groupby('educational-num')['capital-gain'].mean()

educational-num
1       732.000000
2       123.591093
3       360.365422
4       242.626178
5       313.398148
6       323.049676
7       203.739514
8       208.579909
9       573.314179
10      559.961574
11      778.602135
12      636.951905
13     1762.564984
14     2583.605947
15    10586.467626
16     5727.769360
Name: capital-gain, dtype: float64

In [None]:
dataset.groupby('educational-num')['capital-gain'].std()

educational-num
1      4798.968314
2       750.627866
3      4512.388159
4      1030.560526
5      3736.769093
6      4106.377612
7      1191.285890
8      1383.857230
9      4952.769963
10     4474.465161
11     4513.887224
12     3511.958827
13     9356.415141
14    11248.829635
15    27261.781745
16    19570.677258
Name: capital-gain, dtype: float64

As the level of education increases, so does the capital gain. The 14th level of studies is the one that grants the highest average.

Similarly, there is a trend that indicates that higher levels of education move in a more dispersed capital gain (they are more prepared to carry out both low-paid and well-paid jobs, depending on the opportunities they receive).

## **Exercise 3: Unconditional probability**

For all of the following exercises, the calculation of probabilities should be automatic and adaptive to the increase or decrease of records in the dataset.

**Exercise:** What is the probability that a worker randomly chosen from the dataset is a black male and earns more than 50k per year?

In [None]:
# Solution
num_total = dataset.shape[0]
num_group = dataset[(dataset['gender'] == 'Male') &
                    (dataset['race'] == 'Black') &
                    (dataset['income'] == '>50K')].shape[0]

print(num_group / num_total)

0.008885795012489251


[0.89% of the records in the dataset are black men earning more than 50k per year. This is the probability that a randomly chosen record meets these criteria.

**Exercise:** What is the probability that a worker randomly chosen from the dataset is a female under 30 years old, married, and earns less than 50k per year?

In [None]:
# Solution
num_total = dataset.shape[0]
num_group = dataset[(dataset['gender'] == 'Female') &
                    (dataset['age'] < 30) &
                    ((dataset['marital-status'] == 'Married-civ-spouse') |
                     (dataset['marital-status'] == 'Married-AF-spouse') |
                     (dataset['marital-status'] == 'Married-spouse-absent')) &
                    (dataset['income'] == '<=50K')].shape[0]

print(num_group / num_total)

0.009766184840915605


0.98% of the records in the dataset are married female workers under 30 year old that earn less than 50k per year. This is the probability that a randomly chosen record meets these criteria.