Workshop 2 - Enyia Esther

## **Theoretical Task**

T2.1 - Explain why data normalisation is necessary for AI and Machine Learning models. Use an
example to flesh out your discussion (1000 words maximum) (5%)



## Data Normalization:

This is a pre-processing and scaling method that helps improve model accuracy and value consistency. The main purpose of data normalization is to prevent features with large values from dominating those with smaller ones. Normalization is said to be vital when it comes to bringing prediction and forecasting techniques into harmony. Some methods include min-max scaling and z-score standardization.

Data normalization is a critical step in building effective machine learning models. It ensures features contribute equally, improves training speed, supports accurate distance calculations, stabilizes neural networks, and enables fair regularization.

Generally speaking, normalization is required when working with attributes that have multiple scales; otherwise, other qualities with values on a greater scale may dilute the impact of a significant feature that is equally essential (on a lower scale).

# --------------------------------------- Training -------------------------------------

* You can check the comments to see what each cell does. First, read the comments, and then run the cells to see the outputs.


* There are a few exercises in the notebook. The solution of some of them is given. But please first try to complete the exercise yourself and then look at the solution.


* After running all cells and completing all exercises, complete the tasks at the end of the notebook.

## Importing necessary Libraries  

In [None]:
import pandas as pd
import numpy as np

## Importing a dataset for this workshop

## Importing the dataset directly from the source

In [None]:
# Installing a package which is needed to download the dataset from its online source. This package is recommended by the online source at the following
# URL : https://archive.ics.uci.edu/dataset/2/adult. You need to reinstall the kernel after the package has been installed
!pip3 install ucimlrepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7


In [None]:
# Downloading the dataset from the online source. The first two lines are given by the online source mentioned above
from ucimlrepo import fetch_ucirepo
import pandas as pd
# fetch dataset
adult = fetch_ucirepo(id=2)

# Putting data in a pandas dataframe
X = adult.data.features
y = adult.data.targets
data=pd.concat([X,y],axis=1)


In [None]:
#printing data
data

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,39,Private,215419,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Female,0,0,36,United-States,<=50K.
48838,64,,321403,HS-grad,9,Widowed,,Other-relative,Black,Male,0,0,40,United-States,<=50K.
48839,38,Private,374983,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,<=50K.
48840,44,Private,83891,Bachelors,13,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,5455,0,40,United-States,<=50K.


## Importing the dataset from a folder on your local disk

In [None]:
#importing the dataset as a Pandas DataFrame into Python if the dataset is stored on your local hard disk
# You can download the dataset from the following URL: https://archive.ics.uci.edu/dataset/2/adult

data = pd.read_csv('/content/drive/MyDrive/adult.csv') # Replace the current data path with the data path on your local disk

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Exploring the dataset

In [None]:
# Showing the first 5 rows of the dataset
data.head()
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32560 entries, 0 to 32559
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   39              32560 non-null  int64 
 1    State-gov      32560 non-null  object
 2    77516          32560 non-null  int64 
 3    Bachelors      32560 non-null  object
 4    13             32560 non-null  int64 
 5    Never-married  32560 non-null  object
 6    Adm-clerical   32560 non-null  object
 7    Not-in-family  32560 non-null  object
 8    White          32560 non-null  object
 9    Male           32560 non-null  object
 10   2174           32560 non-null  int64 
 11   0              32560 non-null  int64 
 12   40             32560 non-null  int64 
 13   United-States  32560 non-null  object
 14   <=50K          32560 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [None]:
# Finding the shape of the data
print(data.shape)

(48842, 15)


In [None]:
# Generate a dataset by randomly extracting 30000 rows (samples)
data_new = data.sample(n=30000, random_state = 48)

In [None]:
# Printing the new dataset
data_new
data_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             30000 non-null  int64 
 1   workclass       29428 non-null  object
 2   fnlwgt          30000 non-null  int64 
 3   education       30000 non-null  object
 4   education-num   30000 non-null  int64 
 5   marital-status  30000 non-null  object
 6   occupation      29425 non-null  object
 7   relationship    30000 non-null  object
 8   race            30000 non-null  object
 9   sex             30000 non-null  object
 10  capital-gain    30000 non-null  int64 
 11  capital-loss    30000 non-null  int64 
 12  hours-per-week  30000 non-null  int64 
 13  native-country  29831 non-null  object
 14  income          30000 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.4+ MB


In [None]:
# The indices of different rows in the dataset are currently messy. This happens in many data science projects. Always reindex
# the dataset if you are unsure the indices are correct.
data_new.reset_index(drop=True, inplace=True)

In [None]:
# Checking if the indices are correct
data_new

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,29,Private,216481,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,>50K
1,36,Private,280570,Some-college,10,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,45,United-States,<=50K.
2,25,?,100903,Bachelors,13,Married-civ-spouse,?,Wife,White,Female,0,0,25,United-States,<=50K
3,47,Private,145636,Assoc-voc,11,Married-civ-spouse,Handlers-cleaners,Husband,White,Male,0,0,48,United-States,>50K.
4,33,Private,119422,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,40,United-States,<=50K.
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29995,20,Private,166371,HS-grad,9,Never-married,Craft-repair,Other-relative,White,Male,0,0,40,United-States,<=50K.
29996,80,Private,202483,HS-grad,9,Married-spouse-absent,Adm-clerical,Not-in-family,White,Female,0,0,16,United-States,<=50K
29997,20,Private,175808,HS-grad,9,Never-married,Craft-repair,Own-child,White,Male,0,0,40,United-States,<=50K.
29998,25,State-gov,31350,Some-college,10,Never-married,Other-service,Not-in-family,White,Male,0,0,40,United-States,<=50K.


In [None]:
data

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,39,Private,215419,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Female,0,0,36,United-States,<=50K.
48838,64,,321403,HS-grad,9,Widowed,,Other-relative,Black,Male,0,0,40,United-States,<=50K.
48839,38,Private,374983,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,<=50K.
48840,44,Private,83891,Bachelors,13,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,5455,0,40,United-States,<=50K.


In [None]:
# Getting statistical information of the dataset for different columns (features)
data.describe(include="all")

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
count,48842.0,47879,48842.0,48842,48842.0,48842,47876,48842,48842,48842,48842.0,48842.0,48842.0,48568,48842
unique,,9,,16,,7,15,6,5,2,,,,42,4
top,,Private,,HS-grad,,Married-civ-spouse,Prof-specialty,Husband,White,Male,,,,United-States,<=50K
freq,,33906,,15784,,22379,6172,19716,41762,32650,,,,43832,24720
mean,38.643585,,189664.1,,10.078089,,,,,,1079.067626,87.502314,40.422382,,
std,13.71051,,105604.0,,2.570973,,,,,,7452.019058,403.004552,12.391444,,
min,17.0,,12285.0,,1.0,,,,,,0.0,0.0,1.0,,
25%,28.0,,117550.5,,9.0,,,,,,0.0,0.0,40.0,,
50%,37.0,,178144.5,,10.0,,,,,,0.0,0.0,40.0,,
75%,48.0,,237642.0,,12.0,,,,,,0.0,0.0,45.0,,


In [None]:
# Showing the dataset information
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       47879 non-null  object
 2   fnlwgt          48842 non-null  int64 
 3   education       48842 non-null  object
 4   education-num   48842 non-null  int64 
 5   marital-status  48842 non-null  object
 6   occupation      47876 non-null  object
 7   relationship    48842 non-null  object
 8   race            48842 non-null  object
 9   sex             48842 non-null  object
 10  capital-gain    48842 non-null  int64 
 11  capital-loss    48842 non-null  int64 
 12  hours-per-week  48842 non-null  int64 
 13  native-country  48568 non-null  object
 14  income          48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


In [None]:
# Getting the count of different values in the column "education-num"
data['education-num'].value_counts()

Unnamed: 0_level_0,count
education-num,Unnamed: 1_level_1
9,15784
10,10878
13,8025
14,2657
11,2061
7,1812
12,1601
6,1389
4,955
15,834


In [None]:
# Getting the count of different values in the column "education"
data['education'].value_counts()

Unnamed: 0_level_0,count
education,Unnamed: 1_level_1
HS-grad,15784
Some-college,10878
Bachelors,8025
Masters,2657
Assoc-voc,2061
11th,1812
Assoc-acdm,1601
10th,1389
7th-8th,955
Prof-school,834


In [None]:
# Dropping a column
data = data.drop(['fnlwgt'], axis=1)

In [None]:
data.shape

(48842, 14)

In [None]:
# Getting the number of unique values of a column
data['education'].nunique()

16

In [None]:
# Finding how many rows are related to either gender
data['sex'].value_counts()

Unnamed: 0_level_0,count
sex,Unnamed: 1_level_1
Male,32650
Female,16192


In [None]:
# Calculating the average age of different genders in the dataset
data['age'].groupby([data['sex']]).mean()


Unnamed: 0_level_0,age
sex,Unnamed: 1_level_1
Female,36.927989
Male,39.494395


In [None]:
# Getting the average age of different genders in the dataset broken down based on their education
data['age'].groupby([data['sex'],data['education']]).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,age
sex,education,Unnamed: 2_level_1
Female,10th,36.474836
Female,11th,29.963077
Female,12th,29.469194
Female,1st-4th,47.098361
Female,5th-6th,45.362205
Female,7th-8th,50.577406
Female,9th,41.272727
Female,Assoc-acdm,36.508772
Female,Assoc-voc,38.111717
Female,Bachelors,35.880904


In [None]:
# Getting the maximum age of different races in the dataset
data['age'].groupby([data['race']]).max()

Unnamed: 0_level_0,age
race,Unnamed: 1_level_1
Amer-Indian-Eskimo,82
Asian-Pac-Islander,90
Black,90
Other,77
White,90


* Exercise 1: Write a code to find how much contribution each sex and occupation category made to the capital-gain on average. Apply the code to the dataset and print the result

In [None]:
# exercise 1 solution

average_contribution=data['capital-gain'].groupby([data['sex'],data['occupation']]).mean()
average_contribution




Unnamed: 0_level_0,Unnamed: 1_level_0,capital-gain
sex,occupation,Unnamed: 2_level_1
Female,?,337.712247
Female,Adm-clerical,465.291059
Female,Craft-repair,728.386997
Female,Exec-managerial,1204.303204
Female,Farming-fishing,707.757895
Female,Handlers-cleaners,503.771654
Female,Machine-op-inspct,140.941542
Female,Other-service,204.280578
Female,Priv-house-serv,193.929825
Female,Prof-specialty,1255.421499


In [None]:
# Extracting the age and education columns and creating a new DataFrame using these columns
a=data['age']
b=data['education']
new_data=pd.concat([a,b],axis=1)

In [None]:
data

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,39,Private,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Female,0,0,36,United-States,<=50K.
48838,64,,HS-grad,9,Widowed,,Other-relative,Black,Male,0,0,40,United-States,<=50K.
48839,38,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,<=50K.
48840,44,Private,Bachelors,13,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,5455,0,40,United-States,<=50K.


* Exercise 2: Write a function that receives the dataset and replace Famle with F and Male with M (please try to write it yourself before checking the answer in the next cell)

In [None]:
# exercise 2 solution
def encode_sex(data):
    data.reset_index(drop=True, inplace=True) # reindexing the dataset in case the dataset index is corrupted
    rows=data.shape[0]
    a=data.loc[:,'sex']
    for i in range(rows):
        if a[i]=="Male":
            data.loc[i,"sex"]="M"
        elif a[i]=="Female":
            data.loc[i,"sex"]="F"
    return data

In [None]:
# Copying the data
data_copy=data.copy()

In [None]:
# Applying the encode_sex function to the copied data
data_encoded=encode_sex(data_copy)
data_encoded.head()

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,M,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,M,0,0,13,United-States,<=50K
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,M,0,0,40,United-States,<=50K
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,M,0,0,40,United-States,<=50K
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,F,0,0,40,Cuba,<=50K


In [None]:
data

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,39,Private,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Female,0,0,36,United-States,<=50K.
48838,64,,HS-grad,9,Widowed,,Other-relative,Black,Male,0,0,40,United-States,<=50K.
48839,38,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,<=50K.
48840,44,Private,Bachelors,13,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,5455,0,40,United-States,<=50K.


# Tasks

## P 2.1: Complete the following steps (4%):
## 1- Import the dataset from the URL we used in this workshop.
## 2- Generate a new dataset by randomly extracting 10000 samples.
## 3- Drop the 'income' column, reindex the new dataset and then clean it.
## 4- Print the new dataset and use it for the rest of the tasks

In [None]:
############# WRITE YOUR CODE IN THIS CELL (IF APPLICABLE)  ####################
import io
import requests
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"

column_names = [
    "age", "workclass", "fnlwgt", "education", "education-num",
    "marital-status", "occupation", "relationship", "race", "sex",
    "capital-gain", "capital-loss", "hours-per-week", "native-country", "income"
]


data_request = requests.get(url).content
data = pd.read_csv(io.StringIO(data_request.decode('utf-8')), header=None, names=column_names, skipinitialspace=True)
print(data.shape)
new_dataset = data.sample(n=10000, random_state = 100)

new_dataset.reset_index(drop=True, inplace=True)
new_dataset.dropna()
new_dataset = new_dataset.drop(['income'], axis=1)


data = new_dataset.copy()
data

(32561, 15)


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,27,?,181284,12th,8,Married-civ-spouse,?,Husband,Black,Male,0,0,45,United-States
1,24,Private,235894,HS-grad,9,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,38,United-States
2,30,Private,65278,HS-grad,9,Married-civ-spouse,Tech-support,Husband,White,Male,0,0,40,United-States
3,20,Private,117476,Some-college,10,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,United-States
4,54,Private,88019,Some-college,10,Divorced,Transport-moving,Not-in-family,White,Male,0,0,55,United-States
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,30,Local-gov,326104,HS-grad,9,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States
9996,56,Federal-gov,277420,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,40,Puerto-Rico
9997,41,?,152880,HS-grad,9,Divorced,?,Not-in-family,Black,Female,0,0,28,United-States
9998,43,Private,237993,Some-college,10,Married-civ-spouse,Tech-support,Husband,White,Male,0,0,40,United-States


## P 2.2: Complete the following steps (4%):
## 1- Determine which columns are categorical and which columns are numerical.
## 2- Encode the categorical columns using the correct method.
## 3- Normalise the numerical columns.
## 4- Print the dataset

In [None]:
############# WRITE YOUR CODE IN THIS CELL (IF APPLICABLE)  ####################
from sklearn.preprocessing import OneHotEncoder,MinMaxScaler
numerical_columns= data.select_dtypes(include=['int64', 'float64']).columns
data = data.dropna(subset=numerical_columns)
categorical_columns= data.select_dtypes(include=['object']).columns

encoded_categorical_columns = pd.get_dummies(data[categorical_columns], drop_first=True)

scale=MinMaxScaler()
numerical_columns_scale = scale.fit_transform(data[numerical_columns])
numerical_columns_scaled = pd.DataFrame(numerical_columns_scale, columns=numerical_columns)

data = pd.concat([numerical_columns_scaled, encoded_categorical_columns], axis=1)
data


Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Private,...,native-country_Portugal,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia
0,0.136986,0.113883,0.466667,0.0,0.0,0.448980,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
1,0.095890,0.151009,0.533333,0.0,0.0,0.377551,False,False,False,True,...,False,False,False,False,False,False,False,True,False,False
2,0.178082,0.035018,0.533333,0.0,0.0,0.397959,False,False,False,True,...,False,False,False,False,False,False,False,True,False,False
3,0.041096,0.070504,0.600000,0.0,0.0,0.397959,False,False,False,True,...,False,False,False,False,False,False,False,True,False,False
4,0.506849,0.050478,0.600000,0.0,0.0,0.551020,False,False,False,True,...,False,False,False,False,False,False,False,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,0.178082,0.212338,0.533333,0.0,0.0,0.397959,False,True,False,False,...,False,False,False,False,False,False,False,True,False,False
9996,0.534247,0.179240,0.800000,0.0,0.0,0.397959,True,False,False,False,...,False,True,False,False,False,False,False,False,False,False
9997,0.328767,0.094573,0.533333,0.0,0.0,0.275510,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
9998,0.356164,0.152436,0.600000,0.0,0.0,0.397959,False,False,False,True,...,False,False,False,False,False,False,False,True,False,False


## P 2.3: Write a function to split the dataset in half column-wise and swap the first half and the second half. Apply the funtion to the dataset and print the result (3%)

In [None]:
############# WRITE YOUR CODE IN THIS CELL (IF APPLICABLE)  ####################


def swap_columns_in_half(data):
    num_columns = len(data.columns)
    half_of_column = num_columns // 2
    first_half = data.iloc[:, :half_of_column]
    second_half = data.iloc[:, half_of_column:]
    swapped_column = pd.concat([second_half, first_half], axis=1)
    return swapped_column

data = swap_columns_in_half(data)
data


Unnamed: 0,relationship_Other-relative,relationship_Own-child,relationship_Unmarried,relationship_Wife,race_Asian-Pac-Islander,race_Black,race_Other,race_White,sex_Male,native-country_Cambodia,...,occupation_Handlers-cleaners,occupation_Machine-op-inspct,occupation_Other-service,occupation_Priv-house-serv,occupation_Prof-specialty,occupation_Protective-serv,occupation_Sales,occupation_Tech-support,occupation_Transport-moving,relationship_Not-in-family
0,False,False,False,False,False,True,False,False,True,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,True,True,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,True,True,False,...,False,False,False,False,False,False,False,True,False,False
3,False,False,False,False,False,False,False,True,True,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,True,True,False,...,False,False,False,False,False,False,False,False,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,False,False,False,False,False,False,False,True,True,False,...,False,False,False,False,False,True,False,False,False,False
9996,False,False,False,False,False,False,False,True,True,False,...,False,False,False,False,False,False,False,False,False,False
9997,False,False,False,False,False,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
9998,False,False,False,False,False,False,False,True,True,False,...,False,False,False,False,False,False,False,True,False,False


## P 2.4: Write a function that receives two numerical columns' names and compare their values for all rows. If the value of the first column is greater than the second column, the function should produce True, otherwise, it should produce False. The function should append an additional column to the dataset to store the results of the comparison for all rows. Apply the function to the "age" and "hours-per-week" columns in the dataset and print the result (4%).

In [None]:
############# WRITE YOUR CODE IN THIS CELL (IF APPLICABLE)  ####################

def compare_columns(data, column1, column2):
    data[column1 + '_greater_than_' + column2] = data[column1] > data[column2]
    return data

data = compare_columns(data, 'age', 'hours-per-week')
data


Unnamed: 0,relationship_Other-relative,relationship_Own-child,relationship_Unmarried,relationship_Wife,race_Asian-Pac-Islander,race_Black,race_Other,race_White,sex_Male,native-country_Cambodia,...,occupation_Machine-op-inspct,occupation_Other-service,occupation_Priv-house-serv,occupation_Prof-specialty,occupation_Protective-serv,occupation_Sales,occupation_Tech-support,occupation_Transport-moving,relationship_Not-in-family,age_greater_than_hours-per-week
0,False,False,False,False,False,True,False,False,True,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,True,True,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,True,True,False,...,False,False,False,False,False,False,True,False,False,False
3,False,False,False,False,False,False,False,True,True,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,True,True,False,...,False,False,False,False,False,False,False,True,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,False,False,False,False,False,False,False,True,True,False,...,False,False,False,False,True,False,False,False,False,False
9996,False,False,False,False,False,False,False,True,True,False,...,False,False,False,False,False,False,False,False,False,True
9997,False,False,False,False,False,True,False,False,False,False,...,False,False,False,False,False,False,False,False,True,True
9998,False,False,False,False,False,False,False,True,True,False,...,False,False,False,False,False,False,True,False,False,False
