# Daily Challenge

## What You Will Create
Implementation of Min-Max scaling and Z-score normalization.
Steps for data merging and dimensionality reduction using PCA.
Analysis of detected outliers, including visualizations.
Results of PCA, including the transformed dataset.

In [1]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from scipy import stats
from scipy.stats import zscore
from scipy.stats.mstats import winsorize

In [4]:
col_names = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']

adult = pd.read_csv('adult.data', names=col_names, na_values=' ?', skipinitialspace=True)

adult.info()
adult.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K


In [5]:
# checking for duplicate strings
adult.duplicated().sum()

24

In [6]:
# obtaining general information about numeric columns
adult.describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


<div style="border:solid green 2px; padding: 20px">
    
- The dataset contains 32561 rows and 15 columns. There are no missing values.
- All columns in their respective types.
- There are 24 duplicates in the dataset and they need to be deleted.

In [7]:
# duplicate removal
adult = adult.drop_duplicates()
len(adult)

32537

## Apply Min-Max scaling and Z-score normalization to appropriate numerical features in the dataset

In [8]:
# Apply Min-Max scaling
min_max_scaler = MinMaxScaler()
df_min_max_scaled = pd.DataFrame(min_max_scaler.fit_transform(adult[['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']]), columns=['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week'])

print("Min-Max Scaled Dataset:")
df_min_max_scaled

Min-Max Scaled Dataset:


Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
0,0.301370,0.044302,0.800000,0.021740,0.0,0.397959
1,0.452055,0.048238,0.800000,0.000000,0.0,0.122449
2,0.287671,0.138113,0.533333,0.000000,0.0,0.397959
3,0.493151,0.151068,0.400000,0.000000,0.0,0.397959
4,0.150685,0.221488,0.800000,0.000000,0.0,0.397959
...,...,...,...,...,...,...
32532,0.136986,0.166404,0.733333,0.000000,0.0,0.377551
32533,0.315068,0.096500,0.533333,0.000000,0.0,0.397959
32534,0.561644,0.094827,0.533333,0.000000,0.0,0.397959
32535,0.068493,0.128499,0.533333,0.000000,0.0,0.193878


In [9]:
# Apply Z-score normalization
z_score_scaler = StandardScaler()
df_z_score_normalized = pd.DataFrame(z_score_scaler.fit_transform(adult[['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']]), columns=['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week'])

print("Z-Score Normalized Dataset:")
df_z_score_normalized

Z-Score Normalized Dataset:


Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
0,0.030390,-1.063569,1.134777,0.148292,-0.216743,-0.035664
1,0.836973,-1.008668,1.134777,-0.145975,-0.216743,-2.222483
2,-0.042936,0.245040,-0.420679,-0.145975,-0.216743,-0.035664
3,1.056950,0.425752,-1.198407,-0.145975,-0.216743,-0.035664
4,-0.776193,1.408066,1.134777,-0.145975,-0.216743,-0.035664
...,...,...,...,...,...,...
32532,-0.849519,0.639678,0.745913,-0.145975,-0.216743,-0.197650
32533,0.103716,-0.335436,-0.420679,-0.145975,-0.216743,-0.035664
32534,1.423579,-0.358779,-0.420679,-0.145975,-0.216743,-0.035664
32535,-1.216148,0.110930,-0.420679,-0.145975,-0.216743,-1.655530


<div style="border:solid green 2px; padding: 20px">
In this dataset, MinMaxScaler is used for Min-Max scaling, and StandardScaler is used for Z-score normalization. The resulting dataframes (df_min_max_scaled and df_z_score_normalized) will contain the normalized values for the specified numerical features.

## Identify any outliers in numerical features using statistical methods (like Z-score, IQR)

In [17]:
# Identify data points with Z-scores higher than 3 or lower than -3
z_score_outliers = df_z_score_normalized[(df_z_score_normalized > 3) | (df_z_score_normalized < -3)].dropna()

print("Identified outliers:")
z_score_outliers

Identified outliers:


Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week


In [16]:
# Identify outliers using IQR
Q1 = adult.quantile(0.25)
Q3 = adult.quantile(0.75)
IQR = Q3 - Q1
outliers_iqr = ((adult < (Q1 - 1.5 * IQR)) | (adult > (Q3 + 1.5 * IQR))).any(axis=1)

print("Outliers identified using IQR:")
adult[outliers_iqr]

  outliers_iqr = ((adult < (Q1 - 1.5 * IQR)) | (adult > (Q3 + 1.5 * IQR))).any(axis=1)


Outliers identified using IQR:


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32545,39,Local-gov,111499,Assoc-acdm,12,Married-civ-spouse,Adm-clerical,Wife,White,Female,0,0,20,United-States,>50K
32548,65,Self-emp-not-inc,99359,Prof-school,15,Never-married,Prof-specialty,Not-in-family,White,Male,1086,0,60,United-States,<=50K
32553,32,Private,116138,Masters,14,Never-married,Tech-support,Not-in-family,Asian-Pac-Islander,Male,0,0,11,Taiwan,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


<div style="border:solid green 2px; padding: 20px">
No outliers (greater than 3 and less than -3) were detected using the Z-scores method. When using IQR, 13554 lines with outliers were detected.

## Choose an appropriate method to handle these outliers (e.g., trimming, capping, transformation) and apply it to the dataset

In [19]:
# Winsorize the numerical columns to handle outliers
df_winsorized = adult.apply(lambda x: winsorize(x, limits=[0.05, 0.05]))

# Convert the winsorized array back to a DataFrame
df_winsorized = pd.DataFrame(df_winsorized, columns=adult.columns)

print("Winsorized Dataset:")
df_winsorized

Winsorized Dataset:


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,Self-emp-not-inc,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,18,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Unmarried,Black,Female,0,0,40,India,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Unmarried,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Separated,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


<div style="border:solid green 2px; padding: 20px">
Handling outliers in a dataset can be approached in various ways, and the choice depends on the nature of data and the goals of analysis. 
Capping (Winsorizing) involves setting a threshold beyond which values are capped at a certain value. This helps in preventing extreme values from affecting the analysis. In this dataset, the winsorize function is applied to each column of the dataset, capping the values at the 5th and 95th percentiles. The effect of winsorizing is to reduce the impact of extreme values on the dataset.