# **DATA EXPLORATION AND PRE PROCESSING**

In [2]:
import pandas as pd

data = pd.read_csv('adult_with_headers.csv')
data.shape

(32561, 15)

In [74]:
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [75]:
data.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
income            0
dtype: int64

In [76]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education_num   32561 non-null  int64 
 5   marital_status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital_gain    32561 non-null  int64 
 11  capital_loss    32561 non-null  int64 
 12  hours_per_week  32561 non-null  int64 
 13  native_country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [77]:
data[data.duplicated()]

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
4881,25,Private,308144,Bachelors,13,Never-married,Craft-repair,Not-in-family,White,Male,0,0,40,Mexico,<=50K
5104,90,Private,52386,Some-college,10,Never-married,Other-service,Not-in-family,Asian-Pac-Islander,Male,0,0,35,United-States,<=50K
9171,21,Private,250051,Some-college,10,Never-married,Prof-specialty,Own-child,White,Female,0,0,10,United-States,<=50K
11631,20,Private,107658,Some-college,10,Never-married,Tech-support,Not-in-family,White,Female,0,0,10,United-States,<=50K
13084,25,Private,195994,1st-4th,2,Never-married,Priv-house-serv,Not-in-family,White,Female,0,0,40,Guatemala,<=50K
15059,21,Private,243368,Preschool,1,Never-married,Farming-fishing,Not-in-family,White,Male,0,0,50,Mexico,<=50K
17040,46,Private,173243,HS-grad,9,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,United-States,<=50K
18555,30,Private,144593,HS-grad,9,Never-married,Other-service,Not-in-family,Black,Male,0,0,40,?,<=50K
18698,19,Private,97261,HS-grad,9,Never-married,Farming-fishing,Not-in-family,White,Male,0,0,40,United-States,<=50K
21318,19,Private,138153,Some-college,10,Never-married,Adm-clerical,Own-child,White,Female,0,0,10,United-States,<=50K


In [78]:
data[data.duplicated()].shape

(24, 15)

In [3]:
data.drop_duplicates(inplace = True)

In [4]:
data.shape

(32537, 15)

# **ENCODING TECHNIQUES**

# Label Encoder

# When to use One-Hot Encoding vs label encoding

**One-hot encoding:**

* Use when you have a categorical variable with a large number of categories.
* Use when you want to treat each category as a separate feature.
* Use when you are not sure whether or not you will encounter new categories in the future.

**Label encoding:**

* Use when you have a categorical variable with a small number of categories.
* Use when you want to treat all categories as being equal.
* Use when you are sure that you will not encounter new categories in the future.

**Example:**

Suppose you have a dataset of customers, and you want to encode the customer's gender. There are two possible values for gender: male and female.

If you use one-hot encoding, you would create two new features: `male` and `female`. The `male` feature would be set to 1 for male customers and 0 for female customers. The `female` feature would be set to 1 for female customers and 0 for male customers.

If you use label encoding, you would simply assign the values 0 and 1 to the male and female categories, respectively.

In this case, either one-hot encoding or label encoding would be appropriate. However, if there were more than two possible values for gender, such as male, female, and non-binary, then one-hot encoding would be the better choice.

In [81]:
data['workclass'].unique() # categorical variable with 9 categories

array([' State-gov', ' Self-emp-not-inc', ' Private', ' Federal-gov',
       ' Local-gov', ' ?', ' Self-emp-inc', ' Without-pay',
       ' Never-worked'], dtype=object)

In [82]:
data['education'].unique()

array([' Bachelors', ' HS-grad', ' 11th', ' Masters', ' 9th',
       ' Some-college', ' Assoc-acdm', ' Assoc-voc', ' 7th-8th',
       ' Doctorate', ' Prof-school', ' 5th-6th', ' 10th', ' 1st-4th',
       ' Preschool', ' 12th'], dtype=object)

In [83]:
data['marital_status'].unique()

array([' Never-married', ' Married-civ-spouse', ' Divorced',
       ' Married-spouse-absent', ' Separated', ' Married-AF-spouse',
       ' Widowed'], dtype=object)

In [84]:
data['occupation'].unique()

array([' Adm-clerical', ' Exec-managerial', ' Handlers-cleaners',
       ' Prof-specialty', ' Other-service', ' Sales', ' Craft-repair',
       ' Transport-moving', ' Farming-fishing', ' Machine-op-inspct',
       ' Tech-support', ' ?', ' Protective-serv', ' Armed-Forces',
       ' Priv-house-serv'], dtype=object)

## Label Encoder

In [85]:
from sklearn.preprocessing import LabelEncoder

labelencoder = LabelEncoder()
data.iloc[:, -1] = labelencoder.fit_transform(data.iloc[:,-1]) # Encodes only Target variable
data

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,0
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,1
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,0
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,0


# Is label encoding applied to only target variables or all features?

- In general, label encoding is applied to all features, not just the target variable. This is because label encoding is a way to convert categorical data into a numerical format that can be used by machine learning algorithms. By applying label encoding to all categorical features, you ensure that all of the features are treated equally by the algorithm.

- However, there are some cases where it may be appropriate to apply label encoding to only the target variable. For example, if the target variable is binary (e.g., yes/no), then label encoding can be used to convert the target variable into a numerical format that can be used by a binary classification algorithm.

- Ultimately, the decision of whether or not to apply label encoding to all features or just the target variable depends on the specific problem and the machine learning algorithm that is being used.

In [86]:
# Applying label - encoding to all the features not just the target variable

# Create a LabelEncoder object
le = LabelEncoder()

# Apply label encoding to all categorical features
for col in data.select_dtypes(include='object'):
  data[col] = le.fit_transform(data[col])

# Print the encoded data
data.head()


Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,7,77516,9,13,4,1,1,4,1,2174,0,40,39,0
1,50,6,83311,9,13,2,4,0,4,1,0,0,13,39,0
2,38,4,215646,11,9,0,6,1,4,1,0,0,40,39,0
3,53,4,234721,1,7,2,6,0,2,1,0,0,40,39,0
4,28,4,338409,9,13,2,10,5,2,0,0,0,40,5,0


## One Hot Encoding

In [87]:
from sklearn.preprocessing import OneHotEncoder

data3 = pd.read_csv('adult_with_headers.csv')
data3.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [88]:
# Applying one hot encoding

# Create a OneHotEncoder object
enc = OneHotEncoder(handle_unknown='ignore')

# Encoding Target variable
enc_df = pd.DataFrame(enc.fit_transform(data3[['income']]).toarray())

enc_df.head()

Unnamed: 0,0,1
0,1.0,0.0
1,1.0,0.0
2,1.0,0.0
3,1.0,0.0
4,1.0,0.0


# unknown ='ignore' mean

- The `handle_unknown='ignore'` argument in the `OneHotEncoder` constructor tells the encoder to ignore any unknown categories that it encounters during the encoding process. This means that if the encoder encounters a category that it has not seen before, it will simply not create a new feature for that category.

- This can be useful if you have a dataset with a large number of categories, and you do not want to create a new feature for every single category. It can also be useful if you are not sure whether or not you will encounter new categories in the future.



In [89]:
# merge with main df
data_final = data3.iloc[:, 0:14].join(enc_df)
data_final

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,0,1
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,1.0,0.0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,1.0,0.0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,1.0,0.0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,1.0,0.0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,1.0,0.0
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,0.0,1.0
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,1.0,0.0
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,1.0,0.0


In [90]:
enc = OneHotEncoder(handle_unknown='ignore')

# Encoding all the variables
enc_df = pd.DataFrame(enc.fit_transform(data3).toarray())

enc_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22136,22137,22138,22139,22140,22141,22142,22143,22144,22145
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


- One-Hot Encoding doesn't create a meaningful name for dummy variables by itself.
- Hence, its better to use get_dummies() function from pandas library

# Label Encoding using Pandas

In [91]:
import pandas as pd
data1 = pd.read_csv('adult_with_headers.csv')

data_encoded = pd.get_dummies(data1, drop_first=True).astype('int') # drop_first : drop's one col for each feature
data_encoded

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,workclass_ Private,...,native_country_ Puerto-Rico,native_country_ Scotland,native_country_ South,native_country_ Taiwan,native_country_ Thailand,native_country_ Trinadad&Tobago,native_country_ United-States,native_country_ Vietnam,native_country_ Yugoslavia,income_ >50K
0,39,77516,13,2174,0,40,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,50,83311,13,0,0,13,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,38,215646,9,0,0,40,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0
3,53,234721,7,0,0,40,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0
4,28,338409,13,0,0,40,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,257302,12,0,0,38,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0
32557,40,154374,9,0,0,40,0,0,0,1,...,0,0,0,0,0,0,1,0,0,1
32558,58,151910,9,0,0,40,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0
32559,22,201490,9,0,0,20,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0


In [92]:
data_encoded.shape

(32561, 101)

# Deciding which scaling is good for my dataset

There are a few factors to consider when deciding which scaling method is best for your dataset:

* **The distribution of your data.** If your data is normally distributed, then standard scaling is a good choice. If your data is skewed, then min-max scaling may be a better choice.
* **The range of your data.** Min-max scaling can be more sensitive to outliers than standard scaling. If your data has a wide range, then min-max scaling may not be the best choice.
* **The type of analysis you are performing.** Some machine learning algorithms are more sensitive to the scale of the data than others. For example, linear regression is more sensitive to the scale of the data than decision trees.

To decide which scaling method is best for your dataset, you can try both methods and compare the results. You can also use a statistical test to compare the performance of the two methods.

# Min-Max Scaler vs Standard Scaler

In [93]:
# Comparing the performance of min-max scaling and standard scaling on my dataset:

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Separate the features and target variable
X = data_encoded.drop('income_ >50K', axis=1)
y = data_encoded['income_ >50K']

# Scale the data using min-max scaling
scaler = MinMaxScaler()
X_minmax = scaler.fit_transform(X)

# Scale the data using standard scaling
scaler = StandardScaler()
X_standard = scaler.fit_transform(X)

# Train a linear regression model on the min-max scaled data
model_minmax = LinearRegression()
model_minmax.fit(X_minmax, y)

# Train a linear regression model on the standard scaled data
model_standard = LinearRegression()
model_standard.fit(X_standard, y)

# Evaluate the performance of the models
y_pred_minmax = model_minmax.predict(X_minmax)
mse_minmax = mean_squared_error(y, y_pred_minmax)

y_pred_standard = model_standard.predict(X_standard)
mse_standard = mean_squared_error(y, y_pred_standard)

# Print the results
print('Min-max scaling:') # Mean Squared Error
print('MSE:', mse_minmax)

print('Standard scaling:')
print('MSE:', mse_standard)


Min-max scaling:
MSE: 0.11534084320536112
Standard scaling:
MSE: 0.11534992586125534


- It's clear from above MSE's that Min-max scaling is slightly better than Standar Scaling, as it gives less Error

# **FEATURE ENGINEERING**

In [94]:
data4 = pd.read_csv('adult_with_headers.csv')
data4.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [95]:
# creating at least 2 new features on the data4 dataframe that could be beneficial for the model.

import numpy as np

# Create a new feature called "hours_per_week_squared" by squaring the "hours_per_week" feature. This could be beneficial for the model because it would allow the model to capture the non-linear relationship between hours worked and income.
data4["hours_per_week_squared"] = data4["hours_per_week"] ** 2

# Create a new feature called "age_group" by binning the "age" feature into three groups: "young", "middle-aged", and "old". This could be beneficial for the model because it would allow the model to capture the different relationships between age and income for different age groups.
bins = [16, 35, 55, 100]
labels = ["young", "middle-aged", "old"]
data4["age_group"] = pd.cut(data4["age"], bins, labels=labels)

# Apply a log transformation to the "capital_gain" feature. This is justified because the "capital-gain" feature is skewed, and a log transformation can help to normalize the distribution.
data4["capital_gain"] = np.log(data4["capital_gain"] + 1)


In [96]:
# Display new features created

data4.head()


Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income,hours_per_week_squared,age_group
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,7.684784,0,40,United-States,<=50K,1600,middle-aged
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0,13,United-States,<=50K,169,middle-aged
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0,40,United-States,<=50K,1600,middle-aged
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0,40,United-States,<=50K,1600,middle-aged
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0,40,Cuba,<=50K,1600,young


- The new features created are `hours_per_week_squared` and `age_group`

# **FEATURE SELECTION**

## Isolation Forest

In [97]:
from sklearn.ensemble import IsolationForest

In [98]:
data_encoded

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,workclass_ Private,...,native_country_ Puerto-Rico,native_country_ Scotland,native_country_ South,native_country_ Taiwan,native_country_ Thailand,native_country_ Trinadad&Tobago,native_country_ United-States,native_country_ Vietnam,native_country_ Yugoslavia,income_ >50K
0,39,77516,13,2174,0,40,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,50,83311,13,0,0,13,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,38,215646,9,0,0,40,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0
3,53,234721,7,0,0,40,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0
4,28,338409,13,0,0,40,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,257302,12,0,0,38,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0
32557,40,154374,9,0,0,40,0,0,0,1,...,0,0,0,0,0,0,1,0,0,1
32558,58,151910,9,0,0,40,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0
32559,22,201490,9,0,0,20,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0


In [105]:
data_encoded.shape

(32561, 102)

In [100]:
# training the model
clf = IsolationForest(random_state=10, contamination=.01)
clf.fit(data_encoded)



In [101]:
# predictions

y_pred = clf.predict(data_encoded)

In [102]:
# -1 for outliers and 1 for inliners
y_pred

array([1, 1, 1, ..., 1, 1, 1])

In [103]:
data_encoded.loc[y_pred==-1]

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,workclass_ Private,...,native_country_ Puerto-Rico,native_country_ Scotland,native_country_ South,native_country_ Taiwan,native_country_ Thailand,native_country_ Trinadad&Tobago,native_country_ United-States,native_country_ Vietnam,native_country_ Yugoslavia,income_ >50K
157,71,494223,10,0,1816,2,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
162,44,78374,14,0,0,40,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
404,28,166481,4,0,2179,40,0,0,0,1,...,1,0,0,0,0,0,0,0,0,0
655,29,71592,10,0,0,40,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
668,20,114746,7,0,1762,40,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32341,74,199136,13,15831,0,8,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
32370,53,137547,15,27828,0,40,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
32401,52,143533,4,0,0,40,0,1,0,0,...,0,0,0,0,0,0,1,0,0,0
32428,39,110622,13,0,0,40,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [104]:
data_encoded['scores'] = clf.decision_function(data_encoded)

In [107]:
data_encoded['anamoly'] = clf.predict(data_encoded.iloc[:,0:101])

In [108]:
data_encoded

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,workclass_ Private,...,native_country_ South,native_country_ Taiwan,native_country_ Thailand,native_country_ Trinadad&Tobago,native_country_ United-States,native_country_ Vietnam,native_country_ Yugoslavia,income_ >50K,scores,anamoly
0,39,77516,13,2174,0,40,0,0,0,0,...,0,0,0,0,1,0,0,0,0.049153,1
1,50,83311,13,0,0,13,0,0,0,0,...,0,0,0,0,1,0,0,0,0.070523,1
2,38,215646,9,0,0,40,0,0,0,1,...,0,0,0,0,1,0,0,0,0.099192,1
3,53,234721,7,0,0,40,0,0,0,1,...,0,0,0,0,1,0,0,0,0.067907,1
4,28,338409,13,0,0,40,0,0,0,1,...,0,0,0,0,0,0,0,0,0.022195,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,257302,12,0,0,38,0,0,0,1,...,0,0,0,0,1,0,0,0,0.051770,1
32557,40,154374,9,0,0,40,0,0,0,1,...,0,0,0,0,1,0,0,1,0.114494,1
32558,58,151910,9,0,0,40,0,0,0,1,...,0,0,0,0,1,0,0,0,0.091817,1
32559,22,201490,9,0,0,20,0,0,0,1,...,0,0,0,0,1,0,0,0,0.106683,1


In [109]:
# print the outlier datapoints

data_encoded[data_encoded['anamoly']==-1]

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,workclass_ Private,...,native_country_ South,native_country_ Taiwan,native_country_ Thailand,native_country_ Trinadad&Tobago,native_country_ United-States,native_country_ Vietnam,native_country_ Yugoslavia,income_ >50K,scores,anamoly
157,71,494223,10,0,1816,2,0,0,0,0,...,0,0,0,0,1,0,0,0,-0.046031,-1
162,44,78374,14,0,0,40,0,0,0,0,...,0,0,0,0,1,0,0,0,-0.002579,-1
404,28,166481,4,0,2179,40,0,0,0,1,...,0,0,0,0,0,0,0,0,-0.012194,-1
655,29,71592,10,0,0,40,0,0,0,0,...,0,0,0,0,0,0,0,0,-0.007577,-1
668,20,114746,7,0,1762,40,0,0,0,0,...,1,0,0,0,0,0,0,0,-0.011104,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32341,74,199136,13,15831,0,8,0,0,0,0,...,0,0,0,0,0,0,0,1,-0.017565,-1
32370,53,137547,15,27828,0,40,0,0,0,0,...,0,0,0,0,0,0,0,1,-0.027747,-1
32401,52,143533,4,0,0,40,0,1,0,0,...,0,0,0,0,1,0,0,0,-0.020090,-1
32428,39,110622,13,0,0,40,1,0,0,0,...,0,0,0,0,0,0,0,0,-0.026650,-1


In [110]:
data_encoded[data_encoded['anamoly']==-1].shape

(326, 103)

# How outliers can effect model performance

- Outliers can have a significant impact on model performance. This is because outliers can skew the distribution of the data, which can make it difficult for the model to learn the underlying relationships between the features and the target variable.

- For example, suppose you are training a linear regression model to predict the price of a house. If there are a few houses in the dataset that are much more expensive than the other houses, these outliers can cause the model to overestimate the price of the other houses.

- Outliers can also make it difficult for the model to generalize to new data. This is because the model may learn to make predictions that are specific to the outliers, which may not be applicable to other data points.

## Here are some ways that outliers can affect model performance:

* **Increased variance:** Outliers can increase the variance of the model, which can make it more likely to overfit the data.
* **Decreased bias:** Outliers can decrease the bias of the model, which can make it more likely to underfit the data.
* **Increased sensitivity to noise:** Outliers can make the model more sensitive to noise in the data.
* **Decreased interpretability:** Outliers can make it more difficult to interpret the model, as they can introduce unexpected patterns into the data.

It is important to be aware of the potential impact of outliers on model performance. If you suspect that there are outliers in your data, you should take steps to address them, such as removing them from the dataset or transforming them.


# PPS Score

- PPS Score is used to find Correlation btw Multidimensional Variables

In [1]:
# install the package

!pip install ppscore



In [5]:
import ppscore as pps

In [8]:
pps.score(data, 'occupation','income')

{'x': 'occupation',
 'y': 'income',
 'ppscore': 0.04651070335074244,
 'case': 'classification',
 'is_valid_score': True,
 'metric': 'weighted F1',
 'baseline_score': 0.6463617026132602,
 'model_score': 0.6628096685564765,
 'model': DecisionTreeClassifier()}

In [7]:
import warnings
warnings.filterwarnings('ignore')

pps.matrix(data)

Unnamed: 0,x,y,ppscore,case,is_valid_score,metric,baseline_score,model_score,model
0,age,age,1.000000e+00,predict_itself,True,,0.000000,1.000000,
1,age,workclass,2.090038e-02,classification,True,weighted F1,0.581447,0.590195,DecisionTreeClassifier()
2,age,fnlwgt,0.000000e+00,regression,True,mean absolute error,75736.243800,77225.235712,DecisionTreeRegressor()
3,age,education,6.124723e-02,classification,True,weighted F1,0.189800,0.239423,DecisionTreeClassifier()
4,age,education_num,0.000000e+00,regression,True,mean absolute error,1.865200,1.906863,DecisionTreeRegressor()
...,...,...,...,...,...,...,...,...,...
220,income,capital_gain,0.000000e+00,regression,True,mean absolute error,1165.066200,1839.292230,DecisionTreeRegressor()
221,income,capital_loss,0.000000e+00,regression,True,mean absolute error,80.707000,151.364703,DecisionTreeRegressor()
222,income,hours_per_week,0.000000e+00,regression,True,mean absolute error,7.451000,8.029614,DecisionTreeRegressor()
223,income,native_country,2.340301e-07,classification,True,weighted F1,0.850030,0.850030,DecisionTreeClassifier()


In [9]:
data.corr()

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week
age,1.0,-0.076447,0.036224,0.077676,0.057745,0.068515
fnlwgt,-0.076447,1.0,-0.043388,0.000429,-0.01026,-0.018898
education_num,0.036224,-0.043388,1.0,0.122664,0.079892,0.148422
capital_gain,0.077676,0.000429,0.122664,1.0,-0.031639,0.078408
capital_loss,0.057745,-0.01026,0.079892,-0.031639,1.0,0.054229
hours_per_week,0.068515,-0.018898,0.148422,0.078408,0.054229,1.0


- Major difference between PPS Score and Correlation Matrix is,
- PPS Score can show correlation between both categorical & numerical, categorical & categorical variables also
- But correlation matrix works only on numerical variables
- PPS Score is Asymmetric, Correlation Matrix is Symmetric
