## **Lab 09 - Tasks**


Activity- 1: Implement the Decision tree algorithm on the data given in the table 9.1 and predict whether the
players can play or not when the weather is overcast and the temperature is mild. Also verify
results by solving manually.


In [16]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score

# convert the given data table into a pandas dataframe
data = {
    'Weather': ['Sunny', 'Sunny', 'Overcast', 'Rain', 'Sunny', 'Overcast', 'Rain', 'Rain', 'Overcast', 'Sunny'],
    'Temperature': ['Hot', 'Hot', 'Hot', 'Mild', 'Cool', 'Cool', 'Mild', 'Mild', 'Hot', 'Hot'],
    'Play': ['Yes', 'Yes', 'No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No']
}
df = pd.DataFrame(data)

# Convert categorical data to numeric
df['Weather'] = df['Weather'].map({'Sunny': 0, 'Overcast': 1, 'Rain': 2})
df['Temperature'] = df['Temperature'].map({'Hot': 0, 'Mild': 1, 'Cool': 2})
df['Play'] = df['Play'].map({'Yes': 1, 'No': 0})

# target and feature splitting
X = df[['Weather', 'Temperature']]
y = df['Play']

# split train and test data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=101)

# print shape of train and test data
print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")

# Create instance of model
model = DecisionTreeClassifier(criterion='entropy')
print("\nmodel: ", model)

# Train the model
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Combine actual and predicted values side by side
results = np.column_stack((y_test, y_pred))

# Printing the results
print("\nActual Values  |  Predicted Values")
print("-----------------------------")
for actual, predicted in results:
    print(f"        {actual}      |       {predicted}")


print("\n------------ Model Evaluation --------------------")
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
accuracy = accuracy_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"Root Mean Squared Error: {rmse}")
print(f'Accuracy: {accuracy:.2f}')

# Predict the outcome for Weather = Overcast (1) and Temperature = Mild (1)
prediction = model.predict([[1, 1]])
# Output the prediction
play_decision = 'Yes' if prediction[0] == 1 else 'No'
print(f'\n\nThe prediction for Weather = Overcast and Temperature = Mild is: {
      play_decision}')

Shape of X_train: (8, 2)
Shape of X_test: (2, 2)
Shape of y_train: (8,)
Shape of y_test: (2,)

model:  DecisionTreeClassifier(criterion='entropy')

Actual Values  |  Predicted Values
-----------------------------
        0      |       1
        0      |       1

------------ Model Evaluation --------------------
Mean Squared Error: 1.0
Root Mean Squared Error: 1.0
Accuracy: 0.00


The prediction for Weather = Overcast and Temperature = Mild is: Yes




### Calculate the Entropy of the Target Variable (Play)

Entropy (S) = - p(Yes) \* log2(p(Yes)) - p(No) \* log2(p(No))

Entropy(S) = - 0.5 \* log2(0.5) - 0.5 \* log2(0.5) = 1

### Calculate Information Gain for each Attribute

**Weather = Sunny**

- Play = Yes: 2
- Play = No: 2
- Entropy = - (2/4) \* log2(2/4) - (2/4) \* log2(2/4) = 1

**Weather = Overcast**

- Play = Yes: 0
- Play = No: 3
- Entropy = - (3/3) \* log2(3/3) - (0/3) \* log2(0/3) = 0

**Weather = Rain**

- Play = Yes: 3
- Play = No: 0
- Entropy = - (3/3) \* log2(3/3) - (0/3) \* log2(0/3) = 0

Entropy(Weather) = 4/10 \* 1 + 3/10 \* 0 + 3/10 \* 0 = 0.4

IG(Weather) = 1−0.4 = 0.6

**Temperature = Hot**

- Play = Yes: 2
- Play = No: 3
- Entropy = - (2/5) \* log2(2/5) - (3/5) \* log2(3/5) ≈ 0.971

**Temperature = Mild**

- Play = Yes: 3
- Play = No: 0
- Entropy = - (3/3) \* log2(3/3) - (0/3) \* log2(0/3) = 0

**Temperature = Cool**

- Play = Yes: 0
- Play = No: 2
- Entropy = - (0/2) \* log2(0/2) - (2/2) \* log2(2/2) = 0

Entropy(Temperature) = 5/10 \* 0.971 + 3/10 \* 0 + 2/10 \* 0 ≈ 0.486

IG(Temperature) = 1−0.486 ≈ 0.514

### Choose the Attribute with the Highest Information Gain

Weather has the highest information gain (0.6), so it is chosen as the decision node.

### Build the Decision Tree and Predict

**Weather = Sunny**

- Subset: {(Hot, Yes), (Hot, Yes), (Cool, No), (Hot, No)}
- Decision: If Temperature = Hot -> Yes (2 Yes, 1 No); If Temperature = Cool -> No

**Weather = Overcast**

- Subset: {(Hot, No), (Cool, No), (Hot, No)}
- Decision: No

**Weather = Rain**

- Subset: {(Mild, Yes), (Mild, Yes), (Mild, Yes)}
- Decision: Yes

For the given condition (Weather = Overcast, Temperature = Mild):

- According to the decision tree, the Weather = Overcast branch leads to "No" (regardless of temperature).
- From the dataset, all instances of "Overcast" result in "No".
- Thus, if the weather is Overcast and the temperature is Mild, the result is "No".

The prediction is that the players cannot play.


Activity- 2: Implement the Decision tree algorithm on any dataset taken from Kaggle.com and predict the
new entry entered by the user.


In [7]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score

# Reading the file
df = pd.read_csv("./data/heart_v2.csv")
df.info()

# Checking first five records
print(df.head())

# Prepare the data
X = df.drop('heart disease', axis=1)
y = df['heart disease']

# split train and test data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=141)

# # print shape of train and test data
print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")

# # Create instance of model
model = DecisionTreeClassifier(criterion='entropy')
print("\nmodel: ", model)

# # Train the model
model.fit(X_train, y_train)

# # Predict
y_pred = model.predict(X_test)

# # Combine actual and predicted values side by side
results = np.column_stack((y_test, y_pred))

# # Printing the results
print("\nActual Values  |  Predicted Values")
print("-----------------------------")
for actual, predicted in results:
    print(f"        {actual}      |       {predicted}")


# print("\n------------ Model Evaluation --------------------")
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
accuracy = accuracy_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"Root Mean Squared Error: {rmse}")
print(f'Accuracy: {accuracy:.2f}')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 270 entries, 0 to 269
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   age            270 non-null    int64
 1   sex            270 non-null    int64
 2   BP             270 non-null    int64
 3   cholestrol     270 non-null    int64
 4   heart disease  270 non-null    int64
dtypes: int64(5)
memory usage: 10.7 KB
   age  sex   BP  cholestrol  heart disease
0   70    1  130         322              1
1   67    0  115         564              0
2   57    1  124         261              1
3   64    1  128         263              0
4   74    0  120         269              0
Shape of X_train: (216, 4)
Shape of X_test: (54, 4)
Shape of y_train: (216,)
Shape of y_test: (54,)

model:  DecisionTreeClassifier(criterion='entropy')

Actual Values  |  Predicted Values
-----------------------------
        0      |       0
        0      |       1
        0      |       1
     

Activity- 3:
Implement Naïve Bayes Algorithm on the dataset given in Table 9.2, to predict whether the
players can play or not when the Outlook is Rain, Temperature is Cool, Humidity is High and the
Wind is Strong. Also verify results by solving manually.


In [17]:
import pandas as pd
import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

# convert the given data table into a pandas dataframe
data = {
    'Outlook': ['Sunny', 'Sunny', 'Overcast', 'Rain', 'Sunny', 'Overcast', 'Rain', 'Rain', 'Overcast', 'Sunny'],
    'Temperature': ['Hot', 'Hot', 'Hot', 'Mild', 'Cool', 'Cool', 'Mild', 'Mild', 'Hot', 'Hot'],
    'Humidity': ['High', 'High', 'High', 'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal', 'Normal'],
    'Wind': ['Weak', 'Strong', 'Weak', 'Weak', 'Weak', 'Strong', 'Strong', 'Weak', 'Weak', 'Strong'],
    'Play': ['Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'No', 'No']
}

df = pd.DataFrame(data)

# Encode categorical variables
label_encoders = {}
for column in df.columns:
    label_encoders[column] = LabelEncoder()
    df[column] = label_encoders[column].fit_transform(df[column])

# target and feature splitting
X = df[['Outlook', 'Temperature', 'Humidity', 'Wind']]
y = df['Play']

# split train and test data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=101)

# print shape of train and test data
print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")

# Create instance of model
model = GaussianNB()
print("\nmodel: ", model)

# Train the model
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Combine actual and predicted values side by side
results = np.column_stack((y_test, y_pred))

# Printing the results
print("\nActual Values  |  Predicted Values")
print("-----------------------------")
for actual, predicted in results:
    print(f"        {actual}      |       {predicted}")


print("\n------------ Model Evaluation --------------------")
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
accuracy = accuracy_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"Root Mean Squared Error: {rmse}")
print(f'Accuracy: {accuracy:.2f}')

# Predict for the given condition: Outlook=Rain, Temperature=Cool, Humidity=High, Wind=Strong
test_data = pd.DataFrame({
    'Outlook': ['Rain'],
    'Temperature': ['Cool'],
    'Humidity': ['High'],
    'Wind': ['Strong']
})

# Encode test data
for column in test_data.columns:
    test_data[column] = label_encoders[column].transform(test_data[column])

# Make prediction
prediction = model.predict(test_data)
play_decision = label_encoders['Play'].inverse_transform(prediction)[0]
print(f'\n\nThe prediction for Outlook=Rain, Temperature=Cool, Humidity=High, Wind=Strong is: {
      play_decision}')

Shape of X_train: (8, 4)
Shape of X_test: (2, 4)
Shape of y_train: (8,)
Shape of y_test: (2,)

model:  GaussianNB()

Actual Values  |  Predicted Values
-----------------------------
        0      |       0
        0      |       1

------------ Model Evaluation --------------------
Mean Squared Error: 0.5
Root Mean Squared Error: 0.7071067811865476
Accuracy: 0.50


The prediction for Outlook=Rain, Temperature=Cool, Humidity=High, Wind=Strong is: Yes


Manual Solution:

Calculate the prior probabilities
P(Yes) = Number of Yes / Total Samples = 4/7
P(No) = Number of No / Total Samples = 3/7

Calculate the likelihood
P(Outlook=Rain∣Play golf=Yes) = 1/4
P(Temperature=Cool∣Play golf=Yes) = 1/4
P(Humidity=High∣Play golf=Yes) = 3/4
P(Wind=Strong∣Play golf=Yes) = 1/4

P(Outlook=Rain∣Play golf=Yes) = 1/3
P(Temperature=Cool∣Play golf=Yes) = 1/3
P(Humidity=High∣Play golf=Yes) = 2/3
P(Wind=Strong∣Play golf=Yes) = 2/3

Calculate the posterior probabilities using Bayes' Theorem

P(Yes∣Outlook=Rain,Temperature=Cool,Humidity=High,Wind=Strong)∝P(Yes)×P(Outlook=Rain∣Yes)×P(Temperature=Cool∣Yes)×P(Humidity=High∣Yes)×P(Wind=Strong∣Yes)

P(Yes∣Outlook=Rain,Temperature=Cool,Humidity=High,Wind=Strong)∝(4/7)×(1/4)×(1/4)×(3/4)×(1/4) = 3/112

P(No∣Outlook=Rain,Temperature=Cool,Humidity=High,Wind=Strong)∝P(No)×P(Outlook=Rain∣No)×P(Temperature=Cool∣No)×P(Humidity=High∣No)×P(Wind=Strong∣No)

P(No∣Outlook=Rain,Temperature=Cool,Humidity=High,Wind=Strong)∝(3/7)×(1/3)×(1/3)×(2/3)×(2/3) = 4/189

Since P(Yes) > P(No), the prediction is Yes.


Activity- 4:
Consider the following dataset given in Table 9.3. Implement Naïve Bayes Algorithm to classify
youth/medium/yes/fair. Also verify results by solving manually.


In [18]:
import pandas as pd
import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

# convert the given data table into a pandas dataframe
data = {
    'age': ['youth', 'youth', 'middle_aged', 'senior', 'senior', 'senior', 'middle_aged', 'youth', 'youth', 'senior', 'youth', 'middle_aged', 'middle_aged', 'senior'],
    'income': ['high', 'high', 'high', 'medium', 'low', 'low', 'low', 'medium', 'low', 'medium', 'medium', 'medium', 'high', 'medium'],
    'student': ['no', 'no', 'no', 'no', 'yes', 'yes', 'yes', 'no', 'yes', 'yes', 'yes', 'no', 'yes', 'no'],
    'credit_rating': ['fair', 'excellent', 'fair', 'fair', 'fair', 'excellent', 'excellent', 'fair', 'fair', 'fair', 'excellent', 'excellent', 'fair', 'excellent'],
    'buys_computer': ['no', 'no', 'yes', 'yes', 'yes', 'no', 'yes', 'no', 'yes', 'yes', 'yes', 'yes', 'yes', 'no']
}

df = pd.DataFrame(data)

# Encode categorical variables
label_encoders = {}
for column in df.columns:
    label_encoders[column] = LabelEncoder()
    df[column] = label_encoders[column].fit_transform(df[column])

# target and feature splitting
X = df[['age', 'income', 'student', 'credit_rating']]
y = df['buys_computer']

# split train and test data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=101)

# print shape of train and test data
print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")

# Create instance of model
model = GaussianNB()
print("\nmodel: ", model)

# Train the model
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Combine actual and predicted values side by side
results = np.column_stack((y_test, y_pred))

# Printing the results
print("\nActual Values  |  Predicted Values")
print("-----------------------------")
for actual, predicted in results:
    print(f"        {actual}      |       {predicted}")


print("\n------------ Model Evaluation --------------------")
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
accuracy = accuracy_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"Root Mean Squared Error: {rmse}")
print(f'Accuracy: {accuracy:.2f}')

# Predict for the given instance
new_instance = pd.DataFrame({
    'age': ['youth'],
    'income': ['medium'],
    'student': ['yes'],
    'credit_rating': ['fair']
})

# Encode the new instance
for column in new_instance.columns:
    new_instance[column] = label_encoders[column].transform(
        new_instance[column])

# Make the prediction
prediction = model.predict(new_instance)
prediction_label = label_encoders['buys_computer'].inverse_transform(
    prediction)
print(f'\n\nThe prediction for age=youth, income=medium, student=yes, credit_rating=fair is: {
      prediction_label[0]}')

Shape of X_train: (11, 4)
Shape of X_test: (3, 4)
Shape of y_train: (11,)
Shape of y_test: (3,)

model:  GaussianNB()

Actual Values  |  Predicted Values
-----------------------------
        1      |       1
        1      |       0
        1      |       0

------------ Model Evaluation --------------------
Mean Squared Error: 0.6666666666666666
Root Mean Squared Error: 0.816496580927726
Accuracy: 0.33


The prediction for age=youth, income=medium, student=yes, credit_rating=fair is: yes


Manual Solution:

Calculate the prior probabilities
P(Yes) = Number of Yes / Total Samples = 9/14
P(No) = Number of No / Total Samples = 5/14

Calculate the likelihood
P(age=youth∣buys_computer=Yes) = 2/9
P(income=medium∣buys_computer=Yes) = 3/9
P(student=yes∣buys_computer=Yes) = 6/9
P(credit_rating=fair∣buys_computer=Yes) = 6/9

P(age=youth∣buys_computer=Yes) = 3/5
P(income=medium∣buys_computer=Yes) = 2/5
P(student=yes∣buys_computer=Yes) = 1/5
P(credit_rating=fair∣buys_computer=Yes) = 2/5

Calculate the posterior probabilities using Bayes' Theorem

P(Yes∣age=youth,income=medium,student=yes,credit_rating=fair)∝P(Yes)×P(age=youth∣Yes)×P(income=medium∣Yes)×P(student=yes∣Yes)×P(credit_rating=fair∣Yes)

P(Yes∣age=youth,income=medium,student=yes,credit_rating=fair)∝(9/14)×(2/9)×(3/9)×(6/9)×(6/9) = 0.00326

P(No∣age=youth,income=medium,student=yes,credit_rating=fair)∝P(No)×P(age=youth∣No)×P(income=medium∣No)×P(student=yes∣No)×P(credit_rating=fair∣No)

P(No∣age=youth,income=medium,student=yes,credit_rating=fair)∝(5/14)×(3/5)×(2/5)×(1/5)×(2/5) = 0.00137

Since P(Yes)>P(No), the prediction is Yes.


Activity- 5:
Implement the Naïve Bayes algorithm on any dataset taken from Kaggle.com and predict the
new entry entered by the user.


In [44]:
import pandas as pd
import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder, RobustScaler
import category_encoders as ce

# load the dataset
df = pd.read_csv("./data/adult.csv")
display(df.head())
print(df.info())
display(df.describe())

# find categorical variables
categorical = [var for var in df.columns if df[var].dtype == 'O']
print('\nThe categorical variables are :\n\n', categorical)

# Print unique values for each categorical variable
for var in categorical:
    print(f'\nUnique values for {var}: {df[var].unique()}')

# find numerical variables
numerical = [var for var in df.columns if df[var].dtype != 'O']
print('\nThe numerical variables are :', numerical)

# # target and feature splitting
X = df.drop(['income'], axis=1)
y = df['income']

# # split train and test data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0)


# # print shape of train and test data
print(f"\nShape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")

# encode categorical variables with one-hot encoding
encoder = ce.OneHotEncoder(['workclass', 'education', 'marital_status', 'occupation', 'relationship',
                            'race', 'gender', 'native-country'])
X_train = encoder.fit_transform(X_train)
X_test = encoder.transform(X_test)
cols = X_train.columns
scaler = RobustScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
X_train = pd.DataFrame(X_train, columns=[cols])
X_test = pd.DataFrame(X_test, columns=[cols])

# Create instance of model
model = GaussianNB()
print("\nmodel: ", model)

# Train the model
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Combine actual and predicted values side by side
results = np.column_stack((y_test, y_pred))

# Printing the results
print("\nActual Values  |  Predicted Values")
print("-----------------------------")
for actual, predicted in results:
    print(f"        {actual}      |       {predicted}")


print("\n------------ Model Evaluation --------------------")
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')

# Create validation data
validation_data = {
    'age': [25, 45, 30, 60],
    'workclass': ['Private', 'Self-emp-not-inc', 'Local-gov', 'State-gov'],
    'fnlwgt': [233516, 187454, 101320, 456123],
    'education': ['HS-grad', 'Bachelors', 'Masters', '10th'],
    'educational-num': [9, 13, 14, 6],
    'marital-status': ['Never-married', 'Married-civ-spouse', 'Divorced', 'Widowed'],
    'occupation': ['Adm-clerical', 'Exec-managerial', 'Prof-specialty', 'Other-service'],
    'relationship': ['Not-in-family', 'Husband', 'Unmarried', 'Other-relative'],
    'race': ['White', 'Black', 'Asian-Pac-Islander', 'Amer-Indian-Eskimo'],
    'gender': ['Male', 'Female', 'Male', 'Female'],
    'capital-gain': [0, 5000, 1500, 0],
    'capital-loss': [0, 0, 0, 2000],
    'hours-per-week': [40, 60, 40, 20],
    'native-country': ['United-States', 'United-States', 'India', 'Mexico']
}

validation_df = pd.DataFrame(validation_data)
display(validation_df)

# Encode and scale the new validation data
validation_df_encoded = encoder.transform(validation_df)
validation_df_scaled = scaler.transform(validation_df_encoded)

validation_df_scaled = pd.DataFrame(
    validation_df_scaled, columns=[cols])
display(validation_df_scaled)

# Predict using the model
validation_predictions = model.predict(validation_df_scaled)

# Combine the validation data with the predictions
validation_results = validation_df.copy()
validation_results['Predicted Income'] = validation_predictions

# Printing the results
print("\nValidation Set Predictions:")
display(validation_results)

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              48842 non-null  int64 
 1   workclass        48842 non-null  object
 2   fnlwgt           48842 non-null  int64 
 3   education        48842 non-null  object
 4   educational-num  48842 non-null  int64 
 5   marital-status   48842 non-null  object
 6   occupation       48842 non-null  object
 7   relationship     48842 non-null  object
 8   race             48842 non-null  object
 9   gender           48842 non-null  object
 10  capital-gain     48842 non-null  int64 
 11  capital-loss     48842 non-null  int64 
 12  hours-per-week   48842 non-null  int64 
 13  native-country   48842 non-null  object
 14  income           48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB
None


Unnamed: 0,age,fnlwgt,educational-num,capital-gain,capital-loss,hours-per-week
count,48842.0,48842.0,48842.0,48842.0,48842.0,48842.0
mean,38.643585,189664.1,10.078089,1079.067626,87.502314,40.422382
std,13.71051,105604.0,2.570973,7452.019058,403.004552,12.391444
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117550.5,9.0,0.0,0.0,40.0
50%,37.0,178144.5,10.0,0.0,0.0,40.0
75%,48.0,237642.0,12.0,0.0,0.0,45.0
max,90.0,1490400.0,16.0,99999.0,4356.0,99.0



The categorical variables are :

 ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'gender', 'native-country', 'income']

Unique values for workclass: ['Private' 'Local-gov' '?' 'Self-emp-not-inc' 'Federal-gov' 'State-gov'
 'Self-emp-inc' 'Without-pay' 'Never-worked']

Unique values for education: ['11th' 'HS-grad' 'Assoc-acdm' 'Some-college' '10th' 'Prof-school'
 '7th-8th' 'Bachelors' 'Masters' 'Doctorate' '5th-6th' 'Assoc-voc' '9th'
 '12th' '1st-4th' 'Preschool']

Unique values for marital-status: ['Never-married' 'Married-civ-spouse' 'Widowed' 'Divorced' 'Separated'
 'Married-spouse-absent' 'Married-AF-spouse']

Unique values for occupation: ['Machine-op-inspct' 'Farming-fishing' 'Protective-serv' '?'
 'Other-service' 'Prof-specialty' 'Craft-repair' 'Adm-clerical'
 'Exec-managerial' 'Tech-support' 'Sales' 'Priv-house-serv'
 'Transport-moving' 'Handlers-cleaners' 'Armed-Forces']

Unique values for relationship: ['Own-child' 'Husband' 'Not-in-family'

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country
0,25,Private,233516,HS-grad,9,Never-married,Adm-clerical,Not-in-family,White,Male,0,0,40,United-States
1,45,Self-emp-not-inc,187454,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,Black,Female,5000,0,60,United-States
2,30,Local-gov,101320,Masters,14,Divorced,Prof-specialty,Unmarried,Asian-Pac-Islander,Male,1500,0,40,India
3,60,State-gov,456123,10th,6,Widowed,Other-service,Other-relative,Amer-Indian-Eskimo,Female,0,2000,20,Mexico


Unnamed: 0,age,workclass_1,workclass_2,workclass_3,workclass_4,workclass_5,workclass_6,workclass_7,workclass_8,workclass_9,...,native-country_33,native-country_34,native-country_35,native-country_36,native-country_37,native-country_38,native-country_39,native-country_40,native-country_41,native-country_42
0,-0.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.4,0.0,-1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,-0.35,0.0,-1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.15,0.0,-1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0



Validation Set Predictions:


Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,Predicted Income
0,25,Private,233516,HS-grad,9,Never-married,Adm-clerical,Not-in-family,White,Male,0,0,40,United-States,<=50K
1,45,Self-emp-not-inc,187454,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,Black,Female,5000,0,60,United-States,>50K
2,30,Local-gov,101320,Masters,14,Divorced,Prof-specialty,Unmarried,Asian-Pac-Islander,Male,1500,0,40,India,<=50K
3,60,State-gov,456123,10th,6,Widowed,Other-service,Other-relative,Amer-Indian-Eskimo,Female,0,2000,20,Mexico,<=50K
