# Binary-Classification---Machine-Learning-Case-Study


### Introduction

You are a Data Scientist working in a Public Policy team. Your team needs you to come up with a prediction model to know if a person, based on his/her demographic data will earn $50,000 or more. This prediction will help the team in making policy decisions for providing financial assistance for the low-income group. You are given a sample data of the population along with their annual income. You can use that data to train your machine learning model..

You can build your model in your own hardware/pc/ laptop and just upload the prediction as shown in the below format.

You are free to use Python programming language of your preference to explore and build the model.

Instructions for the case study are provided below.

Build a Machine Learning Model, which is capable of predicting if an individual's income is greater than 50k or not.
The prediction must be done based on various data attributes provided below.
Use 'TrainData' file provided below for building the model.
Use 'TestData' file provided below for testing your predictions.

Data Attributes description.

age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never- worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never- married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers- cleaners, Machine-op-inspct, Adm-clerical, Farming- fishing, Transport-moving, Priv-house-serv, Protective- serv, Armed-Forces.
relationship: Wife, Own child, Husband, Not-in-family. Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo,Other, Black.
sex: Female, Male,
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United States, Cambodia, England, Puerto Rico, Canada, Germany, Outlying-US(Guam- USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican- Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong Holand-Netherlands.
income> $50K: binary (Target that needs to be predicted)

You are open to use the tool of your choice Python. You are expected to update your results in the specified format.

Datasets:

Click the below links to download the train and test data.

TrainData - The train data has 43957 records.

TestData - The test data has 898 records.

1. You can use the train data to build and train your model and perform your prediction using the test data.

2. Once you have the predictions ready, paste them in the below format into the IDE.

id, outcome

0,1

1,0

2,1

3,1

4,0

3. Click on "Run Code" button to view the model accuracy against the test data.

4. The accuracy has to be at least 75 percent or above to pass this test, once you have reached the desired accuracy click on "Save and Submit" button.
------------------------------------------------------------------------------------------------
i have created a jupyter notebook in my vscode and downloaded the both csv files. 
Shall we start.

### Importing libraries

In [12]:
# Import necessary libraries
import pandas as pd

# Handle missing values
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

### Loading the data

In [6]:
# File paths (assuming both files are in the same folder as the notebook)
train_data_path = "train.csv"
test_data_path = "test.csv"

# Load the datasets
train_data = pd.read_csv(train_data_path)
test_data = pd.read_csv(test_data_path)

# Display the first few rows of the training data
print("Training Data:")
print(train_data.head())

# Display the first few rows of the test data
print("\nTest Data:")
print(test_data.head())

Training Data:
   age  workclass  fnlwgt     education  educational-num      marital-status  \
0   67    Private  366425     Doctorate               16            Divorced   
1   17    Private  244602          12th                8       Never-married   
2   31    Private  174201     Bachelors               13  Married-civ-spouse   
3   58  State-gov  110199       7th-8th                4  Married-civ-spouse   
4   25  State-gov  149248  Some-college               10       Never-married   

         occupation   relationship   race gender  capital-gain  capital-loss  \
0   Exec-managerial  Not-in-family  White   Male         99999             0   
1     Other-service      Own-child  White   Male             0             0   
2   Exec-managerial        Husband  White   Male             0             0   
3  Transport-moving        Husband  White   Male             0             0   
4     Other-service  Not-in-family  Black   Male             0             0   

   hours-per-week nativ

### Exploratory Data Analysis (EDA) 
checking for missing values and getting a sense of the data distribution. 

In [7]:
# Check for missing values in the training data
print("Missing values in the training data:")
print(train_data.isnull().sum())

# Check for missing values in the test data
print("\nMissing values in the test data:")
print(test_data.isnull().sum())

# Summary statistics for numerical features
print("\nSummary statistics for numerical features (Training Data):")
print(train_data.describe())

# Summary statistics for categorical features
print("\nCategorical feature distribution (Training Data):")
print(train_data.describe(include=['O']))

Missing values in the training data:
age                   0
workclass          2498
fnlwgt                0
education             0
educational-num       0
marital-status        0
occupation         2506
relationship          0
race                  0
gender                0
capital-gain          0
capital-loss          0
hours-per-week        0
native-country      763
income_>50K           0
dtype: int64

Missing values in the test data:
age                0
workclass          0
fnlwgt             0
education          0
educational-num    0
marital-status     0
occupation         0
relationship       0
race               0
gender             0
capital-gain       0
capital-loss       0
hours-per-week     0
native-country     0
dtype: int64

Summary statistics for numerical features (Training Data):
                age        fnlwgt  educational-num  capital-gain  \
count  43957.000000  4.395700e+04     43957.000000  43957.000000   
mean      38.617149  1.896730e+05        10.074118   


### Handle missing values

In [8]:
# Handle missing values
# For categorical features, use the mode to fill missing values
categorical_features = ['workclass', 'occupation', 'native-country']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# For numerical features, you can also impute missing values if any
numerical_features = ['age', 'fnlwgt', 'educational-num', 'capital-gain', 'capital-loss', 'hours-per-week']
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Preprocessor for both numerical and categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Apply preprocessing to training and test data
X_train = train_data.drop('income_>50K', axis=1)
y_train = train_data['income_>50K']
X_test = test_data

X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

# Display the transformed feature shape
print(f"Processed training data shape: {X_train_processed.shape}")
print(f"Processed test data shape: {X_test_processed.shape}")

Processed training data shape: (43957, 69)
Processed test data shape: (899, 69)


### Model Building
Now that your data is preprocessed, we can proceed with building a machine learning model. Given that this is a binary classification task, some suitable models include:

* Logistic Regression: Simple and interpretable.
* Random Forest: Can handle complex interactions and provides feature importance.
* Gradient Boosting: Generally more accurate but can be more complex to tune.

We'll start with Logistic Regression as a baseline model. If the performance isn't satisfactory, we can try more complex models.

In [13]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Split the training data for validation
X_train_split, X_val, y_train_split, y_val = train_test_split(X_train_processed, y_train, test_size=0.2, random_state=42)

# Initialize the Logistic Regression model
model = LogisticRegression(max_iter=1000, random_state=42)

# Train the model
model.fit(X_train_split, y_train_split)

# Make predictions on the validation set
y_pred = model.predict(X_val)

# Evaluate the model
accuracy = accuracy_score(y_val, y_pred)
print(f"Validation Accuracy: {accuracy:.4f}")

# Print the classification report
print("Classification Report:")
print(classification_report(y_val, y_pred))

# Print the confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_val, y_pred))


Validation Accuracy: 0.8218
Classification Report:
              precision    recall  f1-score   support

           0       0.84      0.95      0.89      6656
           1       0.73      0.42      0.53      2136

    accuracy                           0.82      8792
   macro avg       0.78      0.69      0.71      8792
weighted avg       0.81      0.82      0.80      8792

Confusion Matrix:
[[6326  330]
 [1237  899]]


### Generating Predictions on Test Data:

In [14]:
# Generate predictions on the test data
test_predictions = model.predict(X_test_processed)

# Format predictions as required
submission = pd.DataFrame({'id': range(len(test_predictions)), 'outcome': test_predictions})

# Save the submission file
submission.to_csv('submission.csv', index=False)

print("Submission file created successfully!")


Submission file created successfully!
