<hr>

#  **Importing Libraries**

**Configuration Libraries**

In [2]:
import warnings
warnings.filterwarnings("ignore")

**Classical Python Libraries**

In [3]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

**Machine Learning Libraries**

In [4]:
from sklearn.metrics import *
from sklearn.ensemble import AdaBoostClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

#**Mounting Google Drive**

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


<hr>

# **Data Loading Phase**

**Data Loading**

In [6]:
df = pd.read_csv("/content/drive/MyDrive/census-income.csv", na_values = "?", skipinitialspace = True)

**Data Inspection**

In [7]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,annual_income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


**Class Balancement**

In [8]:
df["annual_income"].value_counts(normalize = True)

Unnamed: 0_level_0,proportion
annual_income,Unnamed: 1_level_1
<=50K,0.75919
>50K,0.24081


**>50K may suffer in predictions since it's having less data points.**

**Shape Inspection**

In [9]:
a = df.shape
print(f"Rows: {a[0]} & Columns: {a[1]}")

Rows: 32561 & Columns: 15


# **Data Preprocessing**

**Null values check**

In [10]:
df.isnull().sum()

Unnamed: 0,0
age,0
workclass,1836
fnlwgt,0
education,0
education-num,0
marital-status,0
occupation,1843
relationship,0
race,0
sex,0


In [11]:
df = df.dropna()

**Duplicates**

In [12]:
df.duplicated().sum()

23

In [13]:
df = df.drop_duplicates()

**Encoding**

In [14]:
encoder = LabelEncoder()

for x in df.columns:
  if df[x].dtype == "object":
    df[x] = encoder.fit_transform(df[x])

**Feature Division**

In [15]:
# Feature Data
X = df.drop("annual_income", axis = 1)

# Target data
Y = df["annual_income"]

**Data Division**

In [16]:
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 43)

# **Model Building and Evaluation**

**Model Creation**

In [17]:
model = AdaBoostClassifier(n_estimators = 100, learning_rate = 0.7)

**Fitting the data**

In [18]:
model.fit(x_train, y_train)

**Predictions**

In [19]:
pred = model.predict(x_test)

**Evaluate**
  * **Accuracy: It's basically overall performance of the model, irrespective of the classes, i.e if the model has performed well on an individual class or not**

In [20]:
print(f"Accuracy Score: {accuracy_score(y_test, pred)}")

Accuracy Score: 0.8565582835655828


In [21]:
print(f"Classification Report: \n{classification_report(y_test, pred)}")

Classification Report: 
              precision    recall  f1-score   support

           0       0.88      0.94      0.91      6749
           1       0.78      0.60      0.68      2293

    accuracy                           0.86      9042
   macro avg       0.83      0.77      0.79      9042
weighted avg       0.85      0.86      0.85      9042



**F1 Score is the better parameter for judging a model, since it's a harmonic mean of precision and the recall of a particular class**