# Problem Statement:
Census-income data plays the most important role in the democratic system of
government, highly affecting the economic sectors. Census-related figures are used
to allocate federal funding by the government to different states and localities.
Census data is also used for post census residents estimates and predictions,
economic and social science research, and many other such applications.
Therefore, the importance of this data and its accurate predictions is very clear to us.
The main aim is to increase awareness about how the income factor actually has an
impact not only on the individual lives of citizens but also an effect on the nation and
its betterment. You will have a look at the data pulled out from the 1994 Census
bureau database, and try to find insights into how various features have an effect on
the income of an individual.
The data contains approximately 32,000 observations with over 15 variables.
The strategy is to analyze the data and perform a predictive task of classification to
predict whether an individual makes over 50K a year or less by using a logistic
regression algorithm.


In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv(r'P:\\DA DS AI\\IIT M Data Science & AI\\Assignments\\Logistics-Regression-Assignment\\census-income.csv')
df.head(5)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,annual_income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [2]:
# 1. How many types of occupations do we have?
# a. 13
# b. 14
# c. 15
# d. 11

print(df['occupation'].unique(), "\n")
print("List of available occupations:", df['occupation'].nunique(), "\n")
print('(Q1) Answer: c. 15')

['Adm-clerical' 'Exec-managerial' 'Handlers-cleaners' 'Prof-specialty'
 'Other-service' 'Sales' 'Craft-repair' 'Transport-moving'
 'Farming-fishing' 'Machine-op-inspct' 'Tech-support' '?'
 'Protective-serv' 'Armed-Forces' 'Priv-house-serv'] 

List of available occupations: 15 

(Q1) Answer: c. 15


In [3]:
# 2. How many people are working as tech support and have an annual income greater than 50k?
# a. 278
# b. 389
# c. 289
# d. 934

print(df[(df['occupation'] == 'Tech-support') & (df['annual_income'] == '>50K')].shape[0])
print('(Q2) Answer: 283, None of the above given values are correct')

283
(Q2) Answer: 283, None of the above given values are correct


In [4]:
# 3. How many total missing values are present in the dataset?
# a. 4262
# b. 5000
# c. 5349
# d. 4302

print("Count of Missing Values in each Column:\n", df[df.isna() == 'True'].count(), '\n')
print("Sum of Null Values in each Column:\n", df.isnull().sum(), '\n')
print('(Q3) Answer: 0 missing values, None of the above given values are correct')

Count of Missing Values in each Column:
 age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
annual_income     0
dtype: int64 

Sum of Null Values in each Column:
 age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
annual_income     0
dtype: int64 

(Q3) Answer: 0 missing values, None of the above given values are correct


In [5]:
# 4. If there are missing values in the Marital Status column, which option among the following should be 
# used for replacing the missing values:
# a. Mean
# b. Median
# c. Mode
# d. All of the above

print('(Q4) Answer: c. Mode')

(Q4) Answer: c. Mode


In [6]:
# 5. How many people are having private work classes and are not from the United States of America?
# a. 2151
# b. 2300
# c. 2000
# d. 2190

print(df[(df['workclass'] == 'Private' ) & (df['native-country'] != 'United-States')].shape[0])
print('(Q5) Answer: 2561, None of the above given values are correct')

2561
(Q5) Answer: 2561, None of the above given values are correct


In [7]:
# 6. How many people are either having Annual Income(last column) less than or
# equal to 50k or their working hours is greater than or equal to 40 hrs:
# a. 23008
# b. 23448
# c. 29505
# d. 25903

print(df[(df['annual_income'] == '<=50K') | (df['hours-per-week'] >= 40)].shape[0])
print('(Q6) Answer: 31823, None of the above given values are correct')

31823
(Q6) Answer: 31823, None of the above given values are correct


In [8]:
# 7. Which of the following methods can you use for handling outliers?
# a. Interquartile Range(IQR) Method
# b. Z Score method
# c. Both of the above methods
# d. None of the above

print('(Q7) Answer: c. Both of the above methods')

(Q7) Answer: c. Both of the above methods


In [9]:
# 8. Chi-square is used to analyze:
# a. Determine the relationship b/w the variables
# b. Compare observed results with expected results
# c. both a and b
# d. None of the above

print('(Q8) Answer: c. both a and b')

(Q8) Answer: c. both a and b


In [10]:
# 9. What is VIF?
# a. It can detect multicollinearity
# b. If the VIF value is greater than 10, then there is no correlation between
# the independent variables
# c. It stands for Variance Impact Factor
# d. VIF is when there is no correlation between one predictor and the other
# predictors in a model.

print('(Q9) Answer: a. It can detect multicollinearity')
print('VIF stands for Variance Inflation Factor')
print('The value for VIF starts at 1 and has no upper limit.')
print('A VIF of 1 indicates two variables are not correlated, a VIF between 1 and 5 indicates moderate correlation, and a VIF above 5 indicates high correlation.')


(Q9) Answer: a. It can detect multicollinearity
VIF stands for Variance Inflation Factor
The value for VIF starts at 1 and has no upper limit.
A VIF of 1 indicates two variables are not correlated, a VIF between 1 and 5 indicates moderate correlation, and a VIF above 5 indicates high correlation.


In [11]:
# 10.What predict_proba will tell you?
# a. It will predict the class probabilities
# b. It will tell you the target value
# c. Both are correct
# d. None of the above

print('(Q10) Answer: a. It will predict the class probabilities')

(Q10) Answer: a. It will predict the class probabilities


In [12]:
# 11.Logistic regression is useful for regression problems:
# a. True
# b. False

print('(Q11) Answer: b. False')

(Q11) Answer: b. False


In [13]:
# 12.In logistic regression, if the predicted logit is 0, what’s the
# transformed probability?
# a. 0.5
# b. 0.05
# c. Both of the above
# d. None of the above

print('(Q12) Answer: a. 0.5')

(Q12) Answer: a. 0.5


In [14]:
# 13.Which variant of logistic regression is recommended when you have
# a categorical dependent variable with more than two values?
# a. Multiple Logistic regression
# b. Multinomial logistic regression
# c. Ordered logit regression
# d. Poisson regression

print('(Q13) Answer: a. Multiple Logistic regression')

(Q13) Answer: a. Multiple Logistic regression


In [15]:
# Perform the following tasks for answering the remaining questions
print(df.dtypes)

# ● Rename the last column as Annual Income
df = df.rename(columns={'annual_income':'Annual Income'})

# ● Remove the missing values from the dataset
print(df.info())
print('Currently there are no missing values to handle')


age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
annual_income     object
dtype: object
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex          

In [16]:
#Below Code helps to remove the missing values

col_list = []
for col in df.columns:
    if ((df[col].dtype == 'object') & (col != 'Annual Income')):
        col_list.append(col)
        
col_list


['workclass',
 'education',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'native-country']

In [17]:
import numpy as np

# ● Change the labels of categorical data into numerical data using Label Encoder
df['Annual Income'] = np.where(df['Annual Income'] == '>50K',1,0)

from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
for i in col_list:
        df[i]=labelencoder.fit_transform(df[i])
        
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   age             32561 non-null  int64
 1   workclass       32561 non-null  int32
 2   fnlwgt          32561 non-null  int64
 3   education       32561 non-null  int32
 4   education-num   32561 non-null  int64
 5   marital-status  32561 non-null  int32
 6   occupation      32561 non-null  int32
 7   relationship    32561 non-null  int32
 8   race            32561 non-null  int32
 9   sex             32561 non-null  int32
 10  capital-gain    32561 non-null  int64
 11  capital-loss    32561 non-null  int64
 12  hours-per-week  32561 non-null  int64
 13  native-country  32561 non-null  int32
 14  Annual Income   32561 non-null  int32
dtypes: int32(9), int64(6)
memory usage: 2.6 MB
None


In [18]:
# ● Split the dataset into a train and test of proportions 70:30 and set the random
# state to 0.

col_lst = list(df.columns)
col_lst.remove('Annual Income')
print(col_lst)
df_X = df[col_lst]

df_y = df['Annual Income']
print(df_y)

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(df_X, df_y, test_size=0.3, random_state=42)


# ● Build a Logistic Regression Model on the data.
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()

##### Model Fitting/Training
lr.fit(x_train, y_train)

test_pred = lr.predict(x_test)
print(pd.DataFrame(lr.predict_proba(x_test)))


['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country']
0        0
1        0
2        0
3        0
4        0
        ..
32556    0
32557    1
32558    0
32559    0
32560    1
Name: Annual Income, Length: 32561, dtype: int32
             0         1
0     0.777333  0.222667
1     0.753687  0.246313
2     0.805725  0.194275
3     0.794332  0.205668
4     0.656603  0.343397
...        ...       ...
9764  0.779321  0.220679
9765  0.705936  0.294064
9766  0.869441  0.130559
9767  0.815651  0.184349
9768  0.230062  0.769938

[9769 rows x 2 columns]


In [19]:
# Answer the following questions with the help of the above-created model.
# 14.What is the accuracy score of the above model?
# a. 0.60 to 0.70
# b. 0.40 to 0.60
# c. 0.70 to 0.85
# d. None of the above

from sklearn.metrics import confusion_matrix, accuracy_score
c1 = confusion_matrix(y_test,test_pred)
print('Confusion Matrix: \n')
cnfmtrx = pd.DataFrame(c1)
print(cnfmtrx, '\n')
TN = cnfmtrx.iloc[0,0]
FN = cnfmtrx.iloc[1,0]
TP = cnfmtrx.iloc[1,1]
FP = cnfmtrx.iloc[0,1]
    
print("Accuracy Score =", accuracy_score(y_test,test_pred))
print('(Q14) Answer: c. 0.70 to 0.85')

Confusion Matrix: 

      0    1
0  7232  223
1  1697  617 

Accuracy Score = 0.8034599242501791
(Q14) Answer: c. 0.70 to 0.85


In [20]:
# 15.What is the specificity of the above model?
# a. 0.20 to 0.30
# b. 0.30 to 0.40
# c. 0.50 to 0.60
# d. None of the above

sen=c1[0,0]/(c1[0,0]+c1[0,1])
print('Sensitivity = ',sen)

sep=c1[1,1]/(c1[1,1]+c1[1,0])
print('Specificity = ', sep, '\n')

print('(Q15) Answer: a. 0.20 to 0.30')

Sensitivity =  0.9700871898054997
Specificity =  0.266637856525497 

(Q15) Answer: a. 0.20 to 0.30


In [21]:
# 16.What is the model’s precision when the target is False?
# a. 0.60 to 0.70
# b. 0.40 to 0.60
# c. 0.70 to 0.80
# d. None of the above

from sklearn.metrics import precision_score
print('Model’s precision =', precision_score(y_test, test_pred, average=None))

PrecisionTargetFalse = (TN)/(FN +TN)
print('Model’s precision when the target is False =', PrecisionTargetFalse)

# False_negative_rate =(FN)/(FN+TP)
   

Model’s precision = [0.80994512 0.73452381]
Model’s precision when the target is False = 0.8099451226341136


In [22]:
# 17.What is the total support value from the above model?
# a. 9049
# b. 9032
# c. 10000
# d. 9847
from sklearn.metrics import classification_report
print(classification_report(y_test, test_pred))

print('Total support value from the above model = ', TP+TN+FP+FN)

              precision    recall  f1-score   support

           0       0.81      0.97      0.88      7455
           1       0.73      0.27      0.39      2314

    accuracy                           0.80      9769
   macro avg       0.77      0.62      0.64      9769
weighted avg       0.79      0.80      0.77      9769

Total support value from the above model =  9769


In [23]:
# 18.What is the f1 score of the above model when the target is True?
# a. 0.30 to 0.40
# b. 0.40 to 0.50
# c. 0.60 to 0.70
# d. 0.90 to 0.99

precision = (TP)/(TP+FP)

recall_score  = (TP)/(TP+FN)

f1_score = 2*(( precision * recall_score)/( precision + recall_score))

print('f1 score of the above model when the target is True =', f1_score, '\n')

print('(Q18) Answer: a. 0.30 to 0.40')

f1 score of the above model when the target is True = 0.39124920735573876 

(Q18) Answer: a. 0.30 to 0.40


In [24]:
# 19.How many records are correctly classified by the model?
# a. 7173
# b. 7043
# c. 7000
# d. None of the above
print('Number of records correctly classified :', TP)

Number of records correctly classified : 617
