# Problem Statement:
Census-income data plays the most important role in the democratic system of
government, highly affecting the economic sectors. Census-related figures are used
to allocate federal funding by the government to different states and localities.


Census data is also used for post census residents estimates and predictions,
economic and social science research, and many other such applications.


Therefore, the importance of this data and its accurate predictions is very clear to us.

The main aim is to increase awareness about how the income factor actually has an
impact not only on the individual lives of citizens but also an effect on the nation and
its betterment. You will have a look at the data pulled out from the 1994 Census
bureau database, and try to find insights into how various features have an effect on
the income of an individual.


The data contains approximately 32,000 observations with over 15 variables.
The strategy is to analyze the data and perform a predictive task of classification to
predict whether an individual makes over 50K a year or less by using a logistic
regression algorithm.


Column Names Description
Age Age of the individual
Workclass department of the working individual
fnlwgt Final weight of the individual
education The education degree of the individual
education-num Number of years of education
marital-status Marital status of the individual
occupation Occupation of the individual
relationship Relation value
race Ethnicity of the individual
sex Female, Male
capital-gain capital gain of the individual
capital-loss capital loss of the individual
hours-per-week number of working hours
native-country The native country of the individual
Annual-Income Annual income either >50K or <=50K

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
%matplotlib inline

In [2]:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


In [3]:
df_ci=pd.read_csv(r"C:\Users\Rishi\Downloads\censusincome.csv")

In [4]:
df_ci.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,annual_income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [5]:
df_crw=pd.read_excel(r"C:\Users\Rishi\Downloads\CreditWorthiness.xlsx")

In [6]:
df_crw.head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3
0,,,,
1,,,,Objective
2,,,,
3,,,,
4,,,,Cautious plc is in the business of providing c...


In [7]:
# 1. How many types of occupations do we have?
# a. 13
# b. 14
# c. 15
# d. 11

num_occ = df_ci['occupation'].nunique()
print(num_occ)

# coorect ans is  14

15


In [8]:
# 2. How many people are working as tech support and have an annual income greater than 50k?
# a. 278
# b. 389
# c. 289
# d. 934

tech_support = df_ci[(df_ci['occupation'] == 'Tech-support') & (df_ci['annual_income'] == '>50K')].shape[0]
print(tech_support)

# coorect ans is 278

283


In [9]:
# 3. How many total missing values are present in the dataset? ************
# a. 4262
# b. 5000
# c. 5349
# d. 4302

mis_val = df_ci.isnull().sum().sum()
print(mis_val)
# coorect ans is 4262

0


In [10]:
# 4. If there are missing values in the Marital Status column, which option among
# the following should be used for replacing the missing values:
# a. Mean
# b. Median
# c. Mode
# d. All of the above

print("Fill missing values with the mode most frequent value")
df_ci['marital-status'].fillna(df_ci['marital-status'].mode()[0], inplace=True)

Fill missing values with the mode most frequent value


In [11]:
df_ci.head(2)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,annual_income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K


In [12]:
# 5. How many people are having private work classes and are not from the United States of America? ****************
# a. 2151
# b. 2300
# c. 2000
# d. 2190


p_usa = df_ci[(df_ci['workclass'] == 'Private') & (df_ci['native-country'] != 'United-States')].shape[0]
print(p_usa)
# coorect ans is 2151

2561


In [13]:
# 6. How many people are either having Annual Income(last column) less than or
# equal to 50k or their working hours is greater than or equal to 40 hrs:
# a. 23008
# b. 23448
# c. 29505
# d. 25903

ans_6 = df_ci[(df_ci['annual_income'] == '<=50k') | (df_ci['hours-per-week'] >=40)]
print(len(ans_6))
# coorect ans is 29505

24798


In [14]:
# 7. Which of the following methods can you use for handling outliers? ******************
# a. Interquartile Range(IQR) Method
# b. Z Score method
# c. Both of the above methods
# d. None of the above

# Assuming 'age' is a numeric column where you want to handle outliers
# Use the Interquartile Range (IQR) method to handle outliers
Q1 = df_ci['age'].quantile(0.25)
Q3 = df_ci['age'].quantile(0.75)
IQR = Q3 - Q1
outliers_removed_df = df_ci[~((df_ci['age'] < (Q1 - 1.5 * IQR)) | (df_ci['age'] > (Q3 + 1.5 * IQR)))]
# coorect ans is Both of the above methods

In [15]:
# 8. Chi-square is used to analyze:
# a. Determine the relationship b/w the variables
# b. Compare observed results with expected results
# c. both a and b
# d. None of the above

print('Determine the relationship b/w the variables')
# coorect ans is both a n b

Determine the relationship b/w the variables


In [16]:
# 9. What is VIF?              
# a. It can detect multicollinearity
# b. If the VIF value is greater than 10, then there is no correlation between the independent variables
# c. It stands for Variance Impact Factor
# d. VIF is when there is no correlation between one predictor and the other

print('It can detect multicollinearity')

It can detect multicollinearity


# predictors in a model.


In [17]:
# 10.What predict_proba will tell you?                              *******************
# a. It will predict the class probabilities
# b. It will tell you the target value
# c. Both are correct
# d. None of the above

print('It will predict the class probabilities')

It will predict the class probabilities


In [18]:
# 11.Logistic regression is useful for regression problems:
# a. True
# b. False

print("False")

False


In [19]:
# 12.In logistic regression, if the predicted logit is 0, what’s the transformed probability? *************
# a. 0.5
# b. 0.05
# c. Both of the above
# d. None of the above

print('0.5')

0.5


In [20]:
# 13.Which variant of logistic regression is recommended when you have a categorical dependent varib with more than two values?
# a. Multiple Logistic regression
# b. Multinomial logistic regression
# c. Ordered logit regression
# d. Poisson regression

print(' Multinomial logistic regression')

 Multinomial logistic regression


Perform the following tasks for answering the remaining questions

● Rename the last column as Annual Income

● Remove the missing values from the dataset

● Change the labels of categorical data into numerical data using Label
Encoder.

● Split the dataset into a train and test of proportions 70:30 and set the random
state to 0.

● Build a Logistic Regression Model on the data.
Answer the following questions with the help of the above-created model.

In [21]:
df_ci.head(1)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,annual_income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K


In [22]:
#Rename the last column as Annual Income
df_ci.rename(columns={'annual_income': 'Annual_Income'}, inplace=True)

In [23]:
#  Remove the missing values from the dataset

df_ci.dropna(inplace=True)

In [24]:
# Change the labels of categorical data into numerical data using Label Encoder.   ******************
label_encoder = LabelEncoder()
categorical_columns = df_ci.select_dtypes(include=['object']).columns
for column in categorical_columns:
    df_ci[column] = label_encoder.fit_transform(df_ci[column])

In [25]:
#  Split the dataset into a train and test of proportions 70:30 and set the random state to 0.
X = df_ci.drop('Annual_Income', axis=1)
y = df_ci['Annual_Income']
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=0)

In [26]:
#Build a Logistic Regression Model on the data. Answer the following questions with the help of the above-created model.
model = LogisticRegression()
model.fit(X_train, y_train)


In [27]:
# 14.What is the accuracy score of the above model?
# a. 0.60 to 0.70
# b. 0.40 to 0.60
# c. 0.70 to 0.85
# d. None of the above

y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"14. Accuracy score of the model: {accuracy}")

14. Accuracy score of the model: 0.80090080868052


In [28]:
# 15.What is the specificity of the above model? ***************************
# a. 0.20 to 0.30
# b. 0.30 to 0.40
# c. 0.50 to 0.60
# d. None of the above
tn, fp, fn, tp =confusion_matrix(y_test, y_pred).ravel()
specificity = tn / (tn + fp)
print(specificity)

0.9712434183880113


In [29]:
# 16.What is the model’s precision when the target is False?     ****************
# a. 0.60 to 0.70
# b. 0.40 to 0.60
# c. 0.70 to 0.80
# d. None of the above

precision_false = tn / (tn + fn)
print(precision_false)

0.8059601165135559


In [30]:
# 17.What is the total support value from the above model? ***********
# a. 9049
# b. 9032
# c. 10000
# d. 9847

total_support = tn + fp + fn + tp
print(total_support)

# coorect ans is 9049

9769


In [31]:
# 18.What is the f1 score of the above model when the target is True? ***********
# a. 0.30 to 0.40
# b. 0.40 to 0.50
# c. 0.60 to 0.70
# d. 0.90 to 0.99
# coorect ans is .30 to .40

In [32]:
# 19.How many records are correctly classified by the model?
# a. 7173
# b. 7043
# c. 7000
# d. None of the above

correct_rec = accuracy_score(y_test, y_pred, normalize=False)
print(correct_rec)
# coorect ans is 7173

7824


In [33]:
# 20) Choose the code which is used to perform reset indexes in the original data frame.
# data.reset_index(inplace=False)
# data.reset(inplace=True)
# data.reset(inplace=False)
# data.reset_index(inplace=True)

print('df_ci.reset_index(inplace=True')

df_ci.reset_index(inplace=True


In [35]:
# 21) What is the cost function used in Binary Logistic Regression?               ******************
# a Log Loss
# b Categorical Cross Entropy
# c None of the options are correct
# d Sum Squared Error

print ('Log Loss')
# The cost function used in Binary Logistic Regression is typically the Log Loss, also known as Binary Cross-Entropy. 
# It measures the performance of a classification model whose output is a probability value between 0 and 1

Log Loss


In [37]:
# 22) Which below method is used to find the best fitting line in logistic regression       ********************
# A None of the above
# B Least Square Method
# C Maximum Likelihood
# D Sigmoid

print('Maximum Likelihood')

Maximum Likelihood


In [38]:
# 23) What is the relation between Sensitivity and Specificity?
# A No relation between both
# B Sensitivity Increases Specificity Increases
# C Sensitivity Increases Specificity Decreases

print('C. Sensitivity Increases, Specificity Decreases')

C. Sensitivity Increases, Specificity Decreases


In [39]:
df_crw.head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3
0,,,,
1,,,,Objective
2,,,,
3,,,,
4,,,,Cautious plc is in the business of providing c...


In [40]:
# 24)Use the creditworthiness data for the following questions, 
# Don’t perform any Data cleaning or pre-processing before answering the questions.
# What is the entropy of the column Htype using the formula (-1 * np.sum(np.log2(probs) * probs)) 
# where probs is the list of probability of each unique item in the column?
# A 9.85
# B 0.56
# C 1.14
# D 5.75

import numpy as np

# Assuming probs is the list of probabilities
probs = [0.2, 0.3, 0.5]  # Replace with the actual probabilities

# Calculate entropy
entropy = -np.sum(np.log2(probs) * probs)

print(f"The entropy of the column Htype is: {entropy}")

The entropy of the column Htype is: 1.4854752972273344


In [41]:
# 25)Use the creditworthiness data for the following questions, 
# Don’t perform any Data cleaning or pre-processing before answering the questions.
# What is the total number of columns obtained post running the command pandas get_dummies in the raw data 
# when the Initial columns count is 21?
# A 65
# B 21
# C 48
# D 97

# When you use the pandas.get_dummies function to one-hot encode categorical variables in a DataFrame, 
# the number of columns obtained after one-hot encoding will depend on the number of unique values in those categorical columns.

# If the initial column count is 21 and you create dummy columns for all categorical variables, 
# the total number of columns after running pandas.get_dummies could potentially be more than 21.

In [42]:
initial_column_count = df_crw.shape[1]  # Get the initial column count

# Identify categorical columns
categorical_columns = df_crw.select_dtypes(include=['object']).columns

# One-hot encode categorical columns
df_encoded = pd.get_dummies(df_crw, columns=categorical_columns)

# Get the final column count
final_column_count = df_encoded.shape[1]

print(f"Initial Column Count: {initial_column_count}")
print(f"Final Column Count after get_dummies: {final_column_count}")

Initial Column Count: 4
Final Column Count after get_dummies: 5


In [43]:
# 26) What is the X & Y axis for drawing AUC ROC plot?
# TNR & FNR
# FPR & TPR
# TPR & FPR
# FNR & TNR

# X-axis: False Positive Rate (FPR)
# Y-axis: True Positive Rate (TPR)
# ans: FPR & TPR
    
# In the context of an ROC (Receiver Operating Characteristic) plot, the X-axis represents the False Positive Rate (FPR), 
# and the Y-axis represents the True Positive Rate (TPR), also known as Sensitivity or Recall

In [44]:
# 27)Use the creditworthiness data for the following questions, ***************
# Don’t perform any Data cleaning or pre-processing before answering the questions.
# How many records present with the following conditions age >= 25 and age <= 50, 
# and inplans equals ‘bank’ and Cpur is either ‘Business’ or ‘electronics’ and foreign as ‘yes’?
# A 3
# B 1
# C 2
# D 0

In [45]:
# 28) Calculate the Recall for the below confusion matrix Confusion
# Matrix	Predicted
# Good	Bad
# Actual	Good	300	200
# Bad	100	400
# A 4/10
# B 3/7
# C 3/10
# D 4/7

# Recall= True Positives/(True Positives + False Negatives)

# True Positives = 300
# False Negatives = 100

# Recall= 300/300+100 = 300/400 = 3/4

In [47]:
# 29) Calculate the F1 score for the below confusion matrix?
# Confusion Matrix	Predicted
# Good	Bad
# Actual	Good	300	200
# Bad	100	400
# A 0.40
# B 0.75
# C 0.80
# D 0.67

# The F1 score is calculated using the formula:
# F1 score=2×(Precision×Recall)/Precision+Recall

 

# Precision is calculated as:
# Precision=True Positives/(True Positives + False Positives)



# Recall is calculated as:
# Recall=True Positives/(True Positives + False Negatives)


Tasks To Be Performed:
1. Load the dataset using pandas

2. Extract data from outcome column is a variable named Y

3. Extract data from every column except outcome column in a variable named X

4. Divide the dataset into two parts for training and testing in 70% and 30%proportion

5. Create and train Logistic Regression Model on training set

6. Make predictions based on the testing set using the trained model

7. Check the performance by calculating the confusion matrix and accuracy score of the model


In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score

In [2]:
#1. Load the dataset using pandas
dff = pd.read_csv(r'C:\Users\Rishi\Downloads\Datasets\diabetes.csv')

In [3]:
dff.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [6]:
# 2. Extract data from outcome column in a variable named Y
Y = dff['Outcome']

In [7]:
# 3. Extract data from every column except the outcome column in a variable named X
X = dff.drop('Outcome', axis=1)

In [8]:
# 4. Divide the dataset into two parts for training and testing in 70% and 30% proportion
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=2)

In [9]:
# 5. Create and train Logistic Regression Model on the training set
model = LogisticRegression()
model.fit(X_train, Y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [10]:
# 6. Make predictions based on the testing set using the trained model
Y_pred = model.predict(X_test)

In [11]:
# 7. Check the performance by calculating the confusion matrix and accuracy score of the model
conf_matrix = confusion_matrix(Y_test, Y_pred)
accuracy = accuracy_score(Y_test, Y_pred)

In [12]:
print(conf_matrix)
print(f"Accuracy Score: {accuracy}")

[[139  16]
 [ 39  37]]
Accuracy Score: 0.7619047619047619
