# ADULT DATA - Logistic Regression

Prediction task is to determine whether a person makes over 50K a year.

Listing of attributes:

>50K, <=50K.

age: continuous<br>

workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.<br>

fnlwgt: continuous.<br>

education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.<br>

education-num: continuous.<br>
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.<br>

occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.<br>

relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.<br>

race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.<br>
sex: Female, Male.<br>

capital-gain: continuous.<br>

capital-loss: continuous.<br>

hours-per-week: continuous.<br>

native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

In [1]:
# Import all libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

In [2]:
# Load the data 

adult_df = pd.read_csv(r"C:\Users\Anny\OneDrive\Desktop\Imarticus\GITHUB\Adult\adult_data (1).csv", header=None, delimiter =' *, *')
adult_df

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\Anny\\OneDrive\\Desktop\\Imarticus\\GITHUB\\Adult\\adult_data (1).csv'

* Since the data has missing headers, we have inserted them manually

In [None]:
adult_df.columns = ['age', 'workclass', 'fnlwgt', 'education', 'education_num',
                    'marital_status', 'occupation', 'relationship',
                    'race', 'sex', 'capital_gain', 'capital_loss',
                    'hours_per_week', 'native_country', 'income']

In [None]:
# Calling the data to check if the headers have been updated in the data

adult_df

* The data consists of numeric as well as categoric data
* Looking at the variables we can infer that:
1. The columns 'education' and 'education_num' mean they same but in different data types<br>
Hence we will drop any one of them
2. The column 'fnlwgt' is a calculated variable and does not hold as much importance<br>
Hence we will drop the column as a part of feature selection
3. The columns 'capital_gain' and 'capital_loss' have many values that are equal to 0
4. The columns 'relationship' and 'race' might not seem as important, but we will keep it as of now


In [None]:
# General description of the data

print('Data shape -', adult_df.shape)
print()
print('Data types','\n', adult_df.dtypes)
print()
print('Description','\n')
print(adult_df.describe(include='all'))

* There are 32561 rows and 15 columns
* The data consists of mixed data types - int64 and object

#### Processing the Data

In [None]:
# Creating a copy of the data 

adult_df_rev = pd.DataFrame.copy(adult_df)

In [None]:
adult_df_rev.columns

In [None]:
adult_df_rev.shape

In [None]:
# Dropping columns which are irrelevent for the model
# Dropping 'fnlwgt' and 'education'

adult_df_rev.drop(['fnlwgt', 'education'],axis = 1, inplace = True)

In [None]:
adult_df_rev.shape

* The number of columns have been reduced from 15 --> 13 
* Which means column 'fnlwgt' and 'education' have been successfully dropped

In [None]:
# Checking for missing values in the data 

adult_df_rev.isnull().sum()

* Here we can see that there is no missing values in the data
* But, reading the data description we come to know that the data has missing values in the form of '?'

In [None]:
# Checking the datatypes to see if the numeric variables have any anomaly in the data type

adult_df_rev.dtypes

Since there are no further anomalies in the data we can go ahead and check for special characters in the data

In [None]:
# Checking for special chareter (?) in the data

for i in adult_df_rev.columns:
    print({i:adult_df_rev[i].unique()})

We can infer that columns 'workclass', 'occupation' and 'native_country' have special charecter '?' in place of null values

In [None]:
# Checking for duplicates in the data

adult_df_rev.duplicated().sum()

There are 3465 duplicated data<br>
Since we have enough data we can go ahead and drop all the duplicated values

In [None]:
# Dropping the duplicated values

adult_df_rev.drop_duplicates(inplace=True)

In [None]:
# Checking if the duplicate values have been dropped

adult_df_rev.shape

Since the record have been reduced from 32561 ---> 29096<br>
Which means that the duplicate values have been successfully dropped

In [None]:
# Replacing the special charecter '?' with nan

adult_df_rev.replace('?',np.nan, inplace=True)

In [None]:
# Checking if the special charecter '?' has been replaced 

adult_df_rev.isnull().sum()

We can infer that columns 'workclass', 'occupation' and 'native_country' have missing values in their data<br>
To avoid data loss we will fill these missing values

Here we can see that 
* Workclass has 1632 missing values
* Occupation has 1639 missing values
* Native_country has 580 missing values

In [None]:
# Filling the missing values with the mode - since the columns are categorical in nature

for value in adult_df_rev.columns:
    adult_df_rev[value].fillna(adult_df_rev[value].mode()[0],inplace=True)

In [None]:
# Checking if the the missing values have been filled

adult_df_rev.isnull().sum()

We can conclude that the missing values have been successfully filled with their respective modes

#### Pre-processing the model

ASSUMPTION 1 - There should be no outliers in the data

In [None]:
adult_df_rev.dtypes

In [None]:
adult_num = ['age','education_num','capital_gain','capital_loss','hours_per_week']

In [None]:
adult_df_rev.boxplot()

In [None]:
for i in adult_num:
    adult_df_rev.boxplot(column=i)
    plt.show()
    

We can observe that there are a few outliers in the data<br>
But we also observe that outliers are clustered in nature so elimination of any outlier will lead to data loss

Hence we will ignore the outliers<Br>
The 1st assumption has been successfully met

#### **Pre processing the data**

In [None]:
adult_df_rev.columns

In [None]:
# Converting all categoric data into numeric data

# Creating a list that has only values with 'object' datatype

colname = []

for x in adult_df_rev.columns:
    if adult_df_rev[x].dtypes=='object':
        colname.append(x)

print('Columns with "object" as their data type','\n', colname)

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

for x in colname:
    adult_df_rev[x]=le.fit_transform(adult_df_rev[x])
    le_name_mapping = dict(zip(le.classes_, le.transform(le.classes_)))
    print('Feature', x)
    print('mapping', le_name_mapping)
    

All the categoric values have been converted into numeric data

Target variable labels

Feature : income<br>
{'<=50K' :  0,  '>50K' :  1}

In [None]:
# Checking if the data has been converted

adult_df_rev.head()

In [None]:
adult_df_rev.dtypes

The categoric data have been converted successfully<br>
All the data is present in integer data type

In [None]:
# Create X and Y variable (Independent - X and Dependent - Y)

X = adult_df_rev.values[:,0:-1]
Y = adult_df_rev.values[:,-1]

print('X :',X)
print('Y :',Y)

In [None]:
# Checking if the X and Y variables are created properly

print('X = ',X.shape)
print('Y = ',Y.shape)

The X or independent variable has 29096 rows and 12 columns<br>
The Y or dependent or target variable has 29096 rows and 1 column

#### **Scaling the data**

In [None]:
# Using standardozation scaling technique to scale the data

# For X variable

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaler.fit(X)
X = scaler.transform(X)

print('X variable','\n',X)

In [None]:
# For Y variable

Y = Y.astype(int)
Y.dtype

#### **Basic models**

In [None]:
# Splitting the data for Training and Testing

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.3, random_state=10)

In [None]:
print('Data reserved for Training = 70%')
print()
print('Training data shape for X variable - ', X_train.shape) 
print('Training data shape for Y variable - ',Y_train.shape)  
print()
print('Data reserved for Testing = 30%')
print()
print('Testing data shape for X variable - ',X_test.shape)
print('Testing data shape for Y variable - ',Y_test.shape)   
print()
print("Percent of train data = ",X_train.shape[0]/X.shape[0]*100)

#### **Logistic Regression Model**


Logistic regression is a supervised machine learning algorithm mainly used for classification tasks where the goal is to predict the probability that an instance of belonging to a given class. It is used for classification algorithms its name is logistic regression.

We have decided to implement Logistic regression here as the data shows binary class classification

In [None]:
from sklearn.linear_model import LogisticRegression

# Create a model
classifier=LogisticRegression()

# Fitting training data to the model
classifier.fit(X_train,Y_train)

Y_pred=classifier.predict(X_test)
print(Y_pred)

In [None]:
print(list(zip(Y_test,Y_pred)))

In [None]:
print('Beta coefficient','\n')
print(list(zip(adult_df_rev.columns[:-1],classifier.coef_.ravel())))
print()
print('Intercept/Beta0','\n',classifier.intercept_)
print()

* From the above we can infer the relation between the Independent and dependent variable 

In [None]:
# Checking the probability of the data to belong to either class 0 or class 1

y_pred_class=classifier.predict_proba(X_test)

In [None]:
# Evaluation matrix

from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

cfm = confusion_matrix(Y_test,Y_pred)

print('Confusion Matrix','\n',cfm)
print()
print('Classification Report','\n',classification_report(Y_test,Y_pred))
print()
print('Accuracy of the model -', accuracy_score(Y_test,Y_pred))

From the above we can infer that-<br>
1. Class 0 : Out of 6550 values 6178 were predicted correctly while 372 were misclassified<br>
2. Class 1 : Out of 2179 vales 968 were predicted correctly while 1211 were misclassified

Recall values:<br>
1. Class 0 : 94%<br>
2. Class 1 : 44%

Hence the model is working well for Class 0 and not so well for Class 1

The accuracy of the model is 81.86%



The model still needs some work done, hence we will tune it

#### **Tuning the model**

In [None]:
for x in np.arange(0.4,0.61,0.01):
    predict_mine = np.where(y_pred_class[:,1]>x,1,0)
    cfm=confusion_matrix(Y_test,predict_mine)
    total_err=cfm[0,1]+cfm[1,0]
    print('Errors at threshold', x,':',total_err, ", type 2 error :", cfm[1,0],' type 1 error', cfm[0,1])

The best threshold value : 0.4600000000000001

In [None]:
# Create an empty list
y_pred_class_final=[]

for value in y_pred_class[:,1]:
    if value >0.46:
        y_pred_class_final.append(1)
    else:
        y_pred_class_final.append(0)
        
print(y_pred_class_final)

In [None]:
cfm = confusion_matrix(Y_test,y_pred_class_final)
print('Confusion Matrix','\n',cfm)
print()
print('Previous Confusion Matrix')
print('[[6178  372]')
print('[1211  968]]')
print()
print('Classification Report','\n',classification_report(Y_test,y_pred_class_final))
print()
print('Accuracy of the model -', accuracy_score(Y_test,y_pred_class_final))


From the above we can infer that-<br>
1. Class 0 : Out of 6550 values 6094 were predicted correctly while 456 were misclassified<br>
2. Class 1 : Out of 2179 vales 1126 were predicted correctly while 1053 were misclassified

Recall values:<br>
1. Class 0 : 93%<br>
2. Class 1 : 48%


The accuracy of the model is 81.87%

#### **Conclusion**

Hence we have built a model to predict the income of various people based on different factors like ' of 81.87% accuracy, using Logistic Regression

#### Testing the model

In [None]:
# Load the test data

adult_test = pd.read_csv(r"C:\Users\Anny\OneDrive\Desktop\Imarticus\GITHUB\Adult\adult_test (2).csv",header=None, delimiter = ' *, *')

adult_test

In [None]:
# Inserting column names

adult_test.columns = ['age', 'workclass', 'fnlwgt', 'education', 'education_num',
                    'marital_status', 'occupation', 'relationship',
                    'race', 'sex', 'capital_gain', 'capital_loss',
                    'hours_per_week', 'native_country', 'income']

In [None]:
# Checking if column names have been updated or not

adult_test

In [None]:
# Feature Selection

adult_test.drop(['education','fnlwgt'],axis = 1,inplace = True)

In [None]:
# Checking if the columns have been dropped

adult_test.columns

In [None]:
# Checking for missing values in the data

adult_test.isnull().sum()

There is no missing values in the data

In [None]:
# Checking for anomalies in the data types 

In [None]:
adult_test.dtypes

A few of the numeric columns have object as their data types, which indicates some kind of anomaly

In [None]:
# Checking if the data has '?' as missing values in the data

for x in adult_test.columns:
    print({x:adult_test[x].unique()})
    

Work class, occupation and native country have '?' as their missing value

In [None]:
# Converting special charecter '?' into nan values

adult_test.replace('?',np.nan,inplace=True)

In [None]:
# Checking if the the special charecter has been converted into nan values

adult_test.isnull().sum()

There are missing values in workclass, occupation and native country

In [None]:
adult_test.shape

In [None]:
adult_test.columns

In [None]:
# Filling the missing data with the mode values

for value in ['workclass','occupation','native_country']:
    adult_test[value].fillna(adult_test[value].mode()[0],inplace=True)

In [None]:
# Checking if the missing values have been filled 

adult_test.isnull().sum()

All the missing values have been filled successfully with their respective modes

In [None]:
# Creating a list with all categoric columns

categoric = []

for i in adult_test.columns:
    if adult_test[i].dtypes=='object':
        categoric.append(i)
    
categoric

In [None]:
# Converting all the categoric data into numeric data

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

for i in categoric:
    adult_test[i]=le.fit_transform(adult_test[i])
    
    name = dict(zip(le.classes_,le.transform(le.classes_)))
    print('Feature:',x)
    print('Mapping', name)

In [None]:
adult_test

All the categoric columns have been converted into numeric values

In [None]:
# Creating new X and Y variables

X_test_new = adult_test.values[:,:-1]
Y_test_new = adult_test.values[:,-1]

print('New X variable :', X_test_new)
print('New Y variable :', Y_test_new)

In [None]:
# Scaling the data

# Scaling the X variable

X_test_new = scaler.transform(X_test_new)
print(X_test_new)

In [None]:
# Scaling the Y variable

Y_test_new = Y_test_new.astype(int)

In [None]:
# Predicting the probabilty of the values to belong to either class 0 or class 1

Y_pred_prob = classifier.predict_proba(X_test_new)
print(Y_pred_prob)

In [None]:
Y_pred_class_test=[]

for value in Y_pred_prob[:,1]:
    if value >0.46:
        Y_pred_class_test.append(1)
    else:
        Y_pred_class_test.append(0)
        
print(Y_pred_class_test)

In [None]:
# Evaluating the model 

from sklearn.metrics import confusion_matrix, accuracy_score, classification_report, recall_score, f1_score

# Functions - Evaluation matrix

cfm = confusion_matrix(Y_test_new,Y_pred_class_test)

print('Confusion Matrix','\n', cfm)
print()
print('Classification report: ','\n')
print()
print(classification_report(Y_test_new,Y_pred_class_test))
print()
acc= accuracy_score(Y_test_new,Y_pred_class_test)
print()
print('Accuracy of the model = ', acc)
print(acc*100,'%')

In [None]:
# Comparing validation accuracy with test data accuracy
# Model is durable as the range is between 80-83%

Looking at the accuracy score one might infer that the model is a good model, but looking closely at the recall value for class 1, we can say that the model is not the best and can be improved further by applying different algorithms.