# Binary Logistic Regression

   # Definition:
        Logistic regression is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes).


# Importing DATA & Python Packages
 

In [0]:
import numpy as np 
import pandas as pd 
from sklearn import preprocessing
import matplotlib.pyplot as plt 
import seaborn as sns
sns.set(style="white") #white background style for seaborn plots
sns.set(style="whitegrid", color_codes=True)

I am using the HR (Human Resources) data which consists of 1470 rows and 35 features/columns/variables making a total of 51450 observations. The data consists a combination of characters and numerics. Our main objective is to find "Attrition Rate" so that we can identify when a new record feature comes close to attrition the algorithms warns the HR and necessary action is taken by HR

In [2]:
# Read CSV HR data file into DataFrame
hrdf = pd.read_csv("HRDATA.csv")
# Shape data
print('The shape of our features is:', hrdf.shape)
# Size of data
print("Size",hrdf.size)
# preview HR data
hrdf.head()


FileNotFoundError: ignored

<font color=green>We need to convert all the character data into numerics i.e., converting them into categorirical data  or encoding them into numerics. I used excel as it is simpler and faster to encode. Python can also be used. </font>

In [0]:
# Read CSV HR data file into DataFrame
hrdf1 = pd.read_csv("HRDATA-1.csv")
# Shape data
print('The shape of our features is:', hrdf.shape)
# Size of data
print("Size",hrdf1.size)
# preview HR data
hrdf1.head()


# Data Quality & Missing Value Assessment

In [0]:
# check missing values in train data
hrdf1.isnull().sum()

In [0]:
hrdf1.describe()

In [0]:
hrdf1['EmployeeCount'].value_counts()


<font color=green> We can see in EployeeCount column is filled with "1" hence we remove it as 1 represents one employ as attrition is done for an employee </font>

In [0]:
hrdf1=hrdf1.drop("EmployeeCount",axis=1)

In [0]:
# Now we see correlation
hrdf1.corr()

In [0]:
hrdf1["StandardHours"].value_counts()

In [0]:
hrdf1=hrdf1.drop("StandardHours",axis=1)

we are removing standard hours as it is a constant and holds no weight in the equation

In [0]:
# Now we see correlation
corr=hrdf1.corr()
corr

In [0]:

# Now heatmap 

plt.figure(figsize=(28,28))
sns.heatmap((round(corr,2)),annot=True,cmap="Blues")
plt.title('heatmap', fontsize=20)

In [0]:
# now we are taking the columns having maximum correlation
hrdf1.shape

# Spliting the data for trainig and testing

### The data set is partioned in the ratio (75:25) with 1102 records for training and 368 records for testing.  

In [0]:
#split the dataset in features and target variable

X= hrdf1.iloc[:,0:31]
y = hrdf1.iloc[:,-1]

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0)
print("Xtrain shape:",X_train.shape)
print("Xtest shape:",X_test.shape)
print("ytrain shape:",y_train.shape)
print("ytest shape:",y_test.shape)

In [0]:
#import the class
from sklearn.linear_model import LogisticRegression
#instantiate the model (using the default parameters)
logreg=LogisticRegression()

In [0]:
#fit the model with data
logreg.fit(X_train,y_train)
y_pred=logreg.predict(X_test)

In [0]:
#import the metrics class
from sklearn import metrics
cnf_matrix=metrics.confusion_matrix(y_test,y_pred)
print(cnf_matrix) #26 and 11 are incorrect predictions

In [0]:
plt.imshow(cnf_matrix, cmap='binary')

In [0]:
sns.set(font_scale=1.3)#for label size
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu",fmt='g')


In [0]:
LogReg=LogisticRegression()
LogReg.fit(X,y)
print("Score ",LogReg.score(X,y))

In [0]:
#Accuracy
print("Accuracy:",metrics.accuracy_score(y_test,y_pred))
print("Precision:",metrics.precision_score(y_test,y_pred))
print("Recall:",metrics.recall_score(y_test,y_pred))


# From confusion matrix the model accuracy has been observed as 87%

# Classification Report

In [0]:
478
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))

# ROC (Recievers Operating Characteristic Curve)

In [0]:
y_pred_proba=logreg.predict_proba(X_test)[::,1]
fpr,tpr,_=metrics.roc_curve(y_test, y_pred_proba)
auc=metrics.roc_auc_score(y_test,y_pred_proba)
plt.plot(fpr,tpr,label="data 1,auc="+str(auc))
plt.plot([0,1],[0,1],'k--')
plt.legend(loc=1)
plt.show()

# ROC curve is able to define the model at 0.6% of the data for good classification. 

# Cross Validation

In [0]:
from sklearn.model_selection import cross_val_score
clf = LogisticRegression()
scores = cross_val_score(clf, X_test, y_test, cv=10)
print("10 fold Cross validation Scores ")    
for i in scores:
    print(i)
print("\n")
#mean of cross-validation
print("Scores",np.mean(scores))

#### As the accuracy changes with data considered in training and testing samples, we go for cross validation which gives us an average accuracy of the overall model. 