# Problem Statement

Your client is a financial distribution company. Over the last 10 years, they have created an offline distribution channel across the country. They sell financial products to consumers by hiring agents in their network. These agents are freelancers and get a commission when they make a product sale.

### Overview of your client onboarding process

The managers at your client are primarily responsible for recruiting agents. Once a manager has identified a potential applicant he would explain the business opportunity to the agent. Once the agent provides the consent, an application is made to your client to become an agent. In the next 3 months, this potential agent has to undergo a 7 days training at your client's branch (about sales processes and various products) and clear a subsequent examination in order to become an agent.

### The problem - who are the best agents?

As it is obvious in the above process, there is a significant investment which your client makes in identifying, training, and recruiting these agents. However, there are a set of agents who do not bring in the expected resultant business. Your client is looking for help from data scientists like you to help them provide insights using their past recruitment data. They want to predict the target variable for each potential agent which would help them identify the right agents to hire.

    - ID: Unique Application ID
    - Office_PIN:	PINCODE of Your client's Offices
    - Applicant_City_PIN:	PINCODE of Applicant Address
    - Applicant_Gender:	Applicant's Gender
    - Applicant_Marital_Status:	Applicant's Marital Status
    - Applicant_Occupation:	Applicant's Occupation
    - Applicant_Qualification:	Applicant's Educational Qualification
    - Manager_Joining_Designation:	Manager's Joining Designation
    - Manager_Current_Designation:	Manager's Designation at the time of application sourcing
    - Manager_Grade:	Manager's Grade
    - Manager_Status:	Current Employment Status (Probation/Confirmation)
    - Manager_Gender:	Manager's Gender
    - Manager_Num_Application:	Number of Applications sourced in the last 3 months by the Manager
    - Manager_Num_Coded:	Number of agents recruited by the manager in the last 3 months
    - Manager_Business:	Amount of business sourced by the manager in the last 3 months
    - Manager_Num_Products:	Number of products sold by the manager in the last 3 months
    - Manager_Business2:	Amount of business sourced by the manager in the last 3 months excluding business from their Category A advisor
    - Manager_Num_Products2:	Number of products sold by the manager in the last 3 months excluding business from their Category A advisor
    - Business_Sourced(Target):	Business sourced by the applicant within 3 months [1/0] of recruitment

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [9]:
df = pd.read_csv('dataset.csv').drop(['ID'],axis = 1)
df.shape

(8844, 18)

In [10]:
df.isnull().sum()

Office_PIN                        0
Applicant_City_PIN                0
Applicant_Gender                 53
Applicant_Marital_Status         59
Applicant_Occupation           1090
Applicant_Qualification          71
Manager_Joining_Designation       0
Manager_Current_Designation       0
Manager_Grade                     0
Manager_Status                    0
Manager_Gender                    0
Manager_Num_Application           0
Manager_Num_Coded                 0
Manager_Business                  0
Manager_Num_Products              0
Manager_Business2                 0
Manager_Num_Products2             0
Business_Sourced                  0
dtype: int64

In [11]:
df

Unnamed: 0,Office_PIN,Applicant_City_PIN,Applicant_Gender,Applicant_Marital_Status,Applicant_Occupation,Applicant_Qualification,Manager_Joining_Designation,Manager_Current_Designation,Manager_Grade,Manager_Status,Manager_Gender,Manager_Num_Application,Manager_Num_Coded,Manager_Business,Manager_Num_Products,Manager_Business2,Manager_Num_Products2,Business_Sourced
0,842001,844120,M,M,Others,Graduate,Level 1,Level 2,3.0,Confirmation,M,2.0,1.0,335249.0,28.0,335249.0,28.0,0
1,842001,844111,M,S,Others,Class XII,Level 1,Level 2,3.0,Confirmation,M,2.0,1.0,335249.0,28.0,335249.0,28.0,1
2,800001,844101,M,M,Business,Class XII,Level 1,Level 1,2.0,Confirmation,M,0.0,0.0,357184.0,24.0,357184.0,24.0,0
3,814112,814112,M,S,Salaried,Class XII,Level 1,Level 3,4.0,Confirmation,F,0.0,0.0,318356.0,22.0,318356.0,22.0,0
4,814112,815351,M,M,Others,Class XII,Level 1,Level 1,2.0,Confirmation,M,2.0,1.0,230402.0,17.0,230402.0,17.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8839,250001,250004,F,M,,Graduate,Level 1,Level 2,3.0,Confirmation,M,1.0,1.0,55000.0,2.0,55000.0,2.0,0
8840,814112,816118,M,M,,Class XII,Level 1,Level 1,2.0,Confirmation,M,4.0,2.0,418339.0,13.0,418339.0,13.0,0
8841,160017,160032,M,M,Salaried,Graduate,Level 2,Level 2,3.0,Probation,M,0.0,0.0,0.0,0.0,0.0,0.0,0
8842,753012,753014,F,M,Salaried,Graduate,Level 2,Level 2,3.0,Confirmation,M,0.0,0.0,316126.0,9.0,305775.0,8.0,0


In [12]:
df.dtypes

Office_PIN                       int64
Applicant_City_PIN               int64
Applicant_Gender                object
Applicant_Marital_Status        object
Applicant_Occupation            object
Applicant_Qualification         object
Manager_Joining_Designation     object
Manager_Current_Designation     object
Manager_Grade                  float64
Manager_Status                  object
Manager_Gender                  object
Manager_Num_Application        float64
Manager_Num_Coded              float64
Manager_Business               float64
Manager_Num_Products           float64
Manager_Business2              float64
Manager_Num_Products2          float64
Business_Sourced                 int64
dtype: object

In [14]:
for i in df.columns:
    print('****Values in ', i, '******')
    print(df[i].value_counts()/len(df))
    print()

****Values in  Office_PIN ******
695014    0.043193
221010    0.026798
121002    0.025102
211001    0.024649
400075    0.022501
            ...   
110034    0.000339
395001    0.000226
144001    0.000226
334002    0.000226
517503    0.000113
Name: Office_PIN, Length: 98, dtype: float64

****Values in  Applicant_City_PIN ******
202001    0.020692
492001    0.007576
305001    0.007010
452001    0.006219
476001    0.005654
            ...   
302031    0.000113
461775    0.000113
224209    0.000113
689108    0.000113
680008    0.000113
Name: Applicant_City_PIN, Length: 2858, dtype: float64

****Values in  Applicant_Gender ******
M    0.752601
F    0.241407
Name: Applicant_Gender, dtype: float64

****Values in  Applicant_Marital_Status ******
M    0.648236
S    0.343962
W    0.000678
D    0.000452
Name: Applicant_Marital_Status, dtype: float64

****Values in  Applicant_Occupation ******
Salaried         0.400950
Business         0.243894
Others           0.204545
Self Employed    0.016508
S

In [16]:
features = ['Applicant_Gender', 'Applicant_Marital_Status', 'Applicant_Occupation', 'Applicant_Qualification']

for i in features:
    df[i].fillna(value = df[i].mode()[0], inplace = True)

In [17]:
df.isnull().sum()

Office_PIN                     0
Applicant_City_PIN             0
Applicant_Gender               0
Applicant_Marital_Status       0
Applicant_Occupation           0
Applicant_Qualification        0
Manager_Joining_Designation    0
Manager_Current_Designation    0
Manager_Grade                  0
Manager_Status                 0
Manager_Gender                 0
Manager_Num_Application        0
Manager_Num_Coded              0
Manager_Business               0
Manager_Num_Products           0
Manager_Business2              0
Manager_Num_Products2          0
Business_Sourced               0
dtype: int64

In [18]:
df = pd.get_dummies(df)

In [19]:
df.shape

(8844, 48)

In [23]:
from sklearn.model_selection import train_test_split as tts
from sklearn.metrics import roc_auc_score as ras
from sklearn.linear_model import LogisticRegression as log

In [22]:
x = df.drop(['Business_Sourced'], axis = 1)
y = df['Business_Sourced']

In [24]:
xtrain, xtest, ytrain, ytest = tts(x,y, test_size = 0.3, random_state = 51, stratify = y)

In [25]:
lgr = log()

lgr.fit(xtrain, ytrain)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [29]:
ras(ytrain, lgr.predict_proba(xtrain)[:,1])

0.47636648297768724

In [30]:
ras(ytest, lgr.predict_proba(xtest)[:,1])

0.458783962597036