# Project Success/Failure Prediction

We have implemented using two different methods:
### Method 1:
- Some best performing modern machine learning algorithms were chosen to predict the outcome (Success/Failure) like MLP-ANN , KNN, SVM, GLMs etc.
- In order to prevent overfitting we decided to apply these different predictive supervised algorithms and ensemble them afterwards. 

In [232]:
# Importing required libraries
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [233]:
# Reading our dataset
df = pd.read_csv('Original_Exp.csv')
df.head(12)

Unnamed: 0,Stage,Name,Client_Revenue,Region,Client_Domain,Existing_Customer,TCS_Revenue,TCS_Domain,TimeStamp,ID
0,Stage 1,A Plus Lawn Care,251114.773898,north america,Healthcare,1,4487.269736,Cloud Infrastructure,2020-05-25 20:17:37.325647,0
1,Stage 9,A Plus Lawn Care,251114.773898,north america,Healthcare,1,4487.269736,Cloud Infrastructure,2020-06-30 20:17:37.325647,0
2,Stage 1,A+ Electronics,191404.595489,north america,BFSI,1,3746.0,IT infrastructure services,2021-04-25 20:17:37.325647,1
3,Stage 9,A+ Electronics,191404.595489,north america,BFSI,1,3746.0,IT infrastructure services,2021-04-29 20:17:37.325647,1
4,Stage 1,A+ Investments,246067.167131,north america,Telecom,1,10990.809563,Assurance services,2020-11-25 20:17:37.325647,2
5,Stage 9,A+ Investments,246067.167131,north america,Telecom,1,10990.809563,Assurance services,2021-05-16 20:17:37.325647,2
6,Stage 1,A21,147023.0,north america,BFSI,1,27419.506339,Engineering and Industrial services,2020-12-13 20:17:37.325647,3
7,Stage 9,A21,147023.0,north america,BFSI,1,27419.506339,Engineering and Industrial services,2021-05-25 20:17:37.325647,3
8,Stage 1,"Aaron'S, Inc.",105127.041768,north america,BFSI,1,19264.48159,IT infrastructure services,2020-11-22 20:17:37.325647,4
9,Stage 2,"Aaron'S, Inc.",105127.041768,north america,BFSI,1,19264.48159,IT infrastructure services,2020-11-26 20:17:37.325647,4


- Creating new column called `Status`

In [234]:
# Creating new column called Status (1: Success, 0: Fail)
arr = df.Name.unique()
my_df  = pd.DataFrame()
for x in arr:
  new_df = df[df.Name == x]
  ok = 0
  if (str(df[df.Name == x].to_numpy()[-1][0]) == 'Stage 9'):
    ok = 1
  if (ok == 1):
    new_df = new_df.assign(Status=1)
  else:
    new_df = new_df.assign(Status=0)
  my_df = my_df.append(new_df)

df = my_df.copy()

- Creating new column called `Recency`

In [235]:
# Converting TimeStamp parameter to date-time format
df['TimeStamp'] = pd.to_datetime(df['TimeStamp'],format='%Y-%m-%d %H:%M')
uniq_ID = df.ID.unique()
final_df=[]
for i in uniq_ID:
    new_df = df[df.ID == i]
    
    new_df = pd.DataFrame((new_df['TimeStamp'].max() - new_df['TimeStamp']).dt.days)
    final_df.extend(new_df.values)
    

final_df=[i[0] for i in final_df]
df['Recency'] = final_df

- Creating new column called `Frequency`

In [236]:
# Get order counts for each user and create a dataframe with it
uniq_ID = df.ID.unique()
final_df=pd.DataFrame()
df_frequency = df.groupby('ID').TimeStamp.count().reset_index()
df_frequency.columns = ['ID','Frequency']

# add this data to our main dataframe
# df = pd.merge(df, df_frequency, on='ID')
for i in uniq_ID:
    new_df = df[df.ID == i]
    new_df = new_df.assign(Frequency=df_frequency.iloc[i][1])
    final_df = final_df.append(new_df)
    
final_df
df = final_df.copy() 

- One Hot encoding for categorical variables (`Region`, `TCS_Domain`, `Client Domain`)


In [237]:
one_hot = pd.get_dummies(df['Region'])
df = df.drop('Region',axis = 1)
df = df.join(one_hot)
one_hot = pd.get_dummies(df['TCS_Domain'])
df = df.drop('TCS_Domain',axis = 1)
df = df.join(one_hot)
one_hot = pd.get_dummies(df['Client_Domain'])
df = df.drop(['Client_Domain','TimeStamp'],axis = 1)
df = df.join(one_hot)

- Extracting Stage Number from string `Stage`

In [238]:
Y = df['Status'].copy()
X = pd.DataFrame()
X = df.copy()
X = X.reset_index(drop=True)
X.drop(['Status','Name'], axis=1, inplace=True)
X = X.reset_index(drop=True)
X['Stage'] = X['Stage'].str.extract('(\d+)')
X['Stage'] = X['Stage'].astype(str).astype(int)
Y = pd.DataFrame(Y)

- Sneak peek of our current dataset

In [239]:
df.head(5)

Unnamed: 0,Stage,Name,Client_Revenue,Existing_Customer,TCS_Revenue,ID,Status,Recency,Frequency,UK,...,BFSI,Energy & Utilities,Healthcare,Hi-Tech,Manufacturing,Media,Others,Retail & Distribution,Telecom,Travel and Hospitality
0,Stage 1,A Plus Lawn Care,251114.773898,1,4487.269736,0,1,36,2,0,...,0,0,1,0,0,0,0,0,0,0
1,Stage 9,A Plus Lawn Care,251114.773898,1,4487.269736,0,1,0,2,0,...,0,0,1,0,0,0,0,0,0,0
2,Stage 1,A+ Electronics,191404.595489,1,3746.0,1,1,4,2,0,...,1,0,0,0,0,0,0,0,0,0
3,Stage 9,A+ Electronics,191404.595489,1,3746.0,1,1,0,2,0,...,1,0,0,0,0,0,0,0,0,0
4,Stage 1,A+ Investments,246067.167131,1,10990.809563,2,1,172,2,0,...,0,0,0,0,0,0,0,0,1,0


- Splitting the dataset into train and test sets (split ratio = 75:25)

In [240]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.25,random_state=1)

 - Training and predicting with **SVM Classifier** (kernel = RBF)

In [241]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix

svclassifier = SVC(kernel='sigmoid')
svclassifier.fit(X_train, y_train)
pred1 = svclassifier.predict(X_test)

def print_metrics(x,y):
    print('Accuracy:', accuracy_score(x, y))
    print('F1 score:', f1_score(x, y, average='weighted'))
    print('Recall:', recall_score(x, y, average='weighted'))
    print('Precision:', precision_score(x, y))

print_metrics(pred1, y_test)

Accuracy: 0.5172839506172839
F1 score: 0.5169049111375982
Recall: 0.5172839506172839
Precision: 0.5528089887640449


- Training and predicting with **XGBoost** (hidden layers = 8, activation function = 'tanh')

In [242]:
import xgboost as xgb

overall_xgb_model = xgb.XGBClassifier(max_depth=1, learning_rate=0.001,objective= 'binary:logistic',n_jobs=-1).fit(X_train, y_train)
pred2 = overall_xgb_model.predict(X_test)
print_metrics(pred2,y_test)

Accuracy: 0.7345679012345679
F1 score: 0.7361993029486826
Recall: 0.7345679012345679
Precision: 0.6292134831460674


- Training and predicting with **Generalized Linear Model**

In [243]:
import statsmodels.api as sm

log_reg = sm.Logit(y_train,X_train).fit()
_pred = log_reg.predict(X_test)
pred3 = list(map(round, _pred))
print_metrics(pred3, y_test)

Optimization terminated successfully.
         Current function value: 0.589074
         Iterations 6
Accuracy: 0.671604938271605
F1 score: 0.6765899643825574
Recall: 0.671604938271605
Precision: 0.7752808988764045


- Training and predicting with **Gradient Boosting Classifier**

In [244]:
from sklearn.ensemble import GradientBoostingClassifier

clf = GradientBoostingClassifier(n_estimators=10, learning_rate=0.1,max_depth=1, random_state=0).fit(X_train, y_train)
pred4 = clf.predict(X_test)
print_metrics(pred4, y_test)

Accuracy: 0.7345679012345679
F1 score: 0.7361993029486826
Recall: 0.7345679012345679
Precision: 0.6292134831460674


- Training and predicting with **KNeighborsClassifier**

In [245]:
from sklearn.neighbors import KNeighborsClassifier

neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train, y_train)
pred5 = neigh.predict(X_test)
print_metrics(pred5, y_test)

Accuracy: 0.7654320987654321
F1 score: 0.7648950843395287
Recall: 0.7654320987654321
Precision: 0.7303370786516854


- Taking mode of all the outputs for Ensembling

In [246]:
res = []
for i in range(0,len(pred1)):
  if (pred1[i] + pred2[i] + pred3[i] + pred4[i] + pred5[i] >= 3):
    res.append(1)
  else:
    res.append(0)

res = pd.DataFrame(res)
print_metrics(res, y_test)

Accuracy: 0.7654320987654321
F1 score: 0.7651817976513099
Recall: 0.7654320987654321
Precision: 0.7078651685393258


- Creating new column called `NextStage`

In [247]:
df = final_df.copy()
my_df  = pd.DataFrame()
arr = df.Name.unique()
for x in arr:
    new_df = df[df.Name == x]
    new_df = new_df.reset_index(drop = 1)
    k = list(new_df['Stage'])
    if(len(new_df) > 1):
       new_df = new_df[:-1] 
    new_df["NextStage"] = ""
    for i in range(1,len(k)):
        new_df.loc[i-1,'NextStage']= k[i]
    my_df = my_df.append(new_df)
    
# Saving the dataset
my_df.to_csv('NextStage_30.csv')

## HMM (Hidden Markov Model) Filter
### Method 2: 
- HMMs are engineered to handle the data which have sequential patterns in it.
- The HMM filter revises the predictions accordingly to their uncertainty and the state transition matrix estimated from unlabeled data using the **Viterbi algorithm**.
- HMMs are defined by hidden states, state transition probabilities, possible observations and their emission probabilities.

**Steps:**
 - Train base classifier on training dataset
 - Predict labels for unlabeled dataset using trained classifier
 - Estimate HMM state transition matrix from predicted labels of unlabeled dataset
 - Estimate class probability distributions for test dataset using trained classifier
 - Predict most likely sequence of states for each session in test dataset using HMM filter

In [248]:
# Reading dataset
df = pd.read_csv('NextStage_30.csv')

- Extracting stage number from column `Stage` & `NewStage`

In [249]:
df.dropna(subset = ["NextStage"], inplace=True)
df['Stage'] = df['Stage'].str.extract('(\d+)')
df['Stage'] = df['Stage'].astype(str).astype(int) 
df['NextStage'] = df['NextStage'].str.extract('(\d+)')
df['NextStage'] = df['NextStage'].astype(str).astype(int) 

- One Hot encoding for categorical variables (`Region`, `TCS_Domain`, `Client Domain`)

In [250]:
one_hot = pd.get_dummies(df['Region'])
df = df.drop('Region',axis = 1)
df = df.join(one_hot)
one_hot = pd.get_dummies(df['TCS_Domain'])
df = df.drop('TCS_Domain',axis = 1)
df = df.join(one_hot)
one_hot = pd.get_dummies(df['Client_Domain'])
df = df.drop(['Client_Domain','TimeStamp','Name'],axis = 1)
df = df.join(one_hot)

In [251]:
# Importing required libraries
from os import cpu_count
from sklearn.model_selection import cross_validate
from hmm_filter.hmm_filter import HMMFilter

- Spliting dataset into `train_dataset`, `unlabeled_dataset` & `test_dataset`

In [252]:
a = int(len(df)*0.6)
b = int(len(df)*0.3)

train_dataset = df[:a].copy()
unlabeled_dataset = df[a:a+b].copy()
test_dataset = df[a+b:].copy()

- Extracting Features & Labels

In [253]:
# training dataset
X_train = train_dataset.drop('NextStage',axis = 1).values
y_train = train_dataset["NextStage"].values

# test dataset
X_test = test_dataset.drop('NextStage',axis = 1).values
y_test = test_dataset["NextStage"].values

# unlabeled dataset
X_unlabeled = unlabeled_dataset.drop('NextStage',axis = 1).values
print(X_test)

[[1.00000000e+00 3.00000000e+00 1.30829000e+05 ... 0.00000000e+00
  0.00000000e+00 0.00000000e+00]
 [2.00000000e+00 4.00000000e+00 1.30829000e+05 ... 0.00000000e+00
  0.00000000e+00 0.00000000e+00]
 [3.00000000e+00 5.00000000e+00 1.30829000e+05 ... 0.00000000e+00
  0.00000000e+00 0.00000000e+00]
 ...
 [0.00000000e+00 1.00000000e+00 2.03121213e+05 ... 0.00000000e+00
  0.00000000e+00 0.00000000e+00]
 [1.00000000e+00 2.00000000e+00 2.03121213e+05 ... 0.00000000e+00
  0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 1.00000000e+00 6.33434043e+04 ... 0.00000000e+00
  0.00000000e+00 1.00000000e+00]]


- Train **Bagging Classifier**

In [254]:
# Instantiate bagging classifier and fit to training data

from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier
clf = BaggingClassifier( n_estimators=500, random_state=100).fit(X_train, y_train)

test_dataset["prediction_rf"] = clf.predict(X_test)

def print_metrics(x,y):
    print('Accuracy:', accuracy_score(x, y))
    print('F1 score:', f1_score(x, y, average='weighted'))
    print('Recall:', recall_score(x, y, average='weighted'))
    print('Precision:', precision_score(x, y, average='weighted'))
# Evaluate accuracy of predictions
print_metrics(test_dataset['NextStage'],test_dataset['prediction_rf'])

Accuracy: 0.7962085308056872
F1 score: 0.7880108618894737
Recall: 0.7962085308056872
Precision: 0.8134971218971873


- Predict labels for unlabeled dataset

In [255]:
# predict classes for unabeled dataset
unlabeled_dataset["prediction_rf"] = clf.predict(X_unlabeled)

- Estimate HMM state transition matrix

In [256]:
# train HMM filter by estimating the state transition matrix
hmmfilter = HMMFilter()
hmmfilter.fit(unlabeled_dataset, session_column="ID", prediction_column="prediction_rf")

- Estimate class probability distributions

In [257]:
d = pd.DataFrame.from_records(clf.predict_proba(X_test), columns=clf.classes_).to_dict(orient="records")
test_dataset["probabs"] = [{ k:v for k,v in r.items() if v > 0} for r in d ]
test_dataset.head()

Unnamed: 0.1,Unnamed: 0,Stage,Client_Revenue,Existing_Customer,TCS_Revenue,ID,Status,Recency,Frequency,NextStage,...,Healthcare,Hi-Tech,Manufacturing,Media,Others,Retail & Distribution,Telecom,Travel and Hospitality,prediction_rf,probabs
1914,1,3,130829.0,1,11558.260454,1018,0,391,5,4,...,0,0,0,0,0,0,0,0,4,"{4: 0.818, 5: 0.134, 6: 0.042, 7: 0.002, 8: 0...."
1915,2,4,130829.0,1,11558.260454,1018,0,242,5,5,...,0,0,0,0,0,0,0,0,5,"{5: 0.872, 6: 0.072, 7: 0.044, 8: 0.008, 9: 0...."
1916,3,5,130829.0,1,11558.260454,1018,0,28,5,10,...,0,0,0,0,0,0,0,0,6,"{6: 0.476, 7: 0.1, 8: 0.02, 10: 0.274, 11: 0.13}"
1917,0,1,190318.552048,0,22958.497369,1019,1,44,2,9,...,0,0,0,0,0,0,0,0,9,{9: 1.0}
1918,0,1,220614.400247,1,19870.602832,1020,1,47,2,9,...,0,0,1,0,0,0,0,0,9,{9: 1.0}


- Predict most likely sequence of states using HMM filter

In [258]:
df = hmmfilter.predict(test_dataset, session_column='ID', probabs_column="probabs", prediction_column='prediction')


- Evaluating performance of our model

In [259]:
print_metrics(df['prediction'],y_test)

Accuracy: 0.8293838862559242
F1 score: 0.8328088620141683
Recall: 0.8293838862559242
Precision: 0.8508076683097592
