### Task: 
Campus Tech Data Science Coding Assignment

### Created by :
- Rama Lakkireddy
- Date: March 2nd, 2021

## Importing the required libraries for the task

In [117]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline 
from sklearn.preprocessing import StandardScaler # feature scaling library
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score

import warnings #disable the warning
warnings.filterwarnings("ignore", category=FutureWarning)

In [118]:
OriginalData = pd.read_csv("DataSet2.csv") # read the dataset

data = pd.read_csv("DataSet2.csv") # read the dataset

In [119]:
#Looking into the shape of data i.e, total number of rows and columns
data.shape

(1700, 12)

In [120]:
#Looking into few samples of data
data.head(5)

Unnamed: 0,y,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11
0,0,a,-0.375866,0.427942,-0.922338,0.210758,0.109015,0.621001,-0.444421,0.08997,-0.707711,0.4737
1,0,b,0.047819,0.115627,-1.781739,-0.272785,0.392783,1.094168,-0.975254,-0.353424,0.145543,-0.064961
2,1,d,0.372868,-0.263291,-1.120545,-0.773828,0.830072,-1.727836,1.323876,-1.587291,-0.024916,0.082491
3,0,c,0.059598,0.270797,0.961795,-1.804197,2.93133,1.891656,0.094252,-0.873467,-1.21768,-1.848046
4,1,d,0.616319,0.291275,-1.113519,0.626864,-0.287989,-0.842649,-0.947257,1.198215,0.97242,-1.054313


In [121]:
data['y'].value_counts() # checking if datset is imbalenced or balanced

0    1000
1     700
Name: y, dtype: int64

### Observation:
The given data is a balanced dataset

In [122]:
#checking missing values, zeros shows that there is no missing value.
data.isna().sum() 

y      0
x1     0
x2     0
x3     0
x4     0
x5     0
x6     0
x7     0
x8     0
x9     0
x10    0
x11    0
dtype: int64

In [123]:
# checking duplicate values
duplicate_value = data[data.duplicated()]
duplicate_value #there is no duplicate value

Unnamed: 0,y,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11


In [124]:
#geting the information related to data types of each column and number of records
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1700 entries, 0 to 1699
Data columns (total 12 columns):
y      1700 non-null int64
x1     1700 non-null object
x2     1700 non-null float64
x3     1700 non-null float64
x4     1700 non-null float64
x5     1700 non-null float64
x6     1700 non-null float64
x7     1700 non-null float64
x8     1700 non-null float64
x9     1700 non-null float64
x10    1700 non-null float64
x11    1700 non-null float64
dtypes: float64(10), int64(1), object(1)
memory usage: 159.5+ KB


### Observation:
X1 column is object types that need to converted into numeric form as the ML algorithms does not accept object value

In [125]:
# Converting categorical data in to numerical using pandas
data['x1'] = data['x1'].astype("category")
data['x1'] = data['x1'].cat.codes

In [126]:
# Splitting data into features and labels
x = data.drop(columns=['y'])
y = data['y']

In [127]:
#Converting pandas series ojects to numpy arrays
x = x.values
y = y.values

In [128]:
#Shuffleing data and splitting into train and test datasets
x,y = shuffle(x,y,random_state=1)
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=1)

In [135]:
#Random forest classifier training
rf_clf = make_pipeline(StandardScaler(with_mean=False,with_std=False), RandomForestClassifier(n_estimators=20))
rf_clf.fit(x_train,y_train)
print(" Training accuracy ",accuracy_score(y_train,rf_clf.predict(x_train)))
print(" Testing accuracy ",accuracy_score(y_test,rf_clf.predict(x_test)))
print(" Confusion matrix")
print(confusion_matrix(y_test,rf_clf.predict(x_test)))
print("F1 Score ",f1_score(y_test,rf_clf.predict(x_test),'binary'))
print("Predit Probabilities ", rf_clf.predict_proba(x_test))

 Training accuracy  0.9992647058823529
 Testing accuracy  0.9882352941176471
 Confusion matrix
[[190   2]
 [  2 146]]
F1 Score  0.9864864864864865
Predit Probabilities  [[0.   1.  ]
 [0.95 0.05]
 [0.95 0.05]
 [0.95 0.05]
 [1.   0.  ]
 [1.   0.  ]
 [0.   1.  ]
 [1.   0.  ]
 [0.05 0.95]
 [0.15 0.85]
 [0.6  0.4 ]
 [1.   0.  ]
 [0.15 0.85]
 [1.   0.  ]
 [1.   0.  ]
 [1.   0.  ]
 [1.   0.  ]
 [0.   1.  ]
 [0.   1.  ]
 [1.   0.  ]
 [0.85 0.15]
 [0.2  0.8 ]
 [1.   0.  ]
 [1.   0.  ]
 [1.   0.  ]
 [0.85 0.15]
 [0.95 0.05]
 [0.95 0.05]
 [0.   1.  ]
 [1.   0.  ]
 [0.7  0.3 ]
 [0.95 0.05]
 [0.15 0.85]
 [1.   0.  ]
 [0.1  0.9 ]
 [0.9  0.1 ]
 [0.   1.  ]
 [1.   0.  ]
 [1.   0.  ]
 [0.95 0.05]
 [1.   0.  ]
 [1.   0.  ]
 [0.1  0.9 ]
 [0.05 0.95]
 [0.05 0.95]
 [0.85 0.15]
 [0.7  0.3 ]
 [1.   0.  ]
 [0.1  0.9 ]
 [0.1  0.9 ]
 [1.   0.  ]
 [0.05 0.95]
 [0.15 0.85]
 [1.   0.  ]
 [0.9  0.1 ]
 [1.   0.  ]
 [0.85 0.15]
 [1.   0.  ]
 [1.   0.  ]
 [0.05 0.95]
 [0.   1.  ]
 [0.95 0.05]
 [0.   1.  ]
 [0.   1.  ]

## Confusion Matrix
1. __True Negative__ - represents the values which are predicted to be false and are actually false.
2. __False Positive__ - represents the values which are predicted to be true, but are false.
3. __False Negative__ - represents the values which are predicted to be false, but are true.
4. __True Positive__ - represents the values which are predicted to be true and are actually true.

### False Positive (type I error)
When we predict that something happens/occurs and it didn't happened/occured.(rejection of a true null hypothesis) Example :- We predict that an earthquake would occur which didn't happen.

### False Negative (type II error)
When we predict that something won't happen/occur but it happens/occurs.(non-rejection of a false null hypothesis) Example :- We predict that there might be no earthquake but there occurs an earthquake.

Usually, type I errors are considered to be not as critical as type II errors.

In [130]:
# Confusion Matrix Stats
TN = 190 
FN = 2   
FP = 2   
TP = 146 

# Confusion matrix stats and metrics into a dataframe for easy reading and interpretation
cm_d = {'Sensitivity': [(TP/(TP+FN))*100], 'Specificity': [(TN/(TN+FP))*100], 'Precision': [(TP/(TP+FP))*100], 'False_Positive_Rate': [((FP)/(FN+FP))*100], 'Misclassification_Rate': [(FP+FN)/(TN+FN+FP+TP)*100], 'Accuracy': [(TP + TN)/(TN+FN+FP+TP)*100]}
cm_metrics = pd.DataFrame(data = cm_d)
cm_metrics.transpose()

Unnamed: 0,0
Sensitivity,98.648649
Specificity,98.958333
Precision,98.648649
False_Positive_Rate,50.0
Misclassification_Rate,1.176471
Accuracy,98.823529


## Classification Table Metrics
1. __Sensitivity__ = (TP/(TP+FN)) 
2. __Specificity__ = (TN/(TN+FP))
3. __Precision__ = (TP/(TP+FP))
4. __False_Positive_Rate__ = (FP/(FN+FP))
5. __Misclassification_Rate__ = (FP+FN)/(TN+FN+FP+TP)
6. __Accuracy__ = (TP + TN)/(TN+FN+FP+TP)
7. __Rate of Success__ = (TP+TN)/(FP+FN)
8. __Precision__ = TP/(TP+FP)
9. __Recall__ = TP/(TP+FN)
10. __F1 score__ = 2*Precision * Recall/(Precision+Recall)


### Naive Bayes Classifier

In Naive Bayes, we are cluclating the posterior probability using the prior probability. 

In [131]:
# Naive Bayes classifier training
from sklearn.naive_bayes import GaussianNB
NB_clf = make_pipeline(StandardScaler(with_mean=False,with_std=False), GaussianNB())
NB_clf.fit(x_train,y_train)
print(" Training accuracy ",accuracy_score(y_train,NB_clf.predict(x_train)))
print(" Testing accuracy ",accuracy_score(y_test,NB_clf.predict(x_test)))
print(" Confusion matrix")
print(confusion_matrix(y_test,NB_clf.predict(x_test)))
print("F1 Score ",f1_score(y_test,NB_clf.predict(x_test),'binary'))

 Training accuracy  0.8669117647058824
 Testing accuracy  0.8529411764705882
 Confusion matrix
[[154  38]
 [ 12 136]]
F1 Score  0.84472049689441


In [132]:
# Support Vector classifier training
from sklearn.svm import SVC
svc_clf = make_pipeline(StandardScaler(with_mean=False,with_std=False), SVC())
svc_clf.fit(x_train,y_train)
print(" Training accuracy ",accuracy_score(y_train,svc_clf.predict(x_train)))
print(" Testing accuracy ",accuracy_score(y_test,svc_clf.predict(x_test)))
print(" Confusion matrix")
print(confusion_matrix(y_test,svc_clf.predict(x_test)))
print("F1 Score ",f1_score(y_test,svc_clf.predict(x_test),'binary'))

 Training accuracy  0.9191176470588235
 Testing accuracy  0.8794117647058823
 Confusion matrix
[[168  24]
 [ 17 131]]
F1 Score  0.8646864686468647


### Question a)

why it is the best model? Give an example of a data point that model doesn’t do well and explain
the characteristics of that observation. How- do you see the limitations of your model? How do you plan to
improve it?

### Solution
1. Random forest classifier is the best model because the difference between the Training accuracy and the Testing accuracy is very minimal.
2. The F-1 score is very good for the Random forest classifier model 0.98 and have less false negatives.
2. I performed Random Forest with ensemble algorithm model and predicted with 20 decision trees. The best prediction will be made since more models are working on the dataset, and will choose the best output based on voting.
3. We can use the probabilities of the prediction of the model and decide whether the sample belongs to calss0 or class1, If both probabilities are very near like 0.67,0.43 or 0.5,0.5, Then we can say they are confusing examples for the model.

### Question b(1)
A few days after you finish the initial model, you learn new information that the value of the X2 feature
makes sense only if it is positive (i.e. X2> 0 should be in the model instead of original X2). Show how you would address this issue.

### Solution:
1. I have used StandardScaler with mean = False and StandardDeviation = False to standerdize the data.
2. It wil scale the dataset to Mean 0 and StandardDeviation to 1. 
3. The standard score of a sample x is calculated as z=(x-u)/s, Where x = value of x1, u is mean of the x1 column and    s is the standard deviation.


In [108]:
data1 = OriginalData[OriginalData['x2'] > 0]
data1
#data = data[data['x2'] > 0] #To execute in one single step

Unnamed: 0,y,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11
1,0,b,0.047819,0.115627,-1.781739,-0.272785,0.392783,1.094168,-0.975254,-0.353424,0.145543,-0.064961
2,1,d,0.372868,-0.263291,-1.120545,-0.773828,0.830072,-1.727836,1.323876,-1.587291,-0.024916,0.082491
3,0,c,0.059598,0.270797,0.961795,-1.804197,2.931330,1.891656,0.094252,-0.873467,-1.217680,-1.848046
4,1,d,0.616319,0.291275,-1.113519,0.626864,-0.287989,-0.842649,-0.947257,1.198215,0.972420,-1.054313
6,0,c,0.142068,0.345289,-1.460133,-0.532363,0.758932,0.774621,-0.822781,-0.494892,-1.736284,-1.076106
...,...,...,...,...,...,...,...,...,...,...,...,...
1689,1,b,0.607494,0.011744,-1.238585,-0.487627,0.523256,0.491626,-0.909850,-0.427637,0.887907,0.565468
1690,1,c,0.675912,0.164728,-0.238624,0.057843,-0.631889,-0.121608,0.551179,1.154057,0.520261,-0.493283
1692,1,b,0.503167,-0.230747,-1.704355,-0.459350,-0.578345,-2.201978,0.785508,-0.533763,0.540966,0.474100
1693,0,c,0.002887,0.355344,0.118125,-0.325976,0.558218,0.345394,-0.588490,0.440795,-0.158578,0.651394


In [114]:
# Converting categorical data in to numerical using pandas
data1['x1'] = data1['x1'].astype("category")
data1['x1'] = data1['x1'].cat.codes

# Splitting data into features and labels
x = data1.drop(columns=['y'])
y = data1['y']

#Converting pandas series ojects to numpy arrays
x = x.values
y = y.values

#Shuffleing data and splitting into train and test datasets
x,y = shuffle(x,y,random_state=1)
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=1)

#Random forest classifier training
rf_clf = make_pipeline(StandardScaler(with_mean=False,with_std=False), RandomForestClassifier(n_estimators=20))
rf_clf.fit(x_train,y_train)
print(" Training accuracy ",accuracy_score(y_train,rf_clf.predict(x_train)))
print(" Testing accuracy ",accuracy_score(y_test,rf_clf.predict(x_test)))
print(" Confusion matrix")
print(confusion_matrix(y_test,rf_clf.predict(x_test)))
print("F1 Score ",f1_score(y_test,rf_clf.predict(x_test),'binary'))


 Training accuracy  1.0
 Testing accuracy  1.0
 Confusion matrix
[[ 77   0]
 [  0 105]]
F1 Score  1.0


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


### Question b(2)
Later, you additionally learn that X4 should’ve been used as an indicator variable (e.g. X4 that is greater than
   a certain threshold should be treated differently than X4 values below that threshold). However, you don’t
   know the value of threshold. Adapt your code to support a systematic way of finding threshold which leads
   to improved model performance.
   
### Solution
Though I did the standard scalar it's not mandatory for all features to be in the same ranges(o to 1 or -1 to 1). There will be various ranges but the mean will be 0 and the StandardDeviation will be 1 and all the values will be with in the same pattern. It will be easy for the MachineLearning Algorithm to grasp the pattern inside the data and easily predict the things.

In [115]:
data2 = OriginalData[OriginalData['x4'] > 0] # Consider the threshold value as 0
data2
#data = data[data['x4'] > 0] #To execute in one single step

Unnamed: 0,y,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11
3,0,c,0.059598,0.270797,0.961795,-1.804197,2.931330,1.891656,0.094252,-0.873467,-1.217680,-1.848046
5,0,c,-0.716144,0.259650,0.799474,-1.058823,-0.505464,2.202489,1.507675,1.245329,-2.106922,-0.421270
9,1,a,0.599369,0.062345,0.877789,0.171441,0.078635,-1.624168,-0.995793,1.162826,0.106844,2.119611
10,0,a,-0.583062,0.364628,0.779796,-1.365581,-0.636767,0.024686,-0.013499,1.690769,-0.842906,-0.675950
11,1,c,0.000381,-0.099098,0.256639,-1.494390,-0.452845,1.346259,-0.828126,0.559479,-1.149758,-0.524137
...,...,...,...,...,...,...,...,...,...,...,...,...
1691,0,c,-0.854529,-0.187580,0.916175,-0.095200,-0.627267,1.466028,-0.199312,0.005658,0.432920,-0.871332
1693,0,c,0.002887,0.355344,0.118125,-0.325976,0.558218,0.345394,-0.588490,0.440795,-0.158578,0.651394
1694,0,d,-0.505463,0.412871,0.207284,-0.011785,0.642973,1.374254,0.409001,1.168988,-0.077401,0.211484
1698,0,c,-0.448287,0.359409,0.983409,-1.057294,-0.501924,0.808380,-1.330170,3.595073,-0.716719,-0.891881


In [116]:
# Converting categorical data in to numerical using pandas
data2['x1'] = data2['x1'].astype("category")
data2['x1'] = data2['x1'].cat.codes

# Splitting data into features and labels
x = data2.drop(columns=['y'])
y = data2['y']

#Converting pandas series ojects to numpy arrays
x = x.values
y = y.values

#Shuffleing data and splitting into train and test datasets
x,y = shuffle(x,y,random_state=1)
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=1)

#Random forest classifier training
rf_clf = make_pipeline(StandardScaler(with_mean=False,with_std=False), RandomForestClassifier(n_estimators=20))
rf_clf.fit(x_train,y_train)
print(" Training accuracy ",accuracy_score(y_train,rf_clf.predict(x_train)))
print(" Testing accuracy ",accuracy_score(y_test,rf_clf.predict(x_test)))
print(" Confusion matrix")
print(confusion_matrix(y_test,rf_clf.predict(x_test)))
print("F1 Score ",f1_score(y_test,rf_clf.predict(x_test),'binary'))


 Training accuracy  1.0
 Testing accuracy  0.9941520467836257
 Confusion matrix
[[93  1]
 [ 0 77]]
F1 Score  0.9935483870967742


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


### Question b(3)
Now that you realize feature engineering is a frequent need for improving our models, you want to create a
general framework to tackle such situations (e.g. conversion to indicators, defining certain thresholds, etc.).
What function/module do you develop to enable user-defined column conversions for given thresholds?
Suppose thresholds are given to you as arguments to your function.

### Solution
1. I have used the make_pipeline and StandardScaler. 
2. So, the StandardScler will scale the data and tackle the issues of feature scaling and make the Mean to 0 and Standard Deviation as 1 and normalize the data and apply the model to it.
3. So, here we are automating the process and making the model resistent to tackle situations like x2>0 and threshold values.
4. In the future, we can add any further pre processing steps to the make_pipeline method.

### Conclusion
1. Ensemble Random Forest classifier algorithm perfomred well when compared with the Naiv eBayes and the Support Vector Machine algorithms. 
2. Make_Pipeline and StandardScaler are used to automate the process of predicting.